Name | Last modified | Size | Description | |
---|---|---|---|---|
Parent Directory | - | |||
import-dataset.sql | 2022-11-15 11:51 | 1.8K | ||
README.md | 2024-01-05 11:07 | 12K | ||
replication-package.tar.gz | 2022-11-09 11:42 | 13K | ||
licenses-annotated-sample.tar.gz | 2023-05-17 07:23 | 808K | ||
blobs-sample20k.tar.zst | 2022-10-12 14:22 | 27M | ||
blobs-scancode.csv.zst | 2022-10-26 08:52 | 116M | ||
blobs-nb-origins.csv.zst | 2022-10-31 09:34 | 144M | ||
blobs-fileinfo.csv.zst | 2022-11-09 11:24 | 161M | ||
blobs-origins.csv.zst | 2022-10-11 17:52 | 231M | ||
license-blobs.csv.zst | 2022-09-08 14:41 | 288M | ||
blobs-earliest.csv.zst | 2022-10-18 15:32 | 475M | ||
blobs-scancode.ndjson.zst | 2022-10-26 08:53 | 797M | ||
blobs.tar.zst | 2022-11-09 11:40 | 13G | ||
Author: Stefano Zacchiroli
Contact: zack@upsilon.cc
WARNING: you are looking at an old version of this dataset. A new version is available at: ../latest/.
This dataset contains all "license files" extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (A long-term archival version of this dataset is also available from Zenodo).
In this context, a license file is a unique file content (or "blob") that
appeared in a software origin archived by Software Heritage as a file whose
name is often used to ship licenses in software projects. Some name examples
are: COPYING
, LICENSE
, NOTICE
, COPYRIGHT
, etc. The exact file name
pattern used to select the blobs contained in the dataset can be found in the
SQL query file 01-select-blobs.sql
. Note that the file name was not expected
to be at the project root, because project subdirectories can contain different
licenses than the top-level one, and we wanted to include those too.
The dataset is organized as follows:
blobs.tar.zst
: a Zst-compressed tarball containing deduplicated
license blobs, one per file. The tarball contains 6'859'189 blobs, for a
total uncompressed size on disk of 66 GiB.
The blobs are organized in a sharded directory structure that contains files
named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
, where:
blobs/
is the root directory containing all license blobs
8624bcdae55baeef00cd11d5dfcfa60f68710a02
is the SHA1 checksum of a
specific license blobs, a copy of the GPL3 license in this case. Each
license blob is ultimately named with its SHA1:
$ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007
$ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
86
and 24
are, respectively, the first and second group of two hex
digits in the blob SHA1
One blob is missing, because its size (313MB) prevented its inclusion; (it was originally a tarball containing source code):
swh:1:cnt:61bf63793c2ee178733b39f8456a796b72dc8bde,1340d4e2da173c92d432026ecdc54b4859fe9911,"AUTHORS"
blobs-sample20k.tar.zst
: analogous to blobs.tar.zst
, but containing
"only" 20'000 randomly selected license blobs
license-blobs.csv.zst
a Zst-compressed CSV index of all the
blobs in the dataset. Each line in the index (except the first one, which
contains column headers) describes a license blob and is in the format
SWHID,SHA1,NAME
, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING"
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3"
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"
where:
SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
SHA1: the blob SHA1, that can be used to cross-reference blobs in the
blobs/
directory
NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it's an original typo from some repository!).
blobs-fileinfo.csv.zst
a Zst-compressed CSV mapping from
blobs to basic file information in the format:
SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE
, where:
blobs-scancode.csv.zst
a Zst-compressed CSV mapping from
blobs to software license detected in them by ScanCode, in the
format: SHA1,LICENSE,SCORE
, where:
There may be zero or arbitrarily many lines for each blob.
blobs-scancode.ndjson.zst
a Zst-compressed
line-delimited JSON, containing a superset of the information in
blobs-scancode.csv.zst
. Each line is a JSON dictionary with three keys:
scancode.api.get_licenses(..., min_score=0)
scancode.api.get_copyrights(...)
There is exactly one line for each blob. licenses
and copyrights
keys
are omitted for files not detected as plain text.
blobs-origins.csv.zst
a Zst-compressed CSV mapping of where
license blobs come from. Each line in the index associate a license blob to
one of its origins in the format SWHID<TAB>URL
, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 https://github.com/pombreda/Artemis
Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.
If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob's origin's ingestion.
blobs-nb-origins.csv.zst
a Zst-compressed CSV mapping of
how many origins of this blob are known to Software Heritage.
Each line in the index associate a license blob to this count in the format
SWHID<TAB>NUMBER
, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 2822260
Two blobs are missing because the computation crashes:
swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
swh:1:cnt:8b137891791fe96927ad78e64b0aad7bded08bdc
This issue will be fixed in a future version of the dataset
blobs-earliest.csv.zst
a Zst-compressed CSV mapping from
blobs to information about their (earliest) known occurence(s) in the
archive. Format: SWHID<TAB>EARLIEST_SWHID<TAB>EARLIEST_TS<TAB>OCCURRENCES
,
where:
replication-package.tar.gz
: code and scripts used to produce the dataset
licenses-annotated-sample.tar.gz
: ground truth, i.e., manually annotated
random sample of license blobs, with details about the kind of information
they contain.
More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.
Values in the NAME column of license-blobs.csv.zst
are quoted, as some
file names now contain commas.
Replication package now contains all the steps needed to reproduce all artefacts
including the licenseblobs/fetch.py
script.
blobs-nb-origins.csv.zst
is added.
blobs-origins.csv.zst
is now generated using the first origin returned by
swh-graph's leaves
endpoint, instead of its randomwalk
endpoint.
This should have no impact on the result, other than a different distribution
of "random" origins being picked.
blobs-origins.csv.zst
was missing ~10% of its results in previous versions
of the dataset, due to errors and/or timeouts in its generation,
this is now down to 0.02% (1254 of the 6859445 unique blobs).
Blobs with no known origins are now present, with a blank instead of URL.
blobs-earliest.csv.zst
was missing ~10% of its results in previous versions
of the dataset. It is complete now.
blobs-scancode.csv.zst
is generated with a newer scancode-toolkit version
(31.2.1)
blobs-scancode.ndjson.zst
is added.
A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911
was present in the
initial version of the dataset (published on 2022-11-07). It was removed on
2022-11-09 using these two commands:
pv blobs-fileinfo.csv.zst | zstdcat | grep -v "\.tmp" | zstd -19
pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12
The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.
If you use this dataset for research purposes, please acknowledge its use by citing the following paper:
The dataset has been built using primarily the data sources described in the following papers:
[pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.
[pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.