![]() | Name | Last modified | Size | Description |
---|---|---|---|---|
![]() | Parent Directory | - | ||
![]() | README.md | 2025-01-24 11:05 | 10K | |
![]() | blobs-sample20k.tar.zst | 2025-01-24 09:50 | 69M | |
![]() | blobs-scancode.csv.zst | 2025-01-24 09:50 | 218M | |
![]() | blobs-nb-origins.csv.zst | 2025-01-24 09:50 | 229M | |
![]() | blobs-fileinfo.csv.zst | 2025-01-24 09:50 | 276M | |
![]() | blobs-origins.csv.zst | 2025-01-24 09:50 | 382M | |
![]() | blobs-earliest.csv.zst | 2025-01-24 09:50 | 496M | |
![]() | blobs.csv.zst | 2025-01-24 09:50 | 526M | |
![]() | blobs-scancode.ndjson.zst | 2025-01-24 09:50 | 1.4G | |
![]() | blobs.tar.zst | 2025-01-24 09:52 | 24G | |
Author: Stefano Zacchiroli
Contact: zack@upsilon.cc
This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2024-12-06. (A long-term archival version of this dataset is also available from Zenodo).
In this context, a license file is a unique file content (or
“blob”) that appeared in a software origin archived by Software Heritage
as a file whose name is often used to ship licenses in software
projects. Some name examples are: COPYING
,
LICENSE
, NOTICE
, COPYRIGHT
, etc.
The exact file name pattern used to select the blobs contained in the
dataset can be found in the SQL
query. Note that the file name was not expected to be at the project
root, because project subdirectories can contain different licenses than
the top-level one, and we wanted to include those too.
The dataset is organized as follows:
blobs.tar.zst
: a Zst-compressed tarball
containing deduplicated license blobs, one per file. The tarball
contains 10’951’035 blobs, for a total uncompressed size on disk of 122
GiB.
The blobs are organized in a sharded directory structure that
contains files named like
blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
,
where:
blobs/
is the root directory containing all license
blobs
8624bcdae55baeef00cd11d5dfcfa60f68710a02
is the SHA1
checksum of a specific license blobs, a copy of the GPL3 license in this
case. Each license blob is ultimately named with its SHA1:
$ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
$ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
86
and 24
are, respectively, the first
and second group of two hex digits in the blob SHA1
blobs-sample20k.tar.zst
: analogous to
blobs.tar.zst
, but containing “only” 20’000 randomly
selected license blobs
license-blobs.csv.zst
a Zst-compressed CSV
index of all the blobs in the dataset. Each line in the index (except
the first one, which contains column headers) describes a license blob
and is in the format SWHID,SHA1,NAME
, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING"
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3"
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"
where:
SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
SHA1: the blob SHA1, that can be used to
cross-reference blobs in the blobs/
directory. May be empty
in the rare case where the blob is not present in the archive (eg.
because it was too large).
NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it’s an original typo from some repository!).
blobs-fileinfo.csv.zst
a Zst-compressed CSV
mapping from blobs to basic file information in the format:
sha1,mime_type,encoding,line_count,word_count,size
,
where:
blobs-scancode.csv.zst
a Zst-compressed CSV
mapping from blobs to software license detected in them by ScanCode, in
the format: sha1,license,score
, where:
There may be zero or arbitrarily many lines for each blob.
blobs-scancode.ndjson.zst
a Zst-compressed line-delimited JSON, containing a superset
of the information in blobs-scancode.csv.zst
. Each line is
a JSON dictionary with three keys:
scancode.api.get_licenses(..., min_score=0)
scancode.api.get_copyrights(...)
There is exactly one line for each blob. licenses
and
copyrights
keys are omitted for files not detected as plain
text.
blobs-origins.csv.zst
a Zst-compressed CSV
mapping of where license blobs come from. Each line in the index
associate a license blob to one of its origins in the format
swhid,url
, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,https://github.com/pombreda/Artemis
Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.
If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.
blobs-nb-origins.csv.zst
a Zst-compressed CSV
mapping of how many origins of this blob are known to Software Heritage.
Each line in the index associate a license blob to this count in the
format SWHID,NUMBER
, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,2822260
blobs-earliest.csv.zst
a Zst-compressed CSV
mapping from blobs to information about their (earliest) known
occurence(s) in the archive. Format:
SWHID,EARLIEST_SWHID,EARLIEST_TS,OCCURRENCES
, where:
licenses-annotated-sample.tar.gz
: ground truth,
i.e., manually annotated random sample of license blobs, with details
about the kind of information they contain.
More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.
32 contents referenced by directories but missing from the
archive (eg. because they are too big) are now included in
license-blobs.csv.zst
(with an empty SHA1),
blobs-origins.csv.zst
,
blobs-nb-origins.csv.zst
, and
blobs-earliest.csv.zst
. They are not included in other
files, as that would require access to their content.
If you use this dataset for research purposes, please acknowledge its use by citing the following paper:
The dataset has been built using primarily the data sources described in the following papers:
[pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.
[pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.