Index of /public/dataset/license-blobs/2021-03-23

Name	Last modified	Size

Parent Directory		-
blobs.tar.zst	2021-10-23 14:06	13G
blobs-earliest.csv.zst	2022-01-26 16:35	284M
license-blobs.csv.zst	2021-10-01 11:11	273M
blobs-origins.csv.zst	2021-10-03 18:52	214M
blobs-fileinfo.csv.zst	2022-01-26 16:35	161M
blobs-scancode.csv.zst	2022-01-26 16:36	119M
blobs-sample20k.tar.zst	2021-10-03 09:08	30M
README.md	2024-01-05 11:07	8.2K
replication-package.tar.gz	2022-03-23 10:55	6.2K

Software Heritage — License Blob Dataset

Author: Stefano Zacchiroli
Contact: zack@upsilon.cc

WARNING: you are looking at an old version of this dataset. A newer version is available at: ../latest/.

Description

This dataset contains all "license files" extracted from a snapshot of the Software Heritage archive taken on 2021-03-23. (A long-term archival version of this dataset is also available from Zenodo).

In this context, a license file is a unique file content (or "blob") that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.

Format

The dataset is organized as follows:

blobs.tar.zst: a Zst-compressed tarball containing deduplicated license blobs, one per file. The tarball contains 6'008'466 blobs, for a total uncompressed size on disk of 52 GiB.

The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:
- blobs/ is the root directory containing all license blobs
- 8624bcdae55baeef00cd11d5dfcfa60f68710a02 is the SHA1 checksum of a specific license blobs, a copy of the GPL3 license in this case. Each license blob is ultimately named with its SHA1:
  
  $ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007
  
  $ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
- 86 and 24 are, respectively, the first and second group of two hex digits in the blob SHA1
blobs-sample20k.tar.zst: analogous to blobs.tar.zst, but containing "only" 20'000 randomly selected license blobs
license-blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a license blob and is in the format SWHID,SHA1,NAME, for example:
```
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,COPYING
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,COPYING.GPL3
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,COPYING.GLP-3
```
where:
- SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
- SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory
- NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it's an original typo from some repository!).
blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:
- SHA1: blob SHA1
- MIME_TYPE: blob MIME type, as detected by libmagic
- ENCODING: blob character encoding, as detected by libmagic
- LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)
- WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)
- SIZE: blob size in bytes
blobs-scancode.csv.zst a Zst-compressed CSV mapping from blobs to software license detected in them by ScanCode, in the format: SHA1,LICENSE,SCORE, where:
- SHA1: blob SHA1
- LICENSE: license detected in the blob, as an SPDX identifier (or ScanCode identifier for non-SPDX-indexed licenses)
- SCORE: confidence score in the result, as a decimal number between 0 and 100
blobs-origins.csv.zst a Zst-compressed CSV mapping of where license blobs come from. Each line in the index associate a license blob to one of its origins in the format SWHID<TAB>URL, for example:
```
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2  https://github.com/pombreda/Artemis
```
Note that a license blob can come from many different places, only an arbitrary (in fact, random) one is listed in this mapping.
blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHID<TAB>EARLIEST_SWHID<TAB>EARLIEST_TS<TAB>OCCURRENCES, where:
- SWHID: blob SWHID
- EARLIEST_SWHID: SWHID of the earliest known commit containing the blob
- EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a Unix time integer
- OCCURRENCES: number of known commits containing the blob
replication-package.tar.gz: code and scripts used to produce the dataset

Citation

If you use this dataset for research purposes, please acknowledge its use by citing the following paper:

[pdf, bib] Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

References

The dataset has been built using primarily the data sources described in the following papers:

[pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.
[pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.