Index of /public/dataset/license-blobs/2022-04-25

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[TXT]README.md2024-01-05 11:07 12K 
[   ]blobs-earliest.csv.zst2022-10-18 15:32 475M 
[   ]blobs-fileinfo.csv.zst2022-11-09 11:24 161M 
[   ]blobs-nb-origins.csv.zst2022-10-31 09:34 144M 
[   ]blobs-origins.csv.zst2022-10-11 17:52 231M 
[   ]blobs-sample20k.tar.zst2022-10-12 14:22 27M 
[   ]blobs-scancode.csv.zst2022-10-26 08:52 116M 
[   ]blobs-scancode.ndjson.zst2022-10-26 08:53 797M 
[   ]blobs.tar.zst2022-11-09 11:40 13G 
[   ]import-dataset.sql2022-11-15 11:51 1.8K 
[   ]license-blobs.csv.zst2022-09-08 14:41 288M 
[   ]licenses-annotated-sample.tar.gz2023-05-17 07:23 808K 
[   ]replication-package.tar.gz2022-11-09 11:42 13K 

Software Heritage — License Blob Dataset

Author: Stefano Zacchiroli
Contact: zack@upsilon.cc

WARNING: you are looking at an old version of this dataset. A new version is available at: ../latest/.

Description

This dataset contains all "license files" extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (A long-term archival version of this dataset is also available from Zenodo).

In this context, a license file is a unique file content (or "blob") that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.

Format

The dataset is organized as follows:

Changes since the 2021-03-23 dataset

Errata

A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 was present in the initial version of the dataset (published on 2022-11-07). It was removed on 2022-11-09 using these two commands:

pv blobs-fileinfo.csv.zst | zstdcat | grep -v "\.tmp" | zstd -19 pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12

The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.

Citation

If you use this dataset for research purposes, please acknowledge its use by citing the following paper:

References

The dataset has been built using primarily the data sources described in the following papers: