Index of /public/dataset/license-blobs/2024-12-06

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[   ]blobs-earliest.csv.zst2025-01-24 09:50 496M 
[   ]blobs-fileinfo.csv.zst2025-01-24 09:50 276M 
[   ]blobs-nb-origins.csv.zst2025-01-24 09:50 229M 
[   ]blobs-origins.csv.zst2025-01-24 09:50 382M 
[   ]blobs-sample20k.tar.zst2025-01-24 09:50 69M 
[   ]blobs-scancode.csv.zst2025-01-24 09:50 218M 
[   ]blobs-scancode.ndjson.zst2025-01-24 09:50 1.4G 
[   ]blobs.csv.zst2025-01-24 09:50 526M 
[   ]blobs.tar.zst2025-01-24 09:52 24G 
[TXT]README.md2025-01-24 11:05 10K 

Software Heritage — License Blob Dataset

Author: Stefano Zacchiroli
Contact: zack@upsilon.cc

Description

This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2024-12-06. (A long-term archival version of this dataset is also available from Zenodo).

In this context, a license file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.

Format

The dataset is organized as follows:

Changes from the 2022-12-07 dataset

Citation

If you use this dataset for research purposes, please acknowledge its use by citing the following paper:

References

The dataset has been built using primarily the data sources described in the following papers: