Index of /public/dataset/license-blobs/latest

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[TXT]README.md2024-01-05 11:07 11K 
[TXT]blobs-earliest.csv.zst2023-04-24 10:23 333M 
[TXT]blobs-fileinfo.csv.zst2023-04-24 10:25 208M 
[TXT]blobs-nb-origins.csv.zst2023-04-24 10:26 163M 
[TXT]blobs-origins.csv.zst2023-04-24 10:26 284M 
[   ]blobs-sample20k.tar.zst2023-04-24 10:27 36M 
[TXT]blobs-scancode.csv.zst2023-04-24 10:27 144M 
[   ]blobs-scancode.ndjson.zst2023-04-24 10:29 890M 
[TXT]blobs.csv.zst2023-04-24 10:30 335M 
[   ]blobs.tar.zst2023-04-24 10:55 14G 

Software Heritage — License Blob Dataset

Author: Stefano Zacchiroli


This dataset contains all "license files" extracted from a snapshot of the Software Heritage archive taken on 2022-12-07. (A long-term archival version of this dataset is also available from Zenodo).

In this context, a license file is a unique file content (or "blob") that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.


The dataset is organized as follows:

Changes from the 2022-04-25 dataset



If you use this dataset for research purposes, please acknowledge its use by citing at least among the following papers:


The dataset has been built using primarily the data sources described in the following papers: