Index of /public/dataset/license-blobs/2021-03-23

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[TXT]README.md2024-01-05 11:07 8.2K 
[TXT]blobs-earliest.csv.zst2022-01-26 16:35 284M 
[TXT]blobs-fileinfo.csv.zst2022-01-26 16:35 161M 
[TXT]blobs-origins.csv.zst2021-10-03 18:52 214M 
[   ]blobs-sample20k.tar.zst2021-10-03 09:08 30M 
[TXT]blobs-scancode.csv.zst2022-01-26 16:36 119M 
[   ]blobs.tar.zst2021-10-23 14:06 13G 
[TXT]license-blobs.csv.zst2021-10-01 11:11 273M 
[   ]replication-package.tar.gz2022-03-23 10:55 6.2K 

Software Heritage — License Blob Dataset

Author: Stefano Zacchiroli

WARNING: you are looking at an old version of this dataset. A newer version is available at: ../latest/.


This dataset contains all "license files" extracted from a snapshot of the Software Heritage archive taken on 2021-03-23. (A long-term archival version of this dataset is also available from Zenodo).

In this context, a license file is a unique file content (or "blob") that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.


The dataset is organized as follows:


If you use this dataset for research purposes, please acknowledge its use by citing the following paper:


The dataset has been built using primarily the data sources described in the following papers: