Software Heritage — License Blob Dataset

Author: Stefano Zacchiroli


This dataset contains all "license files" extracted from a snapshot of the Software Heritage archive taken on 2021-03-23. (A long-term archival version of this dataset is also available from Zenodo).

In this context, a license file is a unique file content (or "blob") that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.


The dataset is organized as follows:


If you use this dataset for research purposes, please acknowledge its use by citing the following paper:


The dataset has been built using primarily the data sources described in the following papers: