Software Heritage — Software Citation Blob Dataset

Authors: Roberto Di Cosmo and Valentin Lorentz Contact:


This dataset contains all “software citation files” extracted from a snapshot of the Software Heritage archive taken on 2023-09-06.

In this context, a software citation file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is codemetat.json or citation.cff, two kinds of machine readable metadata files used for describing software or citing it. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-get-citation-swhids.sql. Note that the file name was not expected to be at the project root, because project subdirectories may contain different software modules with different citation information, and we wanted to include those too.


The dataset is organized as follows:

Changes from the 2022-12-07 dataset