Index of /public/dataset/citation-blobs/2022-04-25

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[TXT]README.md2023-04-24 10:25 7.3K 
[TXT]blobs-earliest.csv.zst2023-01-07 07:43 843K 
[TXT]blobs-fileinfo.csv.zst2023-01-07 07:43 507K 
[TXT]blobs-nb-origins.csv.zst2023-01-07 07:43 409K 
[TXT]blobs-origins.csv.zst2023-01-07 07:43 543K 
[TXT]blobs.csv.zst2023-01-07 07:43 792K 
[   ]blobs.tar.zst2023-01-07 07:43 4.6M 

Software Heritage — Software Citation Blob Dataset

Authors: Roberto Di Cosmo and Valentin Lorentz Contact: roberto@dicosmo.org

Description

This dataset contains all “software citation files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25.

In this context, a software citation file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is codemetat.json or citation.cff, two kinds of machine readable metadata files used for describing software or citing it. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-get-citation-swhids.sql. Note that the file name was not expected to be at the project root, because project subdirectories may contain different software modules with different citation information, and we wanted to include those too.

Format

The dataset is organized as follows:

Changes from the 2022-04-25 license dataset

This dataset is inspired by the license dataset, but uses different file names.

If you worked with this dataset before and want to reuse your script on this one, beware of the following changes: