Name | Last modified | Size | Description | |
---|---|---|---|---|
Parent Directory | - | |||
blobs-earliest.csv.zst | 2023-01-07 07:43 | 843K | ||
blobs-fileinfo.csv.zst | 2023-01-07 07:43 | 507K | ||
blobs-nb-origins.csv.zst | 2023-01-07 07:43 | 409K | ||
blobs-origins.csv.zst | 2023-01-07 07:43 | 543K | ||
blobs.csv.zst | 2023-01-07 07:43 | 792K | ||
blobs.tar.zst | 2023-01-07 07:43 | 4.6M | ||
README.md | 2023-04-24 10:25 | 7.3K | ||
Authors: Roberto Di Cosmo and Valentin Lorentz Contact: roberto@dicosmo.org
This dataset contains all “software citation files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25.
In this context, a software citation file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is codemetat.json
or citation.cff
, two kinds of machine readable metadata files used for describing software or citing it. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-get-citation-swhids.sql
. Note that the file name was not expected to be at the project root, because project subdirectories may contain different software modules with different citation information, and we wanted to include those too.
The dataset is organized as follows:
blobs.tar.zst
: a Zst-compressed tarball containing deduplicated citation blobs, one per file. The tarball contains 17’984 blobs, for a total uncompressed size of 118MiB.
The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
, where:
blobs/
is the root directory containing all citation blobs
ff4779adbace73349374b9fd5d77a42ae4ec66c2
is the SHA1 checksum of a specific citation blobs, a codemeta.json for rdflib in this case. Each citation blob is ultimately named with its SHA1:
$ head blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2
{
"@context": [
"https://doi.org/doi:10.5063/schema/codemeta-2.0",
"http://schema.org"
],
"@type": "SoftwareSourceCode",
"identifier": "rdflib",
"description": "A friendly and concise user interface for performing common [...]",
"name": "rdflib: A high level wrapper around the 'redland' package for common 'rdf' applications ",
"codeRepository": "https://github.com/cboettig/rdflib",
$ sha1sum blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2
ff4779adbace73349374b9fd5d77a42ae4ec66c2 blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2
ff
and 47
are, respectively, the first and second group of two hex digits in the blob SHA1
blobs.csv.zst
a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a citation blob and is in the format SWHID,SHA1,NAME
, for example:
swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,ff4779adbace73349374b9fd5d77a42ae4ec66c2,"codemeta.json"
swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391,da39a3ee5e6b4b0d3255bfef95601890afd80709,"codemeta.json"
swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391,da39a3ee5e6b4b0d3255bfef95601890afd80709,"citation.cff"
where:
SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the citation blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e
SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/
directory
NAME: a file name given to the citation blob in a given software origin. As the same citation blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above ( swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 is the SWHID of the empty file).
blobs-fileinfo.csv.zst
a Zst-compressed CSV mapping from blobs to basic file information in the format: SWHID,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE
, where:
blobs-origins.csv.zst
a Zst-compressed CSV mapping of where citation blobs come from. Each line in the index associate a citation blob to one of its origins in the format SWHID,URL
, for example:
swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,https://github.com/denatahvildari/rdflib
Note that a citation blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.
If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.
blobs-nb-origins.csv.zst
a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a citation blob to this count in the format SWHID,NUMBER
, for example:
swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,9
blobs-earliest.csv.zst
a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHID,EARLIEST_SWHID,EARLIEST_TS,OCCURRENCES
, where:
This dataset is inspired by the license dataset, but uses different file names.
If you worked with this dataset before and want to reuse your script on this one, beware of the following changes:
license-blobs.csv.zst
is renamed blobs.csv.zst
Blob size is computed based on the real blob size, rather than the file size on ext4 filesystems
No blobs-scancode.csv.zst
, blobs-scancode.ndjson.zst
, or licenses-annotated-sample.tar.gz
files, as they would not be relevant
No replication-package.tar.gz
, this was generated with a new data pipeline, now part of swh-graph
Inconsistencies in file formats are fixed:
blobs-fileinfo.csv.zst
identifies blobs by SWHID instead of SHA1
blobs-origins.csv.zst
and blobs-nb-origins.csv.zst
are comma-separated instead of space-separated, and each have a header
blobs-earliest.csv.zst
is comma-separated instead of space-separated