Name | Last modified | Size | Description | |
---|---|---|---|---|
Parent Directory | - | |||
README.md | 2023-10-17 10:21 | 6.8K | ||
blobs-earliest.csv.zst | 2023-03-01 11:14 | 1.1M | ||
blobs-fileinfo.csv.zst | 2023-03-01 11:14 | 666K | ||
blobs-nb-origins.csv.zst | 2023-03-01 11:14 | 534K | ||
blobs-origins.csv.zst | 2023-03-01 11:14 | 719K | ||
blobs.csv.zst | 2023-03-01 11:14 | 1.0M | ||
blobs.tar.zst | 2023-03-01 11:14 | 6.2M | ||
Authors: Roberto Di Cosmo and Valentin Lorentz Contact: roberto@dicosmo.org
This dataset contains all “software citation files” extracted from a snapshot of the Software Heritage archive taken on 2022-12-07.
In this context, a software citation file is a unique file
content (or “blob”) that appeared in a software origin archived by
Software Heritage as a file whose name is codemetat.json
or
citation.cff
, two kinds of machine readable metadata files
used for describing software or citing it. The exact file name pattern
used to select the blobs contained in the dataset can be found in the
SQL query file 01-get-citation-swhids.sql
. Note that the
file name was not expected to be at the project root, because project
subdirectories may contain different software modules with different
citation information, and we wanted to include those too.
The dataset is organized as follows:
blobs.tar.zst
: a Zst-compressed tarball
containing deduplicated citation blobs, one per file. The tarball
contains 23’452 blobs, for a total uncompressed size of 142MiB.
The blobs are organized in a sharded directory structure that
contains files named like
blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
,
where:
blobs/
is the root directory containing all citation
blobs
ff4779adbace73349374b9fd5d77a42ae4ec66c2
is the SHA1
checksum of a specific citation blobs, a codemeta.json for rdflib in
this case. Each citation blob is ultimately named with its SHA1:
$ head blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2
{
"@context": [
"https://doi.org/doi:10.5063/schema/codemeta-2.0",
"http://schema.org"
],
"@type": "SoftwareSourceCode",
"identifier": "rdflib",
"description": "A friendly and concise user interface for performing common [...]",
"name": "rdflib: A high level wrapper around the 'redland' package for common 'rdf' applications ",
"codeRepository": "https://github.com/cboettig/rdflib",
$ sha1sum blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2
ff4779adbace73349374b9fd5d77a42ae4ec66c2 blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2
ff
and 47
are, respectively, the first
and second group of two hex digits in the blob SHA1
blobs.csv.zst
a Zst-compressed CSV
index of all the blobs in the dataset. Each line in the index (except
the first one, which contains column headers) describes a citation blob
and is in the format SWHID,SHA1,NAME
, for example:
swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,ff4779adbace73349374b9fd5d77a42ae4ec66c2,"codemeta.json"
swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391,da39a3ee5e6b4b0d3255bfef95601890afd80709,"codemeta.json"
swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391,da39a3ee5e6b4b0d3255bfef95601890afd80709,"citation.cff"
where:
SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the citation blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e
SHA1: the blob SHA1, that can be used to
cross-reference blobs in the blobs/
directory
NAME: a file name given to the citation blob in a given software origin. As the same citation blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above ( swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 is the SWHID of the empty file).
blobs-fileinfo.csv.zst
a Zst-compressed CSV
mapping from blobs to basic file information in the format:
SWHID,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE
,
where:
blobs-origins.csv.zst
a Zst-compressed CSV
mapping of where citation blobs come from. Each line in the index
associate a citation blob to one of its origins in the format
SWHID,URL
, for example:
swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,https://github.com/denatahvildari/rdflib
Note that a citation blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.
If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.
blobs-nb-origins.csv.zst
a Zst-compressed CSV
mapping of how many origins of this blob are known to Software Heritage.
Each line in the index associate a citation blob to this count in the
format SWHID,NUMBER
, for example:
swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,9
blobs-earliest.csv.zst
a Zst-compressed CSV
mapping from blobs to information about their (earliest) known
occurence(s) in the archive. Format:
SWHID,EARLIEST_SWHID,EARLIEST_TS,OCCURRENCES
, where:
blobs-earliest.csv.zst
are shifted 1 or 2 hours back based
on the Europe/Paris timezone.