Index of /public/dataset/citation-blobs/2022-04-25

Name	Last modified	Size

Parent Directory		-
blobs-earliest.csv.zst	2023-01-07 07:43	843K
blobs-fileinfo.csv.zst	2023-01-07 07:43	507K
blobs-nb-origins.csv.zst	2023-01-07 07:43	409K
blobs-origins.csv.zst	2023-01-07 07:43	543K
blobs.csv.zst	2023-01-07 07:43	792K
blobs.tar.zst	2023-01-07 07:43	4.6M
README.md	2023-04-24 10:25	7.3K

Software Heritage — Software Citation Blob Dataset

Authors: Roberto Di Cosmo and Valentin Lorentz Contact: roberto@dicosmo.org

Description

This dataset contains all “software citation files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25.

In this context, a software citation file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is codemetat.json or citation.cff, two kinds of machine readable metadata files used for describing software or citing it. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-get-citation-swhids.sql. Note that the file name was not expected to be at the project root, because project subdirectories may contain different software modules with different citation information, and we wanted to include those too.

Format

The dataset is organized as follows:

blobs.tar.zst: a Zst-compressed tarball containing deduplicated citation blobs, one per file. The tarball contains 17’984 blobs, for a total uncompressed size of 118MiB.

The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:
- blobs/ is the root directory containing all citation blobs
- ff4779adbace73349374b9fd5d77a42ae4ec66c2 is the SHA1 checksum of a specific citation blobs, a codemeta.json for rdflib in this case. Each citation blob is ultimately named with its SHA1:
```
$ head blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2 
{
  "@context": [
    "https://doi.org/doi:10.5063/schema/codemeta-2.0",
    "http://schema.org"
  ],
  "@type": "SoftwareSourceCode",
  "identifier": "rdflib",
  "description": "A friendly and concise user interface for performing common [...]",
  "name": "rdflib: A high level wrapper around the 'redland' package for common 'rdf' applications ",
  "codeRepository": "https://github.com/cboettig/rdflib",
$ sha1sum blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2
ff4779adbace73349374b9fd5d77a42ae4ec66c2  blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2
```
- ff and 47 are, respectively, the first and second group of two hex digits in the blob SHA1
blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a citation blob and is in the format SWHID,SHA1,NAME, for example:
```
  swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,ff4779adbace73349374b9fd5d77a42ae4ec66c2,"codemeta.json"
  swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391,da39a3ee5e6b4b0d3255bfef95601890afd80709,"codemeta.json"
  swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391,da39a3ee5e6b4b0d3255bfef95601890afd80709,"citation.cff"
```
where:
- SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the citation blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e
- SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory
- NAME: a file name given to the citation blob in a given software origin. As the same citation blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above ( swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 is the SWHID of the empty file).
blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SWHID,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:
- SWHID: blob SWHID
- MIME_TYPE: blob MIME type, as detected by [libmagic][libmagic]
- ENCODING: blob character encoding, as detected by [libmagic][libmagic]
- LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)
- WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)
- SIZE: blob size in bytes
blobs-origins.csv.zst a Zst-compressed CSV mapping of where citation blobs come from. Each line in the index associate a citation blob to one of its origins in the format SWHID,URL, for example:
```
  swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,https://github.com/denatahvildari/rdflib
```
Note that a citation blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.

If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.
blobs-nb-origins.csv.zst a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a citation blob to this count in the format SWHID,NUMBER, for example:
```
  swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,9
```
blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHID,EARLIEST_SWHID,EARLIEST_TS,OCCURRENCES, where:
- SWHID: blob SWHID
- EARLIEST_SWHID: SWHID of the earliest known commit containing the blob
- EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a [Unix time][unixtime] integer
- OCCURRENCES: number of known commits containing the blob

Changes from the 2022-04-25 license dataset

This dataset is inspired by the license dataset, but uses different file names.

If you worked with this dataset before and want to reuse your script on this one, beware of the following changes:

license-blobs.csv.zst is renamed blobs.csv.zst
Blob size is computed based on the real blob size, rather than the file size on ext4 filesystems
No blobs-scancode.csv.zst, blobs-scancode.ndjson.zst, or licenses-annotated-sample.tar.gz files, as they would not be relevant
No replication-package.tar.gz, this was generated with a new data pipeline, now part of swh-graph
Inconsistencies in file formats are fixed:
- blobs-fileinfo.csv.zst identifies blobs by SWHID instead of SHA1
- blobs-origins.csv.zst and blobs-nb-origins.csv.zst are comma-separated instead of space-separated, and each have a header
- blobs-earliest.csv.zst is comma-separated instead of space-separated