Software Heritage — Software Citation Blob Dataset
==================================================

Authors: Roberto Di Cosmo and Valentin Lorentz
Contact: roberto@dicosmo.org


Description
-----------

This dataset contains all "software citation files" extracted from a snapshot of
the [Software Heritage][swh] archive taken on 2025-10-08.

[swh]: https://www.softwareheritage.org

In this context, a *software citation file* is a unique file content (or "blob")
that appeared in a software origin archived by Software Heritage as a file whose
name is `codemetat.json` or `citation.cff`, two kinds of machine readable
metadata files used for describing software or citing it. The exact file name
pattern used to select the blobs contained in the dataset can be found in the
SQL query file `01-get-citation-swhids.sql`.  Note that the file name was not
expected to be at the project root, because project subdirectories may contain
different software modules with different citation information, and we wanted to
include those too.


Format
------

The dataset is organized as follows:

- `blobs.tar.zst`: a [Zst][zstd]-compressed tarball containing deduplicated
  citation blobs, one per file. The tarball contains 125'612 blobs, for a
  total uncompressed size of 426MB.
  
  The blobs are organized in a sharded directory structure that contains files
  named like `blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02`, where:
  
  - `blobs/` is the root directory containing all citation blobs

  - `ff4779adbace73349374b9fd5d77a42ae4ec66c2` is the SHA1 checksum of a
    specific citation blobs, a codemeta.json for rdflib in this case. Each
    citation blob is ultimately named with its SHA1:

        $ head blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2 
        {
          "@context": [
            "https://doi.org/doi:10.5063/schema/codemeta-2.0",
            "http://schema.org"
          ],
          "@type": "SoftwareSourceCode",
          "identifier": "rdflib",
          "description": "A friendly and concise user interface for performing common [...]",
          "name": "rdflib: A high level wrapper around the 'redland' package for common 'rdf' applications ",
          "codeRepository": "https://github.com/cboettig/rdflib",
        $ sha1sum blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2
        ff4779adbace73349374b9fd5d77a42ae4ec66c2  blobs/ff/47/ff4779adbace73349374b9fd5d77a42ae4ec66c2

  - `ff` and `47` are, respectively, the first and second group of two hex
    digits in the blob SHA1
  
- `blobs.csv.zst` a [Zst][zstd]-compressed [CSV][csv] index of all the
  blobs in the dataset. Each line in the index (except the first one, which
  contains column headers) describes a citation blob and is in the format
  `SWHID,SHA1,NAME`, for example:
  
        swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,ff4779adbace73349374b9fd5d77a42ae4ec66c2,"codemeta.json"
        swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391,da39a3ee5e6b4b0d3255bfef95601890afd80709,"codemeta.json"
        swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391,da39a3ee5e6b4b0d3255bfef95601890afd80709,"citation.cff"

  
  where:

  - **SWHID:** the [Software Heritage persistent identifier][swhid] of the
    blob. It can be used to retrieve and cross-reference the citation blob via
    the Software Heritage archive, e.g., at:
    <https://archive.softwareheritage.org/swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e>

  - **SHA1:** the blob SHA1, that can be used to cross-reference blobs in the
    `blobs/` directory
	
  - **NAME:** *a* file name given to the citation blob in a given software
    origin. As the same citation blob can have different names in different
    contexts, the index contain multiple entries for the same blob with
    different names, as it is the case in the example above (
    swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 is the SWHID of the
    empty file).

- `blobs-fileinfo.csv.zst` a [Zst][zstd]-compressed [CSV][csv] mapping from
  blobs to basic file information in the format:
  `SWHID,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE`, where:
  
  - **SWHID:** blob SWHID
  - **MIME_TYPE:** blob MIME type, as detected by [libmagic][libmagic]
  - **ENCODING:** blob character encoding, as detected by [libmagic][libmagic]
  - **LINE_COUNT:** number of lines in the blob (only for textual blobs with
    UTF8 encoding)
  - **WORD_COUNT:** number of words in the blob (only for textual blobs with
    UTF8 encoding)
  - **SIZE:** blob size in bytes

- `blobs-origins.csv.zst` a [Zst][zstd]-compressed [CSV][csv] mapping of where
  citation blobs come from. Each line in the index associate a citation blob to
  one of its origins in the format `SWHID,URL`, for example:
  
        swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,https://github.com/denatahvildari/rdflib

  Note that a citation blob can come from many different places, only an
  arbitrary (and somewhat random) one is listed in this mapping.

  If no origin URL is found in the Software Heritage archive, then a blank is
  used instead. This happens when they were either being loaded when the
  dataset was generated, or the loader process crashed before completing the
  blob's origin's ingestion.

- `blobs-nb-origins.csv.zst` a [Zst][zstd]-compressed [CSV][csv] mapping of
  how many origins of this blob are known to Software Heritage.
  Each line in the index associate a citation blob to this count in the format
  `SWHID,NUMBER`, for example:

        swh:1:cnt:6daebd857f6f6a98dd9288ef7b942283f7fa4f0e,9

- `blobs-earliest.csv.zst` a [Zst][zstd]-compressed [CSV][csv] mapping from
  blobs to information about their (earliest) known occurence(s) in the
  archive. Format: `SWHID,EARLIEST_SWHID,EARLIEST_TS,OCCURRENCES`,
  where:
  
  - **SWHID:** blob SWHID
  - **EARLIEST_SWHID:** SWHID of the earliest known commit containing the blob
  - **EARLIEST_TS:** timestamp of the earliest known commit containing the
    blob, as a [Unix time][unixtime] integer
  - **OCCURRENCES:** number of known commits containing the blob


[zstd]: http://facebook.github.io/zstd/
[csv]: https://en.wikipedia.org/wiki/Comma-separated_values
[ndjson]: http://ndjson.org/
[swhid]: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html


Changes from the 2025-10-08 dataset
-----------------------------------

- More input data, due to the SWH archive growing: more origins in supported forges
  and package managers; and support for more forges and package managers.
  See the [SWH Archive Changelog](https://docs.softwareheritage.org/devel/archive-changelog.html)
  for details.


[zstd]: http://facebook.github.io/zstd/


