Software Heritage — License Blob Dataset
========================================

Author: Stefano Zacchiroli  
Contact: zack@upsilon.cc  


Description
-----------

This dataset contains all "license files" extracted from a snapshot of the
[Software Heritage][swh] archive taken on 2025-05-18.  (A long-term archival
version of this dataset is also [available from Zenodo][datasetzenodo]).

[swh]: https://www.softwareheritage.org

In this context, a *license file* is a unique file content (or "blob") that
appeared in a software origin archived by Software Heritage as a file whose
name is often used to ship licenses in software projects. Some name examples
are: `COPYING`, `LICENSE`, `NOTICE`, `COPYRIGHT`, etc. The exact file name
pattern used to select the blobs contained in the dataset can be found in the
[SQL query](https://archive.softwareheritage.org/swh:1:cnt:7c6143bccbb0ff6604417b723f38bee06db0138b;origin=https://gitlab.softwareheritage.org/swh/devel/swh-graph;visit=swh:1:snp:d34d87373bb367ba310002693cb7c4c139c3b882;anchor=swh:1:rev:985dcf705e03fde55285ca8aaff2488f43e9a55f;path=/swh/graph/luigi/blobs_datasets.py;lines=128-131).
Note that the file name was not expected
to be at the project root, because project subdirectories can contain different
licenses than the top-level one, and we wanted to include those too.


Format
------

The dataset is organized as follows:

- `blobs.tar.zst`: a [Zst][zstd]-compressed tarball containing deduplicated
  license blobs, one per file. The tarball contains 11'975'472 blobs, for a
  total uncompressed size on disk of 145 GB.
  
  The blobs are organized in a sharded directory structure that contains files
  named like `blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02`, where:
  
  - `blobs/` is the root directory containing all license blobs

  - `8624bcdae55baeef00cd11d5dfcfa60f68710a02` is the SHA1 checksum of a
    specific license blobs, a copy of the GPL3 license in this case. Each
    license blob is ultimately named with its SHA1:

        $ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
                            GNU GENERAL PUBLIC LICENSE
                               Version 3, 29 June 2007
        
        $ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
        8624bcdae55baeef00cd11d5dfcfa60f68710a02  blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02

  - `86` and `24` are, respectively, the first and second group of two hex
    digits in the blob SHA1

- `blobs-sample20k.tar.zst`: analogous to `blobs.tar.zst`, but containing
  "only" 20'000 randomly selected license blobs
  
- `license-blobs.csv.zst` a [Zst][zstd]-compressed [CSV][csv] index of all the
  blobs in the dataset. Each line in the index (except the first one, which
  contains column headers) describes a license blob and is in the format
  `SWHID,SHA1,NAME`, for example:
  
        swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING"
        swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3"
        swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"
  
  where:

  - **SWHID:** the [Software Heritage persistent identifier][swhid] of the
    blob. It can be used to retrieve and cross-reference the license blob via
    the Software Heritage archive, e.g., at:
    <https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2>

  - **SHA1:** the blob SHA1, that can be used to cross-reference blobs in the
    `blobs/` directory. May be empty in the rare case where the blob is not present
    in the archive (eg. because it was too large).
	
  - **NAME:** *a* file name given to the license blob in a given software
    origin. As the same license blob can have different names in different
    contexts, the index contain multiple entries for the same blob with
    different names, as it is the case in the example above (yes, one of those
    has a typo in it, but it's an *original* typo from some repository!).

- `blobs-fileinfo.csv.zst` a [Zst][zstd]-compressed [CSV][csv] mapping from
  blobs to basic file information in the format:
  `sha1,mime_type,encoding,line_count,word_count,size`, where:
  
  - **sha1:** blob SHA1
  - **mime_type:** blob MIME type, as detected by [libmagic][libmagic]
  - **encoding:** blob character encoding, as detected by [libmagic][libmagic]
  - **line_count:** number of lines in the blob (only for textual blobs with
    UTF8 encoding)
  - **word_count:** number of words in the blob (only for textual blobs with
    UTF8 encoding)
  - **size:** blob size in bytes

- `blobs-scancode.csv.zst` a [Zst][zstd]-compressed [CSV][csv] mapping from
  blobs to software license detected in them by [ScanCode][scancode], in the
  format: `sha1,license,score`, where:
  
  - **sha1:** blob SHA1
  - **license:** license detected in the blob, as an [SPDX][spdx] identifier
    (or [ScanCode identifier][scancode-licensedb] for non-SPDX-indexed
    licenses)
  - **score:** confidence score in the result, as a decimal number between 0
    and 100

  There may be zero or arbitrarily many lines for each blob.

- `blobs-scancode.ndjson.zst` a [Zst][zstd]-compressed
  [line-delimited JSON][ndjson], containing a superset of the information in
  `blobs-scancode.csv.zst`. Each line is a JSON dictionary with three keys:

  - **swhid**: blob SWHID
  - **licenses**: output of `scancode.api.get_licenses(..., min_score=0)`
  - **copyrights**: output of `scancode.api.get_copyrights(...)`

  There is exactly one line for each blob. `licenses` and `copyrights` keys
  are omitted for files not detected as plain text.

- `blobs-origins.csv.zst` a [Zst][zstd]-compressed [CSV][csv] mapping of where
  license blobs come from. Each line in the index associate a license blob to
  one of its origins in the format `swhid,url`, for example:
  
        swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,https://github.com/pombreda/Artemis

  Note that a license blob can come from many different places, only an
  arbitrary (and somewhat random) one is listed in this mapping.

  If no origin URL is found in the Software Heritage archive, then a blank is
  used instead. This happens when they were either being loaded when the
  dataset was generated, or the loader process crashed before completing the
  blob's origin's ingestion.

- `blobs-nb-origins.csv.zst` a [Zst][zstd]-compressed [CSV][csv] mapping of
  how many origins of this blob are known to Software Heritage.
  Each line in the index associate a license blob to this count in the format
  `SWHID,NUMBER`, for example:

        swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,2822260

- `blobs-earliest.csv.zst` a [Zst][zstd]-compressed [CSV][csv] mapping from
  blobs to information about their (earliest) known occurence(s) in the
  archive. Format: `SWHID,EARLIEST_SWHID,EARLIEST_TS,OCCURRENCES`,
  where:
  
  - **SWHID:** blob SWHID
  - **EARLIEST_SWHID:** SWHID of the earliest known commit containing the blob
  - **EARLIEST_TS:** timestamp of the earliest known commit containing the
    blob, as a [Unix time][unixtime] integer
  - **OCCURRENCES:** number of known commits containing the blob
  
- `licenses-annotated-sample.tar.gz`: ground truth, i.e., manually annotated
  random sample of license blobs, with details about the kind of information
  they contain.


[zstd]: http://facebook.github.io/zstd/
[csv]: https://en.wikipedia.org/wiki/Comma-separated_values
[ndjson]: http://ndjson.org/
[swhid]: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html


Changes from the 2022-12-07 dataset
-----------------------------------

- More input data, due to the SWH archive growing: more origins in supported forges
  and package managers; and support for more forges and package managers.
  See the [SWH Archive Changelog](https://docs.softwareheritage.org/devel/archive-changelog.html)
  for details.

- 32 contents referenced by directories but missing from the archive (eg. because they
  are too big) are now included in `license-blobs.csv.zst` (with an empty SHA1),
  `blobs-origins.csv.zst`, `blobs-nb-origins.csv.zst`, and `blobs-earliest.csv.zst`.
  They are not included in other files, as that would require access to their content.


Citation
--------

If you use this dataset for research purposes, please acknowledge its use by
citing the following paper:

- [[pdf][preprintmsr2022], [bib][bibmsr2022]]
  Stefano Zacchiroli. [*A Large-scale Dataset of (Open Source) License Text
  Variants*][doimsr2022]. In proceedings of the [2022 Mining Software
  Repositories Conference (MSR 2022)][msr2022]. 23-24 May 2022 Pittsburgh,
  Pennsylvania, United States. ACM 2022.


References
----------

The dataset has been built using primarily the data sources described in the
following papers:

- [[pdf][preprintipres2017], [bib][bibipres2017]]
  Roberto Di Cosmo, Stefano Zacchiroli. [Software Heritage: Why and How to
  Preserve Software Source Code][handleipres2017]. In Proceedings of iPRES
  2017: 14th International Conference on Digital Preservation, Kyoto, Japan,
  25-29 September 2017.
  
- [[pdf][preprintmsr2019], [bib][bibmsr2019]]
  Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. [The Software
  Heritage Graph Dataset: Public software development under one
  roof][doimsr2019]. In proceedings of MSR 2019: The 16th International
  Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages
  138-142, IEEE 2019.

  
[bibipres2017]: https://dblp.uni-trier.de/rec/conf/ipres/CosmoZ17.html?view=bibtex
[bibmsr2019]: https://dblp.uni-trier.de/rec/conf/msr/PietriSZ19.html?view=bibtex
[bibmsr2022]: https://dblp.uni-trier.de/rec/conf/msr/Zacchiroli22.html?view=bibtex
[datasetzenodo]: https://doi.org/10.5281/zenodo.6379164 
[doimsr2019]: https://doi.org/10.1109/MSR.2019.00030
[doimsr2022]: https://doi.org/10.1145/3524842.3528491
[handleipres2017]: https://hdl.handle.net/11353/10.931064
[libmagic]: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/file.html
[msr2022]: https://conf.researchr.org/home/msr-2022
[preprintipres2017]: https://phaidra.univie.ac.at/open/o:931064
[preprintmsr2019]: https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf
[preprintmsr2022]: https://arxiv.org/pdf/2204.00256.pdf
[scancode-licensedb]: https://scancode-licensedb.aboutcode.org/
[scancode]: https://www.aboutcode.org/projects/scancode.html
[spdx]: https://spdx.dev/
[unixtime]: https://en.wikipedia.org/wiki/Unix_time
