The Software Heritage License Blob Dataset - Replication Package

This directory contains the software used to assemble the Software Heritage License Blob Dataset (also available from Zenodo).

The main components used are:

  1. Athena queries used to generate the list of blobs to be included: 01-select-blobs.sql, 02-to-sha1.sql, 03-clean.sh
  2. fileinfo metadata: Python module licenseblobs.stats, available under python/
  3. scancode metadata: Python module licenseblobs.scancode, available under python/
  4. origin metadata: 04-swhid-to-origin.sh, querying the swh-graph API
  5. earliest metadata: custom Java code available under java/