Name | Last modified | Size | Description | |
---|---|---|---|---|
Parent Directory | - | |||
2024-08-23-baseorigins-with-count.csv.zst | 2024-12-05 14:24 | 48K | ||
2024-08-23-knownorigins.csv.zst | 2024-12-05 09:31 | 4.4G | ||
README.md | 2024-12-05 14:44 | 1.9K | ||
Authors: Roberto Di Cosmo Contact: roberto@dicosmo.org
This dataset contains all the software origins (locations of a repository, package or source code archive) knwon to Software Heritage as of 2024-08-23.
The dataset is organized as follows:
2024-08-23-knownorigins.csv.zst
: a Zst-compressed CSV
containing each and every origin known in the archive
(e.g. https://github.com/rdicosmo/parmap
). The file
contains 310330786 entries and weighs 4,716,278,511 bytes. It is
generated using the following query on the S3 graph dataset:SELECT url FROM origin;
2024-08-23-baseorigins-with-count.csv.zst
: a Zst-compressed CSV
containing only the base urls of the origins known in the archive
(e.g. https://github.com/
). The file contains 5456 entries
and weighs 49,125 bytes. It is generated using the following query on
the S3 graph dataset:SELECT COUNT(*) as c, baseurl
FROM (
SELECT
(CASE
WHEN (url LIKE '%.googlecode.com%') THEN 'googlecode.com'
WHEN (url LIKE '%/%/%') THEN split(url,'/')[3]
ELSE url
END) as baseurl
FROM origin
)
GROUP BY baseurl
ORDER BY c DESC;
This dataset contains only factual data, and is hence put in the public domain. If you use it in your publications, we would appreciate a citation to Software Heritage, as suggested in the Software Heritage publications page.