Index of /public/dataset/knownorigins/2024-08-23

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[   ]2024-08-23-baseorigins-with-count.csv.zst2024-12-05 14:24 48K 
[   ]2024-08-23-knownorigins.csv.zst2024-12-05 09:31 4.4G 
[TXT]README.md2024-12-05 14:44 1.9K 

Software Heritage — Known origins and base urls

Authors: Roberto Di Cosmo Contact: roberto@dicosmo.org

Description

This dataset contains all the software origins (locations of a repository, package or source code archive) knwon to Software Heritage as of 2024-08-23.

Format

The dataset is organized as follows:

SELECT url FROM origin;
SELECT COUNT(*) as c, baseurl 
FROM (
  SELECT 
    (CASE 
      WHEN (url LIKE '%.googlecode.com%') THEN 'googlecode.com'
      WHEN (url LIKE '%/%/%') THEN split(url,'/')[3] 
      ELSE url
    END) as baseurl 
  FROM origin
) 
GROUP BY baseurl 
ORDER BY c DESC;

License

This dataset contains only factual data, and is hence put in the public domain. If you use it in your publications, we would appreciate a citation to Software Heritage, as suggested in the Software Heritage publications page.