Index of /public/dataset/swh-graph-2019-01-28

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[DIR]athena/2019-02-07 16:52 -  
[DIR]edges/2019-03-09 20:50 -  
[DIR]parquet/2019-02-08 14:50 -  
[DIR]sql/2019-02-05 16:38 -  

Using the Software Heritage Graph Dataset

This README contains instructions on how to use the different formats the Software Heritage graph dataset is distributed as.

References

If you use this dataset for research purposes, please cite the following paper:

You can also refer to the above paper for more information the dataset and sample queries.

Schema

The detailed schema of the database dumps is available in sql/swh_import_scripts/30-schema.sql. The different fields are documented in the comments of the schema itself.

PostgreSQL dumps

PostgreSQL dumps are available in the sql/ folder. They can be imported in a local database using:

cd sql createdb softwareheritage psql softwareheritage < swh_import.sql

Parquet dumps

Parquet dumps are available in the parquet/ folder. They can be imported in a Hadoop cluster to be analyzed with any data processing framework that supports Parquet files (Hive, Drill, Spark, ...)

The parquet dataset is stored in tarballs that can be unpacked using:

cd parquet tar xvf *

Using Amazon Athena

The Software Heritage graph dataset is available as a public dataset in Amazon Athena.

Setup

In order to query the dataset using Athena, you will first need to create an AWS account and setup billing.

Once your AWS account is ready, you will need to install a few dependencies on your machine:

On Debian, the dependencies can be installed with the following commands:

sudo apt install python3 python3-boto3 awscli

Once the dependencies are installed, run:

aws configure

and add your AWS Access Key ID and your AWS Secret Access Key, to give Python access to your AWS account.

Create the tables

To import the schema of the dataset into your account, run the following command from the athena/ folder:

./gen_schema.py

This will create the required tables in your AWS account. You can check that the tables were successfuly created by going to the Amazon Athena console and selecting the "swh" database.

Run queries

From the console, once you have selected the "swh" database, you can directly run queries from the Query Editor.

Here is an example query that computes the most frequent file names in the archive:

SELECT FROM_UTF8(name, '?') AS name, COUNT(DISTINCT target) AS cnt FROM directory_entry_file GROUP BY name ORDER BY cnt DESC LIMIT 1;

More documentation on Amazon Athena is available here.