The dataset in this folder only contains informations about the edges of the Software Heritage Graph (and none of the associated metadata). This is useful for studying the topology of the graph.
.edges.csv.gz
file contains all edges of a given <src, dst>
type. The format is a compressed textual file with one edge per line, as a "SRC_ID SPACE DST_ID"
string, where identifiers are the intrinsic SHA1 checksums of each node (hex-encoded, as usual)..nodes.csv.gz
file contains a sorted list of unique node identifiers appearing in the corresponding .edges.csv.gz
file. The format is a compressed text file with one hex-encoded SHA1 checksum per line..count
text file contains the number of lines of its matching file.If you want to have the entire graph and ignore division by edge types, it should be enough to cat all files together and process them as if it were a single file.
If you want to pay attention to the edge types, the files are named as follow:
ori_to_snp.edges.csv.gz
: the edges from the origin ID (integer) to the snapshot ID (sha1).
snp_to_cnt.edges.csv.gz
: the edges from the snapshot ID (sha1) to the content it points to (sha1).snp_to_dir.edges.csv.gz
: the edges from the snapshot ID (sha1) to the directory it points to (sha1).snp_to_rev.edges.csv.gz
: the edges from the snapshot ID (sha1) to the revision it points to (sha1).snp_to_rel.edges.csv.gz
: the edges from the snapshot ID (sha1) to the release it points to (sha1).
rel_to_cnt.edges.csv.gz
: the edges from the release ID (sha1) to the content it points to (sha1).rel_to_dir.edges.csv.gz
: the edges from the release ID (sha1) to the directory it points to (sha1).rel_to_rev.edges.csv.gz
: the edges from the release ID (sha1) to the revision it points to (sha1).rel_to_rel.edges.csv.gz
: the edges from the release ID (sha1) to the release it points to (sha1).
rev_to_rev.edges.csv.gz
: the edges from each revision (sha1) to its parent revisions (sha1). This is the full development history of the dataset.rev_to_dir.edges.csv.gz
: the edges from each revision (sha1) to the directory it points to (sha1).
dir_to_dir.edges.csv.gz
: the edges from each directory (sha1) to its children directories (sha1).dir_to_cnt.edges.csv.gz
: the edges from each directory (sha1) to its children files (sha1).dir_to_rev.edges.csv.gz
: the edges from each directory (sha1) to its children revisions (sha1).