easygraph.datasets package

Submodules

easygraph.datasets.citation_graph module

Cora, citeseer, pubmed dataset.

(lingfan): following dataset loading and preprocessing code from tkipf/gcn https://github.com/tkipf/gcn/blob/master/gcn/utils.py

class easygraph.datasets.citation_graph.CitationGraphDataset(name, raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: EasyGraphBuiltinDataset

The citation graph dataset, including cora, citeseer and pubmeb. Nodes mean authors and edges mean citation relationships.

Parameters
  • name (str) – name can be ‘cora’, ‘citeseer’ or ‘pubmed’.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

Attributes
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_classes
num_labels
raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

reverse_edge
save_dir

Directory to save the processed dataset.

save_name
save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Loads input data from data directory and reorder graph for better locality

save()

Overwrite to realize your own logic of saving the processed dataset into files.

has_cache()[source]

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

property num_classes
property num_labels
process()[source]

Loads input data from data directory and reorder graph for better locality

ind.name.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object; ind.name.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object; ind.name.allx => the feature vectors of both labeled and unlabeled training instances

(a superset of ind.name.x) as scipy.sparse.csr.csr_matrix object;

ind.name.y => the one-hot labels of the labeled training instances as numpy.ndarray object; ind.name.ty => the one-hot labels of the test instances as numpy.ndarray object; ind.name.ally => the labels for instances in ind.name.allx as numpy.ndarray object; ind.name.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict

object;

ind.name.test.index => the indices of test instances in graph, for the inductive setting as list object.

property reverse_edge
property save_name
class easygraph.datasets.citation_graph.CiteseerGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: CitationGraphDataset

Citeseer citation network dataset.

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 3703 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.

Statistics:

  • Nodes: 3327

  • Edges: 9228

  • Number of Classes: 6

  • Label Split:

    • Train: 120

    • Valid: 500

    • Test: 1000

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

num_classes

Number of label classes

Type

int

Notes

The node feature is row-normalized.

In citeseer dataset, there are some isolated nodes in the graph. These isolated nodes are added as zero-vecs into the right position.

Examples

>>> dataset = CiteseerGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
Attributes
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_classes
num_labels
raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

reverse_edge
save_dir

Directory to save the processed dataset.

save_name
save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Loads input data from data directory and reorder graph for better locality

save()

Overwrite to realize your own logic of saving the processed dataset into files.

class easygraph.datasets.citation_graph.CoraBinary(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]

Bases: EasyGraphBuiltinDataset

A mini-dataset for binary classification task using Cora.

After loaded, it has following members:

graphs : list of DGLGraph pmpds : list of scipy.sparse.coo_matrix labels : list of numpy.ndarray

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Attributes
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

save_dir

Directory to save the processed dataset.

save_name
save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Overwrite to realize your own logic of processing the input data.

save()

Overwrite to realize your own logic of saving the processed dataset into files.

has_cache()[source]

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

process()[source]

Overwrite to realize your own logic of processing the input data.

property save_name
class easygraph.datasets.citation_graph.CoraGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: CitationGraphDataset

Cora citation network dataset.

Nodes mean paper and edges mean citation relationships. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper.

Statistics:

  • Nodes: 2708

  • Edges: 10556

  • Number of Classes: 7

  • Label split:

    • Train: 140

    • Valid: 500

    • Test: 1000

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

num_classes

Number of label classes

Type

int

Notes

The node feature is row-normalized.

Examples

>>> dataset = CoraGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
Attributes
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_classes
num_labels
raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

reverse_edge
save_dir

Directory to save the processed dataset.

save_name
save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Loads input data from data directory and reorder graph for better locality

save()

Overwrite to realize your own logic of saving the processed dataset into files.

class easygraph.datasets.citation_graph.PubmedGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: CitationGraphDataset

Pubmed citation network dataset.

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 500 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.

Statistics:

  • Nodes: 19717

  • Edges: 88651

  • Number of Classes: 3

  • Label Split:

    • Train: 60

    • Valid: 500

    • Test: 1000

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

num_classes

Number of label classes

Type

int

Notes

The node feature is row-normalized.

Examples

>>> dataset = PubmedGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_of_class
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
Attributes
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_classes
num_labels
raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

reverse_edge
save_dir

Directory to save the processed dataset.

save_name
save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Loads input data from data directory and reorder graph for better locality

save()

Overwrite to realize your own logic of saving the processed dataset into files.

easygraph.datasets.citation_graph.load_citeseer(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]

Get CiteseerGraphDataset

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type

CiteseerGraphDataset

easygraph.datasets.citation_graph.load_cora(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]

Get CoraGraphDataset

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type

CoraGraphDataset

easygraph.datasets.citation_graph.load_pubmed(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]

Get PubmedGraphDataset

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type

PubmedGraphDataset

easygraph.datasets.eg_dataset module

Basic EasyGraph Dataset

class easygraph.datasets.eg_dataset.EasyGraphBuiltinDataset(name, url, raw_dir=None, hash_key=(), force_reload=False, verbose=True, transform=None)[source]

Bases: EasyGraphDataset

The Basic EasyGraph Builtin Dataset.

Parameters
  • name (str) – Name of the dataset.

  • url (str) – Url to download the raw dataset.

  • raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/

  • hash_key (tuple) – A tuple of values as the input for the hash function. Users can distinguish instances (and their caches on the disk) from the same dataset class by comparing the hash values.

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: False

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Attributes
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

save_dir

Directory to save the processed dataset.

save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Overwrite to realize your own logic of processing the input data.

save()

Overwrite to realize your own logic of saving the processed dataset into files.

download()[source]

Automatically download data and extract it.

class easygraph.datasets.eg_dataset.EasyGraphDataset(name, url=None, raw_dir=None, save_dir=None, hash_key=(), force_reload=False, verbose=False, transform=None)[source]

Bases: object

The basic EasyGraph dataset for creating graph datasets. This class defines a basic template class for EasyGraph Dataset. The following steps will be executed automatically:

  1. Check whether there is a dataset cache on disk (already processed and stored on the disk) by invoking has_cache(). If true, goto 5.

  2. Call download() to download the data if url is not None.

  3. Call process() to process the data.

  4. Call save() to save the processed dataset on disk and goto 6.

  5. Call load() to load the processed dataset from disk.

  6. Done.

Users can overwrite these functions with their own data processing logic.

Parameters
  • name (str) – Name of the dataset

  • url (str) – Url to download the raw dataset. Default: None

  • raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/

  • save_dir (str) – Directory to save the processed dataset. Default: same as raw_dir

  • hash_key (tuple) – A tuple of values as the input for the hash function. Users can distinguish instances (and their caches on the disk) from the same dataset class by comparing the hash values. Default: (), the corresponding hash value is 'f9065fa7'.

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

url

The URL to download the dataset

Type

str

name

The dataset name

Type

str

raw_dir

Directory to store all the downloaded raw datasets.

Type

str

raw_path

Path to the downloaded raw dataset folder. An alias for os.path.join(self.raw_dir, self.name).

Type

str

save_dir

Directory to save all the processed datasets.

Type

str

save_path

Path to the processed dataset folder. An alias for os.path.join(self.save_dir, self.name).

Type

str

verbose

Whether to print more runtime information.

Type

bool

hash

Hash value for the dataset and the setting.

Type

str

Attributes
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

save_dir

Directory to save the processed dataset.

save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Overwrite to realize your own logic of downloading data.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Overwrite to realize your own logic of processing the input data.

save()

Overwrite to realize your own logic of saving the processed dataset into files.

download()[source]

Overwrite to realize your own logic of downloading data.

It is recommended to download the to the self.raw_dir folder. Can be ignored if the dataset is already in self.raw_dir.

has_cache()[source]

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

property hash

Hash value for the dataset and the setting.

load()[source]

Overwrite to realize your own logic of loading the saved dataset from files.

It is recommended to use dgl.data.utils.load_graphs to load dgl graph from files and use dgl.data.utils.load_info to load extra information into python dict object.

property name

Name of the dataset.

abstract process()[source]

Overwrite to realize your own logic of processing the input data.

property raw_dir

Raw file directory contains the input data folder.

property raw_path
save()[source]

Overwrite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

property save_dir

Directory to save the processed dataset.

property save_path

Path to save the processed dataset.

property url

Get url to download the raw dataset.

property verbose

Whether to print information.

easygraph.datasets.get_sample_graph module

easygraph.datasets.get_sample_graph.get_graph_blogcatalog()[source]

Returns the undirected graph of blogcatalog.

Returns

get_graph_blogcatalog – The undirected graph instance of blogcatalog from dataset: https://github.com/phanein/deepwalk/blob/master/example_graphs/blogcatalog.mat

Return type

easygraph.Graph

References

1

https://github.com/phanein/deepwalk/blob/master/example_graphs/blogcatalog.mat

easygraph.datasets.get_sample_graph.get_graph_flickr()[source]

Returns the undirected graph of Flickr dataset.

Returns

get_graph_flickr – The undirected graph instance of Flickr from dataset: http://socialnetworks.mpi-sws.mpg.de/data/flickr-links.txt.gz

Return type

easygraph.Graph

References

1

http://socialnetworks.mpi-sws.mpg.de/data/flickr-links.txt.gz

easygraph.datasets.get_sample_graph.get_graph_karateclub()[source]

Returns the undirected graph of Karate Club.

Returns

get_graph_karateclub – The undirected graph instance of karate club from dataset: http://vlado.fmf.uni-lj.si/pub/networks/data/Ucinet/UciData.htm

Return type

easygraph.Graph

References

1

http://vlado.fmf.uni-lj.si/pub/networks/data/Ucinet/UciData.htm

easygraph.datasets.get_sample_graph.get_graph_youtube()[source]

Returns the undirected graph of Youtube dataset.

Returns

get_graph_youtube – The undirected graph instance of Youtube from dataset: http://socialnetworks.mpi-sws.mpg.de/data/youtube-links.txt.gz

Return type

easygraph.Graph

References

1

http://socialnetworks.mpi-sws.mpg.de/data/youtube-links.txt.gz

easygraph.datasets.gnn_benchmark module

class easygraph.datasets.gnn_benchmark.AmazonCoBuyComputerDataset(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]

Bases: GNNBenchmarkDataset

‘Computer’ part of the AmazonCoBuy dataset for node classification task.

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics:

  • Nodes: 13,752

  • Edges: 491,722 (note that the original dataset has 245,778 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)

  • Number of classes: 10

  • Node feature size: 767

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

num_classes

Number of classes for each node.

Type

int

Examples

>>> data = AmazonCoBuyComputerDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
Attributes
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_classes

Number of classes.

raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

save_dir

Directory to save the processed dataset.

save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Overwrite to realize your own logic of processing the input data.

save()

Overwrite to realize your own logic of saving the processed dataset into files.

property num_classes

Number of classes.

Return type

int

easygraph.datasets.karate module

class easygraph.datasets.karate.KarateClubDataset(transform=None)[source]

Bases: EasyGraphDataset

Karate Club dataset for Node Classification

Zachary’s karate club is a social network of a university karate club, described in the paper “An Information Flow Model for Conflict and Fission in Small Groups” by Wayne W. Zachary. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002. Official website: http://konect.cc/networks/ucidata-zachary/

Karate Club dataset statistics:

  • Nodes: 34

  • Edges: 156

  • Number of Classes: 2

Parameters

transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

num_classes

Number of node classes

Type

int

Examples

>>> dataset = KarateClubDataset()
>>> num_classes = dataset.num_classes
>>> g = dataset[0]
>>> labels = g.ndata['label']
Attributes
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_classes

Number of classes.

raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

save_dir

Directory to save the processed dataset.

save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Overwrite to realize your own logic of downloading data.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Overwrite to realize your own logic of processing the input data.

save()

Overwrite to realize your own logic of saving the processed dataset into files.

property num_classes

Number of classes.

process()[source]

Overwrite to realize your own logic of processing the input data.

easygraph.datasets.ppi module

PPIDataset for inductive learning.

class easygraph.datasets.ppi.LegacyPPIDataset(mode='train', raw_dir=None, force_reload=False, verbose=False, transform=None)[source]

Bases: PPIDataset

Legacy version of PPI Dataset

Attributes
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_labels
raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

save_dir

Directory to save the processed dataset.

save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Overwrite to realize your own logic of processing the input data.

save()

Overwrite to realize your own logic of saving the processed dataset into files.

class easygraph.datasets.ppi.PPIDataset(mode='train', raw_dir=None, force_reload=False, verbose=False, transform=None)[source]

Bases: EasyGraphBuiltinDataset

Protein-Protein Interaction dataset for inductive node classification

A toy Protein-Protein Interaction network dataset. The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels. 20 graphs for training, 2 for validation and 2 for testing.

Reference: http://snap.stanford.edu/graphsage/

Statistics:

  • Train examples: 20

  • Valid examples: 2

  • Test examples: 2

Parameters
  • mode (str) – Must be one of (‘train’, ‘valid’, ‘test’). Default: ‘train’

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

num_labels

Number of labels for each node

Type

int

labels

Node labels

Type

Tensor

features

Node features

Type

Tensor

Examples

>>> dataset = PPIDataset(mode='valid')
>>> num_labels = dataset.num_labels
>>> for g in dataset:
....    feat = g.ndata['feat']
....    label = g.ndata['label']
....    # your code here
>>>
Attributes
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_labels
raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

save_dir

Directory to save the processed dataset.

save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Overwrite to realize your own logic of processing the input data.

save()

Overwrite to realize your own logic of saving the processed dataset into files.

has_cache()[source]

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

property num_labels
process()[source]

Overwrite to realize your own logic of processing the input data.

save()[source]

Overwrite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

easygraph.datasets.utils module

easygraph.datasets.utils.download(url, path=None, overwrite=True, sha1_hash=None, retries=5, verify_ssl=True, log=True)[source]

Download a given URL.

Codes borrowed from mxnet/gluon/utils.py

Parameters
  • url (str) – URL to download.

  • path (str, optional) – Destination path to store downloaded file. By default stores to the current directory with the same name as in url.

  • overwrite (bool, optional) – Whether to overwrite the destination file if it already exists. By default always overwrites the downloaded file.

  • sha1_hash (str, optional) – Expected sha1 hash in hexadecimal digits. Will ignore existing file when hash is specified but doesn’t match.

  • retries (integer, default 5) – The number of times to attempt downloading in case of failure or non 200 return codes.

  • verify_ssl (bool, default True) – Verify SSL certificates.

  • log (bool, default True) – Whether to print the progress for download

Returns

The file path of the downloaded file.

Return type

str

easygraph.datasets.utils.extract_archive(file, target_dir, overwrite=False)[source]

Extract archive file.

Parameters
  • file (str) – Absolute path of the archive file.

  • target_dir (str) – Target directory of the archive to be uncompressed.

  • overwrite (bool, default True) – Whether to overwrite the contents inside the directory. By default always overwrites.

easygraph.datasets.utils.generate_mask_tensor(mask)[source]

Generate mask tensor according to different backend For torch, it will create a bool tensor :param mask: input mask tensor :type mask: numpy ndarray

easygraph.datasets.utils.get_download_dir()[source]

Get the absolute path to the download directory.

Returns

dirname – Path to the download directory

Return type

str

easygraph.datasets.utils.makedirs(path)[source]

Module contents