easygraph.datasets package

Submodules

easygraph.datasets.citation_graph module

Cora, citeseer, pubmed dataset.

(lingfan): following dataset loading and preprocessing code from tkipf/gcn https://github.com/tkipf/gcn/blob/master/gcn/utils.py

class easygraph.datasets.citation_graph.CitationGraphDataset(name, raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: EasyGraphBuiltinDataset

The citation graph dataset, including cora, citeseer and pubmeb. Nodes mean authors and edges mean citation relationships.

Parameters

name (str) – name can be ‘cora’, ‘citeseer’ or ‘pubmed’.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.
reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

Attributes

hash: Hash value for the dataset and the setting.
name: Name of the dataset.
num_classes
num_labels
raw_dir: Raw file directory contains the input data folder.
raw_path: Directory contains the input data files.
reverse_edge
save_dir: Directory to save the processed dataset.
save_name
save_path: Path to save the processed dataset.
url: Get url to download the raw dataset.
verbose: Whether to print information.

Methods

`download`()	Automatically download data and extract it.
`has_cache`()	Overwrite to realize your own logic of deciding whether there exists a cached dataset.
`load`()	Overwrite to realize your own logic of loading the saved dataset from files.
`process`()	Loads input data from data directory and reorder graph for better locality
`save`()	Overwrite to realize your own logic of saving the processed dataset into files.

has_cache()[source]

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

property num_classes

property num_labels

process()[source]

Loads input data from data directory and reorder graph for better locality

ind.name.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object; ind.name.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object; ind.name.allx => the feature vectors of both labeled and unlabeled training instances

(a superset of ind.name.x) as scipy.sparse.csr.csr_matrix object;

ind.name.y => the one-hot labels of the labeled training instances as numpy.ndarray object; ind.name.ty => the one-hot labels of the test instances as numpy.ndarray object; ind.name.ally => the labels for instances in ind.name.allx as numpy.ndarray object; ind.name.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict

object;

ind.name.test.index => the indices of test instances in graph, for the inductive setting as list object.

property reverse_edge

property save_name

class easygraph.datasets.citation_graph.CiteseerGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: CitationGraphDataset

Citeseer citation network dataset.

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 3703 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.

Statistics:

Nodes: 3327
Edges: 9228
Number of Classes: 6
Label Split:
- Train: 120
- Valid: 500
- Test: 1000

Parameters

raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.
reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

num_classes

Number of label classes

Type: int

Notes

The node feature is row-normalized.

In citeseer dataset, there are some isolated nodes in the graph. These isolated nodes are added as zero-vecs into the right position.

Examples

>>> dataset = CiteseerGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']

Attributes

hash: Hash value for the dataset and the setting.
name: Name of the dataset.
num_classes
num_labels
raw_dir: Raw file directory contains the input data folder.
raw_path: Directory contains the input data files.
reverse_edge
save_dir: Directory to save the processed dataset.
save_name
save_path: Path to save the processed dataset.
url: Get url to download the raw dataset.
verbose: Whether to print information.

Methods

`download`()	Automatically download data and extract it.
`has_cache`()	Overwrite to realize your own logic of deciding whether there exists a cached dataset.
`load`()	Overwrite to realize your own logic of loading the saved dataset from files.
`process`()	Loads input data from data directory and reorder graph for better locality
`save`()	Overwrite to realize your own logic of saving the processed dataset into files.

class easygraph.datasets.citation_graph.CoraBinary(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]

Bases: EasyGraphBuiltinDataset

A mini-dataset for binary classification task using Cora.

After loaded, it has following members:

graphs : list of DGLGraph pmpds : list of scipy.sparse.coo_matrix labels : list of numpy.ndarray

Parameters

raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Attributes

hash: Hash value for the dataset and the setting.
name: Name of the dataset.
raw_dir: Raw file directory contains the input data folder.
raw_path: Directory contains the input data files.
save_dir: Directory to save the processed dataset.
save_name
save_path: Path to save the processed dataset.
url: Get url to download the raw dataset.
verbose: Whether to print information.

Methods

`download`()	Automatically download data and extract it.
`has_cache`()	Overwrite to realize your own logic of deciding whether there exists a cached dataset.
`load`()	Overwrite to realize your own logic of loading the saved dataset from files.
`process`()	Overwrite to realize your own logic of processing the input data.
`save`()	Overwrite to realize your own logic of saving the processed dataset into files.

has_cache()[source]

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

process()[source]: Overwrite to realize your own logic of processing the input data.

property save_name

class easygraph.datasets.citation_graph.CoraGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: CitationGraphDataset

Cora citation network dataset.

Nodes mean paper and edges mean citation relationships. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper.

Statistics:

Nodes: 2708
Edges: 10556
Number of Classes: 7
Label split:
- Train: 140
- Valid: 500
- Test: 1000

Parameters

raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.
reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

num_classes

Number of label classes

Type: int

Notes

The node feature is row-normalized.

Examples

>>> dataset = CoraGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']

Attributes

hash: Hash value for the dataset and the setting.
name: Name of the dataset.
num_classes
num_labels
raw_dir: Raw file directory contains the input data folder.
raw_path: Directory contains the input data files.
reverse_edge
save_dir: Directory to save the processed dataset.
save_name
save_path: Path to save the processed dataset.
url: Get url to download the raw dataset.
verbose: Whether to print information.

Methods

`download`()	Automatically download data and extract it.
`has_cache`()	Overwrite to realize your own logic of deciding whether there exists a cached dataset.
`load`()	Overwrite to realize your own logic of loading the saved dataset from files.
`process`()	Loads input data from data directory and reorder graph for better locality
`save`()	Overwrite to realize your own logic of saving the processed dataset into files.

class easygraph.datasets.citation_graph.PubmedGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: CitationGraphDataset

Pubmed citation network dataset.

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 500 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.

Statistics:

Nodes: 19717
Edges: 88651
Number of Classes: 3
Label Split:
- Train: 60
- Valid: 500
- Test: 1000

Parameters

raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.
reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

num_classes

Number of label classes

Type: int

Notes

The node feature is row-normalized.

Examples

>>> dataset = PubmedGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_of_class
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']

Attributes

hash: Hash value for the dataset and the setting.
name: Name of the dataset.
num_classes
num_labels
raw_dir: Raw file directory contains the input data folder.
raw_path: Directory contains the input data files.
reverse_edge
save_dir: Directory to save the processed dataset.
save_name
save_path: Path to save the processed dataset.
url: Get url to download the raw dataset.
verbose: Whether to print information.

Methods

`download`()	Automatically download data and extract it.
`has_cache`()	Overwrite to realize your own logic of deciding whether there exists a cached dataset.
`load`()	Overwrite to realize your own logic of loading the saved dataset from files.
`process`()	Loads input data from data directory and reorder graph for better locality
`save`()	Overwrite to realize your own logic of saving the processed dataset into files.

easygraph.datasets.citation_graph.load_citeseer(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]

Get CiteseerGraphDataset

Parameters

raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type

CiteseerGraphDataset

easygraph.datasets.citation_graph.load_cora(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]

Get CoraGraphDataset

Parameters

raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type

CoraGraphDataset

easygraph.datasets.citation_graph.load_pubmed(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]

Get PubmedGraphDataset

Parameters

raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type

PubmedGraphDataset

easygraph.datasets.eg_dataset module

Basic EasyGraph Dataset

class easygraph.datasets.eg_dataset.EasyGraphBuiltinDataset(name, url, raw_dir=None, hash_key=(), force_reload=False, verbose=True, transform=None)[source]

Bases: EasyGraphDataset

The Basic EasyGraph Builtin Dataset.

Parameters

name (str) – Name of the dataset.
url (str) – Url to download the raw dataset.
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
hash_key (tuple) – A tuple of values as the input for the hash function. Users can distinguish instances (and their caches on the disk) from the same dataset class by comparing the hash values.
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: False
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Attributes

hash: Hash value for the dataset and the setting.
name: Name of the dataset.
raw_dir: Raw file directory contains the input data folder.
raw_path: Directory contains the input data files.
save_dir: Directory to save the processed dataset.
save_path: Path to save the processed dataset.
url: Get url to download the raw dataset.
verbose: Whether to print information.

Methods

`download`()	Automatically download data and extract it.
`has_cache`()	Overwrite to realize your own logic of deciding whether there exists a cached dataset.
`load`()	Overwrite to realize your own logic of loading the saved dataset from files.
`process`()	Overwrite to realize your own logic of processing the input data.
`save`()	Overwrite to realize your own logic of saving the processed dataset into files.

download()[source]: Automatically download data and extract it.

class easygraph.datasets.eg_dataset.EasyGraphDataset(name, url=None, raw_dir=None, save_dir=None, hash_key=(), force_reload=False, verbose=False, transform=None)[source]

Bases: object

The basic EasyGraph dataset for creating graph datasets. This class defines a basic template class for EasyGraph Dataset. The following steps will be executed automatically:

Check whether there is a dataset cache on disk (already processed and stored on the disk) by invoking has_cache(). If true, goto 5.

Call download() to download the data if url is not None.

Call process() to process the data.

Call save() to save the processed dataset on disk and goto 6.

Call load() to load the processed dataset from disk.

Done.

Users can overwrite these functions with their own data processing logic.

Parameters

name (str) – Name of the dataset
url (str) – Url to download the raw dataset. Default: None
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
save_dir (str) – Directory to save the processed dataset. Default: same as raw_dir
hash_key (tuple) – A tuple of values as the input for the hash function. Users can distinguish instances (and their caches on the disk) from the same dataset class by comparing the hash values. Default: (), the corresponding hash value is 'f9065fa7'.
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

url

The URL to download the dataset

Type: str

name

The dataset name

Type: str

raw_dir

Directory to store all the downloaded raw datasets.

Type: str

raw_path

Path to the downloaded raw dataset folder. An alias for os.path.join(self.raw_dir, self.name).

Type: str

save_dir

Directory to save all the processed datasets.

Type: str

save_path

Path to the processed dataset folder. An alias for os.path.join(self.save_dir, self.name).

Type: str

verbose

Whether to print more runtime information.

Type: bool

hash

Hash value for the dataset and the setting.

Type: str

Attributes

hash: Hash value for the dataset and the setting.
name: Name of the dataset.
raw_dir: Raw file directory contains the input data folder.
raw_path: Directory contains the input data files.
save_dir: Directory to save the processed dataset.
save_path: Path to save the processed dataset.
url: Get url to download the raw dataset.
verbose: Whether to print information.

Methods

`download`()	Overwrite to realize your own logic of downloading data.
`has_cache`()	Overwrite to realize your own logic of deciding whether there exists a cached dataset.
`load`()	Overwrite to realize your own logic of loading the saved dataset from files.
`process`()	Overwrite to realize your own logic of processing the input data.
`save`()	Overwrite to realize your own logic of saving the processed dataset into files.

download()[source]

Overwrite to realize your own logic of downloading data.

It is recommended to download the to the self.raw_dir folder. Can be ignored if the dataset is already in self.raw_dir.

has_cache()[source]

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

property hash: Hash value for the dataset and the setting.

load()[source]

Overwrite to realize your own logic of loading the saved dataset from files.

It is recommended to use dgl.data.utils.load_graphs to load dgl graph from files and use dgl.data.utils.load_info to load extra information into python dict object.

property name: Name of the dataset.

abstract process()[source]: Overwrite to realize your own logic of processing the input data.

property raw_dir: Raw file directory contains the input data folder.

property raw_path

save()[source]

Overwrite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

property save_dir: Directory to save the processed dataset.

property save_path: Path to save the processed dataset.

property url: Get url to download the raw dataset.

property verbose: Whether to print information.

easygraph.datasets.get_sample_graph module

easygraph.datasets.get_sample_graph.get_graph_blogcatalog()[source]

Returns the undirected graph of blogcatalog.

Returns: get_graph_blogcatalog – The undirected graph instance of blogcatalog from dataset: https://github.com/phanein/deepwalk/blob/master/example_graphs/blogcatalog.mat
Return type: easygraph.Graph

References

1: https://github.com/phanein/deepwalk/blob/master/example_graphs/blogcatalog.mat

easygraph.datasets.get_sample_graph.get_graph_flickr()[source]

Returns the undirected graph of Flickr dataset.

Returns: get_graph_flickr – The undirected graph instance of Flickr from dataset: http://socialnetworks.mpi-sws.mpg.de/data/flickr-links.txt.gz
Return type: easygraph.Graph

References

1: http://socialnetworks.mpi-sws.mpg.de/data/flickr-links.txt.gz

easygraph.datasets.get_sample_graph.get_graph_karateclub()[source]

Returns the undirected graph of Karate Club.

Returns: get_graph_karateclub – The undirected graph instance of karate club from dataset: http://vlado.fmf.uni-lj.si/pub/networks/data/Ucinet/UciData.htm
Return type: easygraph.Graph

References

1: http://vlado.fmf.uni-lj.si/pub/networks/data/Ucinet/UciData.htm

easygraph.datasets.get_sample_graph.get_graph_youtube()[source]

Returns the undirected graph of Youtube dataset.

Returns: get_graph_youtube – The undirected graph instance of Youtube from dataset: http://socialnetworks.mpi-sws.mpg.de/data/youtube-links.txt.gz
Return type: easygraph.Graph

References

1: http://socialnetworks.mpi-sws.mpg.de/data/youtube-links.txt.gz

easygraph.datasets.gnn_benchmark module

class easygraph.datasets.gnn_benchmark.AmazonCoBuyComputerDataset(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]

Bases: GNNBenchmarkDataset

‘Computer’ part of the AmazonCoBuy dataset for node classification task.

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics:

Nodes: 13,752
Edges: 491,722 (note that the original dataset has 245,778 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)
Number of classes: 10
Node feature size: 767

Parameters

raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

num_classes

Number of classes for each node.

Type: int

Examples

>>> data = AmazonCoBuyComputerDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels

Attributes

hash: Hash value for the dataset and the setting.
name: Name of the dataset.
num_classes: Number of classes.
raw_dir: Raw file directory contains the input data folder.
raw_path: Directory contains the input data files.
save_dir: Directory to save the processed dataset.
save_path: Path to save the processed dataset.
url: Get url to download the raw dataset.
verbose: Whether to print information.

Methods

`download`()	Automatically download data and extract it.
`has_cache`()	Overwrite to realize your own logic of deciding whether there exists a cached dataset.
`load`()	Overwrite to realize your own logic of loading the saved dataset from files.
`process`()	Overwrite to realize your own logic of processing the input data.
`save`()	Overwrite to realize your own logic of saving the processed dataset into files.

property num_classes

Number of classes.

Return type: int

easygraph.datasets.karate module

class easygraph.datasets.karate.KarateClubDataset(transform=None)[source]

Bases: EasyGraphDataset

Karate Club dataset for Node Classification

Zachary’s karate club is a social network of a university karate club, described in the paper “An Information Flow Model for Conflict and Fission in Small Groups” by Wayne W. Zachary. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002. Official website: http://konect.cc/networks/ucidata-zachary/

Karate Club dataset statistics:

Nodes: 34
Edges: 156
Number of Classes: 2

Parameters: transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

num_classes

Number of node classes

Type: int

Examples

>>> dataset = KarateClubDataset()
>>> num_classes = dataset.num_classes
>>> g = dataset[0]
>>> labels = g.ndata['label']

Attributes

hash: Hash value for the dataset and the setting.
name: Name of the dataset.
num_classes: Number of classes.
raw_dir: Raw file directory contains the input data folder.
raw_path: Directory contains the input data files.
save_dir: Directory to save the processed dataset.
save_path: Path to save the processed dataset.
url: Get url to download the raw dataset.
verbose: Whether to print information.

Methods

`download`()	Overwrite to realize your own logic of downloading data.
`has_cache`()	Overwrite to realize your own logic of deciding whether there exists a cached dataset.
`load`()	Overwrite to realize your own logic of loading the saved dataset from files.
`process`()	Overwrite to realize your own logic of processing the input data.
`save`()	Overwrite to realize your own logic of saving the processed dataset into files.

property num_classes: Number of classes.

process()[source]: Overwrite to realize your own logic of processing the input data.

easygraph.datasets.ppi module

PPIDataset for inductive learning.

class easygraph.datasets.ppi.LegacyPPIDataset(mode='train', raw_dir=None, force_reload=False, verbose=False, transform=None)[source]

Bases: PPIDataset

Legacy version of PPI Dataset

Attributes

hash: Hash value for the dataset and the setting.
name: Name of the dataset.
num_labels
raw_dir: Raw file directory contains the input data folder.
raw_path: Directory contains the input data files.
save_dir: Directory to save the processed dataset.
save_path: Path to save the processed dataset.
url: Get url to download the raw dataset.
verbose: Whether to print information.

Methods

`download`()	Automatically download data and extract it.
`has_cache`()	Overwrite to realize your own logic of deciding whether there exists a cached dataset.
`load`()	Overwrite to realize your own logic of loading the saved dataset from files.
`process`()	Overwrite to realize your own logic of processing the input data.
`save`()	Overwrite to realize your own logic of saving the processed dataset into files.

class easygraph.datasets.ppi.PPIDataset(mode='train', raw_dir=None, force_reload=False, verbose=False, transform=None)[source]

Bases: EasyGraphBuiltinDataset

Protein-Protein Interaction dataset for inductive node classification

A toy Protein-Protein Interaction network dataset. The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels. 20 graphs for training, 2 for validation and 2 for testing.

Reference: http://snap.stanford.edu/graphsage/

Statistics:

Train examples: 20
Valid examples: 2
Test examples: 2

Parameters

mode (str) – Must be one of (‘train’, ‘valid’, ‘test’). Default: ‘train’
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

num_labels

Number of labels for each node

Type: int

labels

Node labels

Type: Tensor

features

Node features

Type: Tensor

Examples

>>> dataset = PPIDataset(mode='valid')
>>> num_labels = dataset.num_labels
>>> for g in dataset:
....    feat = g.ndata['feat']
....    label = g.ndata['label']
....    # your code here
>>>

Attributes

hash: Hash value for the dataset and the setting.
name: Name of the dataset.
num_labels
raw_dir: Raw file directory contains the input data folder.
raw_path: Directory contains the input data files.
save_dir: Directory to save the processed dataset.
save_path: Path to save the processed dataset.
url: Get url to download the raw dataset.
verbose: Whether to print information.

Methods

`download`()	Automatically download data and extract it.
`has_cache`()	Overwrite to realize your own logic of deciding whether there exists a cached dataset.
`load`()	Overwrite to realize your own logic of loading the saved dataset from files.
`process`()	Overwrite to realize your own logic of processing the input data.
`save`()	Overwrite to realize your own logic of saving the processed dataset into files.

has_cache()[source]

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

property num_labels

process()[source]: Overwrite to realize your own logic of processing the input data.

save()[source]

Overwrite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

easygraph.datasets.utils module

easygraph.datasets.utils.download(url, path=None, overwrite=True, sha1_hash=None, retries=5, verify_ssl=True, log=True)[source]

Download a given URL.

Codes borrowed from mxnet/gluon/utils.py

Parameters

url (str) – URL to download.
path (str, optional) – Destination path to store downloaded file. By default stores to the current directory with the same name as in url.
overwrite (bool, optional) – Whether to overwrite the destination file if it already exists. By default always overwrites the downloaded file.
sha1_hash (str, optional) – Expected sha1 hash in hexadecimal digits. Will ignore existing file when hash is specified but doesn’t match.
retries (integer, default 5) – The number of times to attempt downloading in case of failure or non 200 return codes.
verify_ssl (bool, default True) – Verify SSL certificates.
log (bool, default True) – Whether to print the progress for download

Returns

The file path of the downloaded file.

Return type

str

easygraph.datasets.utils.extract_archive(file, target_dir, overwrite=False)[source]

Extract archive file.

Parameters

file (str) – Absolute path of the archive file.
target_dir (str) – Target directory of the archive to be uncompressed.
overwrite (bool, default True) – Whether to overwrite the contents inside the directory. By default always overwrites.

easygraph.datasets.utils.generate_mask_tensor(mask)[source]: Generate mask tensor according to different backend For torch, it will create a bool tensor :param mask: input mask tensor :type mask: numpy ndarray

easygraph.datasets package

Submodules

easygraph.datasets.citation_graph module

easygraph.datasets.eg_dataset module

easygraph.datasets.get_sample_graph module

easygraph.datasets.gnn_benchmark module

easygraph.datasets.karate module

easygraph.datasets.ppi module

easygraph.datasets.utils module

Module contents