easygraph.datasets package
Submodules
easygraph.datasets.citation_graph module
Cora, citeseer, pubmed dataset.
(lingfan): following dataset loading and preprocessing code from tkipf/gcn https://github.com/tkipf/gcn/blob/master/gcn/utils.py
- class easygraph.datasets.citation_graph.CitationGraphDataset(name, raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]
Bases:
EasyGraphBuiltinDataset
The citation graph dataset, including cora, citeseer and pubmeb. Nodes mean authors and edges mean citation relationships.
- Parameters
name (str) – name can be ‘cora’, ‘citeseer’ or ‘pubmed’.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- Attributes
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
- num_classes
- num_labels
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
- reverse_edge
save_dir
Directory to save the processed dataset.
- save_name
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Loads input data from data directory and reorder graph for better locality
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- has_cache()[source]
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
By default False.
- property num_classes
- property num_labels
- process()[source]
Loads input data from data directory and reorder graph for better locality
ind.name.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object; ind.name.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object; ind.name.allx => the feature vectors of both labeled and unlabeled training instances
(a superset of ind.name.x) as scipy.sparse.csr.csr_matrix object;
ind.name.y => the one-hot labels of the labeled training instances as numpy.ndarray object; ind.name.ty => the one-hot labels of the test instances as numpy.ndarray object; ind.name.ally => the labels for instances in ind.name.allx as numpy.ndarray object; ind.name.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict
object;
ind.name.test.index => the indices of test instances in graph, for the inductive setting as list object.
- property reverse_edge
- property save_name
- class easygraph.datasets.citation_graph.CiteseerGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]
Bases:
CitationGraphDataset
Citeseer citation network dataset.
Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 3703 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.
Statistics:
Nodes: 3327
Edges: 9228
Number of Classes: 6
Label Split:
Train: 120
Valid: 500
Test: 1000
- Parameters
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- num_classes
Number of label classes
- Type
int
Notes
The node feature is row-normalized.
In citeseer dataset, there are some isolated nodes in the graph. These isolated nodes are added as zero-vecs into the right position.
Examples
>>> dataset = CiteseerGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_classes >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
- Attributes
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
- num_classes
- num_labels
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
- reverse_edge
save_dir
Directory to save the processed dataset.
- save_name
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
has_cache
()Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Loads input data from data directory and reorder graph for better locality
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- class easygraph.datasets.citation_graph.CoraBinary(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]
Bases:
EasyGraphBuiltinDataset
A mini-dataset for binary classification task using Cora.
After loaded, it has following members:
graphs : list of
DGLGraph
pmpds : list ofscipy.sparse.coo_matrix
labels : list ofnumpy.ndarray
- Parameters
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Attributes
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
save_dir
Directory to save the processed dataset.
- save_name
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Overwrite to realize your own logic of processing the input data.
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- has_cache()[source]
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
By default False.
- property save_name
- class easygraph.datasets.citation_graph.CoraGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]
Bases:
CitationGraphDataset
Cora citation network dataset.
Nodes mean paper and edges mean citation relationships. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper.
Statistics:
Nodes: 2708
Edges: 10556
Number of Classes: 7
Label split:
Train: 140
Valid: 500
Test: 1000
- Parameters
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- num_classes
Number of label classes
- Type
int
Notes
The node feature is row-normalized.
Examples
>>> dataset = CoraGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_classes >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
- Attributes
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
- num_classes
- num_labels
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
- reverse_edge
save_dir
Directory to save the processed dataset.
- save_name
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
has_cache
()Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Loads input data from data directory and reorder graph for better locality
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- class easygraph.datasets.citation_graph.PubmedGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]
Bases:
CitationGraphDataset
Pubmed citation network dataset.
Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 500 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.
Statistics:
Nodes: 19717
Edges: 88651
Number of Classes: 3
Label Split:
Train: 60
Valid: 500
Test: 1000
- Parameters
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- num_classes
Number of label classes
- Type
int
Notes
The node feature is row-normalized.
Examples
>>> dataset = PubmedGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_of_class >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
- Attributes
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
- num_classes
- num_labels
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
- reverse_edge
save_dir
Directory to save the processed dataset.
- save_name
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
has_cache
()Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Loads input data from data directory and reorder graph for better locality
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- easygraph.datasets.citation_graph.load_citeseer(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]
Get CiteseerGraphDataset
- Parameters
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Return type
- easygraph.datasets.citation_graph.load_cora(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]
Get CoraGraphDataset
- Parameters
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Return type
- easygraph.datasets.citation_graph.load_pubmed(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]
Get PubmedGraphDataset
- Parameters
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Return type
easygraph.datasets.eg_dataset module
Basic EasyGraph Dataset
- class easygraph.datasets.eg_dataset.EasyGraphBuiltinDataset(name, url, raw_dir=None, hash_key=(), force_reload=False, verbose=True, transform=None)[source]
Bases:
EasyGraphDataset
The Basic EasyGraph Builtin Dataset.
- Parameters
name (str) – Name of the dataset.
url (str) – Url to download the raw dataset.
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
hash_key (tuple) – A tuple of values as the input for the hash function. Users can distinguish instances (and their caches on the disk) from the same dataset class by comparing the hash values.
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: False
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Attributes
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
save_dir
Directory to save the processed dataset.
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
has_cache
()Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Overwrite to realize your own logic of processing the input data.
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- class easygraph.datasets.eg_dataset.EasyGraphDataset(name, url=None, raw_dir=None, save_dir=None, hash_key=(), force_reload=False, verbose=False, transform=None)[source]
Bases:
object
The basic EasyGraph dataset for creating graph datasets. This class defines a basic template class for EasyGraph Dataset. The following steps will be executed automatically:
Check whether there is a dataset cache on disk (already processed and stored on the disk) by invoking
has_cache()
. If true, goto 5.Call
download()
to download the data ifurl
is not None.Call
process()
to process the data.Call
save()
to save the processed dataset on disk and goto 6.Call
load()
to load the processed dataset from disk.Done.
Users can overwrite these functions with their own data processing logic.
- Parameters
name (str) – Name of the dataset
url (str) – Url to download the raw dataset. Default: None
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
save_dir (str) – Directory to save the processed dataset. Default: same as raw_dir
hash_key (tuple) – A tuple of values as the input for the hash function. Users can distinguish instances (and their caches on the disk) from the same dataset class by comparing the hash values. Default: (), the corresponding hash value is
'f9065fa7'
.force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- url
The URL to download the dataset
- Type
str
- name
The dataset name
- Type
str
- raw_dir
Directory to store all the downloaded raw datasets.
- Type
str
- raw_path
Path to the downloaded raw dataset folder. An alias for
os.path.join(self.raw_dir, self.name)
.- Type
str
- save_dir
Directory to save all the processed datasets.
- Type
str
- save_path
Path to the processed dataset folder. An alias for
os.path.join(self.save_dir, self.name)
.- Type
str
- verbose
Whether to print more runtime information.
- Type
bool
- hash
Hash value for the dataset and the setting.
- Type
str
- Attributes
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
save_dir
Directory to save the processed dataset.
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Overwrite to realize your own logic of downloading data.
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Overwrite to realize your own logic of processing the input data.
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- download()[source]
Overwrite to realize your own logic of downloading data.
It is recommended to download the to the
self.raw_dir
folder. Can be ignored if the dataset is already inself.raw_dir
.
- has_cache()[source]
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
By default False.
- property hash
Hash value for the dataset and the setting.
- load()[source]
Overwrite to realize your own logic of loading the saved dataset from files.
It is recommended to use
dgl.data.utils.load_graphs
to load dgl graph from files and usedgl.data.utils.load_info
to load extra information into python dict object.
- property name
Name of the dataset.
- property raw_dir
Raw file directory contains the input data folder.
- property raw_path
- save()[source]
Overwrite to realize your own logic of saving the processed dataset into files.
It is recommended to use
dgl.data.utils.save_graphs
to save dgl graph into files and usedgl.data.utils.save_info
to save extra information into files.
- property save_dir
Directory to save the processed dataset.
- property save_path
Path to save the processed dataset.
- property url
Get url to download the raw dataset.
- property verbose
Whether to print information.
easygraph.datasets.get_sample_graph module
- easygraph.datasets.get_sample_graph.get_graph_blogcatalog()[source]
Returns the undirected graph of blogcatalog.
- Returns
get_graph_blogcatalog – The undirected graph instance of blogcatalog from dataset: https://github.com/phanein/deepwalk/blob/master/example_graphs/blogcatalog.mat
- Return type
easygraph.Graph
References
- easygraph.datasets.get_sample_graph.get_graph_flickr()[source]
Returns the undirected graph of Flickr dataset.
- Returns
get_graph_flickr – The undirected graph instance of Flickr from dataset: http://socialnetworks.mpi-sws.mpg.de/data/flickr-links.txt.gz
- Return type
easygraph.Graph
References
- easygraph.datasets.get_sample_graph.get_graph_karateclub()[source]
Returns the undirected graph of Karate Club.
- Returns
get_graph_karateclub – The undirected graph instance of karate club from dataset: http://vlado.fmf.uni-lj.si/pub/networks/data/Ucinet/UciData.htm
- Return type
easygraph.Graph
References
- easygraph.datasets.get_sample_graph.get_graph_youtube()[source]
Returns the undirected graph of Youtube dataset.
- Returns
get_graph_youtube – The undirected graph instance of Youtube from dataset: http://socialnetworks.mpi-sws.mpg.de/data/youtube-links.txt.gz
- Return type
easygraph.Graph
References
easygraph.datasets.gnn_benchmark module
- class easygraph.datasets.gnn_benchmark.AmazonCoBuyComputerDataset(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]
Bases:
GNNBenchmarkDataset
‘Computer’ part of the AmazonCoBuy dataset for node classification task.
Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.
Reference: https://github.com/shchur/gnn-benchmark#datasets
Statistics:
Nodes: 13,752
Edges: 491,722 (note that the original dataset has 245,778 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)
Number of classes: 10
Node feature size: 767
- Parameters
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- num_classes
Number of classes for each node.
- Type
int
Examples
>>> data = AmazonCoBuyComputerDataset() >>> g = data[0] >>> num_class = data.num_classes >>> feat = g.ndata['feat'] # get node feature >>> label = g.ndata['label'] # get node labels
- Attributes
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
num_classes
Number of classes.
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
save_dir
Directory to save the processed dataset.
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
has_cache
()Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Overwrite to realize your own logic of processing the input data.
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- property num_classes
Number of classes.
- Return type
int
easygraph.datasets.karate module
- class easygraph.datasets.karate.KarateClubDataset(transform=None)[source]
Bases:
EasyGraphDataset
Karate Club dataset for Node Classification
Zachary’s karate club is a social network of a university karate club, described in the paper “An Information Flow Model for Conflict and Fission in Small Groups” by Wayne W. Zachary. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002. Official website: http://konect.cc/networks/ucidata-zachary/
Karate Club dataset statistics:
Nodes: 34
Edges: 156
Number of Classes: 2
- Parameters
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- num_classes
Number of node classes
- Type
int
Examples
>>> dataset = KarateClubDataset() >>> num_classes = dataset.num_classes >>> g = dataset[0] >>> labels = g.ndata['label']
- Attributes
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
num_classes
Number of classes.
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
save_dir
Directory to save the processed dataset.
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Overwrite to realize your own logic of downloading data.
has_cache
()Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Overwrite to realize your own logic of processing the input data.
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- property num_classes
Number of classes.
easygraph.datasets.ppi module
PPIDataset for inductive learning.
- class easygraph.datasets.ppi.LegacyPPIDataset(mode='train', raw_dir=None, force_reload=False, verbose=False, transform=None)[source]
Bases:
PPIDataset
Legacy version of PPI Dataset
- Attributes
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
- num_labels
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
save_dir
Directory to save the processed dataset.
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
has_cache
()Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Overwrite to realize your own logic of processing the input data.
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- class easygraph.datasets.ppi.PPIDataset(mode='train', raw_dir=None, force_reload=False, verbose=False, transform=None)[source]
Bases:
EasyGraphBuiltinDataset
Protein-Protein Interaction dataset for inductive node classification
A toy Protein-Protein Interaction network dataset. The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels. 20 graphs for training, 2 for validation and 2 for testing.
Reference: http://snap.stanford.edu/graphsage/
Statistics:
Train examples: 20
Valid examples: 2
Test examples: 2
- Parameters
mode (str) – Must be one of (‘train’, ‘valid’, ‘test’). Default: ‘train’
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- num_labels
Number of labels for each node
- Type
int
- labels
Node labels
- Type
Tensor
- features
Node features
- Type
Tensor
Examples
>>> dataset = PPIDataset(mode='valid') >>> num_labels = dataset.num_labels >>> for g in dataset: .... feat = g.ndata['feat'] .... label = g.ndata['label'] .... # your code here >>>
- Attributes
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
- num_labels
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
save_dir
Directory to save the processed dataset.
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Overwrite to realize your own logic of processing the input data.
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- has_cache()[source]
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
By default False.
- property num_labels
easygraph.datasets.utils module
- easygraph.datasets.utils.download(url, path=None, overwrite=True, sha1_hash=None, retries=5, verify_ssl=True, log=True)[source]
Download a given URL.
Codes borrowed from mxnet/gluon/utils.py
- Parameters
url (str) – URL to download.
path (str, optional) – Destination path to store downloaded file. By default stores to the current directory with the same name as in url.
overwrite (bool, optional) – Whether to overwrite the destination file if it already exists. By default always overwrites the downloaded file.
sha1_hash (str, optional) – Expected sha1 hash in hexadecimal digits. Will ignore existing file when hash is specified but doesn’t match.
retries (integer, default 5) – The number of times to attempt downloading in case of failure or non 200 return codes.
verify_ssl (bool, default True) – Verify SSL certificates.
log (bool, default True) – Whether to print the progress for download
- Returns
The file path of the downloaded file.
- Return type
str
- easygraph.datasets.utils.extract_archive(file, target_dir, overwrite=False)[source]
Extract archive file.
- Parameters
file (str) – Absolute path of the archive file.
target_dir (str) – Target directory of the archive to be uncompressed.
overwrite (bool, default True) – Whether to overwrite the contents inside the directory. By default always overwrites.
- easygraph.datasets.utils.generate_mask_tensor(mask)[source]
Generate mask tensor according to different backend For torch, it will create a bool tensor :param mask: input mask tensor :type mask: numpy ndarray