[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

Last update: Dec 21, 2022

Overview

New Benchmarks for Learning on Non-Homophilous Graphs

Here are the codes and datasets accompanying the paper:
New Benchmarks for Learning on Non-Homophilous Graphs
Derek Lim (Cornell), Xiuyu Li (Cornell), Felix Hohne (Cornell), and Ser-Nam Lim (Facebook AI).
Workshop on Graph Learning Benchmarks, WWW 2021.
[PDF link]

There are codes to load our proposed datasets, compute our measure of the presence of homophily, and train various graph machine learning models in our experimental setup.

Organization

main.py contains the main experimental scripts.

dataset.py loads our datasets.

models.py contains implementations for graph machine learning models, though C&S (correct_smooth.py, cs_tune_hparams.py) is in separate files. Also, gcn-ogbn-proteins.py contains code for running GCN and GCN+JK on ogbn-proteins. Running several of the GNN models on larger datasets may require at least 24GB of VRAM.

homophily.py contains functions for computing homophily measures, including the one that we introduce in our_measure.

Datasets

As discussed in the paper, our proposed datasets are "twitch-e", "yelp-chi", "deezer", "fb100", "pokec", "ogbn-proteins", "arxiv-year", and "snap-patents", which can be loaded by load_nc_dataset in dataset.py by passing in their respective string name. Many of these datasets are included in the data/ directory, but due to their size, yelp-chi, snap-patents, and pokec are automatically downloaded from a Google drive link when loaded from dataset.py. The arxiv-year and ogbn-proteins datasets are downloaded using OGB downloaders. load_nc_dataset returns an NCDataset, the documentation for which is also provided in dataset.py. It is functionally equivalent to OGB's Library-Agnostic Loader for Node Property Prediction, except for the fact that it returns torch tensors. See the OGB website for more specific documentation. Just like the OGB function, dataset.get_idx_split() returns fixed dataset split for training, validation, and testing.

When there are multiple graphs (as in the case of twitch-e and fb100), different ones can be loaded by passing in the sub_dataname argument to load_nc_dataset in dataset.py.

twitch-e consists of seven graphs ["DE", "ENGB", "ES", "FR", "PTBR", "RU", "TW"]. In the paper we test on DE.

fb100 consists of 100 graphs. We only include ["Amherst41", "Cornell5", "Johns Hopkins55", "Penn94", "Reed98"] in this repo, although others may be downloaded from the internet archive. In the paper we test on Penn94.

Installation instructions

Create and activate a new conda environment using python=3.8 (i.e. conda create --name non-hom python=3.8)
Activate your conda environment
Check CUDA version using nvidia-smi
In the root directory of this repository, run bash install.sh cu110, replacing cu110 with your CUDA version (i.e. CUDA 11 -> cu110, CUDA 10.2 -> cu102, CUDA 10.1 -> cu101). We tested on Ubuntu 18.04, CUDA 11.0.

Running experiments

Make sure a results folder exists in the root directory.
Our experiments are in the experiments/ directory. There are bash scripts for running methods on single and multiple datasets. Please note that the experiments must be run from the root directory. For instance, to run the MixHop experiments on snap-patents, use:

bash experiments/mixhop_exp.sh snap-patents

Some datasets require specifying a second sub_dataset argument e.g. to run MixHop experiments on the twitch-e, DE sub_dataset, do:

bash experiments/mixhop_exp.sh twitch-e DE

Otherwise, run python main.py --help to see the full list of options for running experiments. As one example, to train a GAT with max jumping knowledge connections on (directed) arxiv-year with 32 hidden channels and 4 attention heads, run:

python main.py --dataset arxiv-year --method gatjk --hidden_channels 32 --gat_heads 4 --directed

[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

Related tags

Overview

New Benchmarks for Learning on Non-Homophilous Graphs

Organization

Datasets

Installation instructions

Running experiments

Owner

NLTK Source

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

Deep learning for NLP crash course at ABBYY.

MRC approach for Aspect-based Sentiment Analysis (ABSA)

ADCS - Automatic Defect Classification System (ADCS) for SSMC

Code for text augmentation method leveraging large-scale language models

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

vits chinese, tts chinese, tts mandarin

Python library for interactive topic model visualization. Port of the R LDAvis package.

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

Shared code for training sentence embeddings with Flax / JAX

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

A Paper List for Speech Translation

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

Control the classic General Instrument SP0256-AL2 speech chip and AY-3-8910 sound generator with a Raspberry Pi and this Python library.

Simple program that translates the name of files into English

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

👄 The most accurate natural language detection library for Python, suitable for long and short text alike

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/