KDD Cup 2020
The 1st AutoGraph Challenge
AutoML for Graph Representation Learning
(Provided by 4Paradigm, ChaLearn, Stanford and Google)

Overview

KDD 2020 will be held in San Diego, CA, USA from August 23 to 27, 2020. The Automatic Graph Representation Learning challenge (AutoGraph), the first ever AutoML challenge applied to Graph-structured data, is the AutoML track challenge in KDD Cup 2020 provided by 4Paradigm, ChaLearn, Stanford and Google. The challenge website could be found here: https://www.automl.ai/competitions/3

Machine learning on graph-structured data. Graph-structured data have been ubiquitous in real-world, such as social networks, scholar networks, knowledge graph etc. Graph representation learning has been a very hot topic, and the goal is to learn low-dimensional representation of each node in the graph, which are used for downstream tasks, such as friend recommendation in a social network, or classifying academic papers into different subjects in a citation network. Traditionally, heuristics are exploited to extract features for each node from the graph, e.g., the degree statistics, or random walk based similarities. However, in recent years, sophisticated models such as graph neural networks (GNN) have been proposed for the graph representation learning tasks, which lead to the state-of-the-art results in many tasks, such as node classification, or link prediction.

Challenges in developing versatile models. Nevertheless, no matter the traditional heuristic methods or recent GNN based methods, huge computational and expertise resources are needed to be invested to achieve a satisfying performance given a task. For example, in DeepWalk and node2vec, two well-known random walk based methods, various hyper-parameters like the length and number of walks per node, the window size, have to be fine-tuned to obtain better performance. And when using the GNN models, e.g. GraphSAGE or GAT, we have to spend quite a lot of time to choose the optimal aggregation function in GraphSAGE, or head numbers of self-attention in GAT. Therefore, it limits the application of the existing graph representation models due to the huge demand of human experts in fine-tuning process.

Autograph Challenge. AutoML/AutoDL (https://autodl.chalearn.org) is a promising approach to lower the manpower costs of machine learning applications, and has achieved encouraging successes in hyper-parameter tuning, model selection, neural architecture search, and feature engineering. In order to enable more people and organizations to fully exploit their graph-structured data, we organize AutoGraph challenge dedicated to such data. 

In this challenge, participants should design a computer program capable of providing solutions to graph representation learning problems autonomously (without any human intervention).  Compared to previous AutoML competitions we organized, our new focus is on Graph-structured Data, where nodes with features and edges (connections among nodes) are available. 

To prevail in the proposed challenge, participants should propose automatic solutions that can effectively and efficiently learn high-quality representation for each node based on the given features, neighborhood and structural information underlying the graph. The solutions should be designed to automatically extract and utilize any useful signals in the graph no matter by heuristic or systematic models.
 
Here, we list some specific questions that the participants should consider and answer:
  • How to automatically design heuristics to extract features for a node in graph?
  • How to automatically exploit the neighborhood information in a graph ?
  • How to automatically tune an optimal set of hyper-parameters for random walk based graph embedding methods ?
  • How to automatically choose the aggregation function when using the GNN-based models?
  • How to automatically design an optimal GNN architecture given different datasets?
  • How to automatically and efficiently select appropriate hyper-parameters for different models?
  • How to make the solution more generic, i.e., how to make it applicable for unseen tasks?
  • How to keep the computational and memory cost acceptable?

Tentative Timeline 

  • March 25th, 2020: Beginning of Feedback Phase, release of public datasets. Participants can start submitting codes and obtaining immediate feedback in the leaderboard.
  • May 25th, 2020: End of Feedback Phase
  • May 26th, 2020: Beginning of Check Phase
  • June 1st, 2020: End of Check Phase, Organizer notifying results of Check Phase
  • June 2nd, 2020: Beginning of Final Phase
  • June 4th, 2020: Deadline for re-submitting to Final Phase
  • June 5th, 2020: Deadline for submitting the fact sheets
  • June 7th, 2020: End of Final Phase, beginning of post competition process
  • June 9th, 2020: Announcement of the KDD Cup 2020 Winners
  • Auguest 22nd, 2020: Beginning of KDD 2020

 

Dataset

This page describes the datasets used in AutoGraph challenge. 15 graph datasets are prepared for this competition.  5 public datasets, which can be downloaded, are provided to the participants so that they can develop their solutions offline. Besides that, another 5 feedback datasets are also provided to participants to evaluate the public leaderboard scores of their AutoGraph solutions. Afterwards, their solutions will be evaluated with 5 final datasets without human intervention. 

This challenge focuses on the problem of graph representation learning, where node classification is chosen as the task to evaluate the quality of learned representations. 

Note that you can try more datasets to debug your solutions in the open graph benchmark and SNAP project from Stanford University. 

Components

The datasets are collected from real-world business, and are shuffled and split into training and testing parts. Each dataset contains two node files (training and testing), an edge file, a feature file, two label files (training and testing) and a metadata file.
Please note that the data files are read by our program and sent to the participant's program. For the details, please see Evaluations .

  • The training node file (train_node_id.txt) and testing node file (test_node_id.txt) list all node indices used for training and testing correspondingly. The node indices are int type.

    Example:

    node_index
          0
          1
          2
          3
          4
          5
          6
          7
          8
  • The edge file (edge.tsv) contains a set of triplets. A triplet in the form (src_idx, dst_idx, edge_weight) describes a connection from node index src_idx to node dst_idx with the edge weight edge_weight. The type of edge_weight is numerical (float or int)

    Example:

    src_idx	dst_idx	edge_weight
          0	62	1
          0	40	1
          0	127	1
          0	178	1
          0	53	1
          0	67	1
          0	189	1
          0	135	1
          0	48	1
  • The feature file (feature.tsv) is in tsv format. A line of the file is in the format: (node_index f0 f1 ...), where node_index is the index of a node and f0, f1, ... are its features

    The types of features are all numerical

    Example:

    node_index	f0	f1	f2	f3	f4
          0	0.47775876104073356	0.05387578793865644	0.729954200019264	0.6908184238803438	0.9235037015600726
          1	0.34224099072954905	0.6693042243297719	0.08736572053032532	0.07358721227831977	0.27398819586899037
          2	0.8259856025619777	0.4421366756096389	0.9872258141866499	0.4865590790508849	0.12633483872234397
          3	0.11177231902956064	0.40446709473609854	0.2293892960354328	0.4021930454713125	0.40698138834963693
          4	0.34427740190016	0.26622372452918375	0.8042497280547812	0.0022605424347530434	0.8903425653304337
          5	0.08640169107378592	0.43038539444039425	0.6635778390235518	0.9229371884297638	0.8912709075205572
          6	0.6765202023072282	0.9039673560303431	0.986304900152288	0.23661480664770496	0.7140162062880935
          7	0.043651531427249424	0.010090830922163785	0.758404203984433	0.05315076246728134	0.8017402643849966
          8	0.49802375200717	0.6735698429117265	0.04292694482433346	0.3033723691640159	0.43132281219124635
  • The training label file (train_label.tsv) and the testing label file (test_label.tsv) are also in tsv format and contains label information of training and testing nodes correspondingly. A line in the files is in the format: (node_index class), where node_index is the index of a node and class is its label.

    Example:

    node_index  class
          0	1
          1	3
          2	1
          3	1
          4	3
          5	1
          6	1
          7	3
          8	1
  • The metadata file (config.yml) is in yaml format. It provides meta-information of the datasets, including:

    • schema: DEPRECATED
    • the number of label classes in the dataset
    • the time budget of the dataset

    Example:

    time_budget: 5000
          n_class: 7

Rules

This challenge has three phases. The participants are provided with 5 public datasets which can be downloaded, so that they can develop their solutions offline. Then, the code will be uploaded to the platform and participants will receive immediate feedback on the performance of their method at another 5 feedback datasets. After Feedback Phase terminates, we will have another Check Phase, where participants are allowed to submit their code only once on final datasets in order to debug. Participants won't be able to read detailed logs but they are able to see whether their code report errors. Last, in Final Phase, participants' solutions will be evaluated on 5 final datasets. The ranking in Final Phase will count towards determining the winners.

Code submitted is trained and tested automatically, without any human intervention. Code submitted on Feedback (resp. Final) Phase is run on all 5 feedback (resp. final) datasets in parallel on separate compute workers, each one with its own time budget. 

The identities of the datasets used for testing on the platform are concealed. The data are provided in a raw form (no feature extraction) to encourage researchers to use Deep Learning methods performing automatic feature learning, although this is NOT a requirement. All problems are node classification problems. The tasks are constrained by the time budget

Here is some pseudo-code of the evaluation protocol:

# For each dataset, our evaluation program calls the model constructor:
      # load the dataset
      dataset = Dataset(args.dataset_dir)
      
      # get information about the dataset
      time_budget = dataset.get_metadata().get("time_budget")
      n_class = dataset.get_metadata().get("n_class")
      schema = dataset.get_metadata().get("schema")
      
      # import and initialize the participant's Model class
      umodel = init_usermodel()
      
      # initialize the timer
      timer = _init_timer(time_budget)
      
      # train the model and predict the labels of testing data
      predictions = _train_predict(umodel, dataset, timer, n_class, schema)
      

Metrics

For both Feedback Phase and Final Phase, Accuracy is evaluated on each dataset. The submissions will be ranked by the averaged rank on all datasets of a phase.

Note that if a submission fails on a certain dataset, a default score (-1 in this challenge) will be marked in the corresponding dataset of leaderbaord.


API

The participants should implement a class Model with a class method train_predict, which is described as follows:

class Model:
          """user model"""
          def __init__(self):
              # init 
      
          def train_predict(self, data, time_budget, n_class, schema):
              """train and prediction
              This method will be called by the competition platform and constraint with time_budget.
      
              Parameters:
              -----------
              data: dict, store all input data. keys and values are:
                  'fea_table':  pandas.DataFrame, features for training and testing dataset,
                  'edge_file': pandas.DataFrame, edge information of the graph, dtypes of all columns are int
                  'train_indices': list of int, indices of all training nodes
                  'test_indices': list of int, indices of all testing nodes
                  'train_label': pandas.DataFrame, labels of training nodes
                  for the details, please check the format of data files.
              n_class: int, the number of classes in this task
              schema: this is deprecated
      
              Return
              ------
              pred: list(or pandas.Series / 1D numpy.ndarray)
              pred contains predictions for all testing samples, and they are
              in the same order as test_indices
              """
              return pred
      

It is the responsibility of the participants to make sure that the "train_predict" method does not exceed the time budget.


Prizes

1st Prize: 15000 USD

2nd Prize: 10000 USD

3rd Prize: 5000 USD

4th - 10th prize: 500 USD each

 

About

Please contact the organizers if you have any problem concerning this challenge.


Advisors

  • - Wei-Wei Tu, 4Pardigm Inc., China and ChaLearn, USA
  • - Jure Leskovec, Stanford University, USA
  • - Hugo Jair Escalante, IANOE, Mexico and ChaLearn, USA
  • - Isabelle Guyon, Université Paris-Saclay, France, ChaLearn, USA
  • - Qiang Yang, Hong Kong University of Science and Technology, Hong Kong, China

Committee (alphabetical order)

  • - Xiawei Guo, 4Paradigm Inc., China
  • - Shouxiang Liu, 4Paradigm Inc., China
  • - Zhen Xu, 4Paradigm Inc., China
  • - Rex Ying, Stanford University, USA
  • - Huan Zhao, 4Paradigm Inc., China

Organizing Institutes

4paradigmchaleanstanfordGoogle


About AutoML 

Previous AutoML Challenges:


About 4Paradigm Inc.

Founded in early 2015, 4Paradigm is one of the world’s leading AI technology and service providers for industrial applications. 4Paradigm’s flagship product – the AI Prophet – is an AI development platform that enables enterprises to effortlessly build their own AI applications, and thereby significantly increase their operation’s efficiency. Using the AI Prophet, a company can develop a data-driven “AI Core System”, which could be largely regarded as a second core system next to the traditional transaction-oriented Core Banking System (IBM Mainframe) often found in banks. Beyond this, 4Paradigm has also successfully developed more than 100 AI solutions for use in various settings such as finance, telecommunication and internet applications. These solutions include, but are not limited to, smart pricing, real-time anti-fraud systems, precision marketing, personalized recommendation and more. And while it is clear that 4Paradigm can completely set up a new paradigm that an organization uses its data, its scope of services does not stop there. 4Paradigm uses state-of-the-art machine learning technologies and practical experiences to bring together a team of experts ranging from scientists to architects. This team has successfully built China’s largest machine learning system and the world’s first commercial deep learning system. However, 4Paradigm’s success does not stop there. With its core team pioneering the research of “Transfer Learning,” 4Paradigm takes the lead in this area, and as a result, has drawn great attention of worldwide tech giants. 

About ChaLearn

ChaLearn is a non-profit organization with vast experience in the organization of academic challenges. ChaLearn is interested in all aspects of challenge organization, including data gathering procedures, evaluation protocols, novel challenge scenarios (e.g., competitions), training for challenge organizers, challenge analytics, result dissemination and, ultimately, advancing the state-of-the-art through challenges.