KDD Cup 2019
The 5th AutoML Challenge:
AutoML for Temporal Relational Data
(Provided by 4Paradigm, ChaLearn and Microsoft)

Overview

The competition has been launched at CodaLab, please follow the link to participate: 

https://competitions.codalab.org/competitions/21948

Temporal relational data is very common in industrial machine learning applications, such as online advertising, recommender systems, financial market analysis, medical treatment, fraud detection, etc. With timestamps to indicate the timings of events and multiple related tables to provide different perspectives, such data contains useful information that can be exploited to improve machine learning performance. However, currently, the exploitation of temporal relational data is often carried out by experienced human experts with in-depth domain knowledge in a labor-intensive trial-and-error manner.

In this challenge, participants are invited to develop AutoML solutions to binary classification problems for temporal relational data. The provided datasets are in the form of multiple related tables, with timestamped instances. Five public datasets (without labels in the testing part) are provided to the participants so that they can develop their AutoML solutions. Afterward, solutions will be evaluated with five unseen datasets without human intervention. The results of these five datasets determine the final ranking.

To participate, please visit our challenge platform, and follow the instructions to learn details about the problem setup, data, submission interface, evaluation, and get started. We also provided to the participants a starting-kit which includes the demo data, a baseline method, and all things needed to simulate the running environment on their PCs.

This is the first AutoML competition that focuses on temporal relational data and it will pose new challenges to the participants, as listed below:

- How to automatically generate useful temporal information?

- How to efficiently merge the information provided by multiple related tables?

- How to automatically capture meaningful inter-table interactions?

- How to avoid data leak in an automatic manner, when the data is temporal?

Additionally, participants should also consider:

- How to automatically and efficiently select appropriate machine learning model and hyper-parameters?

- How to make the solution more generic, i.e., how to make it applicable for unseen tasks?

- How to keep the computational and memory cost acceptable?

This challenge has been posted on the KDD Cup official website (Automated Machine Learning Competition Track, Auto-ML Track).

Platform

Participants should log in our platform to start the challenge. Please follow the instructions in "Learn the Details - Instructions" to get access to the data, learn the data format and submission interface, and download the starting-kit.

Data

This challenge focuses on the problem of binary classification for temporal relational data collected from real-world businesses. According to the temporal nature of most real-world applications, the datasets are chronologically split into training and testing parts. Both the training and testing parts consist of a main table, a set of related tables, and a relation graph:

- The main table contains instances (with labels in the training part), some features, and timestamps. This is the target of the binary classification.

- Related tables contain valuable auxiliary information about the instances in the main table and can be utilized to improve predictive performance. Entries in the related tables occasionally have timestamps.

- The relations among data in different tables are described by a Relation Graph. It should be noted that any two tables (main or related table) can have a relation, and any pair of tables can have at most one relation. It is guaranteed that the Relation Graphs are the same in training and testing parts.

The following figure illustrates the form of the datasets:

 

More details about the data can be found on the platform [Learn the Details - Data].

Rules

Submission & Evaluation

Participants should form teams with one or more members.

Teams are required to submit AutoML solutions that automatically build machine learning models by using training main table, related tables and relation graph. Once trained, the models should take the testing main table (labels excluded), related tables, and relation graph as input and predict the testing labels. Solutions will be tested under restricted resources and time that will be the same for every competitor. 

A practical AutoML solution should be able to generalize to a wide range of unseen learning tasks. In order to enable the participants to develop and evaluate these solutions, we prepared a total of 10 temporal relational datasets for the competition, five out of which are termed as ‘public datasets’ and the others ‘private datasets’. The challenge comprises three phases:

- Feedback Phase: In this phase, the participants are provided with the training data of five public datasets to develop their AutoML solutions. The code will be uploaded to the platform and participants will receive immediate feedback on the performance of their method in a holdout set. The maximum submission number every day is restricted. Participants can download the labeled training data and the unlabeled testing sets of five public datasets so they can prepare their code submissions at home. The LAST code submission of each team in this phase will be forwarded to the next phase for final testing.

- Check Phase: In this phase participants will be able to check whether their submissions (migrated from the feedback phase) can be successfully evaluated with the five private datasets. Participants will get informed if their submission failed on any of the five private datasets due to time or memory exceeding. Each team has only one chance to correct its submissions.  No testing performance will be revealed.

- AutoML Phase: This is the blind test phase with no submission. In this phase, solutions will be tested with their performances on private datasets. Participants’ codes will automatically train machine learning models without human intervention. The final score will be evaluated by the result of the blind testing.

The score of the solution on a dataset is calculated as:

score = (auc - auc_base) / (auc_max - auc_base),

where auc is the resulting AUC of the solution on the dataset, auc_base is the AUC of the baseline method on the dataset, and auc_max is the AUC of the best solution on the dataset. In case that auc_max < auc_base, all submissions will get 0 scores on the dataset. The baseline method can be found in the starting-kit (see the platform website "Get Started" for more details). The scores of the baseline method, submitted by team "chalearn", are also listed on the leaderboard.

The average score on all five public/private datasets will be used to score a solution in the Feedback/AutoML Phases. And the Average score on all five private datasets is used as the final score of a team.

Solutions with better interpretability will be given preference in case there is a tie in the final score. The interpretability of the solutions will be judged by a committee of experts (some experts from the organization team, and some invited experts).

Please note that the final score evaluates the LAST code submission, i.e., the last code submission in the Feedback Phase, or the resubmission in the Check Phase if there is one.

The following figure illustrates how submissions work in both Feedback and AutoML Phases:

 

Computational and memory resources are limited in all three phases to ensure that solutions are adequately efficient.

More details about submission and evaluation can be found on the platform [Learn the Details - Submission/Evaluation].

Terms & Conditions

Please find the terms and conditions on the platform website [Learn the Details - Terms and Conditions].

Prizes

1st Place: $15,000

2nd Place: $10,000

3rd Place: $5,000

4th - 10th Places: $500 each

Timeline (UTC Time)

Apr 1st, 2019: Beginning of the competition, the release of public datasets. Participants can start submitting codes and obtaining immediate feedback in the leaderboard.

Jun 27th, 2019: End of the Feedback Phase, codes from Feedback phase is migrated automatically to Check Phase.

Jul 7th, 2019: End of the Check Phase, organizers start code verification.

Jul 11th, 2019: Deadline for submitting the fact sheets.

Jul 16th, 2019: End of the AutoML Phase, beginning of the post-competition process

Jul 20th, 2019: Announcement of the KDD Cup Winner.

Aug 4th, 2019: Beginning of KDD 2019.

About

Please contact the organizers if you have any problem concerning this challenge.

Sponsors

  

Advisors

- Isabelle Guyon, Universt ́e Paris-Saclay, France, ChaLearn, USA, (Advisor, Platform Administrator) guyon@chalearn.org

- Qiang Yang, Hong Kong University of Science and Technology, Hong Kong, China, (Advisor, Sponsor) qyang@cse.ust.hk

Committee

- Wei-Wei Tu, 4Pardigm Inc., China, (Coordinator, Platform Administrator, Data Provider, Baseline Provider, Sponsor) tuweiwei@4paradigm.com

- Hugo Jair Escalante, NAOE, Mexico, ChaLearn, USA, (Advisor, Platform Administrator) hugo.jair@gmail.com

- Sergio Escalera, University of Barcelona, Spain, (Advisor) sergio@maia.ub.es

Evelyne Viegas, Microsoft Research, (Sponsor), evelynev@microsoft.com

- Mengshuo Wang, 4Paradigm Inc., China, (Web Administrator) wangmengshuo@4paradigm.com

- Xiawei Guo, 4Paradigm Inc., China, (Platform Administrator) guoxiawei@4paradigm.com

- Ling Yue, 4Paradigm Inc., China, (Platform Administrator) yueling@4paradigm.com

- Jian Liu, 4Paradigm Inc., China, (Operations) liujian@4paradigm.com

- Hai Wang, 4Paradigm Inc., China, (Data Provider) wanghai@4paradigm.com

- Wenhao Li, 4Paradigm Inc., China, (Baseline Provider) liwenhao@4paradigm.com

- Yuanfei Luo, 4Paradigm Inc., China, (Data Provider, Baseline Provider) luoyuanfei@4paradigm.com

- Jingsong Wang, 4Paradigm Inc., China, (Logo Designer) wangjingsong@4paradigm.com

- Runxing Zhong, 4Paradigm Inc., China, (Baseline Provider) zhongrunxing@4paradigm.com

- Yadong Zhao, 4Paradigm Inc., China, (Operations, Web Administrator) zhaoyadong@4paradigm.com

- Feng Bin, 4Paradigm Inc., China, (Platform Developer) binfeng@4paradigm.com

- Xiaojie Yu, 4Paradigm Inc., China, (Platform Developer) yuxiaojie@4paradigm.com

- Yuanmeng Huang, 4Paradigm Inc., China, (Web Administrator) huangyuanmeng@4paradigm.com

- Shiwei Hu, 4Paradigm Inc., China, (Sponsor) hushiwei@4paradigm.com

- Yuqiang Chen, 4Paradigm Inc., China, (Sponsor) chenyuqiang@4paradigm.com

- Wenyuan Dai, 4Paradigm Inc., China, (Sponsor) daiwenyuan@4paradigm.com

Organization Institutes

 

About AutoML

Previous AutoML Challenges:

- First AutoML Challenge

- AutoML@PAKDD2018

- AutoML@NeurIPS2018

- AutoML@PAKDD2019

AutoML workshops can be found here.

Microsoft research blog post on AutoML Challenge can be found here.

KDD Nuggets post on AutoML Challenge can be found here.

I. Guyon et al. A Brief Review of the ChaLearn AutoML Challenge: Any-time Any-dataset Learning Without Human Intervention. ICML W 2016. link

I. Guyon et al. Design of the 2015 ChaLearn AutoML challenge. IJCNN 2015. link

Springer Series on Challenges in Machine Learning. link

Q. Yao et al. Taking Human out of Learning Applications: A Survey on Automated Machine Learning. (a comprehensive survey on AutoML). link

About 4Paradigm Inc.

Founded in early 2015, 4Paradigm is one of the world’s leading AI technology and service providers for industrial applications. 4Paradigm’s flagship product – the AI Prophet – is an AI development platform that enables enterprises to effortlessly build their own AI applications, and thereby significantly increase their operation’s efficiency. Using the AI Prophet, a company can develop a data-driven “AI Core System”, which could be largely regarded as a second core system next to the traditional transaction-oriented Core Banking System (IBM Mainframe) often found in banks. Beyond this, 4Paradigm has also successfully developed more than 100 AI solutions for use in various settings such as finance, telecommunication and internet applications. These solutions include, but are not limited to, smart pricing, real-time anti-fraud systems, precision marketing, personalized recommendation and more. And while it is clear that 4Paradigm can completely set up a new paradigm that an organization uses its data, its scope of services does not stop there. 4Paradigm uses state-of-the-art machine learning technologies and practical experiences to bring together a team of experts ranging from scientists to architects. This team has successfully built China’s largest machine learning system and the world’s first commercial deep learning system. However, 4Paradigm’s success does not stop there. With its core team pioneering the research of “Transfer Learning,” 4Paradigm takes the lead in this area, and as a result, has drawn great attention of worldwide tech giants.

About ChaLearn

ChaLearn is a non-profit organization with vast experience in the organization of academic challenges. ChaLearn is interested in all aspects of challenge organization, including data gathering procedures, evaluation protocols, novel challenge scenarios (e.g., competitions), training for challenge organizers, challenge analytics, result dissemination and, ultimately, advancing the state-of-the-art through challenges.

About KDD 2019 Conference

The annual KDD conference is the premier interdisciplinary conference bringing together researchers and practitioners from data science, data mining, knowledge discovery, large-scale data analytics, and big data.

KDD 2019 Conference:

- August 4 - 8, 2019

- Anchorage, Alaska USA

- Dena’ina Convention Center and William Egan Convention Center

About Other 2019 KDD Cup Competitions

KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners. SIGKDD-2019 will take place in Anchorage, Alaska, US from August 4 - 8, 2019. The KDD Cup competition is anticipated to last for 2-4 months, and the winners will be notified by mid-July 2019. The winners will be honored at the KDD conference opening ceremony and will present their solutions at the KDD Cup workshop during the conference. 

In KDD Cup 2019, there are three competition tracks:

Automated Machine Learning Competition Track (Auto-ML Track) (This competition)

Regular Machine Learning Competition Track (Regular ML Track)

- “Research for Humanity” Reinforcement Learning Competition Track (Humanity RL Track) 

About KDD Cup Chairs

- Taposh Dutta-Roy (KAISER PERMANENTE)

- Wenjun Zhou (UNIVERSITY OF TENNESSEE KNOXVILLE)

- Iryna Skrypnyk (PFIZER)