AutoSpeech 2020


InterSpeech 2020will be held in Shanghai, China from September 14 to 18, 2020. AutoSpeech 2020is one of the competitions in main conferenceprovided by 4Paradigm, ChaLearn, Southern University of Science and Technology, Northwestern Polytechnical University and Google..The challenge website is: can find public datasets and instructions for getting started.

In the last decade, deep learning (DL) has achieved remarkable success in speech-related tasks, e.g., speaker verification, language identification and emotion classification. However, in practice, it is very difficult to get proper performance without expertise of deep learning and speech processing. As the complexity of these tasks is often beyond non-experts, the rapid growth of a vast range of speech classification applications has created a demand for off-the-shelf speech classification methods that can be used easily and without expert knowledge. Automated Deep Learning (AutoDL is proposed to explore automatic pipeline to train an effective DL model given a specific task requirement without any human intervention. Since its proposal, AutoDL have been explored in various applications, and a series of AutoDL competitions, e.g., Automated natural language processing (AutoNLP) and Automated computer vision (AutoCV), have been organized by 4Paradigm, Inc. and ChaLearn (sponsored by Google). These competitions have drawn a lot of attention from both academic researchers and industrial practitioners.

Call For Papers

The special session of AutoSpeech 2020 invites papers with the theme of Automated Machine Learning (AutoML). Paper contributions within the scope are welcome even if the authors do not participate in the Challenge. Relevant topics related to the theme of AutoML are recommended (but not limited to):
  • Meta Learning
  • Transfer Learning
  • Network Architecture Search
  • Few-shot Learning
  • Reinforcement Learning
  • Model Compression
  • Data Augmentation
  • Hyperparameter Optimization
  • Learning to Learn
  • Algorithm Configuration
  • Model Selection
  • Model Initialization

Speech topics including (but are not limited):

  • Automatic Speech Recognition
  • Analysis of Paralinguistics in Speech and Language
  • Speaker Identification
  • Language Identification
  • Emotion Classificaton
  • Accent Recognition
  • Music Genre Classification

This special session follows the same submission policy of INTERSPEECH 2020.

Important Dates (UTC time)

(Regarding competition dates, please read following sections for details.)

  • March 4th: Releasing practice data and baseline
  • March 11th 15:59: Beginning of Feedback Phase
  • May 8th: Interspeech 2020 paper submission deadline
  • May 8th 15:59: End of Feedback Phase
  • May 8th 16:00: Begining of Check Phase
  • May 12th: Notification of the Check Phase Result
  • May 15th 15:59: Deadline of the re-submission, end of the Check Phase
  • May 15th 16:00: Begining of the Final Phase
  • May 18th 16:00: End of the Final Phase, Notification
  • tba: Final paper submission to Interspeech 2020.
  • tba: Camera-ready paper

AutoSpeech 2020 Challenge

In this challenge, we further propose the Automated Speech (AutoSpeech) competition which aims at proposing automated solutions for speech-related tasks. This challenge is restricted to multi-label classification problems, which come from different speech classification domains. The provided solutions are expected to discover various kinds of paralinguistic speech attribute information, such as speaker, language, emotion, etc, when only raw data (speech features) and meta information are provided. There are two kinds of datasets, which correspond to public and private leaderboard respectively. Five public datasets (without labels in the testing part) are provided to the participants for developing AutoSpeech solutions. Afterward, solutions will be evaluated on private datasets without human intervention. The results of these private datasets determine the final ranking.
This is the second AutoSpeech competition (AutoSpeech 2020). AutoSpeech 2019 was held in ACML2019 and many solutions made significant improvement on the performance of automated speech classification tasks. However, it is still challenging in this problem setting, especially when 1) the size of datasets become larger; 2) there are more classes of labels. Other challenges to the participants include: - How to automatically discover various kinds of paralinguistic information in spoken conversation? - How to automatically extract useful features for different tasks from speech data? - How to automatically handle both long and short duration speech data? - How to automatically design effective neural network structures? - How to build and automatically adapt pre-trained models?
Additionally, participants should also consider: - How to automatically and efficiently select appropriate machine learning model and hyper-parameters? - How to make the solution more generic, i.e., how to make it applicable for unseen tasks? - How to keep the computational and memory cost acceptable?


This page describes the datasets used in AutoSpeech challenge. 15 speech categorization datasets are prepared for this competition.Five public datasets, which can be downloaded, are provided to the participants so that they can develop their AutoSpeech solutions offline. Besides that, anotherfive feedback datasetsare also provided to participants to evaluate the public leaderboard scores of their AutoSpeech solutions. Afterward, their solutions will be evaluated withfive final datasetswithout human intervention. Each provided dataset is from one speech classification domains, such as Speaker Identification, Emotion Classification, etc. In the datasets, the number of classes is greater than 2 and less than 500, while the number of instances varies from several to hundreds. All the audios are first converted to single-channel, 16-bit streams at a 16kHz sampling rate for consistency, then they are loaded by librosa and dumped to pickle format (A list of vectors, which contains all train or test audios in one dataset). Note that, datasets contain both long audios and short audios without padding.


All the datasets consist of content file, label file and meta file, where content file and label file are split into train parts and test parts:
  • Content fileAll the datasets consist of audio file, label file and meta file, where audio file and label file are split into train parts and test parts: Audio file ({train,test}.pkl) contains the samples of the audios, which format is a list of vectors.


    [ [-1.2207031e-04, 3.0517578e-05, -1.5258789e-04, ..., -8.8500977e-04, -8.5449219e-04, -1.3732910e-03]), [ 9.1552734e-05, 7.0190430e-04, 1.0375977e-03, ..., -7.6293945e-04, 2.7465820e-04, 1.0375977e-03]), [ 1.8920898e-03, 1.6784668e-03, 1.4648438e-03, ..., 3.0517578e-05, -2.7465820e-04, -3.0517578e-04]), [0.02307129, 0.02386475, 0.02462769, ..., 0.02420044, 0.02410889, 0.02429199]), [ 6.1035156e-05, 1.2207031e-04, 4.5776367e-04, ..., -1.2207031e-04, -6.1035156e-04, -3.6621094e-04]), [0.03787231, 0.03686523, 0.03723145, ..., 0.03497314, 0.03594971, 0.0350647 ]), ..., ]

  • Label file({train, dataset_name}.solution) consists of the labels of the instances in one-hot format. Note that each of its lines corresponds to the corresponding line number in the content file.

    1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  • Meta file(meta.json) is a json file consisted of the meta information about the dataset. Descriptions of the keys
    in meta file:

    class_num : number of classes in the dataset train_num : the number of training instances test_num : the number of test instances time_budget : the time budget of the dataset, 1800s for all the datasets

  • Example:

    { "class_num": 10, "train_num": 428, "test_num": 107, "time_budget": 1800 }

Datasets Credits

- V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, 2015, pp. 5206-5210.

- Weinberger, Steven. (2015). Speech Accent Archive. George Mason University. Retrieved from

-, Berlin emotional speech database

- CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages

- D. Ellis (2007). Classifying Music Audio with Timbral and Chroma Features,Proc. Int. Conf. on Music Information Retrieval ISMIR-07, Vienna, Austria, Sep. 2007.


This challenge hasthree phases. The participants are provided with five practice datasets which can be downloaded, so that they can develop their AutoSpeech solutions offline. Then, the code will be uploaded to the platform and participants will receive immediate feedback on the performance of their method at another five validation datasets. AfterFeedback Phaseterminates, we will have anotherCheck Phase, where participants are allowed to submit their codeonly onceon private datasets in order to debug. Participants won't be able to read detailed logs but they are able to see whether their code report errors. Last, in theFinalPhase,Participants' solutionswill be evaluated on five test datasets. The ranking in the final phase will count towards determining the winners.

Code submitted is trained and tested automatically, without any human intervention. Code submitted onFeedback (resp. Final) Phaseis run on all five feedback (resp. final) datasets in parallel on separate compute workers, each one with its own time budget.

The identities of the datasets used for testing on the platform are concealed.The data are provided in araw form(no feature extraction) to encourage researchers to use Deep Learning methods performing automatic feature learning, although this is NOT a requirement. All problems aremulti-labelclassificationproblems. The tasks are constrained by thetime budget.

It is the responsibility of the participants to make surethat neither the "train" nor the "test" methods exceed the “remaining_time_budget”. The method “train” can choose to manage its time budget such that it trains in varying time increments.Note that, the model will be initialized only one time during the submission process, so the participants can control the model behavior at each train step by its member variables.There is a pressure that it does not use all "overall_time_budget" at the first iteration because we use the area under the learning curve as the metric.


For each dataset, we compute theArea under Learning Curve (ALC). The learning curve is drawn as follows:

  • at each timestamp t, we compute s(t), thebalanced accuracyof themost recentprediction. In this way, s(t) is astep functionw.r.t time t;
  • in order to normalize time to the [0, 1] interval, we perform a time transformation by

    where T is the time budget and t0 is a reference time amount (of default value 60 seconds).
  • then compute the area under learning curve using the formula

    we see that s(t) is weighted by 1/(t + t0)), giving a stronger importance to predictions made at the beginning of the learning curve.

After we compute the ALC for all datasets, theoverall rankingis used as the final score for evaluation and will be used in the leaderboard. It is computed by averaging the ranks (among all participants) of ALC obtained on the datasets.

Examples of learning curves:


1st Prize: 2000 USD

2nd Prize: 1500 USD

3rd Prize: 500 USD


Pleasecontact the organizersif you have any problem concerning this challenge.


- Wei-Wei Tu, 4Pardigm Inc., China andChaLearn, USA

-Tom Ko, Southern University of Science and Technology, China

-Lei Xie, Northwestern Polytechnical University Xian, China

-Hugo Jair Escalante,IANOE, Mexico and ChaLearn, USA

- Isabelle Guyon, Université Paris-Saclay, France, ChaLearn, USA

- Qiang Yang, Hong Kong University of Science and Technology, Hong Kong, China

Committee (alphabetical order)

- Jingsong Wang,4Paradigm Inc., China

- Shouxiang Liu,4Paradigm Inc., China

- Xiawei Guo, 4Paradigm Inc., China

- Zhen Xu,4Paradigm Inc., China

Organization Institutes





Previous AutoML Challenges:

-First AutoML Challenge






-AutoCV2@ECML PKDD2019





About 4Paradigm Inc.

Founded in early 2015,4Paradigmis one of the world’s leading AI technology and service providers for industrial applications. 4Paradigm’s flagship product – the AI Prophet – is an AI development platform that enables enterprises to effortlessly build their own AI applications, and thereby significantly increase their operation’s efficiency. Using the AI Prophet, a company can develop a data-driven “AI Core System”, which could be largely regarded as a second core system next to the traditional transaction-oriented Core Banking System (IBM Mainframe) often found in banks. Beyond this, 4Paradigm has also successfully developed more than 100 AI solutions for use in various settings such as finance, telecommunication and internet applications. These solutions include, but are not limited to, smart pricing, real-time anti-fraud systems, precision marketing, personalized recommendation and more. And while it is clear that 4Paradigm can completely set up a new paradigm that an organization uses its data, its scope of services does not stop there. 4Paradigm uses state-of-the-art machine learning technologies and practical experiences to bring together a team of experts ranging from scientists to architects. This team has successfully built China’s largest machine learning system and the world’s first commercial deep learning system. However, 4Paradigm’s success does not stop there. With its core team pioneering the research of “Transfer Learning,” 4Paradigm takes the lead in this area, and as a result, has drawn great attention of worldwide tech giants.

About ChaLearn

ChaLearnis a non-profit organization with vast experience in the organization of academic challenges. ChaLearn is interested in all aspects of challenge organization, including data gathering procedures, evaluation protocols, novel challenge scenarios (e.g., competitions), training for challenge organizers, challenge analytics, resultdissemination and, ultimately, advancing the state-of-the-art through challenges.