Auto-KWS 2021


Recently, voice wake-up has appeared in people everyday life more frequently through smart speakers and in-vehicle devices. Personalized voice wake-up, including customized wake-up word detection and specific voiceprint verification, also gains more attention. In personalized voice wake-up scenario, there are still many topics worth exploring. Some topics include, how the speech models can better adapt to different wake-up words, and how the models can jointly optimize the two tasks, the wake-up word detection and speaker voiceprint recognition. Also, given that automatic machine learning (AutoML), meta-learning and other methods in artificial intelligence field has already received successful results in speech recognition tasks, whether those methods can be used to improve the personalized wake-up scenario is another problem worth exploring.

In order to promote technological development and narrow the gap between academic research and practical applications, 4Paradigm, together with National Taiwan University, Northwestern Polytechnical University, and Southern University of Science and Technology, and ChaLearn will organize this Auto-KWS Challenge and Special session at INTERSPEECH 2021 Conference). In this challenge we will release this multilingual dataset(dialect included) for Personalized Keyword Spotting (Auto-KWS), which closely resembles the real world scenarios, as each recorder is assigned with an unique wake-up word and can choose their recording environment and familiar dialect freely. In addition, the competition will test and evaluate the participants' algorithms through competition platform. Participants will submit code and pre-trained models to conduct algorithm evaluation under unified platform resources. After the competition, the dataset will be released on platform as an open benchmark available for research, to further boost ideas exchange and discussions in this area.


The challenge website is:

Please sign up on platform and register the competition by the following entrance:

Once the registration is approved, you will receive a link to the training datasets from your registered email.

Note:Participants are only allow to submit code via one account. We will manual check participants' identity and disqualify the duplicated and inauthentic accounts.

Important Dates

Feb 5th: Release of training data and practice data Feb 10th: Release of baseline
Feb 26th: Feedback phase starts
Mar 26th: Feedback phase ends, private phase starts
Mar 26th: Paper submission deadline
Mar 27th: Check phase starts
Apr 1st: Check phase ends, final phase starts
Apr 5th: Final phase ends Jun 2nd: Paper acceptance/rejection notification
Aug 31st: Opening of INTERSPEECH 2021

Auto-KWS 2021 Challenge

In the last decade, machine learning (ML) and deep learning (DL) has achieved remarkable success in speech-related tasks, e.g., speaker verification (SV), automatic speech recognition(ASR) and keyword spotting (KWS). However, in practice, it is very difficult to get proper performance without expertise of machine learning and speech processing. Automated Machine Learning (AutoML) is proposed to explore automatic pipeline to train effective models given a specific task requirement without any human intervention. Moreover, some methods belonging to AutoML, such as Automated Deep Learning (AutoDL) and meta-learning have been used in KWS and SV tasks respectively. A series of AutoML competitions, e.g., automated natural language processing (AutoNLP) and Automated computer vision (AutoCV), have been organized by 4Paradigm, Inc. and ChaLearn (sponsored by Google). These competitions have drawn a lot of attention from both academic researchers and industrial practitioners.

Keyword spotting, usually as the entrance of smart device terminals, such as mobile phone, smart speakers, or other intelligent terminals, has received a lot of attention in both academia and industry. Meanwhile, out of consideration of fun and security, the personalized wake-up mode has more application scenarios and requirements. Conventionally, the solution pipeline is combined of KWS and text dependent speaker verification (TDSV) system, and in which case, two systems are optimized separately. On the other hand, there are always few data belonging to the target speaker, so both of KWS and speaker verification(SV) in that case can be considered as low resource tasks.

In this challenge, we propose the automated machine learning for Personalized Keyword Spotting (Auto-KWS) which aims at proposing automated solutions for personalized keyword spotting tasks. Basically, there are several specific questions that can be further participants explored, including but not limited to:

• How to automatically handle multilingual, multi accent or various keywords?
• How to make better use of additional tagged corpus automatically?
• How to integrate keyword spotting task and speaker verification task?
• How to jointly optimize personalized keyword spotting with speaker verification?
• How to design multi-task learning for personalized keyword spotting with speaker verification?
• How to automatically design effective neural network structures?

• How to reasonably use meta-learning, few-shot learning, or other AutoML technologies in this task? 

Additionally, participants should also consider:

• How to automatically and efficiently select appropriate machine learning model and hyper-parameters?
•  How to make the solution more generic, i.e., how to make it applicable for unseen tasks?
•  How to keep the computational and memory cost acceptable?

We have already organized two successful automated speech classification challenge AutoSpeech1 in ACML2019 and AutoSpeech 2020 in INTERSPEECH2020, which are the first two challenges that combine AutoML and speech tasks. This time, our challenge Auto-KWS will focus on personalized keyword spotting tasks for the first time, and the released database will also serve as a benchmark for researches in this filed and boost the idea exchanging and discussion in this area.


All data are recorded by near-field mobile phones, (located in front of the speakers at around 0.2m distance). Each sample is recorded in single channel, 16-bit streams at a 16kHz sampling rate.There are 4 datasets: training dataset, practice dataset, feedback dataset, and private dataset.Training dataset, recorded from around 100 recorders, is used for participants to develop Auto-KWS solutions. Practice dataset contains 5 speakers data, each with 5 enrollment audio data and seveal test audio. Practice dataset together with the downloadable docker provides an example of how platform would call the participants' code. Both Training and practice dataset can be downloaded for local debugging.The feedback dataset and private dataset have the same format of practice dataset and are used for final evaluation and thus will be hide from participants. Participants final solution will be evaluated by the platform using those two datasets during feedback phase and private phase respectivaly without any human intervention.

Here is a summary of 4 datasets. (The specific number may be slightly adjusted when the challenge is officially launched.)

dataset speaker num phase keywords/enrollment num
training datasets 100 before feedback phase 10
practice datasets 5 before feedback phase 5
feedback datasets 20 feedback phase 5
private datasets 40 final phase 5


This challenge has three phases: Feedback Phase, Check Phase and Final Phase.

Before the feedback phase,the participants are provided with the training dataset (recorded from around 100 persons) and the practice dataset (recorded from 5 persons). Participants can download those data and use them to develop their solutions offline.During the feedback phase, participants can upload their solutions to the platform to receive immediate performance feedback for validation. The data used in this phase are from another 20 recorders which will be hide from the participants. Then in thcheck phase, the participants can submit their code only once to make sure their code works properly on the platform.(Note: Using other open source data, such as dataset from is also allowed. Only remember to indicate the data source in your submission.).The platform will indicate success or failure to the users but detailed logs will be hide. Lastly,in thefinal phase, participants' solutions will be evaluated on a private dataset (recorded from another 40 persons). Once the participants submit their code, the platform will run their algorithm on seperated compute workers automatically to test against each recorder data with its owtime budget. The final ranking will be generated based on the scores calculated in this phase.

The datasets in all phases maintain the same structure. And the platform also exploit the same evaluation logic in all phases. The evaluation task is constrained by the time budget. In each task, after initialization, the platform will call the'enrollment'function, which runs for 5 minutes. Then the platform will call the'test'function to predict the label of each test audio. For each speaker, the test process is constrained by the time budget, which is calculate by real time factor (RTF) and the total duration of test audio. When the time budget of 'enrollment' of 'test' runs up, the platform will automatically terminate the processes, and samples that still wait to be predicted will be counted as errors automatically.


1st Prize:2000 USD
2nd Prize: 1500 USD
3rd Prize: 500 USD


Hung-Yi Lee, College of Electrical Engineering and Computer Science National Taiwan University

Lei Xie, Audio, Speech and Language Processing Lab (NPU-ASLP), Northwestern Polytechnical University

Tom Ko, Southern University of Science and Technology

Wei-Wei Tu, 4Pardigm Inc.

Isabelle Guyon, Universte Paris-Saclay, ChaLearn

Qiang Yang, Hong Kong University of Science and Technology

Committee (alphabetical order)

Chunyu Zhao, 4Paradigm Inc.

Jie Chen, 4Paradigm Inc.

Jingsong Wang, 4Paradigm Inc.

Qijie Shao, NPU-ASLP

Shouxiang Liu, 4Paradigm Inc.

Xiawei Guo, 4Paradigm Inc.

Xiong Wang, NPU-ASLP

Yuxuan He, 4Paradigm Inc.

Zhen Xu, 4Paradigm Inc.

Organization Institutes


Please Contact the Organizers if you have any problem concerning this challenge.


Previous AutoML Challenges:

-First AutoML Challenge






-AutoCV2@ECML PKDD2019







About 4Paradigm Inc.

Founded in early 2015,4Paradigmis one of the world’s leading AI technology and service providers for industrial applications. 4Paradigm’s flagship product – the AI Prophet – is an AI development platform that enables enterprises to effortlessly build their own AI applications, and thereby significantly increase their operation’s efficiency. Using the AI Prophet, a company can develop a data-driven “AI Core System”, which could be largely regarded as a second core system next to the traditional transaction-oriented Core Banking System (IBM Mainframe) often found in banks. Beyond this, 4Paradigm has also successfully developed more than 100 AI solutions for use in various settings such as finance, telecommunication and internet applications. These solutions include, but are not limited to, smart pricing, real-time anti-fraud systems, precision marketing, personalized recommendation and more. And while it is clear that 4Paradigm can completely set up a new paradigm that an organization uses its data, its scope of services does not stop there. 4Paradigm uses state-of-the-art machine learning technologies and practical experiences to bring together a team of experts ranging from scientists to architects. This team has successfully built China’s largest machine learning system and the world’s first commercial deep learning system. However, 4Paradigm’s success does not stop there. With its core team pioneering the research of “Transfer Learning,” 4Paradigm takes the lead in this area, and as a result, has drawn great attention of worldwide tech giants.

About ChaLearn

ChaLearn is a non-profit organization with vast experience in the organization of academic challenges. ChaLearn is interested in all aspects of challenge organization, including data gathering procedures, evaluation protocols, novel challenge scenarios (e.g., competitions), training for challenge organizers, challenge analytics, resultdissemination and, ultimately, advancing the state-of-the-art through challenges.