ASRU 2017 Challenge tasks

Following the success of the challenge special sessions held at ASRU2015, we will also hold challenge tasks during the workshop program.

ASRU 2017 will include special sessions for each challenge task, where accepted papers will be presented.

The challenge tasks themselves are operated by their respective organizers, independently of ASRU. Descriptions of each are given in the next three sections. For more information about these challenge tasks, contact their respective organizers.

The Zero Resource Speech Challenge 2017: unsupervised discovery of linguistic units

Scientific committee: Xavier Anguera, Emmanuel Dupoux, Laurent Besacier, Okko Rasanen and Sharon Goldwater

Organizing committee: Xavier Anguera, Emmanuel Dupoux, Laurent Besacier, Ewan Dunbar, Neil Zeghidour, Thomas Schatz, Xuan-Nga Cao, Mathieu Bernard, Julien Karadayi and Juan Benjumea


The Zero Resource Speech Technologies series of challenges targets the unsupervised discovery of linguistic units from raw speech in an unknown language. Such a task is done within the first year of life by human infants through mere immersion in a language speaking community, but remains very difficult to do by machine, where the dominant paradigm is massive supervision with large human-annotated datasets.

The idea behind this challenge is to push the envelope on the notion of adaptability and flexibility in speech recognition systems by setting up the rather extreme situation where a whole language processing system is learned from scratch. We expect this to impact the Speech and Language Technology field by providing algorithms that can supplement supervised systems when human annotated corpora are scarce (under-resourced languages or dialects), help language documentation and preservation (endangered languages, languages without orthographies), and help the basic science of infant research by providing predictive models of language acquisition.

The Zero Resource Speech Challenge 2017 at ASRU covers two levels of linguistic structure: subword units and word units, respectively. In the first track, we use a psychophysically inspired evaluation task (minimal pair ABX discrimination), and, in the second, metrics inspired by the ones used in NLP word segmentation applications (segmentation and token F scores). The focus is on constructing systems that are language general (there will be three training languages, and the evaluation will be done on two held out surprise languages), and able to perform unsupervised speaker adaptation.

Multi-genre Broadcast Media Transcription Challenge: MGB-3

Orginizers: Peter Bell, Phil Woodland, Thomas Hain, Ahmed Ali, Andrew McParland


The MGB Challenge is a core evaluation of speech recognition, speaker diarization, and lightly supervised alignment using TV recordings from the BBC and Aljazeera.

The 2017 Challenge, MGB-3, will feature both Arabic and English tasks. The speech data is broad and multi-genre, spanning the whole range of TV output. The Arabic track will include dialectal Arabic data.

For both English and Arabic tasks, all acoustic and language model training data will be provided to participants. Acoustic model data will comprise around 1,000 hours of broadcast TV data with subtitles matched by a lightly-supervised alignment process. Systems will be able to make use of supplied metadata including programme title, genre tag, and date/time of transmission.

There will have three main evaluation conditions: (1) speech-to-text transcription of broadcast television; (2) alignment of broadcast audio to a subtitle file (which may be regarded as a lightly-supervised transcript); and (3) longitudinal speaker diarization and linking, requiring the identification of common speakers across multiple recordings.

Speech synthesis as a machine learning problem — Exploring new types of acoustic models

Organizers: Keiichi Tokuda, Simon King, and Alan Black

(see Blizzard Machine Learning Challenge 2017: 2017-ES1 task and 2017-ES2 task)

In the HMM era, by giving a unified view on both automatic speech recognition (ASR) and test-to-speech synthesis (TTS), we could have developed various types of new ASR and TTS techniques, e.g., cross-lingual speaker adaptation, adaptive training for TTS, use of prosody in ASR, etc. We expect that by doing this again in the current DNN era, it would be possible to develope new types of acoustic modeling techniques that are common and useful for both ASR and TTS

In the series of Blizzard Challenge, we have found that we had to spend much time for ad-hoc tweaking and/or labor work. We would like to set another challenge removing all these parts, to attract ASR and machine learning (ML) researchers to TTS challenges, allowing them to concentrate on purely acoustic modeling part.

To keep it as an ML problem, we would like to provide a 'ready-made' data, i.e., 1) the organizers provide pairs of linguistic features and speech features; 2) participants train their systems using the data (participants cannot modify the data nor use any external data); 3) after a training period, the organizers provide open linguistic features to ask participants to predict corresponding speech features; 4) the organizers synthesize speech waveforms from the predicted speech features and conduct a large scale subjective evaluation test similar to that in Blizzard Challenge (w.r.t. naturalness, intelligibility, and similarity).

If some of participants wish to predict speech samples directly (like WaveNet), they can use speech waveforms for training. However, each participant can use only one of speech features or speech waveforms when building their system.


Technical Sponsor