Welcome to SJTU SpeechLab!

What is SpeechLab

The Speech Research group is part of the SJTU Computer Science and Technology Department. Its mission is to advance the research of intelligent speech and language processing for human machine interaction and develop effective algorithms for implementing real-world applications. The work of the group spans a broad range of related topics: from synthesis, recognition, understanding to end-to-end spoken dialogue systems. The work focus on using data-driven statistical approaches to achieve natural human-machine speech interface. Hence, flexible and expressive speech synthesis, robust and adaptable speech recognition, error-tolerant understanding and dialogue management and innovative combination of rich speech technologies are the primary interest in terms of fundamental research. Compared to other machine learning research areas, speech technology is unique in the sense that it has a high requirement of using appropriate computer engineering methods to implement research algorithms and make real-world systems working. Therefore, another important goal of our group is to advance the computer engineering algorithms and methods for implementing speech technologies. We believe, an outstanding scientist is not necessarily an excellent engineer, while an outstanding engineer must be an excellent scientist!

SpeechLab Group

Automatic Speech Recognition

2 Ph.D | 6 Masters | 6 Bachelors

Automatic speech recognition (ASR) converts human speech waveform to text. Statistical ASR approaches are the focus. HMM-based acoustic modelling, statistical language model and decoding algorithm are the main areas. Research topics include, but not limited to, adaptation, low-resource ASR, robust and multi-lingual ASR, deep learning, discriminative training and software engineering for ASR.

Statistical Speech Synthesis

1 Ph.D | 2 Masters | 2 Bachelors

Speech Synthesis is the technique to produce natural human speech. It mainly consists of Text-to-speech (TTS) and Voice Conversion (VC). The TTS system produces the human speech from natural language. We follows the latest end-to-end techniques (e.g. Tacotron, WaveNet) to improve the quality and expressiveness of the generated waveform. The VC system converts the speech waveform from a source style to a target style (e.g. speaker, emotion). Our research interest is to improve the naturalness and similarity of the converted speech.

Spoken Dialogue System

2 Ph.D | 0 Masters | 7 Bachelors

Spoken Dialogue System (SDS) research mainly focus on the application of statistical approaches to speech understanding and dialogue management. SDS architecture, joint optimisation and system engineering are also studied. The aim is to build intelligent end-to-end systems, especially task-oriented systems, which can explicitly deal with the uncertainty arising in human-machine interaction and correctly understand the intention of the users.

Spoken Language Understanding

1 Ph.D | 1 Masters | 2 Bachelors

Spoken Language Understanding (SLU) serves as an interface between ASR and SDS, which converts a sentence to a structured representation of user meaning. Unlike general-domain NLU, SLU focuses only on specific application domains (in the current state of technology). Typically, SLU includes three tasks like domain classification, intent detection, slot filling. Our main research interests focus on deep learning for SLU, SLU domain adaptation & transfer, ASR-error robust SLU, deeper understanding, end-to-end SLU and so on.

Rich Audio Analysis

2 Ph.D | 2 Masters | 3 Bachelors

Rich Audio Analysis (RAA) focus on analysis and classification of non-text information within human speech. The information may involve speaker,emotion, noise, speaking style and so on. In addition, pronunciation evaluation and oral communication skill evaluation are related research topics. The aim is to use intelligent speech technology to assist language learning and examination.

Language Model

1 Ph.D | 3 Masters | 1 Bachelors

Language Model (LM) researches the statistical probability distribution of human languages. LM is usually used in natural language processing, speech recognition, machine translation, handwriting recognition and other applications. Our aim is to propose general LM for both evaluation and generation. We are now focus on the combination of traditional statistical LM and deep learning or reinforcement learning. We are also interested in structured LSTM LM and large vocabulary LM applications.