You are here

News and Events

  •  [2018-04-13] Distinguished Lecture:Spoofing Attacks for Automatic Speaker Verification (ASV)
  • 13rd April 2018, Shanghai, China
  • More Information: here

  • [2016-01-19] Upcoming: SJTU Young Researchers Forum on Speech and Language Processing
  • 21st March 2016, Shanghai, China
  • More Information: here

  • [2015-10-31] Talk:"Real" Speaker-independent Features 
  • Time: 2015.10.31, 10:00-11:30
  • Location: SEIEE 3-410, Shanghai Jiao Tong University
  • Speaker: Prof. Nobuaki Minematsu (The University of Tokyo)
  • Abstract: Speech signals covey various kinds of information, which are broadly grouped into two kinds, linguistic and extra-linguistic information. Many speech applications, however, focus on only a single aspect of speech. For example, speech recognizers try to extract only word identity from speech signals and speaker recognizers extract only speaker identity. Here, irrelevant features are often treated as hidden or latent by applying the probability theory to a large number of samples or irrelevant features are normalized to have quasi-standard values. In speech analysis, however, phases are usually removed, not hidden or normalized, and pitch harmonics are also removed, not hidden or normalized. The resulting speech spectrum still contains both linguistic information (word identity) and extra-linguistic information (speaker identity). Is there any good method to remove extra-linguistic information from the spectrum? In this talk, our proposal of "really" speaker-independent or speaker-invariant features, called speech structure, is explained. Speaker variation can be modeled as feature space transformation and our speech structure model is based on the transform-invariance of f-divergence. This proposal was inspired by findings in classical studies of structural phonology and recent studies of developmental psychology. In this talk, we show how we technically implemented findings obtained in phonology and psychology as computational module and we also show some examples of applying speech structure to speech applications. If time allows, we show some relationships between speech structure and DNN-based robust feature extraction. 
  • Biography: Nobuaki Minematsu received the doctor of Engineering in 1995 from the University of Tokyo. Currently, he is a full professor there. He has a wide interest in speech communication covering from science to engineering. He has published more than 400 scientific and technical papers including conference papers. They focus on speech analysis, speech perception, speech disorder, speech recognition, speech synthesis, dialogue system, language learning systems, etc. He received paper awards from RISP, JSAI, ICIST, O-COCOSDA in 2005, 2007, 2011, and 2014 and received an encouragement award from PSJ in 2014. He gave a tutorial talk on CALL at APSIPA2011 and INTERSPEECH2012. He is a member of IEEE, ISCA, IPA, SLaTE, IEICE, ASJ, etc.


  • [2015-10-07] The Kick-off Meeting of 2015 Autumn Term
  • The kick-off meeting of SJTU Speech Lab has been successfully convened at SEIEE on July 7th.


  • [2015-5-15] Seminar:Sino-German Symposium on “Advanced Computer-Aided Feedback Methods for Second Language and Pronunciation Training” 
  • Organizers: Prof. Kai Yu, Prof. Peter Birkholz
  • Introduction: The growing academic exchange between the PR China and Germany requires more and more effort in teaching the mutual languages. This is a big challenge due to the large number of students. According to the Federal Statistics Office, the number of Chinese students in Germany increased from about 9,000 in 2001 to about 26,000 in 2013. The quality of foreign language teaching can be essentially improved by including the methods of Computer-aided Language Learning (CALL) and Computer-aided Pronunciation Training (CAPT).  
    The symposium will discuss the development of CALL/CAPT with special respect, but not restricted to the Chinese-German pair of languages. There are many experiences with other languages, especially with English and with Slavonic languages, which can be included in a synergetic way. 
  • [2015-5-13] Event:Translation of ASR Book Finished
  • Automatic Speech Recognition, a Deep Learning Approach is a book by Dong Yu and Li Deng in 2014.(DOI 10.1007/978-1-4471-5779-3).
  • The Chinese Version of this book is translated in Speech Lab. Besides, seminars are held to review and discuss the details of each chapter weekly in Speech Lab.
  • Book Introduction: This is the first book on automatic speech recognition (ASR) that is focused on the deep learning approach, and in particular, deep neural network (DNN) technology. The landmark book represents a big milestone in the journey of the DNN technology , which has achieved verwhelming successes in ASR over the past few years. The background mat erial of ASR and technical detail of DNNs including rigorous mathematical descriptions and software implementation are provided in this book, invaluable for ASR experts as well as advanced students.

  • [2015-4-13] Talk:Electromagnetic Articulography for Speaker Independent Acoustic-to-Articulatory Inversion
  • Location: SEIEE 3-414, Shanghai Jiao Tong University
  • Speaker: Dr. Michael T. Johnson
  • Abstract: Electro-Magnetic Articulography (EMA) is a leading technology for measuring articulatory kinematics, providing real-time movement data with relatively good spatial and temporal resolution. EMA provides a number of significant advantages over other technologies such as X-ray cinematography, cine MRI, and ultrasound, including full three-dimensional representation including both sensor position and orientation, sufficient temporal resolution to capture articulatory dynamics, relatively low measurement error, and low cost. EMA data is valuable for both clinical applications, such as assessment and rehabilitation of dysarthric patients, and speech technology applications, such as the use of speech recognition in computer aided language learning systems.

    In this talk we will review the basics of EMA technology, and introduce the articulatory features and methods we have developed for building speaker independent acoustic-to-articulatory inversion systems, in which articulatory trajectories are estimated directly from acoustic data. This is a difficult but important problem for many speech processing applications, including automatic speech recognition and computer aided pronunciation training. In recent years, several approaches have been successfully implemented for speaker dependent models with parallel acoustic and kinematic training data. However, in most practical applications inversion is needed for new speakers for whom no articulatory data is available. We have developed a novel speaker adaptation approach called Parallel Reference Speaker Weighting (PRSW), which uses speaker-weighted adaptation to form an inversion mapping for new speakers that can accurately estimate articulatory trajectories. The proposed PRSW method is evaluated on the newly collected Marquette EMA corpus of Mandarin Accented English (EMA-MAE). Cross-speaker inversion results show that given a good selection of reference speakers with consistent acoustic and articulatory patterns, the PRSW approach gives good speaker independent inversion performance even without kinematic training data.

  • [2014-11-13] Event:Won Second Prize of Wu Wenjun Artificial Intelligence Science and Technology Award

  • Prize-winners: Kai yu, Yanmin Qian
  • Media Report: 

第四届吴文俊人工智能科学技术奖颁奖典礼在上海隆重揭幕 -- Department of Computer Science and Engineering, Shanghai Jiao Tong University

  • [2014-11-6] Talk:Computational Networks and the Computational Network Toolkit
  • Location: SEIEE 3-410, Shanghai Jiao Tong University
  • Speaker: Dr. Dong Yu (Microsoft Research)
  • Abstract: Many popular machine learning models for prediction and classification can be described as a series of computation steps. Such models can be represented using a structure known as a computational network. A computational network expresses a model’s operation as a graph, where leaf nodes represent input values or learnable parameters and parent nodes represent basic computations, such as sum, multiplication, or logarithm. Arbitrarily complex computations can be performed using a sequence of such nodes. Deep neural networks, convolutional neural networks, recurrent neural networks and maximum entropy models are all examples of models that can be expressed using computational networks. In this talk we will first introduce computational networks (CNs) and describe the benefits of such generalization and the key algorithms involved in CNs. I will then introduce the computational network toolkit (CNTK), a general purpose C++ implementation of computational networks. I will describe its architecture and core functionalities and demonstrate that it can construct and learn models of arbitrary topology, connectivity, and recurrence. 
  • Biography: Dr. Dong Yu is a principal researcher at the Microsoft speech and dialog research group. His current research interests include speech processing, robust speech recognition, discriminative training, and machine learning. He has published over 140 papers in these areas and is the co-inventor of more than 50 granted/pending patents. His work on context-dependent deep neural network hidden Markov model (CD-DNN-HMM) has helped to shape the new direction on large vocabulary speech recognition research and was recognized by the IEEE SPS 2013 best paper award. Most recently he has focused on applying computational networks, a generalization of many neural network models, to speech recognition. Dr. Dong Yu is currently serving as a member of the IEEE Speech and Language Processing Technical Committee (2013-) and an associate editor of IEEE transactions on audio, speech, and language processing (2011-). He has served as an associate editor of IEEE signal processing magazine (2008-2011) and the lead guest editor of IEEE transactions on audio, speech, and language processing - special issue on deep learning for speech and language processing (2010-2011).

  • [2014-11-2] Talk:Towards Deep Understanding: Deep Learning for Natural Language Applications
  • Location: SEIEE 3-410, Shanghai Jiao Tong University
  • Speaker: Xiaodong He
  • Abstract: Deep learning techniques have enjoyed tremendous success in the speech and language processing community in recent years. In this talk, I will focus on deep learning approaches to problems in natural language processing, with particular emphasis on important applications. In this talk, I'll highlight the general issues of language processing, and elaborate on how new deep learning technologies are proposed to address these issues. I'll then place particular emphasis on several important applications such as semantic modeling, spoken language understanding, information retrieval, machine translation, etc. For each of the tasks we discuss what particular architectures of deep learning models are suitable given the nature of the task, and how learning can be performed efficiently and effectively using end-to-end optimization strategies.
  • Biography: Xiaodong He is Researcher of Microsoft Research, Redmond, WA, USA. He is also Affiliate Professor in Electrical Engineering at the University of Washington. His research interests include deep learning, information retrieval, natural language understanding, machine translation, and speech recognition. Dr. He has co-authored a book and more than 70 technical papers, and has given tutorials at international conferences in these fields. He and his colleagues have developed entries that obtained No. 1 place in the 2008 NIST Machine Translation Evaluation (NIST MT) and the 2011 International Workshop on Spoken Language Translation Evaluation (IWSLT), both in Chinese-English translation, respectively. He serves as Associate Editor of IEEE Signal Processing Letters and IEEE Signal Processing Magazine, and as Guest Editors of IEEE TASLP and J-STSP. He served in organizing committees and program committees of major speech and language processing conferences. He is a senior member of IEEE and a member of ACL.


  • [2014-10-24] Talk: A New Deep Learning Perspective on Acoustic Signal Processing
  • Time: 24th October, 2014
  • Location: SEIEE 3-410, Shanghai Jiao Tong University
  • Organizers: Dr. Yanmin Qian and Prof. Kai Yu
  • Speaker: Prof. Chin-Hui Lee
    School of Electrical and Computer Engineering, Georgia Institute of Technology


  • [2014-09-22]  2014 SJTU Speech and Language Processing Workshop
  • Time: 22nd September, 2014
  • Location: SEIEE 3-410, Shanghai Jiao Tong University
  • Organizers: Dr. Yanmin Qian and Prof. Kai Yu


  • [2014-09-20]  Talk: High Accuracy Keyword Spotting from Low Resource Languages
  • This talk will describe several different methods we developed to build a system that performs accurate Keyword Spotting on languages with very little training data.


  • [2013-10-14]  Deep Learning: From Academic Concepts to Industrial Triumph
  • October 15, 2013 ,we are honored to invite Li Deng from MSR to give us a talk to learn about deep learning. The talk attracted many students and teachers from SEIEE of SJTU. The talk live was brilliant, and was responded enthusiastically.


  • [2013-08-29]   14th Annual Conference of the International Speech Communication Association       

  • On August 25, 2013 - August 29, 2013, laboratory Instructor Kai Yu and Yanmin Qian, went to Lyon, France, attended the 14th Annual Meeting of Interspeech hosted by International speech communication association(ISCA). Our lab published two papers at the meeting. Kai Yu and Yanmin Qian gave a report respectively entitled "Cluster Adaptive Training with Factorized Decision Trees for Speech Recognition" and "MLP-HMM Two-Stage Unsupervised Training for Low-Resource Languages ??on Conversational Telephone Speech Recognition ".


  • [2013-08-13]   Lab meeting: DNN and adaptation                     

  • In August 15, 2013, to improve laboratory members’ various scientific research ability, and in order to enhance the academic exchanges between members, we held another group meeting. In this group meeting, Tian Tan introduced the relevant knowledge of DNN.


  • [2013-08-05]   Lab meeting: Human-computer game
  • The evening of 5 August 2013 from 7:00 to 9:30, Speech Lab launched a report entitled "man-machine chess", and Kai Sun is the speaker of the group meeting.


  • [2013-07-26]   Dragon Star Program 2013

  • From July 21, 2013 to July 26, 2013, and co-fouded by the Graduate School of Chinese Academy and Dragon-Star program, the Summer Course "Computational Auditory" was successfully held in the Minhang campus. The course was run by Professor Wang Deliang from Ohio State University. Programs attracted many teachers and students all over the country, including Tsinghua University, Zhejiang University, Chinese University of Hong Kong, as well as many scholars who have participated in the work. The talk live was brilliant, and was responded enthusiastically.