Suspending Deep Belief
Gerald Penn
University of Toronto
Thursday, May 2, 2013
2:00 p.m., ICSI Lecture Hall
Abstract:
Deep neural networks (DNNs), often accompanied by generative pre-training with deep belief networks (DBNs), have started to supplant Gaussian mixture models (GMMs) as the default acoustic models for automatic speech recognition (ASR). When the output nodes of DNNs are expanded from a small number of phonemes into a large number of tied-states of triphone HMMs, it has been reported that the resulting so-called context-dependent DNN/HMM hybrid model achieves an unprecedented performance gain in many challenging ASR tasks, including the well-known Switchboard task.
Why, specifically, is this happening? What is it about the DNN approach that grants them this ability? In this talk, I'll describe some experiments that reveal some clues to help us answer this question. The results of these experiments suggest that DNNs do not necessarily yield better modelling capability than conventional GMMs for standard speech features. DNNs, however, are very powerful in terms of leveraging highly correlated features. The unprecedented gain of the context-dependent DNN/HMM model can be almost entirely attributed to the DNN's input feature vectors, having been concatenated from several consecutive speech frames within a relatively long context window.
Then we'll turn our attention to DBN pre-training. Again, where is the benefit coming from? Our recent attempts at answering this have revealed a simple but novel use of convolutional neural networks that can beat a DBN pre-trained network with a similar number of trainable weights.
Bio:
Gerald Penn is an associate professor in the Department of Computer Science at the University of Toronto. He has been a member of technical staff at Bell Labs and a visiting professor at Stanford. His interests include typed unification grammars, logic programming, spoken dialog systems and speech recognition. He received his PhD. from Carnegie Mellon University.