Probabilistic Modeling of Speech and Language

Zhijian Ou

Tsinghua University, Beijing, China

Tuesday, May 12, 2015
12:30 p.m., Conference Room 5A

The first part of the talk introduces PAT (Probablistic Acoustic Tube) model of speech. Current model-based speech analysis tends to be incomplete - only a part of parameters of interest (e.g., only the pitch or vocal tract) are modeled, while the rest that might as well be important are disregarded. The drawback is that analysis on speech parameters may be inaccurate or even incorrect without joint modeling of parameters that are correlated. Based on the fundamental physics of speech production, we propose PAT with joint modeling of pitch, vocal tract, and energy. Notably, PAT incorporates mixed excitation, glottal wave, and phase modeling. We demonstrate the capability of PAT for a number of speech analysis/synthesis tasks, such as pitch tracking under both clean and additive noise conditions, speech synthesis, and phoneme clustering. One of the reviewers comments: "to my knowledge the most complete attempt on developing a true generative model for speech."

In the second part, we introduce random field approach to language modeling. Language modeling (LM) involves determining the joint probability of words in a sentence. The conditional approach is dominant, representing the joint probability in terms of conditionals. Examples include n-gram LMs and neural network LMs. An alternative approach, called the random field (RF) approach, is used in whole-sentence maximum entropy (WSME) LMs. Although the RF approach has potential benefits, the empirical results of previous WSME models are not satisfactory. Recently, we revisit the RF approach for language modeling, with a number of innovations. We propose a trans-dimensional RF model and develop a training algorithm using joint stochastic approximation and trans-dimensional mixture sampling. We perform speech recognition experiments on Wall Street Journal data, and find that our RF models lead to performances as good as the recurrent neural network LMs but are computationally more efficient in use. To our knowledge, this result represents the first strong empirical evidence supporting the power of using the whole-sentence language modeling approach.

Bio:

Zhijian Ou received the BS degree with the highest honor in electronic engineering from Shanghai Jiao Tong University, Shanghai, China, in 1998, and the MS and PhD degrees in electronic engineering from Tsinghua University, Beijing, China, in 2000 and 2003, respectively. Since 2003, he has been with the Department of Electronic Engineering, Tsinghua University, Beijing, China, and is now an associate professor. His current research interests include speech processing and statistical machine intelligence.

http://oa.ee.tsinghua.edu.cn/~ouzhijian/index.htm