Audio-visual Speech Recognition using Deep Neural Networks

Presented by Ahmed Hussen Abdelaziz

Tuesday, February 21, 2016
11:00 a.m.
ICSI Lecture Hall

Abstract:

Many experiments demonstrated that humans' ability to understand speech is slightly affected by noise even without training and under unnatural distortions. Despite the breakthroughs achieved by deep learning in the realm of automatic speech recognition (ASR), such systems still can not imitate this competence and noise robustness of human speech recognition. In order to enhance the performance of ASR systems in noisy environments, a non-traditional approach comes from the fact that speech perception is bi-modal (audio-visual). Video recordings of speakers' mouth's can be very helpful for ASR systems in uncontrolled noisy conditions since the visual modality is almost unaffected by acoustic noise. Using the visual modality in conjunction with the acoustic modality introduces new challenging tasks. For example, in addition to the acoustic front-end, a visual front-end is needed in order to extract speech-related features from the video stream. Audio-visual fusion is also another challenge that is very critical, as the resulting audio-visual (AV)-ASR system performance should always be better than the single-modality recognizers in every test condition. In this talk, I will give an overview of my project at ICSI, which was mainly about addressing the major challenges of AV-ASR and developing a noise-robust AV-ASR framework based on deep-neural-network. After a brief summary of the challenging tasks introduced by AV-ASR, I will give a review of algorithms and approaches used to develop a noise-robust DNN-based AV-ASR. Finally, a summary of experimental results comparing different fusion models for AV-ASR systems will be presented.

Speaker Bio:

Ahmed Hussen Abdelaziz received his B.Sc. degree in electrical engineering and information technology from Alazhar University in Cairo, Egypt, in 2006. He obtained his M.Sc. and Ph.D. degrees in electrical engineering and information technology from Ruhr-Universität Bochum, Germany, in 2010 and 2015, respectively. He is currently a postdoctoral fellow at the International Computer Science Institute (ICSI) in Berkeley funded by the German Academic Exchange Service (DAAD). His research interests include audio-visual speech recognition, audio-visual speech enhancement, and noise robust speech recognition of uncertain or missing data.