Publication Details
Title: On the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval
Author: R. Mertens, P.-S. Huang, L. Gottlieb, G. Friedland, and A. Divakaran
Bibliographic Information: Proceedings of the IEEE International Symposium on Multimedia (ISM 2011), Dana Point, California, pp. 446-451
Date: December 2011
Research Area: Audio and Multimedia
Type: Article in conference proceedings
PDF: http://www.icsi.berkeley.edu/pubs/speech/applicabilityof12.pdf
Overview:
Recently, audio concepts emerged as a useful building block in multimodal video retrieval systems. Information like ”this file contains laughter”, ”this file contains engine sounds” or ”this file contains slow music” can significantly improve purely visual based retrieval. The weak point of current approaches to audio concept detection is that they heavily rely on human annotators. In most approaches, audio material is manually inspected to identify relevant concepts. Then instances that contain examples of relevant concepts are selected – again manually – and used to train concept detectors. This approach comes with two major disadvantages: (1) it leads to rather abstract audio concepts that hardly cover the audio domain at hand and (2) the way human annotators identify audio concepts likely differs from the way a computer algorithm clusters audio data – introducing additional noise in training data. This paper explores whether unsupervized audio segementation systems can be used to identify useful audio concepts by analyzing training data automatically and whether these audio concepts can be used for multimedia document classification and retrieval. A modified version of the ICSI (International Computer Science Institute) speaker diarization system finds segments in an audio track that have similar perceptual properties and groups these segments. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another. Keywords-Audio Clustering, Audio Indexing, Speaker Diarization, Video Indexing
Acknowledgements:
This work was partially supported by funding provided to ICSI by the Intelligence Advanced Research Projects Agency (IARPA). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of IARPA or of the U.S. Government.
Bibliographic Reference:
R. Mertens, P.-S. Huang, L. Gottlieb, G. Friedland, and A. Divakaran. On the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval. Proceedings of the IEEE International Symposium on Multimedia (ISM 2011), Dana Point, California, pp. 446-451, December 2011
Author: R. Mertens, P.-S. Huang, L. Gottlieb, G. Friedland, and A. Divakaran
Bibliographic Information: Proceedings of the IEEE International Symposium on Multimedia (ISM 2011), Dana Point, California, pp. 446-451
Date: December 2011
Research Area: Audio and Multimedia
Type: Article in conference proceedings
PDF: http://www.icsi.berkeley.edu/pubs/speech/applicabilityof12.pdf
Overview:
Recently, audio concepts emerged as a useful building block in multimodal video retrieval systems. Information like ”this file contains laughter”, ”this file contains engine sounds” or ”this file contains slow music” can significantly improve purely visual based retrieval. The weak point of current approaches to audio concept detection is that they heavily rely on human annotators. In most approaches, audio material is manually inspected to identify relevant concepts. Then instances that contain examples of relevant concepts are selected – again manually – and used to train concept detectors. This approach comes with two major disadvantages: (1) it leads to rather abstract audio concepts that hardly cover the audio domain at hand and (2) the way human annotators identify audio concepts likely differs from the way a computer algorithm clusters audio data – introducing additional noise in training data. This paper explores whether unsupervized audio segementation systems can be used to identify useful audio concepts by analyzing training data automatically and whether these audio concepts can be used for multimedia document classification and retrieval. A modified version of the ICSI (International Computer Science Institute) speaker diarization system finds segments in an audio track that have similar perceptual properties and groups these segments. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another. Keywords-Audio Clustering, Audio Indexing, Speaker Diarization, Video Indexing
Acknowledgements:
This work was partially supported by funding provided to ICSI by the Intelligence Advanced Research Projects Agency (IARPA). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of IARPA or of the U.S. Government.
Bibliographic Reference:
R. Mertens, P.-S. Huang, L. Gottlieb, G. Friedland, and A. Divakaran. On the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval. Proceedings of the IEEE International Symposium on Multimedia (ISM 2011), Dana Point, California, pp. 446-451, December 2011