New Research Group: Audio and Multimedia
Back to ICSI Gazette, Summer 2013
ICSI has established a new research area, Audio and Multimedia, to focus on problems related to multimedia analysis and retrieval. Researchers in this area, led by Gerald Friedland, are working on ways to extract meaning automatically from the vast amounts of consumer-produced data available freely on Web sites like Flickr and YouTube – a dataset of billions of images and videos with few constraints on quality, size, and content. Videos are of particular interest to the group because they provide textual, audio, and visual information for analysis and are the fastest-growing type of content on the Internet. YouTube claims that 72 hours of video are uploaded to its Web site alone every minute.
Consumer-produced content consists of entertainment, instructions, personal records, and aspects of life as it was when the media was recorded. They represent a compendium of information about trends, phenomena or events, and social context and dynamics, and so are useful for qualitative and quantitative empirical research on a larger scale than has ever been possible before. The hope is that multimedia research will enable tools to easily organize and search these large collections of social media content and to aid this research in both academic and industrial contexts.
Evolved from the Speech Group, Audio and Multimedia research puts a special focus on audio analysis. Audio content is frequently complementary to visual content, as in videos, but has received less attention from the multimedia research community.
For example, in the ALADDIN project, funded by IARPA, researchers are building a system that uses acoustic analysis to search for concepts in videos. IARPA has provided ICSI, along with other teams from institutions around the world, with tens of thousands of consumer-produced videos, some of which are labeled as belonging to one of 15 categories. Given the labeled examples, the challenge is to find videos that belong in any of the 15 categories from a set of about 150,000 unlabeled videos. The team is developing parallelization methods to deal with the immense data and was recently given multi-CPU cards by Intel for this purpose. ICSI’s work is in collaboration with SRI, Carnegie Mellon University, and other research institutions.
A major focus of Audio and Multimedia researchers has been finding the origin of videos and this led to the development of a method for qualifying Mechanical Turk users. Mechanical Turk is Amazon’s answer to crowd-sourcing, in which average users complete a task for small amounts of money. Crowd-sourcing is used most often to complete tasks that are easy for humans and difficult for computers, but location estimation is difficult for both. The researchers wanted to be able to find a group of people who were qualified to estimate the locations of videos, partly to give a human baseline to compare their system to. They were also developing methods to qualify Mechanical Turk participants so that they could eventually conduct a Mechanical Turk task to place videos from areas that are not well represented on the Web, such as West Africa.
To identify qualified people, they asked participants to estimate the location of a series of videos using online resources such as Google Maps and image search. Their results were compared to those achieved by researchers in the group and by researchers in the multimedia community. They found that about a fifth of their participants were qualified to place videos, meaning that they were able to place videos within 10 kilometers of their origins 80 percent of the time.
The ability to automatically find the origins of videos has implications for online privacy, and the researchers of Audio and Multimedia, along with Robin Sommer and Nick Weaver of Networking and Security, are working to expose the dangers. One project focuses on teaching high school students about the privacy implications of sharing content online since younger people, as the group that uses social media the most and is also the least aware of its potential consequences, are particularly vulnerable to attacks. The researchers are developing classroom tools and building a Web site that help students understand how photos, videos, and text updates shared on social media can be used against them. Much of this work is in collaboration with Dan Garcia from UC Berkeley and the Berkeley Foundation for Opportunities in Information Technology, an ICSI project that supports female students and students from underrepresented ethnicities who want to enter a career in computer science.
In that project and others, researchers are interested in chains of inference, or the aggregation of public and seemingly innocuous information from different Web sites in order to attack privacy. For instance, the researchers have shown how it is possible to link accounts on different sites – such as Yelp, Twitter, and Flickr – based on the length of videos posted and other factors, even when the username is different.
It was the privacy implications of such innocuous public information that first sparked Friedland’s interest in multimedia analysis. In 2008, Friedland and Sommer published a technical report that coined the term “cybercasing,” or the use of information freely available online to mount real-world attacks. The case studies they presented relied on geo-tags, detailed information about where a video or photo was taken that is automatically embedded in media captured by devices such as smartphones. They showed how easy it is, for example, to extract the geo-tags of a photo in a Craigslist ad selling valuables and to use information about the best time to call to infer when the seller was away from home.