Xiaoyong Wei, Professor at Sichuan University, Joins Audio and Multimedia to Work on Video Content Search

Thursday, February 20, 2014

Xiaoyong WeiXiaoyong Wei, an associate professor at Sichuan University in China, has joined Audio and Multimedia for a year.

Xiaoyong’s research works toward systems that can analyze visual information in videos and respond to search queries about their content. His group’s goal is to build reasoning systems that parallel the ways humans think about the world in order to take a query submitted by a user, understand it by matching it to a keyword that the video search system recognizes, and respond to the query with videos that match.

His group is interested in two kinds of reasoning that humans use to analyze the world: semantic and contextual reasoning. Semantic reasoning relies upon the hierarchical relationships among concepts. For example, we know that a car is a kind of ground vehicle, and a ground vehicle is a kind of vehicle; this helps us understand what a car is. With contextual reasoning, we use the context of an object to help us understand what it is. For example, we know cars are usually on the ground, so if we see an object in the air, we know it is unlikely to be a car. Machines typically can’t do this sort of reasoning – they can only do what we train them to do. In other words, a machine can recognize a face if we train it to recognize faces, but its ability to perform reasoning about why a face is a face is limited.

Xiaoyong and his group build computable models to mimic these kinds of reasoning. In one project, they use ontological models, which are normally expressed in a hierarchy. These hierarchies are not easily understandable by computers, but transferring the hierarchy to a linear space produces vectors that computers can read. Xiaoyong’s group is in the early stages of this kind of work.

Another model uses ideas from computational linguistics and again starts with a hierarchical graph, which is used to compute the distance – or number of stops – between concepts. The fewer the intervening nodes between two concepts, the more likely they are to be related. These distances, however, are not very reliable. For example, imagine trying to figure out whether a user is asking for videos about a person or an obscure kind of animal, about which little is known. Because there is less information about the animal, fewer concepts exist to describe it; the distance between concepts will be short, while the distances between concepts related to humans will be much longer (because there are more concepts). To cope with this unreliability, Xiaoyong and his group build models that rely on many reference nodes surrounding the target concept, similar to geographical triangulation.

Xiaoyong has worked mainly with computer vision systems. While a professor at the University of Sichuan, he began to take attendance in a computer vision by photographing the class and running the photos through a facial recognizer. In a later data mining course, he used the photos and cross referenced them with grade records to test the hypothesis that those who sit closer to the front of the class perform better. (The data showed that it was true up to a point.) While at ICSI, he looks forward to taking advantage of Audio and Multimedia’s emphasis on audio analysis.


Xiaoyong breaking a brick with his bare hand

Xiaoyong received his PhD from the City University of Hong Kong in computer science and multimedia content analysis in 2009, after which he returned to Sichuan, where he grew up. He enjoys the martial arts and, for 20 years, practiced kung fu. He now practices (and teaches) tai chi. In his Practices in Scientific Projects class at Sichuan, he teaches about the application of scientific principles in an unusual but effective manner: he uses physics to teach how one can chop a brick with a bare hand – and then proceeds to do exactly that. You can watch Xiaoyong break a brick at http://www.56.com/u31/v_NzYyODYzNDA.html.