Featured Research: FrameNet
The FrameNet project is one of the longest-running projects at ICSI. Led by Professor Charles Fillmore and Dr. Collin Baker, FrameNet researchers are creating "an online lexical resource for English, based on frame semantics and supported by corpus evidence." The theories of frame semantics used in the FrameNet project originated with Professor Charles Fillmore, while at UC Berkeley, prior to his work at ICSI.
Frame semantic theory categorizes words and ideas based on frames that the words evoke. Some frames are quite simple, such as the Placing frame, which involves an object, the location where it goes, and a word that suggests the object is being put in its place - for example, put, lay, shelve, or file. In the sample sentence below, the words highlighted in black are frame-evoking words. Thought evokes the Awareness/Cognition frame, might evokes the Likelihood frame, and die evokes the Death frame. The color-highlighted words are elements of the frame. In the Cognition frame, for example, there is the person who is thinking - I - and the thought - that I might die. In the Likelihood frame, I die is the thing that might happen. In the Death frame, I is the person who may die.
In the mapped image below, the relationship between the frame evoking words and their frame elements is shown in more detail, using the same sentence.
FrameNet annotators strive to document "the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through computer-assisted annotation of example sentences". These fully annotated examples are displayed automatically and are being used in a variety of artificial intelligence and Natural Language Processing (NLP) applications. When using computers to extract semantic information for NLP tasks, FrameNet's semantic mapping provides a means for the computer to extract meaning from a string of words. Currently, the FrameNet database contains over 10,000 lexical units (word senses), of which more than 6,100 are fully annotated. More than 825 semantic frames are represented and exemplified in over 140,000 sentences. The data is available through the FrameNet web site and is already being used by researchers around the world, including NLP researchers at ICSI. Srini Narayanan, head of the AI Group, used FrameNet to aid in semantic information detection in the ongoing question-answering project known as AQUAINT, and a new effort by Adam Janin of the Speech Group and Michael Ellsworth of the AI Group will focus on paraphrasing, using FrameNet data to provide semantic information. Last year, Thomas Schmidt, then a visiting German postdoc, created a multi-lingual dictionary of soccer terms, called Kicktionary, using a FrameNet-style semantic analysis of each term. (See www.kicktionary.de for more information.)
A significant improvement to FrameNet is the development of tools to automate much of the annotation process. This is essential to enable the widespread use of FrameNet data in NLP research, as it will allow NLP researchers to quickly annotate the text they are using in their project. FrameNet developers are working to create software that will annotate semantic frame information, as well as collaborating with scientists working on practical applications for FrameNet data.
One such collaboration is with researchers led by Nancy Ide at Vassar, who are working on development of a large corpus of American English called the American National Corpus. The corpus includes a wide variety of language use, both speech and text, covering everything from sermons to sitcoms. The FrameNet team is working on a FrameNet-style analysis of part of this corpus, to provide semantic information for use of the corpus in NLP research. Another collaboration is with a team led by Christiane Fellbaum at Princeton University. Fellbaum's team developed WordNet, an online dictionary which provides less detailed information than FrameNet but for many more words. The NSF-funded collaboration between FrameNet and WordNet will explore theoretical issues involved in aligning the two resources.
Katrin Erk of the University of Texas at Austin, who has collaborated with the ICSI FrameNet project in the past, is working on automatic annotation of German and English. Erk worked previously on the development of SALSA, a German project which annotated German newspaper articles using English frames, and more recently collaborated with Sebastian Pado to develop the Shalmaneser system, which analyzes text both syntactically and semantically. The system uses existing syntactic parsers for the syntactic analysis. Then, using FrameNet data for training, it performs Word Sense Disambiguation and Semantic Role Labeling. The system currently works for English and German. For English, it has been trained on the Framenet data. For German, it has been trained on the Frame-semantic annotation of the SALSA project.
Another NSF-funded effort is the rapid development of a frame semantic lexicon. This project aims to provide an improved interface for people working on defining semantic frames. This should speed up the labor involved in creating and annotating frames. In a similar vein, a collaboration with researchers at Lawrence Livermore Lab is working on increasing the speed of programs for automatic frame recognition, using inexpensive parallel processors which are commonly used for modern video gaming systems. This involves rewriting algorithms to run on parallel processors, but should improve the efficiency of automatic frame recognition software.
In recent years, FrameNet projects in several other languages have begun. ICSI regularly hosts visiting scientists working to create FrameNet databases in their native languages, which to date include Spanish, Japanese, and German.
Spanish FrameNet - Carlos Subirats
Perenniel ICSI visitor Carlos Subirats is working with colleagues in Spain on the creation of a Spanish language FrameNet. Spanish FrameNet uses its own software to process a 370 million word Spanish corpus, and uses ICSI's FrameNet software to annotate the sentences extracted from the corpus, but due to language differences, some frames are different in Spanish compared with English. Subirats expects a Spanish FrameNet release in February or March of 2008, which will include over 700 annotated lexical units (over 600 have already been annotated), and allows users to look at web reports of the data. Eventually all the data will be searchable online as well. Subirats is currently seeking new funding to replace a previous grant from Spain's Science and Technology department, and has two proposals currently in submission. An integral part of Subirats's work on Spanish FrameNet has resulted from collaborations with the English and Japanese FrameNet developers. Discussion of cross-linguistic frames as well as semantic differences between languages that affect the frames for each language have proved very useful. Some motion verbs, in particular, differ between English and Spanish, requiring some new frames in Spanish FrameNet.
There is also an interest in Brazil to start a Portuguese FrameNet, and Subirats has been invited to visit scientists in Brazil to discuss his work and advise them on how best to begin the Brazilian project.
Japanese FrameNet - Kyoko Ohara and Hiroaki Sato
Kyoko Ohara and Hiroaki Sato are frequent visitors to ICSI currently working on Japanese FrameNet through a grant for joint research between Japan and the U.S. Sato has been involved with FrameNet since 1999, when he spent his sabbatical year working on English FrameNet at ICSI. Since then, he has developed software tools that provide an easy way to search and view FrameNet data. He has adapted these tools for Spanish FrameNet and now Japanese FrameNet, allowing direct comparisons between pairs of languages. In addition, he is developing a tool that allows users to compare FrameNet data in different languages.
Japanese FrameNet is based closely on English FrameNet. The project started in 2002, but because there was no freely available corpus of Japanese text, the Japanese FrameNet team had to collect corpus data before beginning annotation work. Every attempt has been made to utilize the frames developed for English FrameNet, but typological differences between English and Japanese sometimes create a need for slightly modified frame definitions. Differences in the way verbs are expressed also complicate the use of English frames for Japanese text. A notable difference is the omission of verb arguments in Japanese, which is not common in English. Some verb constructions are different, which can suggest a different frame in a Japanese translation of an English sentence, despite being semantically the same. An example is the sentence "He lay on the floor". In Japanese, the verb translates as fall + a resultative auxiliary, so the Japanese verb by itself suggests movement, while the English verb "lay" does not.
A new corpus, the Japanese National Corpus, is currently in development, and since Japanese FrameNet is in collaboration with the project, Ohara expects to begin using this corpus for Japanese FrameNet soon. She is hopeful that the cross-linguistic tools Sato is developing will be useful for Japanese speakers learning English, as it provides a means to compare the way an idea is expressed in the two languages.
German FrameNet - Hans Boas
Hans Boas, our featured alum for this issue, is working on a German FrameNet. Although the German FrameNet project began several years ago, it is still in its beginning stages. Boas hired three students to set up the infrastructure of German FrameNet, and is currently seeking funding to continue the project. Boas plans to use data from SALSA in building the German FrameNet database. Because of the same kinds of inherent linguistic differences that have caused the need for adapted frames in Spanish and Japanese, the SALSA data will need to be supplemented by human annotators who can fill in missing frame data, both for incomplete frames and those frames whose definitions might need to be changed to fit the German language.
While German FrameNet data is being compiled, related projects are underway focusing on the German language. Birte Loenneker-Rodman, a German postdoctoral researcher at ICSI, is working to incorporate FrameNet data in a bilingual dictionary of German and Slovenian. Her research ultimately will be used to create multi-lingual FrameNet databases. The Shalmaneser system for text analysis and the automatic annotation work on German mentioned previously are additional FrameNet resources for the German language.
Expanding FrameNet cross-linguistically has benefits for not only NLP, but also machine translation and second language learning. The ICSI FrameNet team is encouraged by the success of these foreign language efforts and hopes that FrameNet will eventually be expanded to cover all major languages.