Multilingual FrameNet: Merging FrameNets for Cross-linguistic Research
One of the greatest challenges to NLP is the increasing variety of languages on the internet; part of the answer to this challenge can come from the FrameNet lexical database, which has been developed for English since 1997 at the International Computer Science Institute (ICSI) based on the principles of Frame Semantics (Fillmore 1977; Fillmore 1985). The lexicon is organized by semantic frames, with valence information derived from attested, manually annotated corpus examples (Fillmore & Baker 2010). English FrameNet data is already applied in tasks such as event tracking systems incorporating automatic Frame Semantic role labeling, on text from domains ranging from national defense to finance. Funded projects have now created FrameNet-like resources for Spanish, German, Japanese, Swedish, Chinese, Portuguese, French, Italian and Gulf Arabic; partial resources exist for 6 more languages. The general conclusion is that roughly 70% of the lexical units in the target languages fit well into semantic frames originally defined for English. For the rest, they have modified frames defined for English or defined new ones. However, there is no unified multilingual Frame Semantic lexical resource; ICSI researchers seek to create a single platform where users can access all the Frame Semantic lexicons and their annotations for cross-linguistic research and applications.
The project team at ICSI is collaborating with members of the CISE community to identify promising lines of research using multilingual FrameNet data and to define data formats and distribution channels to make the new resource maximally useful to the community. Since input from the creators of each FrameNet is crucial to getting the frame alignments right and relating argument structures across languages, the team is also collaborating with the PIs of the various FrameNets on the design of the combined lexicon and any issues that arise during the process of compiling and updating the database. Both the researchers using FrameNet data and those building the various FrameNets are scattered around the world, so most of this coordination is done by e-mail and teleconference. The project team will hold regular tutorials and workshops for CISE researchers and FrameNet creators, at major computational linguistics conferences. They will set up a framework for improved communication between CISE researchers and the FrameNet builders, to facilitate the building new Frame Semantic resources.
Funded by NSF.