Featured Alum: Matthew Aylett
When Speech Group alum Matthew Aylett's company developed text–to–speech software that would allow Roger Ebert, the popular film critic who lost his voice in 2006 in his battle with thyroid cancer, to speak in an approximation of his own voice again, he expected a little media attention.
He didn't expect to be giving interviews to CBS, the BBC, and Reuters news service. That was the same week Ebert was on the front page of Esquire magazine and on the Oprah Winfrey Show, demonstrating his new voice for the first time.
Aylett received a bachelor's degree in artificial intelligence from the University of Sussex. After a five–year break, he returned to school to pursue his MSc and PhD at the University of Edinburgh. In 2000 he joined Rhetorical Systems, an Edinburgh company that produced engaging and colorful synthetic voices.
In 2004, after leaving Rhetorical Systems, he was sponsored for a visit to ICSI's Speech Group by the European Union through the Augmented Multi–Party Interaction (AMI) Project. "It was an important time," he says, "when I was asking what I was going to do next." ICSI, Aylett says, allowed him to return to an academic research setting while developing contacts with start–ups around the Bay Area. The environment at ICSI was "really conducive to creating stuff and getting stuff done."
He returned to Edinburgh in 2005 and helped found CereProc, which, like Rhetorical Systems, produces synthetic voices that retain local accents, have character, and are pleasant to interact with. While most companies are interested in synthetic voices being intelligible, says Aylett, "we're interested in it sounding good."
CereProc produces voices using unit selection. The company will record a voice actor reading a script. The recorded voice is transcribed and broken down into individual phonemes, which are reassembled when a user types a sentence into a program, producing spoken text. In Ebert's case, the company used audio from Ebert's DVD commentary.
Often speech synthesis companies will record speakers reading in a monotone voice to make transcription easier and the synthetic voice more intelligible. But, Aylett says, such voices have limited use as they are not engaging. "We take more risks with what we're doing," says Aylett. CereProc looks for interesting accents, from Ireland, Scotland, Southern England, America, and the Black Country in central England. In addition, CereProc is developing voices in French, Spanish, Catalan, German, Mandarin, and Japanese.
CereProc is also able to reproduce emotion by asking speakers to read scripts in a calm or tense voice. Aylett hopes, in future, to develop technology that allows emotion to be added to already recorded voices by changing the pitch and speech of synthesized voices.
Recently, CereProc has begun to produce voices using the HTS system, which uses a Hidden Markov Model to train software off of a small amount of recorded sound . minutes instead of hours. With the system, errors in transcription are less of an issue since the system trains itself, and producing an intelligible voice is less expensive, which means more people may be able to use CereProc's services. However, the system isn't perfect, producing voices with less character and softer accents than the unit selection system.
As technology becomes more pervasive and mobile, Aylett says, engaging synthetic voices will become more important. With the popularity of mobile devices like the iPhone, spoken information can be easier to absorb than written text.
Aylett also hopes in the future to help others who have lost their ability to speak, as they did with Ebert. "Our voice is profoundly part of who we are," he says.