Hasegawa-Johnson Decodes Speech for Technology Applications

Mark Hasegawa-Johnson has been able to integrate his two interests – computer engineering and communications – into research that is leading the way in decoding human speech for technology applications.  

As a 12-year-old in the early 1980s, Mark Hasegawa-Johnson and his professor father built a computer from a kit, about the only way most people could attain a PC in those days.

“It was one of these kits you could buy and put the components together,” he said. “The only computers that were available to the home hobbyist were build-it-yourself kits. Because of that, I was interested in doing something with computers but wasn’t sure exactly what.”

That interest never waned as Hasegawa-Johnson earned undergraduate, Master’s and Ph.D. degrees at MIT in electrical and computer engineering. But along the way he also became interested in communications, and found a way to integrate the two disciplines.

“I spent an internship at Motorola my second year (of college) and learned that of the things that there were available to do in communications, what was most interesting to me was speech signal processing,” he said. “That’s pretty much what I’ve been working on since then.”

Hasegawa-Johnson leads the Statistical Speech Technology group at Beckman, which has a research mission of applying “higher-level knowledge from linguistics and psychology in order to specify the structure of machine learning models for automatic speech recognition.”

… often I will tell people that I work on speech and audio understanding, meaning that I want computers to be able to listen to everything that’s happening around them and understand everything that is happening around them. – Mark Hasegawa-Johnson

With speech recognition software and technologies now an everyday part of life, Hasegawa-Johnson said that fact makes explaining what he does both easier and more difficult.

“I have the blessing and the curse that if I say I do speech recognition, then everybody understands that right now, because most people have seen speech recognition products and talked to speech recognizers on the telephone,” he said. “Describing what I do beyond that is often difficult; often I will tell people that I work on speech and audio understanding, meaning that I want computers to be able to listen to everything that’s happening around them and understand everything that is happening around them.”

That means that Hasegawa-Johnson has research interests in both decoding human speech in order to understand it better and for development of automatic speech recognition (ASR) and other technologies. He writes that his research works to “develop large vocabulary speech recognition algorithms using phoneme boundaries rather than phoneme segments as the fundamental phonological class.” His group has created mathematical models for linguistics applications such as a landmark-based speech recognizer and a model that uses the stress and rhythm of natural language (prosody) to disambiguate confusing sentences.

Hasegawa-Johnson, a member of Beckman's Artificial Intelligence group, has a half-dozen current research lines, divided into speech and non-speech categories. One project in the speech category seeks to improve the ability of people with cerebral palsy to communicate by first understanding how the disorder manifests itself in speech production and then by trying to develop ASR technology to address the challenges they face. He said that in normal speech, phonemes (the basic distinctive units of speech sound) are reliably distinct.

“The challenge in cerebral palsy is that if the muscles of your tongue are always tense, then you tend to have a hard time locating your tongue precisely and therefore you tend to have overlapping phonemes,” Hasegawa-Johnson said. “So we’re looking at several ways of solving that.”

That’s where the group’s application of new algorithms specifically designed for people with cerebral palsy could lead to ASR technology that adapts to their limitations.

“We’re thinking that if the adaptation algorithm knows in advance that this person has a greatly reduced vowel space and that maybe there are some consonants that this person can’t produce at all, and that they produce a reliable set of substitutions, then we can make these options available to the adaptation algorithm so that it knows what to look for in the speech of a person with moderate or severe CP,” he said. “The other thing that helps a lot is if we could bring in any kind of context information, if we know what they are going to be talking about, then we can use vocabulary and language models that can bring that information to bear.”

Hasegawa-Johnson hopes that someday the research leads to software that could be downloaded for free by people with cerebral palsy.

“This is one where a university could actually make a big difference,” he said. “If we can make something easy enough to use, where people can just download it and use it, then they would.”

The narrow definition of what we’re trying to do is develop automatic signal measurements that tell you how fluent this person is in their second language. – Mark Hasegawa-Johnson

Other speech research lines include studying multiple Arabic dialects, the dynamics of speech fluency, and prosody and landmarks in speech. The project involving Arabic dialects is investigating ways to improve accuracy in a language with dialects that vary to such a degree that creating ASR technology is a major challenge.

“Arabic is an extreme example of a language in which the different regional dialects are so distinctive that sometimes they aren’t mutually intelligible,” Hasegawa-Johnson said. “Sometimes they are often classed as different languages, and yet they all share a common writing system and they all share a common history.

“There are products right now that do speech detection translations and speech-to-speech translation for Arabic but they are limited to standardized Arabic, like broadcast news Arabic. So if we can get algorithms that do something useful for a dialect, then those algorithms could conceivably be folded into existing products or become new products.”

The prevalence of new speech recognition products is not necessarily a good thing, Hasegawa-Johnson said, when it comes to the accuracy of the products on the market. Two research lines where he is working to improve accuracy are in the dynamics of speech fluency, and in the prosody and landmarks of speech.

Hasegawa-Johnson is part of a National Science Foundation-funded collaboration with Beckman colleagues Chilin Shih and Kay Bock involving the dynamics of fluency when it comes to second language learners’ speech. His group does the signal processing work in the project.

“The narrow definition of what we’re trying to do is develop automatic signal measurements that tell you how fluent this person is in their second language,” Hasegawa-Johnson said. “We have had a couple of papers basically showing, number one, that signal measures that don’t know anything about the speech content predict fluency just as well as knowing the speech content and, number two, that that’s true for 60 and 90 second speech snippets, and we get no significant loss of accuracy when we go down to 15 or 20 second snippets of speech.”

The results of their work could lead to software that helps speakers sound more natural in their second language.

“Our eventual goal is to give people pointers about how to sound more fluent, but we’re not at that point yet,” Hasegawa-Johnson said.

Hasegawa-Johnson is also collaborating with Beckman colleague Jennifer Cole looking at the prosody and landmark components of speech in research that has implications for other languages and for speech recognition applications.

“Prosody is the rhythm and intonation of a language and landmarks are the instant in time that kind of wake up your brain – the consonant release and consonant closures that sound like explosions,” Hasegawa-Johnson said. “The brain seems to synchronize itself to syllable centers, especially stress syllables. Since there are neurons that seem to pay close attention to these things, we’re trying to make the speech recognizer pay attention to these things too, with the idea that if it can get these things right, then the overall speech transcription will also be higher.

“Prosody seems to be a universal phenomenon,” he added. “It’s implemented in different ways but every language has some kind of rhythmic grouping of words and some words are stressed and some words are unstressed.

“I think we’ve come to believe that that’s where the prosody is coming from. It’s a natural part of speech planning. You plan your speech in clumps of words, and those groups of words turn into phrases.”

The technologies that are advanced from all this research include language learning and closed captioning software, as well as the familiar automatic speech recognizers people interact with when they call most any large business these days.

One project Hasegawa-Johnson is involved in for the Department of Homeland Security seeks to develop audio visualization technology to aid the work of intelligence analysts by allowing them to “visualize sound.” It works by taking huge databases of audio information and turning them into something visual, such a spectrograph on a computer screen, which analysts can more easily process than by trying to listen to hours of audio.

“The idea is to let the computer do what computers are good at and have the humans do what the humans are good at,” Hasegawa-Johnson said. “Humans are good at inference, big picture, and anomaly detection. Computers are really good at processing hundreds of hours of data all at once and then compressing it into some format, into some image.”

Another non-speech project involves “opportunistic sensing” which includes advancing artificial intelligence or robotics applications.

“The overall problem again is you want a robot or computer to understand everything that’s going on around it,” Hasegawa-Johnson said. “Opportunistic means that it adjusts its sensors as much as necessary in order to get a better understanding of what’s going on around it. We’re looking at recognition algorithms where part of the recognition algorithm is tweaking your sensor in order to get a better recognition. It’s almost robotics. We’re doing very similar things to what some of the robotics people are doing.”

While all of his projects are interesting, Hasegawa-Johnson said his most important work involves his students. He has been voted to the list of teachers rated as excellent by their students numerous times and has been voted as best graduate adviser in his home Department of Electrical and Computer Engineering. Hasegawa-Johnson has also led, along with Beckman colleague Tom Huang, several groups of students to competitions involving team multimedia search competitions, including the international Star Challenge Multimedia Information Retrieval Competition in 2008, where they reached the Grand Finals in Singapore. He said they are part of his overall teaching philosophy. 

“I want to teach students how to build strong careers for themselves,” Hasegawa-Johnson said. “I take mentoring my graduate students very seriously and make sure that they build a deep understanding of their own research and research in general.”