Preethi Jyothi: Automatic Speech Recognition
Conversational speech presents a major challenge for speech recognition software because of how different pronunciations can arise in casual speech.
“During conversation, words are more fluid, allowing for sounds to smoothly transition into one another, giving rise to a large variety in pronunciations,” Jyothi said. “The large extent of pronunciation variability makes using ASR tremendously challenging. Most ASR systems use phonemes to represent speech as a single stream of discrete sub-word units. For example, the word “five” would be broken up into three phonemes: “f”–“ay”–“v.” Most languages have 20 to 60 phonemes.
In a dictionary, each word is associated with a small number of canonical pronunciations. But one of the problems with using this representation for ASR is that there is more than one way in which a word is pronounced during conversation. For instance, the word “everybody” could be pronounced as “eh”–“v”–“r”–“iy”–“b”–“ah”–“d”–“iy”; “eh”–“v”–“er”–“b”–“ah”–“d”–“iy”; “eh”–“u”–“b”–“a”–“iy”; or “eh”–“b”–“ah”–“iy.”
It is no easy task to enumerate all possible pronunciations of a word. One way Jyothi is working to combat these speech differences is by modeling how variations in pronunciations arise during speech production. Her models are inspired by linguistic theories that study how the various articulators of our vocal tract, for example, lips, tongue, vocal cords, etc., move together to create sounds.
“The challenge here,” Jyothi said, “is to find the right way to incorporate these linguistic insights into computational models that are built from large amounts of speech data. In my doctoral research, I was able to build models that improved on earlier works along this line, but I believe we can go much further.”
Jyothi also wants to use articulation-based models for ASR in a variety of languages.
“Conventional recognition systems require large amounts of annotated data for training,” Jyothi said. “As a consequence, only a small fraction of the world’s 7,000 languages have supporting speech recognition systems. I am keen on exploring how to bring speech recognition technologies to languages with low resources, possibly using data from languages like English.”
Only a small fraction of the world’s 7,000 languages have supporting speech recognition systems. I am keen on exploring how to bring speech recognition technologies to languages with low resources. — Preethi Jyothi
According to Jyothi, articulation-based models are language-independent properties, so they could be transferred between many different languages.
Speech recognition requires the combination of a variety of expertise, so Jyothi feels fortunate to have numerous resources available to her at the Beckman Institute.
“Speech recognition is an interdisciplinary area that lies at the intersection of several larger fields like signal processing, statistical modeling, machine learning, and linguistics,” Jyothi said. “I decided to come to Beckman for research because it provides an ideal environment to work on such an interdisciplinary field with access to experts in all these areas. The Beckman fellowship is unique in that it allows us to choose any project of our liking and gives us all the freedom and flexibility to fulfill our goals. This is probably rare unless you are a faculty member. I feel very privileged to be given this opportunity.”
Jyothi is working with several faculty members at Beckman, including Mark Hasegawa-Johnson and Jennifer Cole.
“Hasegawa-Johnson is an expert in developing coherent, mathematical models inspired by linguistic theories to aid automatic speech recognition,” said Jyothi. “Cole is a well-known authority in both experimental and computational phonology. They also have experience with working on a variety of languages. Collaborating with both of them is extremely beneficial to me in my main goal of developing coherent models of speech recognition for low-resource languages, along with developing a solid understanding of the linguistic underpinnings of these models. I also look forward to collaborating with Paris Smaragdis, who looks beyond just speech and broadly works on machine learning approaches for computational audition and signal processing.”
Jyothi hopes to leverage Beckman’s supportive interdisciplinary environment and gain a more well-rounded, big picture view of the science surrounding her research interests.
“It’s easy in speech recognition research to focus on specific parts of a research area. Because of the resources and expertise here, I can the broaden ways in which I look at a problem,” Jyothi said.