Making Machines That Can Truly Listen

Beckman Institute researcher Paris Smaragdis has a musician’s heart, an engineer’s curiosity, and a computer scientist’s ability to apply both those qualities toward solving one of the biggest problems in machine learning: creating computers that can replicate the human ability to listen.

Beckman Institute researcher Paris Smaragdis has a musician’s heart, an engineer’s curiosity, and a computer scientist’s ability to apply both those qualities toward solving one of the biggest problems in machine learning: creating computers that can replicate the human ability to listen.

“At the very top level, what I try to do is make computers understand sound,” Smaragdis said of his research. “I sort of joke that if one day we make this computer that will do everything, I will be the guy who designs its ears.”

Countless scientists and engineers have learned over the past few decades just how difficult it is to get a computer to fully duplicate the capabilities of a human, including the ability to listen and understand. That’s been especially problematical when it comes to what are called sound mixtures, such as those found in noisy environments, where distinguishing one voice or sound from another has proven an extremely challenging issue for software designers to solve.

Smaragdis takes a different approach than most engineers working on acoustics signal processing issues. It’s one that combines engineering and computer science, statistics and probabilities, and a longtime love of electronic music toward solving the problem. Smaragdis said he would be thrilled if future machines could simply equal basic listening abilities.

“I’ll retire a happy man if I can make a machine that can learn to hear as well as a cat can,” he said. “You know, they can’t transcribe speech or music, but they are doing enough to survive and they are still a long way from what we have with computers.”

Smaragdis is a faculty member in Beckman’s Artificial Intelligence group, and in the University of Illinois departments of Electrical and Computer Engineering and Computer Science. Those varied disciplines and his research interests in signal processing, machine learning, and statistics as they relate to artificial perception, are just one indication of the unique approach Smaragdis takes. That applies to his twin home departments as well.

“In my mind, there’s very little difference between those fields,” Smaragdis said. “If you study them formally, if you come from two completely different places, eventually, by the time you get your Ph.D., you see that they are essentially the same thing, but they’re really treated as being two different disciplines.

“So, part of what I’m trying to do is to bring those two fields together and expose a lot of the similarities. And in my case, I am doing it in the audio world, because that’s where I like doing it.”

Smaragdis’ diverse background is also non-traditional for an engineer/computer scientist: a degree from one of the most prestigious music schools in America (Berklee College of Music), and a Ph.D. from MIT in perceptual computing. So when it comes to engineering answers to problems involving deciphering sounds in noisy environments, Smaragdis comes at the problem form a different perspective than most engineers.

“I think part of it is that I didn’t have the proper engineering education. The standard way I’m supposed to solve a problem, is not something that’s been drilled into me,” he said. “The other thing is that I do have is this very interesting intuition between mathematics and sound. If I see an equation, if I see an algorithm, I know how it sounds and what it means. Likewise, if I want to make my system sound like something, I know what equation I have to go to, to make it happen.

“So, I’m not solving things with pen and paper, I am actually using my education of being a good listener or having a trained ear to inform my systems on how I want them to work. And for a lot of the stuff that I do, I see that most of the traditional methods, although they make sense on paper, the proper thing to do mathematically, they don’t always correlate to something that I’m going for.”

For Smaragdis, that different approach means not using or trying to advance standard signal processing methods that separate out acoustic signals, as is done with applications such as automatic speech recognizers. He uses statistics and probabilities to recognize the desired sounds from mixtures. His computational model describes the target sound signal – such as a particular voice – in sound mixtures through statistical probability, eliminating the need to separate out different signals, or “clean up” the mixture.

Smaragdis said data science has shown that making mathematical models of sounds and trying to separate a voice, for example, from other voices or ambient noise (as is done in automatic speech recognizers) to create a clean mix of sounds is extremely difficult.

“The problem with that is that I would have to make a different model for every person, so it becomes really complicated,” he said. “What we discovered is that what we can do instead is give our system a whole bunch of recordings, and tell it, ‘here are recordings of people speaking, of birds chirping, of cars revving, of music, of whatever… and then every time we observe a mixture, we say to our system, ‘try to find bits and pieces from the clean sound that you can combine to approximate what you are observing right now.’

“I put them all together to approximate the mixture, but then since I know how I put things together, I can say, ‘now give me that sound with everything but the background noise, or give me only the speech component of it.’”

Smaragdis applies those concepts to his current research, including one project that used multiple microphones for surveillance at intersections in order to learn more about how accidents happen – without using cameras.

“What we wanted to figure out is whether something noteworthy happened,” Smaragdis said. “So it was a big win because we could very likely detect a lot of those events. It was very cheap to implement, we could reliably do it, and everybody benefited out of it. And it was a lot simpler than doing computer vision.”

Another project Smaragdis is working on would allow people to combine multiple recordings of, for example, a concert, into one high quality recording.

“There’s probably a thousand people who are going to be recording parts of the concert on and off at different times,” Smaragdis said. “What I want to do is be able to see all those recordings, figure out how can I put them together and make a reconstruction that is a higher quality recording out of them. How can I intelligently combine all those flawed recordings that everybody makes and make something that sounds good.”

Smaragdis said the musical component of his work also makes the research more fun, among other benefits.

“Of course, it makes it more enjoyable, but it makes a lot of the math a lot more intuitive, because now I don’t have to talk about abstract equations, and I know exactly how they relate and what it sounds like,” he said. “And it just makes it a fun place to play with. It’s also great for recruiting students.”

Smaragdis’ interest in electronic music dates back to his youth in his native  Greece when his father bought him a synthesizer.

“I liked music to begin with so I was pretty good at it when I was young,” he said. “I guess one day my dad made the fateful error of buying me a small synthesizer and I was just so enamored by it that I just had to figure out how all these things worked.”

Even with that love of music, there was in Smaragdis an engineer’s drive to understand how the synthesizer worked.

“It wasn’t really so much about the music, the premise was there, but it was more about interesting ideas of how do we think about music, and how to put machines in the loop, and how to automate the certain things that you’d like to do,” he said. “I do have that sort of a music background, and that’s a bit of a driving force.

“But, ultimately, I find the world of sound fascinating; the amount of things we do every day by using our ears is amazing and we don’t really understand it. You can’t cross the street without listening, but you always focus about how ‘oh, I have to look left and right’, everything else is subconscious.”

That fascination may lead to applications that doctors or accident investigators use but in the end, Smaragdis’ research is largely based on one overall motivation. 

“I basically want to take all of those tasks that we do with our ears and figure out, how can I make a machine do the same thing,” he said. “How can I make it be as intelligent when it comes to sound as we are? Whether that involves that machine being able to diagnose respiratory diseases, or listen to music, or understand speech, I don’t care, as long as it solves its problems.”