Cornell researchers have created a new kind of wearable device that can read your lips, even if you’re not speaking aloud — and even though it doesn’t have a camera.
Equipped with sonar and AI, this pair of fairly normal looking glasses becomes capable of recognizing lip and mouth movements to execute up to 31 commands, without the user needing to make a peep.
“We’re very excited about this system because it really pushes the field forward on performance and privacy,” Cheng Zhang, an assistant professor of information science at Cornell, said.
“It’s small, low-power and privacy-sensitive, which are all important features for deploying new, wearable technologies in the real world.”
Straight out of SciFi: Developed at Cornell’s “Smart Computer Interfaces for Future Interactions” (or SciFi) Lab, the glasses, called EchoSpeech, can control a smartphone and interface with software, operating them by mouthing commands.
Rather than cameras — with all their size, power, and privacy problems — the glasses use miniscule speakers to bathe the face in sonar. That sonar signal is picked up by microphones, and then it is fed into a SciFi-designed deep learning algorithm, which determines, then recognizes, mouth movements.
“We noticed that facial movements, especially lip movements, are highly informative for silent speech recognition,” Ruidong Zhang, information science doctoral student and lead author of the EchoSpeech paper, said in a YouTube video.
Two speakers and two microphones are attached to the bottom of either side of the glasses frame. The silent sonar waves bounce off the lips in various directions to the microphones, which pick up various changes in shape for the AI to evaluate.
According to the researchers, their algorithm was able to recognize these sonar echo patterns with 95% accuracy.
Users need to train EchoSpeech before it can work, but the glasses can pick up commands within minutes. In the YouTube video, EchoSpeech learned eight commands for a music player with less than two minutes of training; in less than five minutes of training, the glasses were capable of recognizing random strings of numbers, spoken without stop.
Ditching the camera: For the Cornell team, relying on cameras for silent speech recognition poses a number of problems. Aside from the impracticality of constantly wearing one, cameras open up a whole host of privacy concerns both for their users and the people around them.
In addition to not potentially filming everyone around you, the sonar data used by EchoSpeech is considerably smaller than image and video data, the researchers say, allowing it to be processed and sent directly to a smartphone via Bluetooth, in real time, co-author and professor of information science François Guimbretière said.
“And because the data is processed locally on your smartphone instead of uploaded to the cloud, privacy-sensitive information never leaves your control.”
The sonar tech is also easier on the batteries than a camera, working for up to ten hours.
Looking ahead: The team is currently looking at how to commercialize EchoSpeech’s sonar recognition tech, and sees future use cases including people who have difficulties vocalizing.
“For people who cannot vocalize sound, this silent speech technology could be an excellent input for a voice synthesizer,” Ruidong Zhang said. “It could give patients their voices back.”
We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at [email protected].