Extracting audio from visual information

Algorithm recovers speech from the vibrations of a potato-chip bag filmed through soundproof glass. Watch Video


Researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag photographed from 15 feet away through soundproof glass.

In other experiments, they extracted useful audio signals from videos of aluminum foil, the surface of a glass of water, and even the leaves of a potted plant. The researchers will present their findings in a paper at this year’s Siggraph, the premier computer graphics conference.

“When sound hits an object, it causes the object to vibrate,” says Abe Davis, a graduate student in electrical engineering and computer science at MIT and first author on the new paper. “The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”

Joining Davis on the Siggraph paper are Frédo Durand and Bill Freeman, both MIT professors of computer science and engineering; Neal Wadhwa, a graduate student in Freeman’s group; Michael Rubinstein of Microsoft Research, who did his PhD with Freeman; and Gautham Mysore of Adobe Research.

Reconstructing audio from video requires that the frequency of the video samples — the number of frames of video captured per second — be higher than the frequency of the audio signal. In some of their experiments, the researchers used a high-speed camera that captured 2,000 to 6,000 frames per second. That’s much faster than the 60 frames per second possible with some smartphones, but well below the frame rates of the best commercial high-speed cameras, which can top 100,000 frames per second.

Commodity hardware

In other experiments, however, they used an ordinary digital camera. Because of a quirk in the design of most cameras’ sensors, the researchers were able to infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second. While this audio reconstruction wasn’t as faithful as that with the
high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers’ voices, their identities.

The researchers’ technique has obvious applications in law enforcement and forensics, but Davis is more enthusiastic about the possibility of what he describes as a “new kind of imaging.”

“We’re recovering sounds from objects,” he says. “That gives us a lot of information about the sound that’s going on around the object, but it also gives us a lot of information about the object itself, because different objects are going to respond to sound in different ways.” In ongoing work, the researchers have begun trying to determine material and structural properties of objects from their visible response to short bursts of sound.

Watch how MIT researchers extract audio from the vibrations of a plant, potato-chip bag, and other objects.

Video courtesy of the researchers

In the experiments reported in the Siggraph paper, the researchers also measured the mechanical properties of the objects they were filming and determined that the motions they were measuring were about a tenth of micrometer. That corresponds to five thousandths of a pixel in a close-up image, but from the change of a single pixel’s color value over time, it’s possible to infer motions smaller than a pixel.

Suppose, for instance, that an image has a clear boundary between two regions: Everything on one side of the boundary is blue; everything on the other is red. But at the boundary itself, the camera’s sensor receives both red and blue light, so it averages them out to produce purple. If, over successive frames of video, the blue region encroaches into the red region — even less than the width of a pixel — the purple will grow slightly bluer. That color shift contains information about the degree of encroachment.

Putting it together

Some boundaries in an image are fuzzier than a single pixel in width, however. So the researchers borrowed a technique from earlier work on algorithms that amplify minuscule variations in video, making visible previously undetectable motions: the breathing of an infant in the neonatal ward of a hospital, or the pulse in a subject’s wrist.

That technique passes successive frames of video through a battery of image filters, which are used to measure fluctuations, such as the changing color values at boundaries, at several different orientations — say, horizontal, vertical, and diagonal — and several different scales.

The researchers developed an algorithm that combines the output of the filters to infer the motions of an object as a whole when it’s struck by sound waves. Different edges of the object may be moving in different directions, so the algorithm first aligns all the measurements so that they won’t cancel each other out. And it gives greater weight to measurements made at very distinct edges — clear boundaries between different color values.

The researchers also produced a variation on the algorithm for analyzing conventional video. The sensor of a digital camera consists of an array of photodetectors — millions of them, even in commodity devices. As it turns out, it’s less expensive to design the sensor hardware so that it reads off the measurements of one row of photodetectors at a time. Ordinarily, that’s not a problem, but with fast-moving objects, it can lead to odd visual artifacts. An object — say, the rotor of a helicopter — may actually move detectably between the reading of one row and the reading of the next.

For Davis and his colleagues, this bug is a feature. Slight distortions of the edges of objects in conventional video, though invisible to the naked eye, contain information about the objects’ high-frequency vibration. And that information is enough to yield a murky but potentially useful audio signal.

“This is new and refreshing. It’s the kind of stuff that no other group would do right now,” says Alexei Efros, an associate professor of electrical engineering and computer science at the University of California at Berkeley. “We’re scientists, and sometimes we watch these movies, like James Bond, and we think, ‘This is Hollywood theatrics. It’s not possible to do that. This is ridiculous.’ And suddenly, there you have it. This is totally out of some Hollywood thriller. You know that the killer has admitted his guilt because there’s surveillance footage of his potato chip bag vibrating.”

Efros agrees that the characterization of material properties could be a fruitful application of the technology. But, he adds, “I’m sure there will be applications that nobody will expect. I think the hallmark of good science is when you do something just because it’s cool and then somebody turns around and uses it for something you never imagined. It’s really nice to have this type of creative stuff.”


Topics: Computer vision, Imaging, School of Engineering, Electrical Engineering & Computer Science (eecs), Computer Science and Artificial Intelligence Laboratory (CSAIL), Research

Comments

This reminds me of scene towards the end of "Eagle Eye" http://www.imdb.com/title/tt10... The good guys are trying to hide from microphones controlled by the big evil computer, but it still manages to listen in on their conversation by zooming in on a glass of water. Seems that's not as far-fetched as I thought it was.

Amazing accomplishment! I remember using a laser interferometer decades
ago to hear audio at a distance, although the quality was low. This is much simpler and less prone to noise.

For years I've wondered whether a material setting up such as a ceramic or brick
or anything that turns hard over time might be capturing nearby sounds
in its micro structure. An analysis of micro density variations on an
old brick or piece of pottery might be able to recreate the sounds of
voices in some ancient culture - we could actually hear what Sumerian or
ancient Egyption really sounded like, maybe even hear the voices of
Socrates and Ramses!

Each his own, and my domain is musical instruments and quality of tone. I don't care much for eavesdropping on bagged chips, or the further enabling of high tech spyware, but would be quite interested to see more high-speed analyses of soundboards with various stringloads and pitches !

I'm also interested in its applications, I bet something really interesting will come out of this someday. Not to says it's uninteresting in and of itself.

This is too much fabulosity!!

Interesting. I guess we need white noise generators to keep the government from quite literally spying on us through our house-plants :p

Knowing what I know about human nature, I am predictably reticent about any and all of today's machinery.

But this is just too cool.

What would happen if video was recorded with a camera using the technology in the Lytro cameras? clearer specific audio?

I think it's possible to tell if people are inside a room, talking, by taking telephoto recording of the window pane glass. It's possible to measure vibrations in the glass, hence people are inside. I think this technique was used in some high profile spy work or something . . . vaguely remember it in the news . . . .

Question: is this an article about serious research at MIT or is it a report about teaching students the basics of sound recording in a fun way?
If it's the latter, nicely done! If it's the first, really guys, don't you know the principles of (analog) sound recording?! Everything in this article has been known for many decades!

I'm interested to hear a silent film with this technology.

You used a DSLR. How hard would it be to use a smartphone instead? I see an idea for an app.

The before and after recording of speech from the chip packet reminded me of this audio illusion, where the brain is better able to process distorted speech when knowing what it should sound like: https://soundcloud.com/whyy-th...

It would be interesting for someone who hasn't already watched the video to skip to here: http://www.youtube.com/watch?f... and see if they can make sense of the voice as recorded from the chip packet, _before_ listening to the undistorted recording of the same voice here: http://www.youtube.com/watch?f...

Even an ordinary photograph or video frame might reveal a subtle motion blur resulting from ambient sound, a summary over the 41.7 milliseconds of the single video frame. Depending on the resolution of the camera, this blur could potentially include a summary of the Fourier (frequency-domain) components of the sound at the moment of the image capture.

A few years ago someone proposed that voices from antiquity could be recovered from pottery being produced at the time. As pottery is spun on a potter's wheel, vibrations would be embedded in the wet clay and small snippets of sound (including voices) could be recovered. -Joseph Brown- joe_sails@yahoo.com

So now the government can spy on you by pointing a camera at your garbage. Great.

So would this work to listen to stars? Say listen to the eruptions on Io

It should be possible to locate the origin of audio as well by measuring its propagation through space.

what about a reverse process? extracting 3d space and materials information from a audio recording ?

This is utterly amazing!

If the high speed cameras are pointed at the waves crashing in the sea or a storm blowing through a tree - what sounds could be created?

If for example, the tree trunk movements were mapped to a fundamental frequency and sub-sequent branches/stems and leaves mapped onto corresponding harmonics - would the tree be singing? Would a forest of conifers sing a different melody than a deciduous woodland?

I don't understand how you can pick such small movement with a regular 1080p consumer camera. To my understanding, in a video you can pick a movement as tiny as a pixel, which would translate into something into what, 1mm ? 0.1mm depending on how much you zoom. The subtle movement you're talking about are way way smaller than that. It would be like the object never move at all. I would have understand if you would use something like 4K and zoomed to on a inch square of the object, subtle movement are easier to pick. Please explain.

Hey couldn't that be used for vibrometry of highly vibrational sensitive technical stuff too.

Shine a laser on the bag of chips and the reflection will give you a much better signal to analyze. By recording the reflection, you actually get amplification (from the geometry) and can get better reproduction.

A laser with the right type of receiver aimed at a window does the same thing. That has been around longer than I have, "The Laser Cookbook". Good read ;)

I think the comments about lasers doing a better job miss the point. With this technique, you can use a video that was taken a few weeks ago and recover the audio without knowing at the time that you would later want to retrieve the audio. The next hurdle is a video taken with a handheld camera, shakes and all, and still be able to retrieve usable audio.

So, can you now add sound to Charlie Chaplin movies?

Please analyze the Zapruder film with this technology to settle once and for all the number of shots fired at JFK in Dealey Square. Whatever the results, you will generate significant media attention for your project.

what will happen in a scenario where there is wind? is it possible to extract the required vibration from the vibration created by wind?

Very nice example of the rolling shutter effect on this vid shot by an iPhone inside a guitar: https://m.youtube.com/watch?v=...

(Wish I'd had anything to do w making it.)

Does this mean we can hear a super nova or a black hole .also if it can pick any vibration could it be used to find weak spots in structures say a bridge film the bridge run it through the software and the sound made could let you know if there is any weak spots

Zapruder film please. and is it possible to apply this to really old films that had no sound. perhaps we could hear historically famous people and their voices. This is crazy stuff! sorry, read more and see that it is not possible. That is too bad.

I'm wondering if an audio signal can be pulled off the video of a nearby pane of glass??? This could have huge monetary implications...

Cone of silence anyone? I know I'm building mine stat@!

Is there a public publication about this technology ?

I remember that in an American Embassy somewhere (Moscow?) during the cold war, there was some sort of art gift that was discovered to have a thin metal reed inside it, which the Russians managed to recover audio from by pointing a microwave transmitter at it, and the vibrating reed modulated the reflection.

I'd love to see this applied to sound _canceling_. Point a video camera at a sound source (or a sound proxy like a nearby bag of chips), and generate an inverse waveform that cancels out the sound coming from that source.

It would be a great way to selectively cancel out any given sound source. I wish I had this technology a few weeks ago when someone in my neighborhood was blasting their stereo at 3 AM :P

Kind of a technology-enabled "talk to the hand". I just have to find a sound proxy near you, point my camera at it, and an app on my phone sends canceling sound to my earbuds.

Is it maybe possible to do those kinds of experiments with light as well? Could be f.e. interesting for the solar cell industries!

Some time ago I researched my idea to recover the speech of artists/artisans/workers painting artworks, crypts, pots, walls, tombs etc. by looking at the patterns their voices and other sounds would make by vibration of the bristles or other protuberances on the brushes they were using in the paint they were applying... This sounds similar ;)

As a sound engineer I'm amazed and speechless...

WOW, i wonder if it is any idea to invent a chip bag with soundproofing material.

Hmmm... Very Interesting: Does an objects temperature change it's ability to vibrate at a given frequency due to sound and hence provide a clue to that objects temperature? Can this be used with older reco

The television series CSI in it's last season used the recording of plant vibrations to pickup what two people were saying on a security camera. Naturally the recording of the people's voices were perfect (it's television after all) but one day I'm sure recording a person's voice perfectly will be possible.

I want to hear what a single cell sounds like! I think we could learn a lot from listening to the very small...

I hope this could be used to catch more "bad guys".

Back to the top