AI Foley Artist Fools Human Ears

Foley artists reproduce everyday sound effects for radio, film and television.

Humans are able to navigate the world around us through our five senses and it could be argued that we have significant reliance on sight and hearing over touch, taste and smell. Using our senses, we can learn from our interactions with the world around us and begin to predict actions and reactions in our environment.

For robots to function more effectively in industrial and commercial settings, they too will need to be able to make predictions about their surroundings.

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed an algorithm capable of predicting sound. When shown a video of an object being hit, the algorithm produces a realistic sound capable of fooling human ears.

The ability for artificial intelligence (AI) to replicate sounds is impressive in itself, but Andrew Owens, a CSAIL PhD student, believes the development could be helpful in allowing robots to understand the world around them.

Researchers have developed an algorithm that predicts sound, which could help robots decipher their surroundings. Similar algorithms could also lead to better sound effects for film and television.

“When you run your finger across a wine glass, the sound it makes reflects how much liquid is in it,” said Owens. “An algorithm that simulates such sounds can reveal key information about objects’ shapes and material types, as well as the force and motion of their interactions with the world. A robot could look at a sidewalk and instinctively know that the cement is hard and the grass is soft and therefore know what would happen if it stepson either of them.”

Deep Learning Algorithm Fools Humans

By filtering through data, computers can find patterns on their own without supervision. This technique is inherent to “deep learning.”

To train the sound-predicting algorithm, the researchers gave it approximately 1,000 videos of 46,000 sounds, all produced by a drumstick hitting, scraping or prodding various objects.

The team fed these videos to the deep-learning algorithm, which deconstructed the sounds by analyzing pitch, loudness and other features. Through the study of sound properties, the algorithm built a vast database of sounds (dubbed “Greatest Hits” by the researchers).

Eventually, when shown a new video, the algorithm was able to predict what the new sound was similar to by matching it to what was stored in its database. The program's capabilities include differentiating between high and low pitches and between staccato taps and rustling ivy, for example.

But how convincing are the fake sounds?

To find out, the researchers showed subjects two videos of collisions. One video used the actual sound recorded, while the other used the algorithm's sound. When asked which one was real, viewers picked the fake sound over the real one twice as often.

Human ears were fooled more often by less distinct sounds made by materials such as leaves and dirt. With wood or metal, they could more easily tell the difference.

Despite initial test successes, there is still room for improvement. For instance, the program falters in response to erratic movement. Another limitation is that it depends on “visually indicated sounds,” caused directly by the physical interaction shown in the videos.

Still, the algorithm brings us that much closer to developing smarter, more sophisticated robots that can interact more effectively with their surroundings.

For another example of AIs doing a better job than humans, read about an AI that can replace physicists.