Subliminal Messages: Alexa and Siri Can Hear Commands That Humans Can't

Few people today still believe that rock n' roll music is filled with malicious subliminal messages aimed at teens. Sounds cynical, but maybe it has some validity to it–a new study shows that people might be able to infuse music or speech with malicious "subliminal messages," targeting voice-activated software on your smartphone. A research team from UC Berkeley has made waves with a research paper where they developed recordings of normal speech or music into “targeted adversarial examples,” audio perceived normally by the human ear but interpreted as something else entirely by transcription software. This ability to "trick" a computer or smartphone's microphone is troubling when combined with the rise of voice-activated commands.

An adversarial example is an input to an AI system designed to force the system to make any mistake. A targeted adversarial example is designed to force the system to make a specific kind of mistake (think of it like whispering a wrong answer into your classmate's ear when the teacher calls on them). Obviously, targeted adversarial examples are far more worrisome when considering phone AI; someone specifically manipulating your phone to visit a malware website or record you is much worse than causing it to make a random mistake.

A visual targeted adversarial example: overlaying a targeted adversarial example on a picture of a panda will make classifier software label it as a gibbon. The difference between the two images is not even perceptible to the human eye. (Photo courtesy of OpenAI.)

Until now, most of the research on targeted adversarial examples has been done on visual inputs, such as photos designed to fool face-sensing software. To make visual targeted adversarial examples, researchers introduce barely-perceptible "noise" into the image. So, to make normal recordings into targeted adversarial examples, the researchers added a slight "perturbation" to the waveform, so that it was more than 99.9 percent similar but would transcribe as an entirely different phrase of their choosing.

They measured distortion in decibels and quantified it with the equation dB_x(δ) = dB(δ) − dB(x), where the amount of distortion is represented by the "perturbed" audio they created minus the decibel level of the original waveform. Because the perturbation is quieter than the original waveform, the distortion is a negative number, where smaller values mean quieter distortions. Because they wanted to make the recordings sound as similar as possible to the original, they attempted to minimize distortion.

By adding a slight “perturbation” to waveforms, the team was able to make them register as entirely different sentences, or register as words when the waveform was not speech initially. (Photo courtesy of Nicholas Carlini and David Wagner.)

On his website, Carlini explains exactly how his team made their adversarial examples. "At a high level, we first construct a special ‘loss function’ based on CTC Loss that takes a desired transcription and an audio file as input, and returns a real number as output; the output is small when the phrase is transcribed as we want it to be, and large otherwise. We then minimize this loss function by making slight changes to the input through gradient descent. After running for several minutes, gradient descent will return an audio waveform that has minimized the loss, and will therefore be transcribed as the desired phrase."

For this study, they hid the phrase "Okay, Google, browse to evil.com” within a recording of the sentence, “Without the data set, the article is useless,” and a four-second clip from Verdi’s “Requiem.” The mean distortion needed to "hide" the command phrase in a spoken sentence is –18dB, and the mean distortion needed to hide it in a piece of music is slightly higher, at –20dB.

Crucially, the command was inaudible to human ears, but the transcription software comprehended it instead of the audible phrase. (To see firsthand how imperceptible the difference is, you can visit the authors’ website here.)

Causing mistakes in machine transcriptions may not seem particularly groundbreaking but tricking a computer or phone’s microphone has much wider applications. Paper author Nicholas Carlini was part of another study in 2016, where researchers tested whether they could hide smartphone commands in white noise played over YouTube or through a loudspeaker. Using their masked commands, they could make voice recognition devices turn on airplane mode or open a website when the speakers were within 11.5ft (3.5m) of the targeted device. "We wanted to see if we could make it even more stealthy," Carlini, a Ph.D student at UC Berkley, told The New York Times when speaking about his current study. Carlini believes that the "backdoor" he found could be used to help people hack into other people's smartphones.

Other researchers have been looking into the possibility of digital attack through noises too high for the human ear to hear. In an earlier study, researchers from Princeton University and China’s Zhejiang University were able to activate voice recognition systems using voice commands above the upper frequency of hearing (f > 20kHz). Because computer microphones are meant to pick up on a wide range of noises to capture the nuances of human speech, they’re capable of “hearing” sounds that humans can’t. The researchers tested their attack, which they called "DolphinAttack," across voice recognition systems like Siri, GoogleNow, Cortana and Alexa. They found they were able to get the devices to perform tasks like opening FaceTime, visiting a website, or switching itself to airplane mode. They were able to run the attack off a portable get-up featuring a Samsung Galaxy S6 Edge, an amplifier, an ultrasonic transducer and a battery.

There were, however, limitations to DolphinAttack–the attack was less likely to succeed when there was background noise, and their ultrasonic speaker had to be within 0.8 and 69in (2 and 175cm) of the device.

The commands that the researchers gave to the various devices that they were testing on. There are two different kinds of attacks: recognition (where the system is turned on already and is asked to perform a particular action) and activation (where the system is off and is asked to turn on.) (Table courtesy of Guoming Zhang et al.)

The DolphinAttack team offered several possible defenses against their attack, both hardware and software. On the hardware side, they recommended designing microphones to not pick up on signals above 20kHz, as well as adding a module to detect and cancel inaudible voice commands. On the software side, they suggested using a machine learning classifier to detect and cancel out modulated voice commands (like the subaudible ones they played). When they used a classifier to detect their own samples, it had a 100 percent success rate at determining which audio was "real," suggesting that the second method is a more feasible defense against inaudible attacks.

Carlini et al. don't have any such solution to combat their targeted adversarial examples. As they point out in their paper, almost all of the current research on adversarial examples has been done in the visual field; going into their study, it wasn't even clear that the same kind of effect could be done with audio. But Carlini hopes that their paper will spur more interest in the topic. “We want to demonstrate that it’s possible,” he told the Times, “and then hope that other people will say, ‘Okay, this is possible, now let’s try and fix it.’”

But, when asked whether he thought people were already using his technology with negative intent, Carlini was far more pessimistic. He offered the sentiment, “My assumption is that the malicious people already employ people to do what I do." It wouldn't even necessarily be illegal; the Federal Communications Commission lists subliminal messaging as “counter to the public interest” but does not say that it's against the law.

Gallup poll data on how Americans use artificial intelligence in their daily lives. (Graph courtesy of The New York Times.)

Web developer Chris Garaffa offered one possible solution to people unsettled by the news when he spoke to Radio Sputnik's Loud & Clear, saying, "The best way to protect yourself is not to have any of these devices in your homes." But Garaffa's prescription is likely to be ignored; a recent Gallup poll showed that half of Americans use smartphone personal assistants and a fifth of Americans already have smart devices like Alexa in their homes. If Americans aren't willing to shed their smart tech over the report, smart device companies and the law will be left playing catch-up in a world where subliminal messages have become a genuine threat.