When speech assistants listen even though they should not

“Alexa,” “Hey Siri,” “OK Google” – voice assistants are supposed to react to these triggers. But other words activate them, too.

The research team – here Thorsten Eisenhofer, Jan Wiele, Lea Schönherr, Maximilian Golla, Dorothea Kolossa (left to right) – analysed which terms voice assistants misinterpret as triggers. © RUB/Marquard

HGI/CASA and Max Planck Institute (MPI) for Cyber Security and Privacy researchers have investigated which words inadvertently activate voice assistants. They compiled a list of English, German, and Chinese terms that were repeatedly misinterpreted by various smart speakers as prompts. Whenever the systems wake up, they record a short sequence of what is being said and transmit the data to the manufacturer. The audio snippets are then transcribed and checked by employees of the respective corporation. Thus, fragments of very private conversations can end up in the companies’ systems.

Süddeutsche Zeitung and NDR reported on the results of the analysis on 30 June 2020. Examples yielded by the researchers’ analysis can be found at unacceptable-privacy.github.io.

For the project, Lea Schönherr from the research group Cognitive Signal Processing, headed by Professor Dorothea Kolossa, collaborated with Dr. Maximilian Golla, previously at HGI, now at MPI for Cyber Security and Privacy, as well as, Jan Wiele and Thorsten Eisenhofer from the HGI Chair for Systems Security headed by Professor Thorsten Holz.

Testing all major manufacturers

The researchers tested the voice assistants by Amazon, Apple, Google, Microsoft, and Deutsche Telekom, as well as, three Chinese models by Xiaomi, Baidu, and Tencent. They played them hours of English, German, and Chinese audio material, including several seasons from the series “Game of Thrones,” “Modern Family,” and “House of Cards,” as well as, news broadcasts. Moreover, professional audio data sets that are used to train smart speakers were also included.

All voice assistants were equipped with a light sensor that registered when the activity indicator of the smart speaker lit up, thus, visibly switching the device into active mode indicating that a trigger occurred. The setup also registered when a voice assistant sent data to the outside. Whenever one of the devices switched to active mode, the researchers recorded which audio sequence had caused it. They later manually evaluated which terms had triggered the assistant.
False triggers identified and generated

Based on this data, the team created a list of over 1,000 sequences that incorrectly trigger speech assistants. Depending on the pronunciation, Alexa reacts to the words “unacceptable” and “election,” while Google reacts to “OK, cool.” Siri can be fooled by “a city,” Cortana by “Montana,“ Computer by “Peter,” Amazon by “and the zone,” and Echo by “tobacco.”

In order to understand what makes these terms false triggers, the researchers broke the words down into their smallest possible sound units and identified the units that were often confused by the voice assistants. Based on these findings, they generated new trigger words and showed that these terms also activated the voice assistants.

“The devices are intentionally programmed in a somewhat forgiving manner, because they are supposed to be able to understand their humans. Therefore, they are more likely to start up once too often rather than not at all,” concludes Dorothea Kolossa.

Audio snippets are analysed in the cloud

The researchers analysed in more detail how the manufacturers evaluate false triggers. A two-stage process is most common. First, the device analyses locally whether the speech it perceives contains a trigger word. If the device suspects that it has heard the trigger word, it begins to upload the current conversation to the manufacturer’s cloud for further analysis with more computing power. If the cloud analysis identifies the term as a false trigger, the voice assistant remains silent, only its indicator LED lights up briefly. In this case, several seconds of audio recording may already end up at the corporation, where they are transcribed by humans in order to avoid such false triggers in the future.

“From a privacy point of view, this is of course alarming, because sometimes very private conversations can end up with strangers,” says Thorsten Holz. “From an engineering point of view, however, this approach is quite understandable, because the systems can only be improved using such data. The manufacturers have to strike a balance between data protection and technical optimisation.”

Press contact

Christina Scholten
Marketing and Public Relations
Horst Görtz Institute for IT Security
Ruhr-Universität Bochum
Germany
Phone: +49 234 32 27130
Email: hgi-presse(at)rub.de

Lea Schönherr
Cognitive Signal Processing
Faculty of Electrical Engineering and Information Technology
Ruhr-Universität Bochum
Germany
Phone: +49 234 32 29638
Email: lea.schoenherr(at)rub.de

Dr. Maximilian Golla
Max Planck Institute for Cyber Security and Privacy
Germany
Phone: +49 234 32 28667
Email: maximilian.golla(at)csp.mpg.de

General note: In case of using gender-assigning attributes we include all those who consider themselves in this gender regardless of their own biological sex.

Back

When speech assistants listen even though they shouldn’t