Engineering voice impersonation from machine learning
Text-to-speech (TTS) synthesis is the computer’s way of transforming text to audio. Most popular AI-driven personal assistants rely on TTS software to generate as natural-sounding speech as possible. Automation can happen once the computer performs the TTS “fluently” by pulling together words and phrases from pre-recorded files.
How is voice impersonation technology used?
Google voice cloning and generative adversarial networks
Voice cloning is AI research from Google that allows a computer to read out loud messages using any voice. The system requires two inputs:
- A text to be read
- A sample of the voice
Generative adversarial networks (GANs) can capture and modulate a voice signal’s audio properties. Open platforms such as WaveNet by Google apply GANs to create media that mimic voices and facial expressions to the extent that they become almost indistinguishable from how the impersonated person sounds and looks.
As a rule of thumb, the voice-modeling technology improves the more you feed it with voice data. Nevertheless, advanced neural networks sometimes do not need to use a large dataset of recorded audio to pre-train the model.
Tech companies such as the Canadian Lyrebird strive to design an AI system that can mimic a human voice convincingly by analyzing related speech recordings and the corresponding text transcripts. Lyrebird’s system relies on the deep learning capabilities of its artificial neural networks to transform bits of sound into speech.
Once the system manages to learn how to generate speech, it can then calibrate its settings to resemble any voice after reading through a one-minute sample of someone’s speech. At the moment, speed comes with trade-offs, such as a buzzing noise accompanying the generated voice, and there is a slight but noticeable robotic mannerism with no vestiges of physical attributes common in natural speaking, such as breathing and mouth movement.
Other voice impersonation technologies
Github’s Real-Time Voice Cloning Toolbox promises to replicate anyone’s voice from as little as five seconds of sample audio. Adobe already has a prototype platform called Project VoCo, which after listening to 20 minutes of sample audio, can edit human speech the same way Photoshop modifies digital images.
Synthesized voices of Donald Trump, Barack Obama and Hillary Clinton infused with emotion are the proof we all need that this train is heading in a direction that may further exacerbate security and privacy problems before anything else positive occurs.
To understand how good the voice impersonation technology has become, you could check out a demo by an AI company called Dessa, in which they used text-to-speech deep learning techniques to recreate Joe Rogan’s voice. Facebook engineers also created a machine learning system named “MelNet” that successfully replicated the voices of famous people, public speakers and other participants.
Security and privacy concerns related to voice impersonation
“Compared to text, voice is just much more natural and intimate to us,” said Timo Baumann, a researcher who works on speech processing at the Language Technologies Institute at Carnegie Mellon University. We now live in a world where whoever has a digital imprint of your voice can master its impersonation at will.
Consider also that every device equipped with a voice assistant is pre-programmed to listen quietly for a “wake word” to emerge out of a continuous stream of audio. This process is also typical of how a voice is fed to a machine learning model.
Ambient voice, comparable to hands-free AI-based technology, has practical application in healthcare in cases where medical specialists record verbal interactions with visiting patients. Doctors, among others, see this technology as an excellent tool for alleviating the daunting, bureaucratic task of typing any information needed for physical documentation related to medical records.
However, the question of privacy remains a hot potato in these situations. Given that AI is being used in retail environments more often than ever, there is this ongoing discussion about transparency: Should customers be notified when they engage with AI?
In addition, voice impersonation may entail some serious negative consequences, which are for the most part security-sensitive:
- May confuse voice-based verification systems
- May bring into question the integrity of real-time video in live streams
- May render audio and video recordings unusable in court evidence
While automatic speaker verification systems are good at detecting human imitation, they often fail to spot more advanced machine-generated voice impersonation attacks.
No wonder that Lyrebird’s founders (three university students from the University of Montréal) openly admitted, “This could potentially have dangerous consequences such as misleading diplomats, fraud and more generally any other problem caused by stealing the identity of someone else.”
Real-life implications of ml-based voice impersonation
Impersonation is the largest scam category reported to the FTC, with more than 647,400 complaints in 2019 alone. Some of these cases are AI-related.
Voice fraud as a whole is becoming more popular. One report stated that they have increased by 350% in the past few years. Another research prognoses that as many as 50% of all mobile calls conducted on U.S. soil by next year will be fraudulent.
Social engineers have many sources to draw inspiration from — voicemail greetings, social media, data breaches, visited websites and more. Companies tend to let out recordings of their high-ranking employees’ actual voices — a practice that can, unfortunately, create a fake recording from the upper management. Seventy-five percent of targeted victims share those bad actors already had some personal information about them. Scammers use additional techniques to deceive the victim, such as spoofing area codes, so it appears the call is made from the area that the victim expects it to originate.
At least three recent attacks have taken advantage of deepfake voices to swindle companies out of millions of dollars. In one case, $10 million, according to Symantec CTO Hugh Thompson.
A common scenario: “Please transfer money to this person. It is urgent.”
The story of Gary Schildhorn, a 67-year lawyer, is indicative of how voice impersonation is used for social engineering. While driving to work, Schildhorn received a phone call from his son — or at least sounded just like his son.
“It was his voice, his cadence, using words that he would use,” said the lawyer. The crying voice on the phone explained he had been in an accident and needed $9,000 to pay for a public defender. In 10 minutes, Schildhorn received another call from someone who claimed to be his son’s lawyer — a move that should have further fostered the swindle. Schildhorn almost reached his bank to order the payment, but before that, he called his daughter-in-law, who alerted his son’s work; eventually, his son called to tell him not to pay because it was a scam.
Perhaps the most famous case of an AI-based voice impersonation social engineering attack is when the CEO of a UK energy firm wired €220,000 ($243,000) because he thought he was speaking on the phone with his boss. By his testimony, he recognized his German accent and melody of voice.
Experts say that voice impersonation and deepfake attacks are the logical evolution of the business email compromise scam where scammers impersonate company executives via email. David Thomas, CEO of identity verification company Evident, told Threatpost that “it’s no longer enough to just trust that someone is who they say they are. Individuals and businesses are just now beginning to understand how important identity verification is. Especially in the new era of deep fakes.”
Although somewhat burdensome, dual custody is a measure that might work to prevent these kinds of frauds. Whenever transactions above a specific size are involved, two or three co-signatories should be required.
Voice engineering and its threat
Voice impersonation is another technology that blurs the line between the physical world and cyberspace.
It may not be a pervasive technology yet, but it’s a disturbing technology that raises ethical questions about misuse on a much larger scale soon.
Just as people cannot entirely trust whether an image is doctored via programs such as Photoshop, they should learn not to entirely trust what voices they hear through their electronic devices.
- Lyrebird claims it can recreate any voice using just one minute of sample audio, The Verge
- Clone a Voice in Five Seconds With This AI Toolbox, Medium
- Alarming AI Clones Both a Person’s Voice and Their Speech Patterns, Futurism
- CEO ‘Deep Fake’ Swindles Company Out of $243K, Threatpost
- A Philly lawyer nearly wired $9,000 to a stranger impersonating his son’s voice, showing just how smart scammers are getting, The Inquirer
- Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case, WJS
- Rich PII enables sophisticated impersonation attacks, CSO