Given that people speak 7,117 languages worldwide, speech is a universal experience that transcends social barriers and cultural lines. While thought of for centuries as uniquely human, spoken word has recently been tested with artificial intelligence as the speaker, resulting in shockingly accurate voice deepfakes — AI voices that are specifically crafted to sound human.
AI-generated voices are not a new concept. According to The Verge editor David Pierce, well-known technologies like Alexa, Siri and Waze have relied on AI-generated voices to operate for years. Although technologies can implement AI to help some people regain their voice, AI voice deepfakes train themselves on hours of human dialogue to sound like real people.
Deepfakes, when paired with unethical intentions, can be used for identity theft. Information security company Pindrop reported it receives 1,000 to 10,000 calls made by fraudulent voices per year and up to 20 a week. These calls are not merely spam or advertisements, but rather the first step to secure confidential information or withdraw large sums of money from a bank. New York Times reporters Emily Flitter and Stacy Cowley reported that in the past three years, “300 million people fell into the hands of hackers, leading to $8.8 billion in losses.”
Some means of deepfake generation, like typing requests in real-time, can be spotted by those on the other end of the line. With the technology’s rapid advancement, though, scammers can now type prompts or speak requests into a microphone in order to quickly translate speech in the targeted voice.
Deepfakes can also be dangerous for entire professions. Because they are cheaper to use than human labor, AI voices have put commercial voice actors out of a job abruptly. Any recording they’ve made in their career could have served as an extra piece of data for the AI training companies, refining budding deepfakes to be precise clones of their human competition. This phenomenon is truly ironic: humans are being replaced by the very technology they unknowingly helped train.
AI voices have certainly come a long way from their rudimentary 19th-century predecessors, like Wolfgang von Kempelen’s “boxlike contraption that used bellows, pipes and a rubber mouth and nose to simulate a few recognizably human utterances.” Apps and websites like Personal Voice on iOS 17, ElevenLabs, Descript and Microsoft’s VALL-E let anyone create their own AI voice in a matter of days.
The accessibility of these technologies puts individual voices at risk in an unprecedented capacity, which is why it is all the more important to exercise control over one’s voice. And, as deepfakes further blur the line between human and robot, it is important to continue to celebrate and preserve the individuality and diversity of language as a whole, even in small ways.
For me, this meant relearning Chinese after several years of not speaking it with the app Duolingo. Reconnecting with a forgotten language helped me strengthen family relationships and communication, and language learning sites can help people restrengthen their identity as a whole. This could mean using a thesaurus or a word of the day to challenge your conventional diction. This could mean listening to TED Talks to absorb new ideas that expand your opinion, the truest form of “voice.”
Human voices are comprised of complex acoustics, nuances and physiology, making them individual no matter what. We can utilize them to express our most trivial preferences or confess our deepest beliefs, unlike a deepfake, which is only used as a profitable tool in an anonymous scheme.
Moreover, a voice’s sound is shaped by one’s family from birth, and what it conveys changes in step with the individual. A voice’s ability to adapt to the person it belongs to is crucial; it is exactly what a deepfake is missing. The word “fake” is in the name itself and the antithesis of the human experience, which, no matter what, cannot be replicated with clicks on a keyboard and searches in a database.
We are surrounded by billions of voices, each a reflection of someone’s particular worldview that contributes to the candid, imperfect symphony that is our world. To let a group of hackers, websites or scam calls rob us of our vocal agency is a disservice to humanity. As emeritus professor Klaus Scherer once said, “The voice is not easy to grasp.” For it to remain ours, we must hold on tight.