Voice Cloning and AI Co-Hosts: A Look into the Future of Podcasting
As AI continues to weave its way into our daily lives, many are wondering if podcast hosts are about to be replaced by their robotic co-workers. However, the reality may be quite different than what we imagine.
We explored several AI voice generator services and discovered that, while they may become co-hosts in the future, there's no reason for the Ira Glasses and Joe Rogans of the world to worry about being replaced anytime soon.
But before we dive in, let's clarify some terms.
Voice cloning is the process of recreating someone's voice using recordings in order to create a "puppet" voice that can say anything without the original person present.
TTS (Text to Speech) involves writing out what you want the cloned voice to say, and the AI turns it into speech.
STS (Speech to Speech) lets you use your own voice instead of text, and the AI makes that audio sound like it's coming from the cloned person's mouth.
Let's see which voice cloning services made the grade.
The AI-Powered Voice Cloning Challenge
The challenge we presented our four AI tools with was quite straightforward. We asked them to take a few lines from the Gettysburg Address and transform it into something that could easily pass off as being spoken by our CEO, Harry. Here's the excerpt we used:
"Fourscore and seven years ago our fathers brought forth, on this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives, that that nation might live."
Here's the fleshy humanoid himself reading this passage:
You may have heard of this bot, as it's responsible for many of those incredible celebrity spoofs you've seen online. All you need to do is upload a sample of the person's voice you want to clone.
You only need a maximum of five minutes of high-quality audio, and with just a simple tick-box to confirm you have their consent, the bot will get to work.
Once your cloned voice is ready, it's time to input your text and let the magic happen. Of all the services we tested, ElevenLabs was the simplest to set up, but it did have the least amount of customization.
With only two sliders to adjust 'stability' and 'clarity+enhancement,' we kept the latter untouched. However, it's exciting to play around with the stability slider to get more expressive results, even if it means sacrificing some consistency.
Here's what it created:
The way Harry's voice is depicted is spot-on, but the way it's delivered may sound a bit robotic and weird. You know that feeling when something looks almost human, but not quite? Yeah, that's what I mean.
ElevenLabs pricing: Six pricing packages listed, including 'Independent Publisher' for $99 per month, and 'Growing Business' for $330 per month.
Play.ht uses a setup similar to ElevenLabs but requires a minimum of 30 minutes of audio, which is significantly more than ElevenLabs' five. The tool allows for customization options like emphasizing, speeding up, and pausing specific text.
But unfortunately, the customization features can be a bit finicky and sometimes even hard to find. We had to watch a few tutorials to unlock its full potential. But once you get the hang of it, the audio output is nothing short of impressive.
Here's what it created:
Out of all the outputs we've generated, this one sounds the most mechanical and lacks any natural inflections or emotion. If you stumbled upon this in a podcast where it's pretending to be a human, you'd probably be creeped out.
Play.ht pricing: Annual membership of $315 (equivalent to a monthly fee of $29.25)
As podcasters ourselves, we understand the importance of a good audio editing tool, and there's nothing quite like Descript. This text-based audio editor has been our go-to for years when it comes to creating rough cuts and structural edits of our episodes.
But what's really got us excited lately is their new feature called 'Overdub.' Not only does it let you re-create someone's voice to fix any mistakes in your recordings, but it can actually generate entire sentences and passages using this cloned voice.
Of course, this futuristic technology isn't all fun and games. To ensure that everything is above board, Descript requires you to record a disclaimer before it starts the cloning process.
This means using your own microphone to say your piece, and it's a step that we appreciate as conscientious creators. To get the best results, you'll need anywhere between 30 to 180 minutes of audio to work with for the best results.
After finalizing the voice, you can convert your text into speech by choosing your preferred voice and typing it directly into the document. Descript offers limited options for customization and instead utilizes punctuation to indicate emphasis and pace.
Here's what it created:
The good news is that the sound of Harry's voice is perfectly captured. However, the pacing is overly quick, and the emphasis and intonation sound unnatural and mechanical. This tool may be useful for swapping out a word or two, but it struggles with more complex tasks.
Descript pricing: $288 per user per year (equivalent to a monthly fee of $24)
I've found that Resemble.ai sets itself apart by offering speech-to-speech capabilities on its 'pro' plan. This means that instead of relying solely on text input, Resemble allows you to infuse your AI output with the natural pacing, emphasis, and intonation of a human voice.
To create your cloned voice, you simply record a series of 25 phrases, with the first securing your consent.
The platform also offers the most customization options, allowing you to tweak the intonation, cadence, and emphasis to your liking. While there may be some limitations in terms of product and user inexperience, Resemble.ai provides a revolutionary way to bring the natural fleshy goodness of the human voice to AI-generated voices.
Here's what it created with the text-to-speech feature:
Similar to other options, this also suffers from a mechanical delivery with uneven pacing and inconsistent emphasis. However, let's take a look at how it performs in speech-to-speech conversion:
Expert input: When accents clash, it can result in a jumbled and awkward mix of pronunciation, cadence, and emphasis. Recently, we tested this theory by cloning the voice of someone with an English accent using audio from a Scottish speaker.
The result was a strange hybrid of these two accents that didn't quite hit the mark. But there's more to the story. We didn't give up on this project just yet. Instead, we tried using Resemble's US voice and had an American speaker provide the input. And the outcome?
Well, let's just say it was much more polished, natural, and authentic-sounding. It turns out that the right accent can make all the difference when it comes to voice cloning.
It sounds quite impressive, but upon listening to the text-to-speech version, there is still a detectable difference, albeit significantly less noticeable.
While this voice cloning tool comes close to reproducing someone's voice accurately, it falls just short of achieving that goal. For an unfamiliar voice, it may suffice, but if your audience is already familiar with your voice, the output will likely generate an unsettling sensation and fall into the "uncanny valley" category.
Interestingly, this tool represents the closest we've come to fully replicating a person's voice.
Resemble pricing: If you're interested in using Resemble's speech-to-speech feature and premium voice cloning service, be prepared to pay around $1,000 per month according to their quote.
AI Podcast Hosting: Key Takeaways
So, what did we learn about the current state of AI podcast hosting? While we may be worried about AI taking over our lives, it seems that we can breathe a little easier when it comes to our favorite podcast hosts.
Based on extensive testing, we've discovered that it's simply not possible to convincingly clone someone's voice on a consistent basis.
Despite the strengths of services like ElevenLabs, Descript, and Resemble, all of them ran into the same issue: the natural pacing and emphasis of human speech are simply too difficult to recreate.
While some services did a great job of getting the sound of Harry's voice just right, they struggled to capture the variable pacing and rhythm of human speech. Others did a great job of mimicking the natural inflections of another person's speech but struggled to ensure that Harry's voice sounded just right.
Despite all of these limitations, we can be reassured that for now, the magic of human speech remains uniquely human.
What about workflows? While it may seem like a time-saving solution, the reality is that you'd still need to write podcast scripts, and human input in the form of recorded voiceovers would still be required in the case of speech-to-speech conversion.
Furthermore, the inconsistency of AI-generated output would require multiple takes of the same script, leading to added editing time and effort. However, for a high-stakes production where every minute is precious, the idea of leveraging AI may indeed make sense—as long as the investment (both in time and money) justifies the eventual outcome.
The Future of AI-Powered Podcast Production
I see huge potential for a game-changing addition to any show: a robot co-host or AI cast member. Imagine the buzz and entertainment value it could bring! However, this isn't without its challenges. One major concern is how to make the conversation between humans and AI sound natural.
And let's not forget the need to clearly communicate to listeners that the voice they hear is AI-generated. Security of the scripts and ideas fed into these massive learning models is also a pressing issue.
These exciting challenges represent the tip of the iceberg for this evolving technology. It's worth considering that there may be even more exciting applications for AI in the podcasting world beyond just voice.
As podcasters, we all know that time is precious. That's why services like GPT-4 have the potential to revolutionize the way we work. If you could save a few hours on research, guest preparation, and other tasks, you'd have more time to focus on the exciting parts of podcasting.
However, while AI technology can be a huge help, we must remember that human connection is at the heart of what makes podcasts so special. No algorithm can fully replicate the magic of a genuine human connection with our listeners. So as we embrace new tools and technologies, let's never forget why we fell in love with podcasting in the first place.