The Rise of Synthetic AI Voices and Voice Acting with David Ciccarelli Artwork

Your AI Injection

Is AI an ally or adversary? Get Your AI Injection and learn how to transform your business by responsibly injecting artificial intelligence into your projects. Our host Deep Dhillon, long term AI practitioner and founder of Xyonix.com, interviews successful AI practitioners and domain experts to better understand how AI is affecting the world. AI has been described as a morally agnostic tool that can be used to make the world better, or harm it irrevocably. Join us as we discuss the ethics of AI, including both its astounding promise and sizable societal challenges. We dig in deep and discuss state of the art techniques with a particular focus on how these capabilities are used to transform organizations, making them more efficient, impactful, and successful. Need help injecting AI into your business? Reach out to us @ www.xyonix.com.

All Episodes

Your AI Injection

The Rise of Synthetic AI Voices and Voice Acting with David Ciccarelli

May 26, 2022 • Deep • Season 1 • Episode 19

Synthetic AI voices are everywhere, from the smart speakers in your home to ads on the radio. This week, we speak with David Ciccarelli about AI voices and voice acting. David is the CEO of Voices, a leading voice over platform where companies can hire voice actors to complete creative projects.

In this episode, we cover the rapid rise of AI voice technology, what goes into creating synthetic voices, and future implications of their development. David also addresses the value of voice actors as compared to synthetic voices, and reasons that companies might choose to use one over the other.

Check out Voices here: https://www.voices.com/

Automated Transcript

Deep: Hi there, I'm Deep Dhillon. Welcome to your AI injection, the podcast where we discuss state-of-the-art techniques in artificial intelligence with a focus on how these capabilities are used to transform organizations, making them more efficient, impactful, and successful.

Welcome back to your AI injection. This week, we'll be discussing the role of AI based synthetic voice generation in audio advertising. We're speaking today with David Ciccarelli. David's the CEO of voices.com, a large audio services marketplace. I can't wait to speak with them and learn more about the world of audio advertising.

So, David, please tell us a little bit about how does AI voice generation models, how do these models work at a high level and why does this matter in the world of audio advertising?

David: Yeah. Well, I mean, since the advent of modern technology, people have been engaging with computers in some form or another calling it from typing to tapping to talking. So we're at this phase where we're talking and I think the first real evidence of AI in kind of a speech or a voice situation that was broadly embraced was the advent of the smart speaker. I mean, smart speaker adoption, Deep, has been faster than any other consumer device. And smartphones took 10 years to reach 50% of penetration in the US population, smart speakers only took about four years. It was because Amazon basically giving these Alexa devices away, you know, virtually for free.

Deep: It's pretty mind boggling if you think about how intrusive these devices are relative to anything else in our lives with maybe the exception of the smartphone itself. I mean, you have a mic on 24/7 in your living room and people just, you know, kind of took it.

David: Yeah. Just rolled with it. I mean, they were $19.95, Amazon putting them as like Thanksgiving and Christmas gifts, stocking stuffers, they got a hundred million of these devices in people's homes over that I think it was like 2017-18, somewhere around there. And the number one place actually where people put the smart speaker in the home is the kitchen, followed by the bedroom, which is an even more private and intimate place. But I guess because the going to bed routines, the waking up routines, people like that experience, but I'd say smart speakers are probably the first time where voice and kind of an AI synthetic voice was really prevalent. Even before that, uh, Siri was used, but it wasn't as much. I think what it prompted was this whole comfort level with speaking commands, searches, desires, requests, audibly in the privacy and comfort of your own home and then getting a response back.

Deep: Maybe importantly or not, but like, it didn't sound like a robotic 1980s movie film voice. It actually has a natural voice.

David: You raised a really good question. Like how are, how are these voices even created in the first place? So, there's a three-step process call it, you know, preprocessing, there's normalization which starts with first off, you actually have to format the text and format the audio. Now the old way of doing this was, you know, this pre-processing that tackled words that are pronounced in different ways that could even have different meetings. So, and then it would be the kind of stitching together these phenomes. Meaning that the first part of one word could be used to actually make the sound of another word, converting characters into phenomes if you will. Or-

Deep: Yeah, like if you have to say the word bark, you know, you might record the b-, a-, r-, and the k-, and then you put them together and it sounds like bark. Haha, it sounds kind of crazy.

David: You get that happening fast enough that it becomes almost indistinguishable. That kind of stitching together is called concaptative and that's the most natural sounding that we have right now. Most, kind of the stitching together of snippets of sound to create whole new words, whole new sentences that weren't really there before, then the parametric is this whole new generation of all the information and data's required where it actually didn't even have that base sound to work from. I know it's a little bit beyond my educational background myself as an audio engineer, not a data scientist, not an AI expert in that same sense, but creating these parameters that allow contents and characteristics of speech to be kind of created and controlled and manipulated that wasn't in the original recording. So that's basically the process, Deep, zooming out is recording a lot of content and then parsing out what are the sounds that can be created and then stitching it together. And then that next level is okay, learning from the sounds that are stitched together. Can we make derivative sounds that weren't actually recorded? They're almost like byproducts or derivatives that ultimately create the illusion of a voice that wasn't ever recorded.

Deep: So, one of the things these models are trying to learn is on behalf of a particular person or speaker, which is like, how are they taking their pauses? How are they speaking it with a particular emotion or particular level of maybe aggression or lack of aggression or calmness or what have you?

David: Well, Google had, um, a demo a number of years ago that really kind of caught the world by storm in the imaginations of many of us because the AI voice was purposely inserting those human elements, going, “Umm, okay let me just think about that,” “Uhhh,” and it was like-

Deep: All of the stuff that we spend so much time trying to get rid of it-

David: Exactly. There were like these humanisms that we're completely unnecessary, obviously they were inserted purposely to make the voice sound more natural as if it was quote unquote thinking.

Deep: And then it just sounds really awkward. Cause now it's like somebody like faking being natural.

David: You know, engineers are actually inserting those human elements of the pauses. The other aspect is how are you creating emotion, which can be done through a higher voice, sounds, you know, more excited. And you've got a lot of energy when you're up hear. Or even speaking faster can sound more aggressive or again, more energy. Slower sounds, more demure pace, maybe contemplative. You're just, you're thinking through something. Even just with elongating and compressing, in terms of on the time horizon, how long or short it takes to say something, it gives the impression that someone's thinking about it, or they're rushed and excited. That's how the human ear interprets what we're hearing.

Deep: So let's take a little bit of a turn. So when we move forward, just so everyone knows, you know, you've got some text now it's sitting there and now you're going to kind of, you know, you pick a voice, you've got this kind of like new AI, synthetic, uh, generation of voices that are getting increasingly more accurate as humans. Tell us a little bit more about sort of your background with, how did you get into the audio recording field and what inspired you to create voices.com and why are we talking about synthetic voices?

David: Yeah. Growing up, I've always been fascinated with sound. Maybe it was when I graduated from high school, I went to an audio engineering school learning how to record and you know, mix and produce sound. I actually opened up my small, uh, small recording studio of my own and got my name in the local newspaper on my birthday of all days. And that's how I met Stephanie who's now my dear wife. And we ended up working together and I said, listen, I'll be the engineer and you be the female voice talent and we'll split the money 50/50. So that's how we started working together. And for us, it was get out of the recording business, instead pivot to running this online marketplace. And that's what voices.com is. It's an online marketplace where we, you know, host voice talent, make them available, where they can showcase their work, and they’re available for hire. So clients would post a job and the talent would reply with a quote of how much they're willing to do the work for and a sample recording of their voice.

Deep: Interesting. So tell us a little bit, like what's the universe of uses of voice actors.

David: We have 12 use cases. The #1 is what we just call online video. Think of it [as] everything from an explainer video, a product tutorial, it could be a short social media campaign, but after online video it's e-learning so companies teaching their employees, or compliance videos. And then we kind of go down from like, you know, commercials on radio and television, phone system recordings and video games, audio books, podcasts, right. There's a lot of applications, but the number one are these online videos. And I think that's because they have such a short shelf life. You know, once somebody’s seen this video, once they're like, oh, I get it. I've already seen this thing before, and they just scroll past it. And then the shelf life is expired and they’ve got to constantly be creating this new content.

Deep: Is that part of the driver here, is that maybe in the past you had like a big budget ad that was going to go out at a national level and you could afford to pay human talent to do that but for some of this really tiny, small ad runs, you know, maybe there just isn't a budget to pay for a human. Is that part of it?

David: Yeah, I'd actually argue it's almost the other way. And what we developed is this two-by-two matrix inspired by kind of a McKinsey two by two matrix called the synthetic voice matrix. You know, people can Google that and find what it looks like. In short, it's saying on the horizontal axis, we have like time, right? The duration of the content that's being heard. And Deep, generally speaking, the shorter the instance or interaction that you're listening to, it's actually more suited. And I'm talking short, like 10 seconds. And think of these like turn-by-turn directions and better yet actually where the content is changing all the time. So train stations, subway stations, airports. That is where you have very short bite size information. And then the content itself is dynamic. It's five seconds. It has infinite possibilities. You can't possibly have a talent go into a recording studio to record all those variations and iterations.

Deep: What are all these iterations? Is that because you’re maybe taking like an ad concept but you’re having them localize it to a particular city? And now you got like, you know, 10,000 city names that the ad is going to render? Something that?

David: That’s one. So I I'd actually argue that these applications or use cases are more industrial in nature and then an advertisement is probably better still used, like working with a human, like a voice ad.

Deep: Ah, okay. So you’re thinking more like you call the corporate office and then they ask you something and it's going to say something, that kind of stuff.

David: Yeah, yes. Or again, going to this, you know, news, weather and sports, right? You're asking like, what city are you in? Oh, in the city of, you know, New York city, the weather today on April 8th, 2021, 2022. You know, it's, it's kind of stitching these together. It does sound a little bit disjointed but given the fact that the interaction is under 10 seconds, maybe, definitely under 30 seconds, people's tolerance for that disjointed listening experience is very high because it's utilitarian. You just need the information. You're not looking to be entertained. You're not, you know, it's, I just need to get the information. You were like, ‘thanks for letting me know. Good to carry on.’ However, to kind of close off this synthetic voice matrix, when the duration of the listening experience is longer than a minute, it can go as long as, you know, an hour, two hours, you know, you can think of movies and games. That's where a voice actor can lend a hand. People want to hear from other people in that situation because they're enthralled in the content. So that's the difference that we see between where a voice assistant or an AI voice would be suitable versus a live human voice actor.

Deep: On the machine learning front, the systems are still fairly in the early days. I mean, deep learning's opened up an awful lot of ability to learn particular humans or like blend a blend of a particular set of humans into a particular voice, but we haven't probably made too many strides along the acting front.

David: No. And where, where, and part of that is because the inputs. Right? And remember, we talked about that source material. If the source material for creating a synthetic voice is a blend of 25 different people and it's all kind of, you know, uh, mixed together, you do get a distinct sound that is of no one person, but what it lacks is, is probably that it's, you know, regular, everyday folk, just reading a script and it's, you know, nothing in particular. The point of the script is to actually capture as many different sounds as possible. However, if you actually hire an actor to produce a synthetic voice of their own. Okay, this is, you know, this is where we're starting to see it go. Is it actors out there saying, you know what? I can actually replicate myself so that if somebody wanted to hire me, I can do it live or they can license my own voice. So it's kind of the best of both worlds for progressive voice actors and actors of all walks of life that can open up new work opportunities. They don't always have to step up to the microphone in the vocal booth by creating in effect one of these speech engines. And there's, you don't need more than two, three hours. It used to be 60 hours of source content to produce a voice. Now it's down to two or three. Adobe claims that they can do this in as little as 20 minutes, where, you know, they've got the source content and now you can basically edit audio by typing out, ‘oh, I missed a word.’ I'm literally typing text and it's playing it out. They only need 20 minutes of content. Now I don't think the quality is as good as some of the other standalone applications, but you can see that it's gone from 60 hours down to arguably 20 minutes in a matter of years.

Deep: And the type of audio that you're using and the range that it covers is probably important for these models too. Like being able to cover a lot of emotive states and a lot of different diction, and so to the extent that you can capture those. One question I have, though, is for these famous voices that are out there, like, I'm just curious, how does the legal world interpret this? Like, can I just go grab a bunch of hours of recordings of, I don't know, Sandra Bullock, or somebody that everybody whose voice everybody knows and just create a voice puppet out of it and then manipulate it? Or do you guys have concerns about ownership over training data and the ability to actually let people use these voices in contexts that are perhaps not allowed to.

David: Yeah. So there's two parts to that, Deep. The first one, let me just address the training. If it's a training data and you're merely training the model, the output of the voice that's created actually isn't using the voices that were input. It won't sound like the output. It's just merely used to train the model versus create this synthetic voice itself. It might sound like it's a distinction with no difference, but it's that 20 people kind of on the input and then one sound on the output. There's less concern from voice talent in that situation.However, in those situations where somebody wants to go in and get this content that is out there in public domain, not legally in the public domain, but it's out there, and then produce something, I mean, that is, you know, candidly illegal because you cannot [use] that person's image and likeness, just like you can't take a photo of someone and then use it for commercial purposes without their consent or due consideration, i.e., a payment. So that's where the line is. Is, is this in the public domain and is it for commercial purposes? That's where it's like, if you're going to be an AI driven company looking to create a voice or work in this space, then hire the voice actor.

Deep: You're listening to Your AI Injection brought to you by is xyonix.com. That's x-y-o-n-i-x.com. Check out our website for more content, or if you need help injecting AI into your organization.

Changing the tact a little bit. So in an advertising world, if you're running digital ads, let's say I take an ad campaign out on a Google ad. I'm usually uploading like a bunch of asset permutations. So I might have like 10 different permutations of the title of my ad. I think Google will kind of go out and start mixing and matching these permutations to try it out and figure out what's the optimal one that resonates most and it has the highest click through. Are you seeing that level of usage of synthetic voices now in ads where folks are creating kind of myriad asset permutations and then going through and measuring?

David: Yeah, I mean, it had been done historically with just producing 200 radio ads for each local and regional market across the United States. Now, what we're seeing is someone creating, let's imagine it's a gym, right. And it's a franchise where they have locations across the country. And instead of creating 200 ads where the tag it's, it's that little call to action at the end, the tag is having to be recorded each and every time, now what we're seeing some companies experiment with is why don't we just create one and then have this AI voice, if you will, produce that call to action. So it's like meet with Lucy at your local Albuquerque gym, right? And so it's like this stitching together, the local rep and the city at the end, or the local phone number. And what the producer is doing is just typing that content out and then that bit of the ad is actually being read out. It could even be in a different voice, it doesn't have to be in the same one necessarily. But we see that for gyms and other ones, car dealerships, insurance companies doing this they're too. They're trying to localize the content and then mass produce it. Why? Because their favoring personalization of like, sounding like, oh, the message that I'm hearing is, is localized and targeted to me and it kind of sounds much more-

Deep: I mean, that makes sense, right? Like you're, I don't know, a younger female and you're buying something. You're going to have like a default increased level of trust with a voice that's closer to yours than maybe one that's significantly further away. So it would make sense that the ad targeting is going to start to like direct those voices that have a higher, maybe, trust or belief level with you.

David: You bring up a really important point and we've, we've actually validated this with research that people like to buy from people who sound like them.

Deep: Yeah, yeah they do. And people who look like them too. I mean, these are very base human instincts, right?

David: Sure. So the sound like them is, you know, cause it's like, oh, who's better for radio campaign. Is it male or female? I'm like, well, who's the audience you're trying to reach? Because just mirror who you're trying to reach, you-

Deep: Humans are always trying to seek comfort in the familiar. And all humans have like a default, I mean, not to get too psychological, but we have a default ‘in love with ourselves’ state. That, that is just sort of the part of human nature.

David: Yeah. You want to feel like there's a sense of belonging, right? And, and whether a, you know, brand advertisers are tapping into that, they know that if I can sound like what we call the authoritative expert, the guy or gal next door, who's always up on pop culture and news, and “here's the latest about Elon Musk,” whatever it is, and can share that information. That's kind of the posture that ends up coming out in the ad. It's like, “Hey Deep, did you hear that? Did you hear that thing? That mustard the other day, like crazy, right? It's kind of this insight information. That's what a lot of people want to hear. That's the character that you probably want to hire in your advertisement. That's who you're trying to reach. So you want to basically be speaking to the people in a voice and language and tone and a character that's actually relaying so that it is exactly that relatable.

Deep: How do you measure effectiveness of these voices that are being generated? And how effective are they getting?

David: We did actually run a study with a company called voice bot.ai to actually uncover, can people even distinguish human from robot. That was basically the premise and candidly, our business is largely based upon hiring people to do voice over projects. So we wanted to understand like, Hey, how much of a existential threat Is this most people could distinguish the difference. However, they actually had to listen. It was more of like how much content do you have to listen to, to go, oh, that's the robot versus human. Right. And it actually found it was about five to 10 seconds. Now this was a couple of years ago, but let's just go with this ten second idea. Even if it is 10 seconds, it also matters of what's the application. If it's industrial in nature. I don't really care what the automated parking garage system says to me. I know there's not a person in the LSP talking to me when it's a audio book or a game, or it's something that I'm expecting that somebody is telling me that story. That's where my expectations are. So under 10 seconds, not only can people not really tell the difference, they also don't seem to care all that much. Or they're getting to the point in a lot of applications, assuming that it's some kind of synthetic or AI voice, that's doing that recording over that. The other emotion that we found that was uncovered in people in listening to these was this elevated sense of anxiety, this tension, because the kind of became unsure. And if you're unsure, if it's human or robot, and then they're like, do I trust this voice and what the information is telling me or not? And it was a very unexpected discovery in that research at like 10 to 30 seconds. People were trying to discern whether it's one or the other, but as soon as you hear some, some glitch, some non-human artifact, then. You kind of feel like, oh, I won as a human. I know that that's really a robot. And now I feel at ease, at least I know, but it's the not knowing that created the biggest anxiety. That's why I think Alexa and Siri are successful because people know it's not some recording. It's basically. Going and fetching some content from Wikipedia and reading it out to you. That's pretty much what most queries are. And then people are fine with that.

Deep: You know, as long as there's transparency, you know, folks are fine, but at the same time, even if you know, it's a robot, having it be a less annoying robot is also important. Human like is valuable because it'd be hard to listen to the speak and spell voice for too long before. Kind of just done with it. One question I had for you, it's a little bit different. Do you know of datasets that are out there for measuring voice efficacy in ad delivery? Or is this really unique to individuals running ads?

David: Yeah. I'm not familiar with one myself. What I do know is a lot of this has done. Still with basically small sample groups, you know, focus groups, if you will like listening and making kind of preference choices for what they're hearing and what they're discerning. But in terms of, uh, this data set is the most human. What we're experiencing at voices.com is let's just call it big tech without naming names, big tech companies saying we want 100 hours, we want a thousand hours. Of sample content to train these systems. So it's not the produce, the voice, you know, every big tech company has a voice, but they're looking to make that voice better, both from speaking and from listening, but it's more of the, how do you respond in that situation? It's all of those nuances. So the engine for Alexa or Siri, this is why you can change. You know, Siri to sound like somebody different. You can change Alexa as an engine behind the scenes. You, you I'm sure you're aware is, is Amazon Polly, or basically it's one engine with different almost like voice skins, if you will, that you know, to our ears sound like a different person, but it's virtually it's the same engine. That's powering it behind the scenes. That's one of the big challenges is. Gathering data in a structured way because there's two parts to it. It's the recording. Yes. But then it's having the source of truth of what are the actual words that were said with an expected emotion and having those two kind of matched up. That's what you need to train a model. Not. Source content of random people talking into phones or into smart speakers. I don't know what that person was going to say or what the expected emotion was going to be. You know, it's, it's just a recording. I don't think from my understanding and what I've seen. And you might disagree on this one Deep but I don't think big tech are just harvesting all these recordings to improve their synthetic voices, because the data's so unstructured, you don't know the location, the emotion, the words to be said, all of that's kind of missing, which is why many of them are coming to Voices to say, can you create a structured data set for me so that we can train these engines.

Deep: Well, let's talk a little bit about that. Like, so within voices.com, what's unique about your data set? I mean, I can imagine with, did you have millions of voice actors that alone probably makes it quite distinct?

David: Sure. So the main use in why, as we talked about earlier was why somebody would come to voices.com is, is really to hire a human voice actor. To record for commercial, an online video for their phones, that they want a person to do it. However, there are situations where AI enthusiasts, data scientists, innovation labs are saying, no, no, no. We want to create our own. Synthetic voice engine, but we need a lot of data to train the,

Deep: So they want to give you input, like excited female voice, this age demographic, and then boom, you can farm that out to your voice actors. Cause you'd know who those folks are. Render and excited for that reading, they can tap into it and they can start to build out their training.

David: Exactly. So then what we deliver back is a spreadsheet of file names with the lines of that are red with tabular format and then the in effect Dropbox or equivalent. However it sent all of the source audio files, and that's what needs to be sent as like audio file matched up with. The attributes or the metadata about that audio file. And that's, what's being used to do that training. The other thing we're hearing is like to be in a certain environment. Like we want this voice to sound like it's in a cafe or a car or in an airport, or like there's just different environments as well, but yeah emotions, definitely a big one, but not to mention getting into different languages entirely. Because this is a very untapped, it's highly English centric right now. All of these voices, there's been very little to be done, probably because the work's not done in English yet. And so let's perfect that before moving on to a desert. Or it just follows the natural distribution of language popularity, intersected with places that have money to pay for machine learning, to be built for those.

Deep: So it's kind of an interesting evolution. You, you started off with a very specific use for these voice actors and actresses that are basically doing things that have maybe been done traditionally over the last number of years. Now you've got this, like these new entrants that are sort of tapping into cause it's a great way for them to train their models. How do you think about the resulting models that they generate and what role your business plays in those? Do you think about it as I will always be the business that provides the trading data to improve these because I have the real humans, or do you think maybe I take those models and help my humans represent themselves outside of. You like, they build their own voice puppets, if you will.

David: Yeah. Yeah. We know that it is very hard to accumulate this data at the scale that is being asked for. I mean, we've had projects that have requested over 3 million words, so these are massive engagements. They obviously couldn't go out to get this, or don't have the time or expertise to go get that kind of data. And they're not people on the street, they have home studios. So the quality of the audio.

Deep: Yeah, no, that's, that's really important because anyone could go to mechanical Turk and ask a ton of people to request, but they're gonna almost exclusively do it on their iPhone or their Android device. And they're going to do it wherever they are, because they're not used to thinking about all these things that your voice actors and actresses are thinking about. Like your voice actors know about background noise and they know about. You know, all these other things.

David: And I also having a good microphone, having what I have right in front of me now it's called a pop filter or windscreen to eliminate the plosives. Right? You don't want to pop and kind of, um, you know, these kinds of artifacts, how close you should even be to the microphone. One secret technique, if you will, is. The closer you sound, the more intimate, it sounds worse, the farther via, and these are the things that kind of create emotion. It's like, it sounds very intermittent up close when I lower my voice and soften it, but lean into the microphone. Um, versus when I stepped back here, I can be much more bigger and verbose and, you know, I can put on a show and the talent know how to do that. And that's going to come through with an actor of how to quote unquote, work the microphone. So I think there's a future. Talent will be needed for this. There's also a future where I can definitely see talent going. You know what? I've done a number of these to train the systems. Can I actually almost procure the creation of a clone of my own voice. It's almost like if I get a request for an edit or some small little change, or think of it, an on-hold phone system, you're out, you're put on hold and they have these seasonal updates every now and again,

Deep: You're sort of in an interesting position as a company, right? Because you have the voice telling us to create. Really nice high quality voice. What do you call them? I call them voice puppets, but that might not be the terminology you use. Oh, like a, like a clone, like a voice. I mean, basically a synthetic voice well-poised there, but you also, the consumers of that audio as well, but I'm guessing that you might be, have a lot more. Consumers lined up in traditional channels. And you're trying to like find new channels, like maybe video game producers, or, you know, other places that need synthetic voices as opposed to human voices, more, where are you with respect to the folks who are making the voices? Do you see that ultimately that, you know, you merge or acquire one of these firms and bring that capability in house so that you can sort of. Evolve it, or how do you think about that?

David: You know, that's definitely a possibility, the number of firms that are creating these synthetic voice engines. I mean, there's probably 20 or 30 that are like, you can Google AI voices and, and see them and hear them. And they're sounding pretty good. Now, each of them wants to. Claim how many, you know, with our engine, you can pick between 150 voices in 20 different languages. Like that's an important element is so I would say that they want to expand their capability. I think what, with all respect, I mean, obviously tremendous innovators, tremendous engineers to even create that engine in the first place. However, if you're an ad agency, you probably want the emotion, the representing a brand, and therefore they just want to be able to make sure that everything is pronounced exactly perfect the way they want. And they don't want to be emulating the human voice. They just want the human voice. So I think what some of these text to speech applications, they're looking for the use case where it really makes sense where. There isn't a budget to hire a voice actor and where the content is maybe so long, so dynamic and changing all the time. So I described a number of these industrial applications. The other end of the spectrum, interestingly enough is like corporate compliance videos, like a non-corporate training, where it's like a product launch and it's kind of all hype and it's almost like a product marketing video I'm talking about like compliance and training. Think of it. Like if you're at a pharmaceutical company or a banking or insurance agency, every one of your agents or every one of your employees has to go through hours and hours and hours of compliance every year. Does that curriculum designer who's sitting in the human resources department. Do they want to do the voiceover themselves? No. Are they woefully underfunded? Yes. But they still have to produce this video content that 5,000 or 50,000 employees I'll have to watch. So to augment it from just a. PowerPoint only. And they realized they also have a, an accessibility mandate. Oh, you also need to have this dubbed over. If you will, in 10 languages that our employees speak, how are you going to do that with very little budget?

Deep: That makes sense, right? It's not like a Hollywood film where you can hide it. You know, hire folks. And there's just like an awful lot of scenarios like that, where, you know, you've got English and you need to rip it off into another 25 30, you know, languages legal system comes to mind. You know, there's an awful lot of translation going on because everyone has a right to understand that. Trial or what you know, that they're participating in. So let's fast forward, you know, five, 10 years. How's the world different when we have these both good and bad, like what, what happens here in this world?

David: Well, I think for the content producers at B2B marketers, Storytellers creating audio is difficult. Right? And so I think in that world that you're describing the creation of voice content will be much easier. It will be sounding much more fluid and natural, again, fast forward and kind of that many years. And you're right. The tipping point is when. The synthetic voice is indistinguishable from a human and not just over five seconds over like a duration five minutes. And I just, I really cannot tell the difference. And so that will mean that probably more audio content. I think it makes the world a more accessible place. That means whatever environment you're in, whatever content you want. Consume. I want this article read out to me. I'm a big podcast and audio guy. I have 500 audio books on my iPhone. I listen to content all the time, the same way.

Deep: I'm the same. I've got, I've got a ton of audio books and I listen to podcasts constantly.

David: Oh yeah. That's really, the benefit is a third of the world's population are considered auditory learners and therefore that's how we like to consume content and not just content to merely be entertaining. But also to be educated and informed and at present, not all of it is like that. So I think there's an opportunity for publishers and content producers to make all of their content universally accessible with audio. Yeah. Yeah. The accessibility thing is real. There's an awful lot of folks that just simply can't see, you know, there's other folks in different scenarios. Like there's the. You benefit from having it testable via audio. And then there's the crowd where it's like, I must have it, the audio, or again, engaging with the world around us. I mean, imagine being in a, in a country that speaks a different language, you know, whether it's, you know, you receive literature. And digitally, you know, it could be translated and voiced. It's almost like the universal translator in audio format, like in real time. So I, I actually don't think that that's all that far fetched, our mission is to make the world a more accessible place through the power of the human voice. That's what we're trying to create. And this conversation, Jay has actually given me an opportunity to share that. So thanks so much Deep.

Deep: Yeah. Well, thanks so much for coming on the show. This has been a blast conversation.

That's all for this episode of your AI injection as always. Thanks so much for tuning in. If you've enjoyed this episode on AI voice, generation and advertising, please feel free to tell your friends about us. Give us a review and check out our past episodes at podcast. Say xyonix.com. That's podcast.xyonix.com. And please drop us a note. If you've got specific things that you'd love for us to cover here on your AI. That's all for this episode, I'm Deep Dhillon, your host saying check back soon for your next AI injection.

In the meantime, if you need help injecting AI into your business, reach out to us at dot com. That's x-y-o-n-i-x.com. Whether it's text, audio, video, or other business data, we help all kinds of organizations like yours automatically find and operationalize transformative insights.

Deep Dhillon

Host