Amplifying Equity: The Power of AI-Enhanced Audio Engineering with Emil Winebrand Artwork

Your AI Injection

Is AI an ally or adversary? Get Your AI Injection and learn how to transform your business by responsibly injecting artificial intelligence into your projects. Our host Deep Dhillon, long term AI practitioner and founder of Xyonix.com, interviews successful AI practitioners and domain experts to better understand how AI is affecting the world. AI has been described as a morally agnostic tool that can be used to make the world better, or harm it irrevocably. Join us as we discuss the ethics of AI, including both its astounding promise and sizable societal challenges. We dig in deep and discuss state of the art techniques with a particular focus on how these capabilities are used to transform organizations, making them more efficient, impactful, and successful. Need help injecting AI into your business? Reach out to us @ www.xyonix.com.

All Episodes

Your AI Injection

Amplifying Equity: The Power of AI-Enhanced Audio Engineering with Emil Winebrand

June 13, 2024 • Deep • Season 4 • Episode 1

In this episode of Your AI Injection, Deep Dhillon and Emil Winebrand discuss the revolutionary advancements in AI-driven audio enhancement. Emil, co-founder and CEO of insoundz, explains how their AI-powered technology addresses and solves audio quality problems, making studio-grade audio more accessible. The two chat about the practical applications of this technology across different industries, from podcasts and video production to everyday communications. They also delve into the technical intricacies of these AI models, including how they handle various types of noise and distortion.

Learn more about Emil: https://www.linkedin.com/in/emil-winebrand-59a4061a/
and insoundz: https://www.linkedin.com/company/insoundz/

Learn more about AI in audio and video production:

[Automated Transcript]

Deep: Hey there, I'm Deep Dhillon and welcome to your AI injection. Today we have Emil Winebrand joining us. Emil is the founder and CEO of insoundz, a generative AI audio enhancement company focusing on using ai. To remove unwanted noise and output studio grade audio. Emil, maybe get us started by, you know, walking us through like what, what problem are you trying to solve at insoundz?

And maybe I like to start with what do people who don't use your product do to sort of face the problem?

Emil: Yeah, that's a good question. So, so as you know, audio is everywhere. Specifically, we're on an audio podcast. It's the way you interact with people. It's the way you interact with your loved ones, your car, your devices.

And it's how you do your work and and so forth, but it doesn't get the right attention, uh all the time And audio it has always been a second kind of second brother of the video The the problem is that to create a high quality audio And really a high quality production. It takes huge, uh, efforts that people are seldom aware of.

And why is that important? If you take, for example, a nice study that was done on, uh, academia, it's been shown that when you have bad audio, uh, in terms of quality, lost packets, whatever, people tend to think that, um, the speaker is less intelligent, less likable, his topic is not interesting. And in communications that that's kind of audio quality is the main complaint that people have that's like 50 to 60 Of complaints is about audio quality.

So it's important and it takes a lot of effort What do people do today is either use professional services like audio? Uh audio engineering audio editors if that's if that's in content creation and in live They basically use whatever is available Out there that can be any sort of algorithms for live audio enhancement, but really none of them None of them create the right experience.

So it's still bad that if that makes sense.

Deep: Yeah, and I think you know This problem has probably gotten bigger in many ways in the last few years. Like, if we, if we think of like, it's always a, an interesting kind of temporal demarcation, right? Pre COVID, you know, if you were doing any kind of professional content production, you'd spend a lot of time and energy making sure that you have the right mics, the right, you know, people are in the same room, you know, you have sound engineers, you know, really, you know, Making sure that the raw audio with which the post production team is going to work with is, you know of high quality

Emil: Yeah,

Deep: and then you know as soon as covet hit one of the things that struck me was like, you know Really really big shows, you know, like trevor noah and other folks.

They're like in their apartment Cutting these like huge shows that, you know, you know, a few days before were done with a lot more kind of like hardware upfront. And to fast forward to today, you know, people realized a lot of those efficiencies. Like, you know, like our podcast, uh, you're, you're in Tel Aviv.

I'm actually in Spain right now. We're, we're having this conversation, you know, across a low budget, if you will, a medium zoom. The output quality can be a struggle to get it to sound decent, you know, like in our case, you know, we had some problems with audio for a while and we think we've resolved them and gotten them a lot better sounding, but we're certainly not as good as if we have like super high quality mics and we're in the same spot and all of that.

So it seems like you'd have no shortage of potential, uh, clients.

Emil: It's a good, uh, very good point. So it starts from communications. Like we're having now, typically communication is curated by environment, compression, bandwidth, lots of stuff, but then goes to high quality content production. And, and there is a, uh, huge, much increased awareness of the actual content, as you say, post post pandemic, people begin to, uh, consume more content from home.

Work more from home, being more efficient on one end, but then people are getting more aware of the quality of content they're consuming, especially as it affects their energy levels to that, to a sense, because if you're having a bad conversation for a long time, basically you're drained, you're energetically drained, and there's actually a biological physiological reason for that.

Your brain works 10 times harder to compensate for bad audio. You actually physically feel that you're drained. And it actually comes from, you know, wasting your energy reserves on on processing.

Deep: Yeah. And I'm sure a lot of us have felt this, like, you know, if you're and there's different types of extra attention that one has to employ to, like, figure out what somebody's saying, right?

There's there's the part where if all the audio was perfectly clear, you still have the language overhead that your brain is going through. And then your, your mind's also like teasing out the audio, uh, the, the ideas, sorry. Um, but what you're kind of getting at is just the problems with understanding the actual words being said and the burden that that puts to like, listen more closely.

Like if you're at a party and you know, there's a lot of other conversations going on. We've all felt that added energy of having to strain and like really pay attention to somebody just to figure out what their words are. So I think you're sort of suggesting some, some milder version of that. One thing I want to do for our listeners, because a lot of folks maybe don't understand why it's hard to clean up a crappy audio file.

And I think before we get into, you know, just why this is such a, you know, a wonderful application of AI and machine learning, if you can help set the stage for somebody who maybe doesn't, hasn't taken a bunch of signal processing classes, like, and really tried to. Sort of understand the complexity of what happens inside of an audio signal to make it Very difficult to separate, to basically get back to the equivalent of a green screen in, in, in imagery, you know, to get to a point where you have something unadulterated, not affected by filters.

So then you can go apply filters or whatever.

Emil: Yeah, very complex question. I would start by blaming first of all, our ears. We are so much sensitive to audio dental visual just to give you example in visual. Well, we have three types of color receptors frequencies. Right. Everybody knows that. And there was also, you know, uh, the dark mode, uh, uh, without color in, in our ear.

We have a huge spectrum for each kind of, uh, spectrum portion. We have different follicles in our ears that actually sense that. You have so much sense or so many intricate details that your ear is able to extract, which makes any type of processing, uh, really very hard. Our sensitivity is too big. And by the way, it typically hearing and, and listening to something is considered more ground truth than seeing something.

You believe your ears much faster and with higher reliability than you believe in your eyes.

Deep: That's kind of interesting. I mean, there must be some kind of evolutionary reason for that, you know, like, I think, in the modern era, we put audio last, right? Like, we get great AI models for language first. And then everybody talks about imagery and vision, but virtually nobody's really talking about on the audio side.

And it's been like that for 40, 50 years, it seems like it was in the, at least in the tech world, that we, we think of it as later. But if I'm sure if you go to Hollywood, it's a totally different answer. You know, like when they talk about audio as being essential, it's at the front. So I'm wondering, is there some biological reason for why we trust our ears more than our eyes?

Emil: Personal predator, danger detection, anything that happens in your environment comes first from the ears. Actually, Your ears cover 360 while your eyes cover a 120 degrees angle, so it's kind of your radar. And by the way, your ears draw the attention of your eyes and not the other way around. It's very hard to fool our ears because, you know, that's a, that's a survival mechanism.

If you're walking in a dark forest and you hear something, you, you better be alert. I don't, otherwise you, we wouldn't be talking here. So I believe that there is a biological reason for that. And then it kind of, uh, evolved from that people mainly communicate between themselves and, uh, most of our information before, before the written word was actually audio, right?

People speaking one to each other, one to another, et cetera, et cetera. And, uh, I think our, um, bandwidth perception, the amount of information we're able to, um, to pass to each other. Actually increased with time and with the enlargement of the brains of Homo sapiens, but that's going too far maybe and it kind of evolved from that I think.

And language, uh, written, especially written word, uh, stuff like that came much, much later. So it's not, uh, Biologically fundamental and a vision, although you perceive a lot of information from vision, but you're unable, you were unable to actually store and tell any data using visual signs. Up until maybe last 30, 000 years, when they started drawing stuff.

Yeah, that's a really good

Deep: point. It was, it was audio, like storytelling and, um, yeah, that that's

Emil: storage

Deep: or maybe sticks in the mud or something for a little bit, but that resolution is so much lower than, than audio. Huh? Interesting. So then if we, if we fast forward a little bit, let's say to. Five years and before there's a lot of audio signal processing reasons why it's hard to to to go from a bad audio signal to, you know, a green screen equivalent.

And maybe you can walk us through some of those as well.

Emil: Yeah. So, uh, as I said, maybe just earlier that our ears are sensitive to very small and intricate features in the audio. So to create real green screen. You need to cover a lot and a lot of details. And if they're not consistent, not concise, and if there's a discontinuity, if anything doesn't is, is going wrong, uh, your ear will be able to hear that to a level that you can have a perfect signal, but even if you have some disturbance, that is by a factor of a hundred smaller than the rest of the signal.

But it's in a certain frequency area where you're, uh, you know, where your ear is not saturated, you will still hear it. You don't have that in images. Like, pretty much, if that's the background level, pretty much everything else is the same level. Your eyes are unable to separate, uh, such a big dynamic range features, if that, that makes sense.

Let, but let's get to What does it mean to perceive speech? Okay, so so somebody is talking from some distance from you from a microphone doesn't matter. So let's assume his speech is perfectly generated by by his mouth, throat, etc. And then it goes through some, um, propagation media medium, which is air.

Bounces back off walls, floors, whatever. And, uh, already when the audio comes to, let's say a microphone, it's already distorted to some way, in some way on top of that, you have environment noise that is added is added to this. So you begin, uh, kind of masking some of these, uh, important features in the audio by overlaying that with environment noise, and then you have, uh, potentially microphone noise and some.

Microphone effects and whatever. And then you also have the medium where the audio is transmitted, does compression, does maybe some pre processing in order for it to better fit the, you know, the bandwidth pipeline. All of these kind of incur additional difference from the original signal when you get to the point that you need to reconstruct the audio, so you can actually clean up quite nicely with, you know, classical signal processing, you can just remove kind of the bad, the deteriorated components of the audio, whether that's frequencies, whether that time instances, you just, you can remove them, you can ignore them.

But then your audio becomes very, uh, you know, narrow, shallow, and it's, uh, it's just missing a lot of components. So now what you, what, what, what can you do in previous technologies? You didn't have a lot, a lot to do today. You can regenerate those components, for example, uh, missing frequencies, for example, correcting for reverberations, for example, adding some dynamic range and, uh, in dynamic, uh, depth of the frequencies range.

So you actually mix the, the, the, the residual audio. With a regenerated one, and the result is incredible. That's interesting,

Deep: that sort of suggests, so I, that you're, that you're doing, are you doing some kind of pre filtering before, and, and you're, cause I, it seems like there's a couple of approaches, like one approach you could take, is you could try to get the, Generative model to truly learn the filter transformation of all these effects that you itemized like, you know, the channel effects, the, the room, et cetera.

And then that what you're saying here is a little different. It's, it's making me think maybe you're, you're, um, band pass filtering or something and, and getting down to the raw speech and then reconstructing the character of the room effects and the other things that would be a lot harder to sort of represent into a, into a filtered, you know, Is that right?

Am I hearing you right? Or,

Emil: uh, it's a mix of, of the things. First of all, even if you can characterize some of the room effects and some of it's, it doesn't mean that they're, uh, they're invertible in that sense that you can reconstruct the original, uh, sound. Basically, if you multiply something by zero, you cannot extract the original value.

So, uh, the other thing, additive noise. If it becomes much stronger than the, than some audio component in that frequency space, that, that component just gets lost. It's, it's impossible pretty much to extract it. So typically, what they used to do in classical signal processing, just discard it. But here, so not everything, what I'm trying to say is not everything is invertible.

And by the way, most of the, most of the things are non invertible. Most of these, you know, hard, uh, disturbances are non invertible. So the real way to, uh, complex it for that.

Deep: Just, just a quick side note, is, is that largely because, let's take a, a highly reverberant hall, like Carnegie Hall or something. A lot of those, Effects that you're perceiving are higher frequency contributions that are very noise like right?

Like they're, they're very like stochastic and, and, and, and random appearing. So, it doesn't necessarily make sense to have a filter that can get rid of it and put it back. Is that kind of what you're getting at?

Emil: Yeah, I would say that first, like, lower frequencies get, uh, get scrambled first. You just end up with a lot of high frequencies and they're not kind of sporadic, but everything on the low end and, uh, the things that we perceive that we correlate with intelligibility and quality, they really get lost first and, uh, they're really sporadic.

And, you know, some, some components do get preserved, but a lot of it gets lost. So, yeah.

So maybe let's take a little bit of

Deep: a, a turn. And I, I think these two topics will kind of align as we go, but maybe walk us through like, what are you training on? Like what's the training data? Um, is it, and, and what does it mean to define a green screen audio? Like are you actually putting somebody in a pristine recording environment and getting a true, you know, green screen version of the audio?

And then also. Maybe taking that cleansed version and passing it through a bunch of, you know, known common effects, um, or are you, you know, taking an existing, bad sounding audio signal and I would think you're doing something like the former, at least for the maybe supervised fine tuning in the end, but for the original unstructured tuning, maybe you're doing something more unstructured.

Yeah. Yeah. Like more like sort of like we do with like images or something like where we're, where we're training on, on known omitted parts of the signal or something.

Emil: All of the above. However, I want to, I want to take you to potentially even one step further, if I can.

Deep: Sure.

Emil: So our company, we don't see ourself as just kind of creating one model for cleaning different businesses, different enterprises.

Really look for different things in audio, even in the same market segment, they may different companies want to differentiate themselves by offering something a little different. Let me, let me share an example. We have one platform on board that really wants to remove, um, um, the people say during conversation during podcasts, while you really want it as part of, uh, you know, Like typical human communication, like in communication, uh, office type of communication, uh, uh, platforms.

So different requirements, you still can call this one clean and this one great audio. And sometimes, uh, customers say they want to remove music. Some people, some customers want to keep the music. So you really want to specify because we're a B2B company. We're looking for a clear specification for the customer, how to bring value to his business.

So we've built a machine actually that can take that type of specification and create a data set from it. For now, we started off with just having a catalog data. A lot of the data is our own. The green screen, as you call it, is people and actors that we hired for long periods of time to create high quality data, we recorded them in studios.

And then we kind of use that data. To go through different types of transformations, which in audio, they're called room impulse responses. We, by the way, we acquired a huge number of those room impulse response responses. We had a team that. Did a lot of recordings and a lot of noises and a lot of cataloging.

So we have that kind of structural data.

Deep: I want to just chime in really quick on that. Just, uh, for our, for our listeners that maybe don't know, uh, there's this sort of cool phenomenon and audio analysis. Like, so if you, if you walk into a room. And you take a hammer and, like, just whack it on, uh, on, on something hard.

It'll, it'll create, you know, what we call an impulse. And it, basically, it sends a pressure wave all the way through the room and it starts bouncing all over the room. And then if you record that thing, you can now recreate the essence of that room. And, and that's what Emil's kind of referring to. And not literally with a hammer, but like that's that's the uh, that's the way I think that's one

Emil: way.

Yeah

Deep: Could be literally a hammer.

Emil: So basically you kind of recreate those Different environments as it's as you mentioned with those responses Then you add some physical augmentation on that which can be a moving head Rotating head first so it creates a lot of you know, dynamic changes of that room impulse response You Uh, you can have Doppler effect, Doppler shift, because somebody is moving in the room.

A lot of those physics has to go into the, into the training pipeline. Then you add on top of that

Deep: I was just gonna say, this is, um, this is really interesting because we've talked about on the, on the podcast in the past, generating training data from physics. And, and it seems like this is a perfect example, because in essence, what you're trying to do here is you're trying to, you're trying to recreate the ability to take, um, any one of these sort of pristine audio recordings that you have of your, of your voice actors.

And put it through basically like any room in the that could exist in the real world or an alternate world Um apply any random filters that somebody might get if they're like munging it through garage band or doing, you know Sticking it through hundreds of different guitar effects pedals or voice effects Like you're trying to recreate the ability to make all of that and then once you have all of that And you're mentioning Doppler, which is an interesting one.

I don't know if somebody's out there recording while somebody drives by or, you know, but, but it speaks to, um, an intriguing idea of creating a very comp, like leveraging a known and maybe finite set of physics phenomenon, but then expressing that ability to then go take these pristine signals and munch them up into all kinds of permutations that ideally represents everything you would see out on the Internet today or something.

Thank you.

Emil: And you don't have to like represent everything. You need to represent everything that the specific lines, the use cases that specific client tries to achieve. It's very important, I think, to limit that scope, because otherwise you're solving a lot of problems and not really becoming an expert in a specific problem.

That I'll touch on that in a second. Now, I want to take that The whole, this whole idea that we just talked about munging everything into the next level, which is, uh, we are currently at is that we decided instead of having a catalog data set that you need queries to kind of speak up the right sound and, you know, mix everything together, we decided to take that to the level of, uh, gen AI.

So basically. You specify to the GenAI to a GenAI that generates audio that was trained on all of that data. What do you want it to output? So you actually augment in a much larger scale, everything that goes out of that data set. So if, if that makes sense, let me reiterate that. So imagine that you have a male speaker, one female speaker, and that's all your data set.

Okay. Okay. With today's GenAI. You can draw a line between that speaker and, uh, the male and the female speaker and you can get all the spectrum potential speakers that can represent them. So you want to enjoy that while training, uh, while, while training, uh, an AI to do some tasks, creating an input for that, uh, into that pipeline that is, is, is much wider in that sense.

Deep: What's going through my head is I'm trying to figure out how you actually train up this model. And so I'm envisioning you have, you have the, you have 2 time series signals that are aligned 1 with the, you know, person in the, the, the green screen, the perfect recording and then you have a 1 with whatever effects you've applied.

And now you, you back up and the model that you're training is for any given sort of sequence of, I don't know if you're doing this in time series, but let's, uh, but what I'm imagining is that you've got a sequence of, of, of values in time, you know, like if for our audience, like, you know, you've got that temporal signal that we've all seen a million times.

And now you have to predict the next, the next value of that time series signal. But instead of going to your noisy thing, you're going to go to the pristine one. And that's the value you want to predict. And yes, you're going to make this and you might go out, you know, end values in the future. But that's sort of the essence of the training that's going to help your model learn that really complicated transformation that you're applying.

Is that. Am I like on Mars with that, or is that in the ballpark and maybe link that up to the point you were making before?

Emil: Yeah, it's, it's kind of the, the other side of the, of the whole training pipeline. But let's touch on that. So yeah, basically you're trying to create from the noisy part. You're trying to create the clean part, the clean signal as you clean time series is, as, as mentioned.

Now, it's not done in purely in time domain because the task is too difficult to really recreate that. You know, time domain signal, but luckily our ear is more responsive to frequency domain or some kind of mix between mix frequency and time domain. So you really define the, the way you're trying to approximate the, the clean signal from the noisy one in, in that time, frequency space.

Uh, what I was getting beforehand is how do you create the data for that training? That, that was, uh, my reference beforehand. So it's not only comes from catalog datasets. It's, it's everything that goes, that's also between, but maybe we'll touch on that later.

Deep: So let's go to your, let's go back to the time series and sort of frequency world that you're operating in.

Can you give us a little more color on what that means? Like, why can't we just work in the time domain and what problems actually come up? Why can't we just directly predict? Is it because the. When the model that we're sort of evolving tries to predict future sequences, that some of the errors are, I don't know, glitchy or something like what, what ends up happening there?

Emil: Yeah, let's just take it for one thing for granted, perhaps that you can never predict the exact signal. There will be always a mistake, always an error. Now, how do you control that error is perhaps there is a way to conceal it in such a way that our ears are less sensitive to that. So when you're trying to approximate in time domain, it's very hard to do.

And there are a lot of kind of sporadic zigzags that you get around the signal. And this is something that is very audible to our ear. In some cases, they call it musical noise. And these are noises that are really, really. You know, annoying our ear and the perception is bad, really bad quality. If you're trying to, to do that, perhaps like to approximate the signal and like have a residual noise, but make it inaudible, let's say, for example, uh, maybe for audience that is not aware about, uh, a lot about frequency domain, there are two components, there's a way to look at frequency domain in two components.

Is that magnitude and phase. So our ears are much more sensitive to magnitude change rather than phase change. So potentially you would rather to have most of your errors in the phase domain rather than magnitude domain. That's one way to look at it. In time domain, in time series, you don't have that, you know, kind of way to conceal that clearly, how to conceal that, these errors, so it won't be audible to, to human ear.

Deep: Yeah, that, that makes a lot of sense. Um, it's kind of like maybe for the audience's benefit, like if we think about like MP three or MP four, the, the, the modern compression algorithms kind of leverage a psychological concept that, um, humans, if I think it's like one third of the power of a, like if you're looking at a particular note, let's say like, you know, a 440 hertz signal, like a concert, a tone, if you're hearing a tone, that's just a little bit.

Like 1 3rd or less as strong as that tone, but slightly more or less hurts like 441 hurts or 439 hurts the human. The vast majority of us can't really hear that difference. And basically what you're saying is, if you can squish those errors, either into the, the phase part, uh, or, or the spectral part. The frequency domain then you can sort of take advantage of that in essence that limited That that limit in our physiological response.

Emil: Yeah, you're spot on. We actually use those Magnitude masks they call them in mp3 to conceal Some of the magnitude errors below them and ai to really re you know Really recreate the stuff really be accurate where it's audible And just, you know, throw away everything that is not important below the, kind of spread the, the problems below hearing threshold.

Same goes for time domain. There is another cycle, interesting psychological effect that if you have like a high amplitude audio for post that time, there's some masking time that means some of your follicles are saturated so they can't hear. Or I don't know, 20 or a hundred millisecond after that, anything.

So push some noise there, push, push someone inaccuracies there. So there are a lot of effects that you can use. And by the way, AI does that much, much nicer, much more efficient than classical MP3s and stuff like that. Cause yeah, you can really specify how to do this.

Deep: Well, I mean, it's, it's fascinating how you've approached this problem because you have an essence, I mean, a near infinite training data set, right?

You can. You can give it a bajillion different permutations. So you really have a lot of audio that to work with. The question I would have is, do you find that you miss stuff like that? Are, is there any examples of like big types of problems that you maybe didn't have represented in your training data that you needed to go back and find and somehow include?

Emil: Yeah, yeah, well, great example I can give you is for example, if you have a speech and there is standing ovation after a speech, so is the standing ovation where people shouting and clapping, is that considered noise or is that part of the content?

Deep: That's hard.

Emil: So, yeah.

Deep: Yeah.

Emil: By the way, that kind of ties the loop to, uh, for customers to really understand what they need.

And we help them understand what they need. What they need before we we generate that we create that model for them. So yeah, we had we had also with uh, a high pitch Barking dogs. That was that was a crazy one So one customer, Oh, another, another key, typically toddlers crying in the background is something that's considered, uh, unpleasant and people want to remove it in professionals settings.

But in some of content creation, you really, you want that, that may be part of the media. Not everything can consider noise in all cases and not everything considered good and relevant is relevant in all cases. So that's. Typically, we, we have to do one or two iterations to really, uh, cover 95 percent of the needs for a customer.

Well, well, I could

Deep: Probably nerd out with you all day on the super deep technical stuff. So let's maybe maybe let's take a shift here and jump out a bit. Tell us, uh, maybe walk us through like, who are the customers that you know, are are really, um, that you're really kind of doubling down on that. You think, hey, this is the use case.

That's. You know, the one that's, you know, really gonna take in sounds to the next level, or are you going kind of like, are you, are you focusing more on these sort of consumer apps, like, like the zoom kind of context, or are you going down into a particular podcasting kind of scenario? Or maybe you're going to like a particular part of Hollywood, trying to reduce costs.

Like, how are you thinking about it from the business standpoint and what kinds of customers are really attracted Yeah, good,

Emil: good question. So, uh, Mainly software platforms that do audio processing in their pipeline. We're trying to actually use the flexibility that gen AI brings in order to really, um, um, attend a wide in.

Cross industry markets. So we half of our market is communications and that means zoom like but not a B2C Play rather even zoom, you know have their own kind of B2B Most of the I think most of the revenue even comes from B2B So that's where we're high quality audio release, you know, people are willing to pay extra for that.

So we have communications We have content creation and that content creation can be from podcasts, musicians, singers, and all the way up to Hollywood level productions. Well, so you're not, you're not

Deep: just doing spoken word. You're doing, uh, music applications as well.

Emil: I've specifically focused on singing.

So human voice, human voice is the center of everything. We're not doing instruments and stuff like that. We can separate them as a background channel. This is something that we do, like as part of a specification that, that these enterprises give us. On the third kind of angle, we're going to post production tools, mainly in automated post production that is done in the cloud, rather than, you know, those and plugins and stuff like that.

The SAS platform we're building right now has, has, uh, some manual work in it. So we meet a customer, get the specification, create a pipeline out of it, generate a model, give the model to the customer, he is happy, deployed, et cetera. This whole process is about to be automated, fully automated. So customer can log into our website, specify via text exactly what he wants.

Get a model deployed on a cloud, he can drag and drop and try, he can tune it. He can say, no, you know what, I want to remove dogs as well. You can actually fine tune that model, listen, and then decide that it's now ready to deploy. And then he can just pull it and deploy it as a cloud container in his product.

So we're really, really trying to cover, uh, industry wide solution here. Not just focusing on one thing or the other. And the reason we can do that is because we really under, we think we un, we uncovered and understood the mechanics of taking audio needs into an AI model.

Deep: I mean, that's an interesting thing.

It sort of implies that, that a lot of your users require customization. And I wouldn't have guessed that. I would have guessed that like you would have maybe a handful or 10, 15 different types of application, like one being extract voice. Get to the green screen equivalent of voice with a you know in any kind of podcasting environment But you're sort of suggesting that people are more particular than that I don't know almost like that wouldn't be enough to go to go wide But I'm having a hard time wrap my head around that it seems to me like like if we just take this podcasting app Scenario, it seems like that alone should be one model

Emil: Mainly, yes But platforms, businesses that serve this area want to differentiate itself.

They want to differentiate themselves from other customers by actually giving different capabilities, like removing ums. I don't want to go into like too many details because it may, you know, give out a secret trades of cost of customers that we're engaged with, but, but you're right, podcasting is, is more kind of a fixed format.

But a lot of the other stuff, uh, really require customization. I want to perhaps take an analogy into image processing. Like let's, let's take the problem of segmentation. So it's, you know, segmentation, a lot of open source, a lot of companies worked on it, pretty much everybody in the AI space now knows what's, what is, uh, image segmentation, but when you take it to real business applications, and this is where it come, what this, what a difference.

Between, you know, kind of a wide scale to a B2B product. When you want your image segmentation program to work, let's say for driving cars, you know, in a specific parking lot, you would really have to tune that to that parking lot, otherwise. Like the 80 percent performance is not good enough. So really need to make sure you're tailored to the use case.

I'm taking it back to communications. Let's say we have communication company. Let's say zoom, like, I don't know, maybe smaller second tier 30. It doesn't matter. They have their own processing pipeline. They have their own restrictions. They have their own compressions and stuff like that. Compressions. Add artifacts.

They have their own bandwidth restrictions. They have their different working set We actually just take those requirements and tune that into the model And then you have optimal performance for this customer if you train it for a different type of compression audio compression okay, you can train it wide but you won't really get that level of performance and uh, This is where the b2b difference comes comes into into like

Deep: Cool.

I'm going to ask you a question that I ask a lot. So a lot of our listeners are product managers or folks who are like trying to figure out how do I use, you know, AI and machine learning. In my company, in my products, can you maybe for somebody who's not used to just thinking about machine learning and AI, can you offer some advice on like what they can do to even understand what's possible, uh, with respect to their product and how they might be able to get started thinking about how to integrate machine learning or AI into their products?

Emil: That's a very interesting question. First of all, at least for me, I'm looking at AI as a very sophisticated tool to solve a problem. So you really start from the problem. Not, not from the, from like looking how to put AI just in the business. Cause otherwise that's going to be very hard. I think really to understand what is AI is capable of, it's kind of for me thinking about What a human employee can do.

Yeah, and and that's pretty much what's a limit give or take on what what is the capability of an AI machine so, you know, the AI machine won't be able to launch you to space, right? But it can potentially take some burden off of the mundane tasks or even not so mundane tasks that some of your employees do And and create efficiency there.

Also, I would look for structure. If there is a clear structure to the task, or I would say some kind of rules that you can think of, even if they're vague, but you can think of set of rules that can that can describe that task. You can probably get an AI to do that. Probably get an AI to do that or at least assist with that.

So that's the way I would suggest watching that.

Deep: I think that's, that's, that's very sage advice. Well, Emil, thanks so much for coming on the show. I've got one last question for you. If we fast forward, let's say five or 10 years out and everything, You're working on and envisioning working on, uh, you know, works out.

Uh, maybe walk us through how the world's different for those of us on the, uh, outside of, of your company,

Emil: everything, all your audio interactions, everything that you do with audio will be calm, will. Always feel as if you're in a listening to a high quality podcast rather than rushing and enjoying and having all the noises around you.

And most importantly, audio should be safe because we inherently trust, trust our ears so much more than our eyes that audio has to be safe. And, uh, this is, uh, Also something that in sounds is is really deeply involved into we are trying to think how Audio is going to be used in 10 years and and still we're going to trust our ears and not trust it and you know Not be intimidated by fake audio.

Deep: That's uh, I look forward to the day when uh, When I don't have the harsh, I think that's part of zoom fatigue frankly what you're describing, you know, it's uh, You People are on their headphones. They're, they're watching and that audio gets kind of harsh. I think it's part of the fatigue. That's a great picture.

Hopefully, we're all going to live more serene lives when you're successful. So thanks so much for coming on the show. A meal. Um, it's been great having you. That's all for this episode of your AI injection as always. Thank you so much for tuning in. If you enjoyed this episode and want to know more about implementing AI in your company, you can try chatting with our bot, Xybo.

Just Google Xyonix Xybo. That's X Y O N I X. And Xybo is X-Y-B-O. Also, please feel free to tell your friends about us. Give us a review and check out our past episodes at podcast.xyonix.com.

People on this episode

Deep Dhillon

Host