In this episode of Your AI Injection, host Deep Dhillon and guest Dr. Hans Martin Will delve deep into the intricate cross-section of AI and music. HM narrates his unique journey from starting classical violin to now, harnessing AI for music synthesis. Together they explore cutting-edge technologies like latent diffusion models, which have applications in both visual and audio domains. HM highlights emerging tools in the music industry, such as sound isolation plugins and AI-driven parameter suggestions, emphasizing the transformative potential of AI in music production. Meanwhile, Deep offers a personal perspective on the value of the human touch in music, stressing the importance of understanding the essence behind musical techniques - especially in an era characterized by its rapid embrace of AI.
Learn more about Hans here: https://www.linkedin.com/in/hmwill/
Check out some related content of ours about applications of AI in creative fields:
Deep: Hey there, I'm Deep Dhillon, your host, and today on our show we have Dr. Hans Martin Will. Hans Martin, or H. M., received his M. S. and C. S. from the University of Bonn and his Ph. D. from ETH in Zurich, also in computer science. H. M. served as a senior technologist in many companies and roles, most recently serving as head of engineering at Amazon for Textract.
I've always found whatever H. M. 's mulling over to be super fascinating, so I thought we'd have him on today. I thought we'd get started. Maybe tell us a little bit about your, um, CSS and technologist background. Tell us how your interest in computer music and maybe AI and music got started. You know, maybe there's some colorful intersection there too.
I, I imagine with your, your, your deep tech background. You know, and maybe we'll just start there. So how did it get started with AI music?
Hans-Martin: So, um, fundamentally, um, I have been educated in music learning instruments since my youngest age. I think I started taking first class when I was four and at first then I learned classical violin.
And then in my teenage years, uh, that's when I developed an interest for electronic music and synthesizers. Uh huh.
Deep: And were you producing music too, like on the electronic front or? I,
Hans-Martin: I, I did. In fact, uh, maybe 2015, um, I even started taking it a little bit to another level in that previously I was playing with friends by making, uh, simple productions at home in our living room, so to speak.
Um, but that 2015, it's kind of when I picked up electronic music after a couple of Actually, maybe a decade not do practicing it. Um, and I started actually taking class at, uh, Berklee College of Music. So they have a great online program. Oh, okay. And have done quite, quite a couple of classes on electric music production, composition, sound design, mixing, mastering.
So really kind of all the different aspects
Deep: that make up producing a track. That's good to know. I've been getting obsessed with guitar lately and, and drums and just playing music in general. I lived in Cambridge and, uh, for a while and I remember the Berkley College of Music students were always such interesting characters.
Berkley's, of course, produced, you know, all kinds of household name people, so that's,
Hans-Martin: that's cool to hear, so. Yeah. And maybe just to have a clarification, right? So I'm not in the on campus program. Oh, yeah. Yeah. No, that's, yeah. It's kind of the online branch they have. Yeah. It's still a very, very, um, high quality program, but it's probably easier to get into that than the actual college.
Deep: yeah. No worries. No worries.
Hans-Martin: Yeah. Um, but so that kind of explains the music side of things. Um, and then on the professional side, um, I got started with computers again in my teenage years. That was back in the 80s and I'm dating myself. Um, got into software and actually in high school, I was actually working as a programmer to get the money I needed to buy those synthesizers.
That's really how I started out at the time. But then of course, um, did my PhD ultimately, um, at ETH, came to the U. S. and then spent probably another last two plus decades really around data, machine learning, AI systems. in various inclinations. A big focus initially was computational biology, um, and then it became a more general AI system, say material Amazon, but looking at Amazon Translate, parts of the Alexa infrastructure, or, um, until about a year ago, right, when I was working at SAP again, looking at, say, applications of language models in the enterprise space.
And so recently I decided to take something like a personal sabbatical and as part of that, aside from practicing the music skills, so my last Berklee class ended in December and I just want to have a little bit more time to follow up on that. I also looked a bit more deeply into, well, what's actually going on in the AI and music intersection space, kind of what are the complications, what's happening, just what's, what's really going on in industry right now.
Deep: Well, that's, that's what I had seen a bunch of your posts coming through on LinkedIn or somewhere. And I thought, Oh, okay. So HMs into, uh, generative AI stuff. We had an episode of over a year ago, um, you know, with, uh, on, on generative AI, uh, in the compositional space. So I thought it'd be awesome to have you on.
Cause it's clear that you're starting to wrap your head around the space. So maybe you can tell us a little bit about. What's going on with AI, generative AI, in the music production world, and how does it differ from what we've seen before in the music, um, space? Okay.
Hans-Martin: So maybe, um, that's definitely even more common than just, um, the general AI in the AI space, but let's start with generative AI, right?
And generative AI generally, uh, I think all of us think of systems where you provide some simple textual description, right? And the system is generating some form of content for you, right? It can be the chat GPT experience we see in the textual world, right? And so the equivalent, the music space would be, um, I'm providing a simple description of a sound, right?
And then the system generates. that kind of sound or soundscape for me. Or similarly, I provide description of a kind of piece of music I want to generate and the system creates this kind of track for me. And in fact, I mean, for example, those who, um, have been following, so it is now the first week of August.
Right. As we're recording this piece, right. And just this week, for example, Meta just released another package, uh, called, um, AudioCraft, at least open source. They open source the models for them on hyperface and it has, um, three components, but the two interesting ones here, right, is AudioGen, which does exactly the first kind of scenario, right.
I'm providing. A verbal description, it generates the sound, the audio, the soundscape. And the second one is MusicGen for consistency in the names we see. And again, I provide a textual description of a piece, right? And it's creating a music track for me. Got it. Um, I don't, I don't think we'll touch it here, but it's actually a fairly interesting part is the third piece actually is a new kind of compression tool I created.
Okay. Um, which is also true to there and that may be for, for more advanced application of AI music. It is this notion of actually recreating music based on limited information. And they're using that now as an alternative for audio compression, right? So they're saying it's 10 times more efficient than MP3s.
For example, which is substantial
Deep: on the generative side. I don't know if you saw this, but is there also another, like maybe a third realm, which is like training up models based on the actual audio, uh, signal data point sequences, the way you do with, uh, you know, text words to predict future sequences of texts.
And then based on that, being able to build out audio. forecasting models, if you will, or completions. And I think this is like the backbone for, there was a piece maybe a year or two ago with a Nirvana, somebody had taken out like the, the Nirvana catalog, and then like, and the model was able to generate not just, uh, the actual music sounds, uh, but the actual lyrics and everything all in one coherent model.
I was curious if you had any. Knowledge of that system or that
Hans-Martin: effectively a lot of these music generation systems are built in this way, right? So essentially, right, they're trained on a catalog of music pieces. Mm hmm. And this is the the eyes essentially learning the inherent patterns, and some of them are even using kind of language models like transform models even apply to a reputation for music.
signal that's part of the training corpus, right? And then the textual piece essentially comes in as an annotation of what it is that's happening musically, right? And in this way, the system can actually learn, A, what are the patterns intrinsically in song structure and the kind of melodic patterns, for example, like I see described with the Nirvana
Maybe we dig in there a little bit so you can help us understand. How that works, like, so I have like a, just a bunch of questions off the top of my head. So where do you get the unstructured collection of music? First of all, and then once you get it, is it track separated or is it all kind of mixed together?
And when you train, are you co mingling track separated forecast from. from like together and then where are these text descriptions coming from in the context of what's going on in the music at any point in time like how does that whole world work i'm just curious how these things are trained
Hans-Martin: no these these these are liquid questions where the training corporates come from and i think it's actually also what i think about it even Getting more annotation into the corpus is almost where I would start looking into if I were to consider building such a system.
However, I think a lot of the annotation is, is actually high level metadata even, right? For example, if you say, I want to have a track, which is, and now it's really just a certain moodiness, right? The moodiness may be already in the regular metadata of a song, but we may, may be able to derive it from, from the lyrics.
Deep: See what I mean? Like.
Hans-Martin: No, even just mood, because a lot of these, these, uh, you mean
Deep: like in, uh,
Hans-Martin: it's, it's dark. It's more, sometimes they've just annotated. It's uplifting. Right. So you often get this out of, out of the metadata that comes to the track. Right. Or for example, you would find things like the beat fights, which is kind of how fast is the tracks would speak, right.
Or is it major, minor, right. Is it. More optimistic or more sad kind of sound when this is also part of what you get in regular metadata of audio material that just get from, um, song database. I think if you go beyond that, you will need to source the data. But then you need to actually think about what's the kind of annotation that I can learn on top of it.
Unless you would, for example, say if it's a public music catalog, I could imagine you put mine review information, right? You could probably find which which reviews have been written about all these Right. Yeah. And then do a sentiment analysis on those, or you would say, Oh, there's a lot of databases on the lyrics of songs, right?
So you could actually pull all these pieces of information together, like line them up. Yeah. This was a
Deep: hot topic back in the, um, in the late nineties. I actually started a company with some friends doing a lot of this. So there was, um, let's see at the time, a lot of folks coming at it from different angles.
You know, this, uh, there was, um, folks who had. I think it was, uh, Gracenote was a company that had, uh, you would put your CD into your CD ROM burner and then before uploading it, you would be prompted to enter your metadata. Um, so, and that, that was, they were able to like get the info and build a database that way.
Um, other folks were just paying people to like, um, to actually go in and, and like generate metadata. We did it Purely on signal processing approaches. So we, you know, we built models for, for, for, um, beat extraction and, you know, measuring tempo and then we, and mood and all that. And then I think like, if I recall back in those days, there was a, an open source project, I think it was called music brains that, um, that started building a public repo of all of this metadata around songs.
So, so now you've got this world of produced songs and. metadata, what's next? Like you have it at the song level. Um, so you might be able to say, okay, like give me, you know, a bunch of songs that are sad or happy or aggressive or whatever. I can understand how that works. And then, um, so, so maybe talk to us about where's the AI contribution happening and how does it map back to that text, those
Hans-Martin: text descriptors?
Yeah. So, so maybe, um, the simple system that I came across actually. But it's also interesting, right, because it's actually a little side project, uh, that's called Refusion, right, like Diffusion, with an R at the beginning. Okay. What these guys did actually, they just took the regular late diffusion models for image generation, right, and they're applying it directly to spectrograms.
Okay. And it's, you can go to the website, right? If anybody's interested, you can go to Refusion. com. Um, you can put in your little text prompt and it's generating in real time this little segment for you. Um, it's limited to five seconds because they're kind of slicing it out into, uh, 512 by 512 time points and frequency.
quanta, right? And essentially, the same way you generate an image, right? Yeah,
Hans-Martin: like this. They generate, they're generating the spectrogram, right? And then you, you simply, um, once it generated, you convert the spectrogram into the full audio, right? By essentially running through an inverse, uh, Fourier transform.
Deep: Interesting. And. And
Hans-Martin: that's, that's the simplest form of what you can do. But it actually works surprisingly well, given its simplicity, I must say. And the interesting part is there's actually a whole community of people who are now using that to build their own models. Why do you have an art if there's like a little community that you can find from the website of people describing how that using their own audio material, feeding that into the model, that's creating their own personal model in this little, little sandbox
So, so maybe for the benefit of the audience here, so, so what I'm taking is you have a. A three dimensional signal, you have a frequency bands, and then you have the power for each of those bands, and then you are moving that across time. Um, so, so for the audience, you know, a picture you're in the olden days, you had your stereo, you know, with your equalizer, now you have that signal.
Kind of varying over time. And then that whole thing goes into the same diffusion model architecture that we use for, uh, imagery and video analysis,
Hans-Martin: correct? Correct. Right, so essentially we're kind of quantizing the frequency space into bins and they kind of correspond to the pixels say along the y axis in my image and I'm quantizing time right into little time intervals.
And they said for each time interval, I'm seeing which frequencies are present, to what extent, like that's the third dimension, the magnitude of the signal, each of these frequency time points. And that's really what the model is learning, how that develops over time.
Deep: And then the model is trying to, in essence, like forecast each of those, like across that full frequency space, where that data point is going to be one or two or 10 or N samples out.
In essence, well, and that and that unstructured training process akin to what we do in text to predict future sequences of text is what's going to ultimately kind of get the model to learn what the music is doing. Yeah.
Hans-Martin: Even with this fusion models actually they're generating the whole image at once.
So it's not like Not like an autoregressive model, like where you kind of work yourself forward through time. It's creating this whole five second music block all at once. Right, it's kind of, because these models, I mean, I don't know, but for those who are not familiar with these late diffusion models, like, they ultimately learn running a stochastic process backwards, right?
The stochastic process is, I take my image and I'm adding noise over time, and then what the model is actually trained on is the reverse of the process, so I start with really random noise. And the system, the trade model is applying the inverse process until it comes out with something that looks like a true image, or in this case that comes out like a true spectrogram that represents music, the kind of music that the system was trained on.
So it's iterations, but it's across the whole timetable at once.
Deep: Perhaps you're not sure whether AI can really transform your business. Maybe you don't know what it means to inject AI into your business. Maybe you need some help actually building models. Check us out at xyonix. com. That's X Y O N I X-dot com.
Maybe we can help you.
So you've got this Full time interval, walk us through that again. Like what exactly happens during that five second interval? Is there some kind of like future, like in the way with text that, you know, we're trying to, to figure out based on this statistical interpretation of across a large corpus, like what's the likelihood of the next word, uh, being, you know, X or Y.
Like, is it a similar process? And if, if not, how's it different?
Hans-Martin: I said so, and I said this is not really image, the image world we're in right now, right? Yeah. But these latent diffusion models, the way to think about them is abstractly you have, you have an image and you apply a function to it that is adding noise and you iterate this function.
So as you iteratively apply this function, the image goes from a true image and ends up in being a noise pattern. What you're trading the model on is the inverse function. So essentially it's a noise removal function that the model is learning. Gotcha. So what it really needs, what it really needs to learn to do that is it kind of looks at correlations and how these correlations are still maintained somewhat in the distribution as noise is added, and it's trying to compensate for that.
Deep: It's basically learning like visual structures.
Hans-Martin: Exactly, exactly. That's exactly what it does. And so, so that's how these latent diffusion models for images work, right? You give it the, some noises input, potentially some conditioning information like texture prompt, right? It's the same embeddings or combined embedding space, right?
And then this model really It tries to reverse the noise process, so it's denoising that until it comes out with an image that is then being presented to the user, or in this case, this image happens to be a spectrogram.
Deep: So then you're, it's learning, that inverse function is learning the audio structures.
Exactly, exactly. They're manifested visually in this case,
Hans-Martin: but yeah. Exactly. But of course, there are also these kind of systems I think you've been been looking or kind of, uh, getting at, right? So similar to how I do textual structure and I'm predicting by the next word or the next paragraph, something about it.
There are also these music language, music alumni, for example, music language models, which really apply the same kind of transformer architecture. Now it's applied to a presentation of musical events. So again, this can be a spectrogram, or it can be even more abstract, just the melodic patterns, automatic patterns.
Deep: Yeah, I've even seen it with just time series signals, like it's actually taking this time.
Hans-Martin: For something like the MIDI information. Yeah. So MIDI is a way to essentially represent digitally what a playing is doing when they're hitting the keys on a keyboard. So it's not the actual audio. But it's kind of what is the input, right?
Which node is played when and with what intensity. And so quite a few of them, the systems, they essentially, uh, trade using large databases of these mini tracks. And then it's fairly straightforward because it's just time series. I guess it is driving, right? To apply it, a regular transformer architecture that would, would, uh, but predict.
the next note or a repetition of the next next bar musical bar and then refines it into the actual notes that are being played.
Deep: Maybe walk us through how do we get from that to The text descriptions going back and forth, how do we use the text in the, in the model training and in the
Hans-Martin: output? So this claim, of course, is that a lot of the systems that exist, at least from what I've seen, they're, they're actually fairly high level.
So they're really just having this high level genre, mood type descriptions that are used, that are used to essentially condition the generation process. However, of course, and now this goes back to what you were alluding earlier to, right? We could, we, we can build more audio analysis into the pipeline, right?
And having a lot more fine grained annotation for individual segments, events in that more fine grained, uh, representation of, of what's happening musically, right? And then the system can learn these correlations between text or maybe the text that's generated from other annotation I have on the musical structure.
You talk, for example, how you would have an analysis, music analysis system like the one you described earlier, right? Yeah. And find a lot about what's happening musically. I can translate that, of course, into language and then use that language representation as I'm building the conditioning, right?
Deep: Yeah, so you're saying, literally, I generate a transcription out of it.
Hans-Martin: Exactly, exactly. I kind of translate kind of the analysis into, into text so that I can map it back to the text representation ultimately is what the user will give to me.
Deep: Yeah. Interesting. And so, so I suppose you could also, if you had lyrics, for example, they. Could just be included right alongside of the audio information, but it makes sense that it just gets, it just shakes out on its own.
Hans-Martin: It's also moving space, right? For example, lyric extraction from music. I mean, it's also an active research topic, right? Because I mean, it's one thing to have the lyrics coming from database. It's always having a really, um, aligned note to note where it's happening as, as my song is being played. Right. So it's a lot of development still going on in that space.
Deep: let's take it up a level, then, the conversation. So we have some deeper understanding by these generative AI systems that are learning the music, and similarly to the way you're describing them, learning The lyrics are also learning the independent tracks as well. Like, you know, isolating the bass from the electric guitar, from this, from the vocals, from drum, et cetera, which also, I guess is, is one of the exciting areas of, of production software now is it's.
Some of these AI systems are being used to actually segregate tracks. If, um, maybe you could talk us through all the different ways this, this kind of system is manifesting itself in the world of the musician or the producer. No,
Hans-Martin: why is isolation interesting? So I think most immediately, right. What annotation allows is to essentially get to the building blocks of a song.
Without having access to how it was made at the time, right? Because usually it's much easier to get the final result of a song than, say, all the raw recordings and the mixing steps, etc. that were created as part of the production process, which sometimes contain really the information you want to get to, but, um, usually are just managed as well as the final result.
I mean And so, folk are musically, right? I mean Uh, it's really about sampling in the most immediate application, right? So back in the 80s, when sampling started, right, people just take snippets of other songs and then incorporate them into their own tracks, right? And of course, at the time, that part you were interested in had to be fairly isolated in that track, otherwise you, that's
the kind of break beat that's almost define all of drum and bass, right? It was a drum solo. That's why it was easy to pull out of a track, right? And if it wasn't the case, you could do that with the audio isolation. Now I can pull out many, many more signals out of it. Yeah. I
Deep: mean, there's, there's a lot of, I mean, this has been a bit of a Holy grail within the computer music, uh, arena for, for decades.
I remember back in the, Mid nineties, folks were trying all kinds of things to be able to isolate tracks and, you know, a lot of it leaning heavily on spectrograms and, you know, and it just gets complicated, um, fast to, to segregate these things, particularly for instrumentation, that complex instruments, they don't just sit in a nice, clean, isolated spectrum, you know, spectral band there, they'll have harmonics and all kinds of, uh, um, constituents, you know, across the spectrum.
And even something as simple as like, you know, a drum, like a snare hit is going to have a lot of high frequency components going from like all the like download to the, you know, just the, the beat that you've got all the way up to all the resonance and the, that shiny sound, if you will. It's for a bunch of reasons, right?
Like one is the sampling idea that you're talking about, but also like musicians aren't, if you think back to Mozart's time, you know, Mozart could. sit and compose at a piano and the composer, you know, was expected to really compose primarily on piano or maybe guitar, you know, an instrument that where you can cover rhythm and low frequencies and high frequencies and melody and harmony, like all that stuff can be represented in one instrument.
You know, the left hand of the piano can be the bass instruments. And later on you can tease them out into bass clarinets or whatever, uh, and tubas, but the composer could work that way. But nowadays composers. You know, even not even nowadays in the last 30, 40 years, you know, a lot of times musicians, they just play and then somebody has to write it down.
Cause at some point your guitar player ODs or something, and you've got to go get another one and someone from Berkeley college of music, and you have to give them a score and have them play. And so being able to score is also really important, you know, being able to like go from, you know, a group of musicians who are really typically quite good at just getting together and jamming until something exciting comes out, but maybe don't.
At all know how to write it down, you know, a lot of them don't even know notation. It feels like a powerful building block.
Hans-Martin: It is for that. Um, I think also like kind of anything about, um, beyond sampling remixing, right? You can actually, right, because it can create stems without having access to the original stems.
Another sub element here, um, in the isolation part is in the, I don't know if all the listeners will appreciate it, right, but it's deregulation. Right. So essentially, as I'm producing a song, I'm adding digital effects that create space, that echoes, reflections, part of sound isolation, also removing noise, right?
So if I take a signal and I want to bring it into a different musical context, right? Then as part of it, ideally, I can move the space that was created around this voice or this instrument in that original recording because I need to now place it into a new acoustic context, so to speak.
Deep: Right. You want to take your lead vocals and put them in Carnegie Hall after the fact, or put them in a, in a crowded jazz bar, you know, after the fact, something like that.
Hans-Martin: Well, or the other way, right. They were recorded in a crowded jazz bar and I want to get a clean version. Right. Of course, now I'm going to Carnegie Hall, right? And kind of that kind of process, I can take away the jazz club acoustics, which is typically not that good, right? And I can then, once I have it isolated, recreate it now in the Carnegie environment, which is well acoustically treated, right?
Deep: So, what kinds of features are winding up inside of professional audio editing tools, you know, that are powered by some of these capabilities? I assume one of them is exactly this, just track isolation. And how, and how far along are we? Like are we just at the early stages of maybe a startup or two got a little bit in more sophisticated AI powered features in place?
Or are we, you know, where are we on that curve and what kinds of stuff are you seeing?
Hans-Martin: Um, so for the sound isolation, I've seen primarily startups to To be honest, these are really specialized players. Um, but they're kind of offering them as, as plugins, right? That I can insert into my regular, um, digital audio workstation.
So maybe for the audience, we're not in, in, in, in that space. A digital audio workspace, workstation, essentially, it's a general system on a computer that allows you to work with music and sound. Essentially based from the recording all the way to creating the final production. And, um, it has been highly componentized for the last 20 years.
So it's easy to add a new plugin, a new component into that environment, and have it readily integrated with all the other pieces that I have. So for example, it's like a GarageBand or something. Exactly. It's available, it's a plugin that I can put in, and now I have this new tool without having to be too much concerned with integrating across different software pieces.
Another kind of adjacent part that we are seeing is audio and kind of AI analysis of the audio signal being used essentially to find parameter settings.
Hans-Martin: look for parameters. So for example, I have, say for, for the, for the mixing or the mastering process. Or if effects plug in that I want to, but like adding back the space, but I don't want to find the right product that met my music style.
We see now companies like, like Isotope, it's actually a fairly large player in that space, they have been adding these kind of AI models into the tools. So that this mastering tool or this mixing tool listens to my track and it essentially understands, okay, what's the kind of style that's being applied.
What's the kind of genre I'm in. And based on that, we'll now apply. different rule set of what would be good starting points for the parameters I need to set, say, in my, my channel strip on a mixer, or to guide the selection of a reverb that I want to create by artificial space I want to create around this instrument.
And that maybe also some of these tools that also, they not only listen to this one track I'm interested in, they also listen to the rest, right, and then try to find what's the right kind of mix or what's the correct kind of setting that what I'm adding in here kind of works well with what I already
Gotcha. Are you seeing some higher level capabilities around style being applied? You know, like with imagery, how we've all seen the folks taking some Picasso imagery and like applying that cubist style to some other imagery. Are we seeing analogs in the audio realm where you can take a Charlie Parker style and add that filter to a generator of a saxophone track or something?
There are these
Hans-Martin: systems, so kind of some form of style transfer. Or for example, I can actually use, um, say a singing voice or one instrument as an input signal and the output, and therefore there's a saxophone model that then takes whatever's played as the input signal. And it's not creating a saxophone playing the same melody, the same kind of expressiveness to the extent it can approximate what the input was, was doing, but, but it's not a saxophone track.
But it's really transforming my input signal from one instrument to another. So that's definitely one application. Another somewhat adjacent is, so sometimes you have these, particularly if it's like all the analog devices, for example, an amplifier, guitar amplifier, I'm taking a guitar amplifier from the 60s.
I don't want to emulate how this thing is affecting my inputs, my guitar signal coming cleanly from my pickup on the guitar to the resulting sound, right? So again, the AI models that essentially learn the transformation process. that this old analog circuitry is applying to my input signal. I plug them in my chain and now I get, say, my clean guitar played as if it had been run through this, say, 60s vintage amplifier.
But as, uh, but as I said, different from previous approaches where the electronics were modeled. So I'd also use the process for someone sitting
Deep: down with the physics and modeling. Exactly. Yeah. This is almost like you couldn't learn. The corks of a particular musician, you know, and they're, and they're smashed guitar that they accidentally, you know, drank too much beer one night and smashed it against the stage.
And, and now you're learning that
Hans-Martin: function. Exactly. you're just learning the transfer function as an input output
Deep: signals. As a musician, what do musicians You know, in the eighties, we had this computer music. impact, you know, the kinds of effects and synthesizer sounds that we had. Can we expect something similar, you know, with these AI sounds that are sort of really new and different and fresh in some way, or is it something else?
Is it sort of, you know, a lot of personalization? Like what, what, what kinds of stuff can musicians look forward to? I
Hans-Martin: would look at a couple of extensions. He sent you off the belt and before he seen kind of, I'm starting with simple ones and then get, get, get further out. So I think, um, an immediate.
application of AI that we'll see is essentially tools that simplify using music creation tools. For example, simplifying a synthesizer which has potentially hundreds of parameters to what are the key parameters that need to control to still cover the meaningful sonic territory. But without overwhelming the user.
Now, in the past, these systems have been built kind of manually, right? We had like experts going in and trying to find some macro functions and some rule systems to make a simple interface on top of a complex synthesis engine. But say, consider something like a variational autoencoder. By variation, all encoders learn exactly that, right?
They learn how, what are the key parameters in this intermediate space that still allows me to reach all the points in my overall space without kind of working in the fully high dimensional space. Right. So I think the first one, we actually see that as a company, for example, that's building now like an AI driven synthesizer, right?
That has a couple of core settings, but in each setting, it only has a few parameters, but those parameters are learned and they're specific to this class of sound so that I have still a lot of degrees of freedom, but as a user, I'm no longer overwhelmed. So it's probably easier for me to get to a larger variety of sounds.
Deep: being, without navigating,
Hans-Martin: I'm stuck in a private overload, right? Because I think the other thing we're seeing right now is just, there are just too many, like when I'm in this audio environment, there are just too many different sounds available. I just get lost even finding what I'm looking for, right?
So these systems will make it easier to find what I'm looking for and then easy tweaking it to
Deep: what I want. I mean, that seems valuable even for just like live performers really dislike a lot of this complexity because you just, you know, you're, you're on stage, you're playing, you don't, you can't be dinking around in your synthesizer menus forever, you know, like it's gotta be really terse and intuitive.
So that, that sounds promising. Exactly.
Hans-Martin: And then I think another part kind of a little bit high up is when we have a system that now understands more about musical structure, have a musical sense in a certain way, right? And now they can support me in this process of going from the core creative idea, like my melody, my chord progression, a specific rhythm.
into building out all the pieces that make the full piece of music. It's not at the level of, right, the generative system we talked about in the beginning, right, I just give a little description, here comes the track, but I have, I'm still a musician, right, I have my ideas, I have a certain vision, but really, right, how do I go from my melody to the different repeats of that melody over the course of the track, right, because I will have variations, and they may be genre dependent, and also genre dependent, maybe there's interplay in different other tracks.
What is meaningful? What is the meaningful sound pad I want to create? And so there's all this very subtle and nuanced musical knowledge that goes into this production process. And I think we'll see these tools which will augment and help you do that iteratively and interactively.
Deep: Yeah, I mean, it's almost like we're taking that music composition theory world and chopping it up into little bite sized pieces and, and giving it to you at the right time and the right moment in your post production facility, or maybe we even get it into your instrumentation and your, your pedals or something like, so that we can help you not just have a rough template of like, okay, three chords and a transition and chorus and whatever, but actually be suggestive.
along the way so that it's really helping a musician have ideas. It feels like when I look at some of the capabilities, like the garage band on the iPad, for example, came out with, it felt like they almost stepped over too far. Like they made everything so easy. You know, you just get this drum kit and you just got this two dimension thing and you drag it and then boom, you get a drum beat.
That's not very interesting to most musicians. Like they see something like that and they're just like, whereas if, if it's like built in to the process, I mean, it's nice in the sense that if you have no musical inclinations, it might get you motivated and excited enough to like invest in learning the guitar or the piano or something and going to the next step.
So in that sense, it's wonderful. But as for a musician, it can be sort of. Seen as like doing too much. It feels like there's a world where we have markets and startups that try to get more people into music. And then, and then, you know, you see all kinds of, I get advertised weird products, like little gidgets, and you can just like wiggle your fingers around and it makes music or something.
It's like a, like a fancy fidget spinner for audio or something. And then, you know, and, and so there's a lot of people doing that kind of stuff. And then you've got. You know, like, hey, I'm bored, I've got my laptop, I'm stuck in the mountains somewhere, I can make some music. But then on the other end of the spectrum, you know, you have like, Really, you know, sophisticated audio editing environments.
And now you have the capability of bringing that deeper machine generated understanding on the structures of music into a suggestive, recommendational kind of environment, where all of a sudden you can imagine the net effect being a lot more people who maybe had some level of musical ability now are suddenly like accelerated to a few levels beyond where they would have been.
And it feels like that on the whole, we're just going to get better music, but I don't know, because I feel like people get dark too quick, you know, when it comes to AI everywhere and they start presuming that it's going to replace people, but. I feel humans have a deep ability to move the bar in the same way that realistic painters had to change what they're doing in the wake of the photograph.
No one's going to be impressed with what, what the machine's going to do out of the box. Like it's really, what does that, what happens when you pair that with great musicians and you get even
Hans-Martin: further? Exactly. I think it's that interactivity that I want, but it's, but the AI becomes. An element almost like a partner in this creative process that helps you that augments you but it's really the iteration around that probably also keeps the fun in it right when people create music.
Because it's fun to create music, it's just part of a human nature, we're attracted to music and. I mean, I'm
Deep: literally. As soon as we're done here, I mean, I'm heading off to go grab some friends. We're grabbing guitars, we've got, and we're just showing up the park. We're going to play some music. Like it's a way of communicating.
It's a way of like socializing. That's just different than, you know,
Hans-Martin: speech. Yeah, exactly. And I just, two thoughts to some of the things you were talking about earlier. Right. So the one thing, of course, why did you get this example about, by this virtual drummer in GarageBand, which I think is a really great example.
Right, because that thing right now is to discover it's like this two parameter thing and it's creating something and it's a static process, but I'm setting these parameters and then that's it, right? But if you think about right now yourself as a guitar player, I mean, yes, you have this rhythm line, you know, dragging over with your guitar or maybe with some other friends, right?
And things happen as you play and this drum track may no longer be the best, right? And so, right. So if the system can actually pick up on that and react and change and adapt, but suddenly it really gets in there, this will trigger again, how you may play differently and ideas come up because you have this iteration.
It's not just, I do it once and that's it, but the system can essentially work with me in a much more dynamic way as I'm working, create music. and iterate on, on the music I'm playing. So I think that that's one
Deep: part here. Yeah. I mean, I, I think, um, one of the problems I have with garage band is I go into garage band, I compose something, uh, I really like it.
And then I want to go play the drum part or play the guitar part. And it's like, I'm not skilled enough to play that really complex bebop. And then I go back to my drum set and I try to get in the ballpark. And then I go back and I try to simplify it to something I can actually play, but it feels like there's a lot of room for Instruments for this for this compositional music theory space that we're talking about to.
Not just to go from post production or composition, but actually go into play. Because at the end of the day, nothing against DJs, but it's just a lot more exciting to see somebody with a guitar, you know, or playing live than it is to see somebody making pastries behind, you know, uh, the, the thing. And, and, you know, I get it that, that, that has its role and it's exciting, but.
Like people actually enjoy the physical act of playing their instrument a lot. It's not a replacement to go and sit in front of your computer, especially for people like us who spend our day working days in front of a computer. Anyway, like, I don't want to go sit in front of my computer. When I come home, I want to grab my guitar or grab a drum set, or, you know, and I can, I can just like, if I project out five or 10 years, it feels like I could have a, you know, like a drum kit with electronics in that can talk to my.
Garage band compositional environment that's bootstrapped by a lot of AI features, and then it can like work with me to tailor or personalize. some skill improvement to get me to the point where I can play that thing and I can go back and forth. I can get, maybe it didn't quite get there, but I got something interesting.
It goes back there. Same thing with like a guitar. Like I have a smart guitar that I can use for compositional reasons, you know, and maybe even more, but I can access a lot of this AI powered features and I can go back and forth from leveraging that. deeper compositional understanding of music with my maybe limitations or realities as a player in, in the physical world.
So I don't know, do you have any thoughts on that kind of concept or, or is that too out there? No, no, but
Hans-Martin: I mean, maybe even, um, beyond the creative side, because we talked a lot about the creative side, right? I mean, how do I write songs? Yeah. But now we're talking about playing music, which doesn't necessarily mean it's new songs.
It's just songs I like playing, but there may be also. almost separate topic that's quite interesting is really how could actually I improve me learning and play an instrument. Yeah, yeah. Because they really get this kind of personalized learning experience. For example, guitar is a good example. Or I played, I learned the violin, right?
And there's so much dependency on how do you hold the instrument. And that's the thing where you practice, actually. If you practice the wrong way,
Deep: you are, you're just building on bad building blocks, exactly,
Hans-Martin: exactly. Right. And you have, you've meeting your normal teacher once a week and seven days between you're practicing the wrong thing, potential, right?
So, so also having this. Yeah, having these, these personal teachers could also be super, super
Deep: interesting, right? Oh, for sure. I mean, like, you know, you may just not have the money to have a private guitar teacher. I mean, they're, you know, they're not cheap, you know, 70, 80 bucks, you know, a lesson or something.
But one of the things that I noticed like during the pandemic, I just said, okay, I'm going to, I'm going to throw down and learn how to play guitar. And so, you know, like YouTube videos and all that kind of stuff. And I got somewhere. I don't think I got anywhere particularly exciting for anybody else or myself.
And then I, you know, maybe four months ago, I started taking actual guitar lessons and my guitar teacher, I mean, he's just, he's, he's amazing. He's an accomplished guitarist, first of all, but he's also just amazing at teaching and the sorts of things, being an AI person, I'm constantly. I'm trying to debug what is it he's doing that's so much more effective than what I was doing before.
It's like a combination of stuff. It's not just like, here, play a song. It's like, let's take a break and talk, you know, about what chord progressions are. and why you need to know them and then let's go back to the song or let's talk specifically about the Pima fingering style and why you need to do something because the whys are really important too because whys are connected to motivation if you understand something as being important then you're more likely to like practice it and try it and get better at it and I'm like horrible at this if I don't understand why I just Won't spend any time with it.
So I, I have to, and it feels like there's a lot of room there in the instructional, like the intelligent instructional space, whether it's, you still have your weekly instrument teacher, but they can reach out to you virtually or via this agent like form throughout the week. Um, in an efficient way, or whether you don't have them and you're just trying to not learn everything wrong that has to be undone by a proper teacher, which is, you know, that I went through.
Hans-Martin: Exactly. I mean, I don't know if somebody's working on that, but it's something that needs to be done, so to speak. I mean, so I think the impact on education may have been even more fundamental than on the production side,
Deep: potentially. I mean, I think people are, there's, there's like all kinds of weird guitar things.
Like I ended up just buying one at one point. It was like this. thing, you stick it to your, to your fret at the end of your guitar neck and it had little lights on it and just told you where to put your fingers and therefore you could learn songs and play great music. Well, no, because it's missing like 98 percent of what actually, like, you know, like it's a big deal how you put your fingers on the strings.
It's, it's a big deal of how you, you know, how you move between chords. And that's the part where I feel like. The machine assistance. We always need that human piece in the loop. Need help with computer vision, natural language processing, automated content creation, conversational understanding, time series forecasting, customer behavior analytics, reach out to us at xyonix.
com. That's X, Y, O, N, I, X. com. Maybe we can help.
Hans-Martin: That's dope. Um, come up with a different perspective. So I'm with you on, on the playing side and getting the real feedback from humans as I'm playing. But, um, like you mentioned earlier, oh, um, made the comment, right? People go dark with AI music, right? And now this is an interesting, different perspective.
And actually the person I got, I talked about, talked about it really very explicitly was Will. i. am. Right. Black eyed peas. And we had in LA, so I'm not only living in Seattle. I'm also going back and forth between LA and Seattle. And we had tech week LA in June, and that was also putting discussion on AI music.
And, um, he had the perspective that he said, well, you know, it's not for professional musicians, right? He said, well, you know, professional musicians actually, um, in the future, they will no longer create IP in the form of tracks. Rather, they will want to invest in the model that is the model itself that the IP they're licensing on.
And they may even go so far that they may invest in even into adding more skills to the model and more styles and getting richer and really as much as they, they can create goes into that model because that will just make it more interesting for others to now bring this model into their own creative process.
Instead of just the result. I'm, I'm giving you, I'm licensing you, I'm providing you, right? This thing that essentially emulates what I would be doing
Deep: if I were in that context. In that context, yeah. Which, you know, which opens up a whole new arena. Like historically, I feel like at some point I need to do a whole episode on just this.
Historically, like songs have been kind of the binding music, medium of music, right? Mm-hmm. and a, and a song in the. Modern digital era winds up being an MP4 or whatever, some kind of recording, um, with everything kind of smooshed together all the tracks and that's your thing. But if you think about something like a, like video games, it's a massive entertainment vehicle, but the idea of playing the same like singular song.
start to finish like in a video game context isn't necessarily what you want. Like what I might want is, Hey, I want the Moby model and I'm building a video game and somewhere in the game, somebody goes into a club and I want the Moby model to like react to how many people are in the club to react to maybe like time of the day.
Maybe, you know, if it's cause you know, like clubs don't play the same music at three in the afternoon as they play at three in the morning, you know, all, all that kind of stuff. And maybe like, you know, And somebody slams the door, like all kinds of stuff that we have in video game world. We have this very sophisticated, evolved physics modeled universe that we can build from and tools, but our music abilities are really relatively quite limited, right?
Like we don't have that kind of sophistication in those virtual spaces, like I would argue we do with You know, if you look at the sophistication around the physics,
Hans-Martin: you know, yeah, but I mean, there are companies working exactly in that space, right, but creating this personalized soundtrack. Yeah, the gaming experience and then also like you can select your own style and really get this your own soundtrack.
As you navigate this virtual sound virtual environment, but based on the settings and preference you provided. What I find interesting though is when you talk, I forgot the name of the company now, but it was also one of the podium participants back in this, in this joint event is that actually the key business driver is that nobody wants, um, royalty based music in video games.
Oh, really? Yeah.
Deep: Because they just don't have a history of paying that out, the royalties out the way they do with film
Hans-Martin: and Exactly, exactly. And so that's actually driving the adoption of AI music generation into, into gaming audio, which I found really
Deep: interesting. They just don't have a way to put the budget in or something.
Hans-Martin: Because it can get expensive, right? I mean, if you're creating a video game, you don't want to have per play royalties to some artists who just happen to create
Deep: music that, you know, Oh, and a lot of the economics are different too, right? Like the economics around a movie are very like easily, you know, you can put them on a spreadsheet and model them.
Like you pay 16 bucks for a ticket. You can afford X amount in to go to the musician, but you can't exactly in that game world. Cause it's, you know, there, there might be advertising components to the revenue. And similarly, even in like social media posts, like there's like a Incredibly small fraction of monetization that can happen on a given use of audio track.
Hans-Martin: but so I find it quite interesting, right? So yes, so we're seeing companies go into space by building this personalized music soundtrack for the game experience. And it really works because they're in this world that tries to maintain, stay royalty free. And that's really driving the adoption of these AI tools.
Deep: Cool. Well, thanks so much for coming on. I feel like we covered a lot of great terrain. It's been super valuable and interesting. Thanks so much, uh, HM for coming on. Yeah. Thanks
Hans-Martin: so much for having me.
Deep: That's all for this episode. I'm deep Dhillon, your host saying check back soon for your next AI injection.
In the meantime, if you need help injecting AI into your business, reach out to us. At zxynix. com, that's X Y O N I X dot com, whether it's text, audio, video, or other business data, we help all kinds of organizations like yours automatically find and operationalize transformative insights.