Your AI Injection

Evaluating Efficacy, Assessing Trust & Editorial Alignment in LLMs

June 13, 2023 Deep Season 3 Episode 1
Evaluating Efficacy, Assessing Trust & Editorial Alignment in LLMs
Your AI Injection
More Info
Your AI Injection
Evaluating Efficacy, Assessing Trust & Editorial Alignment in LLMs
Jun 13, 2023 Season 3 Episode 1

Join host Deep Dhillon and Bill Constantine as they explore the intricate process of assessing efficacy, assessing trust, and achieving editorial alignment in large language models. In a world where LLMs are increasingly powerful and usually reasonable, traditional efficacy techniques are insufficient and a new focus that leverages semantic comparisons and even the LLMs themselves for attribute driven editorial alignment is a new way forward. The AI experts delve into a number of topics including the importance of editorial considerations, communication style and the challenges of efficacy assessment in increasingly personalized models.  

Check out some of our related content:

Show Notes Transcript

Join host Deep Dhillon and Bill Constantine as they explore the intricate process of assessing efficacy, assessing trust, and achieving editorial alignment in large language models. In a world where LLMs are increasingly powerful and usually reasonable, traditional efficacy techniques are insufficient and a new focus that leverages semantic comparisons and even the LLMs themselves for attribute driven editorial alignment is a new way forward. The AI experts delve into a number of topics including the importance of editorial considerations, communication style and the challenges of efficacy assessment in increasingly personalized models.  

Check out some of our related content:

[automated transcript]

Deep Dhillon: Hi there, I'm Deep Dhillon. Welcome to Your AI Injection, the podcast where we discuss state-of-the-art techniques and artificial intelligence with a focus on how these capabilities are used to transform organizations, making them more efficient, impactful, and successful.

Bill Constantine: So what are we talking about today? 

Deep Dhillon: Yeah, so we'll just roll into it. Um, so hey, thanks everybody for joining. I've got, uh, bill here. So this is, um, the two of us are gonna talk about, uh, assessing efficacy or accuracy in, uh, these large language models. So, yeah, so let, let's kind of dig in here because one of the things that happens with these LLMs, you know, is you, you get.

A lot of, um, like it's very easy to kind of get carried away and get them to say things and you can quite quickly get them to do something that seems an awful lot, like what you're trying to, uh, go after. But everyone seems to kind of be coming around to the fact that, you know, GPT-4 is, um, You know, I think you, you call it God at this point, but it's like, it's, it's really, really good.

And as you're pointing out, folks now are just sort of using it for training data. So if, like for summarization, you know, you've got a bunch of, um, stuff that you, um, maybe pass into GPT-4, or maybe it's like the CNN Corpus that everyone kind of knows at this point where you can, you've got reporters who've actually, you know, humans that have summarized this content.

But once you've got that, like maybe we'd start with like what are some of the old school techniques that people were doing to, to assess efficacy for a task? Really specific like that where you've got text that's being generated, how do you actually even say how well it's adhering to the Yeah. Sort of that's a per perfect definition.

Why don't we start there and then we can talk about some of the newer techniques and like what GPT-4 makes possible. 

Bill Constantine: So let's go, let's go old school. You and I can ramble a little bit. And we're gonna sound, we are gonna sound pretty old here. Ultimately it comes down to if you take a blob of text and you have one of these models, summarize it in some way.

Um, it's gonna put, it's gonna return an answer. Let's call that the bot answer. But then you're gonna have, uh, a ground truth. Maybe that ground truth actually comes from a human being who's done the same thing as you mentioned. For example, there are some articles out there where they have a whole article and the author summarized that entire article in just another sentence, and you can.

You can still download that corpus of text, for example, but ultimately you have a bot's answer and you have a human's answer, or you have a ground truth answer and you need to compare them somehow. So you can say, well, the bot's answer was actually close or really far away. So if we're going really, really old school, you would literally do like a string uh, comparison.

Um, between those things, without going into the weeds there, there's lots of little, uh, techniques like levinstein distance and so forth. Essentially, you're calculating. Calculating distance. Yeah, you're calculating distance metrics between, um, you know, those things. But then later on we got these, this idea, you know, when you deal with text, you have to turn it into numbers so that we can deal with it in a machine learning sense.

And there was a technique that came out there that, that formed these so-called word embeddings by, uh, Nicol. Uh, Nicholas Toof was, uh, was I think the first, I hope I have his name correct. Uh, he was the first one to sort of publish a, um, A, a real seminal paper on this, you know, writing in the back said a lot of technology, but basically it would take a given word and inform a nu numeric vector.

And so you could do this, you could do, you could think of like, if I have sentence A and sentence B, you wanna compare them, you would take the word vectors and sort of compare them and C, how close they are. But the problem with that is that that doesn't really tell you so much about context. 

Deep Dhillon: Before we get in there, I wanna make sure our list for our listeners benefit, like what we're trying to do here is you've got a human that supposedly gave you the perfect summary and you've got a bot that created a summary and you're trying to find out how close to the perfect summary. May, maybe it is, it isn't even always human cuz like lately we've been using a lot more GPT-4 for the summaries cuz they're better than the humans that we've got, but mm-hmm.

You've got a perfect summary quote and then you've got this one that you're trying to measure. Its distance to perfect, which in and of itself is sort of flawed, right? Because you can imagine multiple summaries that are. Great. And it's hard to know whether or not one's perfect over the other, but that is sort of the premise that we're marching forward on.

And then part of what you're describing is that similarity process, like how do you determine how similar to the quote perfect yours is. And then there's various techniques from like simple string comparisons, like how many. How, like edit distance or levinstein where you know, you just, how many characters does it take to move one to the other?

Which of course doesn't really account for like, things like synonyms. And then we talk about word embeddings, which we've talked about on this program before. Um, you know, where we've got now a semantic understanding of, of words based on these more sophisticated models that can. Predict like, you know, future word sequences.

And now we have even more powerful techniques where we're predicting future sentences and we get what we think of as sentence and embeddings or semantic understandings of the sentence, all of which kind of give you a similarity, all of which let you basically say like, how close to the perfect are you?

But before we go on, let's kind of. Talk a little bit about what we're assuming there and how applicable in real life scenarios that is, because how do you get that perfect, um, that perfect stuff that you're comparing against for, for one two, like how, how is that variability across perfect even expressed?

Or is it, is it even the right mo modality? And sometimes you get things that semantically seem really quite pop, quite good. But maybe you violate some kind of, um, unwritten or some, some kind of like rule. So like, let's say it's HIPAA sensitive data, um, maybe it's like providing medical advice or something like that where maybe semantically, that's still a perfect answer, but it's still wrong.

Like, how, how do we think through these things? Things. 

Bill Constantine: Yeah, you're absolutely right. When it comes right down to it, it's not only a task driven exercise where you have a very specific thing that you're doing. Like take this blob the text and summarize it. But it's also you as, as a person that's developing a product, you might have these constraints that, for example, you can't have city names or personal names.

You can't have any pii. It has to be a certain length, for example. That's something that's quite common. You may want, for example, uh, you know, Something that's maybe tweet sized, and so it's something quite small. Maybe you even want it down to like a couple buzzwords or it's more, you know, edging towards like keywords.

So in this sense, the efficacy of the model, it's not enough for it to, to be very accurate in some sense. Like, say it has a good, uh, sentence, embedding approximation between the two are, are very close. It's also that it, it adheres to these constraints that you've put up. And, uh, and the only way really to put up the boundaries for these models to, to adhere to those constraints is for you to do some sort of, typically it's some sort of fine tuning.

Um, so, and I just wanna point out that it's important to have all these different metrics and you think that these string metrics are kind of silly, but actually when you talk about very, very short responses, it's a, you might wanna include like a string matching. When you talk about longer responses, you might wanna include, you know, string, uh, uh, sentence embeddings, or even word embeddings and so forth.

But you also might wanna bring in, as we just said, measures of how closely they adhere to whatever you're trying to achieve. Like length. You might have a length, you know, metric, and you might have something like, well, does this thing. Return, uh, um, something with PII in it. Does it return, um, something that's kind of cheeky in its response, et cetera?

Yeah. I mean, 

Deep Dhillon: you might have various fuzzy things, right? Like one of the things people realize really quickly with the default settings in ChatGPT is it tends to get kind of preachy and it tends to just barf advice at you in like very large volumes unless you control it more. And that's, that's often like something like, you know, folks who are trying to tweak these.

Outputs. They, you know, they have, they have their idea of quote. Perfect. Right. And it's, it's very editorially driven in some sense. You know, like it's, it's not that different from, you know, a magazine in the, you know, in a decade prior or something, or even today, but like, you know, where a, an editor had a real voice and really specific ideas of what they wanted to express and those things, if we think about classical, Text analysis with machine learning.

You know, like normal precision recall metrics don't really like apply. Exactly. Um, these similarity to perfect metrics are a new way of thinking about things cuz it's very sort of granular in some sense. Like you can define perfect at the granular setting, but when you start talking about bigger and bigger and bigger and bigger summaries that.

Complexity gets bigger and you, if you got like, you know, five, let's say you know, of the key editors in a room, you know, do you get agreement? So all of those become challenges in this space. 

Bill Constantine: Well, and you mentioned one thing is that, how does one go about the business of forming a ground truth here? Like what is your, what is your gold, what are you comparing to?

What is your standard conventionally, you know, we at SCI and I think people. Typically you would just grab humans. You would have the discussion, right? Like editors in the room, Hey, when we put out this material, this is what, this is the content that we're looking for, this is how we're gonna communicate it, and so forth.

So you have that same sort of discussions around this material for specific project. So it's important to point out here as not only are you looking for things that we talked about, like it has, it has to match in terms of context and length maybe, but even attitude. Because you might be speaking to an audience that's young teenagers who are not gonna want a formal response.

So you might be, you might be looking for summaries that are more, you know, like colloquial. Colloquial, like, you know, hey, you know, a lot of haze, or dude, or, you know what, I'm too old to understand modern, you know, TV stock, but. 

Deep Dhillon: Yeah, discourse style discourse is something Thank you. Yeah. On, on some, on some level.

You wanna almost decouple that too though, right? Like that's, that's also part of this is like with respect to efficacy assessment. Yes. You may not want to co-mingle discourse style cuz it sort of pollutes your, uh, your metric 

Bill Constantine: on some level. Or you may want to include it because that's the, that's the only style of communication that you're looking for, you know?

So, but I guess, What you ultimately, ultimately is from a sort of like a, an analytical perspective, is you have these different dimensions by which you can assess, uh, equality between essentially two strengths. You know, one is the gold standard provided by what you think of as a human. This is what we're going for, this is, this is what the editors sort of think is the best response, you know, or the best summary.

And then what, say one of these models spat out and you have these different dimensions that you know, could be context, could be length, it could be. Uh, you know, how, how cheeky it is, attitude, et cetera. These things are difficult. Some of them are difficult to actually measure, but you can get close, right?

And you can do, uh, you can, then you have sort of these different scores and these different dimensions, and what I personally do is I will form an overall score, an efficacy score, say, based on a weighted. Uh, average of all of those dimensions. So I'll, I'll say, you know, it's very important for me that the context be very close, so I'm gonna give that score, whatever that is, uh, you know, like a, a five of a weight.

Deep Dhillon: Yeah, there's some kind of combination there, you know, but it, you might n know like how, what, what's the ratio between cheekiness and, and accuracy, you know? 

Bill Constantine: When it comes to attitude, that's very difficult to assess. 

Deep Dhillon: Well, not not only attitude though, it's things like, you know, if you're building a whatever, like a virtual, um, like, you know, uh, career like therapist or something, like how empathetic is it?

Basically what we're getting at is like measuring, um, Like quantifying these editorial activities that historically have been very playbook driven and in the hands of quite creative folks that don't think quite as, um, like a like metric, like in such a specific metric orientation, like geeks like we do.

And so, That's almost like there's like a big gap there, right? Because historically it would be like, Hey, you're an investigative reporter. Go do your thing. And then we're interacting at this very high level. Now we're talking about this thing that can converse and trying to like, and, and maybe it's not even conversational, right?

Like we were, we were talking about summarization, but it, it's like writing, basically assessing writing is what we're talking about. A against a, a very. Granular attributes sort of laid against an editorial construct of some sort. 

Bill Constantine: I do wanna point out that we haven't quite moved the timeline here up to the current standard.

Uh, what we've discussed thus far. Let's say when we're talking about a world of sort of like GPT-3 and prior models, these are LLMs that aren't quite as sophisticated as the current models are. We definitely would involve the editorial board. As you said, we would involve humans. We would say, uh, we, we have annotators, for example, at Xyonix that we use and we say, here's a bl text.

If I were to tell you to summarize this in a single sentence, You know, go ahead and generate that. And we, and we, we talk to them and we train them. We say, here are the types of things we're looking for. We want proper grammar. We want it to be a certain 

Deep Dhillon: length. Yeah. We come up with really specific guidelines too.

Typically, like brutal documents with detailed examples and things like this is on this side of the boundary, or that side's 

Bill Constantine: true. In the past. You could also steer the models towards that direction. Right. You could try to, Essentially give it prompts, so to speak, and maybe in that prompt you can give it an example of what you're looking for so it has a good idea and you could see it could lock onto that prompt and that example quite nicely.

Now, in the current standard, which is GPT-4, you have these things almost like you tell it the type of assistant that you, you would like to generate. Like you say, Hey, I'm a, you know, I'm a. I'm a principal in a school district and I wanna be able to communicate better with my parents. And so, you know, start from that attitude, you know, start from that position and then do your stuff forward.

From that, maybe the forward from that are instructions. I'd like you to summarize these large documents and, and put it in simple language that I can more easily communicate to the parents, et cetera. Ed, I will tell you that that technique and that approach got so good. From my perspective that it actually was beating, um, humans.


Deep Dhillon: listening to your AI injection brought to you by That's Check out our website for more content or if you need help injecting AI into your organization

Taking the conversation back to efficacy assessment. One of the things we know that, like. These, like some of these higher level models like so good at is you can, you can very easily with like a few examples come up with like a categorization scheme or a scoring scheme. So let's talk for a moment about actually just using the model to, you know, itself to assess efficacy along different ax axis.

Like you, you mentioned like offensiveness, um, you know, there's things like empathy. There's like, there's a number of. A attributes. So, so if we go back to the editorial aspect, you can imagine, um, an editorial guideline coming up with we're gonna, we are gonna always be whatever, empathetic, um, clear in our response, non offensive, um, yet edgy.

And you can imagine just like you have a st like an old stereo dial where you've got, you know, you've, you're, you're sort of got different abilities to light up those attributes. Now, now you can go back and very easily use these models themselves to assess for a given generation how offensive are you?

How, um, edgy were you, et cetera. And now you, and so let's talk about that. Like, like what does that mean? Cuz it's, it's like you're using the same model to generate your stuff. Or, or maybe like, it's not quite. I mean, it's the same model, but maybe different prompting or whatever. 

Bill Constantine: Uh, that's the way I, that's the way I look at it.

It's same, it's the same giant model with a billion capabilities. That's an idiot savant, right? It can do like, it can do most anything that has to do with communication of human beings, you know, uh, coding, uh, speaking in French, uh, you know, writing a business plan. It does a million things. Well, actually, um, and, you know, steering it back to this conversation, I love talking to you deep about this because we're kind of old school folks, right?

If we were to say, look, we want these. Completions, these things coming back from the models to whatever they, whatever they say, we want them to have these characteristics, as you suggested, empathy. They wanna be inspirational, they wanna be clear, they wanna be edgy. Imagine, say even last year, for us to be able to take any given sentence and then go along these different dimensions to say, well, how empathetic was it?

How clear was this? How edgy is this response? Because these are the things, these are the check boxes that you're looking for in those responses. So part of the efficacy in your model could be almost what we call like communication traits, right? We, we want the model to be able to communicate the information in a particular way, and here are the ways we want it to be edgy, et cetera.

But then to actually give that a number, imagine that we had to do that. So what you do is you like, like you say, you know that this, these LLMs and particularly GTP four is so good. That you can go back and say, well, look, why don't you give me a rating on empathy, uh, for this response. And you can even say to it, give me a number.

I mean, I think this is, 

Deep Dhillon: I think this is really a, a big deal. So like, you know, many of our users are product managers or folks who build, you know, sophisticated, uh, products. And the question I always ask myself when I'm talking about efficacy is like, here's this, the scenario I want to avoid is, Some high powered executive, whatever, storms into the room and says, this thing stinks on this, in this case, and fix it and, and expects you as if it's all a big if.

Then you know, you're just gonna jump into some specific line of code and like fix something. That's obviously, we all know that's not how these systems work, so we need statistically meaningful. Um, test suites as product managers, as you know, um, you know, founders and people who are like pushing this technology and integrating it into their products, we need statistically meaningful ways of feeling confident when we release and being able to stand up to that, you know, high power exec that's freaking about the, out about the one example that they think is terrible, which will, is inevitable and definitely will happen and keep happening.

So, That's where I think I always imagine myself, like, I'm about to release this thing. I need to have, I need to have a methodology, uh, some, some stats that I can bubble up and I can crisply say, you know, why we released this thing and, and I need to have those knobs and dials there. That makes sense to the exec walking into the room.

So the knob and dial in this case is like, well, what do you wanna set the edginess level to? What do you wanna set the aggressiveness level to? And this is, we, we now ran against these, you know, 50,000, you know, kind of curated, um, candidates. Like, and, and now we've run the model against it with these curated, you know, ideal.

Answers and we got this score using the kind of, let's call it proximity to perfect. But we also got these scores on these at Cs, and somehow we bubbled it all up into a single score. But everything we can track back and describe because we, it's like a moving target. We're always gonna want to change things a little bit and fi and I don't know, like, to me that seems like where we're headed here is a world where you can assemble an efficacy score.

Based on like high level knobs and dials that you can through experience, um, and exposure to the soft spots there, tweak over time, but you can always link it back. 

Bill Constantine: Well, I think what you're saying is super important. We began this conversation with how do you control these beasts? And we have certain dials and ways of, of doing so as best we can.

But then I think the, the, probably the most important aspect of our conversation here is how, how do you measure the beast? We've talked about efficacy. I almost think of these traits that you're, efficacy to me is more along the line is I, I tend to think of things more along the line as is the information coming back correct and presented in a sort of a decent fashion, um, like the length is suitable for text, et cetera.

Then I almost, I almost feel like there's a communication. Um, trait and, and, and alignment with those, like is, well, did you give an empathetic response? Did you, was it clear? Et cetera. These other attributes, I kind of tend to divide up those territories into two d different realms. Then the reason I do that is because, let's imagine just, this is a bit, uh, abstract, but imagine that there's information to be had out there in the world, and you're, you're assuming at this point that that's very accurate, but now you want to disseminate that information.

Effectively, well that information being disseminated to an old professor in Oxford probably is gonna be put one way, whereas disseminated to a, a young teenager who's very sassy and doesn't like his parents at the time might be disseminated another way, et cetera. Right. And in order to, or I think it's, I almost call this communication alignment because you, we also have the ability now to communicate with people in sort of these different ways.

And we can actually, as you suggest, we can actually measure specifically how aligned we are with those communication strategies. And that's super important to, I think, uh, people and managers out there who are building a product and they say, you know, Hey, we wanna, we, we mainly deal with, you know, professionals and doctors, you know, so we're not gonna want cheeky responses.

We're gonna want very formal, et cetera. Or you might be dealing with the, the general public and maybe you're targeting teenagers for like advertising or whatever. I don't know. It's very important that we be able to measure these sort of communication strategies and with these large language models, in a nutshell, you could do so.

I would say that in our experience, we have done so and. Uh, like most things with GTP four were somewhat blown away. Maybe not getting as blown away anymore because we sort of expect it to be incredible at, at these efforts. But there are cases where in some of these dimensions we can see maybe its own assessment is maybe a little bit off according to what we might think.

Like maybe it says, for example, that, you know, this wasn't very clear, uh, gives it a score of 0.1 out of one. But we, but we go back and say, you know what? Actually I thought that was very, very clear. Um, so we, so what I think that speaks to a larger issue that we haven't yet touched on is we can always go to God and say, please do what our bidding, but we can also that one of the things I brought up earlier is that that's not very scalable, um, in terms of cost and, and.

Com computational speed 

Deep Dhillon: competition. Just to be clear for our audience, bill likes to call GPT-4 God, but like, um, no offenses intended, 

Bill Constantine: but I mean, I mean that in Yeah. In a non offensive way. It, it's, let's just call it a very, uh, huge single source of information. 

Deep Dhillon: That high probability of correctness.

Yeah. Or at 

Bill Constantine: least reasonable. Yes. Well, anyway, you can, you can take these, you can take these huge models and you large language models and you can bring 'em down to a lower level model and use GPT-4 to, to basically generate training 

Deep Dhillon: data. I think part of what you're saying here is if we rewind six months ago, maybe even, you know, like a good chunk of our outputs, we were kind of satisfied by saying, this is.

Good or bad, you know, uh, correct or not. Now, we now fast forward to now we're in a world where almost everything these things say are reasonable. So now we're trying to get to perfect responses. And so perfect means assessing them in a more holistic way. And now we're sort of struggling with like, what are the efficacy?

Um, Metrics one uses to be able to, to do that.

I want to, I want to take a little bit of, y you know, I, I think this has been a super interesting discussion, but I want to jump out, let's say five or 10 years into the future, and I want to ask the question like, what does the world look like with respect to this? Um, how we actually interact and define these things.

And how do we assess efficacy like five, 10 years out? Because it feels to me like there are certain areas that are gonna get quite mature now where, you know, startups, whoever is gonna get, are gonna get really good at helping. So for example, I'm envisioning a tool that's like a workflow tool that can take, um, a set of people that are, you know, on the editorial committee of some dialogue system.

Um, for example, and it can get really good at interacting with them, getting them to debate the issues, figure out exactly what this respon, where this response is suboptimal and why, how it could be better, facilitates agreement amongst them, and then manifests itself in, um, revised, uh, entry in examples for, for, for efficacy assessment or in like more fine tuning examples.

Something like that. Like that's sort of one example of five years out. I imagine somebody's done really well at that, you know, has gotten it so that it's like a few clicks and, you know, and, and, and five folks can like steer this thing pretty quickly. Are there, like, do you see, like, what do you see five, five years out?

Bill Constantine: Well, first of all, it's scary to think about five years out for me. And I, and I wanna, I think it's an important point because, uh, maybe two years out, I mean, at the rate that we are maybe six 

Deep Dhillon: months out, 

Bill Constantine: BEC no, and totally. I mean, I, I think we're totally, I think we have. I, I don't wanna get off track here, but we have gone over the top of the mountain now and we're go, we are fast.

I mean, there are, as you say, there are so many people that have latched on to the possibilities. So there's just a lot more folks using these LLMs and I personally, I, I think we're, we're headed where I guess I would like to see us head is in developing these communication strategies that are very personalized.

Everybody in the earth has a particular way of communicating and, um, APIC particular style. So I, I see these models being highly, highly personalized. Um, you know, in a, in a couple years to the degree that satisfies most people. I think I've mentioned a couple times, maybe in a former podcast, I see a day, not long for now, where I, I literally wake up and, and go in the mirror and it's like, oh, bill, how are you doing today?

You know, and I have a conversation with them, and it's, it's, you know, it, it uses computer vision to recognize that it's me. Whereas my wife comes in soon after me and it's, you know, well, hello, you know, uh, how are you doing? Different way of interacting. So I, I do think that these, what we're talking about is the, the ability to switch personas essentially in communicating basic level information, um, in a way that not only is.

Relevant to the current day, but maybe your entire history, you know, how you've, how you have been, maybe you're not, maybe, maybe for example, you've been feeling down for the last six months so that it takes on a more empathetic tone and so forth. 

Deep Dhillon: But efficacy assessment in that world, in that hyper personalized world is, is challenging, right?

Because now you're almost like assessing efficacy relative to a set of dials and all their potential parameters. And now you're just setting the parameters back for the individual or something. So in other words, um, I have the ability to say to, you know, to like be a 10 on the, on the empathy dial and a six on the aggressiveness and a, and that's because that's what I want and that's how I want to be talked to.

And there's a separate system that figures out that out. But now I have another system that can assess, like, given that I'm trying to get, you know, a 6, 7 83 across these attributes, Output. I now generated it. To what extent did I adhere to that? That might be your efficacy assessment in that world. 

Bill Constantine: That's right.

But we haven't talked about, well, what happens when we have an assessment of the typical responses? Are they adhering, are they aligning to these, uh, communication traits or not? What do we do about it when, when it, when it's not? I think that in a couple years there will be the ability for us to steer these models.

Um, Even more so than we have today. Cuz right now, to be honest with you, I think we're, we're steering, I think open AI's, this is my view, is that they were able to steer it a enough to be able to release it to the, to the general public. Um, but now it's gonna be a game of personalization. Can you steer it towards something that you like?

Well, we're 

Deep Dhillon: almost, uh, I think we're out of time here. Um, it's been. Super fun chatting and digging in on all things that l l m and efficacy assessments, although we, you know, as always, we kind of meander off into other, other related areas in some kind of often to further out. But that's kind of part of the fun, uh, here on your AI injection.

So thanks for, thanks for a great conversation, bill. And, uh, thanks to our audience and, um, for, for being here. And, um, feel free for those online to. You know, we've got a lot of content, uh, articles that kind of are more, um, structured and, and, and in depth than what you get conversationally. But we also have a lot of podcast episodes that touch on, uh, on, on things that are quite related to this topic.

So feel free to go to and, uh, and poke around in our articles or podcast section or, you know, drop us a card or letter, um, you know, if you've got anything that you want to add to the conversation. So, thanks everybody and uh, we'll see you next time. That's all for this episode. I'm Deep Dhillon, your host, saying Check back soon for your next AI injection.

In the meantime, if you need help injecting AI into your business, reach out to us at That's Whether it's text, audio, video, or other business data, we help all kinds of organizations like yours automatically find an operationalized transformative insights.