On this episode of "Your AI Injection", we unpack the challenges of implementing large language models in production. Host Deep and Xyonixians Andy and Bill explore the chasm between the dazzling demos of models like GPT-4 and the nitty-gritty of everyday use in business settings. They dissect topics like user expectation versus API reality, the speed of AI evolution versus operational stability, and the pressing issues of costs and "hallucinations" in AI responses. The episode shines a light on the "picks and shovels" of AI—the vital tools and strategies that enable practical, responsible use of AI technology in diverse domains.
Learn more:
On this episode of "Your AI Injection", we unpack the challenges of implementing large language models in production. Host Deep and Xyonixians Andy and Bill explore the chasm between the dazzling demos of models like GPT-4 and the nitty-gritty of everyday use in business settings. They dissect topics like user expectation versus API reality, the speed of AI evolution versus operational stability, and the pressing issues of costs and "hallucinations" in AI responses. The episode shines a light on the "picks and shovels" of AI—the vital tools and strategies that enable practical, responsible use of AI technology in diverse domains.
Learn more:
[Automated Transcript]
Deep: All right. Thanks so much, everybody, for joining us on this episode of your AI Injection. We've got Andy and Bill fellow Xyonixians, uh, uh, that have been on the, on the episodes before. So, you know, welcome back guys. Thanks so much for coming on.
Andy: Thank you for having us.
Deep: Yeah. So today this is sort of a question that's been on our Minds a lot lately and we're going to kind of dig into it with everyone, but the kind of high level question is what are the picks and shovels of the large language model or the large multimodal model world that's emerging here.
So that's kind of what we're going to dig into. So we're gonna try to explore the gap between sort of the dream of like seamlessly integrating a large language model into your project and like the stark realities of the process. I think all of us certainly here at Xyonix, but I think many of our listeners have had this experience too.
You know, a boss, an exec, a client comes in, they jump into GPT 4, they got some kind of wonderful killer response, uh, that they get for that one question that they ask. And they're like, why can't you get this into production faster? Why is it so slow when it's in production? Um, you know, all of this stuff, like getting from playing around with GPTs,
Andy: like GPT for quality results across our Documents in our SharePoint server, whatever the data situation is.
Deep: Yeah, with the scale and the user, the, you know, the millions of users that folks have and, you know, and then the latency, you know, when you're in that kind of. Think it through mode with gbd4, you know, you're not, and it's got that like incremental typing thing happening.
And there's like a lot of little tricks that make you think that it's not horrifically slow, even though generally we feel like it is. Um, and you know, of course Bard now with the Gemini model is, is a serious, uh, contender. And I think, you know, my prediction by the end of the year, we're going to have, you know, a lot of contenders.
Uh, you know, perplexity is getting really good. So there's, so it's no longer just about OpenAI.
Andy: A brief side note. I don't think we explained what We mean by picks and shovels, but we're in, we're in the Seattle in the U. S. Yeah, we're we're in the Seattle area and during the Klondike gold rush. A lot of people were heading up to Alaska to make their fortunes finding gold or something the people who really made the money on that deal where folks in Seattle selling them their picks and shovels and wall tents and winter wear that were heading up to the Klondike gold rush.
And so, uh, just to sort of fill in
Deep: the gap, I think. No, thank you. Cause we do have quite an international audience and folks.
Andy: We know that there's a lot, a lot going on with LLMs and a lot of gold will be found, but We kind of need these tools to, to, uh, to do it right. And to kind of fill in the gaps between, uh, the ideal gold of GPT 4 and actually finding it in your business's hills, I guess is one
Deep: way to put it.
But the question is still the same, you know, which is like, Hey, what are, maybe we start there. Like, what are some of the problems that happen from that play with it in GPT 4 to make it real scenario?
Bill: I have a lot of comments on that. That's, that's something that we experienced, as you said, with a lot of our clients.
I mean, you have to, you have to know that open AI, this is when you go and use their user interface. It's like watching a movie. It's all, you know, sort of post production. They're very pretty. They have a lot of slick Rick type of things that are riding on top of it. That frankly, Yeah. We are, even as we sit here, we are aware of, but certainly it's much different than hitting the GPT four, et cetera, the difference between the UI and the API is quite vast.
And so one of the things, for example, that you mentioned is that you'll go and you'll type something, you'll get something that looks beautiful on the UI side, but typically. There's a lot to that. They might shorten, for example, you might say, generate 100 paragraphs about a particular story, and they might just give you like 10 of those.
It's in bulleted format and so forth, and there's a lot of checking that goes on behind the scenes there in that UI that makes certain that the output is not insulting, that it's sort of aligned with what you're saying, and that it's presented in a nice, beautiful format. Yeah, but you mentioned what is it the picks and shovels version.
Is, you know, behind the scenes, if you were to do that same query with the A P I, and you would request that query into a particular data format, like, say, Jason or something, you may not get exactly what you want out of it. In fact, even the best models given the same input will sometimes give you what you want.
Maybe 85 percent of the time, but 15 percent of the time they won't. And so what I mean by that is they might surround the text that you want with jibber jab or things that are unnecessary And make it so that it's not parsable in a data format that you'd like. Uh, one of the things that popped up for me is you mentioned that there's a lot of competition out there.
So open AI is moving fast. And what that means is, is that when they release a new model, you can get new funky things that happen. They have a Python interface package, for example. To connect to the API that, that changed drastically overnight.
Deep: Yeah. They're not, they're not shy about breaking things. Uh, yeah.
Moving fast. And I know like we at Xyonix, we've got our own wrappers around, you know, stuff. Cause we're trying to stay fairly LLM agnostic. But, but yeah, I mean, they'll just break stuff. Like they deprecated those, uh, any, any kind of custom fine tunings built on GPT 3. Um, fairly recently that, you know, that kind of stuff.
Is sort of another category of problems. Maybe we start with the categories of challenges, right? So one challenge that I think that one category you mentioned is sort of this UX API gap. And I think that's, that's a real one. And then another kind of category of challenge is the bleeding edge challenge of the fact that, you know, the APIs are, you know, are in flight that like what they're supporting and not is in flight, you're constantly having to.
Evaluate and think about should we be running this on Azure? Should we be, you know, using, uh, you know, Bard and the new Gemini model. So let's call that the bleeding edge problems. I don't know, Andy, like
Bill: another category, just I want to I want to add another category. Sorry, Andy. Is that system level or company level dependence?
You know, that's super important to talk about what do you mean
Deep: there?
Andy: Yeah, I think what you're getting at Bill's like domain, uh, specific needs where a dbt four is going to create a great answer for general audience type of question and your CEO that we were talking about earlier might not. I have a clear idea of why that doesn't work so well against a company specific domain.
Bill: Yeah. Yeah. That, that is also important, Andy. I was thinking more along the lines of one day Sam Altman got fired from an open AI. And I wondered if my models that I had put up there are going to be supported next week. And, and what is my alternative? This is the don't put all your eggs in one basket thing.
And I think a lot of people open AI is, is killing it. They're doing a great job, but it's going to be very disruptive. Yeah, over the next year, there's gonna be a lot of competition and you got to really think you got to think about that. So at a commercial level, who are you going to rely on? And can you do anything about it to
Deep: right?
Like, we've seen another number of, you know, clients thinking through like, Hey, should we be dependent on a company like open AI? Aren't we safer with Microsoft and running their version of GPT for in house and being dependent on that? Um, and then there's like, kind of Google kind of entering the ring.
And then there's a couple of startup players. And then there's just,
Andy: yeah, that's, that's, it's a really different, difficult engineering challenge because what you would traditionally do, like if we'd kind of rewind to when people didn't want to be totally dependent on AWS, maybe they write an interface for, you know, an S3 type service that can talk to Google's backend or, or Amazon's backend.
Well. I don't really think it's feasible to do that at this stage with LLMs because you're really going to get such a different performance or, uh, you know, different peculiarities of efficacy from BARD versus open AI or whatever. And so those things. Yeah, you can build a abstraction layer, but ultimately you're going to have a lot of testing to do if you're making the transition in production from one to the other.
That could
Bill: change in a heartbeat. Like, you know, uh, OpenAI released a new Python package. It just completely changed the way you interface with the API than it used to. That, that changed overnight, but you gotta be, you gotta, I mean, certainly, so you could lock down versions, right? But at some point with them changing so quickly, you want to kind of keep up with them.
You don't want to be left behind, you know, and it's
not
Andy: the same as locking down a version of a library where the code is just sitting there locally because their web services might just stop supporting the old version. But this is exactly
Bill: this is exactly right. Yeah, for
Deep: sure. For sure. So, so far, I've got, like, you know, the U.
S. A. P. I. Gap this. Sort of bleeding edge issues, then there's, like, the system issues themselves, APIs and, uh, in flight being changed engineers, optimizing things behind the scenes. And suddenly calls are falling apart. Like, we've all experienced that. There's a whole another category that I wanted to kind of point out, which I think.
It can, can seem sort of innocent, but be very much not so. And that's the one of cost, both dollar costs and latency and time costs, right? So some of these calls are really expensive. You know, like you throw an image, you ask for some analysis of an image. I mean, you can easily get over 20 cents for one image to analyze.
So that dollar cost is super substantial because it sort of impacts like the process for developing these things. And, uh, and it's not something that. You know, an exec that wants a certain level of efficacy that they get when they just interact with the UX. They don't think that part through, right? Like, it makes sense to spend 20 cents analyzing 1 image if, you know, the underlying business model supports it.
And you would normally maybe pay a human or something, or it's a set of humans to, like, look at it. But it doesn't support it in like in an advertising context. Like all of a sudden that cost becomes a no brainer. And so there's this whole new world that maybe you guys can dig into a little bit with me, which is when do you decide to go to lower level models?
What are the criterion for going to those lower level models? When do you have to do it? When do you not have to do it? And when you do it, like what's all the stuff that happens like, you know, you
Andy: know, because I think, I think your starting point is really good deep with looking whether the business model supports it.
Right? If you have a high cost situation where you have a human review, sure, maybe 20 percent 20 cents is no big deal. But we see a lot of cases where the, you know, the Uh, maybe the client is looking at what GPT 4 can do, and sure, like, maybe that's worth a subscription cost to you personally, but when you break it down to serving, analyzing millions of images a month, it's not going to be tractable, uh, in terms of the cost.
But I think, you know, I think people are pretty focused on what the latest models at the high end can do, and that is an exciting place to be. And there's not a lot of Structure and understanding around not a really a well traveled path to a lower cost approach. It kind of seems like you have to take a shotgun approach right now to see if you can solve your problem with a lower cost model.
Bill: Andy, what you said is so spot on. I think I really appreciate your comments and it's interesting, you know, before Sam got fired. He had this great dev, you know, development conference, right? Where they're announcing a whole bunch of new stuff. And one of the things that they hit upon was, Hey, you know, it was cost.
Cause this stuff is actually quite expensive. As Steve said, you're exactly right. You go hit the UI. Wow. Look at, look at what I got there. Well, that's great. But how much did that thing cost? And how are you going to scale that for a million people? Well, that's some serious money all of a sudden. Yes. It's amazing, but you know, can we scale that out?
One of the things that they did was to create a GTP4 turbo model, which is basically, you can think of as like a fine tuned model on top of the GTP4 base, you know, and it's said to be really good. It's supposed to be a lot cheaper, faster, and it does quite well, but I've actually found it to be not as performant as GTP4, for example.
Uh, in, in many of the more tougher LLM tests that we've given it, it does
Deep: well from
Bill: an efficacy standpoint, the number of times it messes up. We're, we're talking about why would you go to a fine tune model of your own? You know, sometimes these, you can even consider that almost like a new base model. It, it doesn't return the format of the thing that you want.
Uh, it's not a JSON parsable object. That's a real problem. So you got to call, you know, the LLM again, and try to fix the object, you know, after you get it back. There's a lot of issues involving sort of like, like the nuts and bolts that you have to deal with. So gtp4 turbo, for example, I will just tell you yesterday, I can't speak about our client, but we, we tried to use it for two different styles or two different things for a client.
The first thing was kind of a simple thing and it did great there. So you're like, okay, cool. It's faster and it does a great job in terms of its efficacy relative to GTP4. It's cheaper. Now, the more complex problem, total bomb. Like it failed probably 50 to 60 percent of the time. There you have to go to GTP4.
Now there, you wanted to kind of a good conversation about a lead into why you do your own fine tuned models. And that's a perfect example. What you want to do with these fine tuned models is capture the essence of a much smarter model like GTP4, but at a highly reduced cost and at a much improved speed.
So that's your huge impetus to do so. Amongst another thing that I think is super important to talk about fine tuned modeling is, ultimately, these LLMs are black boxes, and you cannot really control them. You could try to persuade them by doing great prompt engineering, very, very specific about what you want them to do and what you want them not to do.
But at the end of the day, you do not have a deterministic output and you cannot control
Deep: them. Yeah, I think that speaks to another category. Um, so as it did the final, you're talking about the hallucination problem or
Bill: no, no, not even hallucinations. It could be that the thing that you want is not in the desired format.
It could be that you've asked for specific things. You don't get those back. Those are actually considered failures. Right? You know, so there are certain. Maybe at summaries or something of a paragraph that you're not getting back that you wanted to. So, to corral that as much as you can, you, you, you do go
Deep: to a fine tuned model and, uh, Yeah, so there's, there's a few things that I'm hearing and, you know, and I'll, I'll throw in some more, but like, you know, one of them is just sort of the expressiveness that you get in the kind of training slash configuration world when you have more examples.
Right. And that's one motivation for going to a fine tuned model is you're not relying on a few shots to like guide the output. You can have a few thousand, you know, or even more to really kind of like capture a lot of the nuance and subtlety. And that's something that. Doesn't always show up when you have somebody messing around in GPD 4 and coming up with a concept for the, for the feature.
Right? Like, they come up with that. They maybe they tried it on 1 or 2 or 5 or 10, maybe even 15 or 20 examples, but they certainly didn't try it out. Systematically across hundreds of thousands of scenarios where they kind of, you know, cluster the response categories and looked at how well it's performing in each of those categories like that.
That's I think another category where the fine tuning really excels. And then of course the cost, which you touched on, right? Like the, just the dollar cost difference can be huge and the latency cost difference can be significant. Have data? Have a hypothesis on some high value insights that if extracted automatically could transform your business?
Not sure how to proceed? Bounce your ideas off one of our data scientists with a free consult. Reach out at zionx. com. You'll talk to an expert, not a salesperson.
Bill: You're able to cut your costs by 10, 20. The speed is incredibly fast. Honestly, I love, I love fine tune models.
Andy: Yeah, Bill, when you give those improvement scale, are you talking about with a fine tune model on GBT 3. 5 or yeah, that's
Bill: a good question. So it's always relative to some base model and here, the base model is an instructional model.
So like 3. 5 turbo, for example, when you go about the business of fine tuning. You basically, you know, you give it some sort of base model. So, uh, just to be actually take a little bit of a step back, Andy, because you asked the question. What you usually do is you go to the best model that you can, which is say GTP4 and you ask it to do some pretty complicated stuff, you create a bunch of responses, basically we can call them a training samples, but then you also scrutinize those training samples.
So you say, were my instructions followed to the letter, you know, when I said to put things in first person voice, did it actually do that? And if you found, if you find one of those samples that violated that, you kick it out. So you can go through this kind of semi arduous process of going to the best model you can and have it, you know, run through a bunch of samples covering a wide domain.
And then you just get rid of all the stuff that's not pristine. You take that pristine stuff and then you use that to train your model. And what you do then is you select any one of the base models. You can select the port five turbo, for example, and then kind of base it off of that. And soon, soon there'll be a GPT four, a turbo model that'll be available for people to train on as well.
So the cost savings is relative to you having to go to GPT four. To be able to do so when you talk about a cost savings relative to GPT 3. 5 turbo. The problem with that is, is that, you know, that's when 3. 5 turbo actually succeeds. There's a lot of failure there. So it's hard to compare costs to something that fails 50 percent of the time for the test that you want it to do.
So I'm usually talking about relative to GPT4, say. Yeah,
Andy: correct me if I'm wrong, but I think we've seen examples where even with fine tuning, GPT 4 has performed better. And in other cases where the lower model, like 3. 5 with the fine tuning has performed better than the higher model GPT 4 with. Without the fine tuning depending on the domain and what the fine tuning has been, right?
So you basically you have to test it and obviously you still get the cost advantage But your performance might not be adequate in the lower model even with the fine
Bill: tuning You know at some point if we roll back the clock when we're doing legacy GPT 3 models How would we create the training data for that?
We would go to the humans,
Andy: right? Absolutely.
Bill: We would go to our annotators. We would say, we would say, here's your prompt. What is the best completion? Here's a prompt. What is the best completion? Right? And the thing is, is that, is that cost effective? At some point when I actually, that point was exactly when gtp4 came out.
I personally found it to be as good, if not better than a lot of humans. So to be absolutely clear. You typically would go to humans to get this, right?
Deep: For most tasks, it was definitely better than the humans that we had. Yeah. We found that. But what we also found was that there's still, you still need a human in the lube to vet that into some, some training data from which you can assess efficacy.
Absolutely. Yeah.
Andy: Sometimes that's, I mean, the annotators we're talking about are. Typically grad students, they're smart, they're motivated, they You know, know a lot, but they don't necessarily know a ton about some specific domain that we're working with. And, and sometimes when we're using the higher level model to generate this, then the human in the loop we end up using as the data scientist to kind of review the, the GAT4 output and make sure it's all in line.
Deep: A lot of times the feedback we're looking for is more editorial in nature, right? It's not so much that. That the, the model output is wrong. It's more that it's speaking in the wrong person in this context or it, you know, like there's something about it. That's like editorially insufficient, like, you know, we've had some a number of applications where, you know, we want to show up.
We want the response to show a lot of empathy. So we're following these paradigms of therapists and advisors and others that have a really specific way of talking. Um, and so we want to score it on these, uh, against these kinds of rubrics around, like, empathy and encouragement and, um, and it's not enough to just leave it in the prompt and hope it does it.
If we're going to get to the point where we're. You know, we're trying to run statistically meaningful models in like a B scenarios to figure out how we're gonna, you know, achieve scale and cost efficiencies, then we have to, like, have those metrics defined, which, in and of itself, we've had a couple of episodes on that actually.
But you know, we, we, we all sort of understand how challenging a problem that is and state-of-the-art, really not very far along there. I don't know. Bill, do you wanna maybe characterize like what historically happened there? Maybe what the literature is kind of like last time you took the look and the kinds of stuff that we're maybe doing to assess efficacy?
Bill: Well, I don't know if there's anything historic to say. I think it's such a problem by problem, you know, sort of specific adventure. You know, I think you were just talking about. We're sort of altering, you know, like, the overall content might be good coming back. Like, you get it. But maybe, for example, you're dealing with users that you're only going to be communicating with via text.
So, you know, you don't want to return this huge, you know, response, which is super great, and it covers all the territory, but you really want that thing to boil it down to something a bit more brief, so there might be a brevity dimension that you're, you're interested in. You know, maybe you're talking to a student who can, you're more interested in maybe getting more like banter ish conversation, but if you're going to be talking to a doctor with this stuff, you want it to be a lot more professional say, so you might have a professionalism dimension, you know?
So I think the way that we've approached this, and I think it's super important. What's not really talked about. And I think it's really important to anybody who wants to create a chat bot for their company or whatever is, you know what, when I'm communicating with the people, you know, our customers, what, what do I care about?
You know, if I have a customer service representative talking to them, how do we train that customer service representative? Do we want them to be professional? Do we want them to be brief? Do we want them to be empathetic? Do we want them to be all these different dimensions? So, efficacy here, in this sense, is, you know, how do we take a response from a bot and measure these different attributes?
And we have done so at Xyonix. To, I think, a pretty successful degree. And then what we do is we take essentially scores that are assigned to these different metrics and we boil them down by forming some sort of weight of the average, say, or something. We can talk more about
Andy: average weighted average is.
Is, uh, customized to a particular situation, right? If you have a customer service representative, Accuracy is important, but friendliness and customer satisfaction may actually be above that. Whereas different scenario, that's not a big deal. It's you're just trying to get to the information and brevity is way more important.
So you have to decide what the mix should be for that weighted average. From all those dimensions to to score something globally.
Bill: Yeah. Yeah. Great point, Andy. And in fact, you know, knowing that that is the case that, you know, every single sort of bots responses is sort of assessed differently. We just we often keep those weights.
Um, associated with these different, uh, dimensions in a, in a configuration file. I want
Deep: to just jump in for the sake of our audience, because I think they might not, there may be some folks that aren't following the conversation. So I want to set the context for what's happening here. So that, so the idea is the LLM is generating some response to a bunch of text.
So you might have a dialogue and then the bot has to figure out what it's going to say next. Or you might have a bunch of dialogue and you're trying to summarize it. Or, you know, so, so, so there exists a response from the bot. And then what we're talking about is a measure of difference between that response from the bot and what we're defining to be an ideal or perfect response and how we characterize this, that those differences across a large set of candidate, you know, responses is what we're talking about.
And so, of course, you know, our, our more machine learning inclined folks are certainly used to more traditional measures and classical ML. You know, like precision and recall, but, um, kind of getting down to a singular number that basically says, like, how far off from an ideal answer. Are you? So the 1st challenge is, how do you even get the ideal answer when going back to the kind of.
Picks and shovels, uh, kind of context that we set at the beginning of this, of this episode. I mean, I think there's going to be, I'm going to try to start calling out some of these tools that I think are going to emerge, you know, over the next year. But somebody is going to, lots of companies are going to emerge that are going to facilitate.
The, um, gathering and assessment of just the, the, like, ideal responses like that alone. I think the same way we have the tools like label studio and others in the machine learning world, they're going to probably evolve towards that in this arena. And we're probably going to have stuff that's like, easy to embed.
In product in a mobile app in a web app and all that, and you can kind of gather all that training data and then there's. Once you have that kind of stuff kind of instrumented, then there's the problem, of course, of assessing it, and we're talking about, like, we're talking about getting down to a singular score, but I think there's also a ton of value and characterizing, you know, V3 from V2, like, what is it doing that's different?
Oh, well, on average, it's brevity scores are a little bit different. Yeah. Like worse, like it's talking longer than it, than it was before, but it's quality of response has gone up and, you know, like, so, so you start to get characterization. So, you know, a team of data scientists prior to release can kind of feel confident that, yeah, you know, we might get an exec in here freaking out about their favorite query gone bad, but we can stand, you know, credibly on a mountain of statistics to sort of say why we know this to be better.
Well said.
Andy: It's so important in the context of a lot of the things that we were talking about earlier in the call. Like, what if there's a new model? Do we just put the blindfold on and assume that it's better? No, we evaluate it against these cases that we've put set forth and score. Or what if we, what if opening eye goes down and we need to suddenly switch to another platform?
How do we go into that and not be blind and have an idea of what's going on here? And. Is the, is the fine tune model performing better against these metrics? You know, it really supports a lot of different decisions that we might need to make in this fast moving landscape.
Bill: Great comments, Andy. I, we, it goes back to the bleeding edge thing that we were talking about before, you know, everybody wants to know how the latest and greatest model is doing relative to what came before it.
And, uh, you know, this is, this is really, it's such an important conversation because this is a standard by which you can assess. Both within a certain company's model, say OpenAI's, but you can switch to Google, BARD, or any other company's AI21s or whatever model that's out there. In fact, by the way, we haven't spoken about our, you know, having your own, uh, supported models like, uh, LLAMA2, you know, uh, in house supported models.
That might be your backup of a backup of a backup. You know, how does it compare to those big boys? Uh, just in case, you know, the plug gets pulled on any one
Deep: of the bigger models. You're listening to your AI Injection, brought to you by xyonix. com. That's X Y O N I X-dot com. Check out our website for more content, or if you need help injecting AI into your organization.
I want to talk about an area that we haven't covered yet, which is, gets a lot of attention, but, and I think rightfully so, which is this kind of question of hallucinations. Like, you know, these LLMs, they answer, They, they, they're generating text and if they're not necessarily precisely guided, they'll make stuff up, you know, could be dates or URLs or, you know, that you're trying to send people to or Jason code that's gibberish or could just be odd things that it says.
And I think I'm going to even go so far as to say there's a. There's a typical project trajectory that we see where we start off and things are going all swimmingly and stuff. Maybe makes it into production and, you know, customers are using the thing and the client's quite happy. And inevitably the bot says something bizarre.
Right? Like, you know, we had a case where, you know, there's folks that are. Talking to a bot and the bots like, Hey, you know, like, imagine like, you know, like a hotel room, like, Hey, I'm going to send somebody out immediately. But, you know, of course. The bots just in play pretend land. It doesn't necessarily have the ability to actually send anyone out, you know, and so, so there's these different scenarios where the hallucinations arise and I want to talk a little bit about, you know, what are some of the, there's some techniques on the prompting side that are sort of, um, increasingly well known, you know, that folks are starting to publish.
Well, there was like a pretty awesome paper, maybe a month ago that came out on this idea, and this is sort of the idea of. Like telling the bot to assess its own answer in essence, like you're giving it more reasons to think about why it's saying what it is, which sort of has this beautiful effect of getting it to, like, think things through a little better.
It's like, it's like telling the toddler, you know, before they, I don't know. You know, like punch somebody in the face before you go and take an action. Imagine what's going to happen, you know, and similar. And then there's this whole other methodology around this concept of like auditing and like auditing what you're about to say, and then measuring that against a separate criterion kind of all in the moment.
So why don't we talk a little bit about hallucinations? To what extent are they even manageable? And when we do manage them, like, like, what are, what's the arena of potential activities and how does that manifest itself in costs, both the dollar and
Bill: latency? So certainly one way to rein in and corral hallucinations is to develop that pristine.
Training data that we spoke of earlier and create a custom model that doesn't necessarily mean you're out of the woods there, but you know these custom models definitely tend to steer you towards the right direction. You know, you're you'll be given thousands of pristine examples. It's going to be pretty hard for the model to sort of stray from that.
It doesn't say that it can't, but it's harder for it to do so. So that's definitely one way of, you know, sort of corralling that kind of behavior. I think to me, there's this thing that's going on too, and you guys touched on it briefly, and that is the models, they live in this virtual world, right? But we're developing tools right now for them to interface.
Um, and go outside of the LLM so you can do things like, Hey, let's go to the Internet and get, you know, the latest scores. Hey, let's go get the latest weather. You can imagine a function that is actually written in whatever language that can interface with a real system. That does real things, you know, in the real world, you know, like, uh, go mow my yard right now.
And we have a bot that does that. But my point is, is that right now, the LLMs are, you can facilitate these so called functions that allows them to interface beyond themselves with the real world. And that becomes this kind of scary, uh, situation because on the user's side, If it's a, if it's a person that needs to figure out when, for example, the job fair is going to be happening locally, and you get a response from the bot, you have so much confidence, maybe in the bot that you think that that job fair day is going to be Saturday, the 30th, but it could be that the bot is just totally hallucinating, making that up, and it didn't go out and use one of these tools and go out into the Internet and sort of double check that right.
And, uh, oh, and sure enough, it turns out it was, uh, not this Saturday, but next, uh, or it already happened. So he missed it. That's a kind of a non, that's not a threatening thing, but you can think of very, you know, other examples that could be, you know, a little bit more threatening to be aware of. So I don't know if Andy, if you want to sort of take off from there and talk about this interface between data retrieval versus just text generation or creativity and how that world kind of plays out.
Andy: Yeah, it's a super interesting interface. I mean, the function kind of model allowing the, the LLM to take action certainly is, uh, a really powerful jumping off point to the AI being able to work on our behalf and actually get things done rather than only telling us what to do. Um, but, uh, there's a lot of ways that one could imagine problems with that.
I think one of the examples I was looking at, imagine. Or, you know, illustrated a misinterpretation where the bot was going to order 4, 000 pizzas and have them delivered to your house, you know, and this is a good place for an auditor, right? But, uh, you know, I think 1 of the things we're seeing is that using the intelligence of this system from different angles to audit the response that has.
Then brought back to look at the user's input and see what the intent is and kind of mesh that with the answer that another prompting of the AI comes back with starts to become, uh, potentially really important here in terms of safeguards and making sure that we get the right things out of these interfaces.
Um, but it's, it's certainly new territory that we're still figuring
Deep: out. Yeah, I mean, like, one of the things the literature really supports is forcing the model to first, for example, hey, is it is this thing that we're asking you to generate or respond to? Is it even possible for you to answer it? Just give me a yes or no.
And that sort of, um, kind of forces a thought process. And then, okay, give us your answer, but you must explain why. Right? So now it's like a thoughtful, Okay. Explanation and for some reason, this also forces a significant reduction in hallucinations and then how you those
Andy: those are those are so important because we all have that friend who will never say, I don't know, no matter what I'll tell you something and they're trained on text completion.
So that. That they're
Deep: trained to do that, they're
Andy: going to give you an answer, but, you know, you should first ask them, are you going to be able to give me a real answer on this? It's a very smart thing to do.
Deep: It's also something that like humans aren't really naturally good at. Like the, I mean, there's other issues that are at play with humans, right?
Like that's the, the thing, the story I always tell is like, if you're anywhere in, in like rural, uh, in rural India in general, but definitely in the North. And you ask some random person on the corner for directions. Now, if this person knows where to send you, they will send you there. But if this person has no idea where to send you, they will still point in a direction and send you there.
And your, and the way you tell the difference is the swagger and confidence with which they sent you on the way. But they will never say, I don't know. And it's because in this, you know, in North India, like in these rural areas, like, Not knowing is like a loss of face and so they'll just say something and so you have to be able to tell and the models like that, that human nature is sort of manifested in all the data.
These things are trained on. So that's, that's 1 thing. And then having to explain stuff. Like, if you have to decide, do I buy X or Y? If you never ask why you would choose X versus Y or why you would, I should've used different letters, but why you would choose, you know, B versus A. then you don't necessarily have as good a end decision.
But if you force that thought process, so then the next part of it with the LLMs though, is like, hey, how do you achieve that? Because a lot of folks are achieving it in subsequent calls. And so, but if you, if you swirl it all together in one call, Then, you know, you, you're sort of relying on, like, the higher stack GPT 4 level, you know, reasoning muscle.
Because as you go down in the models, you know, the level of reasoning can't be quite as multi layered and complicated. And, of course, the time to respond also goes up when you do that. And then, finally, like, once you get your answer from, so there's, like, Prompting and back and forth tuning till you get an answer.
And then we all know that if you ask the LLM repeatedly the exact same thing with the exact same inputs, it's also going to have differences. So then, you know, a lot of times we'll look at an auditing mechanism where you've got an auditing of specific criterion, like, don't ever send some, don't ever promise an action that you can't actually achieve, which includes this list of them, right?
That could be. In the, um, instructions for the L. M. but can also be in the auditor instructions. And what we find is that despite being in the instructions to the L. M. it will violate those things at some frequency level and the auditor will catch them in many cases. And then the question is. What do you do?
Well, you might just try to for another answer. Sometimes you get something different. Sometimes you get basically the same thing. And so all of that is part of the process, but all of it costs more money to do because you're now back hitting it. You know, you've got more latency involved. So there's also like reality imposing, you know, well, okay, you know, maybe I'll give you a few audits every once in a while, you know, or maybe I will let you, you know, do this multi layer reasoning thing every once in a while.
But I can't let you do that all the time because it costs too much.
Bill: I love your, I love these comments by you. They, they spark a lot of questions and debate. There has been this, especially let's go back to the. I love the front loading of the psychology in the prompts at the beginning, because I read a paper not long ago where, you know, they said initially everybody was like, you have to define what the bot is.
And so you say you're an expert in ABC, right? You're the absolute, you know, you're, you're a god. You're the bee's knees. Yeah, you're the bee's knees, right? But they said, you know, what you could do is take a little bit more of a less, less deified approach and say, okay, you know, you're pretty good in this, but please, they literally would do things like, like, please do a good job here because my, like my job depends on it, you
Deep: know, appealing
Andy: to the empathy.
Bill: I thought that was just brilliant. This, I, I don't know, I don't recall if it was psychologists that did this study, but. How great is it to take the, the wealth of human communication, which is fraught with psychology, right? How we communicate is so, so, uh, changes so much with a psychology of behavior, our, our visual inputs and so forth.
But, you know, ultimately results in what we say and do. And it's very interesting to kind of plead with the bot, like, Hey, you may not be the best. But can you just please get this right for me? You're pretty darn
Deep: above average. Slightly above average. Well,
Andy: you know, Bill, where I thought you were going with that is, Hey, bot, you're, you're decent at this task.
Uh, but sometimes you don't know. Yeah. To kind of encourage it not to hallucinate, maybe.
Deep: Well, I don't, I wanted to. That's definitely a thing, right? I wanted to. That forced the feasibility
Bill: assessment. I wanted to comment, I wanted to ask you guys a question. How often has this happened? I will, you know, at some point I have a lot of, a lot of sort of confidence in what the bot could do in terms of like code production, right?
And basically I've become quite lazy if I'm being honest. I have it just rip out some code snippets that I, you know, I know I can do but I just don't want to take the time to do and it's a real efficiency thing for me. But then there are some times when I ask it to do something like that and I know it's off.
And I, I know that I'm, I'm right and I'll have this conversation with the bot about, well, have you thought about this? And then they, they say typically they bend to what you say. They say, oh, you're right. I've even had, I've
Deep: even done the thing they do bend to, it's funny, they, oh, I've even done the thing
Bill: where I've lied to it.
I've known it's right. And I say. Hey, that's wrong. And they say, Oh yeah, it's wrong. No, no, no, no, no. I was just kidding. You were right. And they say, Oh yeah. And then it's like, you know, you wonder what the bot is like, what state is in. But my point is, is that it's a level of trust, right? That you have with these guys.
And, uh, and, uh, to try to corral that, I, I think, you know, we're getting back to the main point, this idea that you have these auditors that go and overlook the process. Yeah. And we at Xyonixof course, we do this process quite a bit. We have auditor bots that, you know, observe the output of other bots and they critique them and in terms of correcting them, this is very interesting.
I don't know if you guys have noticed lately, if you go to GPT4 Turbo with OpenAI's UI, you can ask it to do something and maybe you ask it for something in a very specific format. And what it'll do is it'll, it'll create an output and you can see it, it's testing it, but sometimes it'll fail and then you'll see that it fails and that it tries again.
Have you guys seen this? No. Oh, and then when it tries again. It's
Deep: doing API call. This is in the U.
Bill: I. Oh, and you in the U. I. And you ask him. So what's
Deep: right? I've definitely seen it running off the Bing and doing all kinds of pulling right? Kinds of external info it
Bill: there is there is for lack of a better term and auditing process that's going on there.
There's something else that's looking at the output and saying, You know what? No, try again. This is the part you were just referring to deep is what is the try again? We don't know. We don't know what they're actually doing. We can imagine maybe some scenarios where they're going to a smarter. You know, a smarter model.
They know is maybe a bit more in depth and more intelligent or who knows whatever tricks they have. Maybe they're psychologically begging them. Please do it right this time.
Deep: You know, I mean, I've done stuff like, you know, the, the subsequent prompt when the audit fails, okay, you've already answered this once and your audit failed.
And this was the answer you had last time. So don't give me that answer again. And that, that, yeah, yeah. That does, that can help, but it's not a guarantee. Need help with computer vision, natural language processing, automated content creation, conversational understanding, time series forecasting, customer behavior analytics?
Reach out to us at xyonix. com. That's X Y O N I X dot com. Maybe we can help. So I want to go back to the picks and shovels stuff. So I want to go hit some of these. Some of these areas we we've just outlined a long litany of problems, and we actually didn't even cover some of the really big ones like we did miss something.
So, but maybe we'll touch on that. Maybe we'll get them in a different episode. I don't know. But given all these problems, like. I don't know, in our own head, let's sort of stack rank a little bit and let's sort of talk about like, what are some of the external tools, maybe not built by, you know, the big tech companies, but maybe some arena for startups or smaller companies, you know, that are sort of more localized services.
Like, what are some things that folks that we think over the course of 2024 are going to. Are going to start to make a make themselves known and start to make it out there.
Bill: Gosh, I would speak to security. I don't know how you guys think. I think 1 of the major kind of issues that I have with any of these companies.
Is centered around the security of your data. Ultimately, when you're hitting, uh, if you do a fine train model, for example, you're uploading, you know, data that's potentially related to your customers responses and so forth. I mean, I just think there has to be something along the lines of, of having sort of a separation of these models with the customer data and, uh, and maybe some.
You know, you think about health care, for example, if it's ever used in that area. Now you're talking about PII and so forth. So are
Deep: you, are you talking about the like run, you know, run your own, um, containers with Lama two models in your own data center versus hitting some external service? Or are you talking about something inside of those, uh, inside of that world where you're running your own LLM?
Well,
Bill: if you, I mean, to run your own LLMs is quite expensive and daunting. So are there companies out there that can make their picks and their shovels are related to Well, we'll, we'll provide the security for you. We'll, we'll ho essentially be a host, uh, to these models. Like we'll
Deep: run we in a, in a HIPAA compliant, but we'll run it environment tailored for, for, for medical.
Bill: I can imagine
Andy: that to be a very acute, I think it's all the above. Yeah. I think it's all of the above because you're going to have organizations at the far end who won't move their data anywhere on to machines that they don't own. And so they need a solution where they can buy some hardware that runs LLMs on it, or they have, you know, a prebuilt software that, that runs LLMs that have already been trained in their environment.
And they're going to have lots of integration headaches and scalability headaches and everything else that go along with. Uh, that closed environment, and probably you also have some, you know, specialty providers who provide these kinds of services, but with a higher level of security or protocols or certifications than, you know, someone like, uh, Amazon or open AI is willing to go through.
Um, so I, I think we'll see it in a bunch of different parts of the equation, but I think one of the most interesting parts is that. On premise, not on premise, but running your own LLM in whatever form that takes. That's in
Deep: your own VPC on a cloud or in your hardware.
Andy: But exactly. It could be in your environment or whatever.
But I think. As these things get better, and we have a better idea of what they're capable of, we should see that lower complexity tasks are things that we can do at a low cost in our own environment. Um, or maybe using an AWS service that's low cost or something like that. Yeah, like, we've
Deep: already seen that, right?
Like, straightforward summarization tasks, which, you know, by the way, just, you know, two years ago were not straightforward, they were pretty hard. Right, yeah. But in our new world, like, you know, where Well, these models are so stinking powerful summarizing a bunch of information is now a perfectly feasible thing to train up on your own instance of llama to a smaller model with fine tuning.
Like we've done it. You know, it's, it's, it's super straightforward. It's not, it's no longer. Um, but it's also a type of reasoning. That's actually quite sophisticated, you know, like plenty of college student can attest to the difficulty of summarizing what they know. Um, but, but because it's sort of, I don't know, well bounded, it's a certain type of reasoning.
It's not as, um, you know, we get pretty good at at that performance. So that's, I think, I think I agree with that. There's 1 area. We didn't really talk a lot about, which I think already has tons of picks and shovel companies happening in it, which is. This is the, and I guess it's driven, you know, in large part by the hallucination problem, but there's a whole bunch of folks using LLMs, but only to tailor a response from an a priori pruned set of candidate responses.
So, so there's companies that make these, you know, like the sparse vector databases so that you can find the closest matches in your knowledge base that have totally curated responses that you've Stamped and approved for particular questions. And now they're aligning those rapidly via, you know, a quick search in that larger space to get like maybe 10 of them or 15 of them, and then going to the LLM and saying, Hey, um, answer this question, but only from this reference material.
So like the question answering arena has a lot of tooling there already. You see the big guns that like Amazon and other places coming up with a lot of tools like this, you have. Everything from elastic search to, like, other, like, uh, you know, kind of traditional search vendors introducing sparse vector databases where you can, you know, build these kinds of F.
A. Q. Searching systems quite quickly. So I just wanted to throw that
Andy: like, in that, in that context, deep, maybe you've used the to do the heavy lifting of creating those questions and answers. And then you've done the curation and then, uh, at runtime, you're not doing anything expensive. Really? You're just using that.
F. A. Q. Database to, you know, get pre curated answers back.
Deep: Uh, yeah. So you, you, you bring up an industry point, which is like, how do you populate the knowledge bases in the first place? And I think, um, that is also another arena. We're going to see a lot of picks and shovels, right? Like, and we already have traditional.
So, so just crawling content, getting reference material. Um, that's like one way to do it, uh, and then on top of that, we've seen increases in efficacy when we take that reference material, that natural language content, and then leverage the LLM to transform that into question answer pairs, if you will, so that there's answers for particular questions, but we've also seen, I can imagine a lot of tooling to just simply help a limited set of human curators questions.
Vet those knowledge base because they can get quite large and exhaustive, uh, like particularly like for so we've seen it like with medical applications where, you know, you really cannot say anything outside of the vetted responses. And so now you have to be able to try to guarantee that. And so you have to make sure that a human has looked at every single thing in there.
In some fashion, and maybe have an audit trail of what you did to vet, to, to, to like, vet that curation. That feels to me like a combination sort of smart, like a smart workflow type of application where you've got. Curation combined with some, you know, some smarts combined with like cleansing the knowledge base of duplicates and, you know, potentially clashing answers all kind of in a system.
But, yes, once you've got all that now, you know, you can, like, pop all those question answer pairs or the raw materials into a, into a vector database and then you can, like, hit it. And take these 10 and now you go to the LLM and you say, Hey, look, um, answer from only these questions, but tailor your answer with this wording or to this, you know, like, like, say it in this particular way, you know, include some empathy, whatever.
Bill: Well, I love how you corral, you corral it heavily. Uh, but here's, here's the 10 answers we want you to formulate your response around, but you can be slightly creative, right? You can, you can sort of maybe change the wording slightly, but you always, I think should always have an auditor bot on top of that to assess whether you're doing anything that violates whatever, you know, standards you've come up with.
But I could see a business forming around that, as you say. Yeah. And I think
Deep: like, I'm just going to throw out one last category that we have picks and shovels. I think this whole. Testing thing is going to have, it's just going to be an increasingly large problem because people want to put their editorial stamp on responses and they need to know that it's there.
Folks need to like, know that the, that the. The hallucinations are not present. Uh, and they need to be able to maybe not get it 100 percent of the time, but, you know, understand the statistical nature at which it does respond and it needs to understand that maybe broken out by categories like, okay, well, when I'm talking about, you know, something that's maybe, um, like medical information going to a professional, I screw up X percent of the time.
But when I'm talking about something more benign, like whether or not. Okay. Like what the operating hours of the hospital are, or I don't know, of like a follow up clinic or something. Maybe I screw up more. But knowing that and having visibility into all of that is, I think, going to become increasingly important.
Yeah,
Andy: I would definitely agree.
Deep: Alright, I feel like we covered a lot of terrain. I don't know if we covered everything, but I feel like we, uh, dug in. Is there anything, uh, that, that you guys, like, kind of, like, poignant? Big ticket thing that we missed or that or point you really want to
Bill: emphasize. I really do think that something that's very important is this ability for us to have more transparency and control to when certain tools are being called and used right now.
We have this functionality. That's kind of an open. A. I. Uh, thing where you can specify these different tools that you could potentially call that. I think that's super powerful, but I would love to know exactly. I'd like to have more control of when those specific tools are being called. And maybe, maybe that is the case with open AI.
And I just, I'm not aware of it. The ability to have more control. Over, uh, how the we're reaching out beyond the LLM answer and interacting with the real world is going to be important for, um, the future, uh, for people to be able to trust what the information, uh, they're getting back is, is reasonable, I would just add
Deep: citations to your list, like perplexity does that well, I think, uh, yeah, they got to get to the point where you're citing stuff, right?
Bill: Like, yeah, I do have some evidence.
Deep: Yeah, not like war and peace tombs, but you tell me like where you got
Bill: the responses from. Right, right, exactly.
Andy: Yeah, and the other thing I was going to add, which is probably a whole other podcast worth of content, uh, but is just that one of the dimensions that we didn't, you know, we talked about brevity or friendliness or, you know, efficacy of giving the right answer, but biases in these LLMs and, uh, the way that they're trained, if they're, you know, representing.
People have different backgrounds. Well, over a whole variety of bias issues are ones that should certainly be considered as you're evaluating these LLMs and could certainly be different as you move, especially from, you know, a platform that's been trained by one organization to one that's been trained by another.
And, uh, you know how that comes into play in your specific concerns for your business or audience is something that each. Organization. We'll have to figure out how to test for, but I feel like it's important for us to bring that up as well. Yeah, absolutely. Like the,
Deep: the ethical ramifications of these large, uh, language and multimodal models is, it's just massive.
We've got a number of episodes on just that. So to our audience listening, just, you know, jump on the xyonix site on the podcast or articles front and there's tags for ethical stuff. It's a big, big category and it's, it's going to be. Increasingly important. Um, and I think it's already important, but it's gonna become a lot more important.
Well, thanks so much for coming on. Um, I think this was a super fun discussion. That's all for this episode. I'm Deep Dhillon, your host saying check back soon for your next AI injection. In the meantime, if you need help injecting AI into your business, reach out to us. At xyonix. com, that's X Y O N I X dot com, whether it's text, audio, video, or other business data, we help all kinds of organizations like yours automatically find and operationalize transformative insights.