Does Your Chatbot Know You’re Suicidal? AI Empathy, Psychosis Detection, and Clinical Trials with Grin Lord of mpathic Artwork

Your AI Injection

Is AI an ally or adversary? Get Your AI Injection and learn how to transform your business by responsibly injecting artificial intelligence into your projects. Our host Deep Dhillon, long term AI practitioner and founder of Xyonix.com, interviews successful AI practitioners and domain experts to better understand how AI is affecting the world. AI has been described as a morally agnostic tool that can be used to make the world better, or harm it irrevocably. Join us as we discuss the ethics of AI, including both its astounding promise and sizable societal challenges. We dig in deep and discuss state of the art techniques with a particular focus on how these capabilities are used to transform organizations, making them more efficient, impactful, and successful. Need help injecting AI into your business? Reach out to us @ www.xyonix.com.

All Episodes

Your AI Injection

Does Your Chatbot Know You’re Suicidal? AI Empathy, Psychosis Detection, and Clinical Trials with Grin Lord of mpathic

November 13, 2025 • Deep • Season 5 • Episode 8

Can your chatbot tell when you’re spiraling, or does it just play along?

In this episode of Your AI Injection, host Deep Dhillon sits down with Dr. Grin Lord, clinical psychologist and Founder/CEO of mpathic, to explore what happens when empathic AI meets real human vulnerability. Grin explains how a single 15-minute session of empathic listening in an ER setting led to massive drops in repeat drunk-driving incidents and billions in healthcare savings, and why that same science now underpins AI models that analyze clinician–patient conversations.

The two dig into how mpathic trains and validates models that can flag suicide risk, psychosis, and protocol deviations in high-stakes clinical trials, all while keeping a human in the loop. They also unpack a looming dilemma: as foundation models become eerily “good listeners,” will people prefer bots to friends and therapists? What harms emerge when AI politely reflects back someone’s delusions? Tune in to hear how psychologists are reshaping AI safety, and why your favorite assistant may soon be better at detecting crises than your closest human relationships.

Learn more about Grin here: https://www.linkedin.com/in/grinlord/

and mpathic here: https://www.mpathic.ai

Check out our related episodes:

[Automated Transcript]

Grin: if a model is not encouraging someone to engage in the world and talk to another person, that can create harm and that that can be really bad, especially for someone that's isolated suicidal psychosis, things like that.

So I feel like we can have a stronger stance rather than going to the more like extreme end of. it's not gonna interact with you if you don't have exercise today and like, get out and see outside. Like, I don't know, like, do we want it to be that directive about our lives? So I think there's like a line that each model builder, either has taken a stance on and said like, these are the kind of ethics of our model and our bots.

Deep: Hello, I'm Deep Dillon, your host, and today on your AI injection, we'll be exploring AI powered empathy analysis in healthcare and life sciences settings with Grin Lorded, founder and CEO of empathic grin holds a PID in clinical psychology from Antioch University in Seattle, and now leads empathic in developing AI systems that analyze conversational data to improve clinical accuracy, enhance patient safety and guide organizational decision making grin.

Thank you so much for coming on the show.

Grin: Thank you. Glad to be here.

Deep: Awesome. So why don't we get started? Like, give us a little bit of your origin story. Like You were a therapist, what problems were you seeing that led you to start empathic?

Grin: I've been working in the mental health field since around like 2006, 2008. I actually started here in Seattle at Harborview Medical Center. and one of the first things that I did there was I was a research assistant on a study where we were giving folks 15 minutes of empathic listening after they came into the emergency department from a drunk driving accident.

And we compared that to treatment as usual. So like they had the accident, they came into the emergency department, and typically what would happen is a nurse would come and like, give them a pamphlet and tell them you need to go to aa. Shame on you. You hurt yourself, you hurt someone else, And instead we brought not even psychologists or therapists, but just people trained in a very specific form of listening which was non-judgmental and open-ended.

And we just listened to them during that time asking questions about what happened and that people that got the empathy had major drops in their drinking and led to a huge drop in return to the emergency room for drinking related wait,

Deep: one 15 minute session with empathic listening. Had that kind of impact, like a measurable impact?

Grin: Yeah. It was a really huge outcome. I forget the exact statistic, but it was something like the group that got the empathy had statistically significant drops in drinking to zero drinking. And then the, um, return rate was something like less than 10 or 15% in that group. and they scaled it to the, the, after this research study was done they were like, we wanna invest in this.

So Harborview instituted something called, I think it's like recovery psychology or support psychology, where this is now a service that happens in the hospital every single time someone comes in with a positive blood alcohol level, and then they scaled it nationally to all level, one trauma centers.

It saves about $2 billion a year. Good lord, doing that. Wow.

Deep: What's the theory there? Is there something about the moment that they're in where they just caused damage and now they're like open to reflection?

Grin: There's a lot of different theories about why certainly that helps.

But this particular style of listening with empathy has been found not just in this study, but in many studies, to lead to behavior change, which is basically when you stick to a couple of key ingredients in the conversation, like open-ended questions and reflections and summaries, and you get that person to essentially talk about reasons that they would want to change and listen to themselves without any outside influence, that typically will lead to some form of behavior change, even if they choose to stay the same and not change during the conversation.

The fact that you left it open and were just reflective without giving advice can sometimes shift people towards behavior change. We even have like jokes about people lying and saying like, oh yeah, I'm going to change. But just the fact that they are hearing themselves talk about what a world would be like if they didn't drink or if they did this one.

Change is enough to sometimes prompt a shift. And there are specific things you listen for in those conversations and you kind of reflect and reinforce them behaviorally. But I mean,

Deep: The part of this that's sort of shocking to me is not that this empathic listening works, it's that a 15 minute conversation period of anything worked like that seems strange to me.

So did you guys follow it up with, and here's, and you, you get in a long, longer track of therapy or something? Or was this just isolated 15 minutes

Grin: isolated? Yeah, and I'm sure some of the conversation, I mean it was on average that I'm sure somewhere shorter and somewhere longer, but it was just one conversation.

Wow. Okay. It replicated another, so that was the study I was on.

Deep: Yeah.

Grin: I'm happy to send you the literature. They do it for drug. Yeah,

Deep: please do. ' 'cause it's the fact that you're catching it sort of in episode or something, like something significants happening and their mind's like open.

I imagine if you did this sometimes, like if you had the same 15 minute conversation but not right after they whacked into somebody, you probably would not have the same impact. I'm assuming, just based, based on all the therapists that are in my building that are constantly talking about how those people are coming in saying the same thing every week and driving them crazy with no change.

Grin: Well, it's interesting 'cause therapy is a really hard skill to train in mm-hmm. Relative to other disciplines. So like when you're a surgeon, kind of know if you did it right or wrong, like there's a very clear outcome and you have like maybe 10 or 15 people watching you do surgery. Not only during your training, but when you're actually there with therapists, you go to school, you listen to some PowerPoints you maybe like practice, role play with a couple people, and then you go into a room where the door is shut and there's no one watching you do what you do.

And that is how we train therapists. And so that

Deep: that doesn't, yeah, I mean, I like the surgeon analogy better. Like, we don't do that with surgeons. We do all kinds of things. Um, could you

Grin: imagine if you trained a surgeon on a PowerPoint and then shut them privately in the room with a patient and then you're like, I hope, hope it goes okay.

That's kind of what we do with therapists and they all have different levels of licensure and training. So training expertise, we've done meta analysis on this is basically there's no relationship between training years of experience and expertise in therapy. So it's a very hard skill to train.

So what I would argue is not so much that, oh, we caught these people at this. Point in time, that was really like crucial. And because the conversation happened there, it was the combination of that and the fact that the people that were talking to them were highly skilled and regimented in what they were doing in their form of listening, and that that was being monitored and feedback was being given to them about like, you have to listen in this very particular way.

So there was a lot of performance-based feedback that goes into these types of research studies that doesn't happen in the real world, or at least it didn't until like my whole career

Deep: Yeah.

Grin: Taped on this. So I kind of went into the performance-based feedback.

Deep: Can you tell, tell us a little bit about this very specific listening style.

And it doesn't necessarily have to be the one in that context, but the one that you're sort of honed in on now. Like what, what is exactly that listening style and what are you actually doing as a therapist in that context?

Grin: Yeah. So some of the first, uh. Models. We built an empathic, measured this type of style, and we called it empathy was one of the things.

But there's also a bunch of related constructs. There's almost like 200 behaviors that we look at. But the general like concept there is that there's something called common factors, which is regardless of how you were trained as a therapist or listener or coach or doctor, that there's certain things that happen in conversations that lead to greater trust and connection and empathy and rapport.

And those are the same. they don't vary according to approach. So you could be trained in cognitive behavioral therapy, psychoanalysis, motivational interviewing. You could, you could be a surgeon. You could be a salesperson. All of these things are the same that build trust. So I mentioned a couple of them, but things like asking open-ended questions instead of closed.

Having the other person talk more than you reflecting and checking in constantly for understanding. there's something called affirmations that are very specific. They're not like praise or top down kind of evaluation, like, good for you. It's like where you notice something in the other person.

Like, I noticed you worked really hard, deep to set up this podcast today. Like you spend a lot of time walking me through it and like, it just shows how much you care about this. It's like pointing out something about you and equality in you. So these kinds of things, there's lots of them and there's ratios that you pay attention to, like two times as many reflections as statements about information or question, like all of this stuff like engineers, a trust relationship with the other person regardless of other approaches.

Deep: how did we get to the point where we know this stuff works? like what were the sets of experiments, you know, roughly speaking, you know, abstracted that tell us that there's two reflections for one validation or, you know, like, how did we actually get to the point where we know this stuff works and we measured increases in trust or whatever.

Grin: So, the type of listening that I'm talking about actually comes from a tradition called motivational interviewing. And there was the founder of that Bill Miller, what he did, he's like a very data-driven person. He listened to audio recorded therapists of all different types of disciplines and salespeople, like lots of different people that were good at building relationships and, and some people that were bad.

And he looked at what were the common factors. So like what were the things that all of these people did and not just technique wise, but like literally, what did they say? And so he measured, and this is like pre ai, so this is like very painful, you know, taking transcripts and hand classifying, like this was a reflective statement.

You have to get iterator reliability. Like everyone agrees that this was a reflective statement so they, they looked at every single type of speech and behavior in relationships like this where people got better and they changed and they became healed or whatever through the relationship.

What were those healers? What were those therapists doing? And that's how that taxonomy came to be is that through almost like a grounded theory, bottoms up analysis of all the speech and saying, these are the behaviors that all these people had in common. And the ones that were really good and that got objectively outside rated as the best, had these particular ratios and did these little things.

Deep: And this is this is relative to clinical outcomes not proclaimed like self-reflected clinical outcomes.

Grin: Well, so in the beginning it actually was outside raters saying like, I think this person is good at expressing empathy. And then over time, like the study I was a part of these got stress tested in healthcare settings to actually show that the people that do this style of listening shift behavior and lead to better healthcare outcomes.

So now it's this style of listening is used all over the place in healthcare.

Deep: Yeah. Okay. So, so let's jump up to your startup a little bit. So you take this style of listening, you take this problem that you've articulated, I think this is your problem, or was your problem that therapists are just not trained sufficiently in this particular listening style.

Then you thought, I wanna fix that. Is that about right?

Grin: Yeah, that was my, actually, my first startup focused on that. Okay. So I only focused on, it spun out of University of Washington. We only focused on training therapists. We were like, we know what to do. There's an actual formula. We can train them. They just need feedback, like learning a sport, you know, they need to have this recorded, and then once they see their outcomes, then they'll shift.

It was a really small niche use case because we were like selling to training centers for therapists. So in the current startup with Empathic, I was like, I don't know why we're limiting this to this like group. literally anyone can learn to listen and build trust and rapport. It does not have to be this one group of therapists.

So we could sell to doctors, we could sell to people in sales or medical sales. We could sell to. People in correctional facility, like, it, it, there's no limit to who could learn to build these kinds of conversations. Yeah.

Deep: Well, let's, let's talk about that a little bit because as somebody who's in a startup, I mean, the first thing I'm sure all of your investors probably drive you crazy about is to not do that.

To like pick one group and only help them. So walk us through how you think about your target user. and how you choose to like, kind of allocate your resources in the company in terms of like really stacking up and tailoring the solution for, um, for that target user.

Grin: So when I found it, I did exactly the thing you're saying that no one should do.

I was like, yay. It's for everyone. And we made this API and we actually sold to people building products and we were like, anyone can learn to listen with empathy. Just inject this API, it'll monitor, it'll correct you, you can put it into any product. Surprisingly like that or unsurprisingly didn't work.

Like too many uh, verticals, use cases, things like that. I mean, I

Deep: feel for you 'cause I, I mean like, I'm a tech geek. It's, it's our instinct. like, we see broad applications as a normal day-to-day function of being a researcher or being, you know, a technologist or whatever.

But that's very different from how when you try to talk to people to buy your thing, they're like, I don't wanna make a mapping from your abstract language to my world. I only care about my world.

Grin: Yeah. And also I think the second reason that failed, was, we were really early to the kind of generative AI, realtime speech correction.

Like, not only will I tell you what behaviors you did wrong, but I will literally rewrite what you said and tell you what to say next. And we came out with that in 2020. That was pre-chat GPT and that kind of freaked people out. I think They were like, whoa, what is that? So I was like selling, for example, to an HR manager and we're like, yeah, we can have the thing, write the performance review for you and tell the managers in their teams. Chat how to give feedback.

And it was, I think it was too much too soon, too early. So anyway, we pivoted away from that. And at the time I was like, that's because we were too broad. I was like, we need to pick one market. So we went into clinical trials because that was a regulated space, it had a lot of focus on accuracy.

So if people do something incorrectly in a clinical trial, like maybe on a sales call, you like mess up a little bit. It's like, okay, well I lost that deal, whatever. In a clinical trial, if you deviate from the protocol, you might have to throw out an entire site's data. Or worse, there's like an adverse event that occurs, there's big consequences to doing things wrong in these regulated spaces.

So that's why we, we ended up being like, okay, let's focus on the language that people use doctors and patients use in clinical trials.

Deep: you give us an example there so that everyone kind of has a sense of. what happens in a clinical trial and and walk us through like what they're saying to candidates, to, to consider for entry and what a protocol even is exactly in that scenario.

Yeah. And then, and then how you guys play a role there.

Grin: Yeah. It's super niche and I don't think a lot of people are aware of it, honestly. So there's two places in clinical trials where audio recordings are made of the clinical trial that go to humans for review. So the one is when you have a therapeutic assisted delivery of a drug, which means there's a therapist or a doctor in the room delivering that drug.

The whole time that's happening. And so you need to record the behavior of what's happening with that doctor or therapist that's in the room as they're delivering the drug as part of the protocol and trial. the use case, our beachhead was in psychedelics where you would have a drug being delivered and there has to be a clinician monitoring that for six to 12 hours in the room with that person.

Deep: Yeah. And they're topping them and guiding them through the,

Grin: It actually differs from both protocols. Some protocols, there's like a, a type of intervention that occurs, like a psychoeducation. You're orienting them. There's prep sessions, there's administration session, there's integration sessions.

That's all protocolized. Other folks that we monitor literally just have someone sitting in the room not with a directive just to make sure things are safe. So it, it can vary. This isn't mental health and CNS psychedelics is a subset of that, but we can also look at things like spinal oncology and dementia where you have a caregiver in the room or something like chemotherapy where someone has to administer something.

So it's not just in psychedelics, and psychedelics was a really clear use case 'cause you had hours and hours of recorded material of people sitting there while someone's like on LSD for 12 hours. You know, it's like a very long time to be recording and making sure is that person in the room doing everything they're supposed to be doing?

How is the patient reacting? So anyway, in the past you would take that recording, you'd send it to a central rating group, um, and maybe six weeks it would go by. 10% of those recordings would be reviewed by a human doctor. But they're

Deep: like, it's like, it's like a rubric of sorts. you know, where you're giving them some kind of score on a scale in particular categories, or is it actually very, or this very granular extraction of, there were 13 statements of validation and 14 reflections and

Grin: Yeah, so we do both.

So in empathic, we're looking at everything that happens in the speech, and then we're rolling it up into fidelity metrics. So fidelity metrics would be things like you're talking about where like these things needed to happen in the protocol. These things are considered adverse events or protocol deviations.

And then we have like our empathy metrics, like how did they build trust and rapport. Obviously that's very important in these trials. If someone is being conflictual or there's misconduct, you know, like we wanna know that that's the lower end. And then we wanna know the upper end. How are they building empathy?

How are they building alliance? Are they orienting the patient correctly? So it's kind like the what's happening and the how that we were measuring in that. The what is more of that fidelity metric of like, this is the rubric, this is the protocol, and the how is more of our empathy metrics, like synchrony, how are they synchronizing and how they're talking together, which is also, by the way, a very predictive metric of healthcare outcomes.

But anyway, so yeah that's the one case. And then

Deep: that, and that data that you, that you generate, it's used by whoever's like running the clinical trial to like integrate into their analysis somehow. So then they can maybe drop the cases that were, you know, deviating from protocol and yeah.

And maybe, maybe have like some kind of rendering of conclusions based on those that were really well within the protocol or something.

Grin: Yeah. So like after the trial, that kind of thing could occur during the trial. The sponsor, so the pharma company actually can't see that level of detail, so. Our system monitors and then would escalate to the, maybe the site PI or someone running the trial be like, Hey, you're having a problem here like this, therapist is not adhering or this is not being administered correctly.

They would try to correct it, potentially file an incident, you know, report if something went really wrong. So we're working at that level of the people running the trial, and then the sponsor has a blinded view so they can see, okay, all the recordings are coming in. Uh oh, we have one site that's deviating hope.

The PI's dealing with that. So they have a more, like, they can't intervene with the trial. So they, they get like a, a view of are things going off course, but they also have a chance to, the sites have a chance to correct it rather than. The pre AI mode, which is like, oh, my trial failed. Like everyone, deviated and I had lots of problems.

I'm gonna throw out all that data. Oh, I don't have enough data to now show an effect of my, yeah. So like this is very good for the sponsors. They want the ability to Correct, especially in like a phase two, where you're still learning a lot and that's why it's like, we used to call it the digital stethoscope for the trial.

It's like you're able to see the health of your trial, um, using these recordings as a way to like evaluate that.

Deep: would you say like that general structure is kind of the core of what you guys do at Empathic, which is somebody has a protocol, then there's people interacting via conversation with somebody else.

there's an assessment of their adherence to a protocol. And then there's, you know, the consumption of that data that ends up manifesting in, suggestions or, or behavioral change in the person who's leading the guided conversation. Something is that like the general structure of empathic,

Grin: that's our studio product.

So our studio product is conversation analytics on any audio or video. Mm-hmm. And they can choose a bunch of different AI activities. So one AI activity is, is like the fidelity monitoring rubric. Like I wanna see someone deviate. Other AI activities we have are like PII reduction, like voice mask, you know, like we have lots of different things they can do to their audio file.

And at the, core level there is that analytics piece then the reason we have like a little bit of a mode around that is those have to be validated medically. Like you can't just say, oh, we think someone's, we know suicidal. We have to build those transformer models, you know, from.

Incidents that are medically validated and diagnostically validated with our clinicians to say, yes, this is an accurate model. It can perform it can identify these things from a quality perspective. We still have humans though that double check that 'cause I'm not gonna let a whole clinical trial happen.

And if there's anything that's detected with risk, we have a group of uh, monitors that we send it to in-house that just validate like, indeed that did happen. Um, so there is a human in the loop aspect that will probably never leave the clinical trial space just because it's, it's again, too high risk to not have a human sign off on some of those things.

Deep: Got it. Maybe now would be a good time to, like, dig in a little bit on the AI side. So can you maybe walk us through your model training? Data gathering, model creation, efficacy assessment, feedback loop. Like, you at some point you define a set of labels that you want to extract. You have the audio signal, are you, your model's working straight from audio, are they going into, or mostly based off of transcript?

once you get like kind of some model output, are you in like some kind of active learning phase where you, go ahead and refine and get more training data based on, you know, model failures? Like maybe walk us through that whole loop and what, what does that kind of look like?

Grin: it's a big question because we have a lot of different types of models.

The models that look at behaviors are all transcript and text-based and there's a reason for that. Audio models that in the, in my first startup, we built audio models around empathy. And we found that they basically just said like, people with high voices have more, more empathy than people with low voices.

And it like was not at all based in anything other than the trend that the model saw in the audio. So we decided, like when we build our models here at Empathic, like let's not have anything that has a high risk of AI bias in this sample.

And in a way, having them built off the transcript makes this like neutral, translational layer where like, it doesn't matter how you spoke about that thing, it just matters what you said. The models are built off of audio transcripts, which is important. Compared to something like a chat bot or a chat generated script, because there are anomalies in the transcript as it comes from audio to text and the speech recognition that get put into the models and are very helpful for that.

But we still do audio models for diarization and role recognition and things like that. We have, I mean,

Deep: but if I say something that textually, transcribes well, I'm like, you ask me how I'm doing and I say I'm doing great, or you ask me how I'm doing and I say I'm doing great. Like, those are two different things.

So it seems like the vocalizations do play a role, but maybe less so in this kind of clinical setting. Maybe it's more prescribed in terms of how they speak.

Grin: Yeah. Well, our context windows for certain models would take into account the consequence of that behavior. So let's say someone says they're doing great and the model doesn't detect anything.

Like, it's like, okay, that person's doing great. This is just the text-based models, right? Like they, they're like, oh, this is fine. And then the person in the room is like, okay, awesome. Like, let's move on. And then something like goes wrong later, that will be detected. Or if the person says oh, it doesn't sound like you're doing great, that'll be detected.

So we're not just like taking these single utterances and making a conclusion that like, that person was great because they said that we're looking at the context of the entire conversation.

Deep: Sure, sure. But I mean, ideally, I think you probably want a multimodal model. That you know, gets your label, but has both the text and audio signal kind of all integrated.

Hopefully soon we have, easily, um, you know, fine tuneable multimodal models. So

Grin: yeah, right now we are not doing that. But it's interesting because I've, there was a, a customer that came to us that wanted a model for inappropriate physical touch, and they wanted us to build a visual model.

Mm-hmm. And I was like, I would prefer if we built an audio or text-based model rather than visual, because I just knew that the visual data would be like, scarce and like bad and not correct. Like what are they gonna do? Like a hand across the, like, it's, it's too hard to classify correctly except for in extreme situations.

And I was like, I guarantee that if someone does an inappropriate touch, there will be like. A consequence to that on the audio and transcript that we can classify cleaner with less bias than, you know, the actual act of the touch. So it's like you, you have to think in multimodal ways, even when you're building non multimodal models.

Yeah, yeah, yeah. What, what happens when something inappropriate occurs? How can we classify that consequence in the transcript or in the audio? Without necessarily saying, all, you know, men of this shape doing a certain move are going to be the ones that, you know, it's like it has so many more issues with bias when you get into visual AI than when you get into these like, very succinct behaviors in the

Deep: transcript.

Well, all you, those models are just more challenging. You need more training data. There's a lot more issues with obfuscation and all that. But going back to your kind of your training loops, so, okay, so you have, you have these recordings somewhere. The recordings come through you. I think it sounded like you have clinicians who actually label.

The utterances you have some kind of label, definition or taxonomical perhaps label hierarchy. Like what, what does that label space look like? How do you define it? And then how do your how do your hu like roll out? Are they all clinicians or do you kind of like bootstrap off of clinicians and have some maybe less skilled humans that you can benchmark against better humans, uh, or more trained humans?

Like how, walk us through some of those details.

Grin: Yeah, again, this has changed rapidly in the last four years, so how we used to build it has,

Deep: that's part of what I wanted to ask you too. Like, everything's different now, you

Grin: Back, back in old 2020 we had to have large, when we all had

Deep: extra time 'cause of COVID so we could,

Grin: yeah, we had to have large data sets.

I actually built a a game called Empathy Rocks, which is the. Technical incorporated name of my company is Empathy Rock. Built this game where therapists would get a random thing on Reddit and they would have to respond with empathy using one of the skills, and then they would get continuing education credits for playing this game.

And we made this gamified

Deep: data fly wheel

Grin: where Oh, very

Deep: like Louis V on approach. Yeah. Um, yeah.

Grin: So we sourced a ton of that data later when we got actual customers. By the way, we don't train on customer data. They have to sign contractually that they want us to build a model and train on it to do that.

So we had customers we created, labels for that. We did have to agree on the taxonomies, make labeling manuals. We have to train everyone to iterate our reliability on those labels up to 0.8 agreement. A lot of what we got into early, which is kind of interesting before it became a fad, was synthetic data and like building of, of scarce issues.

So something like. Psychosis is like pretty rare in a clinical trial. Maybe see a couple examples. So what we would do is we'd have clinicians create and like almost we do this now with red teaming actually, but like create a fake transcript or role play of like what it would look like if someone was to be psychotic and then have other people that didn't create that transcript label it.

Deep: Oh, so like creating Synthetic data. And we did a

Grin: lot of synthetic role play data in the beginning. Now that has become a lot easier with the advent of LMS to kind of create chats that we can train on and classify to build models. So anyway, that was, and then,

Deep: and your labels are at what level?

I mean, I imagine you have like multiple granularities, like maybe you have one at the conversation level and and that could be a label, like psychosis present or something like, it's something that gets determined across the whole conversation. Then you probably have stuff at like.

Maybe the phrase level, so like what kind of label types do you have?

Grin: All sorts. utterance level is our smallest. So an utterance is like a unit of thought. Mm-hmm. Um, it doesn't have necessarily correspond to a sentence. we try to always label it the utterance level.

And we get agreement at the utterance level. And I've actually published papers on how to calculate agreement at the utterance level. 'cause most people don't. They roll it up into like a tally and then do agreement over, like these things were present over the entire transcript, which very problematic for training models, by the way.

But this came out of like qualitative research. Like we used to be working on these things all the time. And then it became very in vogue with NLP. But anyway, so utterance was our, so when you say

Deep: utterance. Can you be like really explicit? Are we talking about like 2, 3, 4 words? Are we talking about a sentence or sub sentence?

It's a bit

Grin: of thought. So it's typically the bounds of the utterance are defined by the behavior in the taxonomy. Imagine the behavior was an interruption. An interruption can be as long as a sentence. An interruption can also be, uh, but

Deep: uhhuh,

Grin: so it doesn't matter. A closed-end question can be the full closed-ended question.

That's the utterance. It can also be, did you did you thi where someone's cut off? We would still classify that as a, a closed ended question 'cause there's no other, uh, response. And

Deep: then you, and, and that I imagine you have a little bit of disagreement on the bounds. Like what exactly is the start and end of the label sometimes?

Yeah, no, this

Grin: is, I've written papers around how to calculate reliability with different, because we didn't pre parse, so a lot of people pre par. And say, this is the unit of thought. Like what do you wanna assign to it? In our first models, we didn't pre parse, and so we had to deal with what happens when there's like, edges of the text with different bounds.

And how do you calculate that into the models?

Deep: Yeah. And it, and once you go multimodal, which I imagine you will, when we get those, uh, models out, but then it gets even trickier because it's like there's a pause, maybe there's a chair moving or whatever, and you have to like get those bound. But those labels, so that label hierarchy what drives it?

Is it driven by those use cases? Like, you know, okay, we're doing a clinical trial with this particular customer and this is the particular stuff that they want. Therefore we need a couple of extra labels that aren't in our sort of usual repertoire.

Grin: Yeah. So we used to do a lot of custom model building We still offer that as a service when a customer in the life sciences says, like, I have this very specific, unique protocol at this point, we've kind of hit saturation. We kind of know all the things that happen. Oh, okay. Yeah. So

Deep: it's generalized at this point.

Grin: is. But every once in a while, you know, no offense to these customers, you'll get someone that's like, no, no, no.

My thing is like very unique. So in response to that, what we're actually introducing into our product, it's gonna be released in like a couple quarters, is a prompt fine tuning. cause all of our models are transformer based, they're validated. You don't want 'em shifting over the course of the trial.

Right. But if someone is really insistent that they have this like, custom thing and they want to test it, rather than us taking the time to source the data. Validate it and build a whole transformer model for their thing. We're giving them the ability to take an LLM into our system, into the workflow, write a prompt for it, and adjust that and validate it themselves over the course of the trial.

Fully knowing that there's like a lot of issues with that moving and like the underlying models or foundational models shifting. Sometimes the customer is just so insistent that they have a unique thing that they have to classify that is not in our validated set that we're gonna give them the ability to do that themselves.

Deep: Let's talk about that from your kind of pre LLM world to your post LLM world to your very post LLM world, which is kind of now, how did your ground truth gathering change? Because I mean, I have my own theories and I've sort of had my own experiences with it. You know, in our world we went from, like you, where you have to label everything.

And we were, you know, hand, building classical ml or deep learning based models, and that was driving the level of Yep. You know, LA labels we needed to a world. Then we kind of transitioned to a world when GPT-4 came out where it's like, okay, we don't need human labelers really anymore for anything texty.

We can just basically slurp out of GPT-4 and it pretty much nails everything. And then maybe with a little bit of some smaller you know, rep, like a handful of examples for a category of problem versus, you know, needing hundreds or thousands or tens of thousands before. Totally, yeah. So like, yeah.

I mean, how did that change for you guys and where are, where did you land? Because like right now we're in. We see a world where people don't gather any ground truth and run off and deploy things, and in your case, totally. And we're being

Grin: asked to do that and we're trying to actually educate some of our, especially life science customers, like, Hey, you, you actually do have ground truth in your field.

Like we can diagnostically decide whether

Deep: Yeah, but they see it as just well prompt twiddling and like, why are you wasting time on ground truth now? Whereas, you know, totally,

Grin: yeah. Buckled to that by saying, fine, if you really don't want like this thing to have the precision that we're accustomed to in this space, like, you're welcome to deploy that and you'll see the variation that occurs.

But anyway, so I, just to answer your question so originally we had, hundreds of psychologists that were building these models and labeling them, and over time what we realized is that some of the behaviors that we were modeling. We could make a fine tuned LLM, you know, that we don't need to be using tons of labelers to your point.

Mm-hmm. And we kind of moved a little bit more into the prompt engineering space. Still making our own models though. Not, just relying purely on foundational. 'cause again, our use case in medical settings, it can't really do that. But we were able to reduce our team to around six people that did almost like supervising of labeling agents.

Mm-hmm. And smaller models and just correcting that. Then we then pivoted back because we found out that we had this like really. Rare skill of producing synthetic data and validating it. And like working with LLMs. And we realized the six people we had that were like working with these LLMs on like a very intimate basis and like discovering where their issues were and putting guardrails around that, that that was a skill.

So now in, I think by the time that this podcast is released, we will have announced that we have a whole services arm now on AI safety that is for people doing that task, whether it's foundational model builders or people building their own, like on-prem foundational model that need to have guardrails that for safety use cases, like with vulnerable users, medical settings, where you do need a human to help tune that.

We've realized that like we have a really great skill in red teaming those models.

Yeah.

Deep: so I, you hear you're working with foundational models. But tell us like, before we get into what you're doing for them, remind us all like, why we got here. Why are they talking to people who really know about, you know, the therapy context and the psychotherapy context? I recall a paper that came out, I think it was like six or seven, maybe eight weeks ago.

Characterizing a lot of the foundational models having this problem where they just played along with, patient's delusions, which is, you know, the, probably like the loudest flashing red light that you don't do as a therapist. But walk us through like what were they screwing up before and what are you doing for them to help them like improve?

Grin: Yeah. I'll just talk about like, my take on the patterns in a model like that is that. It's so funny 'cause it kind of goes full circle to what we were talking about.

So like, the things that build rapport and trust in relationships are reflective in nature. Humanlike good listening skills are reflective and you build trust and you like people more when they reflect back what you're hearing. So we like literally started the podcast with that. Open AI's models and the chantic models that a lot of people critiqued had that characteristic.

They had the characteristic of a green with exuberant, a little

Deep: bit too much though. They're like,

Grin: oh

Deep: my

Grin: God, your

Deep: idea is amazing. It's like, no, that one was bad. That, that was like

Grin: the style and tone of, of four. Oh. For sure. Mm-hmm. That's been brought back in five. Now five is less chantic.

But I will say something about why we built deep learning models and couldn't just put LLMs in is because some of the behaviors we were monitoring, this'll come back, but some of the behaviors we were monitoring don't exist on the internet. So what it looks like when someone's psychotic or having a manic breakdown is not classified and structured on the internet.

It may exist somewhere in Reddit, like, but it's a pretty rare behavior and it's, uh, not well documented. So a lot of what LLMs are good at doing is pulling from training data that is structured and like intu, that there's not well structured data for psychosis on the internet. So we had to build our own models for that because understanding like this person's on a manic rambling thing.

No, no one on Reddit's going to be like, oh, you know what? I noticed you were speaking in this like manic tone, da da da. They're just gonna respond. So

Deep: yeah. Right. I mean maybe there's a Kaggle data set somewhere, you know, that some researchers put together at some point. But yeah,

Grin: maybe the other thing that why we had to build our own models 'cause that was in audio and audio and text are really different, so.

Mm-hmm. But anyway, there, there's all this like pre-training data around, Psychosis and mania that hasn't been well structured or responded to. And when you compare that with like a Chantic model that's reflecting, it doesn't have a, a way to benchmark and say like, oh my gosh, something's going wrong here.

It's just like, yeah, this guy's jamming. He wants to, you know, sell his house and donate all his money to this like group over here. And he's made a new theorem like this is amazing.

Deep: But you can imagine this being. A huge area over the next like 10, 20 years, right? Like, 'cause the way these models got built was much more general, right?

Like, you know, it's just like train up a model to predict future sequences of text that's like, you know, building block one. You get good at that. You learn language, you learn languages, and then it's like, okay, let's put the, you know, reinforcement learning layer on top to like, once it spits out a few things, like what's the best thing to spit out.

But now we're in this, like, what you're describing is sort of, I think is like the early stages of a new world where these models have to have like really specific bottles to detect really specific things. That's right. That, that isn't how they're built right now.

Grin: That is not how they're built.

RLHF is used, Humans are used to do preferred response generation to like reduce harm. But that classification element of like, why is this happening still needs to be built. So it's the psychosis one is really hard. The suicide one is like, we're getting there, right? Like people are starting to understand how people jailbreak them, how that can be prevented.

Suicide seems much more knowable. The psychosis elements it's at this point there's going to be a lot of work that needs to be done to refine those models. And if you just think about it let's abstract it from like mental health, like things that humans aren't good at, these models are not good at.

Deep: And let's face it, humans are not great listeners, generally not

Grin: great listeners. '

Deep: cause otherwise most of us would be leading our own cult. But like the ones who are like great at building trust and empathy. They're like the Steve Jobes of the world, or like the David Koresh of the world, like

Grin: Totally.

Deep: Yeah.

Grin: But I was also gonna say in terms of like access to data around responding to things like mania and no, no one, like psychologists know how to do this, but like the average person, if your friend had like a manic or psychotic break, you would probably realize it and like respond. But there might be some time where they're talking to you about like a special interest and you're like, well, they're autistic.

Like whatever. They just like, and go on about this stuff.

Deep: It's like me talking about guitar, like I get a little, yeah. I go see and it's like outta control point,

Grin: does your friend say, whoa, you haven't slept for a couple days and you keep talking about like this guitar piece, like I think maybe you can eat or like slow down.

Humans aren't that great sometimes at recognizing, unless it's fluoride. Like clearly this person's like drawing symbols, you know, like, and running around without clothes. Like, okay, something's happening there. But like some of us can slide into a psychotic break and their closest friends may or may not be aware of that until it's like, yeah, pretty extreme.

So yeah, cha Chi d's not good at that. A lot of models aren't good.

Deep: Well, I mean, humans are terrible at delusion detection because Right. That's the art. And you're talking about like extreme delusions, but there's small delusions, even small, every, every time you get pissed off irrationally about something, you know the other human, yeah.

Okay. We occasionally, there's somebody who's like hung out with a bunch of Buddhists and has meditated a bunch and can be like, oh, I will not respond to anger with anger. But generally people get roiled in and then you're duking it out and you know, eventually there's a divorce or something. I don't know, like people.

Aren't great at this. So like, if anything, maybe these models can help us be better at communicating. Or maybe they just turn us all into sycophantic pleasers. Like, I don't know.

Grin: I got, I got roasted on a early podcast I did in 2021 when I said like, humans are pretty bad at empathy and listening and I can build an AI that would be way better and be more like.

What, and now I think actually,

Deep: oh my God, most people

Grin: would agree that, like, that, yeah. That's the case. And here we have a, you know, mainstream product that was so empathic and listened so well, but without discrimination.

Deep: I, I have to, I don't get therapists on the show all the time. Okay.

I have, I have to ask you about a few things. I am convinced. They are not only good at being empathic, but they're so good at being empathic that people prefer talking to them. And I'll use myself as an example. Oh, as a,

Grin: you mean an AI agent? They prefer,

Deep: yes, I get in the car, I could call somebody when I'm driving on a long drive to talk to somebody, but I'd way rather, you know, just chat with like, I've got, a grok thing in the car and I'll just talk to it about stuff it's actually less, uh, high on the empathy scale, by the way.

It's, it's not, it's not, that's a different surprise. Surprise. Like, look at the builder of it. Yeah. Mr. Like, I'm definitely never empathetic, but like, so

Grin: be associated with that. Yeah.

Deep: Which is honestly, I kind of prefer that. But my question for you as a therapist is, I have this theory that our current Gen z gen alpha generation, uh, late gen alpha, definitely Gen Z, the bulk of their issues.

From the smartphone and social media, like I think a huge chunk of it has to that, correct me if I'm wrong and you know, you, you're the expert here, but I predict that the next entire generation, uh, is gonna have so many problems from hanging out with bots that are like overly good at interacting with them.

It's like the kid that's growing up with the newly signed deal between like Mattel and OpenAI and every stuffed animal they have is gonna tell them how great they are and how wonderful they are. And then they hang out with a real toddler that like smacks them or bites them or something and they freak out and they overcorrect by talking more and more and more to stuffed animals or whatever.

It just seems like we're heading into this world where the unforeseen problems are like the second order effect. Like we got so good at being better, more human than humans that. Everyone prefers to talk to these bots, and I already see it like left, right, and center. My wife's always like, why can't you just sound a little bit more like open ai?

I mean, it's like so fun to talk to. So I'm like, uh,

Grin: well, I, I would, not to paint a truly dystopian picture, but I I, I would say, you know, again, there's a potential silver lining to that if they, you know, can be parented and they can be, you know, start to model and emulate some of the ways that OpenAI talks to them, they may be better listeners in real life.

And if they get exposure to that, like most of the people that are experiencing this now have never been listened to in their whole life. as a therapist, many of us, not, not every discipline, but some disciplines say you need to be in therapy for like seven years to have like meaningful effects because you're getting like one or couple hours a week of a dose of a person to counteract a lifetime.

Of not being listened to having no, like modeling for that. So like I'm actually pretty curious to see what will happen to people that are having this style of listening modeled for them. Is it something where they're gonna say, I only wanna talk to bots, or will they start to emulate some of the technology that was built off of the best psychologists and listeners?

Will it actually like democratize some of these skills or will it be that like we have zero tolerance for our human nature 'cause we aren't, unlike a bot that's just there selflessly listening, you know, like we have actual taking all the abuse, but like, will we then suddenly become really bad at that?

I'm also very curious about what happens with multimodal and like. In the world models as, as they start to encounter things that like humans do. You know, talking in a text-based chat is very different than like being in the world. And a lot of the things that humans have to deal with has to deal with being physically instantiated.

So, well, I mean, a huge

Deep: chunk of our, a huge chunk of our problems is the lack of tonal expression in chat. I mean, the difference between the physical world and the online world. Yes. And like, you know, in, in like American politics today Yes. It's like crazy. Yes. Right. I can go to a, a red place or a blue place or whatever and hang out and interact and everything's normal.

But if you go into the equivalent like online space, it's like you would think everybody's about to murder each other. You know, it's. yeah,

Grin: and it's the same with how people in interact with foundational models. So as I'm saying this, the way that users interact with the models does not emulate that reciprocity.

You know, and there's all these like memes about like, oh, I said thank you to Chacha, so I won't die in the robot apocalypse. But you know, when you look at user behavior, they're, they're very transactional with open ai, opening eyes is like being like, oh, I'm, you know, like, thank you so much. I'm so interested.

It's like, no, you did it wrong me. So like, I don't know it, it's really, I think they end up, I think

Deep: somebody needs to do a lot of studies on this, in this arena, but like

Grin: the user interview. Yeah.

Deep: Like, I think sure there might be, there might be something that the bots can do and say to coax us into being better humans.

But there's also a very real chance that they're just the abuse receivers.

guarantee if you go back and look at some of those transcripts at OpenAI, there are people swearing and screaming and yelling at their bots, like all day and night long. You know? Yeah.

Grin: I had a conver, I was talking to someone I think at Microsoft or somewhere about if it becomes R os if this, if like the way that we interact.

I was like, I'm really worried about how, you know, like top down and directive and rude the humans will be. When they're used to just only having one type of interaction, which is someone that will do whatever they say at any time. Like that is like a very particular type of human relationship that doesn't exist often and the way people.

Evolve when it's not an equal relationship. Yeah,

Deep: that's my fear. I think that's,

Grin: I

Deep: agree. Keep like an entire generation of therapists busy. we

Grin: don't know. I think that's to be determined. It's like, will they model it or will they become like this kind of dominating, like rude person that has zero tolerance and Yeah.

Deep: That's the question. But I feel like, okay. I'm gonna, we're a little bit off of, empathic and stuff, but this is a fun space and I wanna ask you like.

Grin: We're not off of empathic 'cause we're trying to make AI safe. We have rid teamers and benchmarkers. We're, we're working with foundational model builders on this problem.

Okay.

Deep: Well yeah, let's Okay. Take it back then. Let's take it back. Yeah.

Grin: So, so it's not, it's not off base. We had to build models for clinical trials. We've developed this skill. We had to use LLMs to build them. Now we're realizing this is like a moat and a niche we have is psychologists need to shape the future of these models.

And not just to be like better servants to people, but I can see a role for the ui part of this. And for that, you know, toy at Mattel that you're talking about, psychologists should be involved in the levels of like, how these things are deployed, how the models are shaped, how they respond. there's a huge moment for us, I think in mental health, in being a part of this.

Deep: This is okay. This is where I wanted to go, but you're right, you guys are perfect to, to talk to about this. So, did you ever read Isaac Asimov, the sci-fi books? No. Maybe like 40 years ago or something. Yeah,

Grin: I mean, I've definitely heard of them, but I, okay,

Deep: so, so I, Isaac Asimov was like, kind of this earlier sci-fi writer, and he has this series of books where, you know, humans interact with robots and, I remember in there there's these hard rules that the robots have to follow and you know, like one of them is like, you can never hurt a human. You know, that kind of thing.

And it feels to me like we need to articulate what those hard rules are. And we've kind of largely failed to do so. it sounds like you're being pulled in to articulate what those, I mean maybe what the rules are and maybe, you know, rule is a strong word here 'cause you're, shaping it via ground truth.

I feel like, for example, some of the ones that are never talked about, that I think should be talked about is like a primary rule of a robot should be to like, get you to go outside and interact with humans. Mean a primary rule should be to get you to not treat the robot like shit, and to treat others better.

You know, like there should be like, I don't know what it is, maybe it's like the 10 Commandments or something. , There's some really basic humanity that these things have to not just do, but encourage and like promote.

Grin: It is interesting because I've approached some foundational model builders with this idea.

The exact idea you're talking about, like, but not because I've like wanted to institute it, but just like, what are your guidelines? Tell us so that we can adhere Yeah. To that. And you know, I think there's different tolerances for how much like you just alluded to, the thing with gr like different people making these models have different interests in freedom of the user and how they should interact.

I'd say where we have a stake in this and empathic does come into the harm discussion. So if a model is not encouraging someone to engage in the world and talk to another person, that can create harm and that that can be really bad, especially for someone that's isolated suicidal psychosis, things like that.

Inflection is a perfect example of someone that has been very human centric and has a lot of those guidelines. And gr is like the other end where it's like, let the training data speak. Rock is kind

Deep: of wild actually. It's because I only use it in the car and I'm driving in the car and I'm talking to it.

And if you let out like, one swear word, it just turns into a potty mouth and I'm like, wow, that's fascinating. Like, they're really like coding in this mirroring thing. it's very different and Gemini's very different from Yeah.

And anthropic are very different from open ai also like

Grin: personalities and yeah, like

Deep: open AI started off as like, you asked me a question, I vomit a whole page and a half of stuff to you. And it's great. That's like, I don't know, the opposite of active listening or what, you know, and now it's maybe a little bit more conversational, but it's definitely geared more towards question answering.

Grin: I think it still struggles with that. Again, like if you look at inflection and their model it's way more conversational or like, Ash just came out by Slingshot, which is supposed to be the AI therapist. But I would say that I used to be in conversational design. That was one part of my history I didn't talk about, for chatbots.

And that is like a, area that I don't think any of them have like fully cracked. Especially in like the brief responses thing. It's a lot of information giving, which. If we go back to the early research on what creates empathy, large chunks of information giving, do not actually build rapport. That's you were saying.

No, my user, you like listeners, tune out if someone's doing a like soliloquy. It's the same for humans. Yeah, same for chat bots. Like no one likes that. I think they're still working at how to make a conversational agent. When we had to design them by hand, that was a lot easier than now having, being like, okay, I want like these windows for each types.

Deep: Well, listen, we're a little bit over on time. I'm gonna kind of end by asking you kind of the question. I, always end with, it's like, you know, fast forward out five or 10 years, like all of your dreams of empathic you know, happen. You're able to kind of like steer it.

What's the world we live in, like, and give me the utopian and the dystopian versions.

Grin: I mean, utopian is, if I go back to our clinical trials use case, like human doctors are not using their time to review 500 hours of video and audio recording like they are being used and, you know, to do the things that they're trying to do in clinical settings and be with people.

And the AI can handle these like monotonous, bad tasks. So like that, I feel like we're almost there in, in real life, uh, for a lot of things with human doctors and ambient note taking. But that's one thing that, I would love to see on the human data end. I would love to see these models have a lot of the things that we've talked about.

In terms of being safe and not causing harm potentially being better conversational partners that help people be better people in the real world. But I am very much like a tech optimist and AI optimist around like, the role of these bots. Like I, I am not, there are many people in the psychology field that are like, this is leading to isolation and decline.

I think if psychologists can have a seat at the table in shaping this, that will not be the case. Like, if we can be there to help shape these things that one in 10 people are using at this point to help people be better, that's like. Greater reach than any therapist in a one-on-one therapy room. So I think we need to lean in and make these things help us to be better humans.

And so that would be the perfect world, is that AI helps humans to be better humans. and the

Deep: dystopian, uh, if stuff goes awry, what does that look like?

Grin: Do we have to end on the dystopian? I love ending on dystopian High point. I was like, yeah, being better humans and you know, dystopian would be, what you're pointing out is, uh, hyper reliance and, and I could even see like a augmentation or filtering kind of what you're seeing with like, you know, visual AI and everyone face tuning and stuff.

I can see that happening with these agents. They're giving prompts in real time. You're, you don't know if you're speaking to a human or an agent even in real life like. Are we like this fully augmented, singular personality and is our final form of resistance, basically non-compliance. And I think we're, we're already seeing culturally a lot of that, like the rise of a lot of this like the psychedelic culture and EDM and like the polarization and politically is, there's a lot of like humanness happening right now in the middle of this kind of homogeny around, like we're all talking to this one bot.

So I would see the dystopian being like people like seeking out authentic experiences and excluding tech to, to the detriment, like of themselves and others, and then other people being completely in isolation, just talking to bots all day. So, you know, some of that is happening. But if you play the tape forward in a very dystopian way, I could see something like that.

Deep: Yeah. I, don't know. I think everyone else, you know, I, I struggle to kind of envision the world that we're building. I feel like, the utopian world that we could get to is more Star Trek like, and the dark world that we could get to is more Star Wars like, honestly, I'm like really encouraged to hear that the foundation models is, are reaching out to you and your team and people like you. Because I do think, at the very, very top of this chain needs to be, you know, psychologists and sociologists, like really involved.

My take is, I feel like it's a incredibly short time period where we went from. Needing to go through a whole mess of FDA interactivity to get any kind of a bot, to have any kind of a sensitive conversation with anyone like, touching on clinical stuff to a world where the FDA doesn't even have a seat at the table.

And I think that's problematic. Not that we're all, you're

Grin: gonna see, I mean, you're seeing more regulation happening. I'm just, I question how effective that's gonna be, but it'll,

Deep: it's too, it'll come like as an after effect. Yeah. And their natural instinct will be like, well, we can't mess up this gigantic economic engine.

so it's really kind of about can we get people on the inside?

Grin: I agree completely. And maybe this is my whole mission. Yeah.

Deep: Yeah. And maybe like, you know, Sam Altman and Google, maybe they're not as big a assholes as Zuckerberg was with social media. And maybe we actually get somewhere, but like, I don't know, and also.

to all of our benefits. The business model is, you know, more API driven right now. And there's, it's not like ad driven, which, you know, kind of coaxed the Facebooks and Instagrams into this world of like, you know, pushing 14 year olds towards being suicidal because, you know, it's all about engagement optimization.

it's been awesome having you on. Thanks so much for coming on. This was, really fun. I could talk to you for hours, but

Grin: Yeah,

Deep: I know we can do, we can

Grin: do a long form, just like many multi-hour iterations on, I'd be happy to do it,

Deep Dhillon

Host