Structuring Enterprise & Process Automation Data Using AI with Slater Victoroff Artwork

Your AI Injection

Is AI an ally or adversary? Get Your AI Injection and learn how to transform your business by responsibly injecting artificial intelligence into your projects. Our host Deep Dhillon, long term AI practitioner and founder of Xyonix.com, interviews successful AI practitioners and domain experts to better understand how AI is affecting the world. AI has been described as a morally agnostic tool that can be used to make the world better, or harm it irrevocably. Join us as we discuss the ethics of AI, including both its astounding promise and sizable societal challenges. We dig in deep and discuss state of the art techniques with a particular focus on how these capabilities are used to transform organizations, making them more efficient, impactful, and successful. Need help injecting AI into your business? Reach out to us @ www.xyonix.com.

All Episodes

Your AI Injection

Structuring Enterprise & Process Automation Data Using AI with Slater Victoroff

September 29, 2022 • Deep • Season 2 • Episode 8

Learn how AI can help transform unstructured enterprise and business process automation data into valuable insights with Slater Victoroff, founder and CEO of Indico. Indico is an Enterprise AI Solution for unstructured content, with a focus on document understanding and process automation. In this episode, Deep and Slater discuss the fundamentals of multi-persona data science platforms, the complexities of defining and building models for unstructured data, and deep dive into the advantages of hierarchical labelling and a multi-label approach.

Slater begins by discussing what motivated him to found Indico, and talks about how Indico’s unstructured data platform allows enterprises of all sizes to automate, analyze, & apply unstructured data to a wide variety of workflows.

Learn more about Slater here: https://www.linkedin.com/in/slatervictoroff/ and Indico here: https://indicodata.ai/

[Automated Transcript]

Deep: Hi there. I'm Deep Dhillon. Welcome to your AI injection, the podcast where we discuss state of the art techniques and artificial intelligence with a focus on how these capabilities are used to transform organizations, making them more efficient, impactful, and successful.

Welcome back to your AI injection. This week we'll be discussing how AI can transform unstructured enterprise data into valuable insights with Slater Vitro Slater is the founder and CTO of Indico an Enterprise AI solution for unstructured content. With a focus on document understanding and process automation has years of experience building AI and deep learning solutions, and even made it onto the Forbes 30 under 30 list.

Slater, tell us a little bit about yourself. Why did you start Indico and what is it about Indico that you think is really speaking to our times? Right.

Slater: Three great questions. Maybe let me start. I'm Slater Victoroff. I am the CTO and founder of Indico, so that's, uh, somewhat relevant. I'm also an EIR at, uh, 4 0 6 ventures.

And for folks that don't necessarily know what an EER is, that's an executive in residence, which means that I spend a lot of time doing technical due diligence on company. So that's another little bit of per. As far as why I created Indico, I think the simplest I can tell this story is around one quote that I told to a professor of mine in 2012, uh, that I sometimes say is the most wrong I've ever been.

Uh, and I said to her, The war is over. Deep learning lost. I was completely certain of this as, you know, a sophomore in undergraduate who had had, you know, a couple of what year was this? 2012. Which is important. Which is very important. Right. Okay. That's a, because in 2012, right, that wasn't a strange position to have.

Right. I mean, in 2012, that, that was basically the majority position in ai. But Alex net 2012 was the year that everything changed. Yes. Um, so very, very roughly, uh, you know, I was doing competitions with a friend of mine and I realized that I was completely wrong about deep learning and. Indico started as this goal to really figure out what we could do to make this incredible technology accessible.

Right. And, and you know, that, that was the question That was the problem that we saw as kind of the kernel of this things change. Right. You know, we've been at this for, for eight, nine years. The world is a very, very different place from back then. You know, when we were writing our own kuda code and, you know, hand modifying for Trans 77 to get ourself.

Right. Uh, and so really what Indico has, uh, settled on today, you know, we've got three big technical pillars, but, but really before that, what I like to say is our philosophy is bionic arms, not androids. And that's because the position that we feel is very much trying to be this connective tissue, both of the organization, but very specifically between humans and, and ai.

Uh, and we work in the document processing space and working in really all manner of unstructured data. So document processing is like, Super, super hot. Uh, but you can also imagine anything that is image or text where kind of falls under that very

Deep: broad brand. Yeah, I mean, that was a question I had for you is like, you know, just kind of like perusing your website and looking at some of the background info.

You know, we handle all unstructured data. Uh, you know, as you know, that's. That's a kind of a pretty general world. There's totally, So what specific areas you focus on? I see a lot of like RPA related, uh, stuff, um, in there. So like process automation work. Is it really like the kinds of docs that you encounter in the enterprise?

that you're focused on and the types of imagery that you might encounter in an enterprise context, or is it something else? It, it,

Slater: it's a really, really good question. And I think the reason that we make this claim is because we've got a really different bottle from what I think a lot of people would assume.

So we don't, for instance, do what is the, the predominant way of doing things, which is say, Hey, you know, there's this very specific, you know, like mining 10 queues, right? And we're going to create a model that does mining, 10 queues, and we're gonna release that. We don't do that. What we instead do is we have a platform, right?

And well, a product really that we provide to customers that allows them to build their own models so quickly and so easily that they don't need to go and, you know, spend, spend a huge amount of time kind of vetting something, uh, externally. So, you know, we, we will, you know, integrate with folks. You know, if you've got something that's a, you know, particularly good model for a particular domain, you know, you'll have that in the platform.

But really, when I say sort of the connective tissue, it's all of that staggered loop learning and the human in loop stuff. The, the connections to upstream and downstream systems, you know? Yeah. Process automation is, you know, a great example of where yeah. Things do get, uh, a bit complex. We might connect with RPA maybe upstream and downstream, but unstructured processes have their own unique complexities to it.

Deep: Do you have labeling capabilities and and annotation labeling? And do you help companies like handle a workforce of labelers and do you facilitate that across? The entire spectrum of things that you might label within images like segments, image level labeling and temporal segments even.

And you know,

Slater: so we're, we are much more, uh, kind of language and document centric. Right. So things like temporal segments, right? We're not very deep on video. We're not very deep on time series for things along those lines. But, but other than that, right? You, you've absolutely got the characterization. And, and you know, I, I think it's very interesting the way you talk about, it's like, oh, you know, it's, it's a labeling tool.

And I think that's a, a really, you know, data scientist centric way of thinking about the world. I think it's very, very fair. But I think that one of the things that we really have realized is that, What in one pain looks like a labeling tool, kind of the function that that has to play within the organization actually gets much, much deeper than that.

Right. So you think about the things that you actually have to offer to effectively get you from, I have a problem to, I've got something in production that is going to like. Effectively learn continuously and have a change management process. Right.

Deep: And all that. Let me see if I can guess a little bit further then.

So, yeah. Yeah. In addition to facilitating labeling of a particular problem, um, with more of a focus on the kinds of stuff you would encounter in the enterprise, uh, you also facilitate the construction of models, maybe even the parameter optimization, the hyper tuning of models, the deploying, the runtime inference of the models as

Slater: well.

Um, certainly all of that. Um, explainability is really, really key. I, I think one of the things that's important also is that this is not kind of exclusively a tool built for the data scientist. This is something that is really, actually, Gartner has a dual new term for this, that I absolutely love multi persona data science platforms.

Um, and, and that's what it is, right? Because actually, It's not like this is for the data scientist and then, you know, a subject matter expert comes in and labels some data. It's like, it's just as much for the subject matter expert that is processing the contracts day to day as it is for the data scientists.

So we, we kind of view those as two equal users, you know, and, and that's why I would say in practice it's not really that they've got like a team of, of labelers, right? Cause labeling is never the goal. Labeling is usually just some like weird side process that you have to do cuz your stuff is. So instead it's like we plug into the production process just the way that it works today.

Subject matter expert can use the thing even before any AI is turned on at all. Right? But then the data scientist also doesn't have to go through all that hassle of like, Oh, I've got to find some data and I've got to do some separate labels and all that. They're actually like right there in the stream that the subject matter expert is sending along.

Like they've got metrics, they've got models automatically tuning on that and, and then yes, you know, you've got all the things that you. Talked about, but, but you know, a really rich set of explainability tools that facilitate that conversation between the subject matter expert and the data scientist to help really get the problem framing right.

Cause it turns out that, that that is really the, the hardest problem in, in practice. So

Deep: maybe walk me through that a little bit. Like, Yeah. What does that mean to facilitate the problem framing from a practical standpoint?

Slater: Yeah, so, so usually what, what happens, and I'll, I'll use a really specific example.

You know, let's say you've got some insurance company and they've got insurance policies. Over the years, they've come out with different insurance products that are going to be, you know, different in maybe significant ways, maybe that you've got life insurance and health insurance, and. You know, like ice cream, truck insurance, um, or you know, maybe these are more marginal things like, oh, you know, there's the silver 20, you know, 2007 life insurance policy and like the goal 2018 life insurance policy, but you can only get that one in Connecticut.

You know, stuff like that. Mm-hmm. . So they've got this huge, huge, huge kind of sea of stuff and they want some structured inform. On top of that now, in the data scientists world, right? The problem hasn't necessarily started yet because they actually really need to, to figure out that very critical piece of what structure do they want to put on top of these

Deep: documents?

Yeah. And that's where the interesting conversations start, right? Like, like that's what we do all day long. We're a custom machine learning AI shop. That's, that's where it gets interesting, right? It's like, right. What do you actually. , why do you think you need that? Let's validate whether that really matters.

If we build that, like all those sorts of

Slater: questions, right? And, and what we find is like a really, really helpful thing. It's all about accelerating the rate at which you can do those experiments. You know, we find the non-technical person has to be able to be like, I think this is the right schema. And try it and label a couple documents and have, you know, and maybe this is not the perfect model they'll ultimately end up with, but some very kind of rapid feedback to let them know like, Hey, is this kind of working?

Is this. Where is this running into problems?

Deep: Uh, so, so like rapid, early, quick labeling, propagation of those early, let's call them mediocre models at the beginning against the back corporate, and then active learning driven corrections of mistakes, something like that.

Slater: Uh, something like that. You know what I, what I would say is like, The active, So we do have, I, I try slightly away from the active learning term, not because it's not wholly accurate, but just because I think that usually means like pure statistical active learning.

And for us it is very much a human driven process. And, and I think that is an important distinction. So, So there is a process of like, I've got this schema totally right. And I'm labeling things for you human, human driven

Deep: in the sense that. You may not yet have the labels correctly defined. The label universe is still evolving.

That's a

Slater: huge part of it, yes. Okay. Yeah. In fact, that's, that's the first 80% of the

Deep: process I'd say. So your label universe definition, you want to basically kind of quickly facilitate this. Now your, your text pieces are, is that operating at the document level, the sentence level, you know, are they. Spans, You know, is it categorization or is it all of the basic stuff for natural

Slater: language?

All of the above and, and you know, some more. So I would say spans are, are kind of the truest answer to that. But we do have a really true multimodal fusion system on the back end here, which says like, we, we handle things at, at the document level. So you know, it can be anything from a totally raw image to just.

Pure text with no image information and it, and it kind of handles that seamlessly and, and can kind of leverage the relative information across that spectrum.

Deep: And then do you have like hooks into like the Turk and all kinds of places where you can like leverage crowds of labeling or,

Slater: So here's the thing that ends up being really interesting.

You need very, very little data to do it. So we find actually that people usually are either in, in, you know, a production state or very close to it by the time they've done 200. And, and, and which is actually not that crazy. If you, you know, you're, you're kind of like staying up with some of this modern stuff and people are doing zero and one shot learning and things along those lines.

Like, okay, that is, that is believable. But I think one of the things that is really, really cool about that is that it means that you don't really need to do this kind of large scale data labeling anymore. It's like 200 examples is kind of what it takes to get to a consistent problem. Definit. Sort

Deep: of if you have the right 200,

Slater: well, that's the key, right?

It's not any 200. It's almost never 200 of like a database dump that they've got in 99% of cases, right? They end up having to make the 200 from scratch because the existing data is just not usable. From somewhere or,

Deep: Yeah. I mean that's some reason, or that's sort of the reality of these sorts of things, right?

Like, Yeah. You know, you, you start labeling, you, you, you have to think about what to label. So you get an idea, you start grab like recipes and now you're labeling ingredients or something, right? You're gonna get all the usual ingredients. But are you gonna get to nets and like, just kind of weird esoteric stuff?

Maybe not right away.

Slater: So you're, Well, I'll, I'll, I'll give you actually maybe a slightly better example. Like with, with recipes, right? Okay. So the, the issue you run into is like, Oh, well we've been processing recipes for 40 years. These are the things you pull out of a recipe and you're like, Oh, what? And, and Field five is like the hair color of the chef.

What? Like that, that's not in the recipe anywhere. And they're like, Oh yeah, no, we actually, we, we Google that as a part of our process and you. Oh, wait a minute. And, and then maybe there's another field that's like overall weight of, of the recipe, right? And you're like, Oh, wait a minute. Like that. That also clearly doesn't appear anywhere on the recipe.

They're like, Oh yeah, well, no, I kind of like pull these numbers out, which I don't track anywhere. And then I kind of add them up. So, so that's a lot of what end ends up being.

Deep: So, so you're speaking to the corporate itself. That content inside there may not represent what actually the business goal is or the thing trying to be.

Slater: It, it, it represents the business goal. It does not represent something that is ready for use by a machine learning process, right? Mm-hmm. , because even though what they're doing is often close to extraction or close to classification is never that exactly. And it's never the case that their historic data is like so close to that ML task that you can use it as it's, And, and that, I mean, is an unfortunate reality.

You know, we often have to show up at customers, be like, Sorry, I know you've been doing this for 40 years. I know you think this data is like super valuable. You've not been tracking it in the right way. You know, the,

Deep: Well, that's, I mean, that's a good point. And it's, it's sort of like, you know, maybe taking it out from your system in particular into a kind of a general machine learning space.

One thing that we find is there's a difference between labeling things for the, I don't know what to call it, but like the human playbook of whatever's been going on for a while. There's a very big difference between that and, and human labeling of data to feed into models, uh, you know, and algorithms that are gonna learn from it.

And one of the things that we find is that, you know, like however humans have been doing something, you know, humans have like less specific labeling requirements, if you will. Cause, you know, we. Brains, they're inconsistent in the gaps. And yeah, and we can ask people and things can happen out of bandwidth and whereas like, you know, machines have a very specific way of giving them, uh, examples.

Hi there, I'm deep Dylan, welcome to your AI injection, the podcast where we discuss state of the art technique. And artificial intelligence with a focus on how these capabilities are used to transform organizations, making them more efficient, impactful, and successful.

That was like a question I kind of had for you. Like, do you run across scenario? Like, So one of the things that happens a lot is somebody has a product. The product does a thing, it's got an audience. The audience is sort of available for labeling in a sense. They're not, that's not their main goal. Like they're off doing things, but they're inf inevitably if you.

Frame the problem correctly, they can provide labels. So for example, let's say Gmail as an example, you know, you've got email, you know, somebody at the very beginning of the Gmail thing had to decide like, how do we prep the perfect responses, uh, given a particular scenario. But eventually once you kind of juice the system, you bootstrap it.

Now there's like a set of categories, you know, like somebody says something like, Hey, you know, you did a great job and you know, it's thanks, you know, or like, what, you know, Canned responses. Yeah. Once you, once you get to the canned responses, now the, the users of the system are giving you labeling. The question I have for you though is how does your approach both leverage an audience for a given product in the labeling process, and how do you.

Leverage other humans who are like sort of more specifically building for the, for models. Yeah.

Slater: So, so I will say this is an ideal, we haven't quite achieved, but it maybe well summarizes how I think it should work. Um, Okay. I don't believe in a labeling process. I think the fact that we think about an explicit labeling process is actually, I, I, I think, I think it's problem.

Because there is that implicit division between, you know, what we are doing at labeling and what we're training at that time and what we're trying to accomplish in, in the real world. And I think that's actually very, very tough on, on both sides because I think that it, it kind of implies that it's sort of this, um, You know, it is sort of like the CSV becomes this black boxy kind of handoff, and I think that's something that we found makes it very, very hard to lead to long term success in, in problems like that.

So I think that the way that we tend to frame it is we actually say the subject matter expert, right? What they're doing is not labeling data. What they're doing is teaching the machine, which is very, very different. I think from the traditional framing. The traditional framing is like, okay, like you're figuring out the data and then there's some like magic black box that is the data scientist that the data goes into, and then you get some like magic app the other side.

So we say by like changing that paradigm really significantly, you actually have to give, we think the non-technical person ownership over this. How this data is being tagged, right? What is being extracted? The data scientist is there primarily in an architectural and advisory role, right? And we actually think that's where they can be really, really effective is helping 'em be like, Hey, these two fields, like I can't model what's going on.

And the data scientist can be like, Oh, like I'll like click through, Oh, hey, you got some bad recall. Oh hey look, these two people are labeling it in totally different ways. Like go fix it. Um, so, so that's kind of like when we think about what is the happy loop cycle, that's actually much more what it looks.

And, and I think that we try to lean into this idea that you wanna fit into the existing thing that is happening at the business. Like someone is processing these contracts today, So as much as possible, right? You want to fit into the way that they're processing contracts today and, and sort of to your point earlier, you want to figure out how to fit into that cycle.

To get the natural, you know, sort of supervision that they're doing anyway. But, but crossing that gap is, is not easy. You know, coming up with those categories in Gmail, you know, maybe those auto responses aren't, aren't the best, but, but maybe, you know, looking at the difference between those auto responses and that tab auto complete, that just gives you a couple of words.

Mm-hmm. , it's like very, very different kind of interaction.

Deep: You said something that I'm trying to unpack a little bit. Yeah. It, it sounded like you have some kind of dialogue that happens between data scientists and. Labeler who, I think you called them a non-technical person, but it's, it's like subject matter expert.

Okay. So you're a subject matter expert. It sounds like they're operating in the enterprise with a particular thing they're trying to do. Like, I don't know, it's the fleet of lawyers, uh, somewhere and there's a particular type of law that they practice and a particular type, and they're, and they're

Slater: labeling whatever, whatever the data you're processing with the ml, that's the person that processes that data without the.

What

Deep: does that dialogue look like? Like how does the non-technical person find out about a given label's efficacy? I assume you're maybe renez viewing on efficacy and the data science is somehow like telling them, I need more examples of this. I need less ex, I need Totally,

Slater: Yeah. Yeah. Yeah, so, so you know, you with these nice new, uh, trans learning models, right?

If you get the timing right, you know, someone can just like label even a couple dozen examples and get this really powerful kind of live feedback from the model. And even if they're low confidence predictions at first, right? If you think about the UX very intentionally, you know, you, you can present that in a really reasonable way to them.

And, and I think what we found is like giving them that piloting ability, giving them that ability to, to even often just being like, hey, Actually these three labels actually totally wrong. It shouldn't be these three classes, it should be these five classes instead. Um, and I think just really importantly, they're never going to get it right the first time.

We found that like telling, telling them to be like very intentional in the early days, like it doesn't help that much. It's just like how, and you know, automatic model training is the way that we do it and directly displaying these results to them in context as much as possible.

Deep: If you don't back a level, like do you find yourself having a hard time at the, like before your customers are even knowing about your product, you have to kind of talk to them a little bit about what they're trying to achieve.

And if you're interacting with data scientist, You know, they've already filtered the problem into a language that's easy for other data centers to understand. But if you're talking to these non-technical folks, you have to assess if the product's a fit. How do you do that?

Slater: We really push them for specificity.

I think that that's why we, we try to center the conversation around data as much as possible. Mm-hmm. . Um, and, and look, there's, there's plenty of heuristics that we use, but nine times out of 10, if someone is. I understand that coming up with a consistent schema and like labeling this data for the outcome I want is my responsibility.

Number one. They, they've gotta like get in with that mentality. Cuz if they don't believe that, you know, if they think the AI is this like magical, autonomous thing, that's going to guess what they want, right? We just know they're not going to get to success. And then the second thing is making sure that they can actually put together 200 examples of this.

I mean, think 200 is a really, really important piece because like, look, if it is a useful example, That you should be using ML on that you've got real volume with 200 should be a drop in a bucket, right? This should be the easiest thing on the planet. And if you can't actually process 200 in a consistent way, like we know this is not a real use case.

We know like nobody actually cares about solving this or it is not close enough to being solvable that someone can actually define the problem. And, and that's, that's a huge, huge.

Deep: So one of the challenges that you run across when you're defining a label space like this is category boundaries. And so, you know, there's like very different approaches to, to labeling, you know, so like a common approach is, which I suspect you're following, is, you know, so you're doing this binary labeling approach, positive examples of the label and negatives, but there's also like kind of multi

Slater: labeling.

We do not actually, Yeah. So that's actually, we, we found 90% of everything is multi. Um, that, that, that's my short answer

Deep: to what you say. That's interesting because that's what, that's what I say and that's, that's what business does is we do an awful lot of multi labeling projects. Yes. So

Slater: that's interesting.

I like multi labels are default, like 100% e even, even when the category boundaries are fuzzy in like, 10% of the cases. It's just like having that additional power. It's just a better way of approaching those problems. I think there are exceptions, but there're few and far between. It

Deep: feels to me like one of the challenges with multi label labeling, as you know, is you have to be able to wrap your head around the full label universe.

Because if you don't label something, then the presumption is that it's not that thing. And the other challenge with multi label labeling in this context, Is that you start labeling and then, you know, you may, I don't know, maybe label a few hundred, a few thousand things. Now you come up with a new label, in theory, you gotta cycle back through and like, you know, filter, how do, how do you address that problem?

Slater: Yeah. So, so, so great on, on both those fronts. So I think that first on understanding the universe of labels. I, I think that's, that's a huge point. And I think actually that is a, a problem that you see. A lot of problems that people think are flat classification problems, and, and I. My major thought here is, uh, just like most classification problems are multi label, I think that most classification problems are also hierarchical.

Well, you are

Deep: like in a mind mail because this is what

Slater: we, Oh, I

Deep: just steal the words from you. This is literally what we've been doing for the last few years is like, Yeah.

Slater: I, I, I don't think you can reasonably think of more than a dozen labels at a time.

Deep: Yeah. And when you frame things hierarchically, one of the advantages as a labeler is you can be labeling really specific cases and you don't ha and then you can, later on you can rearrange the boundary definitions of parent, parent categories

Slater: and, and, and the other good thing when you're taking that multi label approach at the same time to your second piece of, now when I've got a new category to add label later on, actually it's not, it's not as bad.

Like if I were trying to keep him as distinct categories, I'm. I'm just, shit outta luck. I've gotta start from scratch. Sure, sure. It's like, okay, I don't have to start from scratch. I have to find a couple of indicative examples and then that's where you can have, you know, some, some active learning stuff would actually like, be, be like helpful to plug in in a situation like that.

Right. Like find historic stuff in the corpus that probably also, you know, we, we have mislabeled

Deep: there. Interesting. So like, one of the challenges we have, and maybe you. Same challenge is for certain problems. You mentioned like, hey, you know, you, you get a couple of hundred examples, like life is dandy. But for some problems, particularly like visual ones, what we find is like not all labels are created equally in terms of the importance by the, by a customer.

And so you find that there's like some particular case that's like much more important and then they re and you really have to sort of naturally focus on it. And it can be very nuanced. It can be. A complex image scene and there's like something really specific in there that like trying to like go after.

And so one of the things that we found is that building multi label models gets you a good chunk of the way there. It gives you a nice backstop, but you know, we've actually started like layering on binary models on top and actually even. Co-mingling the binary labeling universe with the multi

Slater: label labeling.

Yeah, well I guess we, we really are mind, mind melting cause cause that, that's exactly what I was gonna say. Right? Cause I think that's exactly right. You know, if you do really have this case that is super specific and like in this region, a totally different rule set of rules applies, right? Like that's a different model.

It's actually another case that I would say is very, very common. I would say even, even, it might not be the case that you've got one edge case, but maybe there. Five or 10 actually really important edge cases, and you need specific metrics on those edge cases. It's why the cycle time and iteration speed is so important because I think that, yeah, very often the answer is, Hey, we thought this was one model, but maybe by the time we were done and we've actually mapped out the whole process, this thing is 12 models.

It's funny

Deep: that you and I are having this conversation. I've never felt like anyone else in the ML universe really appreciates these two issues. Right? Like almost everyone else that I know of is in binary labeling mode and they're in like heavy kind of, uh, active learning, uh, driven mode in binary labeling mode.

It's because

Slater: it's so much easier, right?

Deep: Yes, it is. And it's easier to define categories that way

Slater: and define your efficacy metrics most. Right. Yeah. Because the second you're outside of binary categorization, right? Like the question of what I'm even optimizing is suddenly like tremendously non trivial. So people try to make things fit into that mold very much.

Deep: Yeah. But one of the things, and I think we're probably in the same bar, ballpark, one of the things that we find customers very often just need one or two, or even a handful of models built. It's usually like hundreds or hundreds, you know, or many hundreds of models. Yeah. And when you're wrapping your head around how to label that, Like you wind up in this multi label universe pretty fast.

The other thing that I think we have in common is like, so you've got these SSEs, so these are people who actually can wrap their head around the problem. In our case, we actually just hire like bright college kids cuz uh, cause we can't

Slater: rely on, we do have an internal team as well, maybe, maybe to be clear, but, but that's actually another place where we, we depart kind of, I think very much from a traditional model.

But it sounds like y'all maybe do a similar thing, which. Think about these as random third party contractors, right? That are like gonna come and go and like pick up a project and leap down. Like, no, these are, these are experts, right? These are people that are like very, very good at what they do and need to be picking up new tasks quickly.

And I, I think that for a lot of people it's, it's a compelling job in a lot of ways. I, I wish that people considered, you know, that kind of work more often as full-time employment internally. Cause we've just seen such a tremendous. Treating that way. Yeah, I

Deep: know like we get asked the question a lot like, Well, why don't you just use mechanical Turk for your, for your labeling?

And it's like, look, if we get to the point where we understand the problem well enough to tur it, then yes, of course, you know, we will. But usually there's a few other reasons why we can't. Security is

one.

Slater: You know, like it's just, and and frankly, like usually by the time you understand it well enough to Turkey utility, you've got enough data you don't need to do it.

Right, exactly. Yeah, that

Deep: is, That's exactly right. And meanwhile, you have like all of these annotation companies that have the most bizarre pitches, which have always surprised me. They're like, Hey, I, you know, uh, I need a hundred thousand things labeled, which is like the wrong thing to ask for. And then somebody else is like, Oh, well I'll charge you X amount per label, which is like the wrong way to price.

Something because now you're incentivized for quantity, not quality. And on the first case, you're presuming that the a hundred thousand is gonna cover the distribution of what you're trying

Slater: to actually label. It's something that's made me scratch my head for the longest time. Cause you're right, like it's a huge trend and, And I think it is largely just a symptom of ML teams having just stupid amounts of money.

And I think there is, Yes. It's this idea which. Which is not totally wrong, but also not totally right that no matter what they do, if they're not really sure on what the next step is, then investing in data is a safe bet. So like, I know, I don't know what the next architecture is, but if I buy 10 times more data, Right.

I know I will see a model improvement, so I, I think that that's a lot of the ethos Yeah. On the one side, and I think that's what props a lot of the companies up on the other side. I really question how much longer a model like that is going to last. I, I think when you look at the really high performing organizations, places like, like Tesla, I think looking at how much of an ownership and how a unique process they have put around their labeling and their, their unique flavor of human driven active learning.

I think you're going to see more of that because I think people are really starting to realize, you know, data is not this fungible, like one size fits all. Like I get it and that's good. Like thank you black box kind of interaction, like it is this very holistic, organic kind of thing.

Deep: Also, I mean, I think there's a sociological element, right?

Like as a graduate student. Doing anything that involves some machine learning. I mean, maybe less so now that other disciplines are getting into it, but like the traditionally technical disciplines, you're rewarded on algorithmic improvements. Like nobody's rewarded by like going out and finding better training data, like they're rewarded cuz they came up with a new network architecture.

Some like, some kind of interesting insight. And so when they go out into the working world, they carry that mindset and training data is a burden. It's not a thing to move the needle, It's just this annoying thing you get stuck with.

Slater: And so they wanna call or it's what you have at the start of the process.

Yeah, it's the CSV you got from your professors. Yes. And then, and

Deep: like it's fixed. It's presumed fixed or it's like, you know, Exactly, it's like standard and it's a specific corporate or whatever, and everybody's competing on it. All of that. Basically, when you're out practically building models, you know, like I, I say this a lot to customers are always like, look, you know, if you have a choice between spending, you know, an hour getting more training data or like, you know, a day.

Tweaking your models. Take the former cuz it's absolutely you're gonna see more impact or better or, you know, something more, something around the data.

Slater: I, I think it was some, something, I wanna say this was Richard's Soer, but it might have been some, some other researcher. So I, I apologize if I'm mangling that, but it's like if I had 24 hours to build a model, I would use 23 of them labeling data and one hour building them

Yeah. Yeah. I'm like, yes, absolutely.

Deep: Isn't. I know it's, it's funny because like, it's just not the way people think, you know, Its,

Slater: And you know, I'll tell you the craziest thing is it's this weird kind of like aversion to touching data and like, I don't know if this is quite right, but it's almost people feel like it is beneath them.

And it's something that I've always found really strange and very curious. And I'll tell you something kind of funny is that there was this one company, um, and they had a bunch. Survey responses that they wanted to analyze and in, you know, they, they were looking to pay, you know, six figures to have someone analyze all these survey responses.

You wanna know how many, there were less than 300. Right? ? I was like, these are like a paragraph I, and the second I got them, cause I assume, you know, this is like a hundred thousand responses or something ridiculous. I'm. Well, has he considered reading

Deep: them? You know, it's funny that you bring this up a little side note.

Hopefully, uh, nobody interview is listening to this, but one of the questions that I ask is this, like, this whole thing about like, rewind 15 years ago, Prego image search. Like basically build one of these from scratch. And part of the setup is like somebody just gives you a drive with a, you know, with a hundred thousand images in it, and all you know is they're numbered one to a hundred thousand jp, hardly ever.

Does anyone say, Well, I'm gonna open up the images and like just scan them and look at them. I mean, like, Great. Not that much. Yeah. You look at like a sample, you know? And the thing is

Slater: like that is exactly what you should do,

Deep: right? It's just,

Slater: And people will literally spend years of their life trying weird statistical tricks.

To avoid spending 10 minutes opening the images, I know it's it's absolutely bizarre.

Deep: Need help with computer vision, natural language processing, automated content creation, conversational understanding, time series forecasting, customer behavior analytics. Reach out to us at xyonix dot com. That's x-y-o-n-i-x dot com.

Maybe we can help. So I wanna go back to, uh, change the topic just slightly on your site. There's a lot of talk about rpa, uh, and bpa, like this process automation world. Are those like, um, some clients that, for whatever reason are really attracted to your platform? And, uh, tell me a little bit about like, why, like, what are they trying to do and where do they get stuck and, and how does your platform kind of help them?

Slater: Yeah, so I, I, I would call them more ancillary technologies. These are, you know, very strong, rich partnerships for us. Uh, I, I think specifically looking at automation writ large, there's been this really big explosion of rpa, and I think really what I would characterize as the structured side of automation.

Right. But

Deep: just just for the sake of our audience, can you just explain what RPA

Slater: is? Yes, I can. , Yeah. RPA stands for Robotic Process Automation. Uh, the best analogy I can use for someone who's a little bit technical is it's basically Selenium with flow based programming attached. Which is to say that it's very good for things like open this window, click that, copy this value, paste it in here, right?

Where you've got this very, very procedural kind of a flow. You've got these office

Deep: workers that are like doing something over and over, like maybe preparing a particular attach report. You got a hundred accountants and they, and they're all like going in here, opening this thing, closing this thing, transforming this thing, doing.

Stuff and now you can sort of script that at the, at the

Slater: window level or something. Yeah. And I think a lot of developers don't understand how much work inside the enterprise that describes No. Because we would never

Deep: do anything this way.

Slater: Right, right, right. Exactly. It's alien

Deep: think to automate this stuff cuz they're like, Good Lord, why would you even use a ui?

Slater: Right, right. Exactly. You know, this is stuff like, I have to copy and paste all of these values, you know, from this email into this Excel sheet and it takes me. 10 hours. And I do this every week and, and by the way, we've been doing this for 20 years rid throughout the enterprise. And that's what RPA is really, really good at.

And I think the specific way that RPA plugs in, that's kind of, I think clever is, like you said, usually if you could fix these architecturally, It would be really simple, but RPA does everything on this UI level, right? On this kind of click. Like see what's on the screen level. Cause in a lot of cases, these APIs are, are, you know, very difficult to access.

Or maybe it is super overburdened, right? Or maybe the guy that built that system, you know, quit two years ago or something like that. Right. You know, or you just, you

Deep: just need different people, right? Like, you could write an RPA process as just being the person who clicks and opens and closes and does the thing, as opposed to going off and talking to somebody who knows how to interact with some, like, you know, 1990 something soap.

Slater: It's the, it's the speed also. Right. You know, it's just like, you know, you get it done so quickly, you don't have to deal with, you know, six months of bureaucracy in getting your request put in the queue. And, and a lot of these use cases, you know, they're small enough that, you know, When your standard size and IT project is gonna be, you know, seven figures in time and salaries and, and you know, like how, how are you gonna make time to like solve you the 10 hours every week?

And, you know, RPA is, is really meant to fit into folks that. Not usually able to code up automation on their own and make that more

accessible.

Deep: If I have to guess, Let me just try to guess. I might be totally wrong, but lemme try to guess where you're probably not. Um, so the, so the customers are like l looking at this thing and they're scripting something and somehow they wind up in a window that they need, like based on a particular value in there, or a particular sequence of words or so.

That's what routes them to make the next, uh, step. And so they need to like stop and maybe make a model to extract that thing so that they can get to the next step. Is that,

Slater: that's a, a good, that's a very good start. And that is initially how we thought about it too. And I think that's how a lot of people think about it is like you might think initially, Okay, you know, I've got some process end to end.

And one block in that process is do the invoice, like get the values out. Right. And there's a PDF or whatever. Yeah, yeah. Something along those lines. And that's not, and and, and sort of like I said earlier, it's not totally wrong, but it's not totally right. I. Okay. And I think the place where we have been most surprised is that actually when you look at just the process of whatever the like unstructured document is itself, it turns out that actually usually has its own internal process unto itself.

And, and the thing that's really interesting is often you're thinking like, Oh, you know, these might be coming in email attachments. Right? You know, this might actually be. Completely parallel path. I, I was very surprised, you know, in the early days, I think we expected 80, 90% of our use cases were gonna be exactly like you laid out, right?

Like RPA in like, gets the document, passes it to us, like we do some thing, pass 'em back to pay, uh, payload back into rpa, off to the races. That's probably. 30% of what we see. Okay. Actually 70% we see is actually the unstructured use case, right? The invoice processing or it makes more sense if you think about things like contracts, right?

Where you have this much more complex notion in system are, are just systems unto themselves. Uh, that actually stand totally alone. But they are the same kind of companies that are interested in this automation. And, and I think for more of them, they hope that turning this into kind of one consistent automation fabric in the future would be really helpful.

But we see actually a, a, a division now. There's like structure folks in Unstructure. This, I

Deep: don't know, I'm thinking, I'm imagining like fleet of accountants generating tax returns and somewhere in there. A massive database of receipts and they, they wanna just, and somebody's like, Oh, we got the receipts.

Why don't, we should know like the amount and then the, the company and the type company and like, maybe we can use this thing that the, the RPA guys are using, or,

Slater: Yeah. I mean, so that's, that's, that's decent. Let me maybe give, uh, a slightly, a slightly better one that might make it a little clearer. Cause I think that the problem just with invoices and receipts is that they are definitely on the simpler side.

Right. I think that on the one hand, right, if you talk to folks that are used to the form side of the world, like invoices and receipts are like the high end of complexity. But in terms of the things that we, we deal with, invoices and receipts are kind of the simplest or like there's a little bit of ambiguity.

But if you think instead of something like a mortgage, okay, and someone has to. Process your package of mortgage documents and there's like 30 different things in there, 400 pages of stuff. There's like, I've got to extract these values here and then I've gotta compare them against this, right? And then I've gotta do this business logic.

And it turns out that a lot of those operations, a lot of that, Oh, compare this to this and extract this, and you've gotta sequence this before that, and you know, and need human review over here and not over there. We thought that RPA would be able to handle a lot of that, but it turns out that unstructured workflows.

Really weird. And so a lot of the traditional tools that you might think of an RPA just don't map really well to those kinds of highly unstructured workflows, so, so I think we find by and large folks do them in parallel, and then they tend to multiply in terms of benefits to the organization. Maybe this person used to have 40 hours a.

10 hours now is handled by their RPA box. Right. And like 20 hours now is handled by their indico box. Right? And maybe there's like a really nice five hours that is right at the juncture too, or something like that. Okay. Which very tue like we, we did not expect it. But I think a lot of it is the complexity of these unstructured flows and these document flows is just much, much higher than, than folks, uh, expect.

You know, It, it is not classification, you know, I mean, Classification sometimes happens, but you know, it's, it's kind of like a rare and more simple, like extraction even is one of the simpler things we deal with, which is, you know, I think in, in ML schemes, you know, custom named entity recognition on small data is, that's a tricky problem.

But, but even that, I think we would characterize as an easier problem than, you know, kind of complex multi-stage reasoning over a 200 page contract.

Deep: I can imagine a lot of nuance here. I mean, this is kind of, yeah, the challenge with how this RPA stuff is evolving in many ways. It's like, okay, I can, I can automate this piece and then I get stuck and now I got this bridge to cross and I gotta get this other piece.

And so then maybe. You set up the RPA piece to just fill up a queue with that place that you get stuck and then you walk down, you get across the bridge, and then now you RPA a little bit more in, in some

Slater: cases, that works really well. You know, I, I, I think though what RPA has done that's really amazing for us, right, is they've just triggered this whole mental thought around automation.

P people are hungry for it, and, and I think it is clear on the one hand, RPA has kind of stoked this, uh, this appetite that they can't fulfill. You know, people clearly have this, uh, desire for automation well beyond what rpa, uh, can provide. And you see all of these industries now popping up that can't provide pieces of this.

Right? I mean, task mining I think was, you know, another one that started off completely as like almost a subsection of RPA now is its own separate, uh, you know, You said

Deep: something at the very beginning when we were chatting, uh, about the value that your system brings in monitoring like models at once they're deployed.

Yes. And these kinds of workflows must get very awkward to debug and, and you must get some weird bugs in there, like a lot of them at. Once because somebody ships a new version of software and the look and feel changes and like your visual imagery, like whatever that model was, might not yet have like learned this new transparency look or whatever that got rolled out and now all your clients who had that sort of piece, suddenly breaked, do you get anything like that?

Or we

Slater: have to work very, very, very hard not to. The magic of Indico is, I can say no, right? You actually don't have that problem. We, we have totally. It was really

Deep: hard. I didn't see how you could address it. Like, I mean, unless you're like artificially contorting your, your imagery so that you can create like fake data to like guess how things are gonna change or something.

But, so there there is,

Slater: I I guess there are, there are some elements there of just, just being robust, but, but it's actually, it's, it's different than, than the point cuz I actually, I appreciate cuz that should be your, That was kind of one of our, our, our sayings in the early days is if you believe what we are saying, then you're not paying close a.

Um, and, and so a lot of what it comes down to is like, you have to really change your mind about what production looks like and what production is defined by. And, and that's really the key because you're correct. I if you use these two assumptions that I think a lot of people bring to the problem, you will never solve this problem.

Number one, if you think that learning happens in a closed loop, which is like, I have to be able to learn from each new incremental example. And, and as is, you won't be able to solve this problem.

Deep: Oh, I think I get where you're going. Okay.

Slater: Yeah. And, and then the second thing is also if you go to the other side and believe that your model is totally static, you will also never solve this problem, right?

Cause you don't have any recourse.

Deep: So the people, let me just guess that. Building the flows, the process flows. They're also your SMEs. They're also sort of the ones who are providing the training data. They're like never actually out of the loop. Like they're sort of like, not completely. Cause they always have to spot check for you and they have to always help in the monitoring.

And so, So they're gonna be the ones to actually flag the problem in the first place. And so they're gonna be the ones to give you a few more examples. Model gets shipped and then boom, it sort of starts feeling seamless. Exactly. So you don't

Slater: blindsided. Right, so, so you never turn it all the way off. You maintain this thing at all times, which is just like, this is a knob and this knob, it's gonna turn up and it's gonna turn down, but it's gonna turn up and down, you know, naturally over time.

And then the other thing that, that turns out being really important, 80% of all production errors can be traced directly back to bad training data. 80%. Now obviously that's like enough, I can't say that in like all case, but I don't think anyone's is surprised by that. So the other piece that ends up being really, really critical is we've built all of this explainability around interrogating, like you can, you can see any error that comes through the system.

Trace back to what training data is the model relying on to make this

Deep: decision. Oh, that's, yeah. And what are you using there is, are you like leveraging like shop and some of those kinds of techniques or,

Slater: Uh, our, our own stuff that we built before them and works better. But, but very, very roughly the, the strategy is, it all comes down to prediction, attribution.

So I think we do have, you know, some of those more traditional tools, right? Things like ency maps, like those are definitely useful, but I think useful in a different way. For us, really what it comes down to in a hundred percent is like, I need to be able to find that training data. So we're almost like we gear everything around.

This is,

Deep: this is what. Machine learning has been for 20, 30, 40 years is like totally through training data, finding out like, Hmm, you know, we're missing this, we need more of that. And you're just kind of like building it into the workflow, which is super, super

Slater: powerful. Combining it with the transfer learning stuff and, and you know, those modern techniques just because when you're doing that on a couple hun, when, when you're doing that a hundred thousand examples, like you need a team to figure that out.

That is like almost impossible on 200 examples even. Right? It, it's very, very challeng. But at 200 examples, like it's possible, like you can go through that, you can interrogate that and attribute it in a really meaningful.

Deep: I would love to talk about even just the part of the prediction attribution that we could go on for hours, but we're, we're running outta time.

This has been a fascinating conversation. I'm gonna just ask you one final question, uh, at least for this episode. Maybe we'll talk more on the other topics. 10 years have passed, what your aspiring to do has, has happened. Whatever levels of success, you know, you and your team can bring to bear. What's different in the world?

Like how, how is the world better off? And, um, and, and what does it look like?

Slater: I'll answer that first society, and then I'll give maybe a more specific a answer. Right. So I, I think one of the things that's always been very shocking to us at Indico working in so many of these mission critical document use cases, and, and I don't mean to, you know, say that's the only thing we do, it's just, you know, very relatable and, and a very high No, I get it.

You can have people doing the same process next to each other for 30 years. They will not realize that they have been doing it differe. These errors compound. Right? And, and the things that we're talking about, these are not insignificant applications. These are mortgages, these are loan approvals, right?

These are health insurance and, and you know, the first thing that we always do when we see the inconsistent state of the data, right, is we, you know, it used to be in the early days that we were just so incredibly shocked. How can something so important have. Insufficient tooling around it, but you know, like we just didn't think about it.

You know, these are things where people have been doing this in the same way for 30, 40 years. And I think that a lot of the problems that we see today in terms of, you know, differential outcomes, right? And how kind of non transparent, a lot of the most important processes today. Even if you think about things like court documents and legislation and the transparency there, I, I think that I would really like Indico to have a serious and meaningful.

Meaningful approach layer. You know, I want us to look at the idea that we would have people just decide in a black box, whether alone is like okay or not. You know, I want us to be a little horrified at that idea that that's how we used to do things, um, that there was no transparency, that you didn't have this layer of explainability on

Deep: top.

That's all for this episode of your AI injection. As always, thank you so much for tuning in. If you enjoyed this episode, On using AI to extract insights from unstructured enterprise data, please feel free to tell your friends about us. Give us a review and check out our past episodes, podcast.xyonix.com

That's podcast dot XYONIX dot com.

People on this episode

Deep Dhillon

Host