What do data centers loaded on tractor trailers with self destruct buttons in war zones have to do with ML models built from networks of hospital data? Join us in this fascinating episode as Deep speaks with Dr. David Bauer, Co-founder and CTO of BOSS AI. Dr. Bauer shares colorful stories and discusses how his experience leading big data and distributed learning initiatives for the U.S. Intelligence community inspired him to take hard lessons learned on the battlefield into the commercial sector and start BOSS AI.
Deep and Dr. Bauer dive into Federated AI, a type of AI that allows machine learning models to be built from data across multiple disparate systems while leaving the data in each system encrypted and decentralized. Federated AI encourages businesses to leverage their data to extract insights, alleviating the need for expensive data lakes, while also avoiding the risk, hassle and increased latency involved in centralizing data.
Deep: Hi there. I'm Deep Dhillon. Welcome to your AI injection. The podcast where we discuss state of the art techniques and artificial intelligence with a focus on how these capabilities are used to transform organizations, making them more efficient, impactful, and successful.
Welcome back to your AI injection this week, we'll be speaking with Dr. David Bauer. Dr. Bauer received his doctorate degree in CS from RPI and is a co-founder and CTO of Boss AI, a firm helping make AI solutions easier to deploy, monitor, and manage. Dr. Bauer has led multiple big data and distributed learning initiatives for the us intelligence community, including a role where he helped with the architectural integration of synthesis, an NLP ML, and deep learning product.
So David start us off today. Tell us a little bit about your background in machine learning and how did you get started with boss AI?
Dr. Bauer: So boss AI is, uh, a startup that I founded with my partner, Ross Blair. and, um, really what we decided to do was, uh, start this company. After about 20 years in the us intelligence company in a community, what happened was we had been building cloud computing environments and. Big data, analytic environments for the do D uh, for, for quite a long time. And then, uh, around 20 10, 20 11, we started, uh, AI models started to come in, you know, classifiers and, and some, uh, unsupervised, deep learning models. And, you know, suddenly what we started to see were our analytic approaches. You know, where we were trying to do. For example, machine language translation on languages, like, you know, past two far C and other things mm-hmm, went from, you know, an accuracy of like, you know, 22% to like 80 or 90% overnight. Some of our video analytics, some of our imagery analytics, things that we couldn't do previously were suddenly. Showing some real promise with AI.
Deep: This, this is just because of you were using some deep learning techniques, some more advanced techniques. Is that why? Or is there something else going on?
Dr. Bauer: Well, we, we couldn't do them without AI. Right? Mm-hmm so like, for example, trying to do object detection inside of an image, right. Kind of the canonical example that came about in, in 2012 was, you know, identifying cats and images, right. That was. That was really groundbreaking for us. And so, you know, we were building these very large scale systems. And when I say large scale, I mean, systems that had anywhere from, you know, 10,000 processors, would've been small for us in 2005, you know, we've worked on systems with hundreds of thousands, millions of processors. And, you know, you need that type of compute capacity when you're doing O and PD solvers. Right. And so, you know, those fiscal analysis techniques then started to be replaced by, uh, a need for GPU processors for, uh, you know, deep learning and for, for AI. And, uh, suddenly we could solve problems that we couldn't solve before. And so that really kind of opened my eyes. And then one of the key things that really happened for us was the NSA, uh, began to open source a lot of the technology that we had. Developed in partnership with companies like Google and others.
Deep: So you mentioned that these DOD agencies were helping you or you were helping them kind of get into the cloud. So is this non secure work or were you air gapped in a gap cloud? Like what, what,
Dr. Bauer: yeah, so it's actually, so one of the exciting things about working in the department of defense, which is where the intelligence community is, is that you have the opportunity to work on projects. With, uh, you know, much longer timeframes and with much more funds, right? Many more funds. And so in 2005, we started to work with Google. And then in 2008, they gave us something called Google big table. Uh, we took that. We added an enormous amount of security to it. And, um, and we developed something called a Culo from it and what we were doing with what we would, what we would call big data today, back in 2007, 8, 9, 10, we called cloud. Right. So for us, At the time, that's what cloud computing was. Um, which is kind of funny to me that, that we now call it big data and we think of cloud as something else. We were also doing cloud using technology like OpenStack, but we didn't really think of that as being very much of anything at all. Right. But the big data technologies. Working with companies like Google, working with Facebook, working on Cassandra, working on, um, you know, and then bringing in AI machine learning really opened our eyes. And then in 2015, uh, so we opened sourced a Culo in 2011 and then in 2015, we opened sourced our security architecture for cyber security. And we also opensourced our ETL F. For data, data ingestion, which is called Apache Niagara files or NiFi with respect to the, the. The cloud and DOD agencies.
Deep: Were you just working on non-secure data? Is that why you were able to work on?
Dr. Bauer: Oh, no. Everything we did was classified in fact, but then, so then there's an impression out here that that sort of work is not happening in a public cloud, like AWS. So is this a private cloud, um, that, that you, oh yeah. Everything we did was, uh, Well, you would consider it to be like an on-prem type environment. You know, listen deep in, in 2007, we were working with Hewlett Packard to build, uh, the first, what we call the pod, a portable optimized data center, a data center in a tractor trailer, body that could be flown into a country, um, and be prepopulated with all of the kit necessary. To operate what we called a cloud compute environment. And by 2011, we had a worldwide global enterprise architecture, uh, cloud computing architecture that was deployed on all of the major classified networks. Each pond, each tractor trailer, body would have thousands of servers in it. It would have petabytes of capacity. I mean, think about how much fun that is in 2008 to be delivering five petabytes systems that are just airdropped into a country.
Deep: Walk us through that. What that looks like? Like what, what does one of these units look like? How are you cooling out? Just give us a little visual there.
Dr. Bauer: Yeah. So, you know, everybody knows what a tractor trailer looks like. Right? Yeah. So imagine that white box on the back of the tractor trailer, having 50 U racks. Right. And, and having about 22 of them. Mm-hmm so you pods kind of come in two sizes, right? There's 22 feet and 40 feet. And, uh, you can get anywhere from, uh, 11 to 22 rack. And, um, and then they're 50 U and which is a kind of a non-traditional size. And you know, today they're really nice, right? The racks just slide in they're all CRE wired, pre everything, but, uh, the first systems were a complete and total nightmare. The first system we built was actually three tractor trailers. One was a generator. One was a, uh, compute node. And then the third one was the, uh, air conditioner basically by 2009, our systems had fire suppression systems. Um, they, they were much smaller in size and scale. I, I will tell you one funny story, the very first system we built, you know, they're supposed to be taken off of the, the flatbed and put on the ground mm-hmm . But our very first system, we left them on the flatbeds. Because we wanted to emphasize the fact that it was mobile. Yeah. But the problem is you can't get into them when they're on the flatbed. Right. They're four or five feet up the air, like a staircase or something. Yeah. so we had to go and install staircases on all of the. Systems so that we could do tourism. Right. We would have all of these, you know, kind of key people coming in to see what we were doing and review it. And, you know, the military was supposed to take the equipment off and lay it on the, on the, on the pad, right. On a concrete platform. Yeah. Yeah. And they liked it. They liked it. They liked to be able to see the wheels. Right. Cuz they thought that emphasized it being mobile. You're not, they're not driven around. Right. They're driven to a location and then you keep 'em there and, and like, and then you're just hooking in your electricity. There's an air conditioning unit to like keep everything cool. That's right. They have to. They have to be totally self-sufficient because you don't know where you're putting them in. Right. So, yeah, depending on where you need them, you might have to provide all your own power or all your own cooling. Um, we actually designed them to be connected up to a local lake or river for cooling.
Deep: And I was just gonna ask that, cuz it seems impractical to just have air conditioners up there. And a lot of times, if you can plug in a water source, you might be able to run some cool water.
Dr. Bauer: We did that. We did that. Um, and then, uh, by 2009, we had deployed about, you know, I, I don't remember the exact number, but a few dozen of these things around the world. Interconnected them together. And, um, as I said, you know, the virtualization component, what we think of as cloud today, Amazon, that that was just a necessity, right? You, you just, you had to have a compute framework. Um, but what was really key was what we put in it, which was the big data. Components so that we could bring in petabytes of data for data analysis, um, to, you know, for, for, you know, whatever, whatever mission the, the DOD needed to execute. I see. So it was, um, so you're, you're setting up these hardware Environments these trailers, if you will.
Deep: And you're configuring them with the expectation of doing a lot of, a lot of heavy machine learning AI, but your company wasn't actually doing the machine learning AI. It was the users.
Dr. Bauer: Oh no, we were, oh, you were, you were also, yeah, we developed everything. We had to develop the hardware architecture we had to develop actually the, the containerized pods, no one had ever. Uh, done a secure cl you asked me, you know, were these public? No, nothing we did was public. Everything we did was classified. And so when you're doing a classified pod, you have to have special X 5 0 9 locks on the doors. All of the vents have to be welded shut, um, you know, and so on and so forth. Right. It has all of your cables, right? If you have a wire running into the pod, it has to be. Uh, sheath and a metal tube, right. Like, so that nobody can just come in and tap into your cables, right? Oh yeah. Yeah. So there's a, a lot of considerations going to it and nobody had ever, um, certified or accredited one of these things, uh, you know, back in 2008, right.
Deep: What about a self destruct, uh, button? Is there such a thing we did?
Dr. Bauer: We did in fact, have a self destruct button as well. And. You know, the very first one that we put in, they, they warned everybody don't lean against it because, oh gosh, you literally had a button. We literally had a button, but nobody had thought to put a cover on it. Uh, huh. So you could literally bump into it and hit that button.
Deep: And what would physically happen if you did that?
Dr. Bauer: Well, I can't really discuss the details of that. Um, but you know, you could be assured that system would no longer be operational.
Deep: Yeah. So, and that makes a lot of sense, cuz you know, we think about like when with the evacuation, Afghanistan and how quickly, something like that went down, you know, whoever the planners were that had the foresight. To put the buttons on all the, on the systems makes, uh, makes a lot of sense.
Dr. Bauer: Well, it was a real possibility of a system like that being hit by a stray RPG. Okay. And so truly our primary concern was for, um, the people who would be there in country, you know, Intel analysts, soldier, war, fighters. Um, and making sure that they were safe. So in the event, something like that would happen, the button was really there to really kind of shut everything down, extinguish, any fires, make sure that there wasn't a, a more catastrophic failure that could be, uh, toxic to the people that were there. So that was really our primary concern.
Deep: Um, oh, I see. I, so it wasn't so much like enemy breach of security and is able to access Intel. So one, one kind of question is like, did you have like a way of hooking multiples of these units together? And how do you deal with, uh, redundancy outside the unit itself, given that someone might just lean up against the button, you know, or,
Dr. Bauer: well, we didn't plan for somebody hitting the button and nobody ever in fact did hit the button. Okay. And, uh, for that particular system, we did have somebody fashion a cover for the button. That seems cause it was yes. And then of course. Covers became mandatory in all the future systems. Um, but, uh, how do you make them redundant? What we tended to do was, um, focus on different capabilities within each. And so then what we were really doing. What you know. So for example, you may have one that held petabytes of geospatial data. You might have another one that holds, you know, petabytes of, um, you knows data, for example. And so, uh, in that way, we kind of built them around their function and then leverage the capabilities within each individually. So there, there were in fact multiples sitting side by side, , but then, you know, you really do kind of hit, uh, a limit of what you're able to do in many of these places. Mm-hmm . Um, and so that you were, you were kind of limited in, in, uh, chaining too many of them together. So we didn't do that too frequently. Now companies today certainly do that. Right? Google has adopted this technology in a massive way. They chain them together very intelligently. Um, and they've really taken the technology to, to, you know, far beyond anything we had ever done.
Deep: You're listening to your AI injection, brought to you by xyonix.com. That's xyonix.com. Check out our website for more continent, or if you need help injecting AI into your organization.
So walk us through the. So, so you, we started off with the question of like, how'd you get started with boss AI. What do you guys do? And how does, how do these, the evolution of these kind of AI purposed trailer hardware setups? How does that feed into to boss AI?
Dr. Bauer: Well, we don't do those any longer. Right? When we were, you know, working inside the government, um, on a regular basis, doing these large programs for massive efforts, like. Um, you know, operation Iraqi freedom and, and, and Afghanistan. Um, we, we're not doing that anymore. Uh, so to, so to answer your original question, the reason why we started the company is as we started to open source, these major components of the architecture, um, we actually decided to leave the government space. Because we felt like we really knew and understand this understood how to operationalize this technology in a way that other people did not. And, you know, we would go to conferences for example, and people would say, yes, this technology is great, but you know, of course, nobody knows how to secure it or how to secure the data or how to secure the algorithms. We knew how to do that. Nobody knew, you know, they said this technology is great, but you know, it's, it's in a beta form. Nobody would ever actually operationalize this and try to use it in production. We had already put it into production on a massive worldwide scale and knew how to do that, you know, so we really felt like. We, we could be very successful in the commercial space with this technology, bringing a lot of the, um, capabilities that we had developed in the intelligence community into, uh, the commercial realm. That is exactly what we're doing today. So, you know, I don't want to mislead your audience. We do still do some work within the intelligence community and they have been behind our push to do federated machine. Which is really a very new or, or I should say novel. Approach within machine learning that, um, you know, we're, we're really excited to be on the forefront of, and somewhat to your question about, you know, maybe, maybe not pods, but you know, multiple data centers, multi-cloud environments when you have Azure GCP and, um, uh, Amazon resources, right. Maybe, maybe you're. You know, major company with multiple regions employed around the world. One of the things that people have really had to, uh, one of the things enterprises have really had to face over the years is the centralization of all of their data for analysis. Right? Many of these algorithms that we use. Require the data to be centralized now in the intelligence community, centralization is a bad thing, right? We don't want things to be centralized. It's risky. They can potentially look at it, that kind of thing. That's right. Um, and it's a signal. Right. So if I am in a, uh, on a, on a, on a battlefield environment, if I'm in, uh, another person's country and I'm collecting massive amounts of data, and then I try to move all that data to a centralized location in the United States, that network. Uh, bandwidth consumption becomes a signal to what's happening. Sure. And, uh, so having federated machine learning capabilities, meaning I can leave large amounts of data in multiple disparate data centers around the world. I can have it in big data centers. I can have it in little data centers. I can have it in edge based systems. If I can leverage that data in that type of an environment. Um, and I can use it to train models. I can use it to serve models for inferencing and, and gain predictions for my data. Then what I can begin to do is I can begin to learn, you know, in the government space, I can learn strategies and tactics as they change over a battlefield environment. In the commercial domain, I can learn, uh, geographic and demographic differences between different parts of a, of, of a country like the United States or different parts, uh, of a continent like Europe very quickly and easily in the commercial context.
Deep: Couldn't you just achieve that by maintaining the geographical. Like a, you could centralize all the data still, but you just know where the geography is that it originated from you. You could do that. Why, why, why the physical separation in the commercial world?
Dr. Bauer: So many organizations today want, they don't want to put all their eggs in one basket with a single cloud provider, right. They also need to break up their services and their data collection across multiple different regions in order to reduce. Bandwidth, uh, and, and latency in the network. So it's natural for them to distribute services and data, um, around the country and around the world for speed, for, for access speed for locality.
So then when you've built out that kind of a distributed architecture on a national or global scale, In order to be able to analyze that data, massive amounts of data, um, having to then centralize it into a single location, adds this storage requirement adds this network requirement that you really didn't have before. But the reality is. That the most important thing is the time component. And this is true, whether you're in the public or the private sector, um, being able to learn from your data, being able to inference from your data, um, as quickly as possible, turns out to have an enormous value for companies. If you have to wait weeks or in some cases, our customers were waiting months for data to be centralized. So that they could use it for analysis. That data is now very stale and it's not what we call actionable intelligence. Um, because it's old, right. It doesn't follow the trend. Um, you know, we just saw this over the last couple years, right? We've been on this rollercoaster ride of shutting down, opening up things, changing with the pandemic. And so if you want to be able to adjust very quickly to those types, I guess you would call them market forces in the commercial sector. Then you need to be very agile in terms of your machine learning infrastructure.
Deep: So I'm trying to like wrap my head around what your system actually is that you're basically talking about making. Cloud in a box in multiple regions and being able to coordinate across them is, is that a reasonable way to think about what you're describing or am I off,
Dr. Bauer: um, somewhat, I mean, I don't want to overcomplicate it. Right. Um, you know, let's take a very simple case, right? I'll I'll give you an example. Um, actually I can give you two examples. Uh, okay. One is a hospital group, right? So we're working with a hospital. Hospital, uh, healthcare sector as a whole is undergone enormous mergers and acquisitions. They buy hospitals all the time. Uh, one of the challenges they have in centralizing all of their data is that as they're buying these different hospitals around the country, each hospital comes with its own set of infrastructure and they're very heterogeneous, right? They're very disparate across them. And so it then becomes very difficult to be able to leverage that data because. Beyond moving that data into a centralized location, that data also has to be somewhat normalized and unified in order for it to be useful, what we do in a situation like that. Using federated machine learning is we put smaller instances in with each of those, uh, hospital infrastructures where we're able to index their data directly and make it available for, uh, you know, training data sets, testing, data sets, or in right at the source. Now in the machine learning world, as we all know. If you only have a small amount of data, it may be difficult for you to really get a very high quality model. So you want to centralize all of your hospital's data into one place. So you have a large enough data set to be able to, to train a highly accurate model. So how do we do that? Right. What we're doing is we leave the data in each of these, you know, and I'm thinking 12, because that's what this customer has or these 12 hospital sites. And what we're doing is we're training a model across those different sites simultaneously. Right? So we're training against all the data. We're not moving the data. What we're doing is in between the epics. We're moving. For example, if it's a neural network, we're moving model weights, losses across, um, to, to a centralized location. And then what we call an arbiter. And at that arbiter, we're aggregating the model for that round. And then. Once that aggregation is complete, then we can, uh, send that information back the pertinent information back to each of the 12 systems.
Deep: One question like one of the challenges we have a lot is, and in this using this particular example is each of these hospital units has a ton of data from which. A small set is gonna get used for a particular model. So do you have some way for your machine learning practitioner, who's interacting with the system to be able to see into those other systems?
Dr. Bauer: Yeah, that, that was really actually the more difficult portion of it to build because. I imagine it would be actually right. They need to see what's available to them. Right. And in most systems, you know, what, what we typically say is if you're calling Reed CSV in a notebook, you're probably doing it wrong. Um, that's not how Sy, uh, our users leverage the environment. What our users are doing is they're executing very complex queries across, uh, these different systems. In accordance with the data models that exist, right? We've got a data unification that happens across the global environment, and then we're leveraging a federated query capability to be able for them to be able to query, you know, any possible way they could want across that data that result set becomes the basis. For training data. For example, before you get there, though, somebody who's running one of these organizations, presumably had to decide what data inside of their, you know, cause they might have had multiple databases, all kinds of things that were that's right. Inappropriate for machine learning, for example, and some that was appropriate. And, you know, they're gonna have epic instances and there's gonna be all kinds of security issues. Somebody decide to put it into this pool from which all this work happens. Is that right? That is why we implement data level security throughout what we're doing, because what data level, this is one of the things that we have to do in the classified domain, right? There is a. Um, protocol for handling classified data called I C D 5 0 3. It is effectively a superset of all of the other compliance, uh, and security and compliance requirements that you may be familiar with. For example, um, you know, HIPAA, PCI Phi. Um, Sarbanes, Oxley, and even, uh, you know, some of the newer restrictions, like GPDR are a subset of what we're doing in ICD 5 0 3. So what that means is that yes, each machine owner, each hospital is able to not only decide what they want to put in, but how they want to disseminate it. To different people in accordance with security groups. So then what happens is when the user authenticates, they not only are querying across the data to discover what they have access to, but they're also only being exposed to the data that they're, um, cleared to see.
Deep: Oh, that makes a sense. I mean like is, and, and is it reasonable for me to think about the exposure of their systems of record their legacy systems? Like whatever they have. Should I think about the exposure of that to your system?
Dr. Bauer: The way I would think about the exposure of that to like a data warehouse, uh, it is very similar to a, to a data warehouse and people have asked us, can I use this as a data warehouse? And we do have customers that do that, but unlike a data warehouse where you can do 12 other things with it, besides. AI and ML, we are really focused primarily on making data access, um, efficient and scalable for AI and machine learning workloads. I mean, one of the, now I do wanna give you, I do wanna give you a second use case, right? This one's a little bit more exciting for us. So where we really get excited about federated machine learning is when we use it across different parties. Um, we have a major manufacturer who. Creates products. They sell them through their own brick and mortar, but then they also sell them through partners. Right. They sell them at Costco, they sell them at other places. And so each of these individual companies are their own distinct companies. They have their own storefronts, they have their own customer information. They're all selling the same product. Right. So what we were able to do for them leveraging federated machine learning is not one, not one of them had. Enough data to really create high quality models, but leveraging federated machine learning, they were able to take all of that customer data together and begin to train models in a multitude of different ways. Um, so yes, not, not just off of having more customer records, but actually having more features as well. And that gets into an area that we call vertical learning. Vertical learning is where one Federation may have columns or features ABC. And the second federate might have columns D F so you can't get the value from the columns if you don't even have them. Right. So federating different companies together. Allows them to have more features to work with. Um, it allows them to have different types of labels to work with different types of categories to work with and their classifiers. And so that really gets exciting for us because it starts to unlock these new use cases that, that you could not do before, without fully sharing your data. Maybe somebody you're partnered with, but you wouldn't want to give them your customer database. Right? So my data never leaves my system. Their data never leaves their system. And then we use a variety of techniques to protect each of the partners in this Federation, um, that that to, to provide enhanced security around the modeling effort itself. And this comes out of an area in the DOD space as well that we call adversarial AI. Um, so for example, you know, it's it, I, I hope for your listeners, they understand that if I had access to a model, I could actually. Use those model weights to reverse engineer the data that was used to train a model that is an active research area, and people have had pretty good results with that. So what we do to protect these models, because we are moving the model weights and losses. Around the network. We use a variety of techniques. We use homomorphic encryption mm-hmm so that the model is encrypted in memory throughout the lifespan of the model training, we use something called garbled circuits to further obfuscate the model training and the features that are used from the different partners. Um, we do use. As I mentioned earlier, an arbiter, which is a third piece of software that does just the model aggregation and, and that arbiter can be placed in one or other of the feds, or it can be placed in a third location that may be more neutral. Um, and then we do a couple of other techniques such as differential privacy. And, uh, SPBC, I'm trying to remember what SPC stands for. I got, I lost that one off the top of my head, but, um, so in that way, what we're really trying to do is protect the models themselves. So that, for example, if somebody were to one, one technique that we test for is one fed is training with legitimate data and labels. And the other fed is training with no data. Right. And what they're trying to do is they're trying to see what they're gonna get back so that they can maybe discover something about your portion of the, of the model, your portion of the features. Um, and so that's the type of, you know, uh, testing that we're doing that can be.
Deep: That kind of obfuscation of the features or encryption on the features can be difficult for the model builder to build intuition about what's going on.
Dr. Bauer: Well, keep in mind, they can still see their local version of the model. Right. So they can decrypt, they have the keys for their portion of the month.
Deep: Ah, okay. That's how you handle that basically,
Perhaps you're not sure whether AI can really transform your business. Maybe you don't know what it means to inject AI into your business. Maybe you need some help actually building models. Check this out at xyonix.com. That's xyonix.com. Maybe we can help.
Dr. Bauer: Right. So they can see their portion of the model. What I'm talking about is what if I, as a fed, like as a bad actor, say, I'm only gonna give you. You know, let's say we're training against numerical tabular data, and I just give you all ones, right. I'm just gonna, all my data records are ones you're just gonna train 'em ones. Right. And I wanna see what I get back from that to see if I can determine anything about the features that you're using or the, you know, what, what the model training looks like on your side. And so that's what these steps do. They obfuscate the features that are being used on the other side. So then you don't have to worry about any. Any, uh, data tampering. You don't have to worry about any, um, you know, kind of funny business in terms of somebody messing with your model or your data in the process, and frankly, on the adversarial AI problem. Uh, the biggest problem that we have is what if somebody is tampering with your data stream, what if somebody is, you know, modifying your data as you're centralizing it. So we make that problem go away by never transmitting the data. We reduced the threat surface greatly by not transmitting the data out of the system. So, um, it makes it a lot harder because now in order for them to tamper with your data, they have to actually get inside of your system. So that's one of the biggest problems is once you put your data out there in the wild. Yeah. Right. You put your data in an S3 bucket. People think, well, I secured my S3 bucket. I'm good. That's a fallacy, right? First of all, Amazon can see what's in your S3 bucket. Right? We know that right? When you put these, when you put your data into these systems, these companies can see your data. If they want to, um, the. There's definitely a certain level of trust there. Right. And if you're worried about you in certain data's that's right. So, you know, and then we know that other people have, you know, had very, uh, you know, varying degrees of success in breaking into those systems. Um, and so you don't really know what's being applied, right? We, we sign up for these services thinking that they're fed ramp compliant, thinking that they're sock compliant. And that we're all good and we're covered. But the reality is because we're not monitoring it, we're not implementing it. We don't really know what the security is. We don't even necessarily know when breaches occur. And so we don't really know, you know, what's happening with our data so you can keep it within your, the more you can keep it within a single environment and not spread it around. The safer you can be if you're collecting in one place, centralizing in. Sending some of it off to a partner in a third you're you're just increasing what we call the threat surface.
Deep: Sure. So keeping that threat surface as small as possible is a, uh, is an important, uh, kind of a first step. What would you like? What are the rules of thumb that you use for our listeners? For example, to see like, Hey, do they need a federated machine learning approach or system or could benefit from it? Like, what are. Sort of standard questions that you should ask yourselves, just to see if, if I might need a federated ML solution.
Dr. Bauer: Well, as with any AI solution, the first thing we ask is, do you, is, is AI or really machine learning appropriate for the problem to be solved? Um, once we know the answer to that question, then the next portion of that question becomes. Where is the data? What is the source of this information? How is it collected? How, you know, what is the process, um, by which it's, it's, uh, gathered and used. And, and what does it really mean? Um, many times if we are collecting data kind of at the tail end of an infrastructure, right at the tail end of a process, it may be too, uh, refined, right? It may be too clean. It may be too, um, Too cleansed to be able to, to really be useful to us, um, in determining certain problems. So understanding where that data is, understanding where the right place to collect and analyze that data is the next step in the process. Um, but frankly, You know, the answer to that question tends to be a lot simpler for our customers. One of the things we say to many of our customers there is, is just as simple as do you have data in more than one cloud? Many customers do immediately. That's the answer right there. Work with the data where it is work with the data live, get the value from the data in real time. Um, that's, that's what we should be doing. That's what companies should be doing. What our customers frequently say is we wish we knew about this three years ago before we started. Yeah. Putting everything in a data lake before we started putting everything into a data warehouse, um, because it would've greatly simplified the process. One of my favorite questions to ask a company that's been successful with a data. Is. Okay. So now that you've got it, what is the value you get? And frankly, many times the answer is none, right? They don't know how to leverage it at that point. Right.
Deep: That's why, well, I mean, that's a, that's a generic problem. Like, I don't know, 90% of the companies that, you know, that I've seen that have a data warehouse, don't leverage it for too much. Right. More than some simple, some, some simple business analytics, you know?
Dr. Bauer: Um, it's insane to me that that's the case. Um, one company that we worked with, they were very successful. I mean, it did take 'em three years. They were very successful. They said the number one thing we do is export data out of the data. Yeah, sure. That's exactly what the number one thing is. Everybody does with it. That's and I'm like, how does that make any sense? You spent all this time centralizing. Only, this is something that's been a, a kind of, a bit of a BMI bond for the last, uh, you know, 10 years or so. You know, the whole idea of big data was sold with the promise of machine learning outputs that would. Be transformational business insights that would, you know, do this exciting thing or that exciting thing. You know, we were quite successful in convincing an awful high percentage of businesses to go ahead and move to data warehouses. They did. So, and then somewhere along the line, they forgot about why they were doing so, and, and it basically became exports. Uh, but no one really. Cycled back and figured out all the models that could actually be built that could actually drive value and the heterogeneity and the data, all these other issues just never quite got, you know, hacked away with machete and dealt with. So if I rewind 20 minutes to your original question, in order to really pull this off, the biggest challenges we really face were not so much with the federated machine. But in building a, a platform that could do federated query, federated feature engineering, um, and, and, you know, federated, uh, you know, essentially ed operations across the data because. What happens in an environment like this, you know, when you're working in a notebook on your laptop with, with, you know, a CSV or a JSM file, you pretty much know what the da data looks like. It's fairly uniform already when you're dealing with 12 different hospitals. Yes, they all have, you know, EHR kind of look the same from one location to the next, right? They may have a different version, Hola seven or eight or, or whatever one person, you know, one hospital may not collect a certain field. The other one does. And so that shows up as a gap in the data. But, um, even, even in a simple case like that, the data can look very different from place to place. And so now what you're doing in the federated environment is you're working with data. That is, uh, far less uniform. To begin with. Yeah. Right. And so you, as the data scientists really have to have, uh, good tools for that. Unfortunately, Jupyter notebooks don't work in a federated environment so then you have to have we've we've spent an enormous amount of time building out no code and low code interfaces to really make that data. Visible, um, both statistically and really more importantly, visually, um, for users so that they can really understand, uh, what the data is that they're putting into models and, and what types of operations they have to apply to the models. And then of course it all has to be, you know, scalable and efficient and secure. And all of those things.
Deep: One question I had for you is you, you must get into a lot of tedious conversations with lawyers trying to figure out exactly what data is leaving the org. And, uh, what its ramifications are. Does that conversation come up and how do you address it
Dr. Bauer: typically, to be honest with you, frankly, that question doesn't really come up all that really, but at the end of the day, I mean, there's a leap of trust they have to have with your system. Which is that maybe none of the original data is leaving, but you're enabling things to understand that data and move something else out, like derived knowledge from it.
Deep: Yeah. That feels to me like it's gonna trigger a bunch of questions.
Dr. Bauer: Yeah, I think deep, I, my answer to that would be that their level of understanding of what we're doing with federated machine learning is, is they, they don't know to ask those questions. Right. Well, we tell them is we tell them the data doesn't leave the system.
Deep: I think that's brilliant on your part, to be honest, like when you say it like that, you're just done.
Dr. Bauer: That's right. And they go ever. And I said, Nope, not ever. And they're like, I think you've mastered that. Yeah. That's, that's the right answer to, they're like, okay, we're moving on. Now. We do have some customers. I don't want to, you know, be silly about this. Right. We do have some very mature customers that know all about that. Um, so for example, inside of some of the intelligence community, they're actually looking to leverage federated machine learning across classification boundaries.
Deep: Oh, wow.
Dr. Bauer: Yeah. So going from top secret to secret, right. Going from top secret to unclassify, how do I leverage open source intelligence data against my classified data? If I don't wanna move a petabyte of open source data across the wire, into my database on the classified side. In my data system on the classified side, how am I going to leverage it? What we've been doing in the IC is we train open source models on open source data. We train classified models, unclassified data. The question is, how do I commingle the two? How do I get a model that learns from both data sets?
Deep: This reminds me of a, a, a funny story. So we were, I was doing some, some clear, uh, or some, uh, some work for some three letter agencies. And, you know, as, as you know, you've worked with a lot of these folks, they're, they're really sort of intellectually driven largely, uh, in many ways. Oh yeah. Right. These are really bright folks. Very, and they, they come in in order to. Get to the roles they're in, you know, they've been in, you know, graduate school, writing papers, all that kind of stuff. So they, so they left to share. And so we had just shared for like, you know, I don't know an hours of, of, of, of our work. And one of the sponsors is like, Hey, I really have something. I have something I really wanna show you guys. I wanna show you guys. And we said, okay, great. He's like, He's like, you know, I can't tell you where the data came from or what it's about or what it represents or anything, but it's, it's really cool. And I got through all the lawyers to show it to you and he shows us this it's like a three dimensional, damp. Sinusoid just kind of, you know, doing its thing bouncing around. And we're like, that's great. But that, that got clear, you know, like it was based on real data. We didn't know anything about it, but it got through the, you know, the, the, the system just through the process.
Yeah. So, listen, this has been an amazing conversation. I wanna end with one final question and I like to look out into the future. So let's go out 10 years. Um, you've ah, everything that you can possibly imagine, you know, with respect to, um, federated ML, what does the world look like? You know, and, and. Is it better off or, or worse off or just different?
Dr. Bauer: I well shy away from any social commentary about how tools are used. Cause in this day and age, it's impossible to predict, right? I would not, uh, know how to even begin to predict how these tools will be used and for the future, from a technological perspective, what I, what I really like about your question, what I, what really immediately kind of jumps to my mind when you talk about the future, what gets me excited is. You know, listen, AI is becoming a commodity for those of us who have worked in it for a long time. It's rote. It's not a major challenge any longer, right. It's becoming a commodity. And as it becomes a commodity, what these systems like federated machine learning are enabling us to do is to build repositories of data and repositories of models that can be shared in, in, in a, across the environ. Right. Yeah. And so what we're really trying to do is building out, you know, for right now the most, the best word I have for it is like an ecosystem, an AI ecosystem where maybe you don't come in like today, what do we do? I gotta go find some data. I gotta write some code. I slap 'em together. Maybe I get a model. Maybe I gotta, maybe I gotta. Go refine it. Yeah. And to find some data part can be quite a challenge in these environments where the data's laddered all over the place. Right? Imagine you register for access to the system. And there's thousands, tens of thousands of data sets already out there. Right. I can subscribe to them some for free, some for a fee. I can create a marketplace around this, right? In this ecosystem I can sell. Inferences, right. A penny and inference. I can sell access to data. Um, you know, I want, I want to take maybe some of my data and do federated transfer learning and combine it with somebody else's data because I don't have labels or I only have a small data set. Yeah. And as you pointed out, Frankly, the hardest part is always just kind of finding the right data. Right? So if I can come into a system on day one and I see thousands upon thousands of data sets, I can quickly find what I need. Um, somebody else has already written the model. Maybe, maybe they've already trained the model and I can just subscribe to it for in. Listen, there's an enormous amount of work that has to happen before that, uh, can really take place. We're building what we call AI marketplaces to facilitate publishing and subscribing to data. We're leveraging smart, uh, contracts on the blockchain to try to make that happen. And, uh, but there's also an enormous amount of work that has to do for, uh, you know, model validation, model verification, so that you really understand what you're even subscribing to. Right. When I build a model, I know what it does because I wrote it. When you build a model, I have no idea what it does. I have no idea if it's applicable to my data, those are things I need to do. Google's done some really. Fantastic thinking in this area with their model cards, notion, um, that work needs to progress quite a bit more for people to really understand the value of a model. We're also working in an area called augmented ML, um, where you can really describe a problem statement, a hypothesis, and have the system auto find these things for you. It would be great to log into a system and see 10,000 data sources. But frankly, that actually makes my job harder. Right now. I have to sift through 10,000 data sources and I want the system in an augmented sense to go out and identify potential data sources. That that may represent concrete may have cracks in that concrete. I wanted to go out and find models.
Deep: I like that, that may do that. I like that. Cuz I mean, this is, this is one of the things that a lot of us that work in, in ML are realizing is, um, the forefronts, not so much on the algorithmic front and the tweaking of the, of the networks, uh, the forefronts really on the data side, like how do you get your data prep the data. Uh, efficiently, how do you get access to large data sets or label the data sets that you need? And what you're describing here is a picture of a world where it's so much easier to. Augment access data. That's gonna help you take your models to the next level. Like one of the things that, you know, I tell folks all the time is if you spend an hour, you know, improving the, the data, you're gonna get more than an hour's worth of, uh, whatever you're gonna get from tweaking your model for that hour.
That's all for this episode of your AI injection as always. Thanks so much for tuning in. If you enjoy this episode and want to know more about how AI can be applied to your business, check out a recent article of ours called how to spot great AI opportunities in your business. By going to xyonix.com/articles that's xyonix.com/articles.
Also, please feel free to tell your friends about us. Give us a review and check out our past episodes at podcast.xyonix.com. That's all for this episode, I'm Deep Dhillon, your host saying check back soon for your next AI injection. In the meantime, if you need help injecting AI into your business, reach out to us at xyonix.com.
That's xyonix.com. Whether it's text, audio, video, or other business data, we help all kinds of organizations like yours automatically find and operationalize transformative insights.