Data Radicals logo

Mastering Your Own Destiny

Andy Palmer & Mike Stonebraker, Co-founders of Tamr

Andy Palmer & Mike Stonebraker

Andy Palmer is a serial entrepreneur who’s been a founding investor, board member, or advisor to more than 50 start-up companies. Mike Stonebraker is a pioneer of database research and technology who’s an adjunct professor at MIT Computer Science & Artificial Intelligence Laboratory (CSAIL). Together, they founded Tamr, the enterprise data mastering company.

Andy Palmer

Andy Palmer

Co-founder

Tamr

Mike Stonebraker

Mike Stonebraker

Co-Founder

Tamr

Satyen Sangani

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”

Satyen Sangani

Satyen Sangani

CEO & Co-Founder

Alation

Producer: (00:02) Hello and welcome to Data Radicals. In today's episode, Satyen sits down with Tamr co-founders Michael Stonebraker and Andy Palmer. Tamr is the data mastering leader delivering data products that provide clean, consolidated and curated data to help businesses stay ahead in a rapidly changing world. In this episode, Michael, Andy, and Satyen discussed Tamr's tech evolution, third normal form, and probabilistic methods.

Producer: (00:28) This podcast is brought to you by Alation. Alation achieved 8 top rankings and 11 leading positions in two different peer groups in the latest edition of the Data Management Survey '23, conducted by BARC, the Business Application Research Center. Read the report at Alation.com/barc23.

Satyen Sangani: (1:00) Today on Data Radicals, we have Andy Palmer and Dr. Michael Stonebraker. Andy is a serial entrepreneur and has served as a founder, board member, or advisor to more than 50 startups. Most recently, Andy founded Koa Labs, a startup club in the heart of Harvard Square. He also co-founded Tamr, the enterprise data mastering company. Joined by Andy is his Tamr co-founder and Turing Award winner, Dr. Michael Stonebraker. Michael is a legendary database pioneer, MIT professor, and entrepreneur. Today his products are central to many relational database systems and he has founded 9 database startups in the last 40 years. In 2014, he won the A.M. Turing Award, known as the Nobel Prize of computing. Michael and Andy, welcome to Data Radicals.

Michael Stonebraker: (01:41) Thank you, Satyen.

Andy Palmer: (01:43) Great to be here.

Satyen Sangani: (1:43) So, you two founded Tamr in 2013, which was roughly the time that we founded Alation, and it was based on some work that Mike, you had done at your MIT Computer Science and AI Lab. Can you give us a little bit more background about why you started Tamr, why you chose that area of inquiry, given that most of the time you'd spend a lot of your career working on the workings of a database, but not necessarily what was inside of the database?

Michael Stonebraker: (02:10) Sure. I've been interested in data integration for years and the first dumb idea was to pretend it was a distributed database problem. I quickly found out that schemas never are the same ever. Distributed databases have a data integration problem. The second thing is I founded another company called Goby, and then I said, "Well, the problem then is to easily write transforms from one thing to another." And that didn't really get any traction either. And then the next idea, which turned out to come from a collaboration with the Qatar Computing Research Institute — that was CSAIL MIT cooperation with the government of Qatar where we figured out some stuff to work on and they paid all the bills. They had an active data integration project. And at the same time Joey Hellerstein, who turned out to be the founder of Trifacta, he was visiting Harvard for a year. We started collaborating and we were both interested in data integration. So we basically built a thing called Data Tamr, which was basically, roughly, Tamr as a pilot. And through MIT we got (and through Andy Palmer) we got some people to actually try and use it and it really worked very well to put disparate schemas together. We figured we have a commercial story here and with Andy's help, we ran with it. So it was basically serendipity and a bunch of dimensions.

Andy Palmer: (04:02) Another perspective in addition to what Mike said: Mike and I had also spent a lot of time on a previous company, Vertica, an early column-oriented data warehousing system. One of the things we saw over and over again when people were doing data warehousing was that the success of these warehouses was limited by the quality and the comprehensive nature of the data that was being loaded into the warehouse. The bottleneck for doing that was the ability to add new sources of data very quickly and easily and map all of the data sources really quickly, and using the power of the machine to sort of make that completely automated thing was very intriguing to us based on the failures that we had seen in a lot of data warehousing projects.


The procurement proliferation

Satyen Sangani: (4:47) Right. Because as a builder of a database, you were completely reliant on the upstream population of that database being correct and consistent and comprehensive. This point that you made, Mike, around the schemas never being the same, can you dive a little bit deeper into that? What was your prior, before you sort of got into the problem, and what did you learn along the way in terms of why they're not the same, and how did you think about the world before and after?

Michael Stonebraker: (05:17) One of the very early Tamr customers was GE, the big conglomerate. I'm sure Alation has exactly one purchasing system because that's the obvious answer to how many procurement systems you have. The best answer is one. You want to guess how many procurement systems GE had as of three years ago?

Satyen Sangani: (5:46) I'm gonna go with just short of 20, 15.

Michael Stonebraker: (05:51) Seventy-five! And so just for your listeners, a procurement system is, you want to buy some paper clips. You go to your procurement system, you type in some magic number, it spits out a purchase order, you take the purchase order down to Staples, and Staples gives you your paper clips. So that's what a procurement system does. And GE has 75 of them. You might say, “Why in the world could that possibly be?” And the answer is, it looks like they are buying and selling business units all the time and every time they buy a business unit or a company, it comes with a procurement system. And unless you take the time to stamp out these data silos at acquisition time, they just pile up. So in every big company there are many, many data silos. They turn out to arrive when you buy stuff. When you start a new business unit, the trouble is the CEO gives you, Satyen, a budget, and says you have, say, a year to make something happen.

Michael Stonebraker: (06:56) And the last thing you want to do is stamp out data silos. So you just build whatever you need and you just created another data silo. So there are tons and tons of them, and it just turns out to be a fact that if you build a schema and I build a schema, your schema is for employees, my schema is for workers. Let's say you are in Paris and I'm in New York, so your workers have a salary, it's net after taxes in euros, my salaries are called wages and they're gross in U.S. dollars before taxes. So your employee table and my employee table simply aren't compatible. Either the names of the tables aren't compatible, the names of the columns aren't compatible and chances are the actual data types of the columns aren't compatible either. You may well have different stuff in your tables than I do. So the schemas just don't, they never line up. Tattoo that on your brain.

Satyen Sangani: (8:03) Yeah, because everybody has this model of the world around them. And no matter how much you understand or think about a domain, you're gonna model that based upon your own experience, your own knowledge, and your own set of requirements as a software developer or data modeler. And, therefore, there'll be slight differences even if you're talking about the same real world things.


Third normal form and analytics at scale

Andy Palmer: (08:25) Yeah in database systems, the most extreme example of this is third normal form, which is popularized by [Edgar F.] Codd way back in the day. And the reality is, third normal form is just wrong — this idea that you can design these schemas upfront and that, especially for warehouse read-oriented kinds of things and then they're gonna persist. The core schema has changed so much. In some ways, Mike and I are calling BS on third normal form.

Michael Stonebraker: (08:54) The thing to keep in mind is that there are these data silos in every large enterprise — often hundreds of them. And there's immense financial value to integrating these silos. For example, I said GE had 75 procurement systems. Every one of those procurement systems has a supplier table, or one or more supplier tables. And GE figured out that if you're one of these 75 procurement agents, if you can find out the terms and conditions, you're trying to negotiate a deal with Staples to get more paper clips. If you can figure out the terms and conditions that were negotiated by your 74 counterparts and then demand most favored-nation status, the CFO thought that would be worth $100,000,000 a year. But what you have to do is integrate 75 disparate supplier schemas. The upside value is enormous, but you've got to do data integration.

Satyen Sangani: (10:03) In some ways this is sort of the essential problem of like analytics at scale because you have all this data about all these things and unless you reconcile the way in which you talk about the things that you're trying to analyze, you can't actually analyze them. You can't actually make sense of them. Getting that to this question of third normal form, though, Andy, what is that criticism? So make that argument one more time. As I understand, third normal form and have experienced it in my career, the premise is, basically, you sort of model everything at its most atomic level. So a person has two arms, and so arms are an attribute of a person and an arm is then its own table and arms have fingers. There's this premise of just modeling everything just as it exists in the world. Tell us why that doesn't work.

Andy Palmer: (10:51) When it comes to designing schemas for databases, there are different purposes. When you have a schema that you design while you're inputting the data, you have one use of that schema. And over time, that schema's gonna change. And then when you go to consume data downstream, there may be a very different definition of it. The hubris associated with the definition of a schema and believing that that schema is canonical and shouldn't change or it doesn't change much, is where the problem lies with third normal form. And like Mike likes to say, "The design principle is schema last," or our good friend Bill Hakim, who's an amazing data engineer, an architect, he likes to describe it as “lazy schema mapping,” where when you're sort of assembling data to be consumed by people that the way you create that target schema, you should be very dynamic and should be based on what the consumers of the data need and how they think of the things that you're combining together, the core elements.

Andy Palmer: (11:58) And the other reason it's dynamic is because the underlying source data has changed and there's all kinds of schemas in the way things are called. The real problem with third normal form is the hubris of predetermination. At some point you have to decide what schemas are and like you kind of lock them in. But the real problem is believing that they're not gonna change, either in that data source, or that believing that the schema is downstream when you load the data somewhere else are going to be consistent with the original schema. Another great reference here, one of our good friends, Elena — who's been a customer of Tamr three times now through many big companies — she likes to say over and over again, the source is not the master. And I think it's a very powerful idea because just as Mike described with GE, there are many people that try and rationalize systems down into, if you've got one source, everything's a lot simpler in the one procurement system.

Andy Palmer: (12:49) But it just doesn't work that way in big companies. You inevitably have lots and lots of source systems and if you try and treat any one source system as the master, you've got this massive mapping problem. And, of course, every source system inside of an ecosystem wants to be the one to define everything, especially the schema. And so kind of saying, “Listen, none of these source systems are the master. And you need a dynamic schema for consuming the data downstream that's generated in relatively real time” — it's a very powerful thing to do. And we practice this a lot at Tamr.


The scheme of the schema

Satyen Sangani: (13:25) When we were all founding these companies, Hadoop was all the rage at the time, and everybody was talking about this notion of schema on read as a criticism of the traditional relational database, where everybody had to declare the schema up front and populate the schema. How much of that informed your thinking or is that a different way of saying exactly what you're saying right now, which is the schema needs to be sort of ideally binding later?

Michael Stonebraker: (13:47) This is sort of a little bit of a different dimension, but several years ago we had a collaboration with a large retail company in South America who made their historical schemas available to us. We found that the average schema lasted less than three months because business conditions just plain change and they change all the time. One of my pet peeves with current enterprises is that when presented with changed business conditions, they make every attempt not to change the schema but to overload it with junk. That may be short-term optimal because it minimizes the amount of disruption, but long-term that means that sooner or later this whole system is gonna die after you junk it up too many times. So I think schemas change all the time and that we would all be well advised to figure out good ways to accommodate that.

Michael Stonebraker: (14:56) And you guys at Alation, I mean when you do data cataloging large scale inside of a company as you guys do all day every day, you realize the quantity and the heterogeneity of the amount of data out there, it's insane, right? At some level, a lot of the data that people use is in spreadsheets and this massively idiosyncratic kind of world. If you really catalog everything that's out there and you say, “Okay, well, now I need to create some comprehensive view around a key entity and there's hundreds or, god forbid, thousands of sources in schema associated with each one of those sources.” And they would be like, “Oh wait, the only way to solve that problem is by binding together a schema late into using probabilistic methods to do it.” There aren't enough good rules or enough three star wizards and lab coats, data engineers to dynamically do all that stuff.


Solving data chaos

Satyen Sangani: (15:51) Yeah and it's to me, the point that you're both mentioning — and Mike that you're pointing out — I mean, that's what constantly when people say, “Doesn't a data warehouse do what Alation does? Or once you've got a big data lake, do you need an Alation or do you need a Tamr?” I think there's this kind of ethos in IT. The data discipline attracts people who are loving order and loving organization and loving optimization. There's this premise that you sort of assume “no friction” like you would in a physics problem and the business problems never gonna change. The questions are never gonna change. Of course, that's just not reality. So I think it's a really key observation. So the schema on read question is really a computational question. And I think historically databases have been limited in their size and scale.

Satyen Sangani: (16:36) In some ways a schema is not only a mechanism of modeling the world, but also of making an application run in a performant way. You mentioned something Andy, and I think this is the genesis of Tamr, which is like, you've got to take a probabilistic approach on the sort of interrogation of that schema, that database to understand or ask a question and that the notion of reconciling things can happen on a late basis. Talk to us a little bit about that, because that sounds like that's probably the genesis of where Tamr came to be and how was that different from traditional or original data mastering techniques?

Andy Palmer: (17:12) My favorite example is from a couple decades ago when I was an part-time employee of Informix, which is a database company that got bought by IBM. The new CEO walked into the executive staff meeting on day one and turned to the director of human resources and asked how many employees do we have, which is the ground zero analytic question. And the HR person said, "I don't know, but I'll find out." Fast forward to the next staff meeting he asked the same question again. And the HR person says, "I don't know. And you can't ask that question," because Informix was operating in 58 countries and there's no consistent definition of an employee. In Eastern Europe they were all consultants, in the U.S. you only count people who got a W2, etc. Without data mastering to put these disparate datasets together, there is no ground truth. And the minute there's no ground truth, your analytics are just total garbage. Until you solve the data mastering problem, your analytics just won't make any sense.


Receiver makes “wrong”

Satyen Sangani: (18:33) How previously did people do that? What was your observation and what was the change at Tamr that you guys brought?

Andy Palmer: (18:39) So one of the things that Mike and I saw is — Mike mentioned earlier — we were working with Joey Hellerstein and the team that ultimately founded Trifecta. Those guys, they were really thinking about the problem from a different perspective, which was the analytic perspective, the downstream perspective of, “Okay, I've got an analytic; how do I in the last mile organize and clean up the data so that it works for my analytic?” As we started to work more and more with Joey and Hare and the rest of that team, it became clear to us that both of these things were required and neither alone was sufficient to solve the overall problem in the enterprise. So self-service, data prep, and what Alteryx became and what Trifacta became, were really picking up steam. And there were individual people at the very end of the analytic cycle, at the very last mile that were organizing and doing these tweaks to their data to clean it up.

Andy Palmer: (19:31) But they would get their files from some extract from a warehouse or a mart or some combination thereof and usually put it into a spreadsheet, then load it into Trifacta or Alteryx and then visualize it in Tableau or Qlik. But when we looked across all the things that those kinds of users were doing, there were a bunch of things that were very consistent. There were fundamental problems with how the data was organized and how it was mastered. And the result of doing this highly idiosyncratic prep was that you looked at one person's view of analytics of customers and another person's view and the numbers were inconsistent. They weren't the same. And back to Mike's employee example, if you're an executive and you've got two people that have done their independent analysis and they're both telling you they have a different number of employees on any given day, you're like, “Who do I trust?”

Andy Palmer: (20:21) We think that last-mile data prep and in-organization is good and important, but it needs to be complemented by much broader mastering of the data. If you have an endpoint that says, “Hey, this is the definitive list of employees today based on all the information from all of our HR systems and it's dynamically mapped using machine-driven, human-guided methods,” at least, then, the people consuming into those data prep tools, at least they've got one consistent endpoint for the best cleanest, most comprehensive, well curated data for these key logical entities.

Michael Stonebraker: (21:05) So I want to stress one thing that Andy just said which is the trouble with “make it right for my analytics” is that then you've got something that works for you at “time equals now.” Fast forward a day and you're now out of date, so you have to do it again. If you add a new data source, then you got to do it all again. And so it becomes extremely time-intensive to do “receiver makes right.” Because data mastering is really a global problem, and having everybody do it at their end point on the piece of data that they're interested in is unbelievably inefficient. So, data mastering is a global problem and it isn't well done by “receiver makes right.”


Transcending traditional MDM

Satyen Sangani: (21:53) In this example that you guys raised — let's call it the HR system, not the procurement system — I've got 75 of them. Day one, I have an option. I can turn 75 into something less than 75 by integrating and deprecating the original systems or rationalizing these systems or I can live with them. In a world where most people just choose to live with them, to Mike's point, system rationalization might be high on the list during a time of efficiency, but if you're focused on growth and all these other things, you probably don't do it. And so now I'm leaving these poor analysts at the end of the chain with tools like — at the time, Trifacta or other data prep tools like Alteryx and the like — and they need something that tells them what the employee list is and to look at, and you guys come in and say, “Hey, buy Tamr.” But at the time, at least as I remember it, MDM had existed for many years prior to Tamr. What was it that was different about Tamr and the technology that you were proposing that was different from what had existed in the past?

Andy Palmer: (22:49) The inspiration for me when I was running software and data engineering at the time when Mike was working on the data Tamr project at MIT and the inspiration for me was that we had tried and had access to every popular MDM tool, but the level of idiosyncrasy amongst the sources was so extreme that like there wasn't any amount of rules-based — like predetermined data mapping and matching — that could even come close to handling what we had. One of our data systems we used the data Tamr project on early, Novartis, had like 15,000 tables in it and the idiosyncrasy in each one of these tables was so extreme, you just couldn't think in advance and design a rules-based [system], right? You had to just probabilistically — this is what we used Tamr to do — probabilistically look at all the schema in each one of those tables and then come up with recommendations as to how they mapped and matched and then use humans very deliberately to tune the mappings and matchings and even version control those mappings and matchings over time because they were always changing. It was this probabilistic approach. Before we started the company, we went out and talked to all of the people that do this stuff really well, like Informatica and Microsoft and IBM and Oracle because we just, we've been around for so long, we just know everybody. And this is like 2011, Mike, is that right? That we were doing that?

Michael Stonebraker: (24:12) Yeah. Probably. Yeah.

Andy Palmer: (24:14) The weird part is we were using machine learning back in 2011 and almost universally with the exception of one person — James Markarian, who was the CTO at Informatica at the time — almost all the people said, “We don't understand how to use machine learning for this. We're all set, we do this all manually.” James Markarian — it was just before he left Informatica — was like, “This is the future. This is how people are gonna do schema mapping, record matching, and classification and datasets. You're gonna use fundamentally probabilistic approaches.” So I think it was the probabilistic technique that was the real difference.


Probabilistically solving data problems

Satyen Sangani: (24:49) That was the real insight. And so, old world, I would basically have a list of values, like a list of employees from, I don't know, call it my HRIS system. I'd have a list of employees from some other system and I would just match them. I'd say, “Oh, maybe I'd use a regex [regular expression] or I'd declare a rule or I'd manually do it.” And then you guys basically came along and said, “Nope, we're gonna take a probabilistic approach.”

Andy Palmer: (25:11) Yeah. The key difference is the first thing Tamr does is it looks at both of the schemas and it tells you, “Hey, based on the models we have, we think this is how things map,” right? As opposed to, again, in the old world it's like, “No, there's a human being that looks at the two schemas and then starts to map them manually.” The minute you do that, you're kind of in this deterministic state as opposed to a probabilistic state.


The impact (and limitations) of GenAI

Satyen Sangani: (25:35) Yeah. At the time there was a company called Waterline Data, I don't know if you remember it, which was in some ways like a hybrid between what you guys did and what we did. On some level they were a catalog, but then they also used this idea of a fingerprint where they said, “We think these two things are the same and we're gonna use some matching in order to make sure that they are actually the same.” It's funny how kind of gradations of the companies were different. There was sort of a continuum between them. Let's fast forward actually to today because now everybody's talking about GenAI and certainly there's a whole bunch of entity matching capabilities obviously in text, but that people are talking about. How does that change the landscape for you today? Is that technology that obviously makes you guys more relevant? Because one of the big problems of GenAI is just having a clearly defined semantic model, but under the hood, is that stuff that you're also leveraging in order to make the models better?

Michael Stonebraker: (26:24) I would say without much exaggeration that the entire CSAIL research laboratory at MIT is now oriented toward “What can we do with GenAI?” The big problem is that GenAI is trained on public data, basically, on the web. The trouble is that enterprise data doesn't look like web data. I don't expect GenAI to do anywhere near as well on enterprise data as it does on public data for that reason. The big huge problem that Tamr has every day with every customer is getting enough training data to make an ML model work. You can't use public data off the web because it's not relevant to your problem at hand. You can't really ask somebody that you find off the web to help you because a lot of the problems are very domain- and customer-specific.

Michael Stonebraker: (27:28) For example, is Merck with an address in Berlin, Germany, the same as Merck with an address in New York City? That's just a yes or no question. And the answer is, the three of us probably don't know. I turn out to know because I looked it up. They're totally different companies. Getting enough training data is the big problem. ChatGPT requires even more training data because it has a lot more parameters than other deep learning models. Deep learning hasn't been very successful at doing our kind of stuff, number one, because you can't get enough training data. Number two, invariably if you're dealing with money, you've got to be able to explain decisions. If you have a deep learning model that gives you a credit score and you then immediately say, “Why did you give me a particular score?”

Michael Stonebraker: (28:32) And all the person at the other end can say is, “Well, here's this black box. It said 596.” Nobody who handles money can get away with that kind of answer. Explainability and training data are the challenges with deep learning. That's the reason why Tamr at the present time does not use deep learning because traditional ML requires a lot less training data.

Andy Palmer: (28:58) Exactly. What Mike said. There's a bunch of places inside of Tamr where we use things that resemble GenAI kind of techniques, but our math is so specifically tuned to do these core functions of schema mapping, record matching, and classification. It reminds me that we go through this swinging of a pendulum. I've been studying AI since 1986. I started studying with Marvin Minsky back in the day. The reality is that we spent a lot of time doing these very generalized algorithms, and GenAI and the resulting LLMs are an example of a pretty generalized set of things. Then the pendulum swings back as we try to apply that math and make use of it in the world. For Tamr, we've just been very, very focused on very specific math that works to do these data integration and cleaning things.

Andy Palmer: (29:54) It's so highly tuned that these generalized techniques don't do nearly as well, and that there are places where they can be helpful and supportive, but our stuff is so specialist and I think it's true in general with GenAI and LLMS, we're gonna find over the next 5 or 10 years. Just like Mike said, there's a lot of training that's required to make this stuff work. Very few people will have the resources, the patience to do the training. Oftentimes you're gonna find, well, there's actually better math for any given problem. The very generalized math is good for general problems like ChatGPT. But when it comes to mapping attributes in columns of enterprise data, like if you've got this very specialized tuned math, it's probably gonna do a lot better.


Commoditizing commercial data products

Satyen Sangani: (30:37) Do you think there's an opportunity to sort of speed up the development of reference datasets? I mean this problem that you mentioned, Mike, with regard to Merck in Berlin versus Merck in New York. In some ways, if you think about a lot of large-scale mastering problems, I mean like my personal life, like I used to collect all these records and CDs and used to want to keep inventory of them and it was terrible to track all the metadata. But now I've got Apple Music and the problem got solved because everybody's record collection in the galaxy — whether it's Apple Music or Spotify’s website — it's all in the same place. Could you imagine the same thing for entities? Is that an area that you guys have considered or are looking into?

Andy Palmer: (31:13) We think this is happening right now with firmographic [data] for corporate entities. The reference dataset for company information is commoditizing incredibly fast.

Satyen Sangani: (31:24) Those are data products. And I use that word specifically because you guys have articulated in your messaging that Tamr is a mechanism for data product development. Tell us a little more about that because that sounds like a different way of articulating what you do, but an important distinction. What led you to that and why did you lean so hard into that message?

Andy Palmer: (31:43) We used to call these “logical entity types,” which is probably a very technical way to think about them. But calling them data products is really useful. This is really the mapping of all the physical data that exists in an enterprise and is cataloged using Alation but then using the power of the machine to map all of that physical data and all of its idiosyncrasy into a set of logical entities — representative data products — so that that clean curated data inside of each one of these data products can be consumed by lots of people inside the enterprise and/or machines that might want to consume it. And so the “data product thing,” I think it's a really effective way to describe the deliverable or the artifact that the modern data organization is delivering to the consumers of data inside of their enterprise. I think we're at the very beginning of realizing what data products can be and how beneficial they can be. It's like, still very, very early days.

Michael Stonebraker: (32:41) I think the thing you should also realize is that Tamr in theory does whatever kind of data integration you want, but essentially all of our customers want to either integrate suppliers, as in GE, they want to integrate customers, they want to integrate projects, they want to integrate parts, they want to integrate employees. There's a dozen or less very common things that people want to master and they are all amenable to the ML approaches that learn from the first project and apply it to the second project. I think there will be a dozen-ish of very comprehensive data products. In other words, that stuff will get commoditized and will start costing next to nothing. The real value, I think, it turns out that I play bluegrass Banjo and the catalog of bluegrass Banjos — the ones that were built before World War II — are prized by collectors and there's probably 10,000 of them.

Michael Stonebraker: (33:55) And my hope is that somebody will curate that collection of 10,000 and because there's all kinds of fraud, people claiming a Banjo is something other than it is. It's a very specific market. It's something where an expert curator could put a bunch of datasets together with other stuff and it would have a lot of value to this community. There's thousands of such communities that would value data products. The number of data products will be thousands and thousands and there will be a few very, very popular ones and all the value is going to be in the long tail.

Satyen Sangani: (34:41) What I think is fascinating about this is that it allows companies to finally bridge the wall between their internal data and truth about the world externally. So now I don't have to map to this sort of — to your point Andy -— the source is not the master. The master may not even exist inside of your company, and this idea that this can be sort of flexibly shared in the knowledge about the world around a singular set of things can be shared is is a pretty transformative one and I think would accelerate then all of this stuff around GenAI that we're all talking about because it now a thing is a thing. It doesn't have to be 15 things.

Andy Palmer: (35:15) Yes, exactly. Oftentimes, we engage with a customer who's doing supplier mastering. Our data products at Tamr come out of the box with third-party data sources already integrated. The simple question we ask is, “What do you think the odds are that the name, address, phone number, contact information that was fat-fingered into your procurement system a year ago, two years ago, 10 years ago, is better than the data that was scraped off of their website yesterday?” [laughter] They’re always like, “Yeah, I probably got better data than we do in our own procurement system.” When we start with a supplier mastering project for the supplier data product, it's like, no, we start with the most recent refresh of the best data that's available on the open web and then you take their supplier file and we're like, “Oh those are the companies that they're interested in. Okay, now you're all best values for those companies at this point in time.” In retrospect, it's kind of an obvious thing.


Defining a data product

Satyen Sangani: (36:14) Yeah. So when you use this idea of a data product, for you guys it means something more narrow than I've historically considered it. I've always thought about a data product as being like a report, a schema, a table, a spreadsheet — any of those things could be data products. When you're talking about data products, you're really talking about a very specific notion or instance within the concept of a data product, which is like, “This is the list of values that is really canonical.” It's the thing that we all refer to and refer back to.

Andy Palmer: (36:39) Yeah, it's the reference dataset. The other stuff, like the metadata, is a part of it and ideally you have all the lineage for like, “This value for that address for that customer or supplier was selected from these five other addresses, two of which came from external sources, some of which came from internal sources.” The mapping of all the provenance information is sort of a part of that data product. But it's not the primary thing. The primary thing is the data itself and the best, cleanest, curated version of that data — because that's what consumers want. In some ways they don't really want to worry about how you got there, they just want to consume it. It's just like your music, there's all this stuff behind Apple Music and Spotify and YouTube Music where they're curating aggressively all of the content related to that. As a consumer, you kind of don't care, you just want to find the song and play it.


The future of Tamr

Satyen Sangani: (37:34) Yeah, for sure. As you think about evolving Tamr now in this new world of data products and, certainly, where the probabilistic approach does, especially when you have large data sets to compare it with — large reference datasets to compare it with — probably has fairly high fidelity. How do you think about evolving the company and what are the big problems that you're focused on today?

Michael Stonebraker: (37:56) I think it's pretty obvious to anybody with half a brain that the world is going to move to the public web. Everything you can possibly move to the cloud, you will move, and we'll start with decision-support stuff because that's typically easy to move but all your transaction systems will move there over time except for the stuff that's in COBOL that you've lost the source code for; that's never gonna move. Data products and data mastering are basically a cloud problem. You want to be cloud native, you want to run software as a service, you want to be friendly to the cloud vendors. Tamr spent a lot of time over the last two or three years doing exactly that. There's a big difference between running on the cloud and being cloud native and running software as a service. That's what we're focused on big time right now. After that I think there's a lot of research directions we're paying attention to, trying to build more semantics into tables to be able to leverage — you can think of this as leveraging more exhaustive catalogs to do our stuff better. I think that's something we're thinking about a bunch.

Andy Palmer: (39:16) Mike's point is dead-on. One of the reasons we love working with Alation is that the better job that people have done in cataloging their data, the easier it is to create a better data product from all of that source data. And when we've even just begun to scratch the surface on data cataloging, there's a lot of work to be done in these big enterprises of getting all the data cataloged, getting it all mastered and curated and then delivering out. And as Mike mentioned, we see the cloud in cloud infrastructure, multi-tenant cloud infrastructure as a key enabler to making that happen faster. Early on at Tamr, we did a lot of stuff on-premise and those projects just took so much longer and you ended up doing a whole bunch of infrastructure stuff that's just not required. We're really encouraging all of our customers to think cloud-native multi-tenant kind of infrastructure as the de facto starting point, because that'll let them get to better outcomes much faster. Again, there's so much work to be done and cataloging, organizing and then delivering all this data.


A bottomless well of ideas

Satyen Sangani: (40:20) Yeah, it does certainly feel that way. Speaking of which, I mean there's so many problems to solve. Andy, you're a prolific angel [investor], you're involved in tons of companies, you advise tons of companies. Mike, it sounds like being a professor at MIT and a Turing Award–winner would keep one occupied. How do you guys do this? You've come up with collaboration. Just on a personal level, just outside-in, I've always marveled at that — I have a hard time doing my job and raising my kids and I can't imagine having all the stuff that you guys do on the side. How do you manage that and what's kept it going? What has allowed you to do what you do at this level of scale and at this level of efficiency?

Andy Palmer: (40:56) Mike has a neverending set of crazy-cool new ideas and he's got a really cool new one called DBOS. We get these things started together and then I just try to keep up with him and carry his bag around. That's my view of — our formula is like me just scrambling around trying to keep up with Mike [laughter].

Michael Stonebraker: (41:14) One answer to this question is that the research community is relatively large and the difference between very successful researchers and everybody else is the very successful researchers make sure they solve a problem that somebody is interested in. I spend a lot of time talking to real world people asking, “What are your pain points?” That keeps me focused on things that are relevant to somebody. That's the high-level bit that keeps me producing stuff that somebody's interested in. My favorite current example is, security is getting to be a more and more and more serious problem. The number of ransomware attacks that you never hear about is astonishing and the number of billion-dollar losses as a result is also mind-boggling. So what you start out with is Linux, which is 40-year-old software. That's a very leaky boat, as Andy loves to call it.

Michael Stonebraker: (42:21) And the development of Linux has been very, very slow because it's very elderly code. People have layered stuff on top of it like Kubernetes, which is just some more stuff and then the security companies layer a bunch more stuff on top of that. So you paper over a leaky boat and in my opinion, that's not the way to get a secure reliable system. We've written a thing called DBOS, which puts all operating system data into a database inside the kernel. So you only have to trust one thing, not a whole bunch of things, namely an OLTP multi note database system. So that's my latest thing. You can check out DBOS. There's a bunch of papers and there's a startup company that’s working hard at commercializing it.


A great technologist versus a great entrepreneur

Satyen Sangani: (43:16) Oh, interesting. I think Matei Zaharia as a collaborator with you on that — as well as, I'm sure, an army of researchers.

Michael Stonebraker: (43:22) All correct, yes.

Satyen Sangani: (43:24) Which strikes me as one of the core skills. I mean, to your point, the difference between a great technologist and a great entrepreneur — or in your case a great and prolific researcher and academic and maybe somebody who is just able to push out a whole bunch of papers — is just building stuff that other people want. But it also strikes me that you've just recruited some incredible people to work with and are able to share the podium with other pretty phenomenal minds and both of those things are pretty incredible skills.

Andy Palmer: (43:48) One of the coolest thing about Mike and Matei and their relationship is back in the early days of Spark, Mike was not a big fan and [laughter], Mike and Matei used to go at it pretty hard about Spark, but in the process built such a level of trust and relationship as they were working through the hardest problems. And like now watching Mike and Matei work on DBOS together over the last two or three years has really been remarkable. When you get people that are crazy smart, who respect each other's opinions very deeply, when you get them in the same room, it is really inspirational to kind of watch 'em go at it. Very cool.

Satyen Sangani: (44:22) Yeah, it's super privileged to be a part of that era. Mike and Andy, I probably could ask you questions for the next two hours, but this was a lot of fun and I think there's at least three insights that I can catalog in my brain that people will just be floored by. So just thank you for the time and thank you for sharing all your insight and wisdom and experience. It's just fun to see and fun to be a part of the same community as you guys.

Michael Stonebraker: (44:42) Thanks for your time.

Andy Palmer: (44:44) Yeah, thank you.

Satyen Sangani: (44:51) People like Mike and Andy have revolutionized the database. They've transformed how we use data today. How do they do it? To hear Mike tell it, his research has always started with a problem that many people would want to solve. He asks his target audience, why is this hard for you and what would make it better? He has also collaborated with other data radicals like Matei Zaharia. He didn't love Spark at first. He challenged it — and Matei. And in doing so, they were both able to grow. Today, Mike and Andy are helping customers and companies create high quality data products so they can drive business growth and enhance decision-making. Thank you for listening and thank you, Mike and Andy, for joining. I'm your host, Satyen Sangani, CEO of Alation, and data radicals, stay the course, keep learning and sharing. Until next time.

Producer: (45:34) This podcast is brought to you by Alation. Your entire business community uses data, not just your data experts. Learn how to build a village of stakeholders throughout your organization to launch a data governance program and find out how a data catalog can accelerate adoption. Watch the on-demand webinar titled Data Governance Takes a Village at alation.com/village.

Other episodes you might like

Data Radicals episode The AI Echo of Saul Alinsky's Legacy

Season 2 Episode 27

The AI Echo of Saul Alinsky's Legacy

In this unique episode, we introduce "Saul GP Talinsky," an AI iteration of Saul Alinsky, the pioneering force behind community organizing and the influential author of Rules for Radicals. The dialogue bridges the past and present, highlighting how modern data analytics culture echos Alinsky's ethos of empowerment and societal change. Through the lens of data, Alinsky's AI counterpart illustrates the transformative potential in both grassroots activism and corporate realms, advocating for a future where data-driven insights fuel innovation, challenge traditional paradigms and foster more just and equitable decision-making.

Season 2 Episode 26

Vector Databases 101

Edo Liberty, CEO and founder of Pinecone, introduces the impact of vector databases on AI, likening them to Esperanto for algorithms—a universally understandable language that transforms intricate data into an easily interpretable format for AI systems. Unlike traditional databases' clunky, one-size-fits-all approach, they make AI smarter, faster, and infinitely more useful. As the fabric of AI's cognitive processes, vector databases are the hidden engine behind the Generative AI revolution.

Season 2 Episode 9

Start with Story, End with Data

Ashish Thusoo has been on the leading edge of a data culture, whether it’s as a founder of a data lake startup, developing the Hive data warehouse at Facebook, or in his role as GM of AI/AML at Amazon Web Services. This discussion traces the evolution of data innovation, from big data to data science to generative AI.