Building a Data Warehouse from Scratch

with Jacob Baskin

Episode 28 | June 24th, 2026

BLURB

In university Jacob Baskin studied at the intersection of computer science and economics, thinking about systems that incentivize people to express their true preferences. He put those ideas into practice at Google, where he worked on ad serving, before joining Jane Street’s database infrastructure team. In this episode, Ron and Jacob discuss Superstore, a distributed columnar database now central to Jane Street’s tech stack that Jacob began building practically the day he started. How do you support wide-ranging analytical queries while transactional writes stream in at the speed of trading systems? And what’s it like when your first design doc leads to an eight-figure hardware purchase? After building Superstore Jacob has since gone back to his roots, thinking about schemes for bidding on compute time as he works to optimize usage of the Hive, Jane Street’s massive compute cluster for research.

SUMMARY

In university Jacob Baskin studied at the intersection of computer science and economics, thinking about systems that incentivize people to express their true preferences. He put those ideas into practice at Google, where he worked on ad serving, before joining Jane Street’s database infrastructure team. In this episode, Ron and Jacob discuss Superstore, a distributed columnar database now central to Jane Street’s tech stack that Jacob began building practically the day he started. How do you support wide-ranging analytical queries while transactional writes stream in at the speed of trading systems? And what’s it like when your first design doc leads to an eight-figure hardware purchase? After building Superstore Jacob has since gone back to his roots, thinking about schemes for bidding on compute time as he works to optimize usage of the Hive, Jane Street’s massive compute cluster for research.

Some links to topics that came up in the discussion:

CONTENTS

Mechanism design and economics (00:00:02)
Building ad markets at Google (00:03:57)
What the theory misses (00:14:21)
Distributed systems lessons (00:18:21)
A curb-mapping startup (00:21:56)
Joining Jane Street and the limits of Postgres (00:26:02)
The Wild West and long-running transactions (00:36:09)
Designing Superstore (00:42:06)
No direct Parquet access (00:51:04)
Build vs. buy (00:56:14)
What he’d do differently (01:05:23)
The Hive (01:12:58)
Scheduling compute as mechanism design (01:21:17)
Distributing work and queries (01:27:33)

TRANSCRIPT

Mechanism design and economics (00:00:02)

Ron

Well, it’s my pleasure to have Jacob Baskin here on Signals and Threads. Jacob’s a software engineer here who has worked on a bunch of large scale compute and data systems that we’ve built here, and we’ll talk about a few of those. But first of all, just welcome. Thanks for joining me.

Jacob

Happy to be here.

Ron

And I always like to hear a little bit about how people got started. Can you tell me a little bit more about how you got into computer science in the first place?

Jacob

Well, so I’ve been coding since I was a little kid. My dad taught me basic back when I was seven. But then in college, I was actually not going to be a computer science major. I initially planned to be a psychology major, but I couldn’t get away. I enjoyed coding too much and there was too much cool stuff to work on. So I went to Brown University and I ended up doing a bunch of stuff for my thesis around mechanism design and the connection between economics and algorithms, which I found super interesting and had no particular belief that I would ever end up using this in my career. There’s some really cool stuff where incorporating the economics makes finding efficient algorithms and finding good ways to actually do the computations within a reasonable amount of time much more difficult.

Ron

And just say a bit of jargon, like mechanism design is weird econ jargon for designing markets, right?

Jacob

Yeah. Well, markets are one kind of mechanism. I guess if you want to be technical about it, there’s a lot of mechanisms that don’t look like markets. But usually when people talk about mechanism design, they’re talking about something that looks like an auction, where people have some number that’s denominated in dollars or whatever, that they’re willing to pay for a certain thing or a certain collection of things. And your job as a mechanism designer is getting everyone to say their amount of dollars without having to be strategic and underhanded in a way that lets you figure out a socially good, so something that will result in the most actual value for real human beings, allocation of resources to the people who are contending for them.

Ron

And then how does mechanism design connect into software engineering?

Jacob

Well, often the thing that you’re trying to do is hard, not just from a mechanism perspective. It’s not just that different people want things in different amounts and you have to figure out who wants stuff more and who wants stuff less, but actually figuring out the allocation is computationally difficult. So some classic problems and mechanism design are things like building networks. If you want to connect a bunch of different people’s houses to the internet, you have to literally go in the world and dig trenches and put fiber in trenches, right? And figuring out what are the right trenches to dig and what’s the right fiber to put. Even if you’re the king of the world and you get to dig whatever trenches you want and command the resources of everyone in the economy, still involves some algorithms and computation, but in a world where you actually have to find contractors to do this and pay them and Verizon is out there and AT&T is out there and they want to do this in different places and are willing to charge you different amounts, right? Then you get this combination of this algorithmically challenging problem, this economically challenging problem. And you have to kind of solve both at once and they have interesting effects on each other.

Ron

Sure. Although I would like to point out that I asked you, how does it connect to software engineering? And then you told me how it connected –

Jacob

It connected to computer science.

Ron

To digging trenches.

Jacob

Yes. Fair enough. Okay. It connects to software engineering because a lot of the problems that we’re trying to build software to solve are problems that are economically important, where the ways that people interact with software are driven by their incentives and by their economic desires.

Ron

Right. And obviously you can totally see this stuff playing out in design of exchanges and markets and stuff like that. And there you see a rich intertwining of what you do on the mechanism design and how people behave and how that affects the technology. So despite you imagining this wouldn’t come up much in life, here you are at a trading firm.

Building ad markets at Google (00:03:57)

Jacob

Right. Here I am. It turns out that my interests are what they are and are somewhat consistent over the course of my life. But in fact, the way that I ended up dealing with this first at work involved advertising. My first job out of school was at Google, and I wound up working on Google’s very large and very interesting advertising business, helping Google do a better job of showing ads on other people’s websites. The system that I was working on, interestingly enough, was called AdExchange.

Ron

So yet another market.

Jacob

Yet another market. And this one was in fact a two-sided market, much like financial markets because you had people with space who wanted to show ads and then you had people who had ads and wanted to find space for them. And our job was to connect them. And we did in fact have to do some mechanism design to make the system work well.

Ron

So I mean, Google obviously selling advertising is very deep in that business. And in fact, early on it was kind of the only way they made any money at all. Where was this in the kind of evolution of Google’s ad business? What had been happening in their business at the time?

Jacob

So in 2008, which was in fact the year that I started at Google, they bought a company called DoubleClick, which is one of the original companies for doing banner ads. Back in the olden days in the internet, advertisements were much more about like you go to a website and there’s like an image and that image is an image that advertises something and you have to pick which image to load when someone visits the website. And that particular problem of loading the right image when a person visits the website and then also keeping track of all of these, that was DoubleClick’s bread and butter. And Google was very good at showing ads on Google search, but not so good at showing ads on other people’s websites. I mean, Google’s had that business for a long time too, but particularly when it came to sort of larger and fancier websites, DoubleClick was the industry leader. So Google acquired them and then they had this problem of, well, can we do both? Can we take the stuff that we’re so good at when it comes to showing ads on google.com, which was an auction forever. They wrote very interesting papers back when they were building the system about how they did this auction mechanism for showing ads on Google. Can we combine that kind of auction mechanism with something that takes into account the needs and the desires of large web publishers, people like newspapers or large news operations, Huffington Post, Buzzfeed, this was the aughts. So those were large companies at the time. So how do we adapt this auctioning world to one where there’s all these publishers, but also where we don’t know what all the ads are in the world, where in fact there’s all of these other companies, whether they’re ad agencies or whether they’re middlemen who understand the advertisers and what they want and want to integrate with our system in some reasonable way.

Ron

Can you say a little bit more about what the kind of two sides of this two company combo was? I can sort of guess at what Google was good at here. Google is really good at building infrastructure and I bet did a lot of good work on setting up the auction so it made sense and had coherent incentives and all of that.

Jacob

So what was DoubleClick bringing to the table?

Ron

What was DoubleClick good at?

Jacob

Double-Click understood enterprise customers. I don’t want to overgeneralize about Google in the mid 2000s, but there is certainly an ethos of we don’t know why anyone would rather talk to a salesperson than use a web UI.

Ron

I mean, I do sympathize.

Jacob

I sympathize too. I would almost always rather use a web UI than talk to a salesperson, but there are definitely cases, especially when it’s one company talking to another company where first of all, having the attitude as a business that we really want to understand what it is that our customers are trying to do and why, rather than think that we should know better than them and tell them what we should try to do. And then also have a much more managed relationship with these people that’s not just like, “Here you go, enjoy your advertising system.” So that was sort of where DoubleClick was coming in. They had built relationships with the very largest web publishers and the very largest advertisers. And obviously Google appreciated having those relationships, but they also really appreciated having the DNA of, well, here’s what it takes to serve large customers, which was just not something that Google did very much of at all in the early days.

Ron

Got it. So then what was the kind of technical work that you were doing? You have these two companies that you need to integrate their businesses together. How does that turn into a technical problem?

Jacob

Well, so we were building all of this on Google infrastructure. So DoubleClick and Google had completely separate technical infrastructure and pretty quickly the decision that, and this was in fact before I started, Google made the decision we’re going to throw out all the DoubleClick stuff and build it all again on Google infrastructure because of course they did. So the challenge became sort of trying to take the insights that DoubleClick had about their business and combine them with Google’s … It’s worth saying that at the time, Google’s infrastructure felt like it came from the future. A lot of the stuff that Google built … And these days, we just sort of live with it in the cloud. So we don’t recognize after the fact how revolutionary it was back in the mid 2000s, but things like MapReduce and BigTable, which were Google’s distributed systems for storing and processing large amounts of data. They had this very large distributed file system called GFS that stored many petabytes of data. You could just run applications over thousands of computers at once just by typing in the right numbers into a config file. These things were not table stakes across the industry.

Ron

Yeah. And I remember when those papers came out, they made a huge splash because it was … In some sense, the underlying techniques were not super new or surprising exactly, but they had sort of implemented it at scale and gotten … And there’s a lot of hard things to go from taking an idea that kind of makes sense to doing it really effectively at scale and had learned a lot from that experience. And it was actually great that they published those papers and I think it influenced the industry broadly a bunch.

Jacob

So really the technical work or one very hard piece of technical work was taking this Google infrastructure that was, again, very horizontally scalable and combining it with these double click insights about what is the feature set that you need to serve these large publishers and large advertisers. For ad exchange specifically, there’s another very interesting technical piece, which is called real-time bidding. To integrate with the systems on the advertiser’s side, the things that actually know what ads people want to show, rather than have them give us all of their ad inventory in advance and be able to pick out of a database, we actually would query these systems across the internet for every single ad impression that they wanted to show and let them submit a bid in real time.

Ron

Oh, wait, how does that actually work practically? They have ads. Those ads have relatively heavy assets like images and stuff that need to get delivered. And is it like literally just in time, like someone loads a webpage and it needs to find an ad and it goes all the way end to end to the actual advertiser who’s providing the ad and gets the image at the last moment?

Jacob

Yes.

Ron

Is it really like just in time the whole way through?

Jacob

Well, so it’s actually, it’s just in time twice. So the first thing that happens is you say like, “Hey, I want to show an ad here.” And then you need people to bid. So that goes through Google systems to look at all of the people who sort of have bids. They have some filter criteria. We don’t call everyone for every single time Google shows an ad on the internet, but there’s some filter criteria. We choose which advertisers call. We say like, “Hey, here’s information about this advertising spot on this website. What’s your bid?” And then they return us a bid and then if they win, we inject a little HTML snippet into the ad unit on the webpage. That goes to their system and fetches the image.

Ron

Got it. So then the final asset is not actually flowing through Google systems, that’s just being directly loaded from the final advertiser.

Jacob

Exactly. With a lot of asterisks around … This is all very much integrated with all of the crazy ad tracking pixels and cookies stuff that got built in the internet around that time. So really the load on pages for loading ads, I mean these days as well, but also in those days, was incredibly heavy just in terms of all of the various tracking cookies and pixels and different domains that provided all of this different stuff to do metrics and monitoring for the ad. That ended up being much worse than the image itself.

Ron

Got it. And then also this whole infrastructure is also probably kind of a security disaster, although maybe in a way that wasn’t as obvious back then.

Jacob

Definitely all of the tracking and cookie stuff is absolutely a security and privacy mess, but we did in fact think about this. And one of the things that we built was a way for advertisers to link what they knew about a given user with a particular ad impression without us being able to extend their reach to that user across the internet. We would anonymize users identifiers in a way that didn’t give advertisers access to them, but it still let them know if they had seen that user before.

Ron

Got it. So I guess that sounds like part of the security problem. I guess the other security problem is around the like delivery of weird malware through JavaScript nonsense,

Jacob

Yes.

Ron

This kind of complex routing system.

Jacob

Yes. So Google also had pretty reasonable controls for that. That can be very difficult. That was a real arms race with malware actors, but at least we tried very hard. But yes, certainly once you get to put arbitrary HTML inside a user’s browser, yeah, it’s a pretty fraught scenario. And certainly at the time, browsers didn’t have any particularly good sandboxing abilities for managing this well.

Ron

Right. Luckily when this all started, it was less important and then we had some time to spend time making it better as it got more central to the world.

Jacob

Yeah, I think that’s fair. It’s certainly very important to the publishers at the time. If you remember back in the day, the New York Times didn’t have a paywall. Just so much journalism on the internet and just content more generally on the internet was ad supported. And this is one of the exciting things about working in that industry at the time is we were like on a mission to make the ad supported internet actually work for people. Now we failed, but we tried really hard.

What the theory misses (00:14:21)

Ron

So you’d studied something about mechanism design in school, and then here you were right in the middle of designing these mechanisms. How well did the lessons that you learned in school pan out? What do you feel like you learned from the whole process of trying to do this stuff with a real system?

Jacob

So the main message that you get or that I got at the time from the sort of academic community around mechanism design was that building incentive compatible mechanisms was extremely valuable. That if you build the mechanism right, it benefits people to tell you the truth. And if you build mechanisms wrong, then people are going to try and game the system, and this will end up resulting in a worse situation than you otherwise would have. And I think the main thing that I learned in industry was that’s definitely a valuable thing that it’s nice to do, but that is a much smaller part of the problem than I believed it was back when I was in school.

Ron

So what’s the rest of the problem?

Jacob

So two big things that you have to focus on that are not covered in that view of mechanism design are first of all, people need to understand the mechanism. It’s no good if you can give them the complicated proof that the mechanism is incentive compatible if they don’t have any idea what the heck is going on. So being able to point to, well, here’s why this good thing happened to you. Here’s why this bad thing happened to you. Here’s why we charge you this amount without having to get a whiteboard out is really important when you’re actually trying to sell something to people who are not mechanism design PhDs.

Ron

Can you give an example of a part of the technical design that was done to achieve that goal of making the results more explainable to people?

Jacob

Okay. Well, so a great example is budgets. If I’m an advertiser and I want to go and show ads on websites, probably I don’t want to spend infinite amounts of money, even if I can get really good prices all day.

Ron

Sure.

Jacob

Probably I want to spend a thousand bucks, maybe a thousand bucks a day.

Ron

You want some risk limits?

Jacob

Yeah, I want some risk limits.

Ron

Fantastic.

Jacob

So let’s say that I go and I start spending my budget and I spend it really quickly and then I have like five bucks left and it’s getting towards the end of the day. What should I do to give myself better outcomes? Well, I should probably lower my bid, right? Because the chances are that I have been spending too fast and I could get more of the good stuff that I want for less money if I just bid less for it. And this is not really an incentive compatible result because another way of saying that is if I have a budget, then I should bid less than my marginal value of each thing in order to get the best results for me. Right. So budgets break sort of a very naively memoryless incentive compatible auction all the time. And yet very few people try and model this for their market participants. In almost all the cases, what you’ll see is that people offer, maybe they have some kind of second price auction, maybe not even, but nobody tries to take into account people’s budgets when figuring out how to allocate them stuff.

Ron

When you say nobody, you mean like in the academic literature, nobody thinks about budgets, or do you mean in real world systems?

Jacob

In real world systems, people let you manage your budget. They don’t try and manage your budget for you. And they will even maybe say like, “Hey, you should bid less so that your budget lasts all day.” And rather than trying to model budgets as part of the mechanism, they just let it be this other thing that people have to manage on their own. Because when we try and incorporate people’s budgets for them in some way, they just get results that they don’t understand. And in fact, people have a lot of constraints that they’re not telling you that you’re not really in a good place to know and you just kind of have to let them do their thing and be sort of incentive compatible in a narrower sense if you can without really trying to be fully incentive compatible in some theoretical sense. I remember –

Ron

So simplicity is a value.

Jacob

Simplicity is a value.

Distributed systems lessons (00:18:21)

Ron

So Google was obviously an interesting place to learn about all of this kind of market stuff around ads, but there’s also a lot of interesting technical stuff happening there as you were mentioning. What do you feel like you learned from the experience of being a kind of systems engineer and systems designer in that world?

Jacob

Just so much. I feel like I’m still mining a lot of my early Google experience for software engineering insights. One of the main things that I learned was kind of how to think about distributed systems in ways that let you design them well. The Google approach to building distributed systems is to try and make the hard part be as contained as possible. You take some existing system that does a very hard distributed system thing, whether that’s distributed consensus or whether that’s like building a large stateful key value store, and then you build your system in terms of that existing thing. And often this is sort of a nested process where like Big Table, which is this great distributed key value store is built on top of Chubby, which is this distributed consensus mechanism. And most engineers at Google are not building big table or Chubby, they’re building something on top of both of them. And in particular, building large stateless systems is much easier than building large stateful systems. So you really want to contain the state of your system in as few different components as you can.

Ron

Although I find it’s always kind of funny when people talk about building stateless systems, like often there is state there. It’s just like not being managed by you.

Jacob

Right

Ron

And so it’s more about like, it’s not so much about the system you’re building being stateless, but the component that you’re adding on, not being responsible for managing this data on itself.

Jacob

Well, almost every system is like stateful in some sense, or what is it even doing? Maybe you’re building a data center of a million pocket calculators, but otherwise you’re pretty stateful. But yeah, again, the goal is having as few of the components be stateful as you can. So a very Google approach to this kind of problem is like, well, we’ll have this stateful core that goes and gets the data and loads into memory in some extremely efficient form, and then we’ll just have a bazillion web servers that I’ll go talk to it and we’ll push as much of the workout to these bazillion web servers as we can, and then they’ll make what are hopefully very efficient queries to the thing maintaining the state, and that’s what the system will look like. This does not look a lot like the way that trading systems work, where you end up keeping the state very close to the edge because you prioritize latency at levels that Google just didn’t. Our latency target for ad serving was 250 milliseconds.

Ron

Which by some measures is kind of fast, but-

Jacob

Yeah, it’s fast enough that you’re not going to go and get a cup of coffee waiting for the website to load, which is really the important thing.

Ron

Right. But it’s very, very different from the kind of latencies that we worry about when we’re building systems.

Jacob

And so you can do a lot of things with a 250 millisecond latency budget that you can’t if your latency budget is in the microseconds.

Ron

Right. Yeah. And people often talk about technical organizations pointing out that the technical organizations are often kind of shaped like their initial application and Google’s early infrastructure in many ways was built around search and ours was built very much around trading systems and that’s led to some very different choices.

Jacob

And in ads at Google, we were 100% using components that were initially developed for Google search. Google looked at advertising, this was before my time, Google looked at advertising as just a search problem. Our problem is like we have to index all of the ads in the world and find the best one to show for a given website. Yeah, if you squint hard enough, sure.

Ron

Okay. So you learned a bunch of this fun stuff at Google. How did you end up here?

A curb-mapping startup (00:21:56)

Jacob

Well, so from Google, I ended up going and doing a startup focused on helping cities manage their curbs.

Ron

What do you mean manage their curbs?

Jacob

So if you walk down the sidewalk, certainly in New York, but also in really just about any city in the world, there will be all kinds of arcane paint markings and signs with icons on them and so forth, denoting if someone pulls a vehicle over to the side of this particular street, what are they allowed to do and what is going to get them a ticket? Okay. Maybe you’re allowed to park, but do you have to pay? Can you park for four hours or can you only park for one hour? Maybe you can’t park, but if you’re a truck, you can load and unload goods from the truck. Maybe you can’t unload goods, but you can unload passengers, but only for three minutes. All of these rules have sort of been built up over time and accrued in these cities and almost nobody has a good record of what they are and why. Okay. So we were trying to solve this problem and build a platform for cities to manage both the actual physical control infrastructure that causes these rules to take effect on a given street, and then also better be able to understand and analyze what the rules even were. How many parking spots are there in New York? This is a harder question to answer than one might think, especially because it changes day to day and hour to hour.

Ron

Wait, it changes day to day. How does it change from hour to hour?

Jacob

Well, there’s a sign that says no stopping 4:00 PM to 6:00 PM Monday to Friday.

Ron

Okay. And then how is this company going to solve those problems?

Jacob

One of the interesting things about parking signs is that exactly where they are matters to plus or minus one parking spot, which is a little tighter than you can do reliably in urban areas with GPS. Okay. So one piece of infrastructure we had was an app that used sort of AR inspired visual odometry techniques to figure out exactly where along the street different things like signs were.

Ron

Wait, AR, augmented reality.

Jacob

Yes.

Ron

What?

Jacob

So if you like Pokemon GO, you lift up your phone and there on the table is sitting at Pokemon and you move your phone around and the Pokemon stays where they are relative to the things in the scene, even if your phone moves. So to do that, you have to build a map of the scene inside your phone.

Ron

Sure.

Jacob

If you can build a map of a scene inside your phone, then you can build a map of a block inside your phone.

Ron

Okay.

Jacob

And then we can use those to position the parking signs along the block with extremely high accuracy.

Ron

Nice. Okay. Fun.

Jacob

So that data collection problem is actually really fun to solve. And then once you have the data of like, well, where are the parking signs? Well, then you have to figure out what they say, which is a transcription problem. We didn’t have Claude back in the day. We had to figure out how to do this using slightly less general techniques.

Ron

Sure.

Jacob

And then even once you know what, they say you have to know what that means.

Ron

Right.

Jacob

And parking signs have this language that’s sort of all their own where often-

Ron

I would think one advantage is like the vocabulary is relatively small.

Jacob

The vocabulary is relatively small. We ended up essentially hand building parsers for this, but it’s actually quite tricky because this is designed to be sort of easily looked at by humans who know that particular city very well. So the vocabulary varies city to city and also things like the position of different numbers and words on the sign looks very different than the linear reading of a sentence of text on a computer. There’s a big two over here and then somewhere else on the sign, it’ll say maximum hours parking. And do you connect those? Well, yes, as a human, we’re very good at using spatial context to put these things together, but figuring out a way to model this in a computer can be quite challenging.

Ron

So this sounds like a big pile of fun, grotty engineering problems and a bunch of weird domain specific knowledge that you need to make this thing work. How did this work out as a business?

Jacob

Not so well. Not so well, Ron. I love this problem to death. I still wish this data existed in the world, but we were not really set up very well to turn this into a business.

Ron

Okay.

Joining Jane Street and the limits of Postgres (00:26:02)

Jacob

That didn’t so much work out. And then I was saying, “Well, okay, what do I do next?” I knew I didn’t want to go back to Google. Partly just kind of been there, done that. Partly the company had gotten much, much bigger in the time that I was there and in the time since I was there. And I was looking for something that I could do in New York. I loved living in New York. I didn’t want to move.

Ron

Back to being enough of an urbanist to want to do a startup around cities.

Jacob

Exactly. And I loved this kind of economic thinking that I was doing back earlier in my career. And I knew of, in fact, a bunch of people who were doing finance-y type stuff in New York. And so I thought, well, where do I do this? What’s the right place for me to put all these things together? Jane Street was a super obvious choice for me, partly because almost all the people I knew who were working in the financial sphere were in fact at Jane Street already. These are people who I’d worked with at university. These are people I knew from Google. These were random people I knew socially. It was a bunch of people who I thought were extremely smart, who I really respected, and they all managed to wind up at this place. And you’re kind of like, “Hmm, I wonder how that happened.” I mean, also I had done some functional programming in college, and even back in 2008, already in the back of my head was, “Oh, Jane Street’s a place you can go and write OCaml.” Oh, that sounds pretty fun.

Ron

So our odd taste in programming languages strikes again.

Jacob

Absolutely. And I had in fact written some OCaml to do some combinatorial optimization problem solving for a class back in university. Not that that came in that handy when I started here. The OCaml that Jane Street writes actually looks pretty different than the OCaml that I wrote 15 years ago. But yeah, I’d known about the place and it seemed really cool. And so I thought, why don’t I interview with them and see what there is?

Ron

Cool. Okay. So when you got here, what did you find yourself doing? What part of the organization did you land in?

Jacob

So one of the things that Jane Street does before you start is there’s this team match process where you go and meet with people who seem like teams that you might want to work on and you chat with them. So one of the teams that I did a team match with was this team called database infrastructure. And I talked to this guy, Sam, who talked about, well, Jane Street runs all these Postgres databases. Some of them are really big and we’re having trouble managing them and we just sort of let people put whatever data in there and it doesn’t always go very well, so we really want to make this work better. And I’m like, okay, well, that definitely sounds like a hard problem, but why are you trying to do this? Why don’t you have some kind of distributed analytical database that can query over all of this different data without it having to live on a single computer that runs Postgres? And so I’m like, yeah, that kind of does seem like something we thought about. We haven’t built it yet, but it seems like that would be a good direction to at least explore. And I’m like, okay, I’m very happy to work on these database problems, but just so you know, if I come and work on your team, I’m going to try and build this data warehouse analytics thing.

Ron

So can we dig in a little bit more to the kind of like, status quo ante, what is wrong with the throw a lot of data into a Postgres database? Isn’t that what databases are for is throwing a lot of data into them? How were you using them and what was weird about that use from your perspective?

Jacob

So Postgres is a great database, but it has a lot of different goals that it’s trying to meet. One of the things that Postgres is trying to do is be able to do really fast single row or few row transactions. Think about running a store. Someone clicks a checkout button on your website and they go and buy a widget. And you want to both record this order that someone bought this widget. You want to decrease the number of widgets in your inventory by one. And you want this to all happen while the page is loading in a very short amount of time and you want it to all succeed atomically. Postgres has to solve that problem. There’s this other set of problems which we kind of call sort of analytical processing, which are about I have a lot of data and I want to answer questions about all of this data altogether in aggregate in some way that could involve not just single rows, but like many millions of rows of data. Postgres can do this for you as well, but it’s much slower at it because it’s constrained by needing to store this data in a way that lets it do these atomic transactions over very few rows. It’s like for example, Postgres maintains indexes. You have all this data in a table and Postgres can look up this data for you by potentially a lot of different things, a lot of different potential keys into this piece of data. But that means that all these indexes have to live on disk. And also Postgres can very efficiently update individual rows. And if you think about it, what this means is that Postgres can’t really compress across rows. Because if I have to be able to change that one bit on that one row, if that bit is compressed into like some 50 kilobyte or one megabyte chunk, well then every time I change a row, I got to write that 50 kilobytes or that one megabyte of data back to disk, and that’s just going to be a non-starter for these transactional use cases. So Postgres does as good of a job as it can given the constraints that it has. But ultimately, when you’re doing these analytical processing use cases, you really want to lay out the actual data at rest in a very, very different way.

Ron

Right. And this kind of goes back to this classic distinction between row oriented databases and column oriented databases, which has always struck me as super weird. The idea is like, I would like to get a piece of software for managing my data. It’s like, wait, do you want a row oriented database or a column oriented data? And I’m like, you mean I have to pick my data structure now? Why can’t I pick it on a table by table basis? Why can’t I take my data and represent it multiple different ways so that I can do both this kind of access and that kind of access efficiently, depending on the particular kind of thing I want to do?

Jacob

And a lot of people try and build systems that do both. But what we found is that the abstractions that make columnar databases efficient kind of leak through the interface layer. You can go out and find right now a lot of SQL compliant, whatever that means, different things to different people, databases that have columnar storage. In fact, at Jane Street, we were using one at the time that I came and made my bold claim, which was called Vertica, which is an existing columnar database. It’s in fact very efficient for a lot of these sorts of analytical problems. What we found with Vertica is that because it tried to give you these full SQL transactional semantics, it was very easy to break. A lot of the things that you could do with SQL that Postgres can support very efficiently, trying to build that transactional layer over underlying columnar data just gets rid of a lot of the advantages that you’re trying to get in the first place. And the way that this gets implemented with a lot of these systems just dramatically slows down the queries and can even make the database unusable. Now what most-

Ron

SQL is like the blessing and curse of the database industry.

Jacob

Exactly.

Ron

There’s this language defined like, what, in the 70s, I think, that somehow we’re all still using more or less the same language today. And I guess the problem you’re pointing at here is like a kind of overexpressiveness problem of like SQL can do a lot of different things, but the implementation you put under it, maybe we’ll do some of those things really well and some of those things really badly. And when you give the full API surface to people, they’ll maybe try and do all the things that you tell them that you can do and it’s going to be a bad time.

Jacob

Now the way that I’d put that is that SQL does such a good job of being what it is that it doesn’t feel like programming. If someone told you like, “Oh, C++ is way too expressive.” It lets you do whatever you want to do. And sometimes people do bad things with C++ and that means it’s a mistake. No, God knows C++ has plenty of faults, but you’re not going to say, “Oh, it’s too expressive. It lets you do too many different things.” If people treated the SQL that they write as if they were writing a software system when they were constructing their queries, I think we wouldn’t mind that SQL can do all these different things and it has leaky abstractions and et cetera, et cetera. This is all stuff that as software engineers we take for granted, but SQL makes it so natural to write queries that you just feel like you’re asking a question of the system. You don’t feel like you’re writing a program. Even though what a database’s query planner actually is, is a compiler that generates code to run your SQL query as a program. It doesn’t feel like that’s what it’s doing to us because SQL is such a good abstraction.

Ron

Right. But I think it also goes back to almost like the mechanism design point you were making before of sometimes it’s more important to be explainable. And the key thing that SQL trades away is explainability, which is like you do put in a SQL query and then it’s optimized and then a miracle happens or maybe the miracle doesn’t happen. And depending on not just the shape of the tables you run it on, but the statistics that were gathered from that table, it will optimize in different ways. It can be very hard to predict what the behavior is. Right, so you’re sort of not used to as a user who’s sitting there writing SQL work.

Jacob

Whereas we can all simulate our compilers in our head and understand exactly what they do to our OCaml code. I think a lot of the last 10 years or 15 years even maybe of CPU design has been designed to give people the illusion that they still understand what the CPU is doing to their code when in fact they have no freaking idea what makes it fast.

Ron

I mean, I think what you’re saying is obviously true in the sense that if you want to understand what’s actually happening in a program that you write in OCaml or C or whatever, and you really want to understand it at a very low level of detail, God help you. There’s an enormous amount of complex things happening at many different levels and lots of optimizations that are in there that try to make it fast for you even when you haven’t done all the careful work for yourself. But I think there’s a difference of degree. There are deep algorithmic big O notation changes to the execution of a SQL query, which makes it dramatically harder to predict what it’s going to do just from the query alone than you can from looking at a program in almost any ordinary-

Jacob

This is true. Very few compilers will change the big O complexity of your code up from under you. Whereas the decisions of the query optimizer for a SQL query engine can often be extremely load bearing. That’s a hundred percent true.

The Wild West and long-running transactions (00:36:09)

Ron

Right. And when you say we were throwing lots of data and creating lots of queries and stuff, what for, what was the use case that you were running into that we were all kind of running into and how we were using databases?

Jacob

Well, so the funny thing is that this was in many ways just kind of emergent from lots of different people’s connected use of these databases. So a great example is like a list of all the trades that Jane Street has done every day. You can totally imagine why that’s valuable. Okay, let’s take some information that I have in my trading. Let’s combine that with all the trades that Jane Street is doing and derive some insight from this. Super valuable. Let’s understand things about the fees that we’re paying to exchanges or the positions that we keep in our different accounts.

Ron

Or the symbology.

Jacob

Or the symbology.

Ron

Or the different symbols.

Jacob

Yeah. Yeah. What markets does this stock trade on? What currency is it denominated and stuff like that.

Ron

And part of what gets surfaced here is part of the value here is database as ecosystem. The set of things that you can join together, like every new thing that you add to that increases the value of the platform. And the key thing about joining the data behavior is literally join, like SQL joins or the essential data operation for bringing this data together.

Jacob

And many companies, when faced with these problems with transactional databases, end up essentially getting rid of a lot of the flexibility and a lot of the usability of these systems in order to maintain the scale. They will have a schema council that before you’re allowed to add a table to the database, you have to have these people sit in judgment on you and find a schema. You’ll end up with databases where the only queries you’re allowed to run are ones that have been code reviewed.

Ron

And in fact, often what you’ll have is you’ll have code that sits in front of the database and you can’t just use the database directly. You go to the program that you type something into a GUI and it goes off and constructs the queries. And so you don’t have any kind of free reign in just running whatever queries that you want.

Jacob

But we are getting huge value from traders, not just traders, but among other people, traders, being able to do more or less whatever they want on these databases. And we were putting in a ton of work on maintaining the database in such ways to give traders the illusion that they could do whatever they want. Don’t get me wrong, Postgres is a very good database, but there’s a lot of things that you can do once you have hundreds of users on a Postgres database that will slow down and kill the database for everyone. Long running transactions being the best example. So rather than just not let people run transactions on the shared database, we had a long running transaction killer that would go around and kill long running transactions if they got too long.

Ron

So can you say more about this? Why are long running transactions bad?

Jacob

The way that Postgres does animicity is essentially every time you have a transaction that writes to a table, it sort of forks the table and keeps it in two states, the state before your transaction and the state after your transaction. And then once you commit, it then has to do essentially almost like a version control merge.

Ron

So that on the face of it sounds insanely expensive. When you say forks the table, do you mean like makes a whole copy of the table?

Jacob

Well so –

Ron

That would be bad.

Jacob

It’s all copy on write, right? So it only forks the pieces of the table that you touch.

Ron

So it’s both copy on write and it’s copy on write in a kind of segmented way.

Jacob

Exactly.

Ron

Got it. Okay.

Jacob

So

Ron

Then why is that bad? A nice analogy for this is like version control. When you like Git or whatever, you can have lots of forks and actually it’s like kind of okay to have lots of fork. You in fact have lots of long running forks and that doesn’t necessarily cause other things to fall apart. Why is it problematic to have these long running transactions in a database context?

Jacob

The way the transaction semantics that databases give people aren’t like, okay, begin transaction starts a fork and you work in that fork and then later on you have to merge. So the way that the semantics of transactions in SQL and thus in Postgres is that while a transaction is in flight, it’s effectively locking all the rows that it’s changing. Not necessarily every row that it’s reading, but every row that it’s writing to, it gets effectively a lock. This is to stop merge hell from happening when transactions commit, which turns out to not be what most people want out of their databases. So instead of letting you do whatever you want and then merging, we assume that if you’re touching something, you want to lock it until your transaction’s done and then the next person gets to run on that. But what that means is that long running transactions can accrue along a large amount of locks and a large amount of data that we don’t know if it’s the future or not. And this can end up effectively costing a lot of storage and a lot of compute time on the database for having to take this into account and do all the bookkeeping for this transaction while it’s open. Once a transaction gets closed, you just can very quickly garbage collect all of the past stuff that happened before this transaction.

Ron

Is the main issue there that if you’re acquiring a bunch of locks that you’re basically running into issues of performance isolation, like this long running transaction blocks other transactions from completing because maybe somebody else wanted to write to that

Jacob

Yes.

Ron

And now they have to be pushed later in time. Yeah,

Jacob

That’s certainly the brunt of the problem. There’s other ways that this interacts with replication, for instance, and the ability to do what Postgres calls vacuuming, which is cleaning up old data in a table that where long running transactions make that hard too. But yeah, certainly this ability to have this chain of long running transactions is one of the things that makes this difficult for Postgres to manage performably.

Ron

Got it. Okay. So the state of the world is we have this incredibly valuable, not just a single database, but a set of databases that are being used in this kind of somewhat wild and undisciplined way where lots of people are pretty freely throwing in new kinds of data, throwing new SQL transactions at it and relatively routinely running into exciting performance problems, that you have a bunch of people who are doing a lot of operational work to try and keep this Frankenstein monstrosity running efficiently.

Designing Superstore (00:42:06)

Jacob

And just another big part of the problem is that Postgres very inherently has to keep all of your data on a single machine. So we had, I think at the time it was like the largest computer that Jane Street owned was running Trader DB. So yeah, so what are our requirements when we’re trying to replace the system? What are the things that we want out of the new system? Well, we still want it to be the Wild West. We still want people to feel like they can do anything. We absolutely want this to be able to store more data that you can fit on a computer. And ideally, we want this to be much more compressible and much better at doing sort of large reads over arbitrarily large amounts of data than Postgres.

Ron

Right. And essentially you want to move from something that is focused on transactions to things that are focused on big read queries that get lots of data and figure out the results of complex computations.

Jacob

Exactly. And we also want it to be much harder for people to break isolation. We want it to be much harder for me to go and do some crazy thing on the system that will impact your ability to use it. Right. Both from a correctness perspective and also from a performance perspective. It should be much harder for a single user to bork the whole system for everyone.

Ron

Did you have problems? I would’ve thought you’d have lots of performance isolation problems in something like Postgres and very few correctness problems.

Jacob

Well, you had deadlocks were sort of the main ways. Is a deadlock a performance problem? Well, in a sense.

Ron

It’s a very, very bad performance problem. Got it. And inherently when you’re grabbing lots of locks … Actually, how do you deal with deadlocks in the Postgres world? Does the whole thing lock?

Jacob

So we had a deadlock detector, and if queries got deadlocked, we would just kill them.

Ron

Makes sense.

Jacob

So this is what we wanted. There’s no free lunch, right? As I said, we’re not going to just do Postgres, but better because we think Postgres is pretty good. So what can we give up? Well, it would be really nice if we could give up transactions.

Ron

Sure. Although transactions are really nice.

Jacob

These transactions are really nice. Well,

Ron

It’s nice to be able to reason consistently about the data in your database.

Jacob

It sure is. So what if we couldn’t … Is there a way that we could avoid giving up transactions entirely while still avoiding the kinds of problems that transactions caused on Postgres? So this ended up being a lot of the meat of what we tried to build when we were specing out our data warehouse. And we looked at a lot of the existing systems around this. A lot of them either give up transactions entirely. You have something like Kafka, say, which Kafka isn’t a database. Yeah, I know you

Ron

Don’t think of Kafka as a database.

Jacob

But it is a system that stores a lot of data, right? For sure. And if you look at Kafka, it doesn’t have read-modify-write and –

Ron

To the degree that it has any notion of consistency, it’s the ordering guarantees that are provided, and those are very narrowly scoped.

Jacob

Yes.

Ron

And there’s the amount of data that you could fit inside of a single partition of a single topic, that’s the scope over which you have well-defined ordering.

Jacob

Yeah. Or you have something like Redis, for instance, which has no guarantees essentially across keys.

Ron

Okay. Yeah. Another way of limiting the scope.

Jacob

Exactly.

Ron

Which is totally not good enough. If you’re trying to do some complex join across multiple tables where you need consistency across those things, its like you definitely need to cross keys.

Jacob

Yeah. And so on the flip side, you have something like Vertica, which tries to provide full transactional semantics, but if you use it just right, you can make it efficient, which again, we still want this to be the Wild West. So we want something that you don’t need to use just right, and it’ll still work well. So what we ended up landing on is something in the middle. In particular, we wanted to make it possible to do consistent rights to a single table because this is something that comes up a lot is I want to have some log of a systems behavior or going back to the list of all the trades that we do. You really want to not miss any trades. You really want to not double count any trades, but you don’t necessarily need to insert new trades transactionally with other pieces of data because either the trade happens or it doesn’t and you have some source of truth for that. So you can always just pull in new trades from the source of truth to keep this up to date.

Ron

That’s right. And also to the degree that you’re making changes, most of the changes are appends.

Jacob

Exactly.

Ron

You’re mostly adding new data. And so a lot of the crosstable stuff that happens is like, it’s not quite consistent for free, but it’s mostly consistent in the sense that you might be missing data, but the data is mostly not being changed under your feet.

Jacob

Yeah. So we wanted to have data that you could atomically write to a single table, you could get order preserving writes to a single table. You could essentially get all of the guarantees that you want for that table, but not between tables. Another key thing that we gave up, so one of the big problems with columnar databases is that rewriting stuff is really annoying because … So the way that columnar databases store their data is every column becomes its own separate stream of data. And the reason you do that is because a single column of data turns out to be extremely easy to compress and manage. The other reason you do this is because most queries don’t hit all the columns in a table, especially analytical queries. You have these very wide tables, and maybe a given query needs like seven out of a hundred columns, and you just get to ignore the other 93. And so this makes adding new columns to a table nearly free at query time, which is super cool. Sure. So given this layout, doing an update of a single row, something that is in Postgres, extremely easy and efficient, becomes very difficult because first of all, all these columns are their own separate things, and second of all, they’re all compressed. But again, we didn’t want to give up the ability to update and delete data because it turns out that being able to do that is really important.

Ron

Even though it’s by far not the ordinary thing, mostly you stream it in

Jacob

Exactly.

Ron

Sometimes, you get it wrong.

Jacob

Exactly.

Ron

And you’d like to go and edit the table.

Jacob

So we want to have these be able to happen, but we’re sort of okay with them being slow, as long as they don’t slow down other stuff that’s going on in the database. So what we decided to do, and this is kind of a weird decision, is we decided to make all writes in this database asynchronous. Now you write to this table and your writes gets committed, which is the database says, yes, I acknowledge your write. This is going to go in in the correct order. This is going to go in atomically if you ask it to be atomic and so forth, but readers aren’t necessarily going to be able to see it just yet. And this lets us take the time in between when you do the write and when it becomes readable to make sure that the data that you’re writing gets readable efficiently, that we put it in the right place in the columnar data store, that we redo the compression if we have to do the compression and so forth and so on. And we’re much happier doing that neither when someone’s waiting for their read to complete, nor when someone’s waiting for their write to commit, but somewhere in the middle.

Ron

Right. And this is very different from doing updates to a database via SQL, where SQL is kind of part of the language and you’re kind of just mixing together the reads and the writes into one set of SQL expressions and they kind of all are supposed to operate in this kind of integrated way. And here you’re kind of breaking out the writes essentially as a totally separate API that has different semantics and where the kind of results of those writes happen asynchronously.

Jacob

Exactly. And part of you will not see very many commercial or open source systems that make this decision just because, I think mostly because you have to educate your users every single time. But this is one of the cool things about being at Jane Street where all of our users are in house is we just get to tell them how it works and why. And we can even build the first few examples ourselves to give them examples of how this works really nicely. And also very importantly, we can have, even if SQL is like the first class read interface to this system, it doesn’t need to be the first class write interface. Because one thing that we noticed looking at Trader DB, for instance, is that even though a lot of the read queries were extremely ad hoc and just written by hand by a human being, most of the writes were happening in much better to find ways through actual software systems.

Ron

Right. And maybe this goes back to its role as an analytical database. The place where the innovation is happening is on the query side, on the read side. And there’s data loading. You got to do the data loading,

Jacob

Yep

Ron

But the data loading is kind of done once per data source and that’s kind of that.

Jacob

Often. Now we didn’t want to stop people from writing ad hoc data to the database if they wanted to. A very common case is like, I have this shared data that’s very big and important and I want to query it, but I want to join it with some other thing that nobody else has ever heard of. So you want to make it easy to upload your little join table to the database to join with the big existing table. We want to have ad hoc data. We don’t necessarily need to have that have full SQL semantics. You just upload the whole table at once. And if that’s fast enough, then if you want to change the analysis, re-upload it.

No direct Parquet access (00:51:04)

Ron

Got it. So you sort of avoid the whole writing an individual row by just not having that. Exactly. Not a thing in the system. So another thing that’s interesting about this system, it’s called Superstore, is that another thing that we … There’s various things that we have not given people who want it, who want it from it. In particular, just to say a little more about the underlying design, the underlying data is actually stored in Parquet files. Parquet is lovely open source format for storing columnar data. And a request that people have put out a lot, which has always been denied is like, oh, can I just access the Parquet files directly? You have all these nice parquet files distributed in this nice distributed system. Can I just go and read them directly and do something whatever I want to them? And the answer from the Superstore team has been no. I’m curious, why is that?

Jacob

So in some sense, this seems like a very natural decision to make because if you look at, again, Postgres, can you just go and read the files on just that Postgres uses to store its tables? No, that would be crazy. You have a database for that. So if you think about it as like, well, we’re building a software system and offering it to the firm, it’s a very natural decision. But the funny thing is how little software that Jane Street develops actually works like that. People are very smart and good at programming and love looking under the hood and understanding what’s going on. And so once you tell them like, “Hey, this is a bunch of parquet files on a disk.” They’re like, “Oh, so I can read them, right?” So why don’t we let them read? Because we get to manage all of the reads and all of the writes, we get to build a lot of features that we can reason about. So one of the best examples is just access control. Sure. If we let people access our Parquet files on the disk, then we’d be stuck with Unix file system access controls because they were reading files off of a disk and we don’t have any code that’s running in between the disk and them. But in fact, since they’re reading them through our system, we get to build an access control layer and have ACLs and so forth for these data sets. Another thing that we can do is we can do usage logging. So this came out of a real problem with Trader DB and with other systems at Jane Street, which is they become nearly impossible to deprecate or to do schema evolution on. Imagine you have a table and a database and there are like 500 different people querying this database every day and you want to delete this table. Well, which of these 500 people are querying this table? How do you find out? I can go, maybe, okay, so I write down every SQL query that everyone is running and I grep for the name of my table. Well, okay, maybe this works and I find and I grep and I don’t see any matches to my table. Well, except there’s a view that references my table and now I have to grep for the view too. Well, maybe there’s a view on that view. So it ends up being very difficult to understand who is using this table and who’s not. And then even if you do understand it, why are they using it? And what data are they reading from it and how much … Maybe you can keep this data that you have, you only retain the last month instead of going back to all time. Can you do that efficiently? Who knows? So being able to have full read and write access logs to Superstore turned out to be very important. And to even just build a taxonomy of like, well, here’s the list of all of the data sets that people are storing in our system. Now again, Postgres has that, but this is a system that’s many orders of magnitude larger than Postgres. I guess this is another important thing to say is single computer databases only scale so far. If you want more than a few thousand tables, you’re going to need something that scales horizontally.

Ron

Right. And all of this is happening in the context of a huge explosion in the amount of data that we’re using and the kind of intensity we’re using that data and many different new use cases for all of this data. And somewhere along the way in the middle of this, a kind of revolution in the kind of machine learning driven approaches that we’re using also arrived and also put a lot more pressure on this system.

Jacob

And it just turns out that once you have a system that’s able to query over many terabytes of data efficiently, you keep finding new use cases. And again, we were building this, this database we keep talking about is called Trader DB. We’re building this mostly for traders and researchers, but we keep finding use cases across the firm. Cybersecurity wants to have access logs for our network. Oh, why don’t you just put that in Superstore?

Ron

Tools and compilers wants to have access

Jacob

Right

Ron

For all the history of the code review that people have done for all sorts of ad hoc queries that people want to do.

Jacob

Right. Every build that’s ever run at Jane Street. And you just kind of put stuff in. And then from the trading side, you end up wanting a record of the thing that a trading system thought at the time that it made a given order is a very good example of a kind of data that has very high cardinality. You want to write down rows like that across the firm like many billions of times a day. And we just couldn’t do that really, except that now we can. So you end up doing it a lot.

Build vs. buy (00:56:14)

Ron

So one thing I wonder about this design is like, we sort of pointed a few different kinds of systems where there are kind of clear distinctions and approach, but data warehouses aren’t like a new idea. Lots of people have written data warehouses and there’s lots of commercial products for data warehouses. How did you think about the question of, should we just like, I don’t know, sign up for yield software-as-a-service data warehouse thing versus, “No, actually we’re going to go and design our own thing from scratch.”

Jacob

So software as a service is a very interesting question here because really most of the data warehouses that are at the cutting edge in the world that we see are cloud software as a service offerings, Snowflake and Databricks being kind of the two primary examples that it makes sense to point to. And these are perfectly good software. There are a few disadvantages there. One is they’re extremely expensive. And in particular, whenever I talk to people who work at companies that use these systems, they end up thinking very hard about what queries they want to run or not just because the cost of running a lot of queries in dollars is quite high and we just sort of don’t want to think about that. The other problem is that our data isn’t in the cloud. We have a lot of systems producing data and managing data on premises and requiring people to copy all of their data to the cloud in order to run queries over them just seemed like not a great idea. Sure. But even still, once you’re on premises, there’s still many offerings. There’s commercial offerings, there’s open source offerings, so why don’t we just use one of those? Well, one of the things that we’ve learned, and really I say we, but this was really not me who was pushing this because I kind of came in here and I’m like, “Oh, Jane Street has so much not invented here syndrome. We never use open source software. We just write everything from scratch.” Probably it’s because this Ron guy needs everyone write everything in OCaml.

Ron

That is a fair complaint.

Jacob

But I do think, I guess now I’ve been here for long enough that I kind of see some of the method and the madness. In particular, the finance use case does not get very well understood by the open source community because we keep pretty quiet about exactly what we’re doing and why, because we think it’s a competitive advantage. Sure. And our competitors think the same thing. So you’ll have a lot of open source people who can spout off what’s sort of the tech company reason to use an analytical. Okay, people visit your website and record all their clicks and all their visits and sessions and stuff, and you need a big database for this because you have millions of people visiting your website. That’s a very well understood use case for data warehouses. And so a lot of the open source products are very good at solving that. I think very few people who are developing these systems can spout off some of the finance related use cases quite as well. And there’s just a huge advantage for us in being able to do the specific thing that we want to do very well. And because these systems aren’t necessarily built for these use cases from the top, we don’t have any particular reason to believe that they’re going to prioritize the things that we care the most about. Honestly, even in the ones that do think more about finance use cases, there are time series databases where a lot of the people who are using them are finance companies, but Jane Street’s a weird finance company. So being able to just sort of go into our system and make this thing work is extremely valuable.

Ron

The exact choice of trade-offs is an enormously load bearing thing. And the fact that we get to make the choice of which of these features matters and which doesn’t and where can we give up on something and where are we going to demand that we really get really perfect. That’s just an extremely powerful place to be.

Jacob

And we have use cases that if we were to share them with people building these pieces of software commercially, which is make them shudder and walk out of the room. And honestly, sometimes they make me shudder and walk out of the room. And one impulse that I’ve had to really stifle in myself is asking people who are using tools like Superstore are like, “Have you tried not doing that? “ I ask it more than no times. It’s definitely good to push on people and see if there’s a better way for them to meet their use case and the first thing they have to have come up with, but you really want to not push back on the use case itself because our traders are very smart and they’ve thought very hard about what is the thing that they want to understand. So being able to say yes in many more cases than a vendor would perhaps say yes to us is a huge advantage.

Ron

So one thing I wonder about this story in general that you’re telling is that you kind of joined Jane Street and were kind of immediately plopped into a team and started to help lead a project that was to replace a pretty central piece of infrastructure that the day it falls over, it’s really bad, like a pretty important central piece. And here you are, like a person pretty new to the organization. And how did that all happen?

Jacob

Well, so a lot of this is not on me. And I had many collaborators doing this. Many of them had been at Jane Street much longer than me. I talked briefly about Sam, who was my boss when I started at Jane Street. He was very much bought into this general project and did a lot of the work of giving me talking tos when I proposed something that just would not work at Jane Street or that is a bad idea for reasons that he understood and that I didn’t understand yet. And just sort of transmitting to me what are the things that make software valuable at Jane Street and why? And also doing a lot of the work with the prospective users of this system to explain to them what we’re building and why, and make sure that they were ready to use it once something existed. So these were both incredibly important. I was involved with them, but I don’t think that they worked, I don’t think I deserved the credit for them going well. I think really this involved a lot of very impressive organizational flexibility from across Jane Street. One kind of conversation that I had a lot that I found very impressive was I would go and talk to a team who we thought might one day want to use Superstore. This is before it existed. And they would say like, “Hmm, this seems pretty hard. Are you sure you’re going to be able to do it? “ And we’d say like, “Well, we’re going to try really hard to do it. We’re going to prioritize this exact thing that you’re talking about working for the following reasons, but you just have to kind of trust us that we really think that this is achievable because we’ve run some tests and it seems to look good.” And they’d be like, “Okay, this sounds great. We will use it when it’s available.” And I was like, “Okay, wow. This was a very ego-free conversation with a lot of trust between teams and trust of me who had not particularly done anything at that point.” And once we did launch the thing, we got a lot of usage very quickly. I think in many ways it turned out to work quite well, which is nice, but people were willing to give it a try pretty easily and early on, which I think really helped contribute to the system’s success.

Ron

Yeah. Maybe there’s two key cultural things going on there. One is this kind of extension of grace, the trusting of someone else that’s trying to do things, something, and it seems like a reasonable thing to try and do. And I guess the other thing is just the willingness to make bets. It’s a trading firm. We make bets all the time. And the idea that some of the software projects we’re doing are also bets. And I think I’m fairly confident that the various people that you talked to who said, “Oh, this seems hard.” Some of them thought, “Yeah, this probably won’t work, but it’s okay. We’re allowed to make bets that there’s some risk and sometimes they work and sometimes they don’t.”

Jacob

Well, and one of the fun conversations on that note that I got to have pretty early on in my tenure at Jane Street was buying the storage appliances that were going to back Superstore. So there’s a few questions here. One is when do you buy these? And the second one is how do you size them? Now this is in 2022 when supply chain dislocations from COVID were still going on and lead times were quite long.

Ron

Thank God there are no more supply chain dislocations

Jacob

Yeah

Ron

It’s all good now.

Jacob

All fixed. So we didn’t have a running system yet, right? So we sat down with a relatively small group of people like, “Okay, well, what’s the right amount of storage for us to buy?” And I don’t know how, but we ended up landing on a number of 10 petabytes. I was like, “What, petabytes, really? That seems like pretty big.” And they’re like, “Yeah, but you don’t really want to go smaller than that. “ Well, how much is this going to cost? And it was some eight figure amount of money. And I just, to be clear, I had just come from a startup where we had never seen an eight figure amount of money in the whole company’s existence.

Ron

And before that, you were at Google where they hadn’t seen a storage number as small as eight petabytes.

Jacob

Right. But where decisions were made in a much larger and more centralized way. So the fact that a few people could get in the room and decide to write this check on the strength of a design doc was pretty … I mean, it made me nervous that day, I would say, but I think it’s also a very impressive thing to be able to do as a company of Jane Street’s size.

What he’d do differently (01:05:23)

Ron

So Superstore had a bunch of embedded bets in it of different design choices, different things it was trying to achieve. And on the whole, it has panned out very well. Are there any aspects of the initial approach or design or things you did that you kind of in retrospect think didn’t pan out particularly well, decisions that you kind of regret?

Jacob

Yes. So I think the main decision that I regret from Superstore is the way that we handled the metadata. So in all of these columnar databases, the metadata ends up being a pretty key piece. Metadata being just like, what are the tables and what are the pointers to the data that’s currently in them?

Ron

And I guess also what are the permissions and who has access to what? Is that also part of the metadata?

Jacob

In our case, no. But in many other systems, yes, permissions live in those metadata stores. Sure. So we thought, well, what do we want to use for the metadata store? This pretty clearly has to be some sort of distributed globally consistent storage system because you certainly don’t want your data to be consistent or not just you don’t want. It’s impossible to have a system where your data is consistent, but your metadata is not if the metadata is like the pointers to the data. Sure. It’s like, well, what should we use for this? We decided to use CockroachDB. It’s

Ron

It’s like a well-known, well-regarded, open source-

Jacob

Transactional data.

Ron

Transactional data, thats right.

Jacob

And again, you want the metadata store to be transactional because very often doing these transactional, even lightly transactional operations over the underlying data require transactional metadata operations.

Ron

Right. And you do a transaction every time you add a new chunk of data, you’re doing a transaction and it really looks more like the Postgres use case in that instead of operating on whole columns, you really are operating on rows of like, oh, there’s a new chunk of data and I want to register that chunk of data. And so the ordinary use case is like the dual. It’s the opposite thing here.

Jacob

Exactly.

Ron

Okay.

Jacob

And in many ways, CockroachDB was a great fit, but it turned out that we’ve had more and more problems with it as we’ve been scaling. I think one of the key areas where we’ve been having problems with CockroachDB comes to write latency. So we have a lot of the data that’s in Superstore. I would say probably by bytes the majority of the data that’s in Superstore is data that comes in streaming. Some system is writing some stream of data and then we want to take that stream of data and materialize it as a Superstore table as fast as we can. As fast as we can is ideally within less than a second, which is like, again, for a trading system, not fast, but for analytical database pretty fast.

Ron

Right. They’re fast enough that you could build stuff for analyzing trading on the day and it would be fine, fast enough that you could build monitoring systems based on this, not fast enough that you would want your path from quote to order or something to go through the system.

Jacob

And this particular style of access ends up being pretty bad for CockroachDB because every time you need to write to a row, you end up doing a transactional update, which for CockroachDB being distributed means like distributed consensus round. And CockroachDB gets at scale by just dividing all of the rows in the world into many different Paxos group, or sorry, not Paxos, Raft, many different consensus groups that are independent of each other and like having a bunch of stuff on top of that to make that mutually consistent. But if you’re touching the same row over and over and over and over and over again, in effect, you’re serialized. There’s no way for them to do those transactions in parallel. And the wall time that a single consensus round takes ends up being quite high.

Ron

Got it. And just to unpack a little bit of the distributed systems terminology here, consensus is like the name for the algorithm that does the simplest like get a bunch of servers in different places to agree on anything. It could be a single bit.

Jacob

Exactly. I don’t want to have there be a single computer that’s the only place that stores like what’s the latest data in this table because then if that computer gets unplugged, I can’t access my table. So what CockroachDB does and what many distributed systems do is they write this to a bunch of computers and then essentially they vote. If the majority of the computers agree on where the data is, then congratulations you get to write your table. And this gives you very nice fault tolerance guarantees. But the way that these computers have to exchange information ends up being a really limiting factor in the design of the system.

Ron

And just to be clear, you were saying that the bad thing is having to go to a single place over and over to make a decision, but actually if you just had to go to a single place over and over to make a decision, it would be fine.

Jacob

Yeah. It turns out CPUs are pretty fast. CPUs are fast. They make a lot of decisions and it’s all of this network communications and distributed synchronization that ends up making this slow.

Ron

And one of the things we’re doing now is like ripping out all of the Cockroach DB and replacing it with Aria.

Jacob

Yes.

Ron

And Aria is this very finance style state machine replication system that basically shoves all of the data, it gets consistency by shoving all the data through a single core and you just distribute that data out by reliable ordered multicast and it gets to all the people. And the thing that’s weird about the system from a kind of big systems person kind of perspective is there is a single point of failure, which is the sequencer, the thing in the middle that gets all the transactions and brought –

Jacob

There is a plug that if someone unplugs, the entire Aria system will go down. And once Superstore is built on Aria, there is a single plug that you can pull to disable an entire Superstore cluster.

Ron

Although it’s worth saying, it disables it in a kind of temporary way and that you can actually recover in that all of the information that goes through the sequencer is also replicated and captured by other components of the system.

Jacob

Effectively, you’re in read-only mode for a while.

Ron

That’s right. And then there’s a careful, in the end involving some human intervention step to make sure the original sequencer is good and dead to flip over to a backup sequencer. So when it’s a single point of failure, you’re not totally dead, but you are paused in a way that’s not awesome. Although I think one of the points about this design that I think is interesting is it’s true that it’s not great to have single points of failure, but if you have a single single point of failure, it’s not as bad. A well-tended machine might experience a hardware failure once every four years in a well-cooled data center and all of that. And then what you’re really taking on in exchange for way higher performance is like, it will be down for a minute once every four years. And that’s not necessarily so bad.

Jacob

And you have a background rate of outages. I’m not going to sit here and say, “Oh, Superstore never has outages. This is going to be our primary cause of outages.” No, we’re humans. We write bugs the same as anyone. And if you look at what causes system downtime, including for big tech systems that don’t have single points of failure, it turns out that it’s almost always caused by software bugs and not by the kinds of failures that people are designing against.

Ron

And anyway, so here, the point that you’re pointing to as a bit of regret is the core metadata system was basically architected with a kind of higher requirements, like made a bunch of trade-offs in favor of very high availability that blocked you from having the kind of throughput you needed to really scale the system as far as we want to.

Jacob

And throughput in this particular dimension, I think what we, the reasoning-

Ron

The query throughput is fine.

Jacob

Query throughput is fine. Total commit throughput across all of our tables is fine. It turns out the thing that we can’t scale is just like the rate of commits to a single table. And I just don’t think that we fully appreciated how much of a blocker that would be when we were designing Superstore initially.

The Hive (01:12:58)

Ron

Got it. Okay. So let’s talk about the after Superstore world. So you no longer work on Superstore. What’s your day job today?

Jacob

So today I work on Jane Street’s compute cluster, which is called The Hive. So the Hive is not like all of the computers at Jane Street, but it is sort of the place that you go to if you have a computation that needs to be done by a lot of computers at once. So examples of this include, for instance, training neural networks, but also things like analyzing many days worth of trading to understand what the patterns might be that we see there.

Ron

Or running large scale simulations where you take the actual code of our trading systems and run them in simulated mode, like over large swaths of historical data.

Jacob

Yes, exactly.

Ron

Right. And this system is actually like, I have a great fondness for it because it’s like one of the first … The early prototype of this thing is one of the first systems I worked on. The original version of the hive was like six Dell boxes piled up on a card table and I think was like set up what we used to call a Beowulf cluster back in the day. It was like very simple and very primitive. And it’s a very different system now. First of all, then it was a very kind of simple and naive kind of scheduling story of like you would run one thing at a time, it would get the whole hive, you’d have a master process, which is like go off and schedule jobs running on individual cores and gather the results together and write them down to a file. It was also just about CPUs. We didn’t do any GPU stuff back then. It was like more than 20 years ago. Things are dramatically different now. Can you maybe say a little bit more on what is the kind of use cases that are driving the work on the hive today?

Jacob

So there’s a ton of different use cases on the hive, but if you look at where most of our growth is coming from, most of the growth in demand for hive compute is coming from training neural networks and processing the feature data that we’re going to feed to these neural networks.

Ron

Right. Some of which ends up being on GPUs and some of which ends up being on CPUs.

Jacob

Yeah. So the training itself for neural networks happens almost exclusively on GPUs, but a lot of the data production and the data analysis that go before and after the training runs happen on CPUs. Great. The hive is we have hundreds of thousands of CPU cores and over 10,000 GPUs running on the hive. And this is big enough that it’s become essentially a giant DOS machine. You can point this many things at almost anything and make it go down. So one of the big challenges has been just scaling the underlying infrastructure that powers the hive to make sure that it doesn’t rely on anything that it’s going to DOS, and that it can actually feed all of these computers efficiently. So what are some of the things that we’ve had to make sure that we can scale? So one of them is just storage. And all of these computers have to have their input somewhere and then also have their output somewhere. Sometimes this is Superstore, but in fact, some of the larger data sets on the Hive are too big for us to want to store them in Superstore. We just store them directly on a storage appliance or on a large object storage system. And it’s very easy to accidentally write code that makes these storage appliances fall over. So we’ve been working very hard to make sure that hive jobs don’t bring down storage appliances.

Ron

Is that a problem you solve by making the hive itself better or is it a problem you solve by isolating the hive so that its queries don’t like reach out and smack some unsuspecting system on the head and knock it over?

Jacob

So we have to do both. We definitely isolate the hive in many ways and we make it much harder for Hive. We try and make it impossible for HiveJobs to accidentally interact with trading systems, which would be a big no-no. But even the stuff that the Hive is supposed to talk to, we want to make sure that the Hive talks to it gently enough that that thing doesn’t get taken down. A great example, we’re still using, for better or for worse, NFS as the protocol by which Hive machines access much of their data. And NFS has some bizarre locking semantics around certain interactions, especially when it comes to things like listing directories and creating new files in a directory. A very easy way to bring down our storage appliances on the hive is to create thousands of files in the same directory from different computers all at once at the same time.

Ron

Oh, interesting. Just because there’s a kind of consistency semantics on the directory itself.

Jacob

Yeah. Pretty clearly, I think most people would realize that if you tried to write to the same file from 5,000 machines at the same time, that would go badly. But people don’t realize that creating a file in a directory is effectively writing to a file.

Ron

Right. It feels like it should be okay, right? It’s just like some kind of eventually consistent, eventually all the files that you added show up there would be fine, but like- That’s not how that works. That’s not how NFS works, right? Okay.

Jacob

So obviously you shouldn’t do that. We put some checks in place like disabled jobs that we think are doing that, but then a big part of this is just building the interfaces to the hive in such a way that people won’t do that. Rather than having people open files directly, we build frameworks that they can use that can manage their data in larger chunks.

Ron

So one thing that’s notable is that your experience coming to work on The Hive is sort of the opposite of your experience of coming to work on Superstore. You started working on Superstore when it didn’t exist. And then you got to ask the nice question of like, what are the APIs I would like to have people to have in order to use this system? And how can I pick just the right trade-offs and all of that? And in The Hive, you came up in a much worse position of a system that the worst of all situations I initially wrote and then has been used by lots of people and evolved a lot over time and is the beating heart of a lot of our research work. And you don’t have the freedom to just be like, “Ah, and here’s what all the APIs are going to be. “ It’s already being used actively and all the time. So what do you do about that? How do you think about improving the system and changing the kind of trade-offs in a world where there’s already a ton of uses before you get to make any of those decisions?

Jacob

It’s really hard, I think is the long and short of it. One of the things that I really try and do is I try and understand the full stack of code. One of the things that I found was happening a lot when I started on The Hive is that we were building features and then shipping the features and then nobody would use them. And so why was that? It’s because the people who the features were designed for were not actually using our APIs directly. They were using other systems that used our APIs in turn. So it’s not enough to design a feature that solves a problem. You have to design a feature in a way that solves a problem in concert with the other layers of our research stack.

Ron

And there’s like 20 years of people hacking away on these layers. So there’s like a lot there.

Jacob

Exactly. So having that sort of full end-to-end understanding and actually interacting with researchers. In fact, the first thing that I did when I started on the Hive team was I didn’t work on the Hive at all. I sat on a trading desk for a month using the Hive with them. I worked with these traders to take some of their code and rewrite it in a way that was sort of more maintainable and better integrated with the rest of the code base, but that involved a lot of using the Hive and changing the way that this code interacted with the Hive, which really gave me an appreciation for at least one of the ways that actual trading code interacted with the Hive. And my goal was to use that kind of insights to better align the way that we released our features with the ways that users were actually working.

Ron

So now you’ve had some experience going off and using it, an experience, seeing what’s going on in the team and making changes to the system. What do you think of as the problems that there are to solve? What are the important things to be done to make the Hive better fit for both what we’re doing today and the insane amount of compute we are bringing up over the next several years?

Scheduling compute as mechanism design (01:21:17)

Jacob

So bringing this all the way back to the beginning, one of the main challenges that the Hive has is a mechanism design challenge. So one of the things we have to do is we have to schedule different jobs. We have to decide what computers are going to run which people’s code. And sometimes more people are demanding hive resources than we have resources to give. So we have to decide how to prioritize them. And being a trading firm, we do this using money. People bid in dollars per CPU hour or dollars per GPU hour for how much they want to run their code on the system and we run a second price auction. So this all sounds well and good, but in fact, the structure of the mechanism makes it very hard for people to say things that we want them to be able to tell us about how they want their code to run on the hive.

Ron

Are you telling me you would like them to be able to express their true preferences about how valuable they think their job is?

Jacob

Well, so the funny thing again is that it’s not so much about getting the exact dollar value, which they don’t know down to the cent how valuable their thing is. It’s about what is the structure of their preferences. And particularly the thing that we haven’t been doing a good job at modeling is urgency, right? Maybe this piece of work is worth $10,000 and this piece of work is worth $5,000, but the $5,000 work decays much faster. If I don’t do it this hour, it’s going to be worth $0. Whereas the $10,000 piece of work, if I don’t get to it until tomorrow, it’ll still be worth $10,000. That’s a very valuable piece of information when you’re trying to figure out what the computers should do right now.

Ron

Yeah. This is like a key insight that was not at all obvious to me, but as I’ve kind of heard people talk about the kind of scheduling problems on the hive, just like makes a ton of sense, which is if you think about it in terms of utility curve, the thing that matters is the kind of derivative, right? It’s like, if your utility curve is flat, I would like to get this done, but it’s worth the same if it happens today or tomorrow, then while I’m not very urgent to get it scheduled right now, I’m kind of happy to wait. And if there’s free time on the hive, happy to schedule it then, but it can also wait and other things that are more urgent can go first. And so this very simple thing of not a single dollar value, but a curve seems like a much better way of thinking about what’s going on.

Jacob

Yeah. So assuming again that you give people good ways to specify this, and there’s a real API design problem for how do you not just make this thing of running a program on the hive be an entire economics quiz?

Ron

Having everyone have full freedom in writing out their utility curve for every job sounds also like a nightmare.

Jacob

Right. And plus you hope that they’re like convex.

Ron

Sure. Right. Yeah. It would be weird if they went up in time.

Jacob

Well, but sometimes they do, right? Think about a thing that runs every hour or every day that you want sort of a regular monitoring cadence. In that case, really your utility is how close you get to being exactly one day after the last thing and one day before the next thing. You’re probably fine with like a 20-hour gap and then a 28-hour gap, but like a four-hour gap and then a 44-hour gap sucks a lot.

Ron

Oh, so that means there’s like a peak around where you would like to get the job to run.

Jacob

It’s not a convex utility curve.

Ron

But we in fact don’t –

Jacob

We’re not modeling that.

Ron

Okay, good. Back to the-

Jacob

We’re not modeling that. There’s a paper from CMU that talks about this. So you want to model this thing over time. This of course creates an NP hard scheduling problem. I mean, it’s already NP hard because we have these chunky jobs that need a certain number of GPUs in order to run. So that’s solving problem.

Ron

Gang scheduling now you want a whole group. And this is because the GPU problems themselves, in fact, all of the original stuff that we scheduled on the hive was this kind of like, you just have a bag of jobs, they’re all independent, they all just want to run, you want to get the result, that’s kind of that.

Jacob

And this was before multi-core OCaml. They all used exactly one core. We really had an abstract problem.

Ron

And now we have, especially GPU jobs where they’re locked into a tight barrier synchronization thing and they are like, if you’re missing any GPU, you can’t do anything and you just kind of have to have this very tight scheduling.

Jacob

So even solving the scheduling problem at a given time T involves a lot of sort of combinatorial work. As our GPU clusters get more advanced, we want to be topologically aware also. We want to make sure that people get scheduled on GPUs that have good connectivity to each other. And then of course, as our GPUs are now in multiple data centers, we have to pick which data centers things run in, which often means copying their data from where it lives to where the free computers are. So that also gets very challenging. Sure. But then on top of that, we want to add this time dimension, which just makes the problem much more complex. So we know that we’re not going to be solving this problem in an algorithmically perfect way, but we still think that the moves that we can do, like if you think about this as a local search problem, the moves that we can do to go from the schedule to a better schedule are valuable enough that it’s worth adding this extra dimension in the modeling.

Ron

Sure. Right. And what’s the key pathology in the original kind of just like where it’s going to have, everyone’s going to bid for right now, whether they can run and you just have them bid against each other and the highest bid wins. What’s the particular thing that goes wrong in that case?

Jacob

So the particular thing that goes wrong is people end up paying much, too much money to get their large jobs run and then feeling sad that they didn’t just wait until the night. Essentially, people sort of schedule hive jobs for the future when they think the resources are going to be cheap because there’s no good way of them modeling their urgency in the bid.

Ron

Right. So the main issue is the auction happens when you propose the job and what you really want to do is give the system your job and be like, “Here are my preferences, run it at some point when it makes sense, given what I’ve told you about my preferences.”

Jacob

Exactly.

Ron

And that’s the thing that’s missing.

Jacob

Right. People end up doing their own work in their head or sometimes even writing little Python notebooks, modeling what they expect to be the future usage of the hive to figure out when to submit their jobs.

Ron

And thus we have pushed a lot of complexity on the users.

Jacob

Exactly.

Ron

Okay. Okay. So that’s one thing that’s wrong. There’s like some problems in the basic scheduling … Mechanism that don’t let us exercise the kind of intelligence we want to in terms of picking what jobs run when. What else is wrong with the system?

Distributing work and queries (01:27:33)

Jacob

Well, another class of problems is that we are not very good at starting at running people’s work efficiently on a worker by worker basis. So I want to run this thing on the hive. Maybe I want to run a thousand different copies of the same process over a thousand different shards of my data. This means that the very first thing that the hive has to do is get my code onto a thousand different cores. Okay. There’s a bunch of problems with this. One is, well, maybe my code is quite large and just literally physically copying the bytes is going to be an expensive thing to do. Another thing that’s a problem with this is, well, right now all of these bytes are located in one particular place and now I have a thousand different computers all trying to talk to the same one computer that happens to have these bytes. And that’s not going to necessarily go very well either. Yep. So it turns out the way to solve this is more or less with BitTorrent.

Ron

The same thing that we use to download copies of Red Hat Linux back in the day.

Jacob

Yeah. It’s built almost exactly for this problem of like, I want to have this data fan out from this one place to many different places. The fundamental insight here is that think about the first person who successfully manages to get a chunk of your file. Well, now they can share that chunk with everyone else while taking load off of the place that the file originally was. So if you have your readers also be writers or your receivers also be transmitters of data, then you end up evening out the work distribution much more nicely across all of the machines that are trying to do work at once. And in this particular case of like a hive worker that’s trying to start running a binary that it doesn’t have all of yet, that’s just wasted cores. So you might as well use those cores transmitting chunks of the executable to other machines.

Ron

It’s funny, BitTorrent is also like an exercise in mechanism design. And this is a problem that we don’t have, but the original BitTorrent design is designed for mutually distrustful people to share data. And one of the things that it wanted to do was incentivize people to be transmitters by giving them more download speed if they donated upload speed.

Jacob

Yes.

Ron

Which I guess is not actually a part of the thing that we have to worry about here.

Jacob

No, we like to think that all of our … In fact, not only do we like to think that all of our employees are good actors for stuff like this, but also there’s no particular benefit to you in not transmitting while you receive, given that you’re just sort of wasting space out.

Ron

So the incentives are in fact perfectly well aligned.

Jacob

Yes.

Ron

Cool. What about the APIs that end users get? Those grew very organically. The initial APIs that we provided were like rock simple and not very scalable and not very expressive. And then people have built lots of stuff on the outside of those and kind of extending in all sorts of ways those original designs, but there wasn’t ever a, let’s stop and think about what the right API really should be. So what kind of problems does that create?

Jacob

So one of the primary APIs that people use to access the hive is essentially this computation graph API, where you build this sort of graph of tasks that have dependencies of data that are produced by other tasks and so forth and so on. And you end up developing this entire graph where you have your source data at one end of the graph and then like your computation result at the other end. And you tell the hive, “Hey, run this stuff,” and it will do it in topological order and so forth. So this is nice that we have this. One question you might ask is, are there optimizations you can do on this graph to make it faster to run? And well, hey, now you’re back to writing a query planner. In fact, one very common case of this is where your task graph is actually using a data frame library like Polars in each individual task. And then in fact, you can analyze the computation that polars or a different library is doing and optimize not just how the computation graph is structured between the machines, but what computation gets done on each existing machine versus when the computations get distributed.

Ron

And the Polars story kind of brings us all the way back to SQL.

Jacob

All the way back to SQL, a lot of the work that’s done on the hive looks a lot like a big distributed query. Not all of it, God knows. I’m not going to say that GPU training is something we can write SQL for, but a lot of the work that’s done preparing data sets and analyzing data sets on the hive looks much more like an SQL query than most code in the world.

Ron

And if you go look at the difference between Polars and Pandas, which are like Pandas being the standard dominant data frame library and Polars being this growing upstart, the big difference there is again, the shape of the APIs and Polars is like, rather Pandas is the eager mode thing. It’s just like you tell it what you want it to do and it immediately does it. And Polars does what’s called lazy execution, but really the way to think about it is the operations don’t really run the thing immediately. What they do is they build up the query piece by piece, and then you get the whole query and you can see the whole thing and the system can optimize it.

Jacob

Exactly. And other distributed systems like Spark end up having the same insight. Spark also looks a lot like a distributed query engine. So really we want to build sort of a distributed query engine style interface onto the hive. And we’ve been taking a lot of steps in that direction.

Ron

So are you saying that the API that we want is literally Polars or do you think that something of that style is what you want?

Jacob

I want Polars to be part of it. I don’t think … So Polars has a distributed product. I don’t think we are literally going to have people write a Polars LazyFrame, and that is the way that you use this interface in the hive. But my guess is that Polars will be in there somewhere, but you’ll also have other capabilities that map to some of, again, the Jane Street specific or the finance specific things that we care about where we sort of privilege things like operations over many dates or operations over different symbols, and we can do a better job because we understand our domain of making sure that things are, for instance, balanced across shards.

Ron

But the kind of key enabler is you want the computation to be run in a way that gives you a kind of graph of a computation that is like-

Jacob

Exactly. You need some declarative computation graph. And the more that we know about what exactly you’re trying to run over what exact data, the better we can do of scheduling this and making it efficient.

Ron

Awesome. All right. So that is the mission that you are currently on or part of the mission

Jacob

That you’re creating on. That’s definitely part of the mission that I’m currently on is how do we run these very large DAGs of computation efficiently and what’s more information that we can get about these computations that can help us do so?

Ron

Awesome. Okay. Well, maybe that’s a good place to stop. All right. Thanks so much.

Jacob

Thank you, Ron. This is a lot of fun.

Ron

You’ll find a complete transcript of the episode along with show notes and links at signalsandthreads.com.