Why Testing is Hard and How to Fix it

with Will Wilson

Episode 26 | March 16th, 2026

BLURB

Will Wilson is the founder and CEO of Antithesis, which is trying to change how people test software. The idea is that you run your application inside a special hypervisor environment that intelligently (and deterministically) explores the program’s state space, allowing you to pinpoint and replay the events leading to crashes, bugs, and violations of invariants. In this episode, he and Ron take a broad view of testing, considering not just “the unreasonable effectiveness of example-based tests” but also property-based testing, fuzzing, chaos testing, type systems, and formal methods. How do you blend these techniques to find the subtle, show-stopper bugs that will otherwise wake you up at 3am? As Will has discovered, making testing less painful is actually a tour of some of computer science’s most vexing and interesting problems.

SUMMARY

Will Wilson is the founder and CEO of Antithesis, which is trying to change how people test software. The idea is that you run your application inside a special hypervisor environment that intelligently (and deterministically) explores the program’s state space, allowing you to pinpoint and replay the events leading to crashes, bugs, and violations of invariants. In this episode, he and Ron take a broad view of testing, considering not just “the unreasonable effectiveness of example-based tests” but also property-based testing, fuzzing, chaos testing, type systems, and formal methods. How do you blend these techniques to find the subtle, show-stopper bugs that will otherwise wake you up at 3am? As Will has discovered, making testing less painful is actually a tour of some of computer science’s most vexing and interesting problems.

Some links to topics that came up in the discussion:

Antithesis, Will’s company
FoundationDB’s deterministic simulation framework
QuickCheck — the original Haskell property-based testing library, by Koen Claessen and John Hughes
Hypothesis — property-based testing for Python, created by David MacIver
QuviQ — John Hughes’ company commercializing QuickCheck, including automotive testing work
Netflix Chaos Monkey
Goodhart’s law — “When a measure becomes a target, it ceases to be a good measure”
CAP theorem — the impossibility result for distributed systems that FoundationDB claims to have in some sense violated.
Paxos — the consensus algorithm FoundationDB reimplemented from scratch
Large cardinals, an area Will studied before abandoning mathematics
Lyapunov exponent — measure of chaotic divergence
Chesterton’s fence
The Story of the Flash Fill Feature in Excel
Building a C compiler with a team of parallel Claudes
Barak Richman, “How Community Institutions Create Economic Advantage: Jewish Diamond Merchants in New York”

TRANSCRIPT

00:00:03

Ron

Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I’m Ron Minsky. All right. It is my pleasure to introduce Will Wilson, who’s the co-founder and CEO of Antithesis, someone who started out studying math and then somehow found himself working on distributed databases and now running a startup that is trying to change how we all do testing, hopefully for the better. Jane Street is actually both a customer of Antithesis and an investor, something I want to talk about a little bit further in, but thanks for joining me.

00:00:32

Will

Yeah, hopefully for the better, but I think it would be hard to make it a whole lot worse.

00:00:37

Ron

Fair. So, let’s just talk a little bit about how you got here. You started off studying mathematics, you’ve done a bunch of other things. You’re now doing a lot of what seems to me is really hardcore systems work. Tell us a little more about that journey.

00:00:52

Will

Sure. So when I got to college, it was the time when everybody was super, super excited about computer science. Facebook was new, Google was new, everybody was going off and joining those companies and making a lot of money and doing really cool stuff. And I basically made a very large mistake, which was I got to college and I was like, “Wow, that computer science stuff seems really cool. Too bad it’s over. Too bad all the interesting problems have been solved already. Look, somebody’s already made Google. What else could there be to do? So I basically ran kind of in the opposite direction. I knew a little bit about how to program. I taught myself when I was a kid, but I basically avoided studying computer science at all and ran into the most abstruse forms of mathematics, which just seemed more intellectually interesting and also nobody was going to run out of math anytime soon.

00:01:46

Ron

That’s true. Although this whole thing of maybe AIs will run us out of math, but that’s like a much newer problem.

00:01:51

Will

If I were making that decision again today, I might have picked something different that AI is not so good at.

00:01:56

Ron

So when you say abstruse mathematics, what kind of stuff were you interested in?

00:02:00

Will

I did a bunch of different things. I liked a lot, something called representation theory, which is something very useful in mathematical physics. It’s basically the study of homomorphisms from general abstract groups into vector spaces, either finite or infinite dimensional. It’s pretty neat. That was actually a little bit too useful. That was a little bit too applied. So I also got into some mathematical-

00:02:23

Ron

They’re like actual matrices there.

00:02:24

Will

Right. Well, there’s actual matrices and you can actually use this to do particle physics, which I don’t know. So I also did a little bit of set theory. I got into something called large cardinal theory, which is so abstract. It almost sounds like a parody. It’s basically what new forms of mathematics can we develop if we add assumptions that certain very large infinite numbers exist. And the Wikipedia pages on this stuff are a total hoot if you want to look at it.

00:02:53

Ron

I have sadly looked more than a little bit at large cardinal theory and it is fun and wild and indeed not the most practical of all.

00:03:01

Will

This is the only podcast I can imagine where the host might say that as a response.

00:03:06

Ron

All right. So you had a promising start of a career in mathematics. Why did that not go anywhere?

00:03:11

Will

Oh, well, I basically, I got to my senior year and I did actually apply to grad school and I actually got into grad school a few different places and I was all set to go off and do my PhD in math. And then I just looked around and I looked at my fellow classmates who were going to grad school and I looked at my professors and I looked at myself and I had a very important moment of self-realization, which was that I am never going to be a world-class mathematician because basically, I mean, basically for the same reason that I’m never going to dunk, right? I’m never going to be a world-class basketball player. There’s a certain measure of natural talent and random variation that is just required. And yes, you can definitely get better at basketball or better at math by working very, very hard, but these are both professions with this incredibly skewed return distribution where if you’re not in the top 0.0001% of people, you’re just never actually going to have a great time. And so I realized that I could spend six years in grad school or longer and eventually get some job teaching somewhere as an adjunct or something, or I could not do that and I could sort of bail out of this process sooner. And I realized that was what I had to do.

00:04:29

Ron

Got it. And then you transitioned into what? What did you do from there?

00:04:33

Will

Well, I basically, I actually initially was off doing a little bit of biomedical research. I had interned when I was in college and actually before college at a small biotech startup. And I’d done a bit of that. And then after that, I bopped around in a few different sort of dead end-ish jobs. And it was at one of those that I had this crucial realization. And the crucial realization was that actually my ability to write a janky Python script was unbelievably economically valuable. I was sitting at my job and my boss had assigned me some enormous pile of drudgery and I looked at it and so I wrote a Python script and it took me 45 minutes and it automated the enormous pile of drudgery. And I was like, okay, here I’m done. And he looked at me with this expression of dread and was like, “That was supposed to be your work for the next three months.” And that made me get something in my head. I was like, “Ah, interesting. Maybe I should get better at this programming thing. That seems like it could be good.” So I went and I taught myself how to code for real and I did some online classes and then I eventually got my way into a number of tech startups. So

00:05:43

Ron

How do you actually learn how to program? My overall sense of the world is that the world is actually very bad at teaching people how to program. Universities, I feel like are especially bad at it. They do this weird form of performance art where professors hand out assignments and then students fill in and resolve it and then it’s given back and looked at once and then it vanishes like a puff of smoke. It’s like the evanescence is part of the art of it all. And real software is nothing like that. It’s the thing where the kind of permanent evolving state of the software is like part of what’s important about it, part of what you need to optimize for when you’re writing software are these not like just the functional properties of what the software does, but the non-functional properties around how extensible is it and how easy will it be for people in the future to understand and what kind of performance problems are you creating in the future? And all these things that don’t show up in the kind of very small scale fake environments where you learn how to code and you need to do very different things to learn to be good at it. And what do you do?

00:06:40

Will

Yeah, no, that is super, super true. And so I did actually try to solve that problem a little bit, but I will also qualify my answer by saying that my main goal was to get hired at a software company, not to become a great engineer yet. I think I knew somewhere in the back of my head that becoming a great engineer would require working with other great engineers and being mentored by them, as indeed it did. But basically what I did was I followed two tracks and I was on paternity leave at the time, which made it easier because I could sort of do this nights and weekends. And basically I studied a lot of academic knowledge, all the stuff that I had missed in college. I went and learned about complexity theory and I learned about the theory of algorithms and I learned what a data structure is and all the stuff that everybody else learns their sophomore year. So I jammed all that into my head using a bunch of YouTube videos and online resources and so on, which there’s a lot of these days. And then I also just tried building things and I mostly focused on things that were interesting to me and things that were hard. And I tried to pick a pretty broad set of things that would force me to learn different skills. So I wrote my own little Raytracer and it was like a- A classic. … pretty crappy ray tracer, but I did learn C++ and I did learn a lot about how to do object oriented programming and how to do memory management and so on in the course of that. And then I wrote a little toy compiler and I wrote a little computer game and I wrote like a bunch of different … I wrote a little graph database. I did this-

00:08:16

Ron

Prescient. Yeah,

00:08:18

Will

That’s right. That’s right. Turns out that those … Well, those were actually a fad. They never really took off. Sure.

00:08:23

Ron

Graph databases has not really taken off, but there’s a lot of database theory.

00:08:27

Will

There’s a lot of database theory. That’s right. And that actually was part of what got me interested in databases and what eventually led me to working at FoundationDB, which is where I did find really great engineers who were able to mentor me and who made me actually somewhat competent.

00:08:41

Ron

Got it. And then somehow from the work at Foundation DB, you ended up eventually founding Antithesis. So tell us about that.

00:08:50

Will

Yeah. So FoundationDB was a magical place. I mean, I think in some ways a little bit like Jane Street. It’s just one of these places that you walk into and everybody is brilliant and everybody is incredibly humble and everybody is incredibly nice and good at their jobs and it just hums with this extraordinary energy. And one of the brilliant things that had happened at FoundationDB, it’s a thing that should happen in more software projects, I think. They sat down and were like, we’re going to build a new kind of database. This is a kind of database which at the time people believed was literally physically impossible to build because of a misunderstanding of something called the cap theorem. And we can get into that more if you want. But basically, basically they were like, “Okay, we’re going to try and build this new kind of database. What do we need to have in order to build this database?” And they realized that in order to build such a system, you would be totally foolish to do it without a powerful deterministic simulation framework that could sort of test the database in every possible configuration, in every possible mode of operation, in all possible network conditions and failure conditions and so on, with any amount of concurrent user activity and have that all be replayable deterministically. And if you think about for a second, it’s like, yeah, you would be foolish to build a database without that, but they were the only people I knew of who had actually acted on that insight. And so they built this extraordinary system-

00:10:18

Ron

Can you just pause for again and say, “What is a deterministic simulation framework?” There’s a few words there, deterministic simulation. I feel like understanding how those play out is maybe useful.

00:10:27

Will

Right, right, right. Sure. So basically, let’s start by talking about property-based testing in general, in the abstract. Quick check from Haskell, or I think OCaml has its own property-based testing system, right?

00:10:42

Ron

Every functional programming language has at least three of them.

00:10:45

Will

Right. And then in Python, you’ve got hypothesis. So property-based testing, the basic idea of it is I have some piece of code. Rather than sit there and write a bunch of unit tests that do particular things that I’ve thought of ahead of time that take particular actions, I’m going to just tell my testing framework what you can do to my code, what actions you can take. If it’s a little data structure, it’s like maybe I can insert an item and I can pop an item and I can query for some item or something. And then you set up a bunch of randomized generators which do all these things in random orders and then you figure out what the invariants of your program are. Probably an easy one is it shouldn’t crash, but maybe a more interesting one for a data structure is like, if I insert five things, then there’s five things in it. But actually that’s not a great one. There’s a higher order one, which is if I insert N things and don’t remove anything, there’s N things there. But then we can make that even more abstract and be like, if I insert N things and then remove M things, so long as N is bigger than M, I’ll have N minus M things in there. And so you can sort of get quite clever with these things. And then the magic is you now have not a test. You have a thing that will produce an infinite number of tests so long as you keep running it and it will basically try your thing in many, many more permutations and combinations than you would ever have thought of. That’s the basic idea of property based testing, right?

00:12:17

Ron

That’s right. And these classic frameworks like Quick Check in some sense automate, the hardest part of this is generating a good probability distribution. And you were framing this in terms of operations where you have sequences of operations on some kind of system, and that’s already like leaning a little more systemsy. I feel like the classic functional programming version is more like, I’m going to test my map data structure or whatever. And then often what you’re putting in is just like lists and whatever, shapes of containers, whatever that you want to use for doing straightforward things. And often you’re thinking about it less in terms of sequences of operations and just like some fairly broad shape of data that you might want to put in. And you want nice ways of generating good probably distributions. The question of what counts as a good probability distribution is actually quite a complicated one.

00:13:00

Will

It is very complicated.

00:13:01

Ron

And so in some sense, there’s like two things you need to specify. There’s like the properties that are supposed to be true and the probability distributions for generating examples, and that’s kind of the whole bulk.

00:13:11

Will

Right. And so then one of the rules of all human endeavors is that every good idea is rediscovered 17 different times by different people who are in slightly different subdomains and so they didn’t talk to each other and then they create their own language and set of concepts for it. And it’s all very confusing. And this is also true of property-based testing, which has been reinvented tons of times. And one of the most well-known other times it was invented, it was called fuzzing, which is a very, very similar thing conceptually. Fuzzing is more from the security world, but if you squint, it’s the same thing. I have a property, which is my program shouldn’t crash, shouldn’t have memory corruption, shouldn’t have security vulnerabilities, and then I’m going to feed in a distribution and the distribution happens to look like stuff to parse maybe that has errors in it or has maliciously crafted content. And I’m going to have a random generator, which is my fuzzer, which is going to keep sending in stuff until I find a failure of the property that I care about. And this is a totally separate group of people who solved many very similar problems in some different ways and in some similar ways, and the two sides just never talk.

00:14:22

Ron

That’s right. And the early versions of fuzzing were very simple on the probability distribution side. It’s just like white noise basically for throwing into things for some of the very early research and just like take the Unix utilities and throw white noise at them and see what happens. And the language of properties was incredibly impoverished. It was like not much better than doesn’t crash.

00:14:41

Will

Yep. But the fuzzing people had a clever trick, which the property-based testing people did not have. The fuzzing people realized that you don’t need to make this a black box process. You can actually track things like code coverage and you can see what your inputs make your code do. And then you can use a genetic algorithm or an evolutionary algorithm to adapt your input distribution as you go to find more and more interesting behaviors.

00:15:07

Ron

That’s right. You basically have these tentacles into the program and you feel out where you are in the state space and try and explore more of the state space of which branches you’ve gone through and all of that. It’s definitely like an extra idea. And a bunch of the property-based stuff came out of the functional programming world, which has this, oh, we’re going to drive probability distributions from types, totally makes sense from that. And this is like, no, no, no. We’re going to modify the compiler and we’re going to do a bunch of weird ad hoc stuff to try and explore the state space. It’s a very different, but very good idea.

00:15:36

Will

Yeah. Well, the interesting thing is you are actually, I mean, you are trying to solve the Turing halting problem here. We know you cannot do it. We know that there’s no one technique that’s going to find all the bugs. And so I actually believe that the correct response to that is just to throw everything at the wall and see what sticks. You should try and have very clever probability distributions and you should try to have evolutionary algorithms and you should have constraints and constraint solvers. And you do everything you can, add some ML, whatever. We’re up against a very hard problem. And the nice thing about a basket of tools is that if you’re careful about how you architect them, no tool can make the situation that much worse, but there are certain situations where it can make it much better. And so by having a broad distribution of techniques, you’re likely to have something that works in a larger space of programs.

00:16:32

Ron

Right. Particularly because we’re doing testing, right? It’s just like you do an extra thing, it takes some time, but it doesn’t break anything. It’s just like the worst thing it can do is not find any bugs for you.

00:16:41

Will

That’s right. And you have to be a little bit more careful about that once you have sophisticated evolutionary tactics, because it could be that some technique you use pollutes your distribution in some way that makes it harder to find other bugs, but that just means you have to not be totally naive. Got it. Yeah. Okay. So there’s all these people doing randomized testing. And what’s interesting is nobody until very recently had ever applied any technique like this to what I would call real software. And this is not a knock on Haskell or small functional data structures, certainly not a knock on parsers written in C and C++. What I mean by that is nobody fuzzed or used property-based testing on a database or on a computer game or on a large distributed system or on an operating system or a kernel. People have lately started to do these things, but by and large, it was not happening until quite recently.

00:17:45

Ron

I feel like it wasn’t common, but is it really that it wasn’t done at all? I’ve talked to John Hughes about stuff that the Quick Check folk did where they worked with auto manufacturers for fuzzing their super weird network inside of the computer and things like that. So I feel like there is stuff that I think should qualify as real software that’s more than the traditional toys to which this stuff is applied. There at least been some commercial applications.

00:18:07

Will

I think people did some of it, but I would say it was vanishingly rare. I mean, all of these techniques maybe arguably are vanishingly rare to a first order approximation, 0% of people use them, but I think it was especially uncommon to try and use it on big stuff.

00:18:24

Ron

Yeah. I mean, I think it’s felt relatively niche. I think there are things that qualify as more serious applications of it, but much rarer than they deserve to be applied or something.

00:18:33

Will

And basically, I think that this is actually for somewhat good reason. So when you have big software, big complicated software, and I promise I’m getting back to your original question, which is what is deterministic simulation testing. Basically, when you have big, complicated software, there’s two things that get dramatically harder. The first thing is the state space of the software that you are trying to explore is really complicated. And it is probably complicated in such a way that the fuzzing trick of just recording code coverage is no longer a very good map for where you have gotten in the software. Consider something like a Python interpreter. If you hit 100% code coverage in that, you have not gotten anywhere close to exhausting its behavior or consider something like-

00:19:25

Ron

And that one is just because the state space is much bigger than just where you are in each branch of the code. Your code location doesn’t tell you that much about the state space. There’s lots of other things going on that are really important.

00:19:36

Will

What’s in various variables, what’s in memory, all this other stuff. And if you try and take the Cartesian product of that with all the coverage, it’s way too big and you’re not going to make any progress. Or consider a distributed system where just what coverage you have gotten might be less important than what order you have encountered coverage across different nodes in some distributed algorithm. And so basically knowing where you are and fully exploring the program becomes harder, both from the fuzzing philosophy of we’re going to use signals like coverage to determine where we are. And it also gets harder from the PBT philosophy of we’re going to have really clever, intelligent, random distributions because basically you have to just get lucky so many times in a row to get something useful happening that it’s intractable to solve the problem purely that way.

00:20:33

Ron

Right. You more or less probably can’t do it fully obliviously. That’s right. The oblivious thing where you have the distribution chosen ahead of time and you’re just throwing things at the system, you kind of have to be responsive to the state of the system if you’re going to get the right kind of coverage. Although it’s worth saying, when you say covering, you never actually cover the state space. The thing that you’re doing is always weirder and more heuristic because the actual state space is highly exponential, and so you will not in any reasonable testing budget be able to test any appreciable fraction of it. So there’s some weird question of taste of which vanishingly small subset of the scenarios is it important for you to cover?

00:21:08

Will

Yes, totally true. And we will come back to that. You want to cover all of the interesting parts of the state space and you want to try and do it as quickly as you can. And that is a whole nother dimension along which this is hard. Okay. So then there’s a second problem with these larger systems, more quote unquote real systems, which is that they don’t really look … They don’t really look like the kinds of systems that people have traditionally applied fuzzing and property-based testing to in two kind of ways. One is that they tend to be interactive. They tend to not be things that accept an input and then do a bunch of computation and then crash or don’t, which is kind of what fuzzing is optimized for. They tend to be things that take a little bit of input and then send you a response, then get a little more input and then do something. Imagine a web server or a computer game. It’s like got this interactive flow to it, which makes the whole fuzzing model of like, I’m going to come up with what is a good input to break the system and send it in and see what happens a little bit more complicated. Then the second thing, which makes the state space exploration problem even harder is that these systems are all non-deterministic. And this is in some ways, I think, the crux of it because basically computers are machines, right? They’re like real physical machines in the real world. And in order to make those machines really efficient, CPU designers have done all kinds of evil and awful things to make them, that have this side effect of making them non-deterministic, meaning that if you try and perform the same computation on the same computer twice with all the same inputs, once you have things like threads involved, once you have things like timers, once you have things that need to interact in any way with the real world, with network sockets, with hard drives, suddenly your computer program is not a pure function, unless you have written it in Haskell and have been very, very careful. It’s a big, complicated, weird state machine with all kinds of co-effects from the environment that can mean it does something totally different each time you run it.

00:23:21

Ron

Yep. Although one of the weird paradoxes of this is it is often the case that the individual components are actually all very close to deterministic. It’s just that they wildly depend on initial conditions and their behavior is kind of chaotic and diverges from predictable things. So it’s like, actually the thread scheduler is a completely deterministic program in some sense and timers, the timers work largely deterministically, but your memory, it doesn’t always have the same latency. There’s like a cycle where the memory gets refreshed and it’ll block out for a very little piece of time. And did you start your program at exactly the same time in the memory refresh cycle, the two times that you ran it? Probably not. And then all of these things compound and multiplies. You have multiple systems talking to each other and the small differences become big differences and effectively this non-determinism kind of gets pulled almost out of nothing.

00:24:16

Will

Yeah, that is a fantastically accurate intuition. And we have actually, we haven’t started talking about our technology yet, but we have actually, we are able to measure that intuition. We can empirically tell you what the Lyapunov exponent of your software is and what its chaotic doubling time is. And it turns out that for Linux, it’s insanely fast. Basically, if you change one bit in the memory of a Linux computer, the whole state of the system is completely different within tens of microseconds. It’s actually crazy.

00:24:49

Ron

That’s shocking.

00:24:50

Will

Yeah, it’s nuts. I did not believe it, but it’s true.

00:24:55

Ron

Yeah, I’m still not sure I do, but-

00:24:59

Will

I can show you. Okay. Anyway, so why is this non-determinism so bad? So it’s bad for two reasons. The more obvious reason is it means that if I do my cool fuzzing property-based testing thing, I run some fantastically expensive computational search. I find the bug that’s going to ruin my life. And then if I don’t have exactly the right logging in place, if I can’t just look at the source code and one shot the bug, I may never make it happen again. And that is very, very frustrating. Now my testing system has just made me feel bad.

00:25:32

Ron

Something is wrong.

00:25:33

Will

That’s right. Good luck. You’ll never know what it is until you find out at 3:00 AM when your pager goes off. So that sucks. And then there’s a second problem with it, which is that it makes the fuzzing trick of look at what inputs have made me do useful things so far and then try small modifications on those inputs, break down and become much less performant. Because if putting the same input into the system again might not get me to the same point in the state space, then putting a slightly tweaked one is extra, maybe not going to get me to the same point in the state space. And so this optimization loop that all of Fuzzing kind of implicitly depends on, doesn’t work very well.

00:26:13

Ron

You basically need the fact that there’s a kind of random input, like more or less you’re a random number generator and like a function from that into the behavior. And you really want that function to be a real function, which you can always run and get the same answer so that you can actually explore that space. Whereas every time you try it, there’s just like a new version of the function that is spiritually similar, but has all different behavior-

00:26:37

Will

It makes Fuzz degrade into random guessing.That’s right. Okay. So that brings me back to what is deterministic simulation testing. And the idea here is the somewhat crazy one of like, we can sidestep all these issues if we just make all of the software deterministic, which sounds a little bit insane and maybe a little bit useless. It’s like, assume you had a can opener. How do you make your software deterministic? And that’s a very fair criticism up until the existence of Antithesis, which I will get to later, has kind of solved this problem for people. But in the absence of that, what we did at FoundationDB was we wrote our software in such a way that it could be run completely deterministically. So we could simulate an entire interacting network of database processes within one physical Linux process with deterministic task scheduling and execution, with fake concurrency, with mocked implementations of communication with networks and with disks, right? We could cause database processes to have simulated failures and restart. We had to do all this with no dependencies whatsoever because as soon as you add a dependency on ZooKeeper or Kafka or some other program, you lose this ability to run in this totally deterministic mode. But it made us so much more productive to be able to test our software this way that it was worth it to us to not have any dependencies.

00:28:04

Ron

So is it fair to say that the key enabling technology here is dependency injection? Like you have a bunch of APIs that let you interact with the world. Most of what you write in a usual program are in fact deterministic components. You do some computation, the results is deterministic, but there are some things that you do that aren’t, like you ask what time it is. It’s like, well, now you’re really two different pieces of hardware. There’s like a clock in the CPU and they’re interacting and who knows what’s going to happen when you ask what time it is. You send or receive a network packet, you ask for something from disk. So the thing you can do is you can just enumerate all of the APIs that you have that introduce non-determinism and just have them have two modes. There’s like the regular production mode where it hits the real world and is non-deterministic. And then there’s test mode where you just have control and you can, behind all of those calls, you can have a simulation that gives sort of the response to the API where you have control over it and you can thereby force it to be deterministic. Is that like the basic trick?

00:28:58

Will

Right. Well, that’s the basic trick, but you’re left with one really, really, really big problem, which is concurrency. Even if your program only runs on one computer, you probably have threads and then the OS is going to schedule them in like God knows what order. And they also, by the way, will take non-deterministic amounts of time to execute actions, thank you Intel, and thank you everything else running on your computer, right?

00:29:25

Ron

Well, I mean, thank you, Intel, because if they didn’t do that, things would be way slower.

00:29:29

Will

Super true. So people can solve that. There are languages with sort of cooperative multitasking models of concurrent programming, which you can actually plug in a deterministic scheduler and make that all work. But then if you have multiple processes running on different computers, now you’re really in trouble. Now, how long did it take that network packet to get from this computer to that other one is something that’s completely outside of your control. And if you want to try and run them all on the same computer, you need to create some way of faking processes on different computers, running on the same computer in some sort of cooperative multitasking runtime so that you can make it all deterministic. And there are people who’ve done that. We did it at FoundationDB. I think you guys did it at Jane Street.

00:30:22

Ron

That’s right. Yeah. One of the reasons I know the bag of tricks is that this is more or less exactly what we have done and hit the exact kind of same set of issues and the same basic commitment to we will write all the code ourselves. We had kind of weirdly fallen into by using an obscure programming language. So we had this whole OCaml ecosystem where we had really deep control over the whole thing. And so yeah, a lot of our systems, not all of our systems, but a bunch of our systems are built in this way where we have this kind of end-to-end control and can do this kind of deterministic simulation. And it’s absolutely critical for all the reasons you said. It really helps you go faster in many different ways.

00:30:57

Will

Yeah. I think something I haven’t said yet is this all sounds like a lot of work and it is a lot of work, but it was so game changing at FoundationDB. That company could not have existed without this technology. We built a thing that everybody thought was impossible with a team of 10 people. And we did it really, really fast. And we did crazy things that nobody would ever dare to do without a testing system like this. I mean, I’ll give you two examples. One was we deleted all of our dependencies. And in particular, we deleted Apache Zookeeper, which we had been using as our implementation of consensus of Paxos. And nobody writes their own Paxos implementation.That’s like a thing that insane people do who want to have bugs. And we did it and our new one was less buggy than the one, the officially good one from Zookeeper that everybody uses. Later, we basically deleted and completely rewrote from scratch our core database concurrency control and conflict checking algorithm to make it more parallelizable and more scalable and faster, which again, is just a totally crazy thing to do. I don’t know of other databases that once they have gotten that piece working, have rewritten it, let alone rewritten it to make it more theoretically scalable and cleaner. That’s just nuts. But if you have a system that can find all the bugs really, really fast, it frees you to just do crazy stuff like that.

00:32:29

Ron

Okay. So this seems like a great idea. We think it’s a great idea, which is why we’ve done it. FoundationDB thought it was a great idea. It’s also totally impractical.

00:32:38

Will

Totally impractical.

00:32:39

Ron

Because the whole thing of like, we’ll just do everything from scratch. It’s like, okay, yeah, maybe a databas system should do that. And maybe some crazy trading company that made a decision 20 years ago to use a weird tech stack can do that for all sorts of reasons, but it’s not like a generalizable tool. And antithesis is trying to be a company that sells a generalizable tool. So how do you go from the good idea that’s totally impractical to a thing people can use?

00:33:03

Will

Right. So basically we’ve talked about how there’s sort of two key obstacles to making a really, really powerful randomized testing system, what we call an autonomous testing system that can find all your bugs really, really fast. One is need to actually explore the state space extremely quickly and find all the bugs. And the other is this determinism issue, which both impacts the usefulness of finding those bugs and also makes it just harder to explore the state space. And basically what we’re trying to do is the absolutely insane, hubristic goal of solving both those problems in full generality for every piece of software in existence. And so basically the important thing is we solve them in the reverse order. So once you solve determinism, that actually gives you a huge leg up in efficient state space exploration for all the reasons we’ve already talked about. And I can go into more detail about how we use that. Okay. So how do we solve determinism? That sounds kind of hard because as we’ve just talked about, all kinds of things that you want to do on a computer are non-deterministic. So there’s other people who have tried to do this. There’s people who use frameworks like the one that you guys have at Jane Street or like the one that we built at FoundationDB. There’s since been a bunch of open source ones built for various programming languages and run times. That’s cool. It only helps people who are committed to using that framework, willing to write all of their software, that framework, not use any dependency. It’s not in that framework. It’s not general, right? Yep. Not a general solution, can’t do it that way. There are people who have tried to solve this problem with record and replay where basically as I’m running my program, I write down the result of every single system call in the exact moment at which it was delivered. And then if I want to run my program again, I can just replay all of that without actually talking to the system. And that works pretty well for a thing running on a single node, doesn’t work very well for distributed systems. It’s also just not very scalable. Although

00:35:13

Ron

There’s a critical idea that you snuck in there, which is where you said the word syscall, right? So the whole, the kind of foundation DB/Jane Street/whatever version of doing this at the library level is like there are particular function calls inside of a language that you’re going to be swappable. But here what you’re doing is say, “You know what? Actually, we’re going to do this at the OS level.” Yes. At the bottom, actually all the non-determinism generally comes in from the operating system and from concurrency. And concurrency is somewhat mediated by the operating system. So system calls are anyway, one huge source of non-determinism. And so the idea of these kind of record replay things are, we’re just going to do the dependency injection at that level. And we’ve already now stepped up a big level in generality. I no longer have to own your programming language.

00:35:58

Will

It’s gotten better. It’s gotten better

00:35:59

Ron

It’s a big step.

00:36:00

Will

We’re not there yet though. Okay. So we’re not there yet for two reasons. One is it’s still not fully general. This is only going to work for the operating systems that you’ve designed this to support and maybe that’s okay. Maybe you think it’s fine because everybody uses Linux. Weirdly, you seem to be true now. People write iOS apps, man. People write computer games. They’re mostly run on Windows. There’s other OSs out there. But I think also doesn’t work great for distributed systems, although you can kind of hack it. And there’s a few people who have. Actually,

00:36:30

Ron

Why doesn’t work great for distributed systems? The syscall layer, it gets you a hook into all the distributed … All the distributed communication comes again through the OS. So why can’t this generalize to that?

00:36:41

Will

Basically, all of the record replay systems out there are designed to do this for one process.

00:36:46

Ron

Got it. So it’s not like so much a fundamental question as an engineering question.

00:36:49

Will

Correct. Correct. It’s just like the UX is not very good. Sure. But I think the more fundamental limitation of these things is the scalability problem. It is just a vast amount of data to write down every single syscall that your thing ever did. You’re already doing a computationally expensive search. You really don’t want to hugely increase the overhead of that and it doesn’t actually get you true determinism. It

00:37:11

Ron

Lets you replay a non-deterministic run. Correct. But it doesn’t let you play things out a deterministic way, because every time you do a thing you haven’t previously captured, you just got to do it.

00:37:23

Will

Exactly.

00:37:24

Ron

Exactly. So it’s a weird halfway house, right?

00:37:26

Will

Exactly. So basically what we decided to do was just go another step beyond that and say, okay, we’re going to do the dependency injection as you put it at an even lower level. Let’s just get under the operating system and let’s implement a deterministic computer, which is a thing that you can do these days without creating custom silicon because people have virtual machines, hooray. So basically we just have to write a hypervisor that emulates a fully deterministic machine, and then we don’t have to touch your OS at all. We don’t have to touch anything you do at all. You can just run your stuff unmodified.

00:38:02

Ron

Right. And so your crazy hard thing to do is possible because people did a super weird, crazy hard thing to do years ago. And this was like part of the historical failure of the operating system where it’s like, oh, we’re going to use the operating system back in the 60s or 70s, but these multi-user operating systems are going to have ways of isolating different programs from each other and stuff. And then some over years we’re going to be like, “Oh yeah, none of this works actually. Unix is very badly designed. It doesn’t solve any of these problems.” So instead we’re going to have a new abstraction where we are going to simulate things at the level of machines. The hypervisor is basically the computer whose upward interface it exposes is a fake machine and lets you run different virtual machines on that hypervisor. And then once you have the hypervisor, in some sense, the path is clear, right? That’s the layer at which … In some sense, before we said, oh, all of the non-determinism comes from the operating system.

00:38:53

Will

But no, it comes from the CPU.

00:38:54

Ron

It comes from the … Well- It comes with the

00:38:56

Will

Or the hardware. It comes from the heart. Timers, right?

00:38:59

Ron

All the different pieces of hardware are introducing that. So you just got to be like, oh, we just got…That’s the layer at which the non-determinism comes, and that’s the layer at which we can instead do a deterministic simulation of what a machine is.

00:39:09

Will

Correct. And our hypervisor is a little bit more ambitious than just being a deterministic hypervisor, which was already kind of hard. But in order to make this really work well, it also needs to be really fast, close to native speed, or even in some weird cases, a little faster than native speed for most code, which is an interesting thing that we’ve pulled off. But then there’s another property that’s also really important, which is we are trying to do this huge branching exploration through the state space of a computer system. And so if we’re running down multiple branches on the same physical host that is running the hypervisor, it’s really annoying if we have to store a separate copy of the memory for each of the guest operating systems that’s running inside of it. That would be a lot of RAM. And so what we do instead is we deduplicate memory pages at the host level using copy on write so that if one of the guests is doing something and it doesn’t affect some particular page in memory, it just inherits a copy of that from its ancestor. And sibling VMs can just be addressing the same underlying memory on the host system, which means that we can do this with massive concurrency on very big computers and explore really fast.

00:40:32

Ron

Got it. Okay. So this kind of brings into focus what is the thing that Antithesis is providing in the end, right? It’s trying to give all of the upsides you described of having this very powerful testing system that can efficiently explore lots of different behavior, but it does it in a way where the amount of work that you have to do to use the system is very low.

00:40:55

Will

That’s right.

00:40:55

Ron

It’s just like, what is your API to Antithesis? It’s actually what you’re doing already. You threw a bunch of stuff in a Docker container before. You throw a bunch of stuff in a Docker container now. You’re just running a VM somewhere. It’s like, yeah, you just run a VM. Somewhere else you run a VM on Antithesis’ servers and then they get to use all of this fancy tech to make it efficient and be able to do all this exploration. And you don’t have to do anything clever to make your system testable. It’s just like-

00:41:23

Will

That’s right. We magically find all the bugs and they’re magically reproducible. That’s right. It’s very straightforward. And the key there, I said we magically find all the bugs. That’s the second really hard thing I mentioned. Once you’ve made the system deterministic, you still need to find all the bugs. You still need to do this state space exploration and you now need to do it because you’ve enabled exploration of way more complicated computer programmers than parsers and little data structures written in Haskell and so on, you now need really, really smart state space exploration. But because we have determinism, we can be smart about it. It doesn’t degrade to random search. And so we’ve also got a whole large chunk of our company that is doing fundamental research on how to do this data exploration faster and more efficiently for wider and wider classes of programs.

00:42:17

Ron

So to jump back for a second to the initial framing of like, this is all kind of comes out of property-based testing in a sense, we’ve spent an enormous amount of time talking about one half of property-based testing, which is essentially the random generation of the generation essentially of the probability distributions, how you explore the space and a bunch on the mechanics of how you run it, but very little on the properties. And if you want to find all the bugs, you have to know what the program’s supposed to do in the first place. Yes. So how do properties fit into this story?

00:42:49

Will

Right. So this is actually a little bit easier than people think it is. And I believe that … I think a lot of the problem here actually is that property-based testing was invented by mathematicians and functional programming people who were thinking of it in the same area as formal methods and stuff like that. My colleague, David MacIver calls this the original sin of property-based testing, is that people were coming from this very, very mathematical background, and so they were thinking of it as you have to exhaustively enumerate all of the properties of your system. And my belief is that you don’t actually have to do that. And the reason why I don’t think you have to do that is that computers and computer programs are very chaotic and they are very good at escalating any misbehavior of your program into much more obvious and extravagant misbehavior. And so you can actually catch a very, very large number of bugs with a partially specified system. So to give you a concrete example of this, if I have some memory corruption in my C++ program and I don’t have ASAN enabled, so I’m not going to find the memory corruption directly, that could still manifest in a lot of ways. It could manifest as my program giving wrong answers. It could manifest as weird garbage or glitches in some response I get. It could manifest as a crash. It could manifest as an infinite loop. It could manifest as corruption of some other random invariant in my program somewhere. And so if I have a property that’s set up to catch any of those things, there’s a decent chance that when I shake the box enough, I will be able to detect that bug, even though I haven’t thoroughly specified every aspect of its behavior. And that same idea, it actually is true for much broader classes of bug than memory corruption.

00:44:42

Ron

So I think what you’re saying is true for a part of the space, but I don’t think you’re going to get all the bugs that way. I think there are lots of areas, and I think we as computer scientists, really as software engineers, rely on this kind of brittleness property all over the place, where the fact that you can find the bugs that you can …

00:45:02

Will

It’s actually kind of why normal non-randomized testing works so well, I think.

00:45:07

Ron

That’s right. But I also think whether it works, depends on the kind of things you’re doing and the way that the code is structured in important ways. The most obvious exception to this is numerical bugs, where numerical bugs just don’t show up this way. Like you get the calculation a little bit wrong and then your curve doesn’t go up into the right quite at the speed that you want it to, but it’s often very hard to get any kind of bright line demonstration that you’ve done something wrong and know where you’ve done something wrong. That’s right. I think there are other properties too. I mean from our side, if you’re building a trading system and the trading system might operate perfectly well and it never breaks, but it’s just more aggressive than it should be. It sends larger orders more often or less often or not placed quite properly in the book. And I think if you don’t do a good job of specifying the properties, I think those kind of systems are very hard to test. And this kind of course-grained, well, let’s look for gross misbehavior and shake the box a lot, is just not going to get those things at all.

00:46:07

Will

Yeah. So totally agree with you. What I’ve said so far only covers a subset of the bugs. I think that there are a lot of other ways to add and refine properties incrementally. I am interested in how to do this absolutely perfectly because I’m a testing fanatic, but I’m also like a pragmatic business owner. So I want to give customers an easy way on, which is just add the most basic possible properties that all software should have, and then I want to give them a nice gentle ramp towards more advanced usage. And I think what the nice gentle ramp looks like for most people who are not sophisticated property-based testing experts is actually others have already done it for us. I think it looks like observability and alerting. If you think about a system like CloudWatch or a system like Datadog or whatever, they have already built, in some sense, the second half of a property-based testing system. You can specify what you don’t want to see, and then you can define alerts on those things. “Hey, memory of this thing should never exceed this number. Oh my gosh, if you ever see this log message, I want to be alerted right away. “Those are all properties. They’re not very good properties, but they’re properties. And I think the main reason they’re inadequate is that with something like CloudWatch or something like Datadog, you only find out that those properties have been violated when your customers do. If you could move that experience, that UX into the testing world into before you deploy, I think it’s actually an amazing sort of interactive way of defining and then refining your system’s properties. And I think it’s a thing that’s actually quite accessible and intuitive to quote unquote normal developers.

00:48:01

Ron

I see why you say that, but I worry actually that observability style approaches will scale very poorly because part of what, as I understand it, the Antithesis approach relies on is the ability to take the work you’re doing, the testing work you’re doing, and run it at kind of massive incomprehensible scale. And I think observability rules tend to rely on the fact that you see the things as often as they happen in real life. And so you can get away with soft properties that aren’t quite the properties that you care about, but are like indicators of and predictors of the things that you care about. So you sort of get to specify things to flag you, and the key thing is to make them not flag you incorrectly too often. But I feel like something like Antithesis depends pretty critically on the violations being real at a high rate, because otherwise you’re just going to anticipate to say,” Oh, we did your run and you have 68 million exceptions. You might want to look at which ones are real.

00:49:01

Will

Totally, totally. You should definitely not take every single thing that you would find interesting in your observability system or whatever, and turn it into a property. But I think taking the ones that would page you and turning them into properties is a great way to get people who have never thought about property-based testing to start thinking about what the properties of their system might be. I think the other thing that can help here is a little bit of the Socratic method. A thing I found when talking to customers is often, if you ask somebody, what are the properties of your system? They get this deer in the headlights look and they’re like, “ Oh my God, get me out of here. “And then if you say to them,” Hey, should your system always return an answer if two out of three replicas are up?” They’re like, “Yeah, yeah, totally.” And if you’re like, “Okay, cool, do you expect that answer to come within some defined SLA?” They’re like, “Oh yeah, obviously.” I’m like, “Okay, great.” And it’s like, “Okay, well, should two users ever be able to stomp on each other’s data?” No, no, definitely not. And so you can kind of tease it out of people. And I think that one thing that we’re very interested in experimenting with is, can we automate teasing it out of people a little bit more? You

00:50:12

Ron

Should clearly train in LLM to have the dialogue with customers to figure out what

00:50:15

Will

Their properties are. Or to just look at their code and to guess at some properties and then present them to the user being like, “Hey, are these properties of your system?” And by the way, even if the user says that they’re not properties of their system, they’re often pretty good guideposts in the state space exploration because they’re often the kinds of thing that some other developer at that company might have mistaken as a property of the system, which means that if you get it to happen, it might lead to a bug later on.

00:50:45

Ron

Do you want to present those semi-properties to the person who owns the system, or do you want to present it to Antithesis and see whether it follows it? And then I feel like you want to classify this. The properties that seem to always be followed and maybe those are properties, and then there’s the ones that are not followed at all and those you discard. And then there are the ones that are mostly followed, and maybe those are the interesting

00:51:07

Will

Ones. Yeah, this is not an original idea. This is an idea that the fuzzing people came up with relatively recently, but they did come up with it first. I think they call it speculative properties. I forget exactly what the term … It’s in a paper somewhere, but basically the fuzzing spin on this is I look at a function that I’ve executed a million times, and if I see that one of the parameters is positive every single time that function is executed, I just go ahead and add an assertion that that parameter will always be positive. And often that just is a property. And then even if it’s not, it’s very likely that if every time I execute it, the thing is positive and then I get it to be negative one time, that’s going to lead to some interesting behavior later in the system, possibly a bug because everybody else assumed it was always positive. And so the idea is we can both use it to guide exploration and use it as a kind of preemptive property creation.

00:52:10

Ron

So I want to step back for a second. I think a meta thing I’m observing from this whole conversation is, I wonder how you sell things to customers. I feel like this whole conversation about what I think of as a really compelling and important kind of superpower that you can give software engineering teams, but we’ve already had a pretty long and complicated story to explain what’s actually going on. To go to the perspective for a second of somebody running a business, how do you think about convincing people that this is a thing they should be excited about and want to pay for?

00:52:45

Will

How do we sell to you?

00:52:47

Ron

So that’s a good question actually, right? How did we actually get to Antithesis? A little randomly, actually. So from our perspective, one of our engineers is someone who just followed the foundation DB work and kind of knew about it and thought it was cool. I was wondering what those people were doing. And at some point saw an Antithesis webpage go up and I was like, “Oh, we have testing problems. Maybe this would be good.” This was someone who’s working on our ultra low latency team that does a lot of very complicated, multi-system, extremely subtle behavior kind of stuff and was like, “Oh, it’d be nice to have deterministic testing for this. Maybe that would be good.” And so we reached out and set up some conversations. I got to sit in on a couple of them, not because we were like, “Oh, we need the old guy who’s been here for a long time, but more because I’m nerdily really interested in testing stuff.” So I thought it would be interesting. And then one of our engineers, a guy named Doug Patty, who’s actually previously been on the show, decided to try it out with Aria, which is one of our internally developed distributed systems that already has a ton of very careful work on the testing. Indeed, it’s one of the places where we’ve done a lot of very careful work around deterministic simulation testing, and yet we thought there was some actual real value add from Antithesis’ version of this, and that’s basically how we found you guys. But it’s like a very expert oriented people who are already in the tank kind of customer acquisition story. It’s like the people who already built their own deterministic simulation framework are like, “We’d like a better one.”

00:54:23

Will

Yeah. Well, we had a –

00:54:24

Ron

I don’t think we’re a big audience.

00:54:27

Will

No, we actually had a debate internally in the early days, which was would people who are already doing fuzzing or PBT or determinist simulation, would they be better customers because they are into this stuff or would they be worse customers because they already have one and they’re not going to pay a lot of money for another one? And it turned out that they’re very conclusively better customers. But as you say, there’s not that many of people like you. And so you’re right, we do have to make it broader. There’s a few arguments that we use, and then I think there’s a few trends that are really, really acting in our favor and that are giving us actually considerable success in selling this to normal people, not saying you’re abnormal. All

00:55:10

Ron

Good. I wouldn’t object if you had.

00:55:12

Will

Basically, the two main arguments are safety and speed. And you can think of those things as being on a frontier. For a given level of programming technology and skill and architecture and language choices and problem domain and whatever, there’s some efficient frontier between safety, like how sure you are that your program has no bugs and speed, which is how fast can you add new features and solve business problems. And I think of a tool like Antithesis as just pushing that frontier outward and you can decide to reap the benefits in more safety. You can keep going at the same speed, but be really, really, really sure you don’t have any bugs, which might be the right call if you’re making pacemakers or airplane guidance software or something like that. Or you can just keep the same level of quality, but do everything a lot faster because you’re not writing as many tests because when you do hit bugs, you’re hitting them in your tests rather than in production, you’re not doing some really long, slow, boring triage process with a badly written bug report from a customer while you’re not sleeping and so on. So you can just go faster with the same level of quality or you can kind of get a little bit of both. And I think we sort of have all three kinds of customers, I would say, and they’re all getting some real benefit from it somewhere on that frontier. I think the trend that has … There’s two trends that have helped us a lot. One is just that all kinds of software is now either responsible for very, very critical stuff that needs to keep working or responsible for making lots and lots of money and needs to keep working and keep getting better at making money. And so a lot more people care relative to 10 or 20 years ago that their software works correctly and that they’re able to continue to develop it.

00:57:11

Ron

Just that there are more critical systems.

00:57:12

Will

That’s right. The other trend that I think has been really good for us is AI code generation, which hugely increases the salience of all these issues. And I think moreover just has made everybody realize the Amdahl’s law nature of being able to verify that your software works correctly, like being a critical limiting factor in how much software you can write. And I believe this was always true, and it just wasn’t obvious enough to people. But now it’s really, really, really obvious to people. I can have 10 Claude codes all writing code, and it doesn’t matter. I’m not going to go any faster if I can’t merge those PRs after reviewing them and actually deciding they work.

00:57:58

Ron

Right. And the two paths towards this is one is figure out ways of making your software less critical. And if you can find a domain where you can do a lot of stuff where you can get value out of it, but correctness isn’t incredibly important, you can move at lightning speed and that’s great. And there’s all sorts of cases where this is true. If I am a software developer who wants an analysis tool to understand what’s going on in my program, it’s like it doesn’t have to be all that right. It can help me some of the time and not succeed other time and it’s kind of fine. It’s kind of a throwaway tool that I just make and use and that’s super great. And you can just vibe code that and it’s awesome.

00:58:32

Will

And by the way, I think there will be many successful companies built to make it easier to have that kind of software. Things like zero trust hosting, things like very powerful security guarantees around a piece of software so it can’t do any damage, things-

00:58:47

Ron

Security guarantees. And I think also just like picking the right abstraction boundaries, figuring out if I want to make this whole thing useful, what pieces have to be reliable and what pieces don’t have to be reliable? So it’s like there’s a whole new kind of software engineering challenge of how do you build these architectures that let you leverage less reliable code. So that’s like one direction to go and the other direction is just getting much better at verifying.

00:59:07

Will

That’s right. That’s right. And right now, I think that has suddenly become interesting. It’s very hot all of a sudden, which is kind of fun because this was like a backwater, in some ways, a deliberately chosen backwater for a long time, and now there’s all this interest.

00:59:23

Ron

What do you mean by a deliberately chosen backwater?

00:59:26

Will

Oh, if you are … Sure. So basically if you’re trying to decide what to make a career in, right? I talked before about how there’s a lot of careers where you’re not going to have world-class success unless you are at an extreme of the distribution of people.

00:59:44

Ron

Like being a violinist or a mathematician.

00:59:47

Will

Correct. One good way to be at the extreme of the distribution is to pick something where nobody else who is very talented is interested in it, and then it just is actually much, much, much easier. And you can’t pick making paper airplanes or whatever. Nobody super talent is interested in that because there’s not a ton of economic benefit in that, not a lot of benefit to the world in that. But if you can find a sweet spot where something is both really, really, really important, but for some reason nobody else has noticed it’s really, really, really important or people know it, but they don’t want to do it anyway.

01:00:29

Ron

Because it’s painful.

01:00:30

Will

Because it’s painful or because it sucks or because it’s low status.That’s actually an-

01:00:32

Ron

Testing is boring.

01:00:33

Will

Incredible arbitrage opportunity. And so that was actually a lot of why I got interested in testing, is this is janitorial work. All developers hate it. The number of smart people who have thought about software testing is very low because-

01:00:53

Ron

Although I have to say, so when I started at Jane Street, I was super incompetent. What did I have? I had a PhD in computer science, which doesn’t tell you how to be a software engineer. And I was not super good at it and I didn’t know anything about testing. But just over time, over the many years of thinking about these systems and building them, I’ve come to realize that not just testing is important, but it’s super interesting and fun. It is. When you do it well, right? There’s a lot of engineering work that it’s one of these things that if you don’t do the work to build good systems for it, it’s horrible and nobody likes to do it. And there are lots of companies that solve this problem by hiring a whole different tier of people to be the testers because it’s so low status that you can’t get the real software engineers to do it. So you get other classes of people to do it and you just make it a different class job and it seems like a terrible way to run a business.

01:01:44

Will

Yeah. I actually believe that the world is fractally full of things that are incredibly interesting and incredibly ignored by the entire world. I believe that there is tremendous low-hanging fruit here. It’s not just software testing. Things that are super economically valuable, super interesting, and that nobody is doing. The key though is even if you find such a thing, your problems have not ended because now you need to convince other people that it is actually super exciting and cool, which you might be able to do kind of one-on-one. But if you want to start a successful company, you need to somehow make yourself legible to capital in the words of somebody who I like. So that’s a whole nother challenge. We got a little bit lucky there. As our company was growing and scaling, we’d kind of laid all the groundwork and the foundations here. And then suddenly this giant thing happened in the world that made what we were doing super legible to capital. And that was just like a nice stroke of luck. I think we would’ve succeeded anyway, but it’s always nice when things break in your head.

01:02:48

Ron

Right. So what you should ideally do is pick a neglected area of the world, operate and sell for a while, get a headstart, and then cause the area to suddenly not be neglected.

01:02:57

Will

That’s right.

01:02:57

Ron

But only after you have done a lot of pre-work.

01:02:59

Will

That’s right. That’s what we somehow managed to do.

01:03:02

Ron

Actually, the capital thing is a thing that may be a good thing to talk about for a second. So one thing that we got involved with, so we started using Antithesis as a product and we’re excited about the actual results. I guess a thing I didn’t say before, which was one thing that made us really happy about it is it actually found bugs that we didn’t find before. It allowed us to do more aggressive, larger scale kinds of simulations, even though we already had a deterministic simulation.

01:03:27

Will

And your systems were really well tested.

01:03:29

Ron

Right. Really well tested and had a really good record of a low number of bugs in production. But the curse of a system that has a really good level, a really good record of very few bugs in production is people start relying on it, having a very good record of very few bugs in production in the future. And so the stakes go higher and you end up using it for more and you want to put more engineering into making it yet more reliable.

01:03:53

Will

That’s a super interesting testing problem in its own right, by the way. If your system performs better than its SLA, everybody who depends on you will start to assume in code and otherwise that it will always perform better than SLA. And then if you ever merely meet your SLA, everything will go down and crash.

01:04:13

Ron

Yeah, that’s basically right. I’ve often wondered what are SLAs for. I have not found the whole form of an SLA to be a particularly useful engineering mechanism in practice.

01:04:26

Will

At Foundation DB, we actually invented a technique. I mean, we invented it, but others have invented it too, but we call the technique bugification. And the basic idea is if you have a piece of code that you have written well, such that it 99.99% of the time does way better than its promise. It returns an optional value, but it always returns a value. You should, when running in test, sometimes just make it do the pathological thing with some low but real probability so that anything that depends on that code, all the callers get used to the fact that it can exercise its full spectrum of behavior.

01:05:03

Ron

Right. And I guess famously, Netflix was like, actually we’ll do this in production. That’s the whole Chaos Monkey idea.

01:05:09

Will

I’m not such a fan of that.

01:05:10

Ron

Yeah. There’ve been spots where we’ve used it. I have complicated feelings about it. I do feel like it degrades the quality of your overall service in a way that is often just economically inefficient and you just don’t want.

01:05:24

Will

I think Amazon actually might offer an entire region where you deploy your code there and their services will just-

01:05:32

Ron

Take a bad region.

01:05:33

Will

They’ll just return 500 sometimes whenever … Yeah, it’s actually a pretty good idea.

01:05:39

Ron

Yeah. Certainly seems good as a mode of testing.

01:05:42

Will

Yeah. Sorry, you were saying I interrupted you.

01:05:44

Ron

Right. Yeah. So I guess we were talking about capital. So we got involved as customers. We found it useful for finding real bugs. We found that it, again, in the way that you might expect, increased the ambition of the kinds of things that we would try to do. There are things that we are willing to experiment with that are riskier, but we feel like more of that risk is tamped down by the better testing story. So it’s been a very positive experience for the places that we’ve applied it. And then we got involved actually in leading Antithesis Series A. It was pretty cool. Yeah. Which I think you guys found to be a little bit of an interesting and weird experience and we found to be a kind of novel experience too. And I’m kind of curious how it felt from your perspective.

01:06:26

Will

Yeah. Well, you guys haven’t invested in very many companies, so it was not a thing that we thought was even on the table or likely to happen. I think it basically happened as a happy coincidence. You heard through the grapevine that we were raising money. And then I think one of your corp dev people came and chatted with us and we were initially like, oh yeah, whatever. They want to do some small strategic investment. And then I think he was like, “We would consider leading the round.” We were like, “What? That’s completely unheard of. “ But it was actually an incredible experience. Silicon Valley VCs are great and they have many forms of knowledge and they have many form … They have deep networks and they have deep experience from working with many, many companies that lets them give you all kinds of good advice. But they’re generally not super active users of your product and-

01:07:23

Ron

Certainly not of this product.

01:07:24

Will

It’s certainly not of this product. That’s right. That’s right. Maybe Carta had an easier time with that. And so I think one of the amazingly cool things about having Jane Street as an investor is just that I feel so very aligned in terms of you understand our vision and what we want to do. You guys give us constant good ideas about the product and strategic perspective on the world informed by your use of it as a customer. It’s like a very different kind of advice than most investors can give. And we’ve already got the Silicon Valley VCs, right? We have that. And so having you guys as well just feels like an incredible superpower.

01:08:07

Ron

Yeah. And I do think this lines up. I mean, we are not, certainly not primarily and not majorly like a VC. That’s not at all the primary thing we do, but we have been doing more and more investments over the years. And those investments have mostly been in the form of companies where we are connected to the underlying work, where we care about it, our customers or wants to be customers of it, and we feel like we have direct kind of subject matter expertise on the area in question. And we really believe in the product and think it’s great and want to use it ourselves. And I think that’s the kind of thing where we think we actually have some meaningful leverage

01:08:44

Will

For that. Yeah. And I think you guys have done quite well. I think you’re not VCs, but I think you’ve done a pretty good job of spotting opportunities. And I think you’ve got a track record of spotting them before they become quote unquote legible to capital. I think you guys were very early in Anthropic if I don’t misremember. And I think you guys invested in Anthropic at a time when they actually had trouble raising money, hard as that might be to believe because you saw something that others didn’t. And I think with us, you invested in … I mean, three months is a very long time in tech these days. I think today, every single investor is probably lining up to invest in testing companies because it feels so salient with AI code gen. But three months ago when you guys made this investment, no investor had ever heard of software testing or cared to a first approximation. And I talked to a lot of people who were like, nobody has ever made a software testing company that has made any money. Why do you think you’ll be any different? And who really needed to hear arguments. Whereas I think you guys sort of spotted that opportunity before the professionals did.

01:09:58

Ron

And it’s worth saying, I think we were interested in and excited about Antithesis and the value it provided independent of the AI angle. I think the AI angle added a lot more, but I think to some degree, I think we share some of your basic intuition that this stuff has always been important, but it definitely, as a kind of market hypothesis, makes a lot more sense in the present world where this stuff is becoming more salient because of the challenge of verifying AI generated code. I’m actually curious how you think about the product really working in this context because in some ways I think it’s a really good fit and in some ways it’s not quite perfect because one of the critical things that you want, both when you’re thinking about RL, you want to provide feedback to agents as you are training them. And then also when you actually try and use this stuff is you want reliable feedback on whether the thing that they did is good, but you also want fast feedback. And Antithesis is good at a lot of things, but it’s not like super fast, right? When you send kickoff an Antithesis run to find your bugs, you might come back tomorrow to look at the results.

01:11:07

Will

So I actually think that last … I do think that there are ways it doesn’t fit well, but I think that last thing you said is a unfortunate current limitation that is highly contingent and will not be for long. Basically, Antithesis began like its bread and butter was like very, very large distributed systems, and those very large distributed systems tend to just kind of be expensive to run, period. And so there was not tremendous pressure on us to make all of the constant factors of running our software really zippy and snappy. And basically people who were testing this stuff were just okay with getting a relatively slow answer, and so we weren’t under a lot of pressure to do otherwise. As we move beyond distributed systems, which we are doing this year, that equation changes. And I think you are going to see that Antithesis gets way faster at giving results. And we have a lot of really, really cool projects underway that are going to enable that and make that possible. And by the way, I think that even for distributed systems, you might be able to start getting pretty fast results. I don’t think there’s a law of the universe which says you can’t test a distributed system fast. At FoundationDB, we often got good quality answers within minutes or tens of minutes, very thorough answers. Sometimes we’d even find the first bug in less than a minute. And I think that that is totally a thing that you will be getting from Antithesis in the next year or so. So

01:12:42

Ron

What are ways in which, beyond the kind of timescale issues, what are ways in which you think maybe it doesn’t solve all the problems for?

01:12:49

Will

Or for AI in particular? Yeah. Yeah. Well, okay, there’s a few things. Let’s talk first about what I consider the most fundamental one and I think the most interesting one. And I don’t think that this is catastrophic, but I think it’s an interesting challenge that everybody who’s doing any kind of autonomous software verification, whether that’s property-based testing or formal methods or code review or whatever, is in my opinion, not thinking about. Okay. So code generation tools, code synthesis tools, specification-driven tools like that have existed since way before ChatGPT existed. These have existed for 20, 30 years and nobody ever used them because they suck. And why did they suck? Basically because they all acted like evil genies. You would say exactly what you wanted the program to do and the non- LLM program synthesis machine would crank out a program that exactly matched your specification and totally did not do what you wanted to do. You’ve had experience with this.

01:13:59

Ron

Yeah. I mean, I’ve been sort of paying attention to the program synthesis literature for a long time. And there’s a lot of really great research and a lot of great researchers doing interesting stuff, but remarkably little practical applications in it and all the things that people work on end up looking mostly like toys. I think maybe the single most successful program synthesis style thing is like Microsoft Flashfill and Excel, which is pretty good. But I feel like for all the smart work that’s gone into this stuff, you would expect in some ways to have more practical impact. But the problem is just really hard to do well. And I think in some ways one of the reasons why LLMs are better than classic program synthesis and what is that there are less evil genies and they’re not really specification driven, they’re like vibes driven. You say some stuff and it makes some inferences and there’s a lot of leaning on the priors of the thing it’s seen in the past and what it generates and it’s just optimizing less.

01:14:59

Will

Exactly. Exactly, exactly, exactly.

01:15:01

Ron

And of course the RL process makes it optimize more, right? So this whole thing where you have basically like eval hacking where it does kind of whatever it can do to try and get the light to turn green. This is a problem with LLMs. It’s a problem with people, right? Sometimes you have some system where you have some checks in place and I think we talk about internally is like, don’t just play the video game, right? You don’t just try and like score. You want to actually do the right thing and use the alerting as a way of understanding what’s going wrong. But if you turn the alerting into the thing that you’re actually optimizing for, very bad stuff happens.

01:15:36

Will

Goodhart’s Law. Yeah. Yes. Yeah. So that’s exactly right. Basically the reason I personally thought that AI code generation wouldn’t go anywhere like a year or two ago because of exactly this. I had experience with program synthesis tools. I was like, oh, they’re all evil genies. They suck. I think a lot of people who had experience with these tools had the same kind of reaction. And what we all missed was exactly the thing you just said. LLMs are not … They actually want to make you happy, right?

01:16:06

Ron

For good and ill.

01:16:07

Will

Exactly. They’re like, the sycophancy thing is like, there’s actually a nice flip side to it. They’ve been trained on zillions and zillions of examples of people on Reddit and Stack Overflow being helpful. And then they’ve been RLHF’d by people who reward it for being helpful. And so it actually is kind of trying to write the code that you’re asking for as opposed to write code that fits the specification that you asked for in the least amount of work or whatever. And what happens when you put these things in a loop with something that’s like, eh, no, try again, no, right? It kind of shifts it back into being an evil genie a little bit.

01:16:45

Ron

That’s right. Although to be clear, I think that the people who are doing the training are no fools. And you’ve talked to some people who do this kind of training work and they pull the system simultaneously in multiple directions. There are things that you do to pull it in the direction of trying to just satisfy the immediate feedback goal and also trying to pull it in the direction of fitting more of the general distribution and not just kind of totally getting completely twisted out.

01:17:11

Will

Yes. But the problem is that when you’re done training, when you’re actually running this thing, if you run it in a loop, it’s still pushing it back towards being an evil genie, not in terms of shifting its weights and so on, but just in terms of its behavior and what it tries next. I’ve seen this happen even with just very, very, very not sophisticated, not property-based testing. I have Claude Code and I’m like, “Hey, do this thing for me and make sure the tests all pass.” And if the thing is hard and it can’t do it correctly, eventually it deletes the tests or eventually it makes the test pass in some trivial way or in some way that is totally not what I want.

01:17:49

Ron

I do think this is getting a little better, but the phenomenon is still very strong.

01:17:52

Will

Yes. And I basically think that the more powerful and unyielding the validation step is, probably the worse this overall effect gets.

01:18:03

Ron

Yeah. And another, I think, general problem with these issues, we talked before about the kind of functional properties of the software that you’re optimizing for, and then the non-functional properties, like all these kind of architectural and clarity and extensibility properties. And those

01:18:16

Will

Probably get worse.

01:18:17

Ron

Yeah. Right. Because if you look at the agents, their efficacy depends a lot on those non-functional properties. They just do better in context where things are tighter and more extensible and easier to understand and where the systems are fundamentally simpler, but they’re super bad at maintaining those properties. I feel like the thing that Anthropic came out with of the C compiler that they built was a really interesting example where they got really far. They built a pretty good compiler. I mean, not actually a good compiler, you wouldn’t want to use it for anything, but it was an impressive technical feat. It’s a little bit like the talking dog. It’s not that what it says is so great, it’s that it talks at all. They got to compile that got to that level is impressive. But the thing, a lot of people have focused on like, “Oh, it didn’t do any type checking and it didn’t do this and it didn’t do that”. And that’s a little interesting, but the thing I was more struck by was the way in which it ended and they were unable to make future progress, to make more progress with this team of agents approach because it just started to be the case that as the agents started to make improvements, they would break other stuff at such a rate that they couldn’t actually net.

01:19:25

Will

Which is an experience every junior engineer has had too. And it’s why things like architecture matter and it’s why things like making your system actually fit together in a minimal and clean way and have concerns be orthogonal and well factored and all that stuff. Yeah.

01:19:41

Ron

It’s just like a bringing to life the deconstruction of the non-functional properties of your software. And I think that’s one of the reasons why it seems to me like testing while still important just isn’t enough. You still need to think about architecture. You still need to think about the cleanliness of the code and all of that. That’s right. I think you just have to maintain those non-functional

01:20:04

Will

Properties. And it’s possible that if you put an LLM or an agent swarm or something in a loop with a really strong test or a really strong formal verification system or something, it’s just going to make the architecture worse and worse in order to get the test to pass. That seems like a very plausible failure mode.

01:20:22

Ron

Yep. So how do you think about the completeness of Antithesis as an approach? To what degree are you an Antithesis maximalist? I mean, I don’t so much mean Antithesis the product, but the approach. The approach is like, we are going to have a kind of ability to do these high powered end to end randomized tests of our systems in a way that are very cross-cutting and can check lots of different properties. That’s not the only way to write tests. There’s like the classic, I’m going to add a small scale, write a unit test, which sticks an example in there and see whether the example behaves in the way that I want. To what degree do you think the Antithesis approach is really the approach that people should be doubling down on and to what degree do you think we should be throwing many things at the wall?

01:21:10

Will

Yep. So I will first say that I want to dispute the idea that there’s an Antithesis approach. So the thing that we’ve told people, including all of our investors from the start, is that this is not a solutions-based company, it’s a problem space company. Our goal is to make software validation incredibly cheap and easy and like running water and find all the bugs and all the software by any means necessary. And it just so happens that we thought that the lowest hanging fruit, the best way to start making money and really start making a dent was to do this deterministic simulation thing and to make that cheap and easy for people to adopt. But that is not the full extent of our ambitions. If we someday … I kind of dream of a day where software engineers don’t need to know what deterministic simulation or unit testing or formal methods or concolic solving or any of these things are, they just hand their software to a box and get back like it worked or it didn’t. And obviously there’s going to have to be a lot of very complicated things that happen in order to enable that vision, but that’s the dream. Okay. That said, there’s a reason we started where we did, and it’s that I think we do believe that this technique is uniquely high leverage and a little bit uniquely low adopted for how high leverage it is. And I have seen both … Okay, so our team is always dogfooding our own product, which is a thing that every team that’s making a developer tool should do, or really any kind of tool. It

01:22:53

Ron

It can be harder if it’s not a tool that you use, right? Developer tools are where it’s easiest.

01:22:56

Will

Yes. And so that’s both fun. I feel like that both shows the power and the limitations of the current basket of tools that we offer to our customers. We have gotten ridiculously far with just doing Antithesis style deterministic PBT on everything that we write, including UI components, browser-based stuff, including very low level things, just everything. We have entire extremely complex systems that are literally only tested with Antithesis and nothing else where nobody has written a unit test and we’re like one of the policies of that area of the code base is that people don’t write unit tests. You just add more sophistication to the property-based tests to cover whatever you need to cover. And then there’s some parts of our code where I’m like, man, there should just be a unit test here and that would make this a lot more straightforward. And so I feel like this is kind of a wimpy answer to your question, but I kind of feel like there is a line. There is a place at which you should just write the stupid unit test or you should not use testing at all. You should be using something like proof-based techniques because of the nature of your problem domain, or you should be using exhaustive testing. If your function takes an Int32, you can just try all of them. Won’t take that long.

01:24:20

Ron

Definitely done that.

01:24:22

Will

So I think that that line does exist. I think it is a lot farther away than most people realize. I think more things are amenable to property-based testing than people think, and that if we can make it easier and more powerful, people will use it in more situations where they don’t currently use it.

01:24:39

Ron

Yeah, I think that’s right. And I think your point about it being neglected essentially feels right to me as well of if you’re going to see where you can add a new thing and make a big change, I feel like that’s a natural thing to work on. I do think the other kind of testing is really important. I think there’s a kind of unreasonable effectiveness of example-based testing. I think in some ways it’s almost sounds like a comically bad idea of I’m going to have a big complicated program and then I’m going to test it by writing six examples. But to a surprising degree for modest complexity things, it actually works super well. And I think works especially well in code bases that have other good non-functional properties. A thing I’ve long been struck by is the degree to which having a really good and expressive type system that captures a lot of useful properties of your program and tests together, there’s a kind of multiplicative effect where it has this very strong property to kind of snap in place. You just put your finger on a couple of spots and make sure that the behavior is what you expect it is and the kind of analytic continuation of your program, the rest of the behavior is kind of smooth enough that there’s kind of like only one natural thing for it to do and it just clicks in and does that one thing.

01:25:58

Will

Yes. I think a thing I’ve said before is like, there’s this funny thing about impossibility results where they often are actually cluing you into a thing that you should really try and do. And the reason is that a lot of impossibility results, this is true in mathematics, true in computer science, true everywhere, kind of rely on this anti-inductive property. It’s like, I’m going to prove that the thing that you’re trying to do is impossible by constructing a really fiendishly awful example and like, ha, your technique fails here, and I’m going to adapt it based on the technique that you’re bringing. And that’s kind of how impossible results in mathematics often work, like diagonalization arguments. It’s also true in many famous impossibility results in computer science. And I think what’s significant about this is we’re not trying to find bugs in every random Turing machine or even in a random Turing machine drawn from the space of all Turing machines. We’re trying to find bugs and software that people write to accomplish business purposes. And that is a very, very, very infinitesimally small subset in the space of all possible programs. And it’s like a really nice one. It’s like smooth functions or functions that are everywhere differentiable or something. It’s like these are programs that people have built for a reason and have built so that they can come back and modify them and extend them someday. And I think it just turns out that in that space of programs, testing is actually way more tractable than it would be in a completely random program.

01:27:40

Ron

Yeah, there are tons of things like this. And another fun example from our world is type checking in OCaml and any language in that ecosystem or in that kind of rough space of languages is like doubly exponential. You can write an 18 line program that will not finish type checking until the heat death of the universe, but nobody

01:27:59

Will

Does.

01:27:59

Ron

Nobody does. It turns out those programs don’t make any sense. And you can find that, if you think really hard, you can figure out what those programs are, but they’re not actually a practical part of the actual things that you run into when you actually do the real work. And again, I think this behavior of real world programs being a much smoother tamer, better behaved subspace is a really important one for lots of engineering questions.

01:28:21

Will

It’s true. Although we do trollishly inside of our company have the inside joke. At our last company, we violated the cap theorem and at this one we’re violating the Turing halting theorem. So you were just like moving up the hierarchy of theorists.

01:28:35

Ron

Yeah. What’s next? What’s the next theory to violate?

01:28:36

Will

I don’t know. It’s a good question.

01:28:37

Ron

It’s a good company formation question. So we’ve talked a bunch about the kind of engineering practices you’re trying to create in the outside and a little bit on your engineering practices internally, but I’d like to hear a little bit more about that. How does Antithesis operate internally? And I’m kind of curious how that differs from what you guys see in the outside.

01:29:00

Will

Sure. So I think I learned a useful trick from somebody recently, which is when you’re talking about your company’s culture, culture is always a set of trade-offs. There’s no purely good cultural attribute. They’re all just choices on a spectrum and being one thing implies that you are not the good things about the opposite. And so I’m going to try and phrase this in the most edgy way possible maybe. So I think that we generally believe a couple important premises that have led us to pick a pretty weird by outside standards place on a lot of these culture spectrums. I think we believe that for many kinds of projects, the overall cost of the project is dominated by the number of mistakes you make. Big architectural mistakes early on in a project It can just have an exponential effect on the amount of work that it takes to get the project done. I think we also believe that one of the biggest scalability barriers to human organizations is communication, and that one of the things that is worst for communication is lack of trust. Let’s just start there. So given that you believe these things about the world, what would you want your engineering culture to look like? Well, basically we try really, really, really hard to talk a lot about what we are going to do before we do it and to debate multiple possibilities for how we could accomplish some important objective before we go all in on one. And that doesn’t mean that we don’t prototype. Often these discussions do involve people bringing prototypes and showing them to each other and debating the merits of them. But it is basically considered uncouth at Antithesis to be like, “We’re going to do it this way and to not come with two alternatives and then explain why you picked this one over that other one and then explain why you don’t think there’s a great third alternative.” And that I think drives some people completely insane. There’s a lot of people who are just like, “Man, I want to put on my headphones. I want to write my code. Leave me alone.” And they just won’t have a great time at Antithesis where people are going to walk by and be like … And we all work in a big open room exactly like you guys do here. And people will just come look at your screen and be like, “Hey, why are you doing that? “ Which is not a thing that would happen at some other companies I’ve worked at. So we’re highly collaborative, highly deliberative. Collaborative does mean that we’re all in a physical office together for the most part because it’s adding any friction to communication just means that you get a whole lot less of it. Sure. It means that we don’t really care about hierarchy very much. There is hierarchy. Every human society and organization has hierarchy.

01:32:16

Ron

I’ve heard you’re the CEO.

01:32:17

Will

Ah, that’s right. But everybody’s opinions can be questioned and debated. And just because somebody is the big boss of some particular part of our software architecture does not mean that they get to be dictatorial or rule by fiat. People can just come and be like, “I think you’re making a stupid decision and that’s a very normal thing.” And we try to praise people for sticking their necks out and making statements like that.

01:32:45

Ron

Yeah. A lot of this feels very familiar. I think we’ve taken it a pretty similar role. It’s not like the whole big tech thing of you’re an L8 sergeant second class, something. It’s just like we just don’t think makes a lot of sense for us. And people have functional titles as someone who’s responsible for a given area or whatever, but there’s no kind of general notion of title that shows up

01:33:09

Will

Somewhere. We’re the same way. And we’re actually debating whether we need to change this at some point. But basically every single person on our engineering team has the same title on their job offer. It’s senior engineer.

01:33:20

Ron

Yeah. For a while, I think for weird legal reasons, we thought we needed two different ones. And for the first two years you were a software engineer and then afterwards you were … But with no internal reference or anyone paying attention to that kind of stuff.

01:33:32

Will

Yeah. So the thing which I should probably not be saying, but it’s true, is we sort of treat titles as tools. So when we’re interacting with the outside world, people can adopt any title they wish pretty much. So it’s like if somebody really needs to get into a conference, suddenly they’re a senior staff engineer, third class or whatever. Whatever our marketing people decided would be the correct title for you to get into that conference. And people can use whatever titles and their bylines that they think would be most useful or put on LinkedIn. This is a form of compensation. Please pick your title. But internally, there are no titles.

01:34:08

Ron

Right. And I think part of that is we very much want a culture where the thing that matters is the idea and what’s the actual thing you’re trying to do and not the particular position and rank. And no culture is perfect. Our culture is certainly far from perfect. And I don’t think this ideal 100% works out in all the cases, but I think it’s definitely directionally much more this way here than I think in lots of other places. And I think it’s a little disorienting actually sometimes for a strong experienced person who comes from somewhere else and lands at Jane Street doesn’t have a rank that helps them navigate. And we have to actually be much more intentional about trying to get them into the right spot and make sure that people quickly realize that like, oh, this actually is a person who’s substantively worth including in and listening to in a bunch of different contexts. Because we just don’t have the title tool as a way of making that happen. So you have to use other methods to get people in the right spot.

01:35:03

Will

So how do you guys think about maintaining that as you grow? Because I think this kind of organization is really, really effective and also really hard to preserve if you grow quickly.

01:35:15

Ron

Right. So I think one of the things is, even though it feels kind of quick, we just kind of haven’t grown quickly. We’ve been relatively disciplined about growing at, I don’t know, what feels like a fast pace between 10 and 30%, depending on the year, usually south of 30. And when we’ve been on the upper range of that, we’re like, wow, this is really uncomfortable. We kind of maybe want to slow down a little bit. And we really feel like it’s important to be able to take the time to absorb people into the organization. I don’t know how to run a company where you need to double every year for a few years. It seems terrifying, and it’s just not how we’ve operated. So that’s one thing. We’ve also just been very rigorous about interviewing. I’m just trying to make sure we’re bringing in people who are very good technically, that’s really important, but also who fit in culturally, who are nice and humble and have good second order knowledge and aren’t made super uncomfortable about being wrong because we’re all wrong a lot. You make a lot of mistakes and you want people who are comfortable owning up to those mistakes.

01:36:21

Will

Yeah. We actually deliberately design our interview to try and assess these qualities. That’s a significant part of why it’s set up the way it is.

01:36:29

Ron

Yep. Yeah. No, we have the similar things from our side. We think it’s after some early mistakes based on not understanding this, we realize that you really don’t just want to solve the people who are best at solving the puzzles. Being good at solving puzzles is really good, right? Having high wattage and just being really smart at stuff is good, but you really want to make sure that whoever you’re interviewing, you see how they operate under challenge because you’re going to take everyone and there’s more they can do and you’re going to keep on asking them to do more until the job is hard and there’s no end of hard problems to solve. And so you want to see how people operate in that context.

01:37:10

Will

The thing you mentioned of niceness and being good to work with and so on, that I think we fully agree with that. And that comes from another sort of fundamental observation about the world, which is most problems are hard enough that one person alone cannot solve them. And even if they were, your individual value that you bring just by the stuff that you do in almost every case is dwarfed by the positive and negative externalities that you cause on the team. You are going to be chatting with your friend or your colleague at lunch and have some good idea that makes their job easier, or you’re going to be mentoring some junior engineer and teaching them some trick that’s going to make them more valuable for rest of their career. Or conversely, you’re going to be being really mean to somebody and then they’re in a bad mood for the rest of the day and aren’t as productive and also just make the place a less fun place to work. And so that stuff just kind of dominates actually when you get to a sufficiently larger organization size. And it’s not to say that you can be ineffectual and really nice and have a job.

01:38:17

Ron

You do have to get things done.

01:38:18

Will

There is still a bar. That’s right. Not least because having people around like that is terrible for morale.

01:38:23

Ron

That’s right. Lowers the intellectual density.

01:38:25

Will

That’s exactly right. But it’s sort of like you just need both and we’re just not going to accept you unless you are both really great on your own and also really great and magnify the abilities of the people around you. Yep.

01:38:38

Ron

Yeah, I think that’s totally true. One point about the externalities really matter. I think that’s true. I feel like you could take that kind of thinking in the direction of thinking that what really matters is organizational stuff and how things are put together and teams and all that. And I think that stuff is all really important. I also feel like the shape of this business makes very clear to us how amazingly valuable strong individual contributors are. And a lot of that value is the externalities that they have. But individuals in both a kind of trading and a technology in a various other contexts who are just super good at their job and not kind of built to be large scale leaders can still be just enormously valuable and enormously well paid because that kind of individual contribution can just move the needle in a huge way. So it’s both like this kind of collective stuff that really matters, but also people’s just individual power to do amazing things is super important and is really important to recognize and compensate people for that kind of stuff.

01:39:37

Will

Yeah, totally agree. I think another thing that helps with keeping that kind of environment as you grow is just having strong esprit de corps and a strong sense of yourself as an organization. And I think quirky cultural choices and quirky technology choices actually help with that. I think it makes people hold their heads a little bit higher. It’s like, yeah, I work at Jane Street. I work at Antithesis. It’s like a slightly weird place. People who don’t work here definitely don’t work here. It’s not just like another interchangeable company. And I think that actually makes all of these cultural problems a little bit easier to solve on every dimension.

01:40:13

Ron

Yeah, certainly I like to think so since I think I’m deeply culpable for our weird choice of programming like it. So I hope that has some positive externalities.

01:40:19

Will

There’s actually a really interesting paper I read recently that talks about this in the context of Hasidic Jewish merchants in the New York Diamond District.

01:40:29

Ron

Amazing.

01:40:30

Will

Are you familiar with this? The researcher’s named Barak Richman.

01:40:34

Ron

I mean, I am familiar with the stores. I have seen those guys and been in this, but I have not heard about the research.

01:40:40

Will

So they have incredibly low transaction costs with each other. They lend on credit. They don’t require huge amounts of collateral. They don’t sue each other. They are very, very, very low transaction cost. And that is a big part of why they’re so successful. And Richman studies them and basically concludes that a lot of why they have such low transaction costs is because they are clearly not the world. They’re clearly an insular group of people who all know each other, who all trust each other, and where leaving that group or joining that group is very expensive. And he basically thinks that that kind of makes all of their economic dealings more efficient and smoother. And it’s actually super interesting paper. Yeah,

01:41:29

Ron

That’s interesting. I do think the high trust thing matters a lot for us. I do think it reduces the kind of internal transaction costs. It’s kind of easier to get things done. A thing that I’m kind of always worried about, but still delighted seems to be still in place is that it’s still a place that can pivot quickly. When something different needs to happen, you realize there’s a new emergence and we have to change things and move people around and focus less on this and more on that. We’re able to do it in a way that feels generally pretty positive. People who come from other organizations are sometimes … We say, “Oh, we’re reorganizing this area.” And people are like, “There’s a reorg.” And they stiffen up in their chair and it’s like, “What are you worried?What’s wrong about … “ We reorganize stuff all the time. We change where the seats are, it’s all happens kind of routinely. And then I hear stories about what reorgs are like at various big tech firms. I’m like, “Oh, now I see what you’re scared of”.

01:42:22

Will

We’ve made two huge pivots in the last two years that I’m actually just tremendously proud of our team for doing because they both required astonishing levels of intellectual humility and dealing with reality, which is a thing that organizations are usually pretty bad at. The first was basically, we had been in stealth mode doing R&D deep research for five years, and then we came out and started selling it. And at some point we realized that we were still thinking of the world in a very R&D way. And then in particular, we just were not listening to our customers and did not have the customer service mindset at all and were really, really bad at listening to their feedback and we’re really, really bad at doing what our customers wanted and that maybe this is not a great property for a company trying to have more customers to have.

01:43:15

Ron

Yeah, that makes sense.

01:43:17

Will

And so this kind of sense dawned on us and eventually we were just like, oh, we have to change how we think about everything and how we do everything. And the company just all pulled together and we’re like, “Okay, we’re going to be different now.” And we did and we turned on a dime and I think it went really well and it’s not 100% done, but it’s notably and distinctly different. And the second one was AI, where basically for most of the last few years, we were kind of like, AI coding is dumb. It doesn’t work. It’s not mostly a waste of time, you shouldn’t do that. And then Opus 4.5 came out and everybody played with it at home and we’re like, “Oh crap, this actually works now.” And it was just like, again, a lot of places I think would have trouble admitting that they had been that wrong about something that important. And instead, the technical leaders at our company, who I respect tremendously, not least for this, were sort of like, “Okay, we were wrong. Let’s deal with the world now, time to change.” And very quickly, everything got reoriented and recalibrated. And I think that’s what it looks like for an organization to be able to adapt to a changing environment.

01:44:38

Ron

I do, by the way, think that was, in some sense, the right pivot point. I kind of feel like we’ve actually been spending an enormous amount of energy building tools and trying to get agentic coding working effectively for a few years now. And I think up until now, it’s kind of been bad. There are a bunch of things for which it’s great, but for the majority of work you’re doing, doing critical software, I think it had been more likely to slow you down than speed you up. And they’ve had this feeling of spending a bunch of time building a boat and having a sail there and holding the sail up and there’s no wind coming. And we get some utility out of it, people use it for some things, the tools get better. But with a recent round of models, both from all the vendors actually at this point, the models are much better and suddenly it feels like there’s wind in the sails and now it feels like we’re pretty well prepared and have a good team in place and are being able to deliver a lot of value based on this stuff. But there was an awkward period of … I mean, these things are miraculous, but also not super useful, and now they seem both miraculous and useful.

01:45:46

Will

Yep, yep, yep. Yeah. So I don’t know. I think also on all of this cultural stuff, one of the most important things is just having senior people modeling good behavior. We all take great pains. The senior people at the company take great pains to give credit to others, to loudly proclaim when they were wrong or did something dumb. Just showing that that is what we do. Everybody is always looking at the implicit … We all have the same title, but you’re looking at the implicit leaders and seeing how they act. And so having them act the way that you want everybody to act is kind of step one. Yeah.

01:46:26

Ron

And I just want to say don’t give it up. It is possible to maintain at larger scale. I don’t want to say we’ve done all of this perfectly, but it echoes a lot with the kind of things that you’re talking about. I think we really have been able to keep up with it. By the way, one other thing that has been, I think, important is the places designed for long tenures. We just have people who have been around here for a long time. The turnover rate is pretty low, and I think that affects a lot of things about the culture. It keeps a lot of institutional knowledge around, and it helps maintain the culture. I think one of the things about cultures is they’re kind of mysterious. You don’t actually know which parts of it are the ones that are load bearing. And so you want to be very careful about preserving it in a somewhat conservative way. There’s a lot of Chesterton’s fence kind of thinking going along.

01:47:09

Will

You know that’s why we’re in DC. Everybody always asks me, why on earth did you put a ambitious deep tech company in DC and not the Bay Area? And it’s basically 100% so that we can actually keep people and invest in them for the long term. It’s not just that the Bay Area has tons and tons of competition, it’s actually just that the Bay Area has a meta culture of job hopping every nine months to get slightly more RSUs. And basically, once every company is in that equilibrium, nobody invests in anybody, and it’s very hard to be the one that stands out and doesn’t act that way. Whereas in DC, people are used to working for the government and working there for 30 years. And so the kind of ambient expectation in the water is like, yeah, you’re going to go work somewhere and work there for 30 years. And so we have ridiculously good tenure among our engineers and are able to invest in them. And it’s just way nicer in my opinion.

01:48:04

Ron

Okay. That’s amazing. Okay, that seems like a great note to end on. Thank you so much. This has been really fun.

01:48:09

Will

This was awesome. Thank you so much for having me.

01:48:12

Ron

You’ll find a complete transcript of the episode along with show notes and links at signalsandthreads.com.