Performance Engineering on Hard Mode

with Andrew Hunter

Season 3, Episode 3 | November 28th, 2023

BLURB

Andrew Hunter likes making code go fast. Before joining Jane Street, he worked for seven years at Google on multithreaded architecture, and was a tech lead for tcmalloc, Google’s world-class scalable malloc implementation. In this episode, Andrew and Ron discuss how, paradoxically, in some ways it’s easier to optimize systems at hyperscale because of the impact that even miniscule changes can have. Finding performance wins in trading systems, which operate at a smaller scale, but which have bursty, low-latency workloads, is often trickier. Andrew explains how he approaches the problem, including his favorite profiling techniques and visualization tools; the unique challenges of optimizing OCaml versus C++; and when you should and shouldn’t care about nanoseconds. They also touch on the joys of musical theater, and how to pass an interview when you’re sleep-deprived.

SUMMARY

Andrew Hunter likes making code go fast. Before joining Jane Street, he worked for seven years at Google on multithreaded architecture, and was a tech lead for tcmalloc, Google’s world-class scalable malloc implementation. In this episode, Andrew and Ron discuss how, paradoxically, in some ways it’s easier to optimize systems at hyperscale because of the impact that even miniscule changes can have. Finding performance wins in trading systems, which operate at a smaller scale, but which have bursty, low-latency workloads, is often trickier. Andrew explains how he approaches the problem, including his favorite profiling techniques and visualization tools; the unique challenges of optimizing OCaml versus C++; and when you should and shouldn’t care about nanoseconds. They also touch on the joys of musical theater, and how to pass an interview when you’re sleep-deprived.

Some links to topics that came up in the discussion:

TRANSCRIPT

00:03

Ron

Welcome to Signals and Threads in-depth conversations about every layer of the tech stack from Jane Street. I’m Ron Minsky. It’s my pleasure to introduce Andrew Hunter. Andrew is a software engineer who’s worked here for the last five years and he’s currently working on our market data team. Andrew is, by inclination and background, an expert performance engineer and performance is really what we’re going to talk about today. Andrew, just to start off, can you tell me a little bit about how you got into performance engineering and a little bit more kind of about what your path was to getting into software engineering in general?

00:36

Andrew

I can, but I’m just going to be lying to you because the problem is I can give you all sorts of reasons why I do this, but the real reason is just I find it really addictive. It’s just hard for me not to get excited about how systems work. There’s all sorts of reasons why it’s cool or complicated or fun, but I just get this electric high when I make something faster. I guess it’s not a path, right?

00:54

Ron

Right. So I kind of know why you like it, but I’m curious how you got here. One thing that I think of as characteristic of performance engineering and actually lots of different software engineering disciplines is part of how you get really good at them is getting really interested in stuff that’s objectively kind of boring. You get super psyched about the details of how CPUs work and interconnects and compilers and just all sorts of these little pieces and there’s just a lot of knowledge you need to build up over time to be really good at it. And I’m always interested of what were people’s paths where they built up that kind of knowledge and understanding and background that lets them really dive in deeply in the way that’s necessary.

01:27

Andrew

Well, I think that’s exactly right. I just have to care deeply about all the parts of the board. And the way I got there: here was a couple of interesting times when in college, for example, I was taking an operating systems class and I realized that the best way to study this and to learn it well was to just go into the details of the code. And it’s like whenever we’re talking about a topic about virtual memory or whatever, I would go look at Linux’s virtual memory implementation, I’d see what it looked like and I’d have more questions. And I just kept asking these questions and I never said, well, that’s out of scope. And I just kept finding these things interesting. And from then I just realized that like you say, all of these little details matter and if you keep caring about them and you just don’t accept no for an answer, you get pushed towards the places where people really do care about this, which often means performance. And then once you start doing performance to get that little high that I’ve talked about.

02:15

Ron

So in what context did you first experience working in a kind of serious way on performance sensitive systems?

02:22

Andrew

Grad school at the very least, where one of the projects I worked on was this big graph traversal system that was trying to replicate some really weird complicated hardware and do it in software but maintain reasonable levels of performance. And we just had to think really carefully about, okay, how wide is this memory controller? How many accesses can it do in parallel? What happens when one of them stalls? Wait, what does this even mean to have parallel memory accesses? How many cycles do these things take? Because we were roughly trying to replicate this really complicated chip in software, which meant you had to know exactly how would the original hardware have worked and how did all the parts of it that you can replicate in software work and you end up looking up all these bizarre details and you learn so much about it.

02:59

Ron

Is that work that you went in thinking you would spend your time thinking about fancy math-y graph algorithms and ended up spending most of your time thinking about gritty operating system and hardware details?

03:08

Andrew

A little bit. I definitely thought there was going to be a little bit more algorithmic content, but I really rapidly realized that the hard and interesting part here was in fact just, oh God, how do you keep this much stuff in flight? And the hardware has actually gotten way better or more aggressive about these sort of things over time. So I’m glad I learned that.

03:23

Ron

So how did that work in grad school end up leading to you working professionally in this area?

03:28

Andrew

Well, I was interning at Google at the time for the summers and I kind of realized that I could do the same large-scale systems research that I was doing in grad school in a place that just had a lot more scale and a lot more usage of it. A lot of grad school research is done for the point of doing the research, whereas the proper industrial research, the coolest part is it just shows up in production and suddenly people care about it.

03:48

Ron

And this is a challenge for lots of different kinds of academic work where there’s some part of it that really only connects and makes sense at certain scales and a lot of those scales are just only available inside of these very large organizations that are building enormous systems.

04:01

Andrew

Well, that’s true, but I think even more than just the scale issue, it’s the issue of what happens when this actually meets real data. And I think this isn’t just true about performance. One thing I will tell, for example, the most common question I get towards the end of an internship is like what’s going to be different when I come back as a full-timer? And what I tell them is that I was shocked the first time that I had a professional project in my first real job and I finished it and we submitted it to the code base and it rolled into production and everyone was using it. And then a couple weeks later I got an IM from somebody in some other group saying, Hey, we’re using your system and it’s behaving weirdly in this way. What do you think about it? And my mental reaction to this was like, “What do you mean? I turned it in, I got an A!” You have to actually keep this going until you can hand it off to someone else or quit, right? Which sounds depressing, but at the same time means you actually see what it does in reality and under fire and then you learn how to make it even better and you get to do something even more complicated and just this virtuous cycle of optimizations that you get, or features, depending what you’re working on. Right?

04:58

Ron

Yeah. This always strikes me when I go off and give lectures at various universities of, it’s really hard to teach software engineering in an academic context because there is a weird thing that all of the work writing software in a university context, this weird kind of performance art where you’re like, you create this piece of software and then it gets created and poof, it vanishes like a puff of smoke and this is just not what the real world is like. Alright, so let’s get back to performance engineering. One thing I’m curious about is you have a bunch of experience at Google thinking about the kind of performance engineering problems you ran into there and also thinking about it here in a number of different spots, and I’m curious how those two different problems feel different to you.

05:36

Andrew

The difference between performance engineering at Google and performance engineering at Jane Street to me is fundamentally one of leverage. The easy thing about performance engineering at a place like Google or any of the other hyperscalers is that they operate with so many machines consuming so many cycles, doing so many things that any optimization that moves the needle on how much CPU you consume is really valuable, which means that it’s very easy to find good targets, it’s very easy to find things that are worth doing and it may be very difficult to fix those problems. You have to think really carefully and you have to understand these systems, but the return on that investment means you can really support a lot of people who just sit there doing nothing else other than how do I make memory allocation faster? How do I make serialization faster? What can I do in my compiler to just optimize code generation and compression and all these things?

There’s actually a really interesting paper. It’s called Profiling a Warehouse Scale Computer, which looked at, okay, if you just look at all the things that a data center does for one of these hyperscalers, the business logic is really, really, really diverse. Some things are serving videos of cats and some things are doing searches or social networking or whatever. And all of this does different stuff, but it all uses the same infrastructure. And it turns out that the infrastructure is a huge percentage. They coined the term that I like a lot, the data center tax and the 10, 15, 20% of your cycles that you spend on low level infrastructure that everything uses. And it’s not even that that infrastructure is bad or slow, it’s just that’s the common link that scales. Whereas fixing business logic kind of doesn’t.

07:14

Ron

Right? It takes individual engineer effort on each individual piece of business logic, but everyone’s using the same compiler or one of a small set of compilers. Everyone’s using just a small set of operating systems. And so you grab those lower levels of the infrastructure and you optimize them and you can just improve everyone.

07:31

Andrew

Yeah, that’s exactly right. And you can improve it enough by just moving things by half percent. Making logging cheaper just pays for itself way, way more easily than it does if you’re not operating at that level of scale, which means that you get this nice cycle where you hit one hotspot and you point another profile profiler at the system as a whole, you see the next hotspot and you just get better and better at this just by doing the really obvious thing that sits in front of your face. So I like to think of this as easy mode, not because the optimizations are easy or because the work is easy or because it doesn’t take skill, but just because it’s really clear what to do.

08:05

Ron

It is a target rich environment.

08:06

Andrew

It’s a really target rich environment. There’s money falling from the sky if you make something faster.

08:10

Ron

Right? In some sense, this has to do with cost structure. This works best in an organization where a large amount of the costs are going to the actual physical hardware that you’re deploying.

08:19

Andrew

Right. When we say that the business logic doesn’t matter, what it really means is we just don’t really care what you’re working on. You can be serving videos of cats, you can be doing mail, you can do whatever you want. We don’t really have to care. As long as you make something that everyone uses a little bit faster, it’ll pay for itself because the only thing you care about is the number of CPUs you buy, the amount of power you buy and the amount of user queries you service. That’s the only three things that matter.

08:43

Ron

It’s not that the business logic doesn’t matter, and in fact, optimizing the business logic might be the single most impactful thing you can do to improve a given application.

08:51

Andrew

But it’s harder.

08:51

Ron

But the business logic doesn’t matter to you because you are working in the bowels of the ship fixing the things that affect everyone, and it’s kind of impossible for you to at scale improve the business logic. So you are focused on improving the things you can improve.

09:04

Andrew

That’s exactly right.

09:06

Ron

Great. So how’s it different here?

09:08

Andrew

We don’t have that scale. The amount of total compute that we spend is a fair bit of money, but it’s not enough that making things 1% faster matters, which means that the average CPU cycle we spend is just not very important. It’s kind of worthless. If you make logging faster, everyone’s kind of gonna shrug and say like, okay, but that doesn’t change the dial. In fact, a surprising thing is that most of our systems spend most of their CPU time intentionally doing nothing. It is just table stakes in a trading environment that you do user space polling IO, you just sit there spinning in a hard loop on the CPU waiting for a packet to arrive, and most of the time there’s nothing there. So if you point a profiler at a trading system, it’s going to tell you it’s been 95, 99% of its time doing nothing.

09:49

Ron

And actually at this point, I want to push back a little bit on this narrative because when you say most of the systems are doing nothing, it’s not actually most of our systems, we actually have a really diverse set of different programs doing all sorts of different pieces. But a lot of the work that you’re doing thinking about performance optimization is focused on specifically trading systems. And trading systems have exactly the character of what you’re describing, which is they’re sitting there consuming market data. The whole idea, which you hear about a lot in large scale, traditional web and tech companies of trying to build systems that have high utilization is totally crazy from our perspective. We’re trying to get systems that have low latencies, they need to be able to quickly respond to things and also to be able to perform well when there are bursts of activity. So that means that most of the time when things are quiet, they need to be mostly idle.

10:33

Andrew

Yeah, it’s definitely true that we have easy-mode targets that we care a lot about at Jane Street. A really good example is historical research. If you are trying to run simulations of what some strategy might’ve done over the last 10 years, turns out that’s just a question of throughput. You can pile as much input data as you can on the hopper and you see how many trades fall out the other side in the next 10 seconds, and the faster that gets the happier we are. You can just do the same easy-mode tactics that you would on hyperscalers.

10:58

Ron

But even there, the size of changes we chase there is considerably larger.

11:03

Andrew

Yeah, you don’t care about 1% because it’s not the CPU you care about here, it’s the user latency in some sense, it’s whether or not the user gets a result in an hour or a day or a week.

11:12

Ron

Just to speak on the compilers team. We totally care about a 1% improvement in code generation, but you don’t care about it on its own. You care about it in combination with a bunch of other changes because you want to build up bigger changes out of smaller ones. And if you look at a much larger organization, people are looking for compiler-level improvements that are in order of magnitude smaller than that.

11:31

Andrew

I sometimes have to push people about this. In fact, I sometimes have to say, oh, no, no, no. It’s not that this optimization matters on its own. It’s not that this thing that I did that removes a cache line from each data structure is going to make our trading exponentially faster. It’s that I’m in a long process of pulling out the slack from a system and every time I do this, it gets a little bit better and everything is going to slowly converge to a good state. But it’s hard to get the statistical power to say any of these small changes matter. Sometimes I get pushback from people saying, well, did you measure this? I’m like, no, I didn’t bother. I know it’s below the noise floor, but I also know it’s right. That sort of small change incrementally applied over time is really good. But the hard part about it is you just have to have faith that you’re going to get there and you have to have faith that you know that this is making an improvement or find ways you can test it in isolation. Whereas if you operate at the huge scale that some other people do, you can just look at a change. It can be five bips and you can know like, oh no, that’s really real and it’s actually worth a lot of money on its own.

12:24

Ron

I think, yeah, this is a problem that hits people who are thinking about performance almost everywhere.

12:28

Andrew

It’s kind of funny to me in that a common line of pushback I get from people who are not performance-focused people is like, well, I remember in undergrad when my professor said, well, you should never make a performance change without profiling it, and knowing that it matters. And like, no, no, I actually think that’s wrong. If you know you are working on systems where this is important you to have a certain amount of self-discipline and, you know, not where it’s too costly, it’s going to make the system more dangerous or riskier or make your life worse, but make efficient choices as a matter of defaults in your brain.

12:54

Ron

And this is one of the reasons why I think performance engineering depends a lot on intuition and background and experience.

13:00

Andrew

And mechanical sympathy. Knowing that you know deep down what the CPU is actually doing when you compile the code that you’ve got.

13:07

Ron

So actually stop for a second on that word, mechanical sympathy, which is a phrase I really like. Tell me what that phrase means to you.

13:12

Andrew

What that phrase means to me—I think a race car driver invented it actually— is just having an innate knowledge

13:19

Ron

Or maybe not innate, you probably weren’t born with it.

13:23

Andrew

Some people seem to come out of infancy just knowing these things. Did you not read books about CPU architecture to your children?

13:28

Ron

I did not.

13:29

Andrew

Whatcha are even doing

13:30

Ron

Lambda calculus?

13:31

Andrew

Oh, that tracks. I think that it’s not innate, but this really unconscious knowledge of just when you look at code that this is how it structures on real systems because different languages have very different models of how reality works, but reality only has one model. As much as I love Lisp, the world and computers are not made of cons cells, they’re made of big arrays—one big array—and some integer arithmetic that loads things from arrays. That’s all a computer actually does, right? And you have to understand what is the model by which we get from that low level thing to my high level types with structure in them. And you have to understand what layouts mean and how this branching structure gets compiled into something that a CPU actually knows how to operate. And you can’t just construct this from scratch every time you do it. You have to develop an intuition towards looking at something and knowing what that’s going to be.

14:19

Ron

So I asked you what the difference was between the easy-mode performance optimization that you experienced at Google and this kind of harder-to-figure-out version of the story where you don’t have the same kind of scale. I’d love to hear a little bit more about what is the texture of these problems? What kind of problems do you run into? What is interesting and hard about the version of the performance optimization problem you see here.

14:39

Andrew

Hard mode performance optimization typically, but not always is a question of latency. And latency is a question of what is something doing at a really important time. Not what is something doing in general, not what it does usually, but what it does when you care about it, which means it’s fundamentally a measurement problem because to measure general performance, what your system is doing, you point a profiler at it, you get an idea of it’s spending 20% of its time here and 10% of its time here and 5% of its time here. I don’t care about any of those percents. I care about what was it doing for the nanosecond, the millisecond, the microsecond sometimes (or some of our worst systems, the second) that something interesting was happening. I care about what happens when it was sending an order or analyzing market data. I care only about that and I don’t care about anything else. So how do I even know what it’s doing at that point in time? How do I measure this is really the key question.

15:28

Ron

Got it. And maybe somewhat inherently, that puts you in the opposite situation that you’re in when you’re looking at a very big organization where you’re thinking about the low levels of the infrastructure and how to make them as fast as possible. Because if you want to know what’s happening at a given point in time, you’re somewhat unavoidably tied up in the business logic. You care about what is the thing that happens when the important decision is made and what are the latencies that occur and what are the code paths that drive that latency? Is that a fair description?

15:54

Andrew

Yeah, it’s not universally true. There’s some really interesting cases where the infrastructure rears its ugly head in the middle of stuff you want to be doing otherwise, right? But it is generally a large part of it is in fact just the business logic of how is this trading system making a decision? And you have to look at that and that’s what’s happening at the interesting point of time, sort of by definition.

16:11

Ron

So you talked about sampling profilers as one common tool. Can you actually just go in a little more detail of what is a sampling profiler and how does it actually work at a low level?

16:21

Andrew

So there’s a lot of different implementations of this, but the general shape of it is you take a system and you point a tool at it that stops the world every so often. Let’s say every hundred microseconds maybe. And it stops the world and asks, where are you right now? And it looks at where the instruction pointer is, what are we currently executing? And it generally looks at the stack trace, how did you get here? And then it writes this down and it lets the program keep going. And profilers only really differ on how do they stop the world and how do they write this down. My favorite is the Linux kernel profiler. It’s called perf, and it just uses a bunch of hardware features to get an interrupt at exactly the right moment in time. And then it just very quickly writes down the stack trace in this compressed format. It’s very optimized. And then you take all these stack traces. The profile is really just a list of stack traces and sometimes a little bit of augmented information, but that’s fundamentally the core idea. And then you present it to the user in some way that adds them up. And like I say, the key thing is it tells you, okay, 30% of the stack traces ended in the function foo. That’s a hotspot. You’re spending 30% of your time there.

17:19

Ron

But there’s all these different kernel counters that you can use for driving when you’re doing the sampling. How does the choice of kernel counter affect the nature of the information you’re getting out of the profiler?

17:28

Andrew

Yeah, people tend to think about people sampling profiles in time, where the counter is just the number of cycles that’s elapsed. But one of the cool things about it is it lets you cycle on L2 cache misses or branch prediction misses or any of these weird architectural events. And so you get a profile of when did these interesting things happen? You know each of them is costly and they probably have some cost in cycles. You can get much more precise measurements. And in particular, the nice thing about it is that 10% let’s say of your program is slowed down by branch prediction misses. But if you just look at the cycles, you’re just going to see like, well, it’s somewhere in this function. If you profile on branch misses, you will see the branch that is hard to predict and you can actually do something about that branch.

18:07

Ron

Got it. So branch missed predictions is one. What’s the next most interesting counter that you might use?

18:12

Andrew

Actually the most interesting thing isn’t even branch prediction or it isn’t even a hardware counter. The next most interesting thing to profile on is the unit of memory allocation. A lot of allocators, in fact the one we have in OCaml, but also various C++ ones, will let you get a profile – not out of perf but kind of done in software – that tells you where you were allocating memory the most. Cause that’s just a common thing that’s very costly and reducing that it can really improve the performance of a system.

18:36

Ron

And this comes down to something that we see in OCaml a lot, which is when we write really, really high performance systems, we often try to bring the amount of heap allocation we do all the way down to zero.

18:45

Andrew

We try.

18:46

Ron

Right, it’s hard to get it all the way down to zero. Something’s misbehaving in a system performance=wise. A relatively common problem is there’s a thing that shouldn’t be allocating that is, in the hot path.

18:55

Andrew

Yeah, that’s right. And then a good optimized C++ system, it should be spending 5-10% of its time memory allocating and sometimes you just have to do this, it’s necessary, but maybe you’re allocating twice as much as you really need to be and you can look at a memory profile and take a look at it. It’s important to remember that profiles aren’t just about time. They’re about measuring the overall consumption of or use of resources.

19:15

Ron

So that’s how perf essentially works to get you the information that you need. But there’s a fundamental trade-off in a tool like perf where it’s sampling information and that itself is essentially a performance optimization. Part of the reason that you sample rather than capturing all of the information is you want to make it fast. You want to not distort the performance behavior of the program by grabbing information out of it. But in some sense there’s a real tension there because you’re saying I don’t want some broad statistical sampling of the overall behavior of my program. I want to know in detail how it’s behaving at the times that matter most.

19:47

Andrew

I think a really instructive example of that was an optimization that I hit on last year where we had a system, it was trying to send some orders and it was doing it in a totally reasonable way and it was doing it in a really cheap way. The thinking about whether or not I want to send an order was really cheap. It happened really fast. The physical sending of the order really cheap, happened really fast.

20:08

Ron

The thinking, what do you mean by the thinking?

20:10

Andrew

The business logic to decide the looking at the markets and saying, oh yeah, I want to buy or and then do the physical act of sending the message, right? Both these were really cheap. If you pointed a profiler at them, even a profiler that was magically restricted to the times of interest, it would tell you, yep, 5% of your time was doing this. It’s all good. That’s not a hotspot. Here’s the problem:the order sending was happening 200 microseconds after the thinking was, and the reason was it was being put on a low priority queue of work to do later. It was a misconfiguration of the system. It was using this kind of older API that it needed for complicated reasons that did reasonable things under most circumstances. But it assumed that network traffic you wanted to send couldn’t be that latency sensitive. So it just waited to do it in an idle moment and this was not a good thing to wait about. But the profiler tells you nothing about this because I didn’t care about the overall cost, I didn’t care about the overall time. I cared that it happened promptly. And so fixing this again was really easy. I just switched to an eager API that didn’t wait. But a profiler tells you nothing about this.

21:07

Ron

So what kind of tools do you use to uncover things like that?

21:10

Andrew

Magic Trace.

21:11

Ron

So what’s Magic Trace?

21:12

Andrew

Magic Trace is a tool we wrote that gives you a view into what a system was doing over a short window of the past. It’s retrospective in some sense. And what I mean is that any point in time you can just yell, stop, tell me what you were doing for the last two milliseconds, three milliseconds maybe. And you write it down. And exactly like you said earlier, this is not a profile, this is not some statistical average of what you’re doing at various times. This is the exact sequence of where your code went. It went from this function to this function to this function, and you get a different visualization of it that just shows you what things was happening over time. And in fact, exactly like you say, there’s more overhead for using this, but it gives you this really direct view into what happened at an interesting time that a profiler fundamentally can’t give. Traces are in some sense really better. In fact, traces aren’t restricted to Magic Trace. I said there’s memory profiles that are really useful. Memory allocation traces are another thing that we care about a lot. We have a really good one, in fact, that gives you the literal trace of–

22:06

Ron

Although a memory profiler is actually statistical also, right? That’s a sample profile.

22:09

Andrew

Memory tracing is in fact, it’s not a profiler, it’s a tracer, right?

22:12

Ron

Maybe it’s misnamed actually there’s a lot of annoying sort of terminological fuzz around this. I think people

22:16

Andrew

We are not unique in this, I’ll say

22:18

Ron

Right, at least the terms I’ve come to the most are people use “profilers” for what you might call statistical profilers. And then people use “tracing” when they’re talking about capturing all of the data. So a common form of tracing that shows up all over the place is there’s all these nice systems for doing RPC tracing where you write down the details of every message that goes by. And this is sort of a common thing to do if you want to debug, why was that web query slow? And some query came in and it kicked off a bunch of RPCs, it kicked off another bunch of RPCs and you can pull up an API that lets you see the whole cascade of messages that were sent. So that’s like a nice example of tracing. And then we also, as you mentioned, we have a thing called memtrace, which sadly I think is actually a profiler in that it is a statistical sample and does not actually capture everything,

23:01

Andrew

But it does give you a time series of events, which is a key thing that a profiler can’t.

23:07

Ron

That’s interesting. I guess in some sense, all of these systems start by giving you a time series of events and then it’s how you process them. Perf is sampling at some rate and grabbing data, and then you turn that into an overall statistical summary. But you could look at that information temporarily. You just don’t. And in any case, the information is sampled. I think what I just said is also true about memtrace. You get this information just like perf, sampled randomly from the memory allocations. Then you can interpret it statistically as being about the overall state of the heap.

23:35

Andrew

A key difference here is that memtrace gives you all of the information about a statistical sample of some of the allocations. It tells you when this guy was allocated and then freed.

23:48

Ron

That’s true

23:48

Andrew

Whereas you might get a set of stacks out of perf that here’s some allocations and here’s some frees, but you have no guarantee it’s the same thing. This lifecycle example, it’s exactly like the RPCs. A thing people frequently do is just capture traces for 1% of their RPCs. And giving the whole lifecycle of individual one is way more interesting than 1% of the individual moments.

24:09

Ron

I mean, maybe this just highlights this is a total terminological car crash. It’s a little hard to separate it out.

24:13

Andrew

All of these tools are way too hard to use and very inexact and all of them use the same terminology in different ways.

24:18

Ron

Okay, so we’ve talked about the trace part of the name Magic Trace, and the key thing there is it’s not just sampling, it’s giving you the complete summary of the behavior. It’s maybe worth talking a little bit about the magic, which is how do you do this thing? You just said, oh, something interesting happened and then you retrospectively look at what happened before and grab it. How can that be done in an efficient way? What’s mechanically happening here?

24:38

Andrew

Magic. No, there’s two parts to this, right? First is how do you get the data? And the second part is how do you decide when to take the sample? Let’s take those in order. So how do you get the data? Well, there’s this, I’m really just going to call it magic. I don’t know how they managed to do this efficiently, but Intel has this technology, it’s called Processor Trace, that just keeps a ring buffer of everything the CPU does in a really compressed format, like it uses one bit per branch or something along those lines. And it just continually writes this down and the ring buffer is the size that it is and it contains some amount of history—in practice, it’s a couple milliseconds—and at any point in time you just snap your fingers and say, give me that ring buffer.

25:14

Ron

And the critical thing is this is a feature integrated into the hardware.

25:18

Andrew

Yeah, we couldn’t possibly implement this. The kernel couldn’t possibly implement this. This is in the silicon and it’s a huge advantage for Intel processors.

25:26

Ron

Yeah, although I really do not understand Intel’s business strategy around this, which is to say they built this amazing facility in their CPUs, but they never released an open source toolkit that makes it really easy to use. They did a lot of great work at the perf level. So perf has integrated support for Processor Trace. In fact, we relied heavily on that. But I think Magic Trace is the first thing that’s actually a nice usable toolkit built around this. There’s various companies that have built internal versions of this, but it seems like such a great competitive advantage. I’m surprised that Intel hasn’t invested more in building nice, easy to use tools for it, because it’s a real differentiator compared to say AMD chips.

26:01

Andrew

There are a lot of performance analysis tools in the world, and there’s very limited hours of the day. I really don’t feel like I know what everything the world does, but I generally agree with you that a really underinvested thing is good, easy to use, obvious, idiot proof APIs, right? And tools that just work in the obvious way you want them to.

26:18

Ron

I was mentioning before how part of being a good performance engineer is building up a lot of experience and knowledge and intuition. Another part of it is just building encyclopedic knowledge of all the bizarre ins and outs of the somewhat awkward tooling for doing performance analysis. Perf is a great tool, in many ways it’s beautifully engineered, but the user experience is designed for experts and the command lines kind of work the way they do. And sometimes their meanings have evolved over time and the flag still says the old thing but you have to know that it has the new meaning. And there’s really, I think a lot of space for building tools that you just turn on and hit the button and it just does the obvious thing for you and gives you the result.

26:55

Andrew

So it’s a user interface built by experts for experts, and I think it’s easy for them to forget that most of the people have not used it, just like you say, a lot of random esoteric knowledge about what CPUs do. And I also just have a lot of memorized little command lines that I happen to know will point at certain problems. And it’s just an issue of having done this a lot. And I don’t know a good way of teaching this other than getting people to do the reps and a better way would be to give them tools that just give them obvious defaults. But I haven’t figured out how to do this universally.

27:23

Ron

But Magic Trace is one good example of a tool like that, where the defaults have been pretty carefully worked out. The UX is really nice and you can just use it. It’s not perfect. It doesn’t work in all the contexts, but it usually gives you something usable without a lot of thinking.

27:34

Andrew

It is exactly like all the best tools I know in that I’m frequently furious at it for not doing quite the right thing, and that’s a sign of how much I want to use it. I would just, oh, can it also do this? Can it also do this? Can it also do this? But yeah, I mostly just use it in the obvious way with the one obvious flag, which gives me what I want to know.

27:50

Ron

One of the interesting things about a tool like Magic Trace is you’ve told a whole narrative of when doing easy mode optimization, you care about these broad-based things. Sampling profilers are mostly the right tools. When you care about focused latency oriented performance, then you want this kind of narrowed in time analysis tools and Magic Trace is the right thing for that. But a thing that’s actually struck me by seeing people using Magic Trace inside of the organization is it’s often a better tool for doing easy mode style optimization than perf is because just the fact that you get all of the data, every single event in order and when it happened, with precision of just like a handful of nanoseconds makes the results just a lot easier to interpret. I feel like when you get results from perf, there’s a certain amount of thinking and interpretation and how do I infer from this kind of statistical sample of what’s going on what was my program actually probably doing and where really is the hotspot? But with Magic Trace, you can often just see in bright colors exactly what’s happening. I’ve seen people just pick a magic trace at a totally random point in the day and use that and be able to learn more from that than they’re able to learn from looking at perf data.

28:59

Andrew

Yeah, this is not universally true, but it’s very frequent that you can get a bunch of extra information. I think one of the really good examples is that a profiler tells you’re spending 40% of your time in the hotspot of the send order function. But here’s an interesting question. Is that one call to send order that’s taking forever or is that a thousand calls each of which was cheap, but why are you doing this in a tight loop? And it turns out it’s really easy to make the mistake we were calling some function in a tight loop you didn’t intend to. And in a profiler these two things look pretty identical. There are really esoteric tricks you can use to tease this out, but in Magic Trace, you just see, oh God, that’s really obvious. There’s a tight loop where it has a thousand function calls right in front of one another. That’s embarrassing. We should fix it, right? You actually develop this weird intuition for looking at the shape of the trace that you get the physical shape on your screen and the visualization and like, oh, that weird tower. I’m clearly calling a recursive function a thousand deep. That doesn’t seem right. You get to see a lot of these things.

29:54

Ron

Yeah, it makes me wonder whether or not there’s more space in the set of tools that we use for turning the dial back and forth between stuff that’s more about getting broad statistical samples and stuff that gives you this more detailed analysis and just more in the way of visualizations, more ways of taking the data and throwing it up in a graph or a picture that gives you more intuition about what’s going on.

30:15

Andrew

A huge fraction of what I do in my day-to-day is visualization work, rather looking at them or trying to build better ones. It’s really important and we have barely scratched the surface of how to visualize even the simplest profiles.

30:26

Ron

Yeah. One thing I’ve heard you rant a lot over time is that probably the most common visualization that people use for analyzing performance is flame graphs. And flame graphs are great. They’re easy to understand, they’re pretty easy to use, but they also drop some important information. And you’re a big advocate of pprof, which is a tool that has a totally different way of visualizing performance data. Can you give a quick testimonial for why you think people should use pprof more?

30:49

Andrew

Yeah. Flamegraphs were way better than anything that came before it. By the way. They were a revelation when they were invented, I think. And so this is one of those things we have to be really careful not to say something is bad. I just think something better has been invented. So it’s called a flame graph because it’s a linear line that has a bunch of things going up out of it that look like flames. And what this is is that—

31:05

Ron

And people often use orange and red as the color, so it really looks like flames.

31:09

Andrew

Exactly right wouldn’t look as cool if it was green. And so the first level of this is broken down 40%, 60% and that 40% of your stack traces start this way and 60% start this way. And then the next level is just each of those gets refined up. And so every stack trace corresponds to an individual peak on this mountain range. And then the width of that peak is how many stack traces looked like that. So this is good. It tells you where is your time going?

31:32

Ron

And one nice property is that at a glance, it makes it really easy to intuitively see the percentages, right? Because the width of the lines as compared to the width of the overall plot gives you the percentage that part of the flame graph is responsible for.

31:43

Andrew

If there’s one really fat mountain that’s 60% wide, you know what you’re doing. It’s that. Here’s the problem with it. It misses a thing I like to call join points, which are points where stack traces start differently and then reach the same important thing. Because what happens is suppose you’ve got 15 or 16 little peaks, none of which is that big, and then right at the tippy top of each of them in tiny, tiny narrow things. It’s all calling the same function. It’d be really easy to dismiss that you don’t even notice the thing at the top, but it turns out if you add them all together, that’s 40% of your time. And different visualizations can really show you everything comes together here.

32:17

Ron

So how does pprof try and deal with this?

32:19

Andrew

So we’re now going to try to proceed to describe a representation of a directed acyclict graph over a podcast. Which of all the dumb ways people have visualized directed acyclic graphs might be the worst. But what it does is it draws a little DAG on your screen where each node is a function you end up calling and each arrow is a path. You can imagine that you have one node for every function you’re ever in. And then for each stack trace, you just draw a line through each node in order and they go towards the things that were called. And then you highlight each node with the total percentage of time you spent there and you put some colors on it like you say, and you make the arrows thicker, thinner for bigger or smaller weights. And that’s the basic idea. If you close your eyes, and imagine with me for a second, I claim that what will happen in the scenario I described is that you’ll see a bunch of bushy random small paths at the top of your screen and then a bunch of arrows all converge on one function. And that function now is really obviously the problem.

33:20

Ron

And then underneath it’ll also branch out to lots of different things.

33:23

Andrew

Yeah, that’s actually a really good point. It tells you in fact that maybe it’s not that function that is the time, it’s the things it calls, but now you at least know where all this is coming from. And there’s a single example of this that is the most common thing, at least if you’re working in say C++, that function’s always malloc.

33:37

Ron

Oh, interesting.

33:38

Andrew

Like I said, with the business, logic may be very diverse, but everything allocates memory because it’s really easy to realize, oh, it’s doing a little bit of malloc here. It’s doing a little bit of malloc here. It’s doing a little bit of malloc here.

33:49

Ron

And I guess this ties a little bit into what you were talking about of that style of trying to look at the foundations and get the common things that everyone sees. It becomes really important to see those common things. Although I would’ve thought one of the ways you could deal with this just with flame graphs is there’s just two different orientations. You can take your flame graph and turn it upside down, and so you can decide which do you want to prioritize? Thinking about the top of the stack or the bottom of the stack.

34:11

Andrew

It’s the bush below malloc that screws you over there. It’s when the thing’s in the middle that you get problematic. It’s when the really key function is in the middle where everything converges through one particular, it’s not a bottleneck, but you can think about it as a bottleneck in the graph, that the flame graphs really fail. And to their credit, people have built flame graph libraries where you can say, give me a flame graph that goes up and down for malloc, but you need to know to focus on that function.

34:33

Ron

I see. And so pprof somehow has some way of essentially figuring out what are the natural join points and presenting them.

34:39

Andrew

I think that it outsources that problem to the graph drawing library that tries to do various heuristics for how people view graphs tends to work.

34:46

Ron

I’ve looked at flame graphs and I’ve looked at pprof visualizations. I do think pprof visualizations are a little bit harder to intuitively grok what’s going on. So I feel like there’s probably some space there yet to improve the visualization to make it a little more intuitively clear.

34:59

Andrew

I definitely agree. I think that just like we were saying earlier with this is one of those things that experts know, new people don’t, you just kind of have to get used to staring at these for a couple days and then you get used to it. But it’d be nice if we didn’t have to. It’d be nice if it was just as obvious as the other things are.

35:12

Ron

So we’ve talked a bunch here about performance engineering tools and measurement tools more than anything else, which I think makes sense. I think the core of performance engineering is really measurement. And we’ve talked about ways in which these focused tracing tools like Magic Trace in some ways can kind of outperform sampling tools for a lot of the problems that we run into. What are the kinds of cases where you think sampling tools are better?

35:33

Andrew

To me, the case I wish I could use sampling the most is once I’ve identified a range of interest. The thing we care about a lot when we think about latency and optimization of trading systems is tails. If your system 99% of the time responds in five microseconds and then 1% of the time responds in a millisecond, that’s not great. You don’t always get a choice of which of those is the one you trade on, right?

35:57

Ron

And also there’s the usual correlation of the busiest times are often the times that are best to be fast at.

36:02

Andrew

That’s right.

36:02

Ron

And so exactly where it’s bad is the case where you care the most.

36:06

Andrew

And Magic Trace is pretty good at finding tails because you can just do some interesting tricks and hacks to get magic trace to sample at a time where you’re in a 99-percentile tail. Now sometimes you look at those trails and magic trace and you see, oh, I stopped the world to do a major GC. I should maybe avoid that. I need to allocate less or some other bizarre weird event. Sometimes it’s a strange thing that is happening. But a remarkably common pattern we see is that your tails aren’t weird, aren’t interesting, they’re just your medians repeated over and over. And like you said, you’re having a tail because the market is very busy because it’s very interesting. You saw a rapid burst of messages and each of them takes you a microsecond to process, but that’s not the budget you have. You have 800 nanoseconds and you’re just falling behind.

36:50

Ron

It’s just a classic queuing theory result.

36:51

Andrew

It’s nothing but queuing theory.

36:52

Ron

Get lots of data in. You’re going to have large tails when it all piles up in time.

36:57

Andrew

So you pull up this magic trace and you say, oh, I processed 10,000 packets. What do you do now? And sometimes you can think of obvious solutions. Sometimes you realize, oh, I can know I’m in this case and wait and process all of these in a batch. That’s a great optimization that we do all the time. But sometimes you’re just, oh wow, I just really need to make each of these packet processes 20% faster. And what I really wish you could do is take that magic trace and select a range and say, Hey, show that to me in pprof because it’s just like the flame graph. You have all these little tiny chunks in the magic trace and I really want to see them aggregated and you have some trouble doing it.

37:31

Ron

So I asked one question and I think you answered a different one.

37:33

Andrew

I do that a lot.

37:35

Ron

I asked the question of when do you want sampling? You answered the question of when do you want a profile view? And I totally get that, and that seems like a super natural thing and maybe a nice feature request for the Magic Trace folk have a nice way of flipping it into the profile view. But I really wanted to poke at the other question which is when is sampling itself a better technique versus this kind of exhaustive tracing that Magic trace is mostly built around?

37:57

Andrew

One easy answer is the easy mode problems we have. Things like historical research, things like training machine learning models, things that are really are throughput problems and we do have throughput problems. And there it’s just easier to look at the sampling profile or you can really target it at cache misses or whatever you want. So that’s a case where we definitely want it.

38:15

Ron

And maybe another thing to say about it is it is definitively cheaper. The whole point of sampling is you’re not grabbing the data all the time. I guess we didn’t talk about this explicitly, but Intel processor trace, you turn it on and you end up eating 5-15% or something of the performance of the program. There is a material—

38:31

Andrew

Don’t say that out loud or they’ll stop letting me use it on our trading systems.

38:35

Ron

I mean, it is the case that we don’t just have it turned on everywhere all the time. We turn it on when we want it.

38:40

Andrew

And that’s a thing that the hyperscalers do. They just leave a sampling profile on across the fleet just getting 1% of things. And that gives you kind of a great sense of what the overall world is doing. And that is actually a thing I kind of wish we had. It would be less valuable for us than it would be for them. But I would love if I could just look at it like a global view of hotspots. I think the best thing in the world would be like, can I get a sampled profile of all the things all of our trading systems did when they weren’t spinning idly? If I could know that, oh, overall if I could get a little bit more return from optimizing the order sending code versus the market data parsing code, I think that would be a really valuable thing to me.

39:15

Ron

So another interesting thing about the way in which we approach and think about performance is our choice of programming language. We are not using any of the languages that people typically use for doing this kind of stuff. We’re not programming in C or C++ or Rust. We’re writing all of our systems in OCaml and that changes the structure of the work in some ways. And I’m kind of curious how that feels from your perspective as someone who very much comes from C++ background and is dropped in weird functional programming land. what have you learned about how we approach these problems? What do you think about the trade-offs here?

39:44

Andrew

Well, the best thing about it’s employment guarantee, anyone can write fast C++, but it takes a real expert to write fast OCaml, right? You can’t fire me.

39:54

Ron

Although I think that’s actually totally not true. I didn’t mean the part about firing you, but the point about writing fast C++, I actually think there’s a kind of naive idea of, oh, anyone can write fast C++. It’s like, oh man, there’s a lot of ways of writing really slow C++. And actually a lot of the things that you need to get right when designing high performance systems are picking the right architecture, the right way of distributing the job over multiple processes, figuring out how to structure the data, structure the process of pulling things in and out. There are lots of stories of people saying, oh, we’ll make it faster in C or C++. And sometimes you implement it in some other language and it can be made faster still because often the design and the details of how you build it can dominate the language choice.

40:30

Andrew

I think it’s really easy for people who are performance obsessed like myself to just get a little too focused on, oh, I’m going to make this function faster. And maybe the better answer is, can we avoid that function being called? Can we not listen to that data source? Can we outsource this to a different process that feeds us interesting information?

40:47

Ron

The single most important thing in performance engineering I think is figuring out what not to do. How do you make the thing that you’re actually doing as minimal as possible? That is job one.

40:56

Andrew

Honestly, I think one of the reasons that I really like performance optimization as a topic to focus on. I don’t like writing code very much. I’m not very productive. It takes me a long time to do good work. So I want to do the stuff that requires me to write the fewest lines of code and have the biggest impact. This is like one of those hypotheticals, you make a two line change and everything gets 10% faster. The hard part was the three weeks of investigation proving that it was going to work. Right? And I think this is actually a good example of you really have to think about the whole board. You have to think about how you’re structuring the code and how you’re structuring the system, how many hops this is going to go through, how many systems is it going to go through, how can you get assistance from hardware, any of these things? And micro optimizing the fact that clang has a better loop optimizer than GCC than the OCaml compiler. It’s really annoying to look at the bad loop. Is that really what’s killing you? No. What’s killing you is that you’re looping over something that you shouldn’t be looking at at all.

41:46

Ron

Right. So there’s a bunch of stuff you just talked about that would love to talk about more about, in fact the hardware stuff in particular. But I don’t want to lose track of the language issue. So I believe what I said that people often overfocus on the details of the language, but the language does matter and I think it matters in particular for performance. And I’m kind of curious, what’s your feeling about how that affects how we approach the work and how your own kind of interaction and engagement with it?

42:08

Andrew

I break it down into three categories. The first category in which OCaml provides us a challenge is, I call it the most annoying, but the least important. And that’s what I was saying earlier about, oh, our code generation isn’t as good. We’re branching too much. We have too many silly register spills. It’s annoying to look at. It’s really not what’s killing you. I would wish it was better. And really the limit isn’t OCaml, the limit is scale. The C compiler people have been working for 30 more years than the OCaml compiler people have, and there’s more people working on optimizing clang right now across the world than we have probably employees at Jane Street. We’re never going to catch up. That’s okay. It’s not really what’s killing you. The second category is things that are just maybe more of an actual problem, but hard to deal with, but not really the key issue.

Our memory model requires us to do slightly more expensive things in some cases. A good example is we’re a garbage collected language, our garbage collector inspects values at runtime. Therefore, uninitialized data can be really problematic. And so we have to do stupid things in my brain like, oh, it’s really important to null out the pointers in this array and not just leave them behind or they’ll leak, or you can’t just have an uninitialized array that I promise I’ll get too soon. Because what happens if you GC in that range? And I do actually think this is meaningfully costly in some scenarios, but I’m willing to put up with it. In most cases, there are things you can do about it. The thing that I think is most problematic for our use of a language like OCaml gets back to mechanical sympathy. And I said that the world is not made out of parentheses that Lisp uses, and it’s also not made out of algebraic data types. OCaml’s fundamental representations of the world are very boxy. There’s a lot of pointers. There’s a lot of this object contains a pointer to something else where in C++ it would just be splatted right there in the middle. And the reasons we do this, there are reasons that make it easy to write good, clean, safe code, but it is fundamentally costly and it lacks—the language, if anything, some mechanical sympathy.

44:02

Ron

Right? Or at least it makes it hard to express your mechanical sympathy because getting control over the low level details is challenging. And I don’t want to go too much into the, we’re actually doing a lot of work to try and make OCaml better exactly at this. But a question I’m more interested in talking about with you is how do you see us working around these limitations in the language, in the code base that we have?

44:20

Andrew

There’s a couple options here. The first is you can kind of write it the hard way because OCaml’s a real language. You can write whatever you want. It’s just a question of difficulty. If nothing else, I can just, I could in theory, allocate a 64 gigabyte int array at the beginning of startup and then just write C in OCaml that just manipulates that as memory, right? It would work, it never GC, it would do all the things you wanted it to. It’d just be miserable. And clearly I’m not going to do that. But given that we’re a company that has a lot of people who care about programming languages, one thing we’re pretty good at is DSLs. And so we have some DSLs, for example, that let you describe a layout and we’re going to embed this layout into some low level string that doesn’t know a lot, but it’s still, if you glance at it the right way, typesafe, in that the DSL doesn’t let you write out of bounds accesses or anything like this.

45:09

Ron

Right? And the DSL, you sit down and write down what’s the format of a packet that you might get from the NASDAQ exchange. And then it generates some actually relatively reasonable, easy to understand interfaces that are backed by the low level, horrible manipulation of raw memory. And so you write a DSL, you generate some code, and what you surface to the user is a relatively usable thing, but you get the physical behavior that you want with all the flattening and inlining and tight representation of data.

45:35

Andrew

You’re hitting on a really good point that a lot of these originated from our need to parse formats that were given to us. But it turns out you can also just use them for representing your data in memory. I can build a book representation of the state of the market that’s just laid out flatly and packedor me. It’s much less pleasant to use than real OCaml. It’s difficult, and we only do this in the places that it matters, but you can do it. This is what I like to call a dialect of OCaml. We speak in sometimes and sometimes we gently say it’s zero alloc OCaml. And the most notable thing about it, it tries to avoid touching the garbage collector, but implied in that zero alloc dialect is also a lot of representational things. We have little weird corners of the language that are slightly less pleasant to use, but will give you more control over layout and more control over not touching the GC and using malloc instead. And it works pretty well. It’s harder, but you can do it. In the same way, another thing we think about a lot is interoperability. Again, sort of out of necessity that our libraries, we have to interact with that only work in C. So we have these little C stubs that we can call into and it’s really cheap. It’s not like Java. It’s not like one of those languages where there’s this huge costly process for going cross language, just make a function call and it just works, right?

46:40

Ron

Yeah. The overhead, I think for a function call to C at the least is, I dunno, three or four nanos. And I think in Java it’s like three or 400 nanos because the JNI is a beast for reasons I’ve never understood.

46:50

Andrew

Option two is cry a little bit and deal with it. Yeah, we fight a fundamental disadvantage. We’re working on reducing it. I’m super excited about getting more control over the layout of OCaml types. This is like the biggest change to me that maybe will ever happen in the compiler. Being able to write down a representation of memory that is what I want it to be in a real OCaml type that is fun to play with, but fundamentally we’re kind of at a disadvantage and we just have to work hard and we have to think more about, okay, we’re going to have a higher cache footprint. What does this mean about our architecture? How can we get cache from other places? How can we spread out the job across more steps, more processes, pre-process this one place. It gets back to, you don’t want to focus on over optimizing this one function. You want to make your overall architecture do the right things and just informs infrastructural changes.

47:33

Ron

And I think you make an important point that it’s not that any of the optimizations you want to do are impossible, it’s that they’re a little bit more awkward than you would like them to be and you have to do a little extra work to get them to happen. And that means fundamentally that we don’t always do them. And so we really do pay a cost in performance in that getting people to do the right thing. The harder you make that, the less it happens.

47:53

Andrew

One of the hardest things to learn when you’re doing this sort of work is discipline. I have to go through the code base every day and say, no, I’m not fixing that. Yes, it’s like offensive to me on a personal level that it’s slow and it allocates and do these things, but it just doesn’t matter. It’s legitimately hard for me not to stop whatever I’m doing and just fix this optimization that I know is sitting there. If this doesn’t bother you on a fundamental physical level, I just don’t understand.

48:17

Ron

But you have to prioritize.

48:18

Andrew

You have to prioritize. There’s so much more important things to be doing.

48:21

Ron

So another thing I’m wondering about is how you think about the role of hardware in all of this. In some sense, if you’re thinking about making things as low latency as possible, why do we even bother with a CPU? Right? You look at the basic latency of consuming and emitting a packet, and on any ordinary computer you’re going to cross twice over the PCI express bus, it’s going to cost you about 400 nanos each way. Know that and a little bit of slop between the pieces. It’s kind of hard to get under a mic really for anything where you’re like, I’m going to consume some data off the network, do something and respond to it. And on an FPGA attached to a NIC, you can write a hardware design that can turn around a packet in under a hundred nanoseconds. So there’s an order of magnitude improvement that’s just physically impossible to get to with a computer architecture. And so in some sense, if all you cared about is, well, I just want the absolute lowest latency thing possible, it’s like, why are we even using CPUs at all? So how do you think about the role of hardware as integrating and how you think about performance in the context of building these kinds of systems?

49:18

Andrew

It informs the architecture you choose. Yeah, nothing’s ever going to be as fast as hardware, but it’s really hard to write hardware. It can’t do complicated things. And even the things it can do are just exponentially harder to write. I have never in my life written Verilog, which feels like a personal sin. I am reliably informed that it is miserable and unpleasant and your compiler takes 24 hours to run. So we have a lot of strategies with really complicated logic, and that logic is important and it’s valuable. And implementing that in hardware is, I’m just going to say flatly impossible. You couldn’t do it. And so the question becomes what can you outsource to hardware that is easy? How do you architect your system so that you can do the really, really hyper-focused speed things in a simple, simple hardware system that only does one thing and you feed that hardware the right way, but the rest of the software system still has need to be fast. It has to be fast on a different scale, but it turns out there’s optimizations that matter on roughly every single timescale you can imagine. We have trades at this firm that, like you say, complete in less than a hundred nanoseconds or you might not even bother. We also have trades that we send someone an email and the next day you get back a fill, right? And every level in between there turns out you can do useful optimization work.

50:32

Ron

And even with stuff that has no humans in the loop, we really do think about nanoseconds, microseconds, and milliseconds. Depending on what you’re doing and how complicated it is, you really do care about many different orders of magnitude.

50:43

Andrew

Yeah. There’s a system that I’ve worked on where our real goal is to get it down from having 50 millisecond tails to one millisecond tails, and we’ve celebrated when we get there, and it still does a lot of great trading. We have other systems that are doing simpler, more speed competitive things where your software needs to be 20 microseconds or 10 microseconds or five microseconds. And that’s achievable. It’s harder and you have to do simpler things just like with the hardware, but it’s achievable and you care about both of these latencies. And I think another good thing to point out is that you said systems that don’t interact with humans, but it turns out some of the most important latencies of systems we interact with is the systems that do interact with humans. I don’t know about you, Ron, but when my editor freezes up for five seconds while I’m typing, I just want to put a keyboard through the window.

It just drives me nuts and putting aside the aggravation. Human responsive systems are just really important too, both on you’re actively trading and you want by hand and you want to have a good latency on the thing that’s displaying the prices in front of you. That matters a lot, but also I think it matters a lot for just your ability to adapt and improve your systems over time. I said earlier, think about historical research. That’s a throughput problem, but it’s also a latency problem on a human scale. This thing that will give you feedback on whether your trading idea was good in a minute is worth so much more than one that’ll gives you an idea if it’s worth anything in a day.

52:00

Ron

Yeah, that’s absolutely right. I think for lots of different creative endeavors, and I think trading and software engineering are both count from my perspective and also all sorts of different kinds of research. The kind of speed of that interactive loop of I have an idea, I try an idea and I get feedback on how well that idea works out. The faster you can make that loop, the more that people can experiment with new things and the more the creative juices can get flowing and more ideas you can create and try out and evaluate.

52:25

Andrew

The thing I’m obsessed with telling people about is this Air Force colonel from the fifties or the sixties, his name was John Boyd, he invented this idea called the UDA loop, OODA, I believe. It’s observe, orient, decide, act, just like the four stages you go through and figuring out like, oh, I see something. I think about it. I decide how I’m going to adjust to that. I implement the adjustment. And the faster this loop happens, the more control you have over the system and the better you can iterate. I think a great example of this outside software, oddly enough, is whiskey. I like bourbon a lot. And it turns out that to make good bourbon takes 5, 7, 10 years. And so you don’t get a lot of opportunity to iterate. And there are some people who are doing a really controversial thing, which is they’re using technology to rapidly age spirits.

And some people call this sacrilege and it’s never going to be quite as good as doing it the hard way. But on the other hand, it lets you taste it in a month and be like, I think I’m going to change this and they’re going to get 12 iterations in the time someone else might get one. And I just think this sort of process turns out to matter a lot in software too, of being able to rapidly iterate on what you’re doing and get feedback either on the quality of the trading or for that matter on the quality of the performance. One of the things I really care about a lot is building systems that let me really quickly evaluate the performance of a system. There’s a huge difference between, I think this is going to be faster. I did a profile, I know this is a hotspot. I’ve made it better. Okay, I’ll run it in prod for a couple days and okay, I’ve made this change. I think it’s going to be better. I’m going to run it on this really realistic testbed. I know in 10 minutes if it’s better and I can change it and I can try it again.

53:51

Ron

Yeah, iteration speed matters kind of almost everywhere you are. And I really like your point about this kind of performance analysis mattering for trading systems and also mattering for systems with human interaction. And actually I feel like the performance mindset isn’t really so different. You look at someone who’s really good at thinking hard about and optimizing the performance of stuff in the browser. There’s a very different instruction set and oh my God, is that a weird and complicated virtual machine. But a lot of the same intuitions and habits of mind and that focus on being really interested in details that are really boring. All of that really matters a lot. You really have to care about all the kind of gory details of the guts of these things to do a really good job of digging in.

54:30

Andrew

And this is why I kind of wonder if there’s just a mindset that can’t be trained. You have to just look at this and go, what the hell are you talking about, Ron, this isn’t boring. I get why you say that, but I just look at this stuff and go like, you don’t have to pay me to look at this. Sorry, I take that back. You do have to pay me to look at this. I would not do this for free. I promise.

54:49

Ron

Boring in quotes. I love this stuff too, and totally understand why it is, but it is from the outside. It’s a little hard to explain to your friends and family why you like this stuff. They’re like, you can’t even explain in words out loud, what are the details that are going on. People will fall asleep.

55:00

Andrew

Do you know my dad once sat me down in college and said, are you sure you want to do this CSS thing and not go into something where you can find a job? Like being a lawyer? I’m the only person who disappointed his parents by not becoming an English major.

55:12

Ron

Well, maybe that’s a good point to end it on. Thanks so much for joining me. This has been great.

55:16

Andrew

Thanks for having me on. This was a really good talk.