Listen in on Jane Street’s Ron Minsky as he has conversations with engineers working on everything from clock synchronization to reliable multicast, build systems to reconfigurable hardware. Get a peek at how Jane Street approaches problems, and how those ideas relate to tech more broadly.
Electronic exchanges like Nasdaq need to handle a staggering number of transactions every second. To keep up, they rely on two deceptively simple-sounding concepts: single-threaded programs and multicast networking. In this episode, Ron speaks with Brian Nigito, a 20-year industry veteran who helped build some of the earliest electronic exchanges, about the tradeoffs that led to the architecture we have today, and how modern exchanges use these straightforward building blocks to achieve blindingly fast performance at scale.
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I’m Ron Minsky.
Today I’m going to have a conversation with Brian Nigito, essentially, about the technological underpinnings of the financial markets, and some of the ways in which those underpinnings differ from what you might expect if you’re used to things like the open Internet and the way in which cloud infrastructures work. We’re going to talk about a lot of things, but there’s going to be a lot of focus on networking, and some of the technologies at that level, including things like IP multicast.
Brian Nigito is a great person to have this conversation with because he has a deep and long history with the financial markets. He’s worked in the markets for 20 years, some of the time he spent working at the exchange-level where he did a lot of the foundational work that led to the modern exchange architectures that we know today. He’s also worked on the side of various different trading firms. For the last eight years, Brian’s been working here at Jane Street and his work here has covered a lot of different areas, but today, he spends a lot of time thinking about high performance, low latency, and especially network level stuff.
So let’s dive in. I think one thing that I’m very sensitive to is a lot of the people who are listening, don’t know a ton about the financial markets and how they work. And so just to get started, Brian can give a fairly basic explanation of what an exchange is.
I think when you hear about an exchange, you can think of lots of different kinds of marketplaces. But when we talk about an exchange, we’re talking about a formal securities exchange. And these are the exchanges that the SEC regulates, and they meet all of the rules necessary to allow people to trade in securities. So when we use that loosely, yeah, it’s pretty different than your average flea market. Supposed to be anyway.
That’s obviously a function which, once upon a time was done with physical people in the same location, right? Those got moved into more formal, more organized exchanges with more electronic support. Then eventually there’s this kind of transformation that’s happened essentially over the last 20 years, where the human element has changed enormous amount. Now, humans are obviously deeply involved in what’s going on, but the humans are almost all outside of the exchange and the exchange itself has become essentially a kind of purely electronic medium.
Yeah, it’s a really interesting story because you have examples of communication technologies and electronic trading, going back to the late ’60s, but probably more mid-‘70s (I’m being a little loose with dates) but it was kind of always present. But the rule set was not designed to force people to operate at the kinds of timescales that electronic systems would cause you to operate at. It was rather forgiving. So you know, if somebody on the floor didn’t want to deal with then electronic exchange, the electronic exchange had to wait. Over the past 10 to 15 years, that’s kind of flipped and so generally, we favor, always accessible electronic quotations.
To step back a little bit. The exchanges are the places for people to meet and trade, as you said, to advertise their prices. And for people to transact with each other. Other than people who are buying and selling, what are the other people who interact at the exchange level, what are the other kind of entities that get hooked in there.
So you have, obviously the entities who either in their own capacity or on behalf of other people are transacting securities, but then you have financial institutions that are clearing and guaranteeing those trades, providing some of the capital or leverage to the participants who are trading, they obviously want to know what’s going on there. You have other exchanges, because the rule set requires the exchanges to respect each other’s quotations – in this odd way, there’s a web where the the exchanges are customers of each other. And you may also have various kinds of market data providers. So those quotes that reflect the activity on the exchange are eventually making their way all the way down to what you might see scrolling on the bottom of the television or your brokerage screen or financial news website, etc. I guess they even make it all the way down to the printed page when the Wall Street Journal prints transaction prices.
So what does this look like at a more systems level? What are the messages the different participants are sending back and forth?
The most primitive sorts of things are you have orders or instructions, there are other platforms where we have quotes and we may use that loosely, but we’ll just say orders. An order would just say that I would like to buy or sell, let’s say, a specific stock, and I’d like to do so at no worse than this price, and for no more than this quantity, and that may mean I could get filled at a slightly better price than that. I could get filled for less [quantity] than that. I could get filled not at all.
That order could basically check the book immediately and then come right back to me if there’s nothing to be done. Or it can rest there for some nonzero amount of time where it could advertise and other people may see it and choose to interact with it. And then obviously, I can withdraw that interest or cancel it. So when we talked about orders or cancels, those go hand in hand.
Finally, there’s execution messages where if you and I intersect on our interest I want to buy you want to sell or vice versa, then the exchange is going to generate an execution to you. And to me, saying that that happened and the terms of that trade.
One of the key properties here is that you have a fairly simple core set of messages, this basic data structure at the heart of it called “the book,” which is the set of orders that have not been satisfied. And then people can send messages to add orders and remove orders. And then if two orders cross, if there are two orders that are compatible, where a trade can go up, then an execution occurs and the information flows out.
There’s a fairly simple core machine at the heart of it but then lots of different players who want different subsets of information for different purposes. There are people who are themselves trading, who want to see of course, their own activity and all the detail about that. And they also want to see what we call “market data,” that kind of public anonymized version of the trading activity so you can see what are the prices there are out there that are advertised for you to go and transact against.
In the end, you need to build a machine that’s capable of running this core engine, doing it at sufficient speed, doing it reliably enough… Maybe a thing that’s not apparent if you haven’t thought about it is: there’s like a disturbing dizzying amount of money at stake. And oh my God, you do not want to lose track of the transactions, right? If you say like, “Oh, you guys did this trade,” and then you forget about it and don’t report it, or you report to one side and not the other, terrible things happen. So reliability is a key thing.
Yeah and I think to go back, there’s lots of different consumers, lots of different participants. And I think the key word there is there’s lots of competing participants. So one thing you didn’t mention in there is disseminating all that information fairly. So trying to get it to everybody at the same time is a real challenge and one that participants are studying very, very carefully and looking for any advantage they can technologically within the rule set, etc. So that extra layer of competition sort of makes the problem a little more complicated, and a little more challenging.
This fairness issue is one that you’ve seen from the inside working on early exchange infrastructure at Island and Instinet, which eventually became the technology that NASDAQ is built on. Early on, you guys built an infrastructure that I think didn’t have all the fairness guarantees that modern exchanges happen today. Can you say more about how that actually plays out in practice?
When working on the Island system, it was very close, originally, to sort of “fair” in that you had the same physical machines, you had an underlying delivery mechanism, which we’ll talk about, that was very fair at getting it to those individual machines. And then you were sending copies of orders or instructions after going through one application to everyone. So you were all passing through about the exact same amount of work and the about the exact same number of devices. But it was actually very inefficient, we were using thousands of machines that were mostly idle.
So once we started trying to handle multiple clients on a single machine, it exposed some obvious and silly problems. The naive implementation where people would connect, we would collect all of those connections. And then when we had a message, we would send them on the connections serially, often in the order in which people connected. Well, that immediately led to thousands of messages per second before the exchange open where somebody tried to be the very first connection in that line.
Then you start sort of round-robining, so you know, you start from one and then the next time around, you start from two etc, etc, to try to randomize this. And then you had people who were connecting to as many different ports as they could and taking the fastest of each one. And so these incentives are very, very strong, and we’d like to use machines to their fullest, but to literally provide each participant their own unique machine for each connection starts to get ridiculous as well.
Where did that lead you? How did you end up resolving that problem?
A lot of these were TCP protocols. In those days, we actually had a decent number of people connecting over the open Internet. I don’t think we provided trading services directly over the open Internet, but we did actually provide market data that way. And TCP, probably your only reasonable option over something like the Internet. But once you started moving towards colocation, and towards more private networks, where people’s machines were in the same data center, and really only two or three network devices away from the publishing machine, it became a lot more feasible to start using different forms of networking, unreliable networking, UDP, and that leads you to something called multicast where rather than you sending multiple copies of the message to n
people, you send one copy that you allow the network infrastructure to copy and deliver electrically much more deterministically and quickly.
For someone who’s less familiar with the low level networking stories give a quick tour of the different options you have in terms of how you want to get a packet of data from one computer to another.
The Internet and Ethernet protocols are generally a series of layers. And at the lowest layer, we have an unreliable best effort service to deliver a packet’s worth of data. And it’s sort of a one-shot thing more or less point-to-point from this machine to some destination address. Then we build services on top of that that make it reliable by sequencing the data attaching sequence numbers so we know the original order that was intended and having a system of retransmissions, measuring the average round trip time, probabilistically guessing whether packets are lost, etc, etc. That all gets built up into a fairly complex protocol that most of the Internet uses, TCP. Maybe not all and there are some people pushing for future extensions to that, but by and large, I’d say that the vast majority of reliable, in-order, connected data over the Internet is sent via TCP.
TCP assumes that there’s one sender, and one receiver and it has unique sequence numbers for each of those connections. So I really can’t show the same data to multiple participants, I actually have to write a unique copy to each participant. UDP is a much lighter layer on top of the underlying raw transport, still unreliable but with a little bit of routing information. That protocol has some features where you can say I want to direct this to a specific participant, or a virtual participant, which the network could interpret as a group. Machines can join that group and then that same message can be delivered by network hardware to all the interested parties.
One of the key features there which I think is maybe not obvious – why would I prefer multicast over unicast? Why is it better for me to send one copy to the switch that then sends a bunch of copies to a bunch of different recipients, versus me just sending a bunch of individual copies on my own. What’s the advantage of baking this into the actual switch infrastructure?
The switches are very fast and deterministic about how they do that. And because of, I think their usage in the industry, they’ve gotten faster and more deterministic. So they can just electrically repeat those bits simultaneously to 48 ports, or whatever, that that switch might have. And that’s just going to be much faster and more regular than you trying to do it on a general purpose server, where you might be writing multiple copies writing to multiple places, you really can’t compare the two.
One of the key advantages of using switches is that the switches are doing the copying in specialized hardware, which is just fundamentally faster than you can do on your own machine.
Also there’s a distributed component of this. When you make available multicast stream, there’s this distributed algorithm that runs on the switches where it learns essentially what we call the “multicast tree.” At each layer, each switch knows to what other switches it needs to forward packets, and then those switches know which port they need to forward packets to. That gives you the ability to kind of distribute the job of doing the copying. So the if you have like 12 recipients in some distant network, you can send one to the local switch, and then the final copying happens at the last layer, at the place where it’s most efficient. That’s the fundamental magic trick that multicast is providing for you.
As the network’s gets simpler, the very first versions we were using weren’t even using multicast. We were using something called broadcast, which just basically said, “anything you get, I want you to repeat everywhere.” It’s funny because you could imagine that you could certainly overwhelm a network that way. And a large part of the uncertainty and the variation that comes from TCP are these self-learning algorithms that are very concerned about network health. When we would work with the Linux kernel maintainers and have questions about variability that we saw, then they would say, “Well, you shouldn’t be using TCP. If you care about latency, you shouldn’t be using TCP. TCP makes trade offs all the time, for network health, and so on and so forth. And for the Internet that is absolutely necessary. And if you really have these super tight requirements, and you really want them to get there fast, and you have a controlled network with very little packet loss, and very few layers between participants, you should be using UDP.” And they were probably right. We mostly do nowadays for this stuff, but it took a while to get there.
And they were right in a way that was kind of totally unactionable. Which is to say, there are a bunch of standardized protocols, about how you communicate when you’re sending orders. Another thing to say about the trading world is if you step back and look at how the protocols we use are broken up, there are two kinds of primary data flows that a trading firm encounters, at least when we’re talking to an exchange. There is: the order flow connection, where we send our specific orders and our specific cancels and see the specific responses to those, and that is almost always done on a TCP connection; then there is the receipt of market data, and that’s where you’re sending the data that everyone needs to see exactly the same anonymized stream of data, and that’s almost always done through multicast.
So there is part of the data which is done via UDP in the way the Linux kernel developers would recommend, and there’s part of the data flow that’s still to this day, done under TCP. I think the difference is, we no longer use the open Internet in the way that we once did. I think there’s been this transformation where instead of sending things up to the trunk and having things routed around the big bucket of the open Internet, trading firms will typically have lots in the way of colocation sites where they will put some of their servers very near to the switches that the exchange manages, and they will have what we call cross-connects: We will connect our switch to their switch and then bridge between the two networks and deliver multicast across these kind of local area networks that are very tightly controlled, that have very low rates of message loss. So in some sense, we’re running these things over a very different network environment than the one that most of the world uses.
Yeah, a couple interesting observations: It means that colocation makes competition between professional participants more fair – it enables us to use these kinds of technologies, whereas, without colocation, you have less control over how people are reaching you, and you end up with probably more variation between participants.
I think it’s also worth saying that a lot of things we’re talking about are a little bit skewed towards US equities and equities generally. There’s lots of other trading protocols that are a little bit more bilateral, there isn’t like a single price that everybody observes. You know, in currencies often people show different prices to different people, there’s RFQ workflows in fixed income and somewhat in equities and ETFs. But by and large, probably the vast majority of the messages generated, look a bit like this, with shared public market data that’s anonymized but viewable by everyone, and then the private stream, as you say, of your specific transactions and your specific involvement.
I think from a perspective of what was going on, you know, 15 years ago, I feel like the obvious feeling was, Well, yeah, the equity markets are becoming more electronic, and more uniform and operating this way with this kind of central open exchange, and not much in the way of bilateral trading relationships, and surely this is the way the future and everything else is going to become like this. No, actually, the world is way more complicated than that. And currencies and fixed income and various other parts of the world just have not become that same thing.
Yeah, and I think that’s partly because those products are just legitimately different, and the participants have different needs. Sometimes it’s because the equity markets happened so – I think relatively rapidly – a lot of the transformation happened there. And so other markets that were a little bit behind saw the playbook, they saw how it changed, and they sort of positioned and controlled some of that change to maintain their current business models, etc.
I wanted to go back to one thing – you said, we mostly use TCP. It’s interesting because there were attempts (I know of at least one off the top of my head, probably there are more) to use UDP for order entry. Specifically, somebody had a protocol called UFO, UDP for Orders. There wasn’t a ton of uptake because look, if you’re a trading firm connecting to 60 exchanges, and 59 of them require you to be really good at managing a TCP connection, and one of you offers a unique UDP way, like that’s great. But that’s one out of 60, so I kind of have to be good at the other thing anyway, so there just wasn’t as much adoption because there’s just enough critical mass and momentum that the industry kind of hovers around a certain set of conventions.
The place where you see other kinds of technologies really taking hold, are where there’s a much bigger advantage to using it. I think when distributing market data, it’s just kind of, obviously almost grotesquely wasteful, to send things in unicast form where you send one message per recipient. And so multicast is a huge simplifier. It makes the overall architecture simpler, more efficient, fairer. There’s a big win, that really got adopted pretty broadly.
We’ve kind of touched on half of the reason people use multicast. Which is, I think, one of the core things I’m kind of interested in in this whole story is “why is trading so weird in this way?”
When I was a graduate student many years ago, multicast was going to be a big thing. It was going to be the way in the Internet that we delivered video to everyone. Totally dead. Multicast on the open Internet doesn’t work. Multicast in the cloud basically doesn’t work. But multicast in trading environments is a dominant technology. And one of the reasons I think it’s a dominant technology is because it turns out there are a small number of videos that we all want to watch at the same time. Unlike Netflix, where everybody watches a different thing, we actually want in the trading world to all see what’s going on on NASDAQ and ARCA and NYSE and CBOE, and so on so forth, live in real time, we’re all stuck to the same cathode ray tube. But there’s a whole different way that people use multicast that has less to do with that, which is that multicast is used as a kind of internal coordination tool for building certain kinds of highly performant highly scalable infrastructure. What is the role that multicast plays on the inside of exchanges and also on the inside of lots of firms trading infrastructure?
The exchange, the primary thing it’s doing is determining the order#H of the events that are happening. And then the exchange wants to disseminate that information to as many participants as possible. So certain parts of this don’t parallelize very well, then the sequence has to pretty much be done in one place for the same security. So you ended up where you were trying to funnel a lot of traffic down into one place and then report those results back. In that one place, you want to do as little work as possible so that you could be fast and deterministic. Then you were spreading that work out into lots of other applications that were sort of following along and provided value added information, value added services, and reporting what was happening in whatever their specific protocol was. So the same execution that tells you that you bought the security you’re interested in can also tell your clearing firm, maybe in a slightly different form, can tell the general public via market data that’s anonymized and takes your name off of it, etc, etc.
Let me try and kind of sharpen the point you’re making here. Because I think it’s an interesting fact about how this kind architecture all comes together, which is: the kind of move you’re talking about making is taking this very specific and particular problem of “we want to manage a book of open orders on an exchange and distribute this and that kind of data” and turning it into a fairly abstract CS problem of transaction processing. You’re saying “Look, there’s all these things that people want to do the actual logic and data structure at the core of this thing is not incredibly complicated. So we want to do is just to simplify all of the work around it, we’re just going to have a system whose primary job is taking the events, you know, the request to add orders and cancel and so forth. And choosing an ordering and then distributing that ordering to all the different players on the system so that they can do the concrete computations that need to be done to figure out what are the actual executions that happen, what are things need to be reported to regulators what needs to be reported on the public market data.”
Then, multicast becomes essentially the core fabric that you use for doing this. You have one machine that sits in the middle, you can call it the matching engine, but you could also reasonably just call it a sequencer because its primary role is getting all the different requests, and then publishing them back out in a well defined order. It’s worth noting that multicast gives you part of the story but not all of the story because it gives you getting messages out to everyone, but it misses two components. It doesn’t get it to them reliably, meaning messages can get lost, and it doesn’t actually get them to each participant in order. Essentially, the sequencer kind of puts a counter on each message, so it’s like, “Oh, I got messages 1-2-4-3, well, okay, I’ve got to reorder them and interpret them as 1-2-3-4.” Then also that ordering lets you detect when you lose messages and then you have another set of servers out there, whose job is to retransmit when things are lost, they can fill the gaps. Now this is a sort of specialized supercomputer architecture, which gives you this very specialized bus for building what you might call state machine style applications.
Right, and I will say, I think I’m aware of a number of exchanges that actually do have a model where they actually have just a sequencer piece that does no matching that really just determines the order. And then some of these sidecar pieces are the ones that are actually determining whether matches do indeed happen, and then sequencing them back, reporting them back, etc, etc. So there’s definitely examples of that.
A couple of other the points: Yeah, the gap filling and recovery has been a problem that I think is covered by other protocols. There are reliable multicast RFCs and protocols out there and everywhere I’ve been when we’ve looked at them, we’ve run into the problem that they have the ability for receivers to slow or stop publication. In those cases, if you scale up to having thousands of participants, there’s sort of somebody somewhere who always has a problem. So using any of these general purpose, reliable multicast protocols never seemed to quite fit any of the problems that we had. And I think because of the lack of use for the other reasons you mentioned, they were generally not super robust compared to what we had to build ourselves. And so we ended up doing exactly that where we added sequencing and the ability to retransmit missed messages in various specialized ways.
It’s also worth noting that you get some domain-specific benefits that I think also can generalize, where if you’ve missed a sufficient amount of data, I guess you can always replay everything from the beginning. But it sort of turns out that if you know your domain really well, and you can compress that data down to some fixed amount of state, you can have an application that starts after 80% of the day is complete, and be immediately online because you can give him just a smaller subset of the state. And a general purpose protocol like TCP where you’d have to sort of replay any missed data has a number of problems in trading: That can be buffered there for sort of arbitrarily long and it assumes you still want it to get there. And it’s buffering it byte-by-byte. Whereas, you know, if you say, “Oh, I’d like to place an order, I’d like to cancel an order, I’d like to place an order, I’d like to cancel an order.” If all of those are sitting in your buffers the ideal thing to do would be well, if you know the domain, they cancel each other before even going out if they’re waiting in the buffer, and you send nothing. So when we design those protocols ourselves optimized for the specific domain, we can pick up a little bit more efficiency when we do it.
This is, in fact, in some ways, a general story about optimizing many different kinds of systems: specialization, understanding the value system of your domain, and being able to optimize for those values. I think the thing you were just saying about not waiting for receivers, that’s in some sense, part of the way in which people approach the business of trading. The people who are participating in trading care about the latency and responsiveness of their systems. People who are running exchanges, who are disseminating data, care about getting data out quickly and fairly, but they care more about getting data to almost everyone in a clean way than they do making sure that everyone can keep up. So you’d much rather just kind of pile forward obliviously and keep on pushing the data out, and then if people are behind, well, you know, they need to think about how to engineer their systems differently so they’re going to be able to keep up. You worry about the bulk of the herd, but not about everyone in the herd. You know, the stragglers in the herd, they can catch up and get retransmissions later, and they’re going to be slower, but we’re not going to slow down. Understanding what’s important, the applications can be massively simplified. A huge step you can take in any technical design is figuring out what are the part of the problems you don’t have to solve.
I think it’s also worth saying that the problem is somewhat exacerbated by fragmentation. We said it’s important for people to determine the order#H of events, but you also need to report it back to them quickly. And, you know, reliably quickly, deterministically quickly, because that translates directly into better prices. If I told you that you could submit an order and it would be live for the next six or eight hours, you’re going to enter probably a much more conservative price. And let’s say I’m actually acting as an agent for you, I’m routing your order to one of these other 14 exchanges. Well, I may want to check one and then go on to the next one. The faster and more reliable it is for me to check this one, the more frequently I’ll do so. If I think there’s a good chance that the order will get held up there, well, that’s opportunity cost, I may miss other places. So this is all kind of a rambling way of saying that speed and determinism translate directly into better prices when you have markets competing like this.
People often don’t appreciate some of the reasons that people care about performance in exactly the way that you’re kind of highlighting. To give example in the same vein, like this fragmentation story, you might want to put out bids and offers at all of the different exchanges where someone might want to transact, right, there’s a bunch of different marketplaces, you want to show up on all of them. You may think, “Oh, I’m willing to buy or sell 1000 shares of the security. And I’m happy to do it anywhere.” But you may not be happy to do it everywhere. There’s like a missing abstraction in the market. So they want to be able to express something like I would be willing to buy the security at any one of these places, but they can’t do it. So they try and simulate that abstraction by being efficient, by being fast. So they’ll put out their orders on lots of different exchanges, and then when they trade on one of them, they’ll say “Okay, I’m no longer interested,” so they’ll pull their orders from the others, and they’re now worried about other professionals who are also very fast, who try to route quickly and in parallel to all the different places and take all the liquidity that shows up all at once. This dynamic, that the speed and determinism of the markets now becomes something that essentially affects the trade offs between different professional participants in the market.
Yeah, that’s right.
Another thing I kind of want to talk about for a second is: what are some of the trade offs that you walk into when you start building systems on multicast? Like I remember a bunch of years ago, you were like, in the guts of systems like Island and Instinet, and NASDAQ, and Chi-X, and all that, building this infrastructure. Before you came to Jane Street, I was on the other side. At the time, I think Jane Street understood much less about this part of the system. I remember the first time we heard a description from NASDAQ about how their system worked, and I basically didn’t believe them, for two reasons. One reason is, it seemed impossible. The basic description was, the way NASDAQ works is, every single transaction on the entire exchange goes through a single machine on a single core. And on that core is running a more or less ordinary Java program that processes every single transaction, and that single machine was the matching engine, the sequencer. I didn’t really know how you could make it go fast enough for this to work. There was essentially a bunch of optimization techniques that at the time, we just didn’t understand well enough. And also, it just seemed perverse, like what was the point why to go to all that trouble? Maybe you could do it, but why?
Well, a couple things. I mean, first, I want to say like on all the systems you mentioned, I like to think I did some good engineering work, but I was certainly part of many excellent teams, and worked with just a tremendous bunch people over the years.
But yes, from a performance perspective, you said, well, the fewer processes I have, the simpler the system is and it gives you some superpowers there where you just don’t have to worry about splitting things up in various ways. There were certainly some benefits to adding complexity, but a lot of that came about as hardware itself started to change. And that should provide probably the baseline for optimization. I think you want to understand the hardware and the machines you’re using, the machines that are available, the hardware that’s available to you deeply. And you want to basically model out what the theoretical bounds are. Then when you look at what you’re doing in software, and you look at the kind of performance you’re getting, if you can’t really explain where that is relative to what’s capable, you’re leaving some performance on the floor. And so we were trying very, very hard to understand what the machine could theoretically do, and really utilize it to its fullest.
Part of what you’re saying is instead of thinking about having systems where you fan out and distribute and break up the work into pieces, you stop and you think if we can just optimize to the point where we can handle the entire problem in a single core, a bunch of things get simpler, right? We’re just going to keep everything going through this one stream. There’s a lot of work that goes into making things uniformly fast enough for this to make sense. But it simplifies the overall architecture in a dramatic way.
It definitely does. And it’s been pretty powerful. I mean, not every exchange operates exactly on these principles, there’s certainly lots of unique variations that people have put out there. But I do think that it is pretty ubiquitous. Certainly the idea that exchanges want some kind of multicast functionality, I think is universal at this stage. I’m sure there may be an exception here or there. But amongst high-performance exchanges with professional participants like this, I think it’s pretty universal.
When you’re talking about publication of market data, we can see that directly since we’re actually subscribing to multicast in order#H to receive the data ourselves, but their internal infrastructure often depends on multicast as well, right?
True, although, you know, I’m not as familiar with the crypto side of the world. But since a lot of that is happening over the open Internet, UDP is probably not one of the options. And so you have people using more, you know, WebSockets, and JSON API and things like that. But it is kind of the exception that proves the rule, right? Because of that focus on the open Internet and everything, you’ve got a totally different set of tools.
It highlights the fact that the technical choices are in some sense conditional on the background of the people building it. There’s two sides of the question we were just talking about, there’s a question of what’s the advantage of doing all this performance engineering and the other question is, how do you do it? How do you go about it?
It has moved around over the years. You know, many years ago, I remember that we had interrupt driven I/O – packets would come into the network card where they would essentially wait for some period of time and if it had waited there long enough or enough data had accumulated, then the network card would request an interrupt for the CPU to come back and service the network card. How frequently should we allow interrupts? If we allow them anytime a packet arrives, that’d be way too much CPU overhead. So there were trade offs of throughput and latency.
But once you end up with the sheer number of cores that we do nowadays, we can essentially do away with interrupts and just wait for the network card by polling and checking “Do you have data? Do you have data? Do you have data? Do you have data?” and the APIs have shifted a bit away from general purpose sockets. The sockets APIs require lots of copies of the data. There’s an ownership change there: When you read, you give the API a buffer that you own, the data is filled in from the network and then given back to you. So this basically implies a copy on all of the data that comes in. If you start to look at say 25 gigabit networking, that means you basically have to copy 25 gigabits a second to do anything at a baseline. The alternative is you try to reduce those copies as much as possible, and you have the network card just delivering data into memory, the application polling, waiting for that data to change, seeing the change, showing it to the application logic, and then telling the network card, he’s free to overwrite that data, you’re done with it. When you get down to that level, you really are getting very close to the raw performance of what the machine is capable of.
So eliminating the copies and the unnecessary work in the system, that’s certainly one. Trying to make your service times for every packet in every event as reliable and deterministic as possible so that you have very smooth sort of behavior when you queue. You don’t end up having to do that everywhere. The critical path tends to be pretty small when it’s all said and done. I think one of the guys who had built the Island system really kind of had the attitude that if any piece of the system is so complicated that you can’t rewrite it correctly and perfectly in a weekend, it’s wrong. And so I think that, you know, probably the average length of an application there was, you know, 2000 lines or something like that, and the whole exchange probably was maybe four or five applications this together.
Sad to say, I think we do not follow that rule in our engineering: I think we could not rewrite all of our components in a weekend, I’m afraid.
The world has gotten more complicated, but it’s not a bad goal! To often ask people, and I think it’s consistent with reliability and performance, to constantly ask yourself, “yes, but can it be simpler?” We want it to be as simple as possible – no simpler – but as simple as possible. And it really is a mark of you deeply understanding the problem when you can get it down to something that seems trivial to the next person who looks at it. It’s a little depressing because you kill yourself to get to that point. And then the next person that sees it is like, “Oh, that makes sense. That seems obvious. What did you spend all your time on?” Like, if only you knew what was on the cutting room floor.
One thing that strikes me about this conversation is that just talking with you about software is pretty different than lots of the other software engineers that I talk to because you almost immediately in talking about these questions go to the question of what does the machine do? And how do you map the way you’re thinking about the program onto the actual physical hardware? Can you just talk for a minute about the role of mechanical sympathy (which is a term I know you like for this) in the process of designing the software where you really care about performance?
I do love that term. I did not coin that term. I think there was a blog by a number of people in the UK who ran a currency exchange called LMAX, and a gentleman ran a blog under that name. But it comes from a racecar driver, who was talking about how drivers with mechanical sympathy, who really had a deep understanding of the car itself, were better drivers, in some way. And I think that that translates to performance in that if you have some appreciation for the physical, mechanical aspects, just the next layer of abstraction in how our computers are built, you can design solutions that are really much closer to the edge of what they’re capable of. And it helps you a lot, I think in terms of thinking about performance when you know where those bounds are. So I think what’s important there is it gives you a yardstick.
Without that, without knowing what the machine is capable of, you can’t quickly back of the envelope say, “does the system even hang together? Can this work at all?” If you don’t know what the machine is capable of you can’t even answer that question. Then when you look at where you’re at, you say, “Well, how far am I from optimal?” Without knowing what the tools you have are capable of, I just don’t know how you answer that question and when you stop digging, so to speak. If you’re observing that, the market, be it from a competitive perspective, or just the demands of the customer, are much higher than what you think is possible, well, you’ve probably got the wrong architecture, you’ve probably got the wrong hardware. It’s kind of hard for me to not consider that.
As a practical matter as a software engineer, how do you get a good feel for that? I feel like lots of software engineers, in some sense, operate most of the time at an incredible remove from the hardware they work on. There’s like, the programming language they’re in, and that compiles down to whatever representation and maybe it’s a dynamic programming language, and maybe it’s a static one. There’s like several different layers of infrastructure and frameworks they’re using, and there’s operating system, and you know, they don’t even know what hardware they’re running on. They’re running on virtualized hardware in lots of different environments for lots of software engineers… A kind of concrete and detailed understanding of what the hardware can do feels kind of unachievable. How do you go about building intuition trying to understand better what the actual machine is capable of?
I think you’re separating programmers into people who get a lot of things done, and people like myself – is that fair to say?
It’s a good question. I think part of it is interest and I think you really need to construct a lot of experiments. And you have to have a decent amount of curiosity and you have to be blessed with either a problem that demands it, or the freedom to be curious and to dig because you are going to waste some time constructing experiments and your judgment, initially, is probably not going to be great. The machines nowadays are getting more and more complicated. They’re trying to guess and anticipate a lot of what your programs do. So very simple sorts of benchmarks, simple sorts of experiments don’t actually give you the insight you think you’re getting from them. So I do think it is a hard thing to develop. But certainly a good understanding of computer architecture or grounding in computer architecture helps. There are now a decent number of tools that give you this visibility, but you do have to develop an intuition for what are the key experiments? What are the kinds of things that are measurable? Do they correlate with what I’m trying to discover, etc, etc. And I think it requires a lot of work, of staying current with the technology and following the industry solutions, as well as what’s happening in the industry, generally of computing technology.
You’ve gotta kind of love it, right? Gotta spend enough time to develop the right kind of intuition and judgment, to pick your spots when you do your experiments.
I think in lots of cases, people approach problems with a kind of, in some sense, fuzzy notion of scalability. There are some problems where if you’re like “No, actually I can write this one piece,” it admits simpler solutions some of the time, then they do if you try and make it scalable in a general way. You can make a thing that is scalable. But the question of being scalable isn’t the same as being efficient. So when you think about scalability and think about performance, it’s useful to think about it in concrete numerical terms, and in terms that areat least, dimly aware of what the machine is capable of.
I think it’s actually easy to get programmers to focus on this sort of thing: If you just stop hardware people from innovating, they will have no choice. Right? So many programming paradigms and layers of complexity have been empowered by the good work of hardware folks who have continued to provide us with increasing amounts of power. If that stops, and it does seem like in a couple key areas that is slowing, I don’t know about stopping but certainly slowing, then yeah, people will pay a lot more attention to efficiency.
So this is maybe a good transition to talking about some of the work that you do now. You, these days, spend a bunch of your time thinking about a lot of the kind of lowest level work that we do and some of that has to do with building abstractions over network operations that give us the ability to more simply and more efficiently do the kind of things that we want to do. And part of it has to do with hardware. So I wanted to just talk for a minute about the role that you think custom hardware plays in trading infrastructures, and some of the work that we’ve done around that.
Jane Street has always had a large and diversified business and for lots of our business, it’s just not super relevant. But in the areas where message rates and competitiveness are a little extreme, it becomes a lot more efficient for us to take some of these programmable pieces of hardware and really specialize for our domain. Like a network card is actually very good at filtering for multicast data, it can compare these addresses bit by bit, but there’s really nothing that stops us from going deeper into the data and filtering based on content, looking for specific securities, things like that, and there aren’t a lot of general purpose solutions out there to do that at hardware speeds. But we can get programmable network cards, custom pieces of hardware, where we can stitch together solutions ourselves and I think that’s going to become increasingly relevant and maybe even necessary, as we start to move up in terms of data rates.
I think earlier I mentioned that we have, I didn’t get the exact number, maybe there’s 12 now going up to something like 15, 16, 17 different US equity exchanges, if each one of those can provide us data at something close to 10 gigabits per second, and the rule set requires that we consolidate and aggregate all that information in one place, well, we have something of a fundamental mismatch if we only have 10 gigabit network cards, right? So, for us to do that quickly and reliably in a relatively flat architecture we’re going to need some magic, and the closest thing I think we have to magic is some of the custom hardware.
This feels to me like the evolution of the multicast story. If you step back from it, you can think of the use of multicast in these systems as a way of using specialized hardware to solve problems that are associated with trading. But in this case, it’s specialized networking hardware. So it’s general purpose at the level of networking, but it’s not a general purpose programming framework for doing all sorts of things. It’s specialized to copying packets efficiently. Is there anything else at the level of switching and networking worth talking about?
Yeah, I think that it’s funny. I don’t know if I’ve ever come across these Layer 1 crosspoint devices outside of our industry. I think certainly some use them, maybe in the cybersecurity field. But within our industry, there’s been a couple of pioneering folks that have built devices that allow us to, with no switching or intermediate analysis of the packet just merely replicate things electrically everywhere according to a fixed set of instructions. It turns out that that actually covers a tremendous number of our use cases when we’re distributing things like market data.
So, you know, the more traditional, very general switch will take in the packet, look at it, think about it, lookup in some memory where it should go and then route it to the next spot. That got sped up with slightly more specialized switches, based on concepts from InfiniBand, that would do what was known as cut-through. They would look at the early part of the packet, begin to make a routing decision while the rest of the bytes were coming in, start setting up that flow, send that data out, and then forward the rest of the bytes as they arrive. Those were maybe an order#H of magnitude or even two faster than the first generation. Well, these actually do no work whatsoever, but just mechanically electrically replicate this data. They’re another order#H of magnitude or two faster than that.
A store-and-forward switch, the first kind I was describing, I don’t know, maybe that was seven to 10 microseconds, a cut-through switch, looking at part of the packet and moving it forward, maybe that’s 300 to 500 nanoseconds. And now these switches, these Layer 1 crosspoints, maybe they’re more like three to five nanoseconds themselves. And so now we can take the same packet and make it available in maybe hundreds of machines with two layers of switching like that, and we’re talking about a low single to double digit number of nanoseconds in terms of overhead from the network itself.
I think it’s an interesting point in general that having incredibly fast networking, changes your feelings about what kind of things need to be put on a single box and what kind of things can be distributed across boxes. Computer scientists like to solve problems by adding layers of indirection. The increasing availability of very cheap layers of indirection suddenly means that you can do certain kinds of distribution of your computation that otherwise wouldn’t be easy or natural to do. What are the latencies like inside of a single computer versus between computers these days?
It’s starting to vary quite a bit, especially with, folks like AMD having slightly different structure than Intel, but it’s true that moving between cores is starting to get fairly close to what we can do with individual network cards. To throw out some numbers that somebody will then probably correct me on, maybe that’s something on the order#H of 100 nanoseconds. It’s not that different when we’re going across the PCI bus and going through a highly optimized network card. That might be something like 300 to 600 nanoseconds and this is one way to get the data in. It is not unreasonable for the sorts of servers that we work with to get frames all the way up to the user space into the application to do you know, very little work on but then turn around and get that out in something less than a microsecond. Moving, you know, context switching, things like that in the OS can start to be on the order#H of a microsecond or two.
Yeah, and I think a thing that’s shocking and counterintuitive about that is the quoted number for going through an L1 crosspoint switch versus going over the PCI Express bus… we’re talking 300 nanos to go across the PCI Express bus, and two orders#H of magnitude faster to go in and out of a crosspoint switch.
Well you gotta add the wires in, the wire starts to…
Oh, yeah, yeah, that’s right.
The physical wiring starts to matter.
The wiring absolutely starts to matter. To go back to the mechanical sympathy point, when you think about the machines, we’re not just talking about the computers. We’re also talking about the networking fabric and things like that. I think an aspect of the performance of things that people often don’t think about is serialization delay. Can you explain what serialization delay is and how it plays into that story?
We’ve been talking about networking at specific speeds. I can send one gigabit per second, I can send 10 gigabits per second, 25 gigabits per second, 40 gigabits per second, etc. I can’t take data in at 10 gigabit and send it out at 25 gigabit, I have to have the data continuously available. I have to buffer enough and wait for enough to come in before I start sending because I can’t underflow, I can’t run out of data to deliver. Similarly, if I’m taking data in a 10 gigabit and trying to send it out at one gigabit, I can’t really do this bit for bit. I’ve kind of gotta queue some up, and I’ve gotta wait. The lowest latency is happening at the same speeds where you can do that and and certainly the L1 crosspoints are operating at such a low level, as far as I understand, that certainly no speed conversions are happening at the latencies that I described.
Just to kind of clarify the terminology, by serialization delay, you were making this point that, oh, yeah, when you’re in at 10 gigabit and out at 25, well, you can’t pause or anything, right? You have to have all of the data available at the high rate, which means you have to queue it up; when you send out a packet, it kind of has to be emitted in real time from beginning to end at a particular fixed rate. That means that there’s a translation between how big the packet is, and temporally how long it takes to get emitted onto the wire, right? There’s a kind of electrically-determined space to time conversion that’s there. It means if you have a store-and-forward switch, and you have, say, a full, what’s called an MTU – which is the maximum transmission unit of an Ethernet switch, which is typically 1500 bytes-ish – that just takes a fixed amount of time. On a 10 gigabit network, what does that translation look like?
I think it roughly works out to something like a nanosecond per byte. I think this comes back to the thing we were talking about in the beginning and a little bit of appreciation for multicast. So imagine I have 600 customers and I have one network card, and I would like to write a message to all 600. Well, let’s say the message is 1000 bytes. Okay, so that’s about a microsecond per… so the last person in line is going to be 600 microseconds, at a minimum, behind the first person in line. Whereas with multicast, if I can send one copy of that, and have the switch replicate that in parallel, with one of these Layer 1 crosspoints, I’m getting that to everybody, in something close to a microsecond.
That affects latency, but it also affects throughput. If it takes you a half a millisecond of wire time to just get the packets out the door, well, you could do at most 2000 messages per second over that network card and that’s that, right? Again, this goes back to… there are real physical limits imposed by the hardware, that it can be as clever as you want, but there’s just a limit to how much stuff you can emit over that one wire, and that’s a hard constraint that’s worth understanding. Multicast is a story of “the technology that could,” it’s incredibly successful in this niche. There’s other networking technology that had a more complicated story and I’m in particular thinking about things like InfiniBand and RoCE. What is RDMA? What is InfiniBand?
InfiniBand is a networking technology that was very ahead of its time. I think it’s still used in supercomputing areas and a lot of high-performance Ethernet has begged borrowed and stolen ideas from InfiniBand. InfiniBand provided things like reliable delivery at the hardware layer. They had APIs that allowed for zero-copy I/O. They had the concept of remote direct memory access. So, direct memory access is something that peripherals, devices on your computer, can use to sort of move memory around without involving the CPU. The CPU doesn’t have to stop what it’s doing, and copy a little bit over here, from here to there, from here to there, the device itself can say “okay, that memory over there, I just want you to put this data right there.” Remote DMA extends that concept and says “I’d like to take this data and I’d like to put it on your machine over there in memory without your CPU being involved.” This is obviously powerful, but requires different APIs to interact with.
A number of the places I’ve been at used InfiniBand, some very much in production, some a little more experimentally. There are some bumps in the road there: InfiniBand had some of this problem where, by default, it essentially had some flow control in hardware, meaning that it was concerned about network bandwidth and could slow down the sender. So we’d have servers that didn’t seem to be doing anything, but their network cards were sort of oversubscribed, they had more multicast groups, then they could realistically filter and so they were pushing back on the sender. So when we scaled it up to big infrastructures, we’d have market data slow down and it was very difficult to figure out why and to track down who was slowing that down. So the Ethernet model of like best-effort and sort of “fail fast and throw things away quickly” is in some cases a little bit easier to to get your head around and to debug.
You mentioned that when we talk about multicast one of the key issues of multicast is it’s not reliable. We don’t worry about dealing with people who can’t keep up, right? People who can’t keep up fall behind and have a separate path to recover, and that’s that. You just mentioned that InfiniBand had a notion of reliability, and reliability is a two-edged sword, right? The way you make things reliable is in part by constraining what can be done, and so the pushback on senders of data is kind of part and parcel of these reliability guarantees, I’m assuming. Is that the right way of thinking about it?
Yeah, I think that’s a good way to think about it. But certainly the the visibility and the debugability could have been improved as well. And you mentioned RoCE, I never worked with it personally. But it was a way to sort of extend Ethernet to support the RDMA concept from InfiniBand. But I don’t believe it… it involves some proprietary technology still, so it was a little bit of like the “embrace and extend” approach applied to Ethernet. So when you look at the kinds of custom hardware that was being developed, I think there were sort of more interesting things happening in the commodity world than RoCE.
We spent a lot of time talking about the value of customizing and doing just exactly the right thing and understanding the hardware. I guess the Ethernet versus InfiniBand story is in some sense about the value of not customizing of using the commodity thing.
There is a strong lesson there. I mean, I had a couple of instances over my career where I was very surprised at the power of commodity technologies. I was at a place that did telecommunications equipment, and they were doing special purpose devices for processing, phone calls, phone number recognition, what number did you press sorts of menus… and these had very special cards with digital signal processors and algorithms to do all of this detection, some basic voice recognition. This is in the 90s, and these were complex devices. And it turned out that somebody in the research office in California built a pure software version of the API that could use like a $14 card that was sufficient to be able to generate ring voltages, and could emulate like 80% of the product line in software. When I saw that, I was like, I’m not really sure I want to work on custom hardware. I don’t know that I want to swim upstream against the relentless advance of x86 hardware and commodity vendors. Just the price/performance… you’ve got a million people helping you. Whereas in the other direction, you’ve got basically yourself and it took a lot to get me convinced to consider some alternative things.
But I do think that trends around the way processors and memory latency are improving certainly make it clear that, just, you know, looking at things like deep learning and GPUs, it’s pretty clear that we’re starting to see some gains from specializing again, even though I’d say the first 10 or 15 years of my career, it was pretty clear that commodity hardware was relentless.
And it’s worth saying, I think, in some sense, the question of what is commodity hardware shifts over time: I think a standard joke in the networking world is “always bet on Ethernet.” You have no idea what it is, but the thing that’s called Ethernet is the thing that’s going to win. And I think that has played out over multiple generations of networking hardware, where it’s, like you said, stealing lots of ideas from other places, InfiniBand and whatever, but, you know, there is like the chosen commodity thing and learning how to use that, and how to identify what that thing is going to be is valuable.
The work that we’re doing now in custom hardware is also still sensitive to the fact that FPGAs themselves are a new kind of commodity hardware, but it’s not the case that we actually have to go out and actually get fabricated a big collection of chips on one of those awesome reflective discs, we get to use a certain kind of commodity hardware that lots of big manufacturers are actually getting better and better at producing bigger and more powerful and easier to use versions of these systems. Is there anything else that you see coming down the line in the world of networking that you think is going to be increasingly relevant and important, say over the next two to five years?
I think what we’re going to see is a little bit more of the things we’ve been doing already standardized and more common. So this sort of like userspace polling, and that form of I/O, I think you’re seeing some of that start to hit Linux with io_uring
. So these are very, very, very similar models to what we’ve been already doing with a lot of our own cards, but now they’re going to become a bit more standardized, you’re going to see more I/O devices meet that design, and then you’re going to see more efficient, zero-copy, polling sorts of things come down the line. Some of the newer networking technologies like 25 gigabit, I do think is going to have a decent amount of applicability; it is waiting for things like an L1 crosspoint. It’s not always a clear net win; some of the latency has gone up as we’ve gone to these higher signaling rates. With large quantities of data, the gain in serialization delay will overcome some of the baseline latency as the data gets big enough, but, it’s complicated.
Can you say why it is? Why do switches that run at faster rates sometimes have higher latency than switches that run at lower rates?
Part of it is decisions by the vendor where they’re sort of finding the right market for the mix of features and the sensitivity to latency. I do think that we are at the mercy so to speak of some of the major buyers of hardware, which is probably cloud providers. That’s just an enormous market. And so I do think that the requirements hew a little closer to that than they do for our specific industry, so we’ve got to contend with that. As the signaling rates go up, and again I’m no expert here, but I think that you start to have to rely more on error correction, and forward error correction is built into 25 gigabit and eats up a decent amount of latency if you have runs of any length. So that’s also a thing that we have to contend with and an added complexity.
I think it’s going to be important, I think it’s going to be something that does come to our industry. And maybe quickly. I think at this point, there’s a decent amount of 25 gigabit outside of the finance industry, but not quite as much in the trading space. Once you start to see a little bit of it, you’ll see a lot very quickly.
All right, well, thanks a lot. This has been super fun. I’ve really enjoyed walking through some of the history and some of the low level details of how this all works.
I think we do this basically all the time, you and I, it’s just kind of like now we’re doing it for somebody else!
You can find links to more information about some of the topics we discussed, as well as a full transcript of the episode on signalsandthreads.com. And while you’re at it, please rate us and review us on Apple Podcasts. Thanks for joining us, and see you next week.
Putting servers that run the exchange and the servers of market participants in the same datacenter. Usually, these machines will operate on the same, private, network. This helps reduce latency for network participants.
A place where orders to buy and sell stocks or other securities can be placed and executed.
An early electronic trading platform.
An early electronic equities exchange. Island was later acquired by Instinet.
Information about orders and executions, subscribed to by exchange participants so they can make trading decisions and satisfy regulatory reporting obligations.
The piece of an exchange that matches compatible buy and sell orders with each other and records that a trade has happened.
A network protocol that allows messages to be efficiently duplicated to all members of a group that have subscribed to that particular stream of messages.
The first electronic stock exchange. The NASDAQ is still one of the major US exchanges.
A hardware device that takes in packets and forwards them to the correct recipients.
A message sent to an exchange offering to buy or sell a security, often including other parameters like a price and quantity.
A data structure representing the open (active) orders on an exchange.
Or, User Datagram Protocol. A network protocol that sacrificies fault tolerance and congestion control for improved performance.