State Machine Replication, and Why You Should Care

with Doug Patti

Season 2, Episode 5 | April 20th, 2022

BLURB

Doug Patti is a developer in Jane Street’s Client-Facing Tech team, where he works on a system called Concord that undergirds Jane Street’s client offerings. In this episode, Doug and Ron discuss how Concord, which has state-machine replication as its core abstraction, helps Jane Street achieve the reliability, scalability, and speed that the client business demands. They’ll also discuss Doug’s involvement in building a successor system called Aria, which is designed to deliver those same benefits to a much wider audience.

SUMMARY

Doug Patti is a developer in Jane Street’s Client-Facing Tech team, where he works on a system called Concord that undergirds Jane Street’s client offerings. In this episode, Doug and Ron discuss how Concord, which has state-machine replication as its core abstraction, helps Jane Street achieve the reliability, scalability, and speed that the client business demands. They’ll also discuss Doug’s involvement in building a successor system called Aria, which is designed to deliver those same benefits to a much wider audience.

Some links to topics that came up in the discussion:

Jane Street’s client-facing trading platforms
A Signals and Threads episode on market data and multicast which discusses some of the history of state-machine replication in the markets.
The FIX protocol
UDP multicast
Reliable multicast
Kafka

TRANSCRIPT

00:00:04

Ron

Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I’m Ron Minsky. It’s my pleasure to introduce Doug Patti. Doug has worked here at Jane Street for about five years and in that time he’s worked on a bunch of interesting systems and, really, distributed systems. That’s mostly what I want to talk about today, is the kind of design and evolution of those systems. Doug, just to get us started, why don’t you tell us a little bit about where you came from and where you worked before you joined us at Jane Street.

00:00:34

Doug

Sure, and thanks for having me, Ron. Before Jane Street, I worked at a few smaller web startups and my primary day to day was writing Node.js and Ruby and building little web frameworks and APIs. So it was quite a departure from the things that I’m working on today.

00:00:51

Ron

So I know a bunch about what you work on, but what’s different about the character of the work that you do here versus that?

00:00:55

Doug

Well, I think that it’s, in a way, we kind of just use completely different technologies. For example, we very rarely come across databases in our day to day here at Jane Street. There are some teams that work with databases, of course, but I personally have almost never interacted with them and that was a core part of being a web developer is there has to be a database. I think another big divergence in my world is that a lot of the Jane Street systems are based around a trading day, so things kind of start up in the morning and some stuff happens and then it shuts down at the end of the day and you have time to go do maintenance because there’s nothing going on. Whereas with web services, you’re constantly thinking about how to do continuous rollouts over the course of the day so that there’s no downtime.

00:01:41

Ron

Yup. Although that’s a thing that used to be more true here than it is now and we are more and more building services that run 24 hours a day, in fact, seven days a week, so the trading day over time has become more continuous. But for sure, there’s still a lot of that going on in many of the systems we build. Okay. So let’s talk a little bit about the system. So the first thing you came and worked on when you joined here was a system called Concord. Can you tell us a little bit more about what Concord was for, what business problem you were working on and the team that worked with Concord was doing?

00:02:09

Doug

Yeah, definitely. So Concord predates me, when I joined, I think Concord had already been around for at least four years or so. At the time, we were entering into a new business where instead of doing the normal Jane Street thing, which is like, we go to an exchange and we trade there, we’re getting into relationships like some bilateral relationships with different clients, different firms, where we trade with them directly. So this brings on a whole new swath of responsibilities and we built Concord to manage those things.

00:02:42

Ron

Right, and I guess, maybe it’s worth saying from a business level, the thing we’re doing in those businesses is not so different from our on-exchange activity. In both cases, we’re primarily liquidity providers. Other people want to trade and we try and be the best price for that trade, but in these bilateral relationships, we have an arrangement with some other kind of financial institution that wants to reach out to us and see what liquidity we have available. That’s one of the kind of relationships we might have. And suddenly we’re a service provider. That’s the difference is we are not just providing the trading, but also the platform on which the interaction happens.

00:03:14

Doug

That’s totally right, being a service provider is where all this responsibility comes from. We have this obligation to be reliable, to be available. We need to have good uptime. We need to be fast in turnaround. When we get a message from a client, we need to be able to respond really quickly. And we also need to maintain some sort of an audit trail more so than in our traditional trading systems for regulatory reasons.

00:03:37

Ron

So I understand this point about being a service provider and how that kind of changes your obligations, but can you maybe sketch out almost at the network level, what are the services that we are providing to customers that connect to us?

00:03:49

Doug

Yeah. Maybe I can walk through what it looks like from end to end, perhaps. Is that what you’re asking?

00:03:54

Ron

Yeah, totally.

00:03:55

Doug

Yeah. Okay. So you can start out with, we are running some server and a client is going to connect into us, into one of our ports. A client is going to send us an order and the order flows through our system and tries to match against one of the orders that we have from one of our desks, our trading desks. If there is a match, then we produce an execution and we send a copy of that execution back to the client. We send a copy to the desk and we send a copy to a regulatory reporting agency, and that’s a very simplistic version of what we do.

00:04:30

Ron

In some sense, that’s not so different from the interaction you have on an exchange, right? When we want to provide liquidity on exchange, we post orders on exchange that express what we’re willing to trade, and then people who want to trade come in and send orders, and then, maybe they cross one of the orders that we have and an execution goes up. But the difference is we’re playing two roles here. We’re both running the exchange-like infrastructure, and we’re also the same organization that’s actually putting in the orders that are providing liquidity to clients who might want to come in and connect.

00:05:00

Doug

Yup. That’s exactly right and in fact, I think the original design for Concord was taken from inspiration, from what other exchanges use in the real world today.

00:05:11

Ron

Got it. So you said before that this puts pressure in terms of reliability. Can you say a little bit more about how that plays out?

00:05:19

Doug

Yeah, if we go back to that example of a client sends an order, we have to be ready for the possibility of some sort of a network loss, so the client might get disconnected. If the client reconnects, they really need to be able to see, like, “Hey, did the order actually get sent, and if so, did it get filled?” These are important pieces of information that we need to hold onto, so it’s not just what happens if the network goes down, but what happens if one of our entire machines crashes and we suddenly lost the piece of state, the program that was running that was accepting those connections. We need to make sure that when we come back online, we have the right answers for the client.

00:05:55

Ron

Right. So the scenario is a client connects. They send an order. Something terrible happens and they’re disconnected, and then when they do connect again, they would really like to know whether they traded or not. So, there’s that kind of resynchronization process that you have to get right.

00:06:10

Doug

That’s right. That’s really important.

00:06:12

Ron

So how does that work?

00:06:13

Doug

There are multiple levels at which you can think about this, but let’s talk about just the client and the server, talking to each other. We use a protocol called FIX. It’s a pretty standard exchange protocol in the financial world. In FIX, there is this notion of a sequence number that is attached to every single message. You start at one, you expect it to go up by one and you never expect it to go down or skip a sequence or something like that. I should say, this is not unlike TCP, where in TCP you have sequence numbers based on the bytes and making sure that packets arrive in the same order, but the difference is that this persists over the course of the day, so that means if you lose connection and you reconnect, what you would end up doing is you’d log on and say, “Hey, I want to start from message 45,” and we might say, “Oh, wait a minute. We don’t know about 45. Can you send us 45?”

00:07:06

Doug

And all of a sudden, there’s this protocol for exchanging missed messages in that sense, based on sequence number.

00:07:13

Ron

Right, so TCP gives you an illusion of reliability on top of an unreliable network. Your base network sends unreliable datagrams, you might lose any packet. TCP does a bunch of sequence number stuff to synthesize a kind of reliable connection on top of that unreliable one. But it’s limited in that it’s just for the scope of the connection. You lose this connection for any reason, and certainly you lose one host or the other, and you’re dead, right? There are no guarantees provided by TCP that let you be certain about what information did and didn’t get through. So, we’re providing an extra layer of reliability on top of that, where we can recover and we can even recover when the server that’s on the receiving point itself crashes, right?

00:07:51

Ron

So it may have lost any of its local state and still we have to be able to recover because we need to be able to make sure that clients know what they did, even if one of our box crashes. It’s kind of not an acceptable customer service thing where you’d be like, “Oh, we don’t really know if you traded. Maybe.” That’s an awkward conversation to have with a client.

00:08:08

Doug

Exactly.

00:08:09

Ron

Cool. Are there any other ways in which the constraints of the business that we’re doing affect the kind of technology that we have to build?

00:08:17

Doug

One of the things that I said was that we need to be very available and we need to have uptime. And what that means is in the case that a box crashes like, yes, it’s a very rare event and hopefully we can bring it back up. But we also need to be able to move quickly and sometimes that means moving all of our things off of that box and onto a new box, not exactly instantly, but on the order of a few minutes, instead of a few hours potentially to debug an issue with a box. So we have the flexibility of kind of moving processes around in this distributed system, pretty freely.

00:08:51

Ron

Got it. All right, so there are requirements around reliability and durability of transactions, like when you’ve told a client about something that happened, you’d better remember it and you better be able to resynchronize them and know all the stuff about their state that you can recover, even when you lose components in your system. What about scalability?

00:09:08

Doug

Scalability is a big thing because we sometimes have days where trading is busy on the order of magnitudes larger than a normal day. So, we need to be able to handle all of that traffic at the same time, like we would on a normal day. So, we have to design the system such that we can be efficient, not only in the common case, but even under a lot of load.

00:09:31

Ron

Got it, is all of the load of what you’re dealing with, just the messages that are coming in from clients who want to trade? What else drives the traffic on the system?

00:09:39

Doug

No, the system is actually full of lots of other data as well. For example, for regulatory reasons, we need to have all of the market data from all the public exchanges in our system at the same time as all of the orders for all the clients. That alone is a huge amount of data and especially on a busy day, that’s going to go up significantly. So we end up with these very, very hot paths in our code where you need to be able to process certain messages way more quickly than others.

00:10:09

Ron

Got it. You’ve told us a bunch about what’s the problem that’s being solved. Can you give us a sketch of the system that was built to do it? What’s the basic architectural story of Concord?

00:10:18

Doug

Yeah. We’ve already alluded to messages a little bit in the past. What we’ve built is a replicated state machine, is kind of the fancy term for it. When we say state machine, when I say state machine, I mean not state machine in the finite state machine, kind of something you would learn in CS with boxes and arrows and things changing from one state to another. We’re talking more, just think of a module that encapsulates some state, and that module has the ability to process messages one at a time and each time it processes a message, it updates its state. It might perform an action, but that’s it, that’s the state machine. It’s a very contained, encapsulated piece of software.

00:11:01

Ron

In some sense, that sounds very generic, like I write programs, they consume data, they update their state. How does the state machine approach differ from just any way I might write some kind of program that processes messages off the network?

00:11:13

Doug

One of the things that we do is we process messages immediately. We don’t do any sort of like, “Hey, let me go out and send a network request or let me block for a second while I talk to something on the disk.” It’s really important that a single message both just updates your state and it does it deterministically, and this is really important because it means that if you have the same piece of code and you run the same messages through it, you can get to the same state each time.

00:11:41

Ron

Okay. So why does that ability to rerun the same state machine in multiple places and get the same result, why is that so important to the architecture?

00:11:49

Doug

Right. So one of the things that we talked about was reliability and how if an order was sent by the client, we need to be able to remember it and make sure that even if we crash, we’re able to recover it later and tell them about the order and some potential execution that happened. So, we can do this by leveraging these state machines because when you crash and you come back online, one of the things that Concord does is it forces you to read all of the messages that happened in the system from the beginning of the day in the same order, and what that means is that you will end up as an exact copy of what you were when you crashed and you will have all the information that you need to know.

00:12:33

Ron

Got it. So there are really two things going on here. One is the individual applications are structured as these synchronous deterministic state machines, so that they have a well-defined behavior upon receiving the sequence of transactions, and somewhere, there has to be some mechanism for delivering a reliable and ordered sequence of transactions to everyone who starts up, right? That’s like the other leg upon which this whole thing stands.

00:12:56

Doug

Yep. That’s right. We call that sequence of transactions the transaction log. And there are kind of two questions around the transaction log. One is like you mentioned, how do you disseminate it? How do you make sure that in this distributed system with hundreds of processes, everything is seeing the same messages in the same order. The other question is how do you append to the transaction log, because obviously you have to be able to add new messages.

00:13:21

Ron

Got it, so how does that work in Concord?

00:13:23

Doug

Yeah. So for the first part, when it comes to dissemination, what we do is we use something called UDP Multicast. If you’re not familiar, multicast is a kind of protocol where you can take a single UDP packet, send it to a special network address, and the networking devices like your switch will then take it and make copies of it and send it to all the other switches or devices that are subscribed to that special address. It’s lossy because it’s UDP, but it’s a way of taking one packet and suddenly turning it into 50, and you don’t have to think about it at all from your application level.

00:13:59

Ron

The actual work of doing the copying is done by the switches and importantly, the switches have specialized hardware for doing this copying so that they can do it in parallel basically at the same time down through all of the ports that they need to send that packet, they can just do it in this very, very efficient way, and at very high speed. And weirdly nobody else uses this. IP multicast, which is like a big deal historically, and when I was a graduate student so long ago, I’m embarrassed to admit it, it was supposed to be the thing that was going to cause video transmission over the internet to happen. No, it turned out nobody uses it in the outside world, but it’s extremely common in the finance world.

00:14:34

Doug

It is. One of the drawbacks of it is that it’s much harder to maintain over larger and larger networks when you’re dealing with a nice enclosed, like, “Hey Concord is just a single cabinet of hardware.” We actually get a lot of the benefits of it without any of the drawbacks. One of the drawbacks though, that we do have to think about is this is UDP and it means it’s unreliable. Packets can get dropped. So we need to build some layer of additional functionality to help prevent that.

00:15:04

Ron

UDP multicast is efficient, but it’s unreliable. Meaning you can lose data and messages can get reordered, and we need to claw that back. So how do you claw back those reliability properties?

00:15:15

Doug

Right, so this kind of goes back to what we were talking about earlier when we talked about the FIX protocol and sequence numbers. We actually just use sequence numbers inside of our own messaging as well. So each packet has a sequence number attached to it, and that means that if you were to say, receive packet 45 and then 47 and then 48, you would very quickly realize that you may have missed 46. Maybe it’s still in flight. Maybe it hasn’t gotten there yet, or maybe it will never get there. So what you can do then is you can go out to a separate service and contact it and say, “Hey, I’m missing packet 46. Can you fill me in?” And this is a thing that we do all the time and it actually works quite seamlessly, even when drops are very common, which might be the case due to a slow process.

00:16:02

Ron

Right, so an important thing about these kind of unreliable protocols is you might lose data because something is wrong with the network, but as any good network engineer will tell you, it’s almost never the network, just like you’re starting working in a new programming language, you think, “Oh, maybe it’s a compiler bug.” It’s like, “No, it’s not a compiler bug. It’s your bug.” If you are losing packets, it’s probably not because the network can’t keep up. It’s probably because you, the damn program you wrote, can’t keep up and you’re dropping packets for that reason. Again, you can always recover them the same way, which is there’s some retransmission service. So really you’ve described two things here, right?

00:16:34

Ron

There are sequence numbers, which let you know the ordering, and there’s a retransmission server, which lets you recover data that might have been lost. But these both sound kind of magical. Where do the sequence numbers come from and how do you get this magical retransmission service that always knows all the things that your poor application missed.

00:16:50

Doug

Right, and before we talk about that, I think it might be more interesting to start from how do we even add things to the stream in the first place? How do you append to the transaction log? The answer there is, well, spoiler alert, we use multicast for that as well. Every process in the system has the ability to send these packets out into the multicast void, where they say, I want to add this message to the transaction log and there is a single app that we call the sequencer and that sequencer is responsible for taking packets from that multicast group and re-emitting them on a different multicast group. When it does that, it also stamps the packets with a timestamp and it stamps them with their sequence number.

00:17:35

Doug

It sends them all out to all of the things listening. So, you just have this unordered mess of messages coming in and a single canonical ordering of messages coming out of the sequencer.

00:17:46

Ron

Got it, so in some sense, you get ordering in the stupidest possible way, which is you just pick a single point in space and you shove all the messages through that one point and you have a counter there that increments every time it sees a message and that’s how you get the sequence numbers.

00:17:59

Doug

Yup. That’s all it is, and when it comes to re-transmitters, they’re just normal processes, just like all the other processes. They listen to the sequencer, they write down everything that they hear, sometimes they might lose information. It’s very possible. So, re-transmitters will try to talk to other re-transmitters or in a worst case scenario, talk to the sequencer because there’s a chance that the sequencer tried to send a packet that never made it to anyone. So you eventually have to go back to the sequencer and say, “Hold on a second, we missed 46, everyone missed 46. Please help fill us in.”

00:18:32

Ron

Got it and then what happens if you lose the sequencer?

00:18:35

Doug

Well, in the case of Concord, everything stops, everything grinds to a halt. There are no new messages. You can’t really do things since everything that you do is based around messages, and fortunately we have something that is a backup sequencer running. What we do is we intentionally want to make that a manual process because we want to make sure we can inspect the system and feel confident because we’re talking about trading, we’re talking about money. We would much rather have a few seconds of downtime while we collect our thoughts and make sure we’re in a healthy state. What we can do is we can send a message through a different channel to this backup sequencer, and it will be promoted into the new sequencer and take over responsibilities from there.

00:19:19

Ron

Right, and I feel like this is the point at which we could like fork the conversation and have a whole separate conversation about how this compares to all sorts of fancy distributed systems, algorithms like Paxos and Raft and blah, blah, blah, for dealing with the fact that the sequencer might fail. Let’s not do that conversation because that’s a long and complicated and totally different one. What I am curious about is this design is very simple, right? How are we going to drive the system? We’re going to have a single stream of transactions. How are we going to do it? We’re going to have one box that just is responsible for stamping all of the transactions, going through and re-broadcasting them to everyone.

00:19:50

Ron

Then we’re going to build a bunch of apps that do whatever they do by listening to the stream of transactions. It also seems kind of perverse in the sense that, you said scalability was important, but it sounds like there’s a single choke point for the entire system, that everything has to go through this one process that has to do the timestamping on every message. So how does that actually work out for a scalability perspective?

00:20:13

Doug

Surprisingly, well, actually. We’re limited by the speed of the network, which is like 10 gigabits per second. We’re limited by the program that is trying to pull packets off and put them on a different interface, and we’re also limited to some degree by disk writing because we have to be able to write things to disk in case we need to recall them later because we can’t store everything in memory. So those are a few limiting factors, but in practice, this fits perfectly well for Concord. We have seen in total, a transaction log from 9:30 to 4, a rough trading day, that was over 400 gigabytes, which is pretty large. We also see probably messages baseline rate during the trading day of over 100,000 per second. Some of those messages are smaller than others, but you still see maybe even three to five times that on a busier day.

00:21:07

Ron

Right, those messages actually pack a lot of punch, right? Part of the way that you optimize the system like this is you try and have a fairly concise encoding of the data that you actually stick on the network. So you can have pretty high transaction rates with a fairly small number of bytes per transaction. In fact, you put more energy into optimizing the things that come up the most. So market data inputs are pretty concise and you’ve put less energy into messages that happen 10 times a day. You don’t think very much at all about how much space they take.

00:21:33

Doug

Exactly.

00:21:34

Ron

On the kind of exchange side where you see a very similar architecture, the kind of peak transaction rates … like there, they’re trying to solve a larger problem, and the peak transaction rates for that version of this kind of system is on the order of several million transactions a second. I think of this is the kind of case where people look at a design like this and think, “Oh, there’s one choke point, so it’s not scalable,” but that’s that kind of fuzzy-headed approach to thinking about scalability, like your problem, kind of, whatever problem you have has some finite size, even if it’s Webscale and whatever design you come up with, it’s going to scale a certain distance and not farther and what I’ve always liked about this class of designs is that you think in advance like, “How much data is there really? What are the transaction rates we need to do?”

00:22:14

Ron

And then, if you can scope your problem to fit within that, you have an enormous amount of flexibility and you can get really good performance out of that, both in terms of the throughput of the data you can get into kind of one coherent system and latency too, right? Do you know roughly what the latency is in and out of the sequencer that you guys currently have in Concord?

00:22:31

Doug

Yeah. We measure somewhere between, I want to say, eight and 12 microseconds as a round-trip time from an app, a single process wants to sequence a message and the sequencer sends it back to the app.

00:22:44

Ron

Right, which is like within an order of magnitude of the theoretical limits of a software solution here. It takes like about 400 nanos to go one way over the PCI express bus. It’s kind of hard to do better than like one to one and a half mics for something like this. So, the eight to 12 is like within a factor of 10 of the best that you can do in theory.

00:23:04

Doug

Yep.

00:23:05

Ron

Cool. Okay, so that tells us a little bit about this kind of scaling question. What else is good about this design? How does this play out in practice?

00:23:11

Doug

The thing that we’ve talked about the most so far is how Concord acts as a transport mechanism, and I think there’s probably more that we can say about that, but I think it’s not just a transport mechanism. It’s also a framework. It’s a framework for building applications and thinking in a different style when you’re writing these applications so that they work efficiently with the giant stream of data that you have to consume. I think one of the things that’s really nice about this is that our testability story is really, really good in Concord, and that’s because we have the ability to take these synchronous state machines that we’ve talked about and put messages in them, one at a time in an exact order that we can control.

00:23:53

Doug

And we can really simulate race conditions to an amazing degree and take all sorts of nearly impossible things and make them reproduce perfectly. The way that we actually have this nice testability, one of the reasons why it’s so nice, I mentioned it was like a framework for building applications and the framework itself forces you to write your application in an inverted style. What that means is your application doesn’t have the ability to say, “Hey, I want to send a message. I want to propose this message.” Instead, your application has to process messages and make decisions about what it might want to send when something asks it what it wants to send. So you’re no longer pushing messages out of your system. You are just waiting around for something to say, “Hey, what’s your next message?” “Here’s my next message.”

00:24:42

Doug

And this seems like a kind of weird design at first, but when it comes down to the testability story, what it does is it really forces you to put more of your state in your state. In the former example of, I want to send a message, you might have some code that looks like, “Okay, well I send the message, then I wait for it, then I’m going to send another message.” But what you’re actually doing is you’re taking a very subtle piece of state there, the fact that you want to send a second message, and you’re hiding it in a way that’s not observable anymore. So, by inverting control, you have no choice but to put all of your state down into some structure that you can then poke at, and from the testing perspective, you can say, “All right, we’ve seen one message happen. Let’s investigate what every other app wants to send at this point.”

00:25:31

Ron

So when you talk about inversion of control here, this goes back to a very old idea where in the mid ‘70s, if you were writing an application on the Unix system that needed to handle multiple different connections, there’s a system call called Select that you would call. You’d call Select and it would tell you which file descriptors were ready and had data to handle and you’d say, I’m going to like handle the data as far as I can handle it without doing anything that required me to block and wait, and I’ll just keep on calling Select over and over. So, it’s inverted in the sense that rather than expressing your code as, I want to do one damn thing after another, like writing down it in terms of threads of control, you’re essentially writing it in terms of event handlers, saying like when the thing happens, here’s how I respond.

00:26:07

Doug

That’s a really good way to put it.

00:26:09

Ron

Then, at some point, again, back in the Pleistocene era where I was a graduate student, where there was this relatively new idea of threads, that was really cool where instead of doing all of this like weird inversion of control, we can live in a much better world where we have lots of different threads, which you can just express one damn thing after another, in a straightforward way, and then there is a minor problem that the threads had to communicate with each other, with Mutexes and Locks.

00:26:32

Doug

A minor problem. Yes.

00:26:33

Ron

Right, it’s a minor issue and it turns out actually that ladder thing is kind of terrible and like a long disastrous story unfolds from all of that, but it sounds like what you’re saying is the Concord design leans very heavily in the, “No, we’re going to be event driven in the way that we think about writing concurrent programs.”

00:26:46

Doug

Yeah, that’s totally right. We don’t take advantage of using multiple threads because we still have to process every message one at a time, and we have to fully process a message before we can start moving onto the next message.

00:26:57

Ron

Right, and your point about how this forces you to pull your state into your state, it’s a funny thing to say, because like we’re where else would your state be? But the other place where your state can be is it can be in this kind of hidden encapsulated state inside of the thread that you’re running, right? If you’re writing programs that have lots of threads, then there’s all this kind of hard-to-track state of your program that essentially has to do with what’s the program counter and the stack and stuff for each of the threads that you’re running in parallel. We’re saying, no, no, no. We’re going to all make that explicit in states that’s always accessible. So in some sense, any question you want to ask about the state of a particular process, you can always ask.

00:27:33

Ron

Whereas in threaded computations, there are all these intermediate points between when the threads communicate with each other, where these questions become really hard to ask in the ordinary way, in a synchronous way, reflect this in regular data structures, so we can always ask any question that we want to ask about the state of the process and what’s currently going on.

00:27:50

Doug

Yeah, that’s a good summary of what this whole story is about. The testability story, I talked about race conditions and how they’re easy to simulate, and that is purely because you are able to control precisely what goes into the system and when to pause to start looking to examine the system.

00:28:07

Ron

Okay, so I can see why this is nice from a kind of testing perspective. How does it do in terms of composability? In terms of making it easy to build complicated systems out of little ones, right? That’s one of the most important things about a programming framework is how easy is it to build up bigger and more complicated things out of smaller pieces that you can reason about individually. How does that story work for Concord?

00:28:27

Doug

So this is actually one of the nice things that I also really like about Concord is that if you take a single process and look at it, it’s not just a state machine. It’s actually a state machine that’s composed of smaller state machines who themselves might be composed of smaller state machines. What that means is that you can kind of draw your boundaries in lots of different places. You can take a single program that is maybe performing too slowly and split it into two smaller ones or you can take a swath of programs that all have similar kind of needs in terms of data and stitch them together, and it lets you play with the performance profiles of them in a really nice way.

00:29:03

Ron

So is scaling by breaking up things into smaller pieces, how does the Concord architecture make that easier? In some sense, I can always say … In any program I want, you can say, “Yeah, I could break it up into smaller pieces and have them run on different processes.” What about Concord simplifies that?

00:29:17

Doug

So the thing is if you have two state machines that are completely deterministic, you know that they’re going to see all the same messages in the same order. You actually don’t need to have two processes that talk to each other. In fact, we forbid that from happening, except over the stream, over the transaction log. These processes, if they want to know about each other’s states, all they do is simply include the same state machines in both of the processes. All of a sudden, you know what the other is thinking because you know, you’ve both processed the same messages in a deterministic way. That’s why this becomes more composable or decomposable because you’re just running the same code in different places.

00:29:55

Ron

Rather than communicate with a different process in the system, you just think the same way as that process. You look at the same set of inputs. You come to the same conclusions and you necessarily consistently know the same things because you are running the same little processes underneath.

00:30:11

Doug

That’s exactly right. Yeah.

00:30:13

Ron

That’s neat. Okay. So that’s a lot of lovely things you’ve said about Concord. What’s not so great about the system?

00:30:18

Doug

So one of the things that we’ve talked about a few times now is performance and the performance story and how things are pretty fast, but they also have to be fast. I think that this is actually a really tricky thing and something that we have to keep tweaking and tuning over time because when it comes to performance, every single process needs to see every single message even though they might not do something with every message, the vast majority of them need to. If you’re really slow at handling a message, your program is going to fall behind and you’re not going to be reading the stream at the rate that you need to. So we have to find ways to improve the performance, which usually boils down to being smart about our choice of data structures.

00:31:01

Doug

The way that we write code, we actually try to write OCaml in a way that’s pretty allocate-free so that we don’t have to run the garbage collector as much as we normally would. Let’s say, you have a process that’s trying to do too much, you might want to split that into two processes and all of a sudden now they have more bandwidth in terms of CPU power.

00:31:21

Ron

When we’re talking about this basic Concord design, we talked about the fact that the sequencer has to see every single message, but how do you get any kind of scaling and parallelism if every single process has to process every single message? If we all have to do the same work everywhere, then aren’t we kind of in trouble?

00:31:37

Doug

We certainly would be if we had to actually see every single message. The way that this works is Concord has a giant message type that is the protocol for each of these messages, and the message type is actually a tree of different message types. So what you can do as a program is you can, very first thing, look at the message type and say, “Oh, this is for a completely different subsystem of Concord and I don’t need to look at it,” and throw it away. When you find the messages that you want, now, you filter down to those and you process them through your state machines like you normally would.

00:32:10

Ron

Got it. So, the critical performance thing is that you don’t necessarily have to be efficient at processing the actual messages, but you definitely have to be an efficient discarder of messages that you don’t care about because otherwise, you’re going to spend a bunch of time thinking about things that you don’t really mean to be thinking about, and that’s just going to slow everybody down. Okay, so one of the downsides of Concord is you have to spend a lot of time thinking about performance. What else?

00:32:32

Doug

So I mentioned that we have this message type and you have to be really careful about it because if you change the message type in a certain way, sure, over time, you’re going to want to change your messages, that’s a normal thing, but let’s say you make a change to a message type and all of a sudden you have an issue in production on the same day. Maybe it’s completely unrelated to the change that you made. What that means is that if you try to roll back, all of a sudden, you’re not going to be able to understand the messages that were committed to that transaction log earlier in the day. You have to be really careful about when do I release a change that just touches messages and when do I release a change that adds logic that is more dangerous?

00:33:12

Ron

So it sounds like the versioning story for Concord at it is right now, in some sense, very simple and very brittle. There’s essentially, morally one type that determines that entire tree of messages and then, you change that one type and then, it’s a new and incompatible type. There’s no story for having messages that are understandable across different versions so that when you roll, if you’re going to roll a change to the message type, you want that to be a very low-risk roll because those rolls are hard to roll back.

00:33:38

Doug

That’s right. When you change the world, you have to change the whole world at once. In fact, our very first message comes with a little digest, such that if an app reads it and doesn’t know the digest, it will just say, I’m not even going to try and crash.

00:33:52

Ron

Got it, do you think this versioning problem is a detail of how Concord was built or a fundamental aspect of this whole approach?

00:33:59

Doug

Well, I think that in a lot of ways you have protocols and stability, you need to think about what is your story for upgrading and downgrading, if that is possible. It’s a natural part of a lot of things we do, especially at Jane Street. We’ve gotten really good at building out frameworks for thinking about stability. At the same time, I feel like it’s more of a problem in Concord because of that transaction log and because of the message persistence. It makes it so that as soon as something is written down, you really have to commit to being able to read it back in that same form.

00:34:32

Ron

So there’s a kind of forward compatibility requirement. Any message that’s written has to be understandable in the future. So it’s like, you’re maybe not tied to this very simple and brittle, “Well, there’s just one type and that’s that.” You could have programs that are capable of understanding multiple versions at the same time, but essentially the constraint of reading a single transaction log in an un-intermediated way does put some extra constraints. A thing that will often do in point-to-point protocols, it’ll have a negotiation phase. You’ll have two processes and process one knows how to speak very A, B and C and process two knows how to speak B, C and D. They can decide which language to speak between them.

00:35:08

Ron

When you’re sending stuff over a broadcast network. It’s like, no, there’s no negotiation. Somebody is broadcasting them as fast as they can and all the consumers just have to understand. So there is something fundamental at the model level that’s constraining here.

00:35:20

Doug

Yeah, that’s right. You can tell I’ve been in Concord for a while when I forget that things like negotiation phases even exist in the rest of the world.

00:35:26

Ron

Right, I guess another thing that’s implicit in the way you’ve described all of this, is this is a system that’s run by a coordinated team. So you don’t have a bunch of people rolling new apps and unrolling new apps and forward and backwards and trying out new things in an uncoordinated way. The system rolls as a unit and at a minimum, this versioning lock requires you to do this.

00:35:52

Doug

Yes, that’s right.

00:35:53

Ron

Is there anything else about how the system works that requires this kind of team orientation for how you build and deploy Concord applications?

00:36:00

Doug

Yeah, actually there’s a lot. When I mentioned that there were multiple Concord instances at Jane Street, what that actually means is that each of them have their own transaction log, their own sequencer, their own re-transmitters for filling in gaps when a message is dropped. They have their own config management for the entire system, which can get pretty complicated because it is a hard thing to configure and they have their own cabinets of hardware, including all the physical boxes that you need to run the processes and specialized networking hardware to do all sorts of things. So, every time you want to make a brand new Concord instance, you need to do all of those things. You need to tick all those boxes.

00:36:42

Doug

The overhead there is massive and so, we actually have been shying away from creating new Concord instances because it’s just so expensive just from a managerial standpoint.

00:36:53

Ron

Got it, so the essential complaint here is there is a certain kind of heavyweight nature of rolling out Concord to a new thing. You want to roll out a new one and there’s a bunch of stuff, and maybe it kind of makes sense at the given scale, the amount of work you have to do configuring it and setting up is proportionate to a large complex app and pays back enough dividends that you’re happy to do it, but if you want to write some small simple thing, on top of Concord, it’s basically a disaster. You could not contemplate doing it.

00:37:16

Doug

Exactly.

00:37:16

Ron

So this segues nicely into what you’re working on now, where there’s a new system that’s been being worked on in the last few years in the Concord team called Aria, which is meant to address a bunch of these problems, maybe you can give us a thumbnail sketch of what is it that Aria is trying to achieve.

00:37:31

Doug

Yeah, so a lot like Concord, Aria is both a transport mechanism and a framework for writing applications. We actually liked the Concord inverted control style of writing things, so much that we are basically forcing it onto people that want to write things on Aria, because we really just think it’s like the right way to write these kinds of programs. So, Aria itself is … you can actually kind of think about it as just like a Concord instance, off in the corner in its own little hardware cabinet, and it’s managed by the Aria team. If you want to write an application that uses Aria as a transport, then you don’t need to do much. You just need to get your own box. You put your process on it and you talk to our core infrastructure through a special proxy layer.

00:38:19

Ron

So the idea of Aria is Aria is trying to be a general-purpose version of Concord.

00:38:24

Doug

Yeah.

00:38:24

Ron

Right, so anyone can use it for like whatever their little application is. So we talked about all the ways in which Concord does not work well as a general-purpose infrastructure. What are the features that you’ve added to Aria to make it suitable for people broadly to just pick it up and use it?

00:38:39

Doug

So, first of all, the amount of overhead and management is much smaller. There is no configuration story for individual users now because we do all the configuration of the core system. So, if a user needs to configure their app, yeah, they go and do that, but they don’t have to worry about us in any way. That ultimately lowers the management overhead drastically. So if I have my simple toy app that I wanted to get working on Aria, there’s actually a really short time from, “I read the Aria docs” to “I have something that’s sequencing messages on a transaction log right now.” And that’s a really useful thing for people getting into the system for the first time.

00:39:15

Ron

Got it. So step number one, shared infrastructure and shared administration.

00:39:19

Doug

Yup.

00:39:20

Ron

What else?

00:39:21

Doug

One of the things that we’ve talked about is performance and how hard performance is. In Concord, you have to read every message that every process has sequenced and this is obviously just not going to scale in a multi-user world. If there is one application that someone builds that sequences hundreds of thousands of messages and then, someone else has their tiny toy app that only sequences a handful, you don’t want to have to read 100,000 plus a handful in order to do your normal job. So, what we do is at this proxy layer that I alluded to, we have a filtering mechanism. So you can say, “I’m actually just interested in this small swath of the world, so please give me just those messages. Don’t bother me with all of the other 100,000 massive swarm of market data messages.”

00:40:07

Ron

It sounds like making a system like Aria work well still requires a lot of performance engineering, but the performance engineering is at the level of the sequencer and of the proxy applications, and user code only needs to be as fast as that user code has stuff that it needs to process. If you pick various streams of data to consume, and you want to read those messages, well, you have to be able to keep up with those messages, but you don’t have to keep up with everything else.

00:40:32

Doug

That’s right and scaling the core is actually much easier than scaling an application because when we get a message from the stream, we don’t need to look at its payload at all, because we actually can’t understand its payload. It’s just an opaque bag of bytes to us. We just have to process the message and get those bytes out or persist them onto disk or whatever we need to do with them. So the biggest bottleneck there is just copying bytes from one place to another.

00:40:58

Ron

Got it. So this is another kind of segmentation of the work, right? There’s like some outer envelope of the basic structure of the message that Aria needs to understand and presumably it has to have enough information to do things like filtering, right? Because it can’t look at the parts of the message it doesn’t understand and filter based on those. Then, the whole question of what the data is and how it’s versioned is fobbed off on the clients. The clients have to figure out how they want to think about their own message types and Aria kind of doesn’t intrude there and the core system is just blindly ferrying their bytes around.

00:41:25

Doug

That’s exactly right.

00:41:26

Ron

So what is the information that Aria uses for filtering, how is that structured?

00:41:30

Doug

We have this thing that we call a topic and a topic is a hierarchical name space, where you are writing data to some part in that name space and you can subscribe to data from some sub-tree of that name space. So, you can kind of separate your data into separate streams and later on have a single consumer that is joining all those streams together. The important thing about these topics is that even among topics, there is a single canonical ordering that everyone gets to see.

00:42:01

Ron

Why is that ordering so important?

00:42:03

Doug

So the ordering is actually important if you ever want to interface with another team. So let’s say that I have a team that’s providing market data on Aria and all they’re doing is they’re putting the market data out there on a special topic. If I have an app that wants to do something a little bit more complicated and transactional, it might need to rely on market data for something like regulatory reporting and knowing that, “Oh, at the time that this order happened, the market data was this.” It’s really important for reproducibility. Let’s say something crashes and you bring it back up, it sees those messages in the same order and it doesn’t change the ordering of the market data and order of messages.

00:42:43

Ron

This is I think in contrast, when someone hears the description of this system, it’s like, “Oh, we have messages and we shoot them around and we have topics and we can filter,” it starts to sound a lot like things like Kafka. I think this difference about ordering is pretty substantial and the way Kafka works, as I understand it, there’s some fundamental notion of a shard, which is some subset of your data and then, I think topics can have collections of shards and you can subscribe to Kafka topics, but you’re only guaranteed reproducible ordering within a shard and you can make sure that your topic is only one shard, which limits its scalability, at that point, at least you’re guaranteed of an ordering within a topic.

00:43:18

Ron

Between topics, you’re never guaranteed of ordering. So if all of your state machines are single-topic state machines, that’s in some sense fine, because every state machine you build will be decoded the same way as long as that state machine is just subscribing to data from one topic, but the moment you want to combine multiple topics together and have multiple state machines and then be able to make decisions based on concurrently, like at the same time, what do these different state machines say, decisions like that become non-reproducible, unless you have that total ordering. When things are totally ordered, well, you can replicate even things that are making decisions across topics, but without that total order, it becomes a lot more chaotic and hard to predict what’s going to happen if you have two different applications running at the same time.

00:44:00

Ron

Actually even worse, if you do a replay, you can sometimes have extremely weird and unfair orderings, right? You’re listening to two Kafka topics and one of them has a thousand messages and one of them has a million messages. You can imagine that the thousand-message topic is going to finish replaying back to you way faster than the million-message topic. So stuff is going to be super skewed on replay and your replay from scratch is going to behave very, very differently than your live system did.

00:44:26

Doug

Yeah. It really all goes back to that point of determinism and how central that is to the Aria and Concord story.

00:44:33

Ron

Okay, so far what do we have? We have a proxy layer, which insulates applications from messages they don’t care about. We have this hierarchical space of topics into which people can organize their data. Is there anything else interesting about the Aria design, other pieces that distinguish it from how Concord works?

00:44:51

Doug

So one of the things that we also kind of talked about was the messaging protocol and how Concord’s got this one giant message protocol that everything needs to adhere to. But in Aria, like we mentioned, it’s kind of up to the user. So, each user might use a different protocol for a different topic. So you now have lots of different protocols and in fact, you can even get a negotiation-like system by, say, writing out one protocol of data and then writing out a newer version of it and then letting another subscriber choose which one they want to do. So if I don’t know how to read V3 yet, I can fall back to V2 and be happy for the day.

00:45:31

Ron

Got it, although in that case, do you have a guarantee that you get things in the same order when you consume data … if what we’re doing is we’re taking the data that’s in different versions and publishing it on different topics, are those necessarily going to be published in the same order? What kind of ordering guarantees do you get in that environment?

00:45:47

Doug

No, you’re right. There are no ordering guarantees in that sense. You can do your best to publish them one after the other and get them as close to each other as possible, but it’s kind of a thing, it’s a trade-off that you’d have to make, if you were trying to do some sort of an upgrade or were trying to help service older clients that haven’t upgraded yet.

00:46:06

Ron

Right. So I guess you can make sure that your messages are published in the same order within the topic. If I have one message and I want to publish it in Versions V1, 2, and 3, I could say publish first at V1 and then at V2 and then at V3. Then, do the next message then at V1 and then at V2 and then at V3. So within topic, people will get them in the same order, but if I have one app that consumes V1 and concurrently some other totally unrelated topic for some other state machine and reads those together in a different one that reads V2 and that same other state machine, those can be interleaved in different ways, but I guess I can still reliably replicate things that are consuming the same versions. I don’t totally lose the ability to replicate. Some noise is added between applications that are using different versions of the data.

00:46:50

Doug

You’re kind of hitting on a point that’s really subtle and we talked about how message versioning was so hard to get right, but there’s actually a whole second class of message versioning that we didn’t talk about, which is the state machines have code that interprets the messages, and that itself is like a form of versioning. If you have a state machine that reads a bunch of messages and then crashes, and then let’s say you roll a different binary that interprets them differently and start it up, it’s going to read the same messages and end up at a different state. So, there is actually like a subtle version attached to the source code itself.

00:47:26

Ron

Right, and Concord is kind of rolled as one big application that is many different processes, but it’s in some ways less a distributed system and more supercomputer where you’d run a single program on it. And here, in the context of Aria, you’re talking about some complicated zoo of many different programs. So, now you open up the somewhat new question of subtly different state machines that are consuming the same stream of transactions, but maybe reading them differently and all of a sudden the guarantees we have about the perfection of the replication between different copies is much harder to think about.

00:47:56

Doug

That’s right. The story there is that yes, a lot of the burden will fall on the users using the system to make sure that they’re configured correctly.

00:48:05

Ron

Not just configured correctly, but that people are making changes in a reasonable way. They’re making changes that maybe even if they change the state machine, how it runs, they’re not changing its semantics in ways that are deeply incompatible.

00:48:17

Doug

Yeah, exactly.

00:48:18

Ron

So a thing that you haven’t talked at all about is permissioning. If you have one enormous soup of topics, there’s a big hierarchy and anyone can publish anywhere, it seems like awful things are going to happen, so what’s the permissioning story here?

00:48:32

Doug

Well, we give clients the ability to write to certain topics or certain trees of topics, and within those trees of topics, they’re allowed to just create their own topics. So they can kind of create a whole segment of Aria where they can read and write to, freely, but this is still managed to some degree by the Aria team. We still want to partition up the topics in a reasonable way. It’s not like everyone can just create whatever topics they want.

00:48:58

Ron

So part of the central administration of the system is essentially saying users, X, Y, and Z, you can use this sub-tree. Users A, B, and C, you can use that other sub-tree. And preventing users from stomping on each other, that way.

00:49:10

Doug

Yeah, that’s a simple way of putting it.

00:49:12

Ron

Do you think differently about permissioning reading and writing to a given topic?

00:49:17

Doug

Well, actually it’s a little interesting because reading is a lot harder to get right. One of the ways that we actually allow … I talked about these proxy layers. One of the ways that we let you interface with these proxies is with TCP, so you can connect to the proxy and just get a stream of all the messages over TCP and it’s really simple. However, there’s a second way of doing it and that’s UDP multicast, bringing multicast back. So we have multicast both inside the core of Aria and outside at the proxy layer. The reason why that’s interesting is because you can’t really put permissioning on multicast packets. When you’re sending them out into the network, there’s nothing stopping someone from listening to an address and receiving packets on that address.

00:50:01

Doug

The only thing that you can really do is you can make it hard for them to figure out what is that address that they’re trying to listen to?

00:50:07

Ron

I mean, I guess, technically you can do at the network layer segmentation of which hosts can this multicast be routed to and which host can that multicast be routed to, but it’s not easy to hit those kind of controls from the application layer, I would think.

00:50:20

Doug

That’s right. That is kind of like a management nightmare in terms of switch rules and trying to keep things in sync and like you said, the Aria team doesn’t really have a lot of ability to dictate what the networking team does and configures.

00:50:35

Ron

So is the current model that you have that like … you have write permissioning, but anybody can read anything.

00:50:41

Doug

That is explicitly what we have right now, but we are actually working on adding read permissioning in a sort of like … it is kind of a best effort trust based case, which is to say, if you are using our libraries to connect to Aria, which you probably are, we will still validate that you are able to read the data before we even start hooking up your listeners to the stream.

00:51:05

Ron

Right. I think even if you don’t have any security concerns about the confidentiality of the data that’s in Aria, you still might want this kind of permissioning essentially for abstraction reasons. Put it another way, some of the topics to which you sequence datas, I imagine you think about as the internal implementation of your system and you freely version it and change it and think about it in respect to how you roll your own applications, and don’t worry about outside users, and maybe there are some topics of data that you publish as a kind of external service that you want people to use, and I would imagine you would dearly like people to not accidentally subscribe to your implementation, because you’re going to walk in one morning and roll your system, and the whole firm is going to be broken because people went and depended on the thing that they shouldn’t have depended on.

00:51:46

Doug

That’s exactly right. We have those things internal to Aria too. We actually keep a lot of the metadata about the current session of Aria on Aria. It’s really quite nice because we get all of the niceties of Aria from being able to inspect the system at any point in time. We don’t want other people to be able to read and depend on those things for that exact reason.

00:52:08

Ron

This points to another really nice thing about these kind of systems that I think is not well understood by people who haven’t played around with them before, is you sort of think of, “Oh, there are these messages, and we put the data onto the system.” Whatever the core transactions, I don’t know if you’re running a trading system, maybe you put orders in market data and whatever, if you’re Twitter, maybe you use this kind of system when you put tweets on there or something. The other kind of thing you put on there is configuration data, right? It’s nice to have this place where you can put dynamically changing configuration and have a way of thinking consistently about exactly when does the new config apply and when does the old config apply.

00:52:39

Ron

And the fact that you just integrate it into the ordinary data stream, gives you this lovely model for talking about configuration in a way that’s fully dynamic and also fully consistent across all the consumers of that data.

00:52:51

Doug

Yeah. The very simple example there is, let’s say in the middle of the day we want to give someone write abilities on a new topic. How do you do that? Well, you just send an Aria message that says, they’re now allowed to write to this topic.

00:53:03

Ron

One thing that strikes me as interesting about all this is there’s a lot of work in various parts of the world about this kind of transaction log. We’re talking about transaction logs as if it’s in some sense, kind of a new and interesting idea, but it’s the oldest idea in computer science. There’s a paper from the mid ‘70s, that introduces this notion of distributed transaction logs and all sorts of things are built on it. Standard database management systems use very similar techniques and 30 years of distributed systems research and then, the entire blockchain world has rediscovered the notion of distributed ledgers and Byzantine fault tolerance. This idea just kind of keeps on being invented and reinvented over and over.

00:53:40

Ron

The thing about the Aria work that strikes me as interesting and new is there’s a bunch of work about not so much the substrate of how do you get the ledger to exist in the first place, but how do you use it? How do you build an open, heterogeneous system with multiple different clients who are working together? I feel like I’ve seen less work in that space. I guess in some sense, the whole smart contract thing is a species of the same kind of work. That’s a way of writing general-purpose programs and sticking them onto this kind of distributed ledger based system. But Aria is very different from that and it’s kind of a whole different way and a more software engineering-oriented way of thinking about this.

00:54:17

Ron

One of the things that’s been interesting in following along with all of this work is just seeing the working out of these idioms. I don’t know, I feel like idioms like this in computer science are super powerful and easy to underestimate. I don’t know who came up with the idea of a process when designing early operating systems, but that was a good idea and it had a lot of legs, but I feel like there’s a lot of interesting ideas in how you guys are building out Aria as a thing that makes it easy for people to build new complex applications, it has kind of a similar feel to me. So along those lines, how’s it going? I know there’s a system that’s been in progress, you’re not the only person who’s worked on it.

00:54:52

Ron

There’s been a whole team of people working on like Aria core technology and it’s starting to be used like how is that going, who’s using it and what’s the process like of it kind of getting out into real use in Jane Street?

00:55:03

Doug

Yeah, so we have a system, it’s in production. We have deployments in several regions now and it’s actually being used by teams at Jane Street in a production manner. So I would say so far, so good. Things are looking good. We specifically don’t have too many people using it because we want to make sure that when we get a new client, there’s always going to be some new requirements, and we want to make sure that we’re able to address them in a relatively straightforward way, and a lot of the decisions that we make, we think really, really hard about because the decisions that we make now are going to be ones that we’ll have to live with for a long time. So when we’re thinking about new features, we’re constantly trying to figure out ways in which we can do them in the best possible way.

00:55:46

Doug

On one hand, we are actually missing a bunch of features that would be really nice. This has kind of worked out just fine for our current clients, but we know that there are upcoming ones where we’re going to need some new features.

00:55:58

Ron

Can you give me examples of features that you know you need but they’re not yet in production?

00:56:03

Doug

A small example of this is rate limiting. It’s a thing that you might expect would be there from day one perhaps, but it’s not. So a single client can, say, get into a bad state where they’re now sending the same message over and over and over again, and we will happily inject those into the Aria stream as fast as possible. This is not necessarily great because it can actually hurt the performance of the system for other people too. There’s lots of isolation, but there’s still limited numbers of processes and what they can do and so there is some sharing and overlap.

00:56:39

Ron

Focusing in on that for a second. If you have one client who’s injecting an arbitrary amount of data into the system, that data is not going to hit the end client, who’s subscribing to a different topic, but all of the internal infrastructure, the sequencers and the republishers and whatever, all of those interior pieces, the proxies that are sitting there, they have to deal with all of the data. So, if you have one injector who’s running as fast as it can, it is going to slow down all of those pieces, even if the actual end client isn’t exactly directly implicated, the path by which they receive data is.

00:57:08

Doug

That’s right, and not to mention, we are still currently in a world where we’re writing everything, every single byte that we get down to disk, because we have no idea. Maybe someone has to start up and say, “Please fill me in from the beginning of the day.” We’re going to have to go to disk to get that. So there’s actually just also limited disk space. If every client was hitting us with max throughput, as fast as possible, we might be concerned about disk.

00:57:31

Ron

So why aren’t you more worried about rate limiting, that all sounds kind of terrible.

00:57:34

Doug

It is. Yeah. That’s definitely the first thing on our list right now to work on. It just happens to be a relatively easier problem, compared to another thing that we’ve been thinking about a lot, which is snapshots. Snapshots tie in a lot with the thing that I actually just said, which is, let’s say your app starts up and it’s the middle of the day, it needs to receive messages, the only thing that it can do right now is start at the beginning of the day and read every single message on its set of topics that it needs to read up to the current tip of the stream.

00:58:06

Ron

And it’s maybe worth saying that whole phrase, from the beginning of the day, is kind of a weird thing to say, like which day, whose day, when does Hong Kong’s day start? When does the day start in Sydney? There’s different notions of day and I think this just kind of reflects the thing you talked about before, about how this technology has grown out of a trading environment, where there’s a regional concept of which trading region you’re in and that there’s a notion of a day’s begin and end. That actually is messy and complicated and not nearly as clean as you might naively hope. It worked much better 15 years ago than it does today, but still that’s a real thing that lots of our systems have is this kind of notion of a daily boundary.

00:58:46

Doug

Yeah, and to even expand on that, well, first the thing I was going to say was that obviously that might be really expensive. It might take a lot of time to replay all those messages and we want to prevent that, even going further than that, we are trying to think about a single session as not just a day, but we want to expand it to be a whole week. So like, wow, it’s like seven times what we’re doing already. So it would be really bad if an app that starts up on Sunday and an app that starts up on Friday night had completely different performance characteristics because one of them has to replay maybe hundreds of gigabytes worth of data before it can even do anything. So, this is where we introduce snapshots. Snapshots being, “Hey, here’s a chunk of the state at a given point in time. You don’t need to read messages older than this. Just load up this state and start from there.”

00:59:35

Ron

And snapshots from a finance world perspective, this is a familiar concept because that’s what happens for market data feeds, right? If you go to any exchange, they have two things. They have a reliable multicast stream, very similar to reliable multicast that we use inside something like Concord or Aria. They have a gap filler, which is actually part of reliable multicast. And they have a snapshot service. The gap filler is just the thing we talked about as a re-transmitter, where you go and request a packet you missed. Then, the snapshot service is this other thing which gives you some summary of the state, which is meant to be enough for you to go forward from there and compute.

01:00:08

Ron

In the context of a market data feed, that snapshot usually consists of the set of orders that are currently open and maybe the set of executions that have happened on the day, because typically there are way more orders than there are executions, so you don’t want to replay to someone on all of the orders, if you can get away with it, but all of the executions is not so bad necessarily.

01:00:26

Doug

Yeah, you’re touching on how it’s actually not as simple as I just described it originally, because you might need different snapshot techniques for different services. Some of them need to retain more data than others, and some of them, you might want to say, “Hey, this single topic has two different ways of being interpreted.” You can interpret it in a way that you need a really large snapshot in order to continue reading it or you can interpret it in a way that’s more simple and straightforward, like the market data example.

01:00:53

Ron

Right and even within market data, you could imagine … let’s say I have an application whose goal is to print out the total number of open orders at any moment in time, right? If you have that application and it gets, as a snapshot, the current state of the book and the set of open orders, that’s cool. It can replay and then from the moment it gets to the snapshot, it knows how many open orders there are, and every further update it gets to the book, it can update that view of the book and therefore update its count, and everything is cool. But if you have an application that needs to count the number of orders that happened on the day, it’s just dead in the water with that snapshot. That snapshot does not let it recover at state because just knowing the current set of open orders, doesn’t tell you how many orders have been sent in a given day.

01:01:29

Ron

It’s not like there can’t exist a snapshot that would solve the problem. You would just need a different one. So there’s a way in which replaying from the beginning of time is a general-purpose recovery mechanism. It doesn’t matter what you’re doing, as long as what you’re doing is a function of the stream of data, you can do it on a replay just as you did it live. Whereas snapshots are kind of by their nature, to some degree, application specific. You can do some kinds of computations with some kind of snapshots and others, you can’t.

01:01:55

Doug

Yeah, I think that’s exactly right.

01:01:57

Ron

What does that mean for providing a snapshot service for Aria?

01:02:02

Doug

Well, this goes back to the thing, we have to think very hard and carefully about it. One of the ways that we are approaching the problem is to give clients the ability to build their own snapshot servers. So you can say, “I’m a client and I know that I’m using these topics and hey, I’m also using these other topics from a different team. I’m going to put them all together into one snapshot server that can compute a snapshot that is tailored to my use case.” And by doing that, by empowering clients to build these servers out, you kind of solve the problem at the user level.

01:02:36

Ron

Got it, but what’s interesting is that the thing that we describe the exchanges doing is kind of the opposite of that and it’s demonstrably very useful, right? Because I feel like the market data story highlights that snapshots are application specific, but they’re not completely application specific. There are actually lots of different state machines that you can reasonably build on top of the kind of snapshot that exchanges provide. So you can do a lot of things, but not everything. So the thing you’re describing now is going all the way in the direction of the application specific idea. We are going to give users just the full ability to customize the kind of snapshot they want, and they can remember for restarting, whatever it is they want to remember.

01:03:12

Ron

Each application is on its own, and I guess you can imagine something in between where you allow someone who’s providing a general-purpose system to also provide a kind of general purpose snapshot.

01:03:22

Doug

Yeah, exactly. To me, this feels a lot like composing state machines. How, if I’m building a service that’s using another topic, I might use their state machine to consume that topic. In fact, I must use their state machine to consume that topic. In the same way, if I’m building a snapshot server, I will probably actually use some components that that team has released to compute the snapshot, but I might actually embellish it in some way, like add some details to it or maybe compact it in ways that I can get away with to make it simpler. There’s lots of degrees of freedom here, and I think composability is your answer to them.

01:03:58

Ron

Right, and even if the only snapshot mechanism you give people at the Aria level is this kind of for your application, you build your own snapshot service, you can just use the ordinary modularity tools, right? You can have a library that somebody wrote for their particular stream that provides you most of the code that you needed for computing their part of the snapshot. So it’s not like you’re totally lost on a reusability perspective, even if you decide to build completely per-application snapshot servers.

01:04:26

Doug

Exactly.

01:04:27

Ron

You said before, there’s a small number of teams that are using Aria right now. Can you say more about what teams these are and what kind of things they’re building?

01:04:35

Doug

Yeah. So Concord and Aria are part of a department at Jane Street called client-facing tech. A lot of the other current clients of Aria are also in this same department. As the name alludes to, we write programs that interface with clients directly. What we end up dealing with a lot is something called high-touch flow, which means that unlike our kind of traditional Jane Street systems, where it’s just a lot of computers making decisions about what to do, these are systems that actually involve a lot of human interaction. So, it actually gets a lot more complicated because even though we can automate some things, sometimes we just have to kick it back to a human, to look at it and examine it and decide what to do. So, that’s the kind of tooling and infrastructure that we’re building around handling new types of orders and being able to automate them in as much of a way as we can.

01:05:25

Ron

Right, and I guess one of the advantages of that is just the fact that people are kind of organizationally close, makes it easier to kind of coordinate and prioritize together while you’re still in the phase of building out the core functionality of Aria itself.

01:05:37

Doug

That’s right. Yeah. Some of these people have also worked on Concord, so they are pulling their prior experience into the new Aria world.

01:05:45

Ron

I guess the other thing that’s funny about Aria is that, it really is a different way of programming and it takes some time to get used to, which to say a straightforward thing is in some sense, just a mis-feature, right? It’s great to learn things, but also having a system that requires people to learn stuff is kind of bad on the face of it. So I think you think, and I think, we probably think that it’s worth learning it. It’s a valuable thing to learn, but it’s also hard. It’s a different way of thinking about designing your system. And by starting with people who are near at hand, it’s easier to slowly acculturate them and teach them how to approach the problem in the way that you think it needs to be approached. Have you thought much about what the process goes like of evangelizing something like Aria, as you try and reach out to the firm more broadly?

01:06:22

Ron

Jane Street is kind of big now. We have like 1,700 people, hundreds and hundreds of engineers across a bunch of different teams. How do you think about the process of teaching people a new way of approaching the architecture of their systems when the people are more spread out and people like you don’t know and talk to every day?

01:06:38

Doug

Yeah. I think one of the things that we’ve learned the most about Concord is that there are patterns that arise. Concord has, I mentioned, hundreds of different processes running and also several dozens of them are completely different types of applications. What this means is that over time, we’ve kind of converged on a few patterns. For example, we have something that we call the injector pattern. It’s a kind of app that just consumes some external data and puts it into a format that it can put onto the stream and make sure the stream is up to date with the external world. It’s a very simple thing once you actually sit down and get a chance to do it. So, really what I want to do is I want to take these patterns and encapsulate them in documentation and examples.

01:07:24

Doug

Maybe even a sort of, let’s call it a bootcamp where in order to learn how to do things, you have to implement your own injector, and with the extensive testing that we can build around it you get immediate feedback over whether it was working or not.

01:07:38

Ron

So for something like this kind of simple injector, which the way you describe it sounds very simple, what are the kind of mistakes that you think someone who comes to Aria, not having really learned about it and doesn’t have the advantage of the examples or the bootcamp or the documentation, what kind of things are they going to get wrong when they try and build an application that does this injecting?

01:07:55

Doug

Going back to the inversion of control thing that we talked about way earlier on, the applications themselves do not send messages. They instead wait for something to say, “What is the next message you want to send,” and then they propose it. However, there’s kind of something subtle there, which is there’s a chance that they just keep proposing the same message over and over. At some point, you should be updating your state in such a way that you change your mind about what you want to propose. So let’s say I want to sequence message A. Once A is sequenced, I probably don’t want to sequence it again. I probably want to sequence B, but there are certain classes of bugs where you actually don’t update your state correctly and now you’re just stuck sequencing A over and over and over again and that will just continue to infinity.

01:08:40

Ron

By the way, I always think of this way of styling an application as writing your application as a problem solver, right? So the application is doing stuff, interacting with the outside world, and it decides that there’s a problem with the world. The problem is it knows about some order that the stream doesn’t yet know about. Then, that means it wants to tell the stream about that. So, when asked what message you want to send, “Oh, here’s a message that will tell you that this order existed in the outside world.” Then, as you said, once you receive the message that said, that thing has occurred, well, now you don’t want to send it anymore, because the problem has been resolved, and what’s nice about that is you don’t care who resolved the problem, right?

01:09:12

Ron

You might have resolved it, somebody else might have resolved it. Maybe you have a whole fleet of things that are actively going off and trying to resolve these problems and you get to be agnostic to who the solver is. It means if you crash and come back and whatever, and the problem is solved, you don’t care, you don’t see the problem there anymore. Yeah, the problem with this system is like the hypochondriac who thinks there’s always a problem, whatever is going on, there’s constantly … it’s constantly trying to send a message saying, it resolved a problem that isn’t really a problem anymore and that’s how you get the infinite loop bug.

01:09:38

Doug

Yep. That’s a great way to put it.

01:09:39

Ron

One of the things you can do there is you can create documentation to explain things to people. Is there also just some notion of trying to bake in abstractions that help people build these applications in ways that avoid some of the common pitfalls?

01:09:52

Doug

Yeah, totally. I’m a big fan of abstractions myself. So, an example of an abstraction that we built was that you can take a pipe, which is a kind of core library data structure that lets you push things onto one end of the pipe and then lazily pull them off. You can just attach that into an Aria app as a state machine where pulling them off means injecting them into the Aria stream, sequencing them. That’s kind of like, you suddenly can use an interface that you’re familiar with and it’s going to do the right thing of only sending each thing once.

01:10:25

Ron

Got it.

01:10:26

Doug

So we’re constantly looking at code that people are writing and looking for places that we can build abstractions around the patterns or anti-patterns that they are doing.

01:10:36

Ron

In terms of how broadly it’s used at Jane Street, what are you hoping to see out of Aria a year from now? If you look back a year from now, what will you count as a successful rollout of the system?

01:10:45

Doug

Well, I would love to see more users on it. I’m assuming that we’re going to see at some point in the near future, a user that has a much higher bandwidth requirement than other users. So starting to push the system to its limits and forcing us to get a little bit more clever with how we shuffle those bytes around internally in order to both be fair to all of the clients that are using it and scale up past the point that we need to, to satisfy everything that’s going on. So I think that’s some of our biggest concerns, are going to be those of making sure we can scale the system.

01:11:18

Ron

Right, and in fact, there’s a lot of headroom there, right? There are many things that we know that we could do to make Aria scale up by quite a bit.

01:11:24

Doug

That’s right. Yes, we have lots of ideas. It’s a matter of implementing them at the right time, as in, before they’re needed direly.

01:11:32

Ron

But not too far before, so you have time to build other features for people too.

01:11:35

Doug

Exactly. Yes. It’s all a balancing act and no one really knows what they’re doing.

01:11:42

Ron

All right. Well maybe that’s a good point to end it on, a note of clear humility. Thanks a lot, Doug. This has been really fun.

01:11:48

Doug

Thanks again for having me.

01:11:51

Ron

You’ll find a complete transcript of the episode, along with links to some of the things that we discussed, at signalsandthreads.com. Thanks for joining us and see you next time.