Listen in on Jane Street’s Ron Minsky as he has conversations with engineers working on everything from clock synchronization to reliable multicast, build systems to reconfigurable hardware. Get a peek at how Jane Street approaches problems, and how those ideas relate to tech more broadly.
Liora Friedberg is a Production Engineer at Jane Street with a background in economics and computer science. In this episode, Liora and Ron discuss how production engineering blends high-stakes puzzle solving with thoughtful software engineering, as the people doing support build tools to make that support less necessary. They also discuss how Jane Street uses both tabletop simulation and hands-on exercises to train Production Engineers; what skills effective Production Engineers have in common; and how to create a culture where people aren’t blamed for making costly mistakes.
Liora Friedberg is a Production Engineer at Jane Street with a background in economics and computer science. In this episode, Liora and Ron discuss how production engineering blends high-stakes puzzle solving with thoughtful software engineering, as the people doing support build tools to make that support less necessary. They also discuss how Jane Street uses both tabletop simulation and hands-on exercises to train Production Engineers; what skills effective Production Engineers have in common; and how to create a culture where people aren’t blamed for making costly mistakes.
Some links to topics that came up in the discussion:
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I’m Ron Minsky. Alright, it is my pleasure to introduce Liora Friedberg. Liora is a Production Engineer and she’s worked here for the last five years in that role. Liora, welcome to the podcast.
Thank you for having me.
So just to kick things off, maybe you could tell us a little bit more about, what is Production Engineering at Jane Street?
So, Production Engineering is a role at Jane Street. It is a flavor of engineering that focuses on the production layer of our systems, which is a pretty big statement and I can definitely break that down. But I’ll say the motivation here is that Jane Street is writing software that trades billions of dollars a day. And so it’s important that that software behaves as we expect in production, right? And if it doesn’t, we want people to notice right away and to address what’s coming up. Production Engineers have support as a first-class part of their role. So, when we are on support, we are the first line of defense for our team and we are responding to any issues that arise in our systems during the day, whether that be from an alert or for a human raising some behavior that they observe to us. And we are really tackling those issues right away.
And I guess I will say as a clarifying bit here — that it’s not really the same thing as kind of being on call overnight or on the weekends. This is really like, during the trading day you are present and responding to issues that are popping up live. And then of course Software Engineers do this type of work too. So the lines are a bit blurry, but roughly I would just say this is a first-class part of your role as a Production Engineer. So that is one big chunk. And then the other chunk of work as a Production Engineer is longer-term work to make your response to these issues better in the first place, and also to make it less likely that you even need to respond. Sometimes Production Engineers do work that looks very similar to that of a Software Engineer. So, say you might build an OCaml application that helps users self-service some requests that they currently come to your team for. Some Production Engineers, they might have roles that look pretty different from a Software Engineer and maybe they’re spending a lot of their time off support, thinking about processes and how we can respond to massive issues in a more efficient, effective way.
So, off-rotation work is much more varied depending on the Engineer’s interests and skillset and the team that they’ve been placed on. But they all share that overarching goal of making our support story and our production story better.
This sounds similar in spirit to the Site Reliability Engineer role that Google popularized over time, and similar production engineering roles in other places, where the core thing that is organized around is the live support of the systems, but it’s not just the activity of doing the support, it’s also various, kind of, project work around making that support work well. So, what does the split look like? What amount of people’s time is spent sitting and actively thinking about the day-to-day support, and how much time is spent doing these projects that make the support world better?
Yeah, it’s gonna vary a bit by team. But I would say roughly between a quarter and a third of your time you’ll actually be on rotation for your team, and then for the rest of your time you’ll be doing that longer-term work.
And I’m a little curious about how you think about this fitting in with Software Engineering, which you mentioned it’s not a sharp line between the two and certainly Software Engineers here also do various kinds of support. So, what does the difference here boil down to?
Yeah, that’s a good question. And I think it is, again, a little bit in some ways challenging to answer, because there are Production Engineers who look eerily similar to a Software Engineer and then there are Production Engineers where that difference is much starker. But I guess I would say roughly that Software Engineers typically tend to be experts in a few systems, and they’re gonna know right down to the depths those systems really well. And typically Production Engineers will have a really strong working mental model of a broader set of systems and how all those systems fit together. And I think it’s the same way that we have other types of engineers and roles embedded on the same team. You might have a team with Software Engineers, Production Engineers, a PM, a UX Designer, et cetera. Everyone is tackling that same team goal with a bit of a different perspective on it. And I think that is really how you get that excellent product in the end.
So, can you tell us a little bit more about what your path into Production Engineering was like?
Yeah, so I studied computer science and economics in college, and after college I think looking back it’s obvious what path I was going to take. But at the time, I truly could not decide and wanted to try everything. So I went into consulting as many do after college. And I think the problems were actually very interesting, but I wasn’t motivated by the problems themselves. And I think I also just wanted a bit more of a work-life balance and for various reasons didn’t feel like consulting was the place I wanted to end up in. So, I decided to move back into the tech world and I was thinking about where to apply, and I already knew about Jane Street because I had taken an OCaml class in college and my TA, who I thought was really smart and cool had gone on to Jane Street. And so I thought, maybe I should apply there too.
Shout out to Meyer, I think he’s still here. (laughs)
(laughs) Indeed.
So I applied and Jane Street reached out to me and said, “Hey, you applied to Software Engineering, but we actually think you might be more interested in Production Engineering.” And this is just ‘cause I had a bit more of an interdisciplinary background than the typical CS grad applicant. And I said, “What is that?” And they explained it to me as I’m doing to you now, and I thought it sounded interesting, so I went for it. And that was all about five years ago.
And you started in this role, you’ve been doing it for a long time, you enjoy it. What is it that you find appealing about Production Engineering? Why do you like it so much?
So, I guess I can say why I like support. I think that is, to me, the real differentiating factor, although I do enjoy my longer-term work as well. But I think for me, when you’re on support, it’s kind of like you’re on a puzzle hunt. You put your detective hat on, if you will, and you’re sleuthing around and trying to build a story and find the answer to this unsolved puzzle. And you might have to look at a bunch of different places and gather evidence and form a hypothesis. Sometimes you have that “aha!” moment when you discover what’s going on, and that feels really good, and it’s kind of just brain teasers all day. And that can be really fun. Also, this might sound a little mushy, but you’re just helping people all day, which can feel really good, right? Like, people at Jane Street are nice and they’re going to be really grateful that you helped make their day go better by solving this problem that was clearly enough of an issue that they messaged you about it, and then at the end of the day, maybe you helped 15 people, right? That just is really rewarding in kind of, like a short-term way.
I dunno, that doesn’t sound too mushy. I think the human aspect of getting to interact with a lot of people, understand their problems, that’s exciting. This other point you’re making about debugging itself — but not just debugging of a single program gone wrong, but debugging of a large and complicated system with lots of different components, lots of different human and organizational aspects to it. Sometimes you’re debugging our systems, sometimes you are debugging the systems that we are interacting with, of clearing firms and exchanges and all sorts of places. So there’s a lot of richness to the kind of problems that you run into when things go horribly wrong.
Yeah, definitely.
So can we make this all a little more concrete? Like what’s an example of a system that you’ve worked on and supported and maybe we can talk about some of the kinds of things that have gone wrong.
Yeah, definitely. So I sit on a team that owns applications that process the firm’s trading activity. So they ingest the firm’s trading activity, parse them, normalize them, group them, and then ship them off downstream. So, trades you can think of as going through a pipeline of systems that generally live on my team and those are the applications that I’ve been supporting for years.
Maybe it’s worth talking about, what is the system for? Like, I get the idea of, “it ingests our trading activity and processes it, normalizes it, tries to put it into a regular representation and then ships it off to other things.” But why? What are those other things for? What are we achieving with this whole system?
So concretely, where do they go after they are processed by our pipeline? They’re going to go in a giant database of all of the firm’s trading activity over time. They’re also going to go to a Kafka topic that people can subscribe to, to read the firm’s trading activity. They’re gonna go to our banks — and each of these are gonna have a different purpose, right? So for example, sending to our banks, that’s really important because unless we do that, our banks aren’t gonna know what to make happen in the real world. And so that is really critical. Or if you think of writing them to downstream systems, those downstream systems are gonna want to process the firm’s trading activity in a really consistent format. They’re gonna want one central source for all of their calculations and what systems will care about that, right?
Good examples of these things are like, people have all sorts of live monitoring and tracking of trading, traders on the trading desk want to see the activity scroll by for the given trading system or want up-to-date calculations of the current profit or loss of a given trading strategy. And the way they get that live representation of what’s going on with the trading we’re doing, is by subscribing to these upstream systems that collect, normalize, fix any problems with the data, and distribute it on to clients. So it’s really connected to the beating heart of the trading work that we do. So that seems like a pretty important system. What are the kinds of issues that you run up to when supporting a system like that?
Yeah, so something we get more routinely is a new type of trading activity hitting our system that we haven’t seen before. So for me, I think of a type of trading as the collection of all the fields that that trade has. So, I mean the date the trade was booked on, the counterparty it’s with, the settlement system on the trade, and a bunch of other financy words that are attached to the trade that can take on different values. And so when I say a new type of trading enters our system, I mean a trade with the same collection of fields and values has not appeared before in our systems. And so then, when our system is looking at the trade and maybe, say, trying to match it up against some configuration files, there’s no path for that trade through the system because that collection of fields has not been seen before.
And the reason it might not have been configured, is there are concrete decisions that have to be made that haven’t been made. Like each new settlement system — you mentioned a settlement system — these are like, the back-end systems that involve a rendezvous point between different people who are trading the same security and the way in which the shares flow from one person to another. When someone trips into some new kind of trading, there’s an actual human process of thinking about, “Well actually, how do we need to handle this particular case?”
Yeah. And it’s this fun collaboration between business and tech, because of course, the actual decisions that we’re making, that you’re describing, my team is not gonna be typically best placed to make those calls because we just lack some of that business context that the amazing Operations team will have. And so they will give us the information that we’ll try to translate into the technical language that our system will understand, although we are doing work to try to make that self-service.
In some ways you have maybe less of the context about that than people on like the Operations team, but also a lot more than most other people do. I sort of feel like people who are in this production role learn an enormous amount about the nooks and crannies of the financial system and how the business actually fits together. In some sense, I think that’s some of what’s exciting about the role, is you really get to think not just about the technological piece, like that’s really key, but also about how this all wires up and connects to trading.
Yeah, definitely. And I think there is this just, breadth and variety in what you learn. I mean, that to me is another thing that I enjoy about the role, because, I mean, maybe it was obvious by not being able to pick an industry to go in at the beginning, but I think I like to see and touch everything and when you come in you’re not quite sure what’s gonna pop up and what you’ll learn about that day. And so, there is this variety that is consistently present in your life as a Production Engineer.
Okay, so one new thing that can happen is, you said there’s a new trade flow that shows up, a new collection of attributes that makes you say, “Ah, we actually don’t know how to handle this and we need to figure it out.” And you’re going and talking to Legal and Compliance, and Operations people, and the traders, and trying to put all that together. What other kinds of things can go wrong?
So I think that that is a pretty common example. I guess to give some sense of the other side of things when it’s less common or more extreme, you might have something that we call an “incident” at Jane Street, which is basically a big issue that might impact multiple teams. Something you might be somewhat stressed about not being resolved and have all hands on deck to address. And that will come up too, as it will for almost every team at some point. For example, this year we had a case where a human had booked some trades and had manually modified the trade identifiers, and then when they reached our system we actually tried to raise errors, and then the system crashed on the error creation, which is kind of funny. And so then our system is down, and that’s not great because trades cannot get booked downstream. And so we had to go find the cause of the crash and get it back up. Thankfully, we already had a tool that let you remove trades in real-time from our stream of trades. So we did that to the malformed trades and then reflected on everything that had happened. But that is kind of a more drastic example of trade data flowing through our system that our system doesn’t expect or doesn’t like, that we then have to handle on support.
And how do those different incidents differ in terms of the time intensity of the work that you’re doing?
So, for my team, we do generally have time to dig into issues and get to the root cause and talk to human beings about what should happen. And then once in a while, you will have something like the latter example where something falls over and it’s really closer to crunch time in that case. But there are teams where that is much more common. So for example, our Order Engines team, which is a team that owns applications that send trades to the exchange — super roughly, I’m sure you can give a better definition — but that team has support that is quite urgent and they might be losing, I don’t know, a million dollars in 10 minutes. I just made up that number (laughs). Take it with a grain of salt, but I don’t think it’s crazy. And so there, it’s really high energy, fast-paced, adrenaline pumping, and some people find that thrilling and thrive in that environment and that support is much more urgent than the typical case that I see on support.
It’s worth saying most of the time when you’re like, “Oh, you know, we are losing money every minute this isn’t fixed,” it’s overwhelmingly opportunity cost, i.e., there is trading we could be doing, and something is down so we can’t be doing that trading, and we are losing the opportunity of being able to make the money of doing those trades. In the Order Engines world, which is this intermediate piece that takes our internal language of how we want to talk about our system’s orders and stuff, and then translates those to whatever the exchange or broker or venue language is, and this fundamental connective tissue there, that’s in the live flow of trading. And so if that thing is down, you can’t trade right now. Whereas the stuff that you’re talking about is more post-trade, it’s about what we do after the fact, and there we have time to adjust and fix and solve problems at a longer timescale because it doesn’t immediately stop us from doing the trading. It’s just getting in the way of the booking stuff that’s going to need to happen, say, by end of day.
Yeah, exactly.
Although I will say there are monitoring things that are like, pretty critical. So this stuff all kind of falls in between those two things, if those systems are totally down then we are totally down. We cannot trade without monitoring. That’s just not safe.
Yeah, I actually used to work on our monitoring software closer to when I joined Jane Street. I learned a lot because that system is very redundant and robust, and the review process is very intense, and the rollout process is pretty intense — as it should be, because if something goes wrong it can really impact our ability to trade.
Yeah. So maybe we should dig into that a little bit. I think one of the things I’m really interested in is, what are the technical foundations of doing support in a good way? What are the tools that we need to build in order to be effective at support? And you’ve worked on some of those, maybe you can tell us a little bit about the work you’ve done in that context as part of the project side of your work.
Yeah, so for the team that I just referenced, I joined that team somewhat early on, maybe about a year or two in, because I was using that software every day — as were my teammates — and we had strong opinions about the features we wanted to see. And of course we are not the only team using that software, it’s used firmwide. So we decided to dedicate our time and put me as an engineer on that team to advocate and work on the features that we wanted. So for example, I really wanted the ability to be able to snooze pages from this system. And so that is something that I added to our monitoring software and now still to this day I will snooze pages that I get.
Using the code that you yourself wrote?
Yes. It feels great. (laughs)
And just to say, there’s actually lots of different kinds of monitoring software we have out there. The particular system we’re talking about I think is called Oculus and it’s very specifically alerting software. So this is like, you have a whole bunch of systems that are doing stuff, it detects that something has gone wrong and it raises a discreet alert which needs to be brought to the attention of some particular set of humans. And it’s like a workflow system at that point. Something goes wrong, there’s a view that you have and something pops up in your view that tells you the thing that has gone wrong, and now you have various operations that you can do to handle those. This alerting and the resulting workflow management is really critical to the whole support role, because part of what you’re doing is paying attention to lots of things — and being able to handle and not lose track of those things that go wrong — in the middle of a potentially hectic time where lots of things are potentially going wrong, and there’s lots of things that need your attention, and there’s a lot of people who are collaborating together to work on the support and management of that system.
Yeah, and I think you do see this pattern where support team members are often super users of some technology. For example, Oculus, the one that we’re talking about, that same team put out a tool that lets you auto-act on pages that come up. And we actually had some Production Engineers make our own version on top of that with state so that it was more expressive and you had the ability to take more actions automatically. And that’s something that most teams probably don’t need, but Production Engineering teams really find valuable.
Right. And I think one of the lovely things about the fact that a lot of the Production Engineers, not all of them, but a lot of them spend a lot of time doing software engineering, is it means the tools that they’re using most intensely are tools that they can shape to their own needs, right? So you’re not merely stuck as the victim. We’re using the system in whatever the way it happens to work right now, and when you find things that are wrong, you can dive in and make those things better.
Yeah, and you also might be the only teams that want a tool. So sometimes, there will be a tool that you can build off of at the firm level, but sometimes you will come up with it from scratch, right? As an example, we have been talking as Production Engineers about the health of our rotation. What percentage of alerts are we actually actioning? And what percentage are we sending onwards or snoozing or close themselves? And what percentage of alerts that I’m seeing are from this system versus that system? And all these questions are well and good to ask, but it’s way better if you have technology that helps you answer that question. So there were a few Production Engineers this year who built a system that analyzes your team’s alerting history and tries to answer those questions for you, so that you can take action and improve the health of your rotation more actively.
So, having tools that let you statistically analyze the historical behavior of alerting systems and find anti-patterns and things that are going wrong and where you can focus on improving things. Sounds great. I’m curious a little more concretely, what are the kinds of things that you find when you apply a tool like that? What are the anti-patterns that show up in people’s alerting workflows?
I think you kind of get desensitized sometimes when you’re on support. So I would say one example is that often you’ll find that there’s just a lot of flickering going on in your view where pages will open and close and you did nothing to resolve them, they closed themselves. And probably that is taking up mental space because you saw the thing raise and close itself, and you never even needed to see it. And probably sometimes you start to look into it before it just closes itself. And so, that’s an example of a case where we should go improve the call site of the alert and make it such that it does not show up on your screen at all. So maybe that means there’s some threshold you need to make longer or maybe there’s something fundamentally wrong with the alert itself.
What do you mean by the call site of the alert?
So there’s some code that is literally calling a function in our monitoring library that is raising this thing on your screen and it’s possible that that spot in the code needs to be refactored such that that alert is not causing noise on your screen.
Right. And I guess this points to an architectural point about the way in which these alerting systems work, which is, you could imagine — and in fact we have this kind of system too — you could imagine that like, individual systems just export bland facts about the world. And then outside of the systems you have things that monitor those facts, look at metrics exported by the system, and decide when something is in a bad state. But that’s not the only thing we have. We also have systems that themselves have their own internal notion separate from any large-scale alerting things of like, “Oh, something bad has happened.” They see some behavior, you get down some series of conditionals inside of your code and like, “Ooh, I never wanted to get here. This is uncomfortable.” And then like, you raise an alert.
And I think that’s a common case that I see, even.
I think this kind of more metrics-based alerting thing is something that we are actually more growing now, and the historical approach has much more been you have the internal state of, maybe of some trading system, and it sees a weird thing and it flags that particular weird thing. So when you want to go to fix the thing, you often have to go many steps back from the alerting system, all the way back into the core application that is the one that is doing the activity and that found the bad condition.
Yes. Although I recently saw that they added a feature to our software where you can jump from a page to the call site itself in our codebase, which is pretty exciting.
Oh that’s cool. So you could like hit a button and it’ll just bring up the source for the particular thing that raised the alert.
Yeah.
Oh that’s awesome. One problem you’re talking about here is desensitization, right? You get too many alerts thrown at you after a while, you just can’t see them anymore and then people aren’t reacting. How does desensitization show up in the statistics that you gather?
So I mean, I think that that would be this higher proportion of alerts that are “transient,” is what we call them. So they close themselves without anyone taking action on them. We’ll have recordings of what actions people took on a page, right? Did they own it? Did they send it elsewhere? Did they remove it from their view entirely? If so, for how long? And maybe you’ll have an alert that opened and closed within a short timeframe with no one doing anything to it. And probably that’s an example of an alert that should be looked at.
So one indicator of noise is stuff that flickers. Are there other things that are markers of noise?
Yeah, so stuff that flickers is one, but I think if you’re routinely, say, snoozing an alert, that is probably an example of the alert not behaving quite as you intended. So snoozing can be a really powerful tool, but I remember when we were adding the ability to snooze to Oculus, we had conversations and we were thinking to ourselves, is this enabling bad behavior? (laughs) And we ultimately decided no, you should trust Jane Streeters to use snoozing with thought and care. But we were wondering, you know, if people can just snooze alerts, will they not care as much about addressing the problem with the alert itself? And so I think any action that is not owning and then resolving could conceivably be noise. If you are sending it to someone else, it should probably go to that person in the first place. If you are snoozing it, then why are you snoozing it? Should it have raised later? All those types of questions can come up if the action isn’t pretty straightforward owning and resolving.
Got it. That all makes sense. Another bad behavior in the context of alerting systems that I often worry about is the temptation to “win the video game.” There’s a bunch of alerts that pop up and then there are buttons you can press to resolve those alerts in one way or another. And one of the things that sometimes happens is people end up pushing the buttons to resolve the alerts and clear their screen of problems without maybe engaging their brain a hundred percent in between and actually understanding and fixing the underlying problem. And I’m curious, to what degree have you seen that video game dynamic going on and how do you notice it and what do you do about it?
I will say for me, I think what even makes a good Production Engineer in the first place is a healthy dose of self-reflection for every issue that pops up. The goal really should not be to clear it from your screen. After every issue, there are so many questions you can ask yourself to try to fight against this tendency of, “just close the alert.” You should ask yourself, did it need to raise in the first place at all? Could I have mitigated the impact for the next time it raises? Could I make the alert clearer for the next person who has to look at it? Could I automate part of the solution?
So, one of the techniques you mentioned in there, which is close to my heart, is automation. You see lots of things are going wrong. You can build automation to resolve, filter, make the alerts more actionable, and get rid of a lot of that noise. I think that’s an important necessary part of having a good alerting system. You need people to be constantly curating and combing over the list of things that happen. You want a very high signal-to-noise ratio. When an alert goes off, you want it to be as meaningful as possible, so humans are trained to actually care about what the alerts say and to respond to them. There’s a danger on the other side of that too. If you have a lot of powerful automation tools, those themselves can get pretty complicated and pretty hairy and sometimes can contain their own bugs, where you’ve tried to get rid of a lot of noise and in the end — in the process of getting rid of a lot of noise — have silenced a lot of real things that you shouldn’t in the end have silenced. And the scary thing about that is, it’s really hard to notice when the mistake is that nothing happened. You’ve over silenced things. So I’m curious how you feel about the role of automation and what do you do to try and build systems that make it possible for people to automate and clean things up without encouraging them to over automate in a way that creates its own risks?
Yeah, that’s a really good question and I’m not sure I have an amazing answer. I think at Jane Street we generally try to expose a lot, at least to the expert user. So applications will often expose their internal state via command-line tools. And so they should be relatively easy — maybe “should” is the wrong word — but they’re often relatively easy to pack on or build some automation around, because Jane Streeters will have exposed some helpful commands for you to interact with the state of their system. And I think that is great. And then I think the question then, is how do you make sure you’re not doing that too much? And I don’t think I have a good answer. I guess code review is one thing, hopefully someone is reading the changes you’ve made and agrees that they are sane. But I think also triage and prioritization is something that Production Engineering teams talk a lot about. I think for example, on my team, at the end of every support day people write a summary of all the things that they worked on during that day, all the things that popped up live. And that will often generate a discussion in thread about things that we could improve. And then I think it’s really survival of the fittest there, where the ideas that have the most traction and get people the most excited are the ones that will be picked up.
I think part of what I’m hearing you say here is that you need to take the stuff that you build that’s automation around the support role, and take that seriously as software, and take the correctness of that software seriously. In some ways it’s easy to think, “Oh, this is just like the monitoring stuff, getting it wrong isn’t that important.” But it’s actually really important, right? You get it wrong in the direction of under-filtering, and in the end you filter it out because in the brain of the person who’s watching it, they can’t pay attention and they just stop paying attention. And if you over-filter, well now you hide important information from the people who need to see it. Even this stuff is surprisingly correctness-critical and you have to take it as its own pretty serious engineering goal. So we’ve talked a lot about one important phase of support, which is the discovery and managing of alerts, things that go wrong. That’s not the only thing you need to do in support. A thing you were mentioning before is this, debugging and digging and trying to understand and analyze and explore. I think that kind of work rewards things in the space of debugging and introspection tools, and I’m curious what are the tools there that you think are important? Are there any interesting things that we’ve built that have helped make that part of the support work easier?
We’re definitely making strides in this area now. For example, to be honest, it’s pretty recent that the firm has had this big push towards CLM, or centralized log management, that is very helpful in terms of combing through events that have happened. And that is actually something that I have just seen come about in the past maybe two years. We kind of were in our happy world SSH-ing onto production boxes and reading log files and then just parsing them or reading them by hand and we are kind of upgrading now. So I think that’s one big area where things are happening. I think people often will write their own tools to parse data. So a lot of our data at Jane Street is in S-expressions, or “sexps,” and we store a lot of data that way, and that means that people kind of have shared tools around the firm for how to pull data nicely out of those expressions.
All of this stuff you’re talking about highlights the way in which Jane Street is off in its own slightly independent technical universe from the rest of the world. If you look at our internal tooling, you’ll find it simultaneously dramatically better and dramatically worse than the thing you might expect from some big tech infrastructure. And also just in some ways weird and different. Other people use JSON and protobufs and we use S-expressions, which is the file format basically from the mid-fifties involving lots of heavily parenthesized expressions. The thing from Lisp if you happen to know what that is.
Oh but I’ve grown to love them.
(laughs) I like them too. And in many ways we’ve built a system where people are much more connected to and using the particular hardware and particular systems, and the fact that we actually, like, SSH into particular boxes and look at the log files on those boxes, in some ways is just kind of a holdover from those origins. But over time that’s not actually how we want things to work. And so we’ve done, as you said, much more building of centralized systems for bringing together data. Log data is one of them but not the only one. In fact, there’s now a whole Observability team which has built a ton of different tools like distributed tracing tools and all sorts of observability stuff that lets you quickly and efficiently aggregate data from systems and show it together. There’s some kind of observability tools that we’ve had for a long time that you don’t find in other places.
Like we have extremely detailed, super high-resolution packet captures with shockingly accurate timestamps of everything that comes in and out of a large swath of our trading systems so that we can, like, down to a handful of nanoseconds, see exactly when things happened and put complicated traces of things together. That’s the kind of tool that you don’t see in lots of other contexts, but a lot of the more standard observability tools are things we’re really only leaning into in the last handful of years. So I’m curious, as those tools have landed, have they made a big difference to how support works?
Yeah, so I think there’s definitely a bit of inertia here when it comes to change. And I think it’s this interesting combination where engineers hear about these things and are really excited and I absolutely have seen a lot of them being taken on board. But like I said before, everyone has so much always on their stack that they want to work on. And migrating to a new library is something that will be somewhere on that stack, but it may or may not be at the top.
Yeah, it’s always a tough thing that when you come up with a new and better way of doing something, how do you deal with the process of migrating things over? And I think there’s that last tail of things to migrate that can take a long time and that’s not great, right? Because it in complicated ways increases the complexity and risk of what’s going on. Because there’s the new standard shiny way of doing it, and when you’re in the world where most things have been moved over into the new way. And then there’s the handful of things in the old way and people have a little forgotten how the old way works. So I do think it’s important to actually make time on teams to clean up this kind of technical debt and migrate things over maybe a little more eagerly than people might do naturally.
Yeah, and Jane Street has a giant monorepo, and so, often you’ll go look at someone else’s code and I mean certainly Production Engineers we’ll often go read the code of the applications that are creating pages that come our way. And so being able to jump into a codebase that you haven’t seen before and understand all the tools that it’s using and what’s going on is just really important. And that’s true also for Software Engineers who are also often gonna be jumping into codebases in our big monorepo that they haven’t explored before. And so that’s another benefit on top of the benefit that these libraries are just better than the old way in most cases.
And what goes into that kind of Production Engineering role, and what’s a little more on the texture of the work if you’re not doing very much in the way of writing code, what is the kind of project work you do in that context?
I can definitely give you an example of that type of work. I probably can’t come up with an amazing summary of it just ‘cause I’m not in that part of the world. But I can say — we keep going back to Order Engines — I have a friend on Order Engines and something he might do when he’s off support is, he might work on a new type of trading that we’re doing, and he might go talk to the desk about the trading they want to start doing. He might read a spec about that type of trading, he might then go write some code to make it happen. He might talk to downstream clients of his systems to make sure they’re ready for it and kind of be that glue between all of these systems so that — that is an example of something he might do.
And I think that kind of thing does connect to the support role more than it might seem, in the sense that there’s all this operational work where you’re trying to understand the trading that’s coming up, and understand the systems on the other side. That first day when you try and do something, as you said, there are often failures that you then have to respond to and support, and understand maybe the specifics of that particular flow. But also just understanding in general what is the process of connecting to and getting set up with a new counterparty and what are all the, kind of, little corners and details of how those systems work and hook together. That I think very much connects to the kind of things that go wrong and the kind of things you need to understand when you’re doing support, even though it’s a different kind of operational work that’s not just about machining down the support role. I do think there’s a lot of synergy between those two kinds of thinking.
Yeah, totally.
So we just talked a bunch about the tooling that makes support better, but along the way you pointed out in a number of ways the importance of culture. How do you build a good culture around support and around safety? And I’m curious, what do you think are the important ingredients of the culture that we bring to support and to safely supporting the systems that we have here?
Yeah, I think a big part of Jane Street culture in general that I have noticed, and maybe even someone has said this already on your podcast because it’s pretty pervasive, but it’s that you should just be totally comfortable making mistakes. And if you make a mistake, you should say it. And I think if you are making a mistake that’s gonna impact the production environment, that’s okay. Humans do that. The important thing is that you raise it to someone around you urgently so that we can mitigate the impact and resolve it. And I think this shows up in our postmortems of incidents after they take place. There’s really not a blaming culture here, right? It’s just people describing what happened so we can learn from it.
Right. People are going to make mistakes and sometimes it’s true that when someone makes a mistake, the right response is like, “Oh I made a mistake, I need to think about how to do that better personally.” But most of the time, certainly when the rate of mistakes is high, the thing that you need to think about is, how do I change the system? How do I make it so that mistakes are less likely, or that even when mistakes happen, the blast radius of the mistake is reduced. So you talked about postmortems, what is a postmortem? When do we write a postmortem? What are these things for? Where do they go? What’s the story here?
Roughly after an incident, we basically do some reflection and writing about what happened. We’ll sometimes write in a pretty detailed way, with timestamps, the sequence of events. And we’ll write down what led to it, what caused it, how it got resolved. And then we’ll really have a big chunk of the postmortem that is dedicated to, how can we do better? It’s all well and good to reflect, but you really want to come away with concrete actionable items that you can do. You know, you might have some process takeaway, like, “Oh, I should have reached out to impacted users sooner.” And that’s great. But I think technical takeaways are often a result of a postmortem.
And how do you help people actually write these things effectively? I’ll say, maybe an obvious thing — even though we try really hard and I think largely succeed in having a culture where people are encouraged to admit their mistakes, it’s awkward. It is hard for people to sit down and say, “Yeah, here’s the thing that went wrong and here are the mistakes that I made.” I think a thing that’s actually unusually hard to handle is, when you’re the person who’s writing down the postmortem, we ideally try and get the person who made a lot of the mistakes themselves. If it was their hand that did the thing that caused the bad thing, that’s the person who you want to give the opportunity to do the explanation. But sometimes a lot of people have made different mistakes in different places. I think in a well-engineered system when things go wrong, it’s often because there’s like a long parley that was hit, a number of different things failed, and that’s why we got into a bad state. And so you need to talk about other people’s failures, which is especially awkward. What are the things that you do to try and help new people to the organization get through this process, learn how to write a good postmortem — how do we help spread this culture of this?
Yeah, so I think some of it they’re going to pick up by reading other postmortems and experiencing other incidents. It’s pretty rare that someone will join, and then right away be involved in a large incident where it was their mistake that was, you know, heavily the cause. And to be honest, if that happens, probably their mentor should have avoided that situation. I think probably by the time they’re in this situation, they have seen enough things go wrong, seen enough people admit that they have made mistakes, seen enough people still be respected and not shamed for that. And I kind of think it’s just a matter of time, but you do want to be intentional with the way you talk about it. So if something did come up where a new person was involved, I think it would be really important that someone pulls them aside and makes sure they’re comfortable with everything going on, and that they feel okay. It’s up to their teammates to instill that culture and make sure that everyone is talking about it in the right, thoughtful way.
One of the tools that I’ve noticed coming up a lot that’s meant to help people write good postmortems about which I have complicated mixed feelings—
The template?
Templates. Yeah, right. Where we have templates of, what is the shape of a postmortem, what are the set of things you write in a postmortem, there’s the place for the timeline of an event and the place to write down — you were kind of echoing this, you know — here are the things that went well, here are the things that went badly, here are the takeaways. And one of the things I worry about is that sometimes people take the template as the thing they need to do and they go in and fill it in the template. And the thing you most want to have people do when they’re writing a postmortem is to stop and think, and be like, “Yeah, yeah, there’s a lot of detail but like, big picture, how scary was this? What really went wrong? What are the deep lessons that we should learn from this?”
And I totally see the value of giving people structure that they can walk through. You know, the old five paragraph essay. You give people some kind of structure that shapes what they’re doing. But at the same time there’s something hard about getting people to, like, take a deep breath, look wide, think about how this matters to the overall organization, and pull out those big picture lessons. And I sometimes feel there’s just tension there where you give people a lot of structure and they end up focusing on the structure, and spend less time leaning back and thinking, “Wait, no, no, no. What is actually important to take away from this?” And I’m curious whether you’ve seen that and if you have thoughts about how to get people to grow at this harder task of doing the synthesis of, like, “No, no, no, what’s really going on here?”
Yeah, I definitely have observed that. My personal opinion is that the template can be really helpful for people who have never written one before. Because it can be kind of intimidating — this big thing went wrong, now go write about it and reflect on it. And having a bit of structure to guide you when you’re starting off can make it much more approachable. So I do think there is a role for the template, but I definitely agree with you that it can be restrictive. And I think once people are kind of in the flow of writing postmortems and are a bit more used to it and know what to expect, they’re not gonna get writer’s block sitting down. I think removing the template and giving them a bit more space is totally reasonable. Then your question is like, how do you get them from the template onwards, and how do you get them to think about this big-picture framing?
And I think the answer to a lot of these questions is that hopefully the Production Engineers around them are instilling this in them. I mean, to be honest, postmortems are not the only way we reflect and take action after an incident. In reality, you’re going to have team meetings about this and you’re gonna get a lot of people in a room who are talking about what could have gone better or maybe they’re just in the row talking about it, but there’s going to be a real conversation where a lot of these big-picture questions are coming up. And I think it’s really important to have those discussions. And the postmortem is a very helpful tool in reflecting, but it I think should not be the only tool in your arsenal.
Yeah, that makes sense. Another thing that I often worry about when it comes to support, is the problem of, how do you train people? And especially, how do you train people over time as the underlying systems get better? I think about this a lot, because early on, stuff was on fire all the time. Every time the markets got busy, you know, things would break and alarms would go off and there were all sorts of alerts. And you had lots of opportunities in some sense to learn from production incidents because there were production incidents all the time. And, we’ve gotten a lot better. And there are still places that are, like, new things that we’ve done and we’re still working it out and the error rate is high. But there are lots of places where we’ve done a really good job of machining things down and getting the ordinary experience to be quite reliable. But when things do go wrong, you want the experience of knowing how to debug things. How do you square that circle? How do you continue to train people to be good Production Engineers in an environment where the reliability of those systems is trending up over time?
Yeah, I think different teams have come up with different solutions to this question. And the question of how to train people on incidents is a really hard one. I’ve seen some creative solutions out there. So I know one popular method of training is what we call the “incident simulation.” And it’s kind of a choose-your-own style adventure through a simulated incident. And this is all happening in conversation, in discussion. It’s not on a machine, but the trainer is gonna present to you some scenario that you’re in, where something’s going wrong and you are gonna step through it. It’s kind of like D&D and you’re gonna step through it and pick your path and then they will tell you, okay, you’ve taken this step, here’s the situation now. And you’ll walk through the incident and talk about how to resolve it, what updates you would give stakeholders, how you would mitigate it, what you would be thinking about, all of those things.
That is one approach and I think that gamifying approach has proved pretty useful. I know other teams that actually use some Nintendo Switch video games as training. So if you know the games, Overcooked! or Keep Talking and Nobody Explodes, those are both fun, team-based games where you’re actually communicating a fair amount under pressure. It manufactures a bit of a stressful situation, and you’re talking to your teammates, and it’s fun, but also it does simulate how to keep a level head and think clearly and communicate to people under pressure. So we have definitely had to come up with creative ways because like you said, bad incidents are probably just coming up less frequently.
There’s at least some teams that I’ve seen build ways of intentionally breaking non-production versions of the system. So they have some test version of the system with lots of the components deployed and operating in a kind of simulated mode and will do a kind of war-gaming thing. But a war gaming where you actually get to use your keyboard and computer to dig into the actual behavior of the system. There’s a big investment in doing that. You have to both have this whole parallel system that looks close enough to production, that it’s meaningful to kind of bump around inside of it. And then people have to design fun and creative ways of breaking the system to create an interesting experience for new people to go in and try and debug it. In some ways that seems kind of ideal. I don’t know how often people do that, how widespread that kind of thing is. I’ve seen a couple of examples of people doing that.
Yeah, I haven’t seen it be as widespread as I might like. I agree. It is the ideal scenario. I can think of, off of the top of my head, one team that I know has built that. But I think like you said, it’s just a pretty high investment cost, and so I think people have tried to steer away from that where possible — or maybe that’s too strong — but they have put it off in favor of trainings that have a very low investment cost. I do know that we had a simulation, kind of like how you’re describing, where a real thing breaks and you go investigate it. That was used firmwide, or at least in my part of the firm, for many years. And it wasn’t really a system that you had to know about ahead of time, it would break. It’s called Training Wheels, if you know it. Someone wrote it maybe 10 years ago or something like that. This was before we had Production Engineers. Any engineer could try their hand at fixing this system with a small codebase and with a pretty manageable file system presence that you could kind of just go explore off the cuff. So I do know that things like this have popped up over the years, but I don’t think it’s something that you’re gonna find on every team.
So beyond simulating outages and simulating working together and communicating effectively there, what other things go into training people to be effective Production Engineers?
I think that comes with time on the job. So all Software Engineers go through OCaml bootcamp when they join, we have Production Engineers do that, and then also go through a production bootcamp. And you can see a similar pattern with the classes that we have Software Engineers take in their first year, where Production Engineers take them and also take a production class. But then beyond that, it’s just gonna be so team-specific, right? You wanna be a strong debugger, you wanna remain calm, you wanna be careful, you wanna communicate well, but the actual support you’re doing is gonna have such a different shape and color depending on your team. A lot of it is team-driven rather than firm-driven, where someone is gonna sit with you and literally do support with you for weeks, and be teaching you a ton of context about the systems. And hopefully every time you handle a support issue, they’ll be providing active feedback on what you could do better.
So a lot of this stuff comes through in essentially an apprenticeship model. You are sitting next to people who have done this for longer and they’re showing you the ropes, and you over time absorb by osmosis the things that you need to learn to do it effectively.
Yeah, and some teams do have much stricter training models. For example, the team I mentioned earlier, Order Engines that has pretty high-stakes support. I think they have a stricter training model that people follow when they’re joining the team compared to a team that has a much lower support load.
So how does this all filter down to the recruiting level? I know that you’re involved in some recruiting stuff, both on the Software Engineer and the Production Engineer side. What does the recruiting story for Production Engineering look like, and how does it differ from the Software Engineering pipeline?
Yeah, so I guess if you go back to my story, right, I was applying to Software Engineering and Jane Street raised Production Engineering to me as something I might be interested in. And that pattern does pop up a fair amount because certainly, at least at the college level, students don’t know about Production Engineering and so they’re gonna default to Software Engineering, which is totally reasonable. And it’s also kind of just hard to tell what a student might be interested in because you only have so much data. So I think often if we think someone might be interested in Production Engineering, we will propose it to them, see how they feel about it, and have a conversation. We have interviews that are Production Engineering-focused, so someone can even try it out and see if they find it fun or not. We do legitimately get feedback that some of our Production Engineering interviews are pretty fun.
Maybe people are just being nice, but I think they are legitimately fun if you’re into this solving-a-puzzle business. So, part of this is us reaching out to people, or trying to identify people who will be interested in it. But I think there’s also this lateral pool of candidates who, like you mentioned at the beginning, have done something similar at another tech company and so are somewhat in this space already. And then those people will be a bit more opinionated about what type of work they would want to do and if this would be a good fit for them.
Can you identify any of what you think of as the personal qualities of someone who’s likely to be an excellent Production Engineer? I think there are lots of people we hire who would be kind of terrible, unhappy, ineffective Production Engineers, and some people who are great at it. What do you think distinguishes the people who are really well-fitted to the role?
So, I think strong communication is really important for Production Engineers. And communication, sometimes I think people think of it as a throwaway skill or something like that, but it’s so key because Production Engineers are really the glue between a lot of teams. They’re gonna be speaking to people who have very different mental models of all the data and systems in play. You know, like I said, when we’re talking to Operations, we might speak a different language about a trade, right? I think the debugging skill is of course important and I think that’s a great example of how all of these skills are obviously also important for Software Engineering and other engineering disciplines — but I think especially important for Production Engineering because you’re doing an extra high level of investigation and debugging. I would say carefulness, just because you’re interacting with production systems and it’s important that you are taking extra care and thought and aren’t gonna do something crazy, you’re gonna think it through.
You wanna be pretty level-headed and I guess that would be the last quality I would mention, which is just remaining calm. Even in a stressful situation, which might not — to be fair — help you on every team because some teams don’t need that. But typically no matter what team you’re on, a lot of stuff might come up at once. You might feel a little bit overwhelmed and that’s okay, you just need to keep a level head, remain calm, not panic. And that is a certain type of person and that is totally not everyone. And some people would hate that and some people like me don’t mind that situation at all. And some people love that, and then they go to the Order Engines team, right? So you get this big spectrum. But I think across all these teams, those are the qualities that kind of stand out.
Having just a personal enjoyment of putting out fires. I think some people find it unpleasant and hard and some people find it really energizing. I think I actually found it really energizing. I don’t do as much of it as I once did, but there’s something exciting about an emergency — not exciting enough that you would create one when you didn’t need one. But when it’s there, there’s something joyful about it.
Yeah, and I think it kind of ups the reward factor because then once you’re done solving it, you feel extra excited about having tackled this really big issue.
Do you do anything in the interviewing process as a way of trying to get to these qualities? The one that seems like it maybe should be possible to get at is the debugging skill. What do you guys do that’s different for evaluating Production Engineers from what you might do for Software Engineers?
So we do ask candidates software engineering questions, but we also have production-focused questions that we’ll ask them. And for example, one of these questions puts you in an environment with some code, with some command-line tools, with some data, and we tell you, “Hey, here’s the thing that went wrong. We wanna figure out what happened.” And it is their job to go piece together that story. In fact, actually, that specific interview question is based on a real thing that my team had to do on support a lot in the past before we eventually built a website that helps people do it for themselves. But we used to have that all the time on support. So it really is kind of having the candidate go through an action that we used to take ourselves on support a lot, and a thought process that we used to think about ourselves a lot.
So we do try to get at it in as realistic a way as possible. Obviously it’s a bit of a synthetic situation, there’s no way around that, but I think if you talk to a candidate about it and hear how they’re brainstorming about solving this problem and what steps they wanna take to get there and what’s the reasoning behind it, you can kind of get at a lot of these skills, but you do have to ask a lot of questions maybe in a way that you don’t wanna take a backseat. Not that any interviewer should, but you certainly can’t just look at the code at the end and form a really strong opinion. You really need to observe them as they get to the answer throughout the whole process to have a strong opinion of the candidate.
I do think that’s the thing that’s, in general, true about interviewing. Very early on when we started doing interviewing, I think we had the incorrect mental frame of like, oh, we’ll give people problems and we’ll see how well they solve the problems. But I think in reality you learn much more by seeing what it’s like to work together and seeing what’s the process they go through. You talked a lot about the feeling of solving puzzles and in some sense we solve puzzles for a living, but we don’t really solve puzzles for a living. There are puzzles that are part of it, but it’s much more collaborative and connected than that. And just seeing what it’s like working with a person, and how their brain works, and how those gears turn seems much more important.
There are Production Engineers who do very little coding, and that is, like I said, there’s that spectrum. So we also want to make sure that we have interviews that really can help those people shine and get at people who would be good for that type of role.
Alright, well thank you very much for joining me. This has been a lot of fun.
Yeah, thank you for having me.
You’ll find a complete transcript of the episode along with links to some of the things that we discussed at signalsandthreads.com. Thanks for joining us and see you next time.