BLURB

Liora Friedberg is a Production Engineer at Jane Street with a background in economics and computer science. In this episode, Liora and Ron discuss how production engineering blends high-stakes puzzle solving with thoughtful software engineering, as the people doing support build tools to make that support less necessary. They also discuss how Jane Street uses both tabletop simulation and hands-on exercises to train Production Engineers; what skills effective Production Engineers have in common; and how to create a culture where people aren’t blamed for making costly mistakes.

SUMMARY

Some links to topics that came up in the discussion:

More about production engineering at Jane Street, including how to apply.
Notes on Site reliability engineering in the wider world.
Alarm fatigue and desensitization.
Jane Street’s 1950’s era serialization-format of choice,
Some games that Streeters have used for training people to respond to incidents.

TRANSCRIPT

00:03

Ron

Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I’m Ron Minsky. Alright, it is my pleasure to introduce Liora Friedberg. Liora is a Production Engineer and she’s worked here for the last five years in that role. Liora, welcome to the podcast.

00:21

Liora

Thank you for having me.

00:22

Ron

So just to kick things off, maybe you could tell us a little bit more about, what is Production Engineering at Jane Street?

00:28

Liora

So, Production Engineering is a role at Jane Street. It is a flavor of engineering that focuses on the production layer of our systems, which is a pretty big statement and I can definitely break that down. But I’ll say the motivation here is that Jane Street is writing software that trades billions of dollars a day. And so it’s important that that software behaves as we expect in production, right? And if it doesn’t, we want people to notice right away and to address what’s coming up. Production Engineers have support as a first-class part of their role. So, when we are on support, we are the first line of defense for our team and we are responding to any issues that arise in our systems during the day, whether that be from an alert or for a human raising some behavior that they observe to us. And we are really tackling those issues right away.

Templates. Yeah, right. Where we have templates of, what is the shape of a postmortem, what are the set of things you write in a postmortem, there’s the place for the timeline of an event and the place to write down — you were kind of echoing this, you know — here are the things that went well, here are the things that went badly, here are the takeaways. And one of the things I worry about is that sometimes people take the template as the thing they need to do and they go in and fill it in the template. And the thing you most want to have people do when they’re writing a postmortem is to stop and think, and be like, “Yeah, yeah, there’s a lot of detail but like, big picture, how scary was this? What really went wrong? What are the deep lessons that we should learn from this?”

And I totally see the value of giving people structure that they can walk through. You know, the old five paragraph essay. You give people some kind of structure that shapes what they’re doing. But at the same time there’s something hard about getting people to, like, take a deep breath, look wide, think about how this matters to the overall organization, and pull out those big picture lessons. And I sometimes feel there’s just tension there where you give people a lot of structure and they end up focusing on the structure, and spend less time leaning back and thinking, “Wait, no, no, no. What is actually important to take away from this?” And I’m curious whether you’ve seen that and if you have thoughts about how to get people to grow at this harder task of doing the synthesis of, like, “No, no, no, what’s really going on here?”

40:11

Liora

Yeah, I definitely have observed that. My personal opinion is that the template can be really helpful for people who have never written one before. Because it can be kind of intimidating — this big thing went wrong, now go write about it and reflect on it. And having a bit of structure to guide you when you’re starting off can make it much more approachable. So I do think there is a role for the template, but I definitely agree with you that it can be restrictive. And I think once people are kind of in the flow of writing postmortems and are a bit more used to it and know what to expect, they’re not gonna get writer’s block sitting down. I think removing the template and giving them a bit more space is totally reasonable. Then your question is like, how do you get them from the template onwards, and how do you get them to think about this big-picture framing?

And I think the answer to a lot of these questions is that hopefully the Production Engineers around them are instilling this in them. I mean, to be honest, postmortems are not the only way we reflect and take action after an incident. In reality, you’re going to have team meetings about this and you’re gonna get a lot of people in a room who are talking about what could have gone better or maybe they’re just in the row talking about it, but there’s going to be a real conversation where a lot of these big-picture questions are coming up. And I think it’s really important to have those discussions. And the postmortem is a very helpful tool in reflecting, but it I think should not be the only tool in your arsenal.

41:28

Ron

Yeah, that makes sense. Another thing that I often worry about when it comes to support, is the problem of, how do you train people? And especially, how do you train people over time as the underlying systems get better? I think about this a lot, because early on, stuff was on fire all the time. Every time the markets got busy, you know, things would break and alarms would go off and there were all sorts of alerts. And you had lots of opportunities in some sense to learn from production incidents because there were production incidents all the time. And, we’ve gotten a lot better. And there are still places that are, like, new things that we’ve done and we’re still working it out and the error rate is high. But there are lots of places where we’ve done a really good job of machining things down and getting the ordinary experience to be quite reliable. But when things do go wrong, you want the experience of knowing how to debug things. How do you square that circle? How do you continue to train people to be good Production Engineers in an environment where the reliability of those systems is trending up over time?

42:29

47:31

Liora

Yeah, so I guess if you go back to my story, right, I was applying to Software Engineering and Jane Street raised Production Engineering to me as something I might be interested in. And that pattern does pop up a fair amount because certainly, at least at the college level, students don’t know about Production Engineering and so they’re gonna default to Software Engineering, which is totally reasonable. And it’s also kind of just hard to tell what a student might be interested in because you only have so much data. So I think often if we think someone might be interested in Production Engineering, we will propose it to them, see how they feel about it, and have a conversation. We have interviews that are Production Engineering-focused, so someone can even try it out and see if they find it fun or not. We do legitimately get feedback that some of our Production Engineering interviews are pretty fun.

Maybe people are just being nice, but I think they are legitimately fun if you’re into this solving-a-puzzle business. So, part of this is us reaching out to people, or trying to identify people who will be interested in it. But I think there’s also this lateral pool of candidates who, like you mentioned at the beginning, have done something similar at another tech company and so are somewhat in this space already. And then those people will be a bit more opinionated about what type of work they would want to do and if this would be a good fit for them.

48:46

Ron

Can you identify any of what you think of as the personal qualities of someone who’s likely to be an excellent Production Engineer? I think there are lots of people we hire who would be kind of terrible, unhappy, ineffective Production Engineers, and some people who are great at it. What do you think distinguishes the people who are really well-fitted to the role?

49:03

Liora

So, I think strong communication is really important for Production Engineers. And communication, sometimes I think people think of it as a throwaway skill or something like that, but it’s so key because Production Engineers are really the glue between a lot of teams. They’re gonna be speaking to people who have very different mental models of all the data and systems in play. You know, like I said, when we’re talking to Operations, we might speak a different language about a trade, right? I think the debugging skill is of course important and I think that’s a great example of how all of these skills are obviously also important for Software Engineering and other engineering disciplines — but I think especially important for Production Engineering because you’re doing an extra high level of investigation and debugging. I would say carefulness, just because you’re interacting with production systems and it’s important that you are taking extra care and thought and aren’t gonna do something crazy, you’re gonna think it through.

You wanna be pretty level-headed and I guess that would be the last quality I would mention, which is just remaining calm. Even in a stressful situation, which might not — to be fair — help you on every team because some teams don’t need that. But typically no matter what team you’re on, a lot of stuff might come up at once. You might feel a little bit overwhelmed and that’s okay, you just need to keep a level head, remain calm, not panic. And that is a certain type of person and that is totally not everyone. And some people would hate that and some people like me don’t mind that situation at all. And some people love that, and then they go to the Order Engines team, right? So you get this big spectrum. But I think across all these teams, those are the qualities that kind of stand out.

50:35

Ron

Having just a personal enjoyment of putting out fires. I think some people find it unpleasant and hard and some people find it really energizing. I think I actually found it really energizing. I don’t do as much of it as I once did, but there’s something exciting about an emergency — not exciting enough that you would create one when you didn’t need one. But when it’s there, there’s something joyful about it.

50:53

Liora

Yeah, and I think it kind of ups the reward factor because then once you’re done solving it, you feel extra excited about having tackled this really big issue.

51:02

Ron

Do you do anything in the interviewing process as a way of trying to get to these qualities? The one that seems like it maybe should be possible to get at is the debugging skill. What do you guys do that’s different for evaluating Production Engineers from what you might do for Software Engineers?

51:17

Listen and subscribe:

Solving Puzzles in Production

with Liora Friedberg

Episode 20 | October 7th, 2024

BLURB

SUMMARY

TRANSCRIPT

00:03

Ron

00:21

Liora

00:22

Ron

00:28

Liora

02:29

Ron

03:02

Liora

03:13

Ron

03:27

Liora

04:23

Ron

04:27

Liora

05:17

Ron

05:20

Liora

05:42

Ron

05:50

Liora

06:43

Ron

07:15

Liora

07:16

Ron

07:25

Liora

07:47

Ron

08:02

Liora

08:44

Ron

09:21

Liora

10:06

Ron

10:30

Liora

10:53

Ron

11:19

Liora

11:43

Ron

12:01

Liora

13:15

Ron

13:22

Liora

14:17

Ron

15:11

Liora

15:12

Ron

15:24

Liora

15:42

Ron

16:00

Liora

16:42