00:00:04

Ron

Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I’m Ron Minsky.

All right, so, it’s my pleasure today to sit down and have a conversation with Dominick LoBraico about email. In particular, we’re going to talk about a system that Dominick architected and led the development of called Mailcore, which is Jane Street’s own homegrown mail server.

And I think this is interesting on its own because email is an interesting topic and the whole architecture behind it, but I think it’s also a lens into some interesting questions about software design and how you manage infrastructure, some questions about how you make this choice of when you build your own thing and when you use standard, existing tools, and also some interesting questions about how programming language plays a role in systems design.

00:00:48

Dominick

Hi, Ron.

00:00:49

Ron

Hey, DLo. So, to get started, can you tell us a little bit about how email works?

00:00:54

Dominick

Sure. Yeah. So email is based on an old and venerable protocol on the Internet called the Simple Mail Transfer Protocol, SMTP, and SMTP… you can kind of think of it as playing the role that the Postal Service plays in delivering regular mail. It is a way for one server that wants to deliver a message somewhere, to hand that message off to another party, who can get it to its final destination, whether that is the eventual destination server itself or some intermediary who can help you get a little bit closer.

Email itself came into fruition, as we know today, in the early days of the Internet, and the protocol itself is very simple. You basically have the actual body of the message itself, which has its own separate format and specification, and then you have a set of instructions for expressing who that message is destined for and who it’s coming from, and so one server connects to another.

And it says, “I’ve got a message. It’s coming from so and so, and it’s meant to be delivered to some other person. Here’s the body of the message,” and the receiving server can do with that what it will. It can either say, “Great, I’ll take that, and I’ll be responsible for it from here on out.” It can say, “No, I don’t know anything about that person. You have to find somebody else to deliver that to” or reject it for any number of other reasons, like, “This looks like it has a virus,” or “You’re not allowed to connect to me,” or “I’m not available for receiving mail right now.”

00:02:17

Ron

And one thing that always strikes me about email is it’s this kind of wondrous artifact from the early Internet, which is a truly open social network. There’s lots of things that people talk about, right? Could we make existing social networks better and more open and all of that, and email, just is from its initial design; and its complete history has been this very open thing, and as you point out, the core protocols and transports are relatively simple, although there is actually a surprising amount of complexity in the RFCs that tell you how to parse a particular email. The overall system is pretty simple, but there’s a lot of complexity in all of the different players who build systems that actually manage and transfer email around and how they deal with the various problems that happen, like spam and people attacking systems via email and all of that. So the foundations are relatively simple, but the emergent complexity of the system is actually pretty high.

00:03:10

Dominick

Like with many protocols of the old Internet, it was designed in a time where the world was much simpler than it is today, especially the Internet-connected world. You know, there were probably 50 institutions that had Internet connections or ARPANET connections at the time, and you didn’t really have to worry that anybody was going to be spamming because barely anybody even know what email was in the first place.

00:03:30

Ron

When you start and build a new thing, the early properties of the thing that you build can often be really sticky and really matter in a way that’s kind of hard to predict. So this one early property of being open has stayed there. Email is a thing that anyone can participate in. Organizations can build their own infrastructure to connect to it, and through all the rather large transformations that the email system has gone through, that openness remains as a core property.

This is the horrible thing about designing to build a new thing, when you want to design something new, you have to make a bunch of choices, and clearly, you shouldn’t worry about them that much, because probably the thing you build is going to fail and isn’t going to work out, and even if it does, you’re going to learn more about the problem later, and so you shouldn’t worry too much about the early decisions. But also, some of the early decisions, you don’t know which ones are going to turn out to be very hard to change.

00:04:13

Dominick

That’s right.

00:04:14

Ron

And you’ll be stuck with them until the end of time.

00:04:16

Dominick

And in fact, you know, the big players in email today – obviously Google and Gmail, are a really large percentage of the email sending and receiving on the Internet – but they’re still wrestling with some of those early decisions and some of that openness that are architected in, as they try to figure out how they can make email more secure and how they can protect their users and rein in some of the malicious actors on the Internet, and that’s just a hard thing to do while trying to maintain the existing openness that email has; it cuts both ways I guess.

00:04:45

Ron

That openness, in the end, has a lot of value.

00:04:46

Dominick

Absolutely. Yeah.

00:04:48

Ron

So the story here is about how you ended up building the system called Mailcore. What did email at Jane Street look like when you first ran into the problem?

00:04:56

Dominick

So you might think that there’s really not much special about the way Jane Street uses email compared to any other company, and largely, that’s true. I think we have a few special requirements by dint of the fact that we are in a regulated industry, so we have some requirements around logging for compliance purposes every message that is sent or received by somebody at Jane Street.

But other than that, our email system looks pretty similar, or has looked, in the past, pretty similar, to the way an email system in any organization might look, and the rough summary is we have some mail gateways that sit on the outside of our network for receiving email from foreign servers, you know, from external parties, and then we have some mail server, or a set of servers, inside of our network that handle all of the complicated business logic around what to do with those messages.

So, in some cases, it’s as simple as receive the message and deliver it into the mailbox of the user if we are the intended recipient. In other cases, it is apply filtering for things like spam and viruses and other things that we might want to extract from messages before we deliver them, do expansion for mailing lists. So if you send an email to some group at Jane Street, you want to be able to expand that group name to the actual list of recipient mailboxes to make sure that it actually ends up in the inboxes of the recipients who it’s destined for. And then this extra compliance implication of making sure that we’re logging all of the right messages with all of the right metadata.

And at the time that I started, the mail infrastructure here was all based on an open source mail server that has its own config language and is pretty widely used on the Internet at large, and we had about 400 or 500 lines of configuration in the most complex case, I think, for this system to get it to do all of these different things that we wanted it to be able to do.

00:06:44

Ron

Great. So that sounds like a reasonable approach in terms of how to build oneself a mail system. What problems did we run into with it?

00:06:50

Dominick

Yeah, so the biggest problem here, at the end of the day, was the complexity required for configuring this system to do all of the things that we needed it to do. So, now, I said 400 or 500 lines of configuration – that probably doesn’t sound like a huge number, but when it’s in a kind of bespoke configuration language that’s unlike the configuration of any other system and unlike any programming language that a developer or engineer at Jane Street would be familiar with, the complexity of 400 or 500 lines in a foreign language is pretty large and can be a little bit imposing to deal with.

In particular, we had some scary near-misses where we realized that we had done the wrong thing in terms of archiving some email for compliance purposes that we were supposed to archive, and luckily, in each of those cases, there were mitigating factors such that it didn’t end up being a big deal, but that near-miss gave us a little bit of a scare because we went and looked at the configuration and wanted to understand how we had gotten ourselves into this position, and it was harder than it felt like it should be to understand what had gone wrong and how to fix it.

00:07:50

Ron

It’s maybe also worth mentioning that the problem of logging all of your messages for compliance purposes may sound easy, but it’s made more complicated by the fact that Jane Street is a company that operates in lots of different regulatory regimes and has actually different rules for some of the different places it operates. So even the sort of seemingly simple, “Let’s just write everything down” is more complicated than it might appear at first.

00:08:11

Dominick

That’s right, yeah. We have different requirements in terms of what has to be written down and what kinds of metadata we need to store and where the extra copies need to be physically located around the world and things like that, which are reasonable sounding when you think about the human aspects of it, you know, when you reason about, okay, yeah, you need a copy for this and a copy for that, but actually implementing the rules in practice ends up being pretty complicated.

00:08:31

Ron

So, one of the things that motivated you to try and do something new was this kind of near-miss situation of things almost going horribly astray. Were there any other reasons that you wanted to try something different?

00:08:41

Dominick

As I said, one aspect of it was certainly this realization that the complexity of the system had gotten to a point where we just actually were scared to make changes to it. Another came from the fact that it required this kind of specialized knowledge. We have a team – At the time, we were much smaller than we are now – but you know, even today, we have a team made up primarily of generalists, people who are able to work on a lot of different kinds of problems and have a kind of general background across an area of technology. Understanding the configuration for this particular open source mail server is not something that you just have as part of a general knowledge. It really required specialized understanding and background, more so than the general skills required to administer an email system or understand the concepts behind email. You really needed just to know the particular weird semantics and dark corners of this particular language, and the idea that we needed to sort of build a team or have a team to specifically understand and be comfortable working with this just didn’t feel like a good use of our people resources. There are a lot of other problems that we need to be solving, and we’d much rather be able to take as general an approach to them as we can.

00:09:48

Ron

Can you give me an example of the way in which the config language was hard to reason about?

00:09:53

Dominick

I think this is an example of a pretty common pattern that you see in a lot of systems that are intended to be highly flexible and configurable. They start with a relatively simple core that handles the basic functionality, and over time, as they try to add more features to the system, they add more and more knobs that you can turn and more and more configuration parameters or elements in the configuration language to make it possible to express all of those different things you might want to be able to express.

And in this particular case, the configuration language is a bespoke, domain-specific language developed just for this system. It kind of resembles, in some places, the old-school .INI format of having, like, a key and then an equal sign and then a value and sections separated with kind of headers in brackets and things like that, but then when you look a little bit closer, you realize it has all this extra power layered on top.

So, in particular, it has support for these kind of advanced macros that look a little bit like function calls where you can call a macro with some set of arguments and it expands to something else, and there are these different phases of expansion of these configuration elements where you can do this kind of meta-programming, or you can have macros that produce macros that then get expanded to some resulting values, and then, on top of that, the set of fields that are required in the configuration and the interaction between those things is not made very clear, and it’s not really very consistent.

So, for example, you might have a section that defines the way that you can route a message, the way that you can decide where a particular incoming message should go, whether you’re going to send it to a mailbox or relay it to some other server, and you can define multiple routers, and the semantics in terms of which router is going to get selected for a given message are not made explicit by the configuration language. And there are a bunch of other examples like this where there are some set of elements that you define, and the semantics for how the system chooses which of those to apply in a given case are not explicit and clear from the configuration superficially speaking. You just have to know. You have to go and read the documentation and understand how it is that these things interact with each other.

00:12:00

Ron

Right, and the rules for picking which particular rule fires in a particular case, I assume those rules are not simple themselves?

00:12:07

Dominick

They’re not simple, and in some cases, for good reason. I mean, the system, it’s worth saying, is highly, highly flexible, and it is the case that it could do all of the things that we wanted it to do at the time, but, ultimately, the way in which you needed to contort yourself to understand how it was going to do that and how to fit those different pieces together required an expert-level knowledge of the semantics of the particular system.

00:12:29

Ron

So you had a clear problem in front of you. What approach did you decide to follow to address it?

00:12:34

Dominick

Ultimately, what we decided to do was the ostensibly crazy-sounding thing of writing our own email server, and in particular, we wrote a new email server in OCaml, the functional programming language that we use here at Jane Street, and crucially – and maybe the most interesting part of this – is that the system was always configured in OCaml. The real problem that we had come to here was we were happy with the core functionality of the old system that we were using, but the configuration language was what we felt like was really limiting us.

And we came to this fundamental realization that, ultimately, the role of an email server, you can think of as a function. You can think of it as kind of a black box that implements a function, that takes a message, and outputs one or more resulting messages, and that black box is responsible for making all of the decisions about how to transform those messages and how to route those messages to further servers or to inboxes. And at the end of the day, you can kind of encapsulate everything in a function that looks roughly like that. And OCaml, like I said, is a functional language, as you know, and it really lends itself to writing functions in this way, composable units that you can stitch together to implement bits of functionality that ultimately takes some inputs and generates some outputs without any side effects, and that was what we realized we needed, and so we started down that path.

00:13:53

Ron

That simple pivot in the design, it lets you bypass all of this complexity of this custom language, and you just get to pick a really well-thought-through, well-engineered abstraction in the middle of OCaml, which is the function, and use the ordinary tools for software composition that you have there for building the abstraction that you want, and then just kind of gets you out this problem of having to think about a weird, complicated, special case that comes up for mail and for nothing else.

00:14:17

Dominick

Exactly. We already had OCaml developers, and we already had a lot of people who understood the semantics of OCaml and the way in which the various language features might interact with each other. So we didn’t now have to go out and find a bunch of people who understood this esoteric configuration language. We could just find people who knew OCaml.

00:14:33

Ron

So one thing that strikes me about this story is, in some sense, this story sounds very familiar, which is the thing that you’re describing about this mail server configuration language actually sounds an enormous amount like the story around, say, build systems, Make. Make has a relatively simple core domain-specific language which, instead of talking about things you do with mail, it talks about rules for building things with dependencies and targets and all of that, and that language is, indeed, insufficient for doing big and complicated things. So people have built complicated macro systems. In fact, there’s a macro system inside of Make so that you can write Make rules that generate Make rules that generate Make rules, and that’s kind of horrible, and no one’s super happy about that as a way of doing complex builds, but there’s another way that people sometimes use of getting out of this problem, which is not always create their own build system, although plenty of people do that, including us.

00:15:25

Dominick

Us, twice!

00:15:26

Ron

Embarrassingly. That’s right, but another approach is what you might call the config-gen approach, which is to say, okay, there’s a simple core configuration language, and there’s a bunch of complicated stuff on top, which is about increasing the generality of language. Let’s forget all of that terrible stuff and then write code in another language where we have better abstractions and better tools and have it just generate things in this kind of simple core calculus that’s exposed by the underlying config-gen language, and then we get the best of both worlds. We get to write our configurations in a nice high-level language that we understand well and isn’t this special purpose skill, and we get to use the core engine that has been built and maintained by other people, and we don’t have to re-implement it. So why wasn’t that the path that you chose with Mailcore?

00:16:12

Dominick

I think there are three reasons. I think two are good reasons, and one is a bad reason. I’ll start with the good reasons.

The first is we really, at the time, were not happy with the primitives that the system that we were using provided to us. So the configuration language was complicated, even in its simplest form, and it’s not like we had some nice primitives that we could work with where we just needed to generate those and we could do anything with those and everything else was built on top of it. We would’ve had to generate complex macros and some of the config elements I was talking about before, and we didn’t feel like we would be saving ourselves very much by generating those versus writing them by hand. In fact, we still would’ve needed to understand it just as well. It’s not like we could’ve limited our understanding to a subset of the language and just implemented everything we needed using that. That was the first reason.

The second reason is that we did want to some runtime dynamism. We did want the ability, in some cases, to actually change behavior based on other things out in the environment, other things out in the world, and the configuration that we would’ve had to generate to do that, we would’ve been back in the exact same position that we were in before. So we ended up feeling like it would’ve been better, we’d be happier implementing those more dynamic features in a language that we were much more familiar with and more comfortable with, rather than trying to implement those via config generation into some lower-level language.

The third reason, and the kind of worst reason, like I said, is, ultimately, at the time, I think config generation was a much less popular, much less widely-used technique at Jane Street, and I think we probably didn’t consider it as seriously as we should have at the time, because the tooling and the kind of prior art and other examples of it internally just wasn’t widespread enough for it to be top-of-mind as a possible solution.

00:17:53

Ron

So here’s another alternative idea of how you might’ve gotten yourself out of the problem. Sounds you looked at the config language and said, “Wow, this is incredibly hard to reason about. It’s hard to understand. Let’s move to a different language.” There’s another way you could respond to the problem of, wow, this thing is really hard to understand, which is you could’ve approached it by trying to test it much better.

You say, from the outside, step back. What does a mail system look like? A mail server looks like a big bundle of functions, or maybe one big function, that takes in email and decides what emails need to be emitted out of the other end. You could imagine taking that view as an approach to testing it, which is that you could build some framework around the system. And you could make a bunch of assertions, where it’s like, “Oh, yeah, if we put this email in, we expect these emails to go out,” and that’s another way to build confidence that the system behaves in the way that you expect, even if the underlying config language is kind of a disaster. You can do a nice job of the testing framework on the outside to not completely nail down, but to get yourself a lot of confidence about the way in which the system behaves.

00:18:52

Dominick

We did consider that at the time, and I think the reason that we didn’t feel like that was sufficient was, primarily, that while we could’ve tested the full end-to-end system that way, the units of configuration are not composable enough for us to be able to test smaller subsets of it, and so you might be able to say, “yup, this didn’t do what I expected it to; this broke,” but that doesn’t necessarily help you figure out why it broke or what it was that changed, especially in the face of these confusing semantics that I talked about before.

And then, beyond that, we would really have been fighting kind of an uphill battle in the sense that this is not a software system that was designed to be testable in this way or a configuration language that was designed to be testable in this way, and so we ultimately would’ve had to build a lot of our own tools and a whole harness to run this system within to be able to even get there, and then get these suboptimal results.

00:19:40

Ron

There’s this general problem. If the basic system isn’t composable, that’s a problem that’s hard to get around.

00:19:44

Dominick

One thing we did consider is moving to a different open source system or just another mail server implementation. You know, this isn’t the only one in the world. There are others, and the reason we ended up ruling that out is we went and looked around and sort of looked at the most common, other mail servers out there, and we saw basically two variants, two different potential alternatives that we could’ve looked at.

One was a class of very popular and widely used systems that look pretty similar to the system that we were already using in terms of how they were configured and the kind of complexity and sort of system-specific knowledge required to work with them. It didn’t feel like there was enough justification to migrate to some other system just to understand a whole new set of semantics and a whole new set of complexities.

And then the other flavor of system that we came across was a much newer, much less widely used, much less popular in the world at-large, set of systems that were implemented in ways similar to how we eventually architected Mailcore. A small core implemented in some language and a very flexible configuration based on a common programming language, something like Python or Lua or something like that, and there are a handful of those around.

I think the two reasons that we didn’t go down that path, one is none of them were widely used and baked in enough for us to feel confident that they were the right choice. You know, there wasn’t kind of an obvious one that was a frontrunner that we could just say, “Oh, yes, everyone’s using that. It must be good. It must be well tested and used in production,” so to speak.

And the other reason is if we were going to switch to a configuration language that was an actual programming language, we would be much happier using the language that we use for almost everything else where we have great tooling and a lot of experienced engineers around who are already familiar with the language. It just didn’t seem like, you know, switching to Python was going to be a net win for us in the long term.

00:21:29

Ron

You would’ve had to have gotten a lot of benefit from the engineering that had gone into the other system to compensate for the fact that you have to switch languages. There’s the language and the tooling, which is a big deal.

00:21:40

Dominick

Exactly.

00:21:41

Ron

Okay. So you had an architecture in mind. You had an approach to take. How did it go? What were the problems as you ran down this path?

00:21:48

Dominick

Initially, things moved really quickly and went really well. We were able to implement the core SMTP protocol pretty quickly. Like I said, it’s a relatively simple protocol, so that went relatively smoothly, and then we started down the path of writing the configuration, you know, writing this pile of OCaml that was meant to replicate the functionality that we had in the old system.

This was kind of an interesting experience because we found, pretty quickly, cases where the old system had either non-deterministic behavior or it was just doing the wrong thing in some case that we, you know, hadn’t noticed in production or hadn’t really bitten us yet, but could have, and sort of the exercise of reverse engineering many years of configuration changes and people slapping things in to fix issues or to add functionality, and trying to figure out what the intent behind those changes was so that we could then reproduce the intent in a new system.

I think we eventually got to the point where we felt pretty confident that we had addressed most of the existing functionality, but then we were faced with a new problem, which is how do you build enough confidence in a completely new system that’s never been run in production anywhere before that has a completely rewritten configuration, enough to want to move the entire firm’s communications over to it? You know, it’s not something you want to do overnight.

00:23:01

Ron

This may be worth highlighting. Email is absolutely critical to Jane Street, sometimes in a way that’s incredibly important for short periods of time. Like if you turn off trader’s email, there are things that go wrong right quick, and a trading business needs to be able to respond to information quickly, so problems in the communication systems are incredibly critical.

00:23:18

Dominick

And we’re a global team. You know, we’re spread across, you know, three continents, and a lot of the way that we make sure that we’re keeping things consistent and that we’re keeping in touch between regions is email, and so it’s not like you can say, “Oh, well, we’ll do it overnight.” The Hong Kong office isn’t going to be happy about that, and similarly, you know, you don’t necessarily want to just do a big bang migration over a weekend and hope that Monday morning goes smoothly. It’s something that’s kind of fraught with peril.

So we started thinking about how to build this confidence, you know, what we could do to test the new system enough that we would feel ready to actually make that flip, and we came back to something that we talked about a little bit earlier in this conversation, which is this idea of testing the end-to-end behavior of the system and demonstrating that it was doing what you expected it to do in all cases.

But we didn’t really have a reference implementation that we could use in OCaml. We weren’t sure how to produce something like that, and so we looked back at our existing configuration. You know, we had this thing that had been working for years, or at least working enough that we thought it was working, and so we thought about how we could leverage that, and what we ended up doing is setting up, basically, what we called a shadow instance of our new OCaml-based email server, Mailcore, and running it in parallel with the existing system.

And so, for each message that came into our walls, we would fork off a second copy of the message and send it to the new system, in addition to sending the original message to the old system, and then we set up, basically, some endpoints sitting on the other end, on the output side of the old system and of our new system, to just keep track of the output that was generated by each, what messages got generated, what the transformations that had been applied were, where the messages were going to be directed, and then we just diffed those. We just set up, essentially, a streaming diff of all of the messages coming through both, and we made a lot of noise to ourselves each time we saw a case where the new system and the old system didn’t behave the same way, and we found, like, 5 or 10 different cases where there were just entire classes of mail were doing a slightly wrong thing. And in a bunch of those cases, I think the majority of those cases, it was actually the old system that was doing the wrong thing and not the new system that was doing the wrong thing, but it still made us feel sort of warm and fuzzy to know that, for the vast majority of email, the behavior was the same. And so we ran that for a long time, for months.

And then once we’d built up enough confidence, we started cutting users over one by one. We started basically at that stage where we were forking off a copy of the message. We added some logic to decide which primary server a given user’s mail was supposed to be going through. We sent our own mail through the new system for a little while and kind of ramped it up that way until all mail was going through the new system.

00:25:57

Ron

One of the things that strikes me about this story is the thing that, you know, ye standard software engineer thinks about as the problem trying to be solved is a fairly small part of the story that you’re talking about solving. The writing of the software that actually does the thing, that’s, like, a little bit of work, and then there’s a significant chunk of work, which is writing the config, which is, again, a programming task, and then there’s a bunch of just careful operational thinking about how the overall system works.

00:26:23

Dominick

And writing more software to set up this harness and the monitoring and the kind of diffing and all of that stuff, as well.

00:26:29

Ron

The other notable thing to me is a big part of the work here was essentially wrestling knowledge out of the old system into the new system, whereas when I came to Jane Street, a naïve young person out of grad school and thought about writing software, I thought, “Oh, software is where, like, someone has an idea of a thing that they want to happen, and then you write software that makes that thing happen.” It’s like, no, a lot of the times, software is replacing some old thing, and there’s no idea about what should happen. Rather, no human understands specifically what needs to be done. There’s just some old system that encodes all of this knowledge in it in a way that maybe no individual human ever knew all of it, but a bunch of people, over time, slowly added knowledge to this weirdly encoded knowledge base. And then you, as a software engineer, had to figure out how to wrestle it out of that.

I remember running into this many years ago with us replacing our early version of our order engines, which were these systems that connected to other brokers and exchanges and routed our orders there, and I got to this problem after working on trading systems, and trading systems, it was, like, a really smart person had written a thoughtful spec about how this thing was supposed to behave. And it was like, oh, yeah, okay, I can write to this spec. That was relatively easy. Whereas, you know, we had some order engine which had buried inside of it knowledge about how Bear Stearns’ internal infrastructure worked, which is like, again, not anything that anybody internally really knew explicitly, and it took a long time to claw that knowledge out, and it sounds like you ran into, more or less, the same problem here.

00:27:56

Dominick

Absolutely.

00:27:57

Ron

One of the problems I think one always runs into with the decision of whether should we use some external thing or should we build something on our own is the question of, like, how deeply do you mis-underestimate the size of the problem? How did that part go? How hard did you think it was going to be, and how hard did it turn out to be?

00:28:12

Dominick

If I’m completely honest, I don’t really remember what our estimate was at the time. This is now probably 5 or 6 years ago when we originally started on this effort, and I don’t remember what we thought would happen. I think it definitely took longer than we expected because I think that’s a rule, basically, for almost everything that I’ve ever been involved in, but I think maybe the interesting point to highlight is we probably were closer than we expected to be in terms of the implementation of the core system. But the implementation of the actual configuration and the migration to the new system I think exceeded the estimate that we would’ve made at the time. I think it just took a lot longer to build that confidence and to get to the point where we really did feel ready to flip to the new system. I think it took probably on the order of a year total from kind of start to finish or you know, start to some version of finished, and then, of course, it’s been a long tail of improvements and changes and extensions since then.

00:29:05

Ron

Right. I guess one of the funny questions there is what’s the alternative? I guess one alternative was not doing anything new and just kind of suffering with the system as it was and kind of incrementally moving along, and there, I think that time estimate matters a lot, but if the alternative is switching to some other system that you hope will be better, it sounds like most of the work that you had to do, the stuff that took a long time, was stuff you would’ve had to do anyway.

00:29:29

Dominick

I think that’s right. Yeah.

00:29:30

Ron

Might’ve taken longer the other way.

00:29:32

Dominick

Yeah, and then it would’ve been harder to find people with the right expertise and knowledge to work on it, as well, in some sense. I think if we had switched to some other system with some other arcane configuration language, we would’ve had to learn all about that first and then start down those other paths, and at least in this case, we could say, “Okay, you know, you and you and you, you know, some OCaml engineers already working at Jane Street, here’s a new domain to apply your existing knowledge and sort of expertise to.”

00:29:58

Ron

So once the project landed and we started using it as our primary mail system, what were the benefits that we got as an organization from this work? How is Jane Street’s setup now better for all of this change?

00:30:05

Dominick

There are a lot of reasons. I think the very first one is maybe obvious from the way I described the migration, but we now had all this infrastructure around making changes. First of all, now we could implement tests. Now we could implement our normal, in-line tests to use for units of OCaml code, and the configuration was composable. So we could reuse bits of it across different instances of our email server internally and across different use cases where we needed to run, you know, separate mail servers for some reason or another.

But more importantly, we now had this system that we could use to gain confidence in changes that were going to be far-reaching in the environment. So we could run a new version of Mailcore and an old version of Mailcore next to each other, and we could diff their behavior in the same way that we had diffed the old system versus Mailcore, and we used that to great effect. We still use that.

We also, as part of this, implemented a nice OCaml library for working with SMTP, both as a server and as a client, because, obviously, we needed to implement this core functionality as part of the construction of this system, and we found a bunch more use cases for that internally. You know, other cases where it’s useful to be able to stand up a small email server and take some automated action or write down a copy of a message or something like that. The biggest and most important impact that the system had, though, in the end was that it really lowered the barrier to making changes to the system. So, for a long time, we had kind of trained ourselves to not make changes, to not make improvements, to not really touch it when dealing with the old system because we were just ultimately scared that we were going to break something and that we didn’t have good tools to confirm or to give us confidence that we weren’t breaking something, but with Mailcore, we now had this system that looked very familiar. It looked like just about any other software system at Jane Street. It lived in our normal code-reviewed repository and was built with our normal build tools and had all of our normal OCaml-specific tooling and functionality, and that means that the set of people who felt like they could propose or make changes went way up.

So, you know, if somebody in the cyber-security team here wanted to implement a new kind of scanner for a particular kind of malicious attachment or something like that, they could just easily go and write a feature, write some OCaml code, to integrate that scanner or to implement that check, and it didn’t require, you know, finding the one person wearing a cape and pointed hat that happened to know the particular dark corners of the old system’s configuration language. It just really, you know, required the same kind of knowledge that we expect any of our software engineers to have.

00:32:36

Ron

Yeah, and the cyber-security example is not a random one, in the sense that I think one of the big wins from all of this change to our mail, which I think wasn’t really contemplated as much when we made the decision to start in on Mailcore, was that it gave a lot more power to people thinking about security to put all sorts of remediations in place. It’s hard to overstate how important email is as an attack vector and how much work you have to do to protect yourself and the ability to just widen out the set of people who could make those changes and to be able to kind of accelerate the level of work there I think had a very powerful effect on improving our cyber-security protections.

00:33:10

Dominick

It absolutely did. Yeah, and I think that is one example, like you said, but there are other cases, too, where not only from a security perspective, but from a kind of functionality and sort of productivity perspective, we were able to make changes that we wouldn’t have even contemplated in the original system. So, you know, if we wanted to, for example, do things like add support for new kinds of mailing lists, you know, mailing lists that had different behavior or that sent messages to somewhere else besides somebody’s inbox, because we wanted to get something out of email or something like that or a mailing list that we wanted to note was deprecated so that it would alert the sender that we’re not using this mailing list anymore. These are little things, little enhancements that smooth over paper cuts, but the flexibility to be able to throw in a 50-line feature to implement something like that just opened our eyes to the set of customizations that we could make to the way that we work with email.

00:34:07

Ron

Yeah, and simplifying and improving and removing paper cuts in the primary communication mechanism of a 1,300-person company is a surprisingly powerful thing, right, when you step back and think about it. So the individual thing seems small, but the value to the organization is quite large, and I think people are actually just kind of, on a person-to-person basis, incredibly grateful for the work that’s gone into email because it’s so much better than it used to be in a bunch of ways that I think really affect the quality of people’s lives here.

00:34:35

Dominick

A great example of something that, at the time, seemed small, but that has been really widely used and not a lot of people have gotten a lot of value out of is we implemented something internally that we call a relay list, which is like a regular mailing list. So, you know, a mailing list, normally, we think of it as sort of an address that contains some set of members, and when you send mail to that address, it goes to each of the members of the list. So, you know, we might have a distribution list that for everybody at Jane Street so that we can send out announcement firm wide.

Well, a relay list is a special kind of list where instead of including human user email addresses as the members, it includes hosts and ports. So, you know, some internal host name and a port on that machine, and what happens is when you send an email to a relay list, Mailcore actually relays that message on to a mail server listening on that host and port. You pair that with a small library for easily standing up a little mail server, and we’ve now given the power to automate various workflows related to email to everybody around the firm, without requiring them to have any special privileges or any access to actually make changes to this core, super critical piece of infrastructure. And we’ve kind of federated that ability out to everyone, and people have made use of it. They’ve implemented little enhancements to their own workflows for their specific teams around support rotations and little tools for improving their monitoring and all sorts of things like that that they just wouldn’t have been able to do or that we maybe wouldn’t have wanted them to do if it was going to be in the core system that we’re using to handle everyone’s important email.

00:36:09

Ron

And part of I guess what you’re doing here is, again, leveraging the openness of the email architecture, where you can just have routing between different hosts that are implemented differently and are doing different things, but of course, in this case they’re not implemented completely differently. We get to reuse the same libraries that you built for Mailcore. Those now get to be leveraged in all sorts of places.

00:36:25

Dominick

Exactly.

00:36:25

Ron

So one of the things you were pointing out there is the fact that you are now able, in writing the configuration, to leverage the standard software tools that we used for building all sorts of software. Can you say a bit more about how that played out and what were the valuable pieces of those software-oriented workflows?

00:36:41

Dominick

I think this is a pattern that you see a lot around Jane Street. We’re a place that really highly values automation, and so we’d much rather take a problem that maybe is traditionally viewed as an administration or an operational problem and turn it into a software problem if we can, because we do get to benefit from the kind of work that’s being done around the firm to improve our ability to work with software, and what do I mean when I say that?

I think there are a lot of very concrete, specific, technical things that I mean, so things like editor integration, you know, syntax highlighting, the integration with the build system, tools that help us view the type definition for a given value, tools that help us jump to the definition of some bit of code so that we can kind of move around the source code repository, tools for writing automated tests and for kind of demonstrating that the behavior of some system hasn’t changed over time.

All different kinds of things like that, but the other thing that I think is important about taking this kind of software approach to what are traditionally viewed as more systems-y or more administration problems, is I think kind of a cultural one. I think people treat code differently from the way they treat other things. Like, there’s some switch in our brains where, if we’re messing with a config file, then we’re much more willing to copy and paste some stanza or to, you know, hack something together and just throw it into a repo without a good commit message or maybe to not even put it in a repo in the first place.

Whereas when we’re dealing with code, the expectations just change. There’s this general shared understanding that code should be reviewed and code should be tested and there should be a description for why you’re making a particular change, and we work with it in a different way, and we hold ourselves to a different standard when working with code. We refactor code; how often do you refactor configuration? And I think, ultimately, the thing that we were most excited about here was being able to leverage that kind of cultural shift, just this inclination to work with this pile of stuff in a different way, even if it was, you know, for a system where, normally, it wouldn’t be handled that way.

00:38:43

Ron

So part of it is about the tools of software, and part of it is about the culture of software?

00:38:48

Dominick

That’s right.

00:38:48

Ron

So why do you think the culture of software and the culture of configuration are as different as they are? Sort of step back, thinking about it from an abstract perspective, it doesn’t feel very different. Configuration languages are languages. Essentially, very restrictive programming languages. Why should they develop such different cultures?

00:39:05

Dominick

Yeah, it’s a good question. I’m not really sure. I can speculate a little bit. I think one potential reason is that the tools are just not as available to you. So sort of the tools breed the culture in some sense with software. The fact that you have all of these nice tools for writing tests and for doing code review and for working with code definitely encourages you to do good things.

You know, it’s much easier to refactor some code if you have good tools for helping you refactor it, and with configuration, you often don’t have those things because, in many cases, it’s a bespoke language for a particular system, and it’s just not worth the effort of going and building all of that tooling specifically for the system. So I think that’s probably a big part of it. I think another part of it is that, in many cases, we store configurations separate from where we store source code. A common pattern is you build some core functionality, some basic system that just handles the most common kernel of operations that you need to handle.

And then we have many instances of that system, each with their own configuration, and the result of that is you end up with configuration kind of strewn all about, managed by different teams maybe if the system is being run by different groups or something like that, and in general, you don’t get the same kind of consistency that you might get out of the way that we would approach making changes to the core functionality, and I think one of the other big takeaways with Mailcore was we actually just stored the configuration right next to the core functionality. The configuration lives in the same repo right next to all of the other code, and when we roll it out, we deploy everything all as one big bundle, and we don’t have this problem of, like, oh, well, the configuration lives over here, and the code for the implementation for the core functionality lives over here, and we get some tooling that works well over there and some tooling that works well over there, and you know, it’s kind of annoying to inter-op between them or something like that. I think that that plays a role in a lot of cases, as well.

00:40:52

Ron

At least the second problem you describe, the one about where you store the config versus where you store the code, that leads to the thing that, you know, thinking is enough to make it so. Like, if you just have different ideas about how you should start configuration, you can adopt that one. Where your previous point about the culture depends on the tools being there, that’s a much harder problem to fix, and just even from Jane Street’s own history… I think of Jane Street as having a very good and well-developed culture around testing, but it didn’t always. The tools used for testing used to be much worse, and the practices around testing were much worse, and I think the thing you describe is exactly right. That the culture was able to be established only in concert with building the tools. Like, people decided testing was important. We spent more time doing it. People got frustrated about how hard it was, and they spent more time building tools to make it easier, and then when it got to be really easy, which is kind of how I think of it now, that culture gets really widely spread.

You mentioned this refactoring thing, which is another thing that struck me, where one of the practices that I think is very common in code is – you talk about refactoring configs – we have a fairly strong approach of trying to avoid repeating things in code, because if you cut and paste things, it’s a super easy way to introduce bugs, because, you know, you cut and paste it, and then make just the changes you need to make it right.

But it’s so easy to miss something, but OCaml has incredibly good, very lightweight tools for essentially making very simple templates, typically in the form of functions, so that you can just figure out what are the parts that really need to differ and avoid any excess duplication, and it doesn’t even make your code necessarily shorter, but it does, very often, make it cleaner and less likely to be buggy.

And the tendency to do that depends critically on having a system in which you operate that’s friendly to that kind of refactoring, and if you’re in, like, some random config language that was never really designed as programming language, has no kind of core principles on how it’s organized, that stuff is just not going to go so well.

00:42:44

Dominick

That’s right. Yeah, and I think this kind of speaks to the specific functionality of OCaml, the specific capabilities of OCaml that make it such a good language for a wide array of problems, but you know, including this one, and I think you highlighted one. Another that I think is pretty important for the config management case is it’s really nice to have checking at compile time for unused values – for things that you specified somewhere but then you never did anything with – because a really common mistake in a config language that lets you do this is to go and define some value somewhere and then forget to list it in the place where you meant to list it to say, “Oh, and use this thing now.” Having in OCaml, the ability to kind of move things around and reorganize the config and get alerted by the compiler if we forgot to make use of the value or if we left some stale bit of code around helps us keep the implementation as lean as we can and make sure that we just prevent a wide class of mistakes.

00:43:45

Ron

Yeah, I think that’s incredibly important. It’s a simple decision, but the fact that we make fairly aggressive choices about turning on warnings in the OCaml compiler, including that one, and not just warnings, but turn them to errors so you cannot even compile your code when you have an unused variable, that can be annoying in some contexts, but it’s so incredibly useful, and it catches so many bugs.

I was talking with a guy who works in the Tools and Compilers team who had previously worked at various other big tech companies, and he was talking about various, like, fancy techniques that are out there for, like, machine learning, blah, blah, blah, for catching common bugs, and he was like, “Yeah, this seems interesting, but honestly, the fact that we have things like, you know, automatic detection of unused variable just smashes a lot of bugs this stuff would catch anyway. So it’s not clear it’s worth the complexity.”

00:44:35

Dominick

Pattern match exhaustivity is another one that’s like that, where the fact that you can know that you matched on all of the possible values of this type just eliminates a whole class of bugs that you might easily make in other languages that don’t have that.

00:44:47

Ron

For anyone who’s thinking about whether a language like OCaml’s interesting and you want to understand why people like it, pattern matching and the exhaustivity check on pattern matching is the single best feature, and it continues to mystify me that more programming languages have not picked it up. Like, you don’t have to take all of the decision, but like, that one is so good.

00:45:03

Dominick

Just take that one!

00:45:08

Ron

So you talked a bunch about what was good about Mailcore, what the advantages are, but going and building your own thing isn’t all sunshine and roses. What are the downsides of having built our own homegrown mail server?

00:45:18

Dominick

There are plenty.

One obvious one that stands out to me is I’ve been referring to SMTP as this simple protocol, and the core of it is a simple protocol. You know, there really are very few things you’d need to implement to support the functionality that you would expect of a basic mail server, but as we alluded to at the very beginning when we were talking about the openness of email and the reality of the modern Internet and all of the things that you have to consider that weren’t considered when it was originally designed, there have been many extensions to SMTP and many extensions to the kind of surrounding mail ecosystem to add on an extra level of security or extra functionality, and Mailcore has to implement all of those if we want to get that functionality. You know, we can’t rely on some open source community or some vendor maintaining the mail server that we’re using and just adding functionality as new specifications become approved or go into wide use. That I think means that this is a kind of a never-ending project in some sense.

00:46:16

Ron

How about from a security perspective? I can imagine that thing cutting both ways, which is to say we have a lot more power to decide exactly how it works. At the same time, I imagine there are, like, rookie email mistakes that someone implementing a mail server can get wrong and a mail server that’s existed for 20 odd years has had the opportunity to fix some of those, and we get to make those mistakes from scratch. How much of a role does that play?

00:46:37

Dominick

I actually expected more issues of this type, but we’ve seen fewer than I would’ve thought, and we have taken a good hard look at it and considered that angle. I think a big relevant detail of the way the popular open source mail servers is written is that they’re mostly written in C, and so a lot of the security issues that they have run into over the years have been of the normal C memory-unsafe style security flaw that many, many, many systems have been bitten by. And writing our system in OCaml rules out that whole class of things, or at least limits them to, you know, bugs in the OCaml compiler or in external libraries that we link in or something like that. So that’s a big plus. The other thing that we get out of this is because we implemented, you know, we’ve written our own thing. That means that it’s going to be a lot less common on the Internet. There aren’t that many other people using it. It’s a lot less interesting to find a vulnerability in our mail server versus in some popular mail server that’s on, you know, 50 million servers around the Internet, and so I think we get a little bit of security from that fact.

00:47:43

Ron

And the buffer overrun story you talk about with mail servers written in C, it’s really no joke. I think Microsoft, in the last little bit, came out with a study that something like 70% of their vulnerabilities were buffer overruns for things that were written in C and C++, and like, I think it just highlights to me the importance, again, of programming languages in systems design. Problems at the programming language layer are incredibly hard to solve at higher levels, right? If you use a safe programming language, like Java or OCaml or Rust, then there’s a whole class of bugs that just go away and you can do things to try and smash the bug count above that and do address randomization and all sorts of fuzzing testing and all that, and you can do that, and it’s effective.

But it’s an enormous amount of work, and it doesn’t get you to anywhere near as good of a situation as you would’ve been if you just used safe language to begin with. So the kind of ongoing train wreck of people building Internet-facing software in C and C++ and other unsafe languages, like, it continues to amaze me. Like, it seems like a really serious mistake just from a security perspective, all other aspects of software engineering and language design aside.

00:48:57

Dominick

Totally. Yeah, we get a big win out of just not having to think about that and being able to focus our energies on other things.

00:49:03

Ron

So we’ve spent a lot of time talking very positively about email, but at the same time, email is terrible, right? Like, we all live in a world where we have way too much email. Certainly, I live in a world where I have too much email, and email is I think kind of clearly the best collaboration tool I have used, the best communication tool I have used, and you know, for lots of people, things like Slack and whatever, I like and I think are useful in various contexts, but you know, you can pry email from my cold, dead hands. At the same time, oh, lord, I wish it was better, and I’m curious. You spend a lot of time thinking about email and about how email works at Jane Street and not just the technical, but also the kind of organizational and human concerns. How do you wish email was better?

00:49:48

Dominick

There are kind of two trains of thought here that I want to cover. One is how do I wish email was better as a protocol and as a kind of citizen on the Internet, and the other is how do I wish email was better at Jane Street, or you know, what changes do I think that we need to make? I’ll start with the first one.

I think the biggest thing that the world at large, the email world at large, has wrestled with for the past, I don’t know, 20 years probably, and continues to wrestle with, is a consequence of this openness that we’ve talked about a few times in the architecture and specification of the way email works, and the consequence is, essentially, that email is really difficult to authenticate. It’s really difficult to know that, in the kind of core SMTP specification, that a message was actually sent by the person who claims to have sent it.

So this is where we get things like spoofing and phishing and other kinds of malicious impersonation, things that, at their most mundane, just result in more spam, more junk mail for you to clean up, but at their worst, this is how you get things like, you know, people pretending to be you and asking your bank to wire all your money to some offshore, untraceable account or something like that. So it’s a huge problem, and it’s kind of a fundamental problem in the way that email is designed, and lots of people have made attempts at improving it, adding extensions to the email specification and new protocols and things like that to fix this.

There are things like SPF, the Sender Policy Framework, or DKIM, DomainKeys Identified Mail. Both of these are just attempts to further lock down and authenticate email, whether it is, you know, authenticate that the person who sent it is who they claim they are or authenticating that the actual contents of the message are the same contents that the original sender intended to send to you. So these help a lot, and they definitely make a big difference, but one of the issues that crops up with both of these things is that they require participation by both the senders and the recipients. The sender has to be configured to authenticate and say, yes, this email was sent by me, but the recipient also has to be configured to check for it. You know, it’s kind of like the equivalent of – it’s one thing for me to carry around my driver’s license and have a nice picture on it and my name and my license number and all that, but if you don’t ask me to see it and you don’t look at it and make sure that it looks like it’s a real driver’s license and that it was actually issued by the state and all that kind of stuff, then it doesn’t really do anybody any good. It doesn’t actually, you know, demonstrate any identity or validity. So this ends up being a big problem, because if you’re using a big provider, you know, somebody like Gmail or Microsoft 365, Google or Microsoft are going to be highly incentivized to build in a lot of good tooling and implement all these things and do as much as they can to help you authenticate the mail that you’re sending and check that the mail that you’re receiving was also authenticated.

But if you’re trying to run your own mail server or even your own mail client that doesn’t do some of the things that Gmail or Office 365 would do and you’re trying to keep up with all these things as new improvements crop up and things like that, it’s just really, really difficult, and it’s kind of a continuing problem. You know, once you’ve authenticated that the sender of the message is who they said they were, now you have this whole separate problem which is, well, that’s great, but what if other people aren’t authenticating that mail claiming to come from you was actually sent by you? It’s great if you check that the mail that I sent you actually came from me, but if my bank isn’t checking, then it’s not doing me any good.

And so this ends up being kind of a pretty hard problem to solve in a uniform and global way. You know, there’s progress being made and you know, continued improvements to some of these things and new ideas cropping up for how to make this better, but it’s just a really hard problem, and a lot of it stems from all those nice things that we talked about with SMTP and all of its openness, and so it’s just kind of a double-edged sword.

On the Jane Street side of things, I think, kind of ironically, the biggest problem that we have with email is we send too much of it. I think email is great, and I feel the same as you. I think it’s an awesome tool, and it’s a really, really effective way for a lot of kinds of communication, but I think it’s too easy to send an email, and it’s too easy to send an email to a large number of people, and it’s too hard to remove yourself from the list of recipients in some cases.

So, you know, we have all these mailing lists internally that we use for organizing ourselves and making sure that people who want to follow along with different kinds of discussions can follow along, but we don’t have enough tooling to make it easy for people to understand the sheer impact that consuming email can have on your productivity, your ability to focus, your ability to do anything besides read and respond to email.

So this is something that we are actually focusing on within our team right now, which is what kinds of information can we put in front of people? What kinds of tools can we build for people to either get things out of email, you know, to move things that don’t actually belong in email into other systems? You know, you probably don’t want your monitoring system to be primarily alerting you via email. That’s just not the place. You know, you don’t need everybody a week later to see that you got close to running out of memory on some server at some point. You know, that’s just something that’s a transient fact about the world that you kind of don’t want to deal with ever again once it’s resolved, but we’re working on a lot of tooling to make it easier for people to get those things out of email and into other systems and also for people to wrangle their inboxes, better understand what it is that’s coming into their inbox and where it’s coming from and why they’re receiving it and how much of it they’re getting so that they can make better decisions about what they should and shouldn’t be getting.

00:55:40

Ron

I literally ran into this issue this morning, in that yesterday, shockingly, for the first time in about a year, I got myself down to inbox zero, which is a mythical stage that one almost never gets to, and so, you know, as your inbox fills up again, you’re like, oh, what is this stuff, and can I please turn off the stuff that’s irrelevant? And there’s emails I looked at. I’m like, “How do I even know why I am on this mailing list, and how do I unsubscribe from it in a clean way,” and it’s all way more complicated than it feels like it should be.

And then filters, which feel like they should be a good answer to this problem, are actually a surprisingly bad answer in a few different ways. One way is the filter language is, like, Google’s filter language in Gmail is surprisingly primitive. I can’t say, like, I want to not receive emails that I received only because I was on this list, but if there’s some other reason that I should’ve received it, I still want to receive it, and expressing that is surprisingly hard.

And the other thing about filters in email that are difficult, some people at Jane Street have taken a kind of radical, extreme view of email where they, like, block everything and then white list the things that they want to see, and that means it can be very hard to know whether the email that you sent to someone has actually gotten through or it has just been filtered out by their system, but yeah, I think maybe the most important thing you said was this one about the cost issue. That, somehow, giving some way of making people who send emails feel the cost of sending it to those people.

00:57:05

Dominick

Right. If you’re going to write that email to 1,500 people, it’s probably worth spending an extra five minutes editing it to make it as short and concise and direct as possible. Whereas if you’re sending it just to your buddy who sits down the row, okay, that’s fine. You know, send it in whatever form you want, but I think we… at times, it can be easy to forget the impact that sending an email that takes just an extra 30 seconds to read multiplied out over 1,500 people can have. That’s just a big cost, and so we’re working on ways of making that better.

00:57:33

Ron

How? What are your ideas for making that better? I’m fascinated.

00:57:36

Dominick

I think the biggest one is putting that information in front of you when you send an email. So because we have mailing lists, it’s easy to just say, “Oh, I’m going to send this message to everybody@janestreet.com” and forget that everybody@janestreet.com is a mailing list that contains… everybody at janestreet.com and just how many people that is and just what the cost of sending an email to that wide of an audience is. So we’re working on ways to put that information in front of people at the moment when they’re writing the email so that they can at least make a more informed decision so they don’t forget the impact that their message might have.

00:58:12

Ron

I’ve seen some of the reverse problem. There are lists that sometimes people really want to be bothered by. Like, they want to lurk on, and they kind of want the opposite thing there. I want to say, like, yes, there’s a lot of people on here, and you shouldn’t worry about them. They’ve signed onto this fire hose, but I don’t want it to be slowed them. If it’s too much for them, they should sign off.

One of these at Jane Street is the mailing list called CompilerDev, and it turns out, a lot of people like lurking CompilerDev because compiler questions are interesting and people like to kind of pick through them, and we have this problem where people will, we have this organized so CompilerDev is actually the merger of two lists, CompilerDev “Actual”, the people really on the team, and CompilerDev “Also” for other people who just kind of want to hang on.

And then people will, taking seriously what you said, will email CompilerDev “Actual” instead, because they’re like, well, I don’t want to email all of those people, and we have to, like, go and manually redirect and be like, “No, no, no, you should normally worry about the safety of your coworkers, but in this one case, people have decided that they really want this and have, like, asked us to make it so that all the emails go here. So please redirect.”

00:59:14

Dominick

Right. I totally agree, and I think I’m probably infamous internally for the amount of dispatching between mailing lists that I do because I am, like, militant about making sure that a message has gone to the right mailing list. Even if, you know, you send a message directly to me, I might redirect you to a mailing list that contains only me just so that, like, if I ever decide to stop working on that thing, somebody else will start receiving your emails and you won’t just cache that you should always send an email to me. So, yeah, it’s a problem that definitely cuts both ways.

I think the other thing that I wanted to highlight that you reminded me of is when we were talking about filters, another big problem that we run into is a lot of the filtering technology, in addition to not being flexible enough to express some of the things we want to express, is not really built to be used by groups. Whereas, in practice, we organize ourselves in groups, in many cases.

So if you have a team of people that are on some support rotation, for example, there’s really not a lot of value in each of them independently coming to their own conclusions about what kinds of emails they need to see first and what kinds of emails they can kind of just have to skim later on, and so we’d really like a better way to build tooling that allows for sharing of some of this stuff so that we can kind of implement it once and people can kind of sign onto use the same well developed general set of rules that somebody else on their team has decided on, and we’ve actually built some tooling to this effect. We’ve actually started doing the maybe predictable thing of generating some of our filters in OCaml and allowing for better sharing of those OCaml filters and code reviewing them and doing all the things that we do with everything, and that has helped a lot but we’re looking for ways to kind of expand some of that functionality to, you know, support more things and add support for just some more expressiveness to the tune of the things that you were talking about before.

01:01:04

Ron

Right, and presumably, even in that group-oriented environment, you also want a composability story, you’d like some ways of sharing among a team a set of decisions about how to handle emails and also allow customizations and you could both want to use the trade support email filters, and then also some friend of yours who came up with a good set of filters for some particular case, you want to be able to mix that in and have some way of having the semantics at the end of that be something that you can reason about.

01:01:31

Dominick

Yeah, and there’s an obvious scary thing that can happen here, which is a kind of smaller version of the scary thing that we were worried about when we were rolling out Mailcore initially, which is once you start sharing filters, now you’ve given somebody the ability to black hole all of your email, and it does happen from time to time with some of this tooling where, you know, somebody new to the team is like, oh, I’m going to add a filter. I’m going to, you know, try to add support for this new thing that I ran into, and they accidentally, you know, confuse the rules in some way and end up, you know, with a filter that says send everything to the archive, and it takes a little while for somebody to notice sometimes.

01:02:08

Ron

Yeah, and I think this highlights why wanting to have this kind of more complex, shareable system quite naturally goes along with wanting to have things like code review and testing, because, suddenly, you’ve taken what had been a very low-impact thing of, like, you’re just mucking with your own filters, to a thing where you might black hole all the email for the entire Trade Support team, which is now a critical firm risk issue of, like, now no one who’s supposed to support the trading sees any of the things you’re supposed to be able to see.

01:02:33

Dominick

That’s right.

01:02:34

Ron

So maybe to close it out, so email is, you know, a big, long-term system which has been around for a long time and changed a lot over the years, and every now and then, you hear people coming up with new things that are going to kill email and replace email. Projects like Google Wave, which was this grand new thing that was going to replace email, and then, instead, you know, the wave crashed, and that was the end of that. I’m wondering, are you optimistic about the future of email?

01:03:07

Dominick

I am optimistic about it. I think that email, in its flexibility and openness, is something that it would be really hard to replace with any of these other systems, and I think there’s a reason why it’s had the staying power that it’s had. You know, it’s been around for 40 years, and while it’s changed around the edges and while we’ve had to adapt to some of the developments on the Internet, at the end of the day, the core functionality has stayed basically the same that entire time, and I think the fact that it is so open and so flexible and makes it so easy for you to build your own things on top of it means that it’s got a long, bright future, and I certainly think that it’s far from being outmoded at this point.

01:03:53

Ron

Well, thank you very much for joining me. This has been a real pleasure.

01:03:55

Dominick

Thanks, Ron.

01:04:00

Ron

You can find a full transcript of the episode, along with more information about some other topics we discussed, including a link to a talk that DLo gave about Mailcore and also links to some of our mail-handling libraries at signalsandthreads.com. Thanks for joining us, and see you next week.

Listen and subscribe:

Building a functional email server

with Dominick LoBraico

Season 1, Episode 8 | October 28th, 2020

00:00:04

Ron

00:00:48

Dominick

00:00:49

Ron

00:00:54

Dominick

00:02:17

Ron

00:03:10

Dominick

00:03:30

Ron

00:04:13

Dominick

00:04:14

Ron

00:04:16

Dominick

00:04:45

Ron

00:04:46

Dominick

00:04:48

Ron

00:04:56

Dominick

00:06:44

Ron

00:06:50

Dominick

00:07:50

Ron

00:08:11

Dominick

00:08:31

Ron

00:08:41

Dominick

00:09:48

Ron

00:09:53

Dominick

00:12:00

Ron

00:12:07

Dominick

00:12:29

Ron

00:12:34

Dominick

00:13:53

Ron

00:14:17

Dominick

00:14:33

Ron

00:15:25

Dominick

00:15:26

Ron

00:16:12

Dominick

00:17:53

Ron

00:18:52

Dominick

00:19:40

Ron

00:19:44

Dominick

00:21:29

Ron

00:21:40

Dominick