All Episodes

Listen in on Jane Street’s Ron Minsky as he has conversations with engineers working on everything from clock synchronization to reliable multicast, build systems to reconfigurable hardware. Get a peek at how Jane Street approaches problems, and how those ideas relate to tech more broadly.

What is an Operating System?

with Anil Madhavapeddy

Season 2, Episode 3   |   November 3rd, 2021

BLURB

Anil Madhavapeddy is an academic, author, engineer, entrepreneur, and OCaml aficionado. In this episode, Anil and Ron consider the evolving role of operating systems, security on the internet, and the pending arrival (at last!) of OCaml 5.0. They also discuss using Raspberry Pis to fight climate change; the programming inspiration found in British pubs and on Moroccan beaches; and the time Anil went to a party, got drunk, and woke up with a job working on the Mars Polar Lander.

SUMMARY

Anil Madhavapeddy is an academic, author, engineer, entrepreneur, and OCaml aficionado. In this episode, Anil and Ron consider the evolving role of operating systems, security on the internet, and the pending arrival (at last!) of OCaml 5.0. They also discuss using Raspberry Pis to fight climate change; the programming inspiration found in British pubs and on Moroccan beaches; and the time Anil went to a party, got drunk, and woke up with a job working on the Mars Polar Lander.

Some links to topics that came up in the discussion:

TRANSCRIPT

00:00:04

Ron

Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I’m Ron Minsky.

00:00:13

Ron

It is my pleasure to introduce Anil Madhavapeddy. Anil and I have worked together for many years in lots of different contexts. We wrote a book together, Anil and I and Jason Hickey together wrote a book called “Real World OCaml.” We have spent lots of years talking about and scheming about OCaml and the future of the language and collaborating together in many different ways including working together to found a lab at Cambridge University that focused on OCaml.

00:00:41

Ron

Anil is also a systems researcher in his own right, an academic who’s done a lot of interesting work, and also an industrial programmer who’s built real systems with enormous scale and reach. We’re going to talk about a lot of different parts of the work that Anil has done over the years.

00:00:55

Ron

To start with, though, I want to focus on one particular project that you’re pretty well known for which is Mirage. Can you give us a capsule summary of what Mirage is?

00:01:04

Anil

Sure, I can. And it’s great to be here, Ron.

00:01:06

Anil

The story of Mirage starts at the turn of the century. In the early 2000s, pretty much every bit of software that ran on the internet was written in C and back then, we had internet worms that were just destroying and tearing through services because there was lots of problems like buffer overflows and memory errors and reasons why the unreliability of old systems code that had been written in the past was becoming really obvious. And the internet was really insecure.

00:01:31

Anil

So there I was, as a fresh graduate student in Cambridge, and I decided that after years of doing consistence programming in C, I would just have a go and see what it was like to rewrite some common internet protocols using a modern, high-level language. So I looked around and I looked at Java which was obviously the big language back then, I looked at Perl which was heavily used for scripting purposes. But in the end, I decided I wanted something that was the most Unix-like language I could find and I ended up using OCaml.

00:01:58

Anil

It had fast native code compilation that just ran in Unix, it could be debugged very easily, it had a very thin layer to the operating system. I spent a great couple of years figuring out how to write really safe applications in OCaml. I started by rewriting the domain name service which is how we resolve human-readable names like google.com to IP addresses and I rewrote the Secure Shell protocol which is how most computers just talk to each other over remote connections.

00:02:23

Anil

I rewrote all of these in pure OCaml and I showed it as part of my PhD research that you could make these not only as high-performance as the C versions, which really wasn’t that well-known then because there was a perception that these high-level languages would be quite slow, but then I also showed that you could start doing some high-level reasoning about them as well. You could use model checking and early verification techniques to prove high-level properties.

00:02:44

Anil

This is all really good fun. I wrote loads of OCaml code and then I published all these papers. And then I asked myself a simple question. I’ve written all of this code to rewrite network protocols and have safe applications, but then the compiler just seemed to stop. So after all of these beautiful abstractions and compilation processes, I got a binary at the end and this binary just talked to this operating system. I might have written 100,000 lines of OCaml, but this operating system had 25 million lines of C code, the Linux kernel.

00:03:13

Anil

So why after all of my hard work in perfecting this beautiful network protocol do I have to drag along 25 million lines of code? What value is that adding to me when I’ve done so much in my high-level language? And this is where MirageOS comes in.

00:03:26

Anil

MirageOS is a system written in pure OCaml where not only do common network protocols and file systems and high-level things like web servers and web stacks can all be expressed in OCaml but the compiler just refuses to stop. We then provide different abstractions to plug in the actual operating system as well. And so the compiler, instead of stopping and generating a binary that you then run inside Linux or Windows, will continue to specialize the application that it is compiling and it will emit a full operating system that can just boot by itself.

00:03:58

Anil

The compiler has specialized your high-level application into an operating system that can only do one thing, the thing that is written to do, and it does this not just by looking at the source code, it also looks at your configuration files which are also written in OCaml, it evaluates all of those in combination with your business logic, and then it compiles the whole thing in combination with operating system components written in OCaml like TCP/IP stacks and low-level file systems and network drivers and those kind of things, and emits what’s known as a unikernel. A unikernel is a highly specialized binary output.

00:04:30

Anil

So MirageOS started off as an experiment in my PhD 15 years ago. I’ve been joined by an incredible community initially by Thomas Gazagnaire and David Scott and now by a large MirageOS core team. We have hundreds of protocols and file systems and pieces that can all fit together and be combined into very bespoke, artisanal infrastructure. You can design a kernel that does exactly what you want it to do and you don’t have to drag along other people’s code unless you want to.

00:04:56

Ron

Maybe an overly pithy summary of this is: Your operating system is a library.

00:05:01

Anil

This is an operating system. But what is an operating system? In a normal operating system, you run a bunch of processes that’s known as userland where this is a failure domain where if something goes wrong or it needs resources from the outside world, it’s got to ask a higher-level system and the higher-level system in conventional operating systems is the kernel. The Linux kernel, for example, is the thing that mediates all of the resources in your system, it manages the hardware, and it acts as a middleware to give software safe and isolated and high-performance access to the underlying hardware.

00:05:33

Anil

With unikernels, it uses a different approach of structuring operating systems, one known as library operating systems. And this is one where instead of the kernel acting as a big wrapper around all of your code, it simply is provided as a set of libraries so it is no different from any other library that you link to such as OpenSSL or some kind of graphics library, for example. The kernel is just another one of those things.

00:05:55

Anil

But what you sacrifice is multiuser modes because if one application is accessing some system libraries, it needs exclusive access to the hardware. It’s quite hard to provide competing or untrusted access to different parts of your hardware stack.

00:06:10

Anil

So library operating systems work really well if you’re trying to build a specialized application that is maximally using the hardware at hand. If you just want to have a desktop with lots of applications running, then you should just use conventional operating systems. It’s only if you can benefit from the specialization that you want to switch into this different mode of operating system construction.

00:06:27

Ron

What I like about MirageOS as an idea is it’s so weird. It’s hard to know whether it’s a research project or a stunt. It’s also part of what I think of as a larger story of the multi-decade-long failure of the original idea of an operating system.

00:06:42

Ron

Back in the day, we had this idea of what we’re really going to do is build multiuser operating systems. Computers were really expensive, and we need to share them, share one big computer among a bunch of people. And so we build systems like Multics. And then systems that took inspiration from them like Unix and lots of other systems along the way. We built all of these abstractions that were designed to make it easy to share hardware among multiple people and to do it safely.

00:07:07

Ron

And then in the last couple of decades, we have more completely and utterly given up on that project, and things like virtualization and containers are all examples where we’re like, “No, no, no, that’s not what an operating system is for. Operating systems are for piling up your complicated stack of all the different components you need to do when you throw together to build an application and you want to add them up and freeze them in place so you can replicably build up this weird agglomeration of stuff that you’ve thrown together.”

00:07:38

Ron

The original purpose of actually having multiple users share the same operating system has basically vanished from the scene. And once you’ve made all those changes, the idea that instead of all of the traditional abstractions that we needed when we were separating out different users, maybe we could do something radically different. That’s where I see something like MirageOS showing up.

00:07:57

Anil

That’s right. It’s an interesting perspective to think that operating systems have been a failure because what’s really happened in the last 20 or 30 years is that we have invisibly added layers that provide the right level of abstractions needed for that point in time.

00:08:11

Anil

For example, in the late ’90s, I would spend ages building a beautifully configured Windows machine because I knew exactly all the registry keys and all the magic that went into it. But in the early 2000s, I worked on the Xen hypervisor.

00:08:23

Anil

The Xen hypervisor started off with a very simple thesis which is, it is possible to run multiple instances of an operating system not designed to run on the same physical hardware simultaneously and make sure it’s completely isolated from the other operating systems running in the machine but also do so with minimal performance hit. There was a serious balancing act there.

00:08:40

Anil

And so what we did with the Xen hypervisor was, don’t touch anything in user space because you don’t want to have people rewriting all their applications, their Oracle databases or their SQL servers or whatever they’re running. So we scooped out the guts of the kernel. And normally, the guts of the kernel in Linux is what manages the low-level hardware, the memory management subsystem, the interrupt controller, and the things that map hardware to operating systems. With this simple modification, we adopted a technique called paravirtualization.

00:09:07

Anil

What paravirtualization did was, it just fooled the kernel into thinking it was running a real hardware but we shimmed in a little layer called a hypervisor, the Xen hypervisor, which then did all the real mapping to real hardware. It turned out this was extraordinarily effective because we could take entire physical operating system stacks and tens of millions of lines of code all combined and run them simultaneously in a single physical machine and make sure that they’re all utilized to their maximum potential. If you had a bunch of machines all being used 10% of the time, we could shove these in one place.

00:09:38

Anil

Now, this worked out so well because the notion of a user wasn’t someone who’s logging into a Windows machine but it became the person who’s booting up an operating system. And then suddenly, the Xen hypervisor became its own operating system and cloud computing and all of these kind of things took off by the mid 2000s. But they just provided a different interface. And when MirageOS came along, it was kind of the leftover portions of the Xen experiment.

00:10:01

Anil

Xen also interestingly started off as a stunt. It was a bet in the Castle Pub in Cambridge that Keir Fraser couldn’t hack Linux over a weekend. And then Monday came along and we had the first version of Xen and then a big team of us continued working on it.

00:10:14

Anil

I then spent the next few years at a startup company called XenSource building all of the support to make it production quality so we could sell the Xen hypervisor as a product so that we had Windows drivers and Linux drivers. Those years were filled full of compatibility woes. So you have to look at every single edge case and make sure it works perfectly.

00:10:32

Anil

And then life just got frustrating. You just get bored of making other people’s code work well in your virtualization layer. So we had to have some way to test Xen. And so MirageOS, the first version of it, came along because we built a minimal operating system that didn’t have all of the Windows baggage and all of the Linux baggage and all it did was exercise the lowest levels of the Xen functionality, the device drivers, the memory subsystem and so on. I needed to have slightly more complicated tests.

00:10:59

Anil

With Thomas Gazagnaire, we just linked in the OCaml runtime because we just needed to write some high-level logic and then that was running inside the Xen hypervisor as a minimal operating system. It was a few hundred kilobytes in size at most. And then we’re sending Ethernet packets so wouldn’t it be nice if you could just hook up an OCaml library to send TCP frames instead of low-level Ethernet? So then I started writing a TCP/IP stack in pure OCaml. And then once you have TCP, it’s a pretty small step to go write an HTTP stack in OCaml. And then that happened.

00:11:28

Anil

MirageOS became this organic growth of starting from low-level interfaces, figuring out what the system abstractions that we need are, and then filling in the blanks with libraries. So it did start as a stunt. I think all good systems projects start with a stunt because you’re trying to test an experimental hypothesis, you’re trying to show that if we modify the world to be the way we want it to be with our hypothesis, that it’s worth doing. And you need that stunt to show that all of the effort and all the hard work that goes into productizing something is actually worthwhile.

00:11:56

Anil

So Xen, the hypervisor, was a stunt just to show that you could just boot three Linuxes on one machine and then to this day it remains one of the industry’s most popular hypervisors. MirageOS also started as a stunt just to show you could build a credible sequence of OCaml applications and protocols and compose them together and build something useful. MirageOS today has tens of millions of daily active users. It’s embedded in all kinds of systems that use the libraries and the protocols in lots of different ways and it’s invisibly servicing lots and lots of cloud infrastructure.

00:12:30

Ron

Yeah, I think it’s hard to overstate how impactful the Xen work has been. It’s the foundation on which the entire modern internet is built, right? The virtualization is absolutely at the core of what an enormous number of companies have done, an enormous number of different systems that have been built have been built on top of this. There’s been a bunch of ways that MirageOS has gotten into big and important pieces of infrastructure.

00:12:54

Ron

One thing I wonder about is, are you happy with the set of abstractions that we’ve started to build up around this? In some ways I feel like the stunt-like nature of all of this shows a little bit in the happenstance of what we got. A lot of the things that we’ve ended up building are things that you could kind of shim in. We started off building a big multiuser set of operating systems and we’re like, “Oh, actually, the abstractions aren’t good enough for supporting multiple users truly isolated from each other.” So we started doing this, in some sense, very strange thing where we said, “You know what’s the right abstraction? Hardware.”

00:13:30

Ron

Whatever the physical hardware happened to provide at the bottom layer, that’s the thing that will allow us to take our operating systems and just port them cheaply to new places. So let’s pick hardware as the new abstraction. And I find it hard to believe on some level that either of these are really good choices. If you were to actually start from scratch in a way that’s not just like a stunt but a multi-decade-long commitment to rebuild the entire world, do you have a feel for what abstractions you’d actually pick?

00:13:58

Anil

That’s a great question.

00:14:00

Anil

Mirage is now 15 years old, and we are never happy with our abstractions. I don’t think there’s been a single day where the core team has sat down and said, “We have the perfect set of interfaces and it will survive for the next few years.” It’s worth stepping back a little bit to explain why OCaml was the right choice for MirageOS and why that empowers this continuous evolution of our interfaces.

00:14:21

Anil

In OCaml, you have the notion of modules. This is one of the defining features of OCaml beyond being a functional programming language. And what modules do is that they let you define an interface and this interface is a series of types which can then have functions that operate over those types and that collection is known as a module signature. And whenever in MirageOS we are defining some abstract hardware or even a high-level thing, we define a module signature for this thing. And all that does is sketch out what goes in and what goes out and how you create things of this module type.

00:14:57

Anil

But then in OCaml, you also have this notion of module implementations, modules themselves, and if they satisfy that module signature, then you can apply this in a type-safe way and you can compose lots and lots of different module types with lots and lots of different implementations.

00:15:11

Anil

In Mirage, we have a sequence of module types which represent the full set of our possible hardware- and application-level and protocol-level signatures. But then, we also have hundreds and hundreds of concrete libraries which satisfy some of those module signatures.

00:15:26

Anil

For example, if I have a networking module signature that just says you can open a connection and you can read and write from it, we call this a flow in MirageOS, then there are several possible implementations of this flow interface. One of them is just a normal Linux socket stack which will compile only on Linux and another one is a full OCaml-based implementation of TCP/IP which exports the same socket interface but instead of delegating the requirements to actually send the network traffic to the kernel, it actually implements it in pure OCaml.

00:15:58

Anil

So in MirageOS, whenever we’re not happy with the lack of some safe code, we go write an implementation. Whenever we’re unhappy with the evolution of some hardware interfaces or virtualization interfaces, we go rewrite our module signatures and all we have to do is to adjust our implementations so that they match the new module signatures. We can do this in an incremental and evolutionary way.

00:16:21

Anil

Over the years, we’ve learned a ton of stuff, we’ve seen an evolution of hardware both in terms of performance and straight-line capabilities, we’ve seen it change in terms of the security model, we started with just page tables for memory now we have all kinds of trusted, encrypted memory enclaves and we have nested virtualization. It’s become an incredibly sophisticated interface there. And then we also have the dimensionality of distributed systems which is just another way of programming and abstracting across the failure domain.

00:16:51

Anil

OCaml lets us split up our implementations and our signatures into two discrete halves and then try to evolve continuously. That’s why the Mirage project is called Mirage because our idea was that the Mirage project would disappear and just become the default way that people programmed systems because our signatures would just become part of the standard community and part of the standard way that people build things and we’ve been seeing that over the last few years.

00:17:14

Ron

One, I think, subtle advantage of Mirage which is not, I think, totally obvious to someone who encounters it as an operating systems project is you can take a program that was built for Mirage and you can run it with an ordinary operating system. Your point about one of the ways that you can get network services is to just use the standard network services on the operating system of your choice. And the other way is to have a pure OCaml implementation that goes all the way down and run that inside of hypervisor or maybe run it on an actual, bare metal server.

00:17:45

Ron

There’s an enormous amount of flexibility in terms of how you take these things and deploy them. This may be not obvious if you just think about it as an operating system. In some sense, it’s both more than that and kind of less, in the sense that, as you said, there’s a way in which the more you look at it, the more you wonder, like, “What actually is here?” In some sense, the whole architecture disappears into the background.

00:18:03

Anil

That’s right. That’s right.

00:18:05

Anil

Well, to give you a concrete example of this. Right now, we’re really worried about climate change. We thought we would build a website that is purely solar powered and one observation about websites, for example the OCaml Labs website, is that most people probably only look at the website when it’s daytime. There’s not much machine access to the website. So we thought, “Well, what if we had a bunch of Raspberry Pis around the world that were just solar powered?”

00:18:26

Anil

The process of writing this kind of thing is, first of all, just start writing it in Unix like a normal OCaml Unix application and we built the web server with my colleague, Patrick Ferris. And then at this point, we start measuring the energy usage and the energy usage is high because it’s running Linux and the Raspberry Pi and then it’s just taking up more budget than our solar is letting us provide. So then we wrap it in a more constrained MirageOS interface, one that doesn’t give you the full access to Linux and all the syscalls and only requires a small file system.

00:18:56

Anil

And so this is just an evolution over our existing Linux code and then suddenly it becomes compatible with all of the direct unikernel interfaces. And then you can replace the Raspberry Pi with an ESP32, one of those tiny little 32-bit microcontrollers, and your energy budget drops dramatically. But obviously, your capabilities drop, but I had the luxury of developing the Raspberry Pi environment which is a full Linux environment and then when I decide, “Well, okay, my high level logic is right, I can bisect it, and then get rid of the lower half of the operating system.” It’s all just done through iterative, normal, pure OCaml development.

00:19:25

Anil

It’s worth noting as well that anyone can build their own custom kernel. If you’ve never done any kernel hacking, you can still use MirageOS, program in pure OCaml, and have a custom kernel that you can boot. It is really quite dramatic if you think that there’s mystique in kernel programming because there isn’t. It’s just another very, very large program that is hard to debug.

00:19:45

Ron

I think I have a pretty good sense of what’s to like about this approach. One advantage is that you get all of the flexibility that you get out of a powerful programming language for building rich abstractions in a kind of kernel environment. You are restricted in various ways to building abstractions that are, in some sense, safe via the hardware support that you have for separating kernel code and non-kernel code. There’s a bunch of constraints about how you can build that kind of system.

00:20:11

Ron

Here, you get to use the abstractions very freely, you can build just what you want, and you can have a compilation process that just doesn’t link in the stuff that you’re not using so you get things that are truly minimal and as a result more secure. So that all seems really exciting.

00:20:26

Ron

Yeah, I have an enormous amount of sympathy for the idea that part of the way that you make your world better is by extending the programming language. I think this is a luxury that Jane Street has had over the years and I think that, in some sense, everyone, whether they know it or not, is enormously dependent on the fundamental tools they use including the programming language and people mostly think of themselves as being in the position of victim with respect to their programming language of choice. They mostly use it and don’t have a lot of control over how it works.

00:20:55

Ron

But being in a place where you can be in real conversation with the community of developers that defines the language lets you, when you find really important ways of changing that ecosystem, actually being able to push that forward, that’s a very powerful thing.

00:21:08

Anil

It is.

00:21:09

Anil

And OCaml, in my mind, is a generational language. One of the properties I want from systems I build is that they last the test of time. It’s so frustrating that a system I built in the early 2000s, if you put it on the internet today, would be hacked in seconds. It just would not survive for any length of time. So how do we even begin the discipline of building systems that can last for, forget a decade, just even a year without having some kind of security holes or some kind of terrible, terrible flaw?

00:21:36

Anil

Now, there is one argument saying that you should build living systems that are perpetually refreshed but also we should have the hope of building eternal systems that have beautiful mathematical properties and still perform useful, utilitarian functions in the world.

00:21:48

Ron

There’s one big downside I feel like I see in all of this which you haven’t talked about yet which is it requires you to write all of your code in OCaml. And I really like OCaml, you really like OCaml, it’s in some sense not a downside, but if you’re trying to build software that’s broadly useful and usable and can build a big ecosystem around it, restricting down to one particular programming language can be awkward.

00:22:12

Ron

Just to say the obvious, I would find it somewhat awkward if there’s some operating system I wanted to use and I had to use whatever their favorite language was and I couldn’t write in my favorite language. How do you think about this trade off?

00:22:23

Anil

Totally. Well, first of all, we must use multiple languages. It’s not really OCaml that is the lure for this notion of generational computing. It’s the fact that there’s at the heart of it a simple semantic that could be expressed in a machine-specifiable form. And although we have the OCaml syntax and everything at the heart of it, there’s no formal specification about OCaml, but it’s obvious that one is emerging and one can be written in the next, certainly, five to 10 years.

00:22:46

Anil

This means that once you have a large body of code that has semantics, that has meaning, it’s possible to transform it into other languages and other future semantics. That self-description is a really, really important part of the reason why I chose OCaml. It’s still possible to compile code I wrote in the early 2000s using the modern OCaml compiler. I’ve compiled code I wrote 20 years ago. In fact, it was OCaml’s 25th birthday just a few months ago and I tested out the first program I could find, it was my CVS repository, and it compiles fine.

00:23:14

Anil

But when you want to use another language, then we just go through the foreign function interface and it’s just like the process abstraction I talked about. All you have to do is spin up another process, which is another runtime, and you have to talk to it. The industry has made tremendous progress in understanding how multilanguage interoperability should work, specifically through WebAssembly, for example, at the moment. We have a substrate where modern browsers can run quite portable code but more importantly than the bytecode is their emerging understanding of what it means to make function calls across languages and all we have to do is take advantage of whatever those advances are and we can link multiple libraries for multiple languages together.

00:23:53

Anil

Again, it’s a mirage, right? By using other people’s advances, Mirage can benefit because all we need are libraries to build these operating systems. Nothing else. Everyone loves libraries, everyone has them. That’s the only thing we need, and standards for how they can talk to each other.

00:24:08

Ron

One of the things that I think is really important about programming-language design is building a good programming language, it is as much about what you leave out is what about what you put in, and having a set of abstractions that smoothly work together, language features that really click, where it’s really easy to use other people’s code no matter which subset of the language features they tried to use, and it’ll still all hook together, it’s hard to build a language that encourages that kind of simplicity, that embodies that kind of simplicity.

00:24:37

Ron

And if what you need is now languages that need to be fully interoperable with each other, there’s a degree to which each language has to fully embrace the complexity of the other languages and it can get awkward fast. I wonder if some of the simplicity that Mirage offers would get harder to maintain in a context where you’re trying to have lots and lots of different languages interacting with each other.

00:25:01

Anil

It definitely does because you’re trying to get end-to-end guarantees. One of the big users of Mirage unikernels is the Tezos proof-of-stake blockchain, and Tezos is a complicated distributed system with lots of nodes and validators and security keys flying around.

00:25:15

Anil

To build that as a unikernel, it involves a lot of OCaml code, it’s a large OCaml codebase, but also Rust code. There’s been really interesting work on hooking together the Rust type system, which is based around a borrowing model, so there’s a lifetime model for how long values persist, and the OCaml model which is based around garbage collection, it involves dynamic collection.

00:25:32

Anil

But this works because typically, the Rust code is at the lowest levels of the system, it’s kind of at the runtime part of the system. So as long as you have a clean layering where you’re starting from a C runtime, then you’re moving into the Rust code which is very unopinionated from a garbage collection perspective but very opinionated from a lifetime perspective, and then calling into the OCaml code, things work out pretty well.

00:25:51

Anil

We’ve made tremendous progress in building some really complicated unikernels from a very, very complicated distributed system but you have to just make sure you look at your entire language stack and your dependency stack ahead of time, make sure you understand how they interoperate at a high level, and then dive into turning into the unikernel. So it’s definitely not a magic wand that you can just wave and expect the build systems to just work.

00:26:11

Anil

Another example that we use Mirage for is in Docker which is a container management system. If you’ve ever used Docker for Mac or Docker for Windows, then every byte of every container that you’re using in your desktop is going through a MirageOS translation layer because whenever you mount a file system on the Mac, for example, something has to translate the semantics of your Mac file system, which is APFS or HFS, into a Linux container which is a similar-looking file system but actually completely different under the hood.

00:26:41

Anil

What we did was we did a very special Mirage. Dave Scott, David Sheets, and Jeremy Yallop, they figured out that if you treat one end of a Mirage compilation target as Linux and the other end as macOS we can build translation proxies simply by serializing network packets into the OCaml stack and then deserializing it on the other end and turning it into socket calls. So now the Mac transparently reconstructs traffic coming out of a container and then admits them on your Mac desktop as normal Mac networking calls. A lot of the tricky difficulties of network bridging and firewalls and all of that stuff just go away.

00:27:18

Anil

When we run a Linux container on the Mac, it goes through MirageOS and it looks just like a Mac application. When we deployed that in Docker, I think our support calls went down by about 99%. So anytime this software was deployed in the enterprise, everyone’s got some crazy firewall and antivirus software and things that break some integration of a virtualization stack with your system.

00:27:38

Anil

Today, Docker for Mac, you just double click on it, you install it on Mac or Windows and it’s like a background daemon that just runs in the system with minimal interruption and that’s the user experience we’re going for. But it’s only possible because, again, we understood how to interface Go with OCaml, but made sure we did it in exactly the right order. Then once you deploy it, it’s incredibly robust in production. But you just have to take the time to make sure you understand the lifetime of Go values, the lifetime of OCaml values, and make sure they can interoperate correctly.

00:28:03

Ron

This is another example of the flexibility of Mirage, right? It’s not just an all-at-once operating system, it needs to know everything and then you run it on bare metal. Like here you are integrating it as a very carefully designed shim between two operating systems running on the same machine.

00:28:18

Anil

That’s right.

00:28:19

Anil

Along the way, KC Sivaramakrishnan joined OCaml Labs to work on multicore parallelism. Hannes Mehnert from Robur and David Kaloper were on a beach in Morocco and they wrote us a TLS stack and then they did this incredible stunt where they decided they loved Mirage and they’d never talked to me or any of the Mirage team and on this beach in Marrakesh they wrote a complete SSL stack in the wake of the Heartbleed attack.

00:28:40

Anil

And then they put up what we called a Bitcoin piñata and this Bitcoin piñata was in about 2015 or so I think, they hid 10 Bitcoins inside a unikernel, put it on the internet, and they left the private keys inside the unikernel and they said to the internet, “If anyone could break into this unikernel and take those keys and trade those Bitcoin, we can’t deny the fact that this thing has been hacked, and you can keep the money.”

00:29:02

Anil

Back then I think Bitcoin was worth not very much but then during the course of the experiment, there was hundreds of thousands of attacks against the system and it got onto Hacker News and all of the social media networks. People kept crashing the system by denial-of-servicing it. But then like a real piñata it just bounced back and rebooted in 20 milliseconds because that’s how long a unikernel takes to reboot and it was back up again and no one managed to take the Bitcoin. In the end, I think we donated it to charity because it was growing a bit much.

00:29:25

Anil

But it just goes to show how you can assemble all these things, you can get a community who can then do what they want to do with it, and then contribute back to the whole. So today, if you use a TLS stack in OCaml or indeed an HTTP stack, you’re probably using one of the Mirage libraries. There’s many, many alternatives but for a long time, the Mirage libraries became the de facto community stacks that people used.

00:29:48

Ron

Right. And I would assume that Mirage in its various forms, maybe it’s Mirage plus Xen together, are responsible for most of the deployments of OCaml code onto people’s actual machines. How many machines do you think software that you’ve worked on has now been installed on?

00:30:04

Anil

It’s a hard question to answer because we’re deployed in products.

00:30:07

Anil

There was an OCaml Xenstored, which is the management daemon behind XenStore, which I believe Amazon used for many years so that would cover quite a lot of machines in the cloud. I can’t say exactly how many, but a lot.

00:30:18

Anil

And then Docker for Mac and Docker for Windows, I think, was the second most popular developer tool behind Visual Studio Code. So it’s deployed on tens of millions of desktops, for sure.

00:30:27

Anil

But then, of course, in the community, you have people like Facebook who have written their frontend for their Messenger application in a variant of OCaml known as ReasonML and compiled that to JavaScript. That’s also to some extent deployed but not deployed in the same way.

00:30:40

Ron

That’s a good point. That might be more desktops than all of the Docker desktops combined. In fact, it kind of has to be.

00:30:45

Anil

It does. That would probably address a few billion desktops. But it’s a website, right? It’s not an application running on the other side.

00:30:51

Anil

But our plans right now are even bigger. I’m working on some climate change projects where we need to deploy millions of sensors around the world. And of course, we’re using Mirage to deal with the complicated logic of carbon CO2 sensing and chemical tasting and deploying it on RISC-V hardware that’s quite embedded. The Mirage journey is just continuing but on different paths and different use cases.

00:31:11

Anil

We have in Germany the Robur team deploying all kinds of different unikernels for the German government. I think they have a contract to build secure VPN tunnels and lightweight overlay networks and all of these are unikernels that are being deployed. So who knows how far it’s going to go inside critical infrastructure on the internet in the coming years.

00:31:28

Ron

A thing I’ve always found striking about your background is you’ve dug deeply into a bunch of different areas, you’ve done a lot of different open source work over the years of various different forms, you’ve done lots of impactful academic research, and you’ve been involved in a bunch of pretty major industrial projects. Can you tell us a bit about how you got into this whole line of work in the first place? How did you get into computers and into systems research? Where did this journey start?

00:31:55

Anil

Well, I’m actually not a computer scientist. I began my training as an engineer and I actually planned to get into electrical engineering. I was fascinated by power systems and cars and planes and so on.

00:32:04

Anil

But then when I was studying in London, I got working on a computer game, an online MUD where you could program this game and it was programmed in a really interesting language called LPC which is a pseudo-functional, object-oriented language from the late ’90s. I went to a party. It was known as MUD meet and I got drunk and I woke up the next day and I’d been offered an internship at NASA to work on the Mars Polar Lander. And this is in California, it’s an exotic land far away from gray and dreary London.

00:32:30

Anil

I ended up that summer working on the various bits of infrastructure for helping the Mars Polar Lander land and when it finally landed, this was the first time that we had the technology to livestream the photographs that were coming out of Mars. I was set up, I would say, as the person who set up all the infrastructure for supporting one in three people on the internet to access a website all at once because the world’s attention was focused on this landing in 1999.

00:32:57

Anil

I rapidly learned how computers worked and stuff and operating systems and things and I set up all of these Solaris boxes and the first thing that happened was those boxes got hacked. I put them up on the internet and obviously hackers love mars.nasa.gov as a domain to control and so they took them over. I then looked around for more secure alternatives and I found this operating system called OpenBSD.

00:33:17

Anil

And what OpenBSD is, it’s an all-in-one operating system designed with reliability and correctness in mind and uses a variety of security techniques. I wiped all of these expensive Solaris boxes, installed OpenBSD, and then managed to get the system running stably again. And then OpenBSD was open source. So I found a few bugs because when you’re deploying something as large as that you can’t not find some bugs and it turns out that I could just send in some patches and they got interested and they accepted my patches.

00:33:43

Anil

This is some massive dopamine rush because when someone takes your code and incorporates it into this operating system used by loads of other people, it’s an incredible feeling. I got more and more into that development and I ended up going to an OpenBSD hackathon and these are regular, semiannual events. Back then, it was in Calgary in Canada because the U.S. export restrictions prevented any cryptographic code from being written in the U.S.

00:34:06

Anil

I got to travel and go to Canada. And then talking to Damien Miller who’s one of the core maintainers of SSH, it set me on the path to thinking, “Well how can you start rewriting systems in a more secure fashion?”

00:34:16

Anil

And then I went back to Cambridge because the Mars Polar Lander crashed straight into Mars at very high speed. So all of the infrastructure we set up never actually got used. Well, it got onto CNN and lots of people looked at our sad faces.

00:34:28

Ron

People got to watch the crash due to your hard work.

00:34:30

Anil

People got to watch the crash.

00:34:32

Anil

We had to wait for two days until we decided it had crashed, so people stopped watching after about five minutes but we waited two days. And then I had to find a new job because I was so depressed that all of our hard work had hit Mars at high speed.

00:34:42

Anil

I decided to go back to Cambridge and do a PhD and then I really started my training as a computer scientist. So during the PhD, I did lots and lots of different projects but I started working on the Xen hypervisor, I started using OCaml and functional programming more seriously in order to build the stacks that I described earlier, and then it became this wonderful journey where all of the code I’ve ever written has pretty much been open source. A lot of it is terrible but has been included in lots and lots of products. It’s really easy to move between industry and academia and government jobs because you’re taking your secret weapons with you wherever you go.

00:35:14

Anil

So now, it’s not like I’m obsessed with OCaml, it’s just the most efficient thing for me to use to solve any given problem because I’ve just deployed it in so many contexts that if I’m doing anything for building my website or doing a bit of data processing, it’s just what I reach for. It’s a really fun thing to work with even after all these years.

00:35:28

Ron

You’ve talked some about why you think OCaml is a good fit for Mirage and what you’re trying to do there. But OCaml is not a tool that systems programmers reach for early. How did you end up coming across it in the first place?

00:35:41

Anil

Well, in Cambridge, OCaml is now taught to first year students because first of all, it’s a reset button because most students would come with a background of JavaScript or Python and they’d have partial knowledge so we wanted to find something that’s a little bit obscure but certainly not massively in the mainstream. Secondly, it’s the easiest way to teach the foundations of computer science so that the basics of data structures and recursion and representations and all the beautiful logics and proofs that follow from that.

00:36:08

Anil

At Cambridge, there’s a long tradition of using ML-style languages from standard ML to OCaml so I couldn’t help but be exposed to it because of the university environment. Secondly, it was also the most practical way to do systems programming in the early 2000s. There weren’t really any other alternatives back then. You could go for Java which is very heavyweight, you could go for Perl which was right once, it still is, to some extent. Python and Ruby were still very much in their fledgling phases. There weren’t many other compiled languages. Today, we had this wonderful spring of programming languages but we didn’t back then.

00:36:38

Anil

But languages have momentum as well. This is that generational concept I keep going back to. It’s not like we’re just avoiding other languages but when you build up such a large codebase of OCaml code, it just gets easier and easier to build and advance it every single day so it’s almost at the tipping point now where it’s easier to extend OCaml with Rust-style features than it is to rewrite all of our code in Rust, for example, or in any other language that comes along. It’s easier to go do a machine proof using the Coq Proof Assistant and extract OCaml than it is to do anything else. It’s this reduction of friction that just builds up over the years.

00:37:11

Ron

I understand what you’re saying but I feel like what you’re saying is also on some level, objectively false. Meaning, you’re saying, “Well, back in the ’90s, what systems programming languages were there other than OCaml?” And I’m like, “There was C.” And in fact, that’s what everybody used. It is not the case that system programmers in general in the ’90s looked around and were like, “Oh, yeah, we’re definitely going to write all our systems in OCaml.”

00:37:33

Anil

No, that’s right. If I could go back in time, I would evangelize OCaml, not now, but in the late ’90s because I feel like I missed a leap of innovation there. No one had heard of OCaml back then and it was just this incredibly productive tool to write Unix-like code. It was just better than writing in C. And this is me emerging out of writing lots of C code for many, many years and indeed writing lots of PHP code for websites and web mail stacks and so on.

00:37:55

Anil

But OCaml went through a period of stagnation because, like any open source project, if it’s not invested in, if it doesn’t have a large body of programmers, then it gets really hard to sustain it over the years. So around 10 years into OCaml’s life, which is roughly when I was using it in about 2005, the rate of progress really stalled.

00:38:13

Anil

At this point, we’ve missed a window where we could have heavily evangelized this to more systems programmers that didn’t have the tools and the right development environment to make it easily possible. So while we used it heavily at XenSource, it never got picked up by other developers within XenSource because of that lack of tooling.

00:38:28

Ron

We talked some about your background in open source. Some of the work that you’ve done, and in fact that you and I have collaborated on over the years, has been about developing the open source community around OCaml and helping in part, certainly not just us, but helping in part to combat some of that stagnation and part of that was the creation of OCaml Labs. Can you tell us a little more about where OCaml Labs came from?

00:38:50

Anil

I can.

00:38:51

Anil

Whenever we finished at XenSource, we got acquired by Citrix and I left after a few years of happily hacking on Xen within Citrix. I went back to academia and I knew that I had this burning desire to build MirageOS because everything was set. I had all the code from the previous startups, I had the problem, I had five years of funding, I had this wonderful research fellowship to work on. But it was just me and I knew that if I wanted to make this as big as I wanted it to be, I needed help and it was help on multiple fronts.

00:39:18

Anil

The first thing was that the OCaml development team was incredible. I remember having dinner with Xavier Leroy in about 2009 and he just said that they would maintain OCaml forever but they were struggling with all of the bug reports coming in and the fact that they didn’t have any dedicated staff working on it. But he said, “Anyone can work on it. But why isn’t anyone doing it?” I got talking to you, Ron, and we said, “Well, why don’t we find someone that will help us do this?”

00:39:39

Anil

It was really hard to find anyone who would actually work in the core compiler, look at bug reports, and build out tooling because these are all the things that we needed. In the end, it came to a hard decision: “If you can’t find anyone else, then perhaps I should do it myself.” And the reason I was really motivated to do this myself was because I wanted this for MirageOS. So anything I did to improve OCaml would directly leverage and improve MirageOS, the project I’m really passionate about.

00:40:02

Anil

We founded OCaml Labs in Cambridge, and one of the beautiful things about Cambridge University is that individual staff retain their intellectual property, it’s not owned by the university. This meant that working in open source became really easy because anyone we hired at the university could just write code and there wasn’t any need for any legal agreements or anything with the university, we just released it.

00:40:22

Anil

I’m really, really proud that what we started with a seed in Cambridge has now become a diaspora of people all around the world working in different geographies in different environments but continue to communicate and share their code through the open source ecosystem.

00:40:36

Ron

I think Cambridge as an institution deserves an enormous amount of credit for all of this because this thing was messy and complicated and does not fit in in an ordinary way to a simple notion of academic research. A lot of the work that needed to be done was work about coordinating open source ecosystems and maintainership work. It’s not the kind of stuff that gets you tenure. Most institutions aren’t willing to take it on and Cambridge was.

00:41:03

Ron

I think it was important to have an academic institution that was willing to do it because OCaml is, in many ways, a deeply academic language, its roots, and much of the expertise just realistically resides in academic institutions. There’s an enormous amount of connection to various different kinds of real and legitimate research work. We saw lots of exciting things coming out of Cambridge on that kind of research side that were secondary to this and all of this other real infrastructure that was created.

00:41:32

Ron

We looked around and tried to find various homes for OCaml Labs and Cambridge was the place that was willing to do it. It was an enormously important find that we found an institution that was really willing to partner with us effectively in doing this kind of work.

00:41:46

Ron

Another thing that strikes me about the story you’re telling is the degree to which OCaml Labs acted as an effective form of glue. A lot of the work you’re talking about which is important advances in the state of the art for OCaml, they’re not all things that were done at OCaml Labs. Merlin was created by some Inria undergrads, if I remember correctly, but they were later working with and supported by OCaml Labs. OCaml Format was just done as an internal Facebook project and then Jane Street adopted it and made a bunch of further changes.

00:42:20

Ron

But it was OCaml Labs that provided the glue to take it and turn it into a maintained and general purpose piece of software and figure out how to share between the various different contributors.

00:42:32

Ron

Dune is another example. Dune was created at Jane Street for Jane Street’s kind of narrow purposes and now there’s been a really deep collaboration between engineers at Jane Street including Jeremie Dimino who wrote the first version of it and runs the team that manages it at Jane Street and collaborates very closely with OCaml Labs. And so both the industrial side of that work and the open source side of that work are well handled and handled by different parts of what is essentially one big team that’s working on multiple aspects of the problem.

00:43:07

Anil

That’s right.

00:43:07

Anil

The fundamental value that Cambridge brings is training, mentoring, and graduation. Graduation is a really important part of Cambridge where you leave and you go do something else. And the same is true for Inria and the universities in France where the Merlin developers came from. I’m particularly proud of the number of people that have learned and moved on from Cambridge to other jobs in the ecosystem and succeeded.

00:43:28

Anil

Stephen Dolan and Leo White, both of whom are on this podcast, started off their degrees in Cambridge, did their PhDs there, and have moved into Jane Street and many other graduates have done similar as well.

00:43:38

Anil

It’s crucial for the longevity of a community to have this easy flow of people across jobs because obviously people’s lives change. They can’t just stay working in a university and Cambridge was extraordinarily flexible in figuring out how to get people in.

00:43:51

Anil

David Allsopp, for example, who is one of the most prolific contributors to core OCaml, is also a countertenor singer in his spare time, but when I say spare time it is actually his career. So I had to convince him to come be a developer here because he was working on OCaml in his spare time while also maintaining his singing career. He successfully juggled both of those and became an incredible contributor and an incredible singer. But explaining to Cambridge HR exactly why I was hiring a singer to work in my research group was a challenge but they didn’t say no. He’s still at OCaml Labs and he’s still one of the prime maintainers many years on.

00:44:22

Ron

One of the big and long-running projects that OCaml Labs has taken on and really driven is the work towards having a multicore garbage collector for OCaml and a multicore capable of runtime. This is a long-running sore point about OCaml. You mentioned one of the limitations in Mirage is that OCaml is not multicore-capable in terms that you can’t run multiple OCaml threads that share the same heap. This has been a thing that people have talked about for a very long time and there’s been some amount of work on and some discussion about how to get there for many years.

00:44:56

Ron

One question I have is why has it taken so long? Why has this been such a big and long-running project to add multicore to the language?

00:45:03

Anil

A really important part of research is understanding that 90% of what we do is fail. Whenever we started adding multicore parallelism to OCaml, we were taking an existing ecosystem, an existing semantic for the language, and just trying to extend it with the ability to run two things at the same time instead of one thing. And the number of assumptions that break when you do two things at the same time instead of just one thing is incredible.

00:45:27

Anil

Our first naive attempt was in 2013. We presented our confident plan for exactly how multicore would go into OCaml and it got okayed by Damien Doligez and Xavier Leroy. Then a couple of years on, we just realized just how many edge cases there were and the need for a better conceptual core for what it means to be multicore.

00:45:46

Anil

We went to a Caml Consortium meeting which was where the industrial users of OCaml a few years ago would present their needs and requirements and we presented our work to that team and they said, “Well, look, you can’t add this without having a memory model to OCaml.” So without a memory model which says, “This is what happens when two threads simultaneously access a single OCaml value.” Without that definition, it’s really hard to ascribe any meaning to multicore OCaml because what does the program do whenever the situation happens?

00:46:15

Anil

So we then had to go off for a year and figure out new theorems and we came up with something called LDRF, local data race freedom, which is published in PLDI, a top-tier conference, but it also crucially resulted, in addition to this nice new theorem, to a clean, well-defined semantic for a multicore parallelism in OCaml.

00:46:33

Anil

Then we went back to the core development team and we said, “Hey, here is this clean memory model semantic.” They went, “Yay, great. Where’s the rest of it?” But remember, there was only about two or three of us working on this while juggling many other things. So we then went off and frantically started writing the garbage collector and making sure that we could finish off the job.

00:46:51

Anil

The garbage collector is more difficult than a normal, single-threaded one because it has to deal with multiple cores simultaneously wanting to trigger garbage collections and you have to make sure that irrespective of when the garbage collection is happening that the program is still maintaining type safety so nothing can ever observably be violated by garbage collection happening.

00:47:09

Anil

We ended up with two separate schemes for garbage collection and we couldn’t decide between them. We then had to write a full paper about this, we had to make sure that we evaluated both sides, and we also had to do this against a backdrop where we could not tolerate more than a few percent of a performance hit for old OCaml code.

00:47:28

Anil

If you’re building a new language, you could just go ahead and build it and you could build the perfect parallel algorithm because you have no compatibility to worry about. But meanwhile, we had the entire Coq Proof Assistant community that said, “We’re not going to use multicore for a few years, but if we compile our existing code with multicore OCaml, it shouldn’t get any slower.”

00:47:48

Anil

Back then, we’d have maybe a 10 or 20% performance hit, so a significant slowdown until you use multicore. Now after a few years of work, we got that down to a few percent. It was almost indistinguishable from noise because of all of the various techniques that we put into the garbage collector and the compilation model to ensure that that happened.

00:48:05

Anil

This was, again, real research. It got published in ICFP. We then had to figure out how to present this to the core development team, get consensus, and then move it forward.

00:48:16

Anil

I think we have been working on multicore incrementally since OCaml 4.0.2. OCaml 402 was where we had the first branch of OCaml for multicore. We’re now in OCaml 413 which has just been branched and I think in every version since 408, we’ve put in a significant chunk of work in order to get towards multicore parallelism.

00:48:35

Anil

Most of these things are invisible to the OCaml users so you at Jane Street have been using different parts of the multicore compiler that we have upstreamed into lots and lots of different versions of OCaml and we’ve done so in such a way that it totally respects backwards compatibility because if you don’t get it just right then we’ll end up with a split world where the multicore OCaml compiler is a new language and it won’t work with older existing OCaml code and that would be a disaster.

00:48:59

Anil

The reason we’re so careful in threading the needle is that whenever OCaml 5.0 lands, it will compile almost every bit of existing code in the last 25 years with a minimal performance hit. It will then allow you to add multicore parallelism through this domains interface and it has one of the best and clean memory models out of any language.

00:49:19

Anil

Our research paper on bounding data races in space and time showed that C++ and Java, the two gold standards for their memory models, have disastrous issues, is the best way to put it. So that’s the opening of our paper. We show that with just no performance hit in x86, and 0.4 on ARM and 2% on PowerPC, we could make it all work. So that’s a pretty big result.

00:49:44

Anil

It took a lot of theoretical computer science, a lot of experimental evaluation, and a lot of implementation. All of these had to happen simultaneously. It wouldn’t have been possible without KC Sivaramakrishnan who’s worked with me on this project for the last six years and we’ve gotten two top-tier papers out of it. So it’s not been a great ratio of coding to papers but the end result is something we’re very, very proud of.

00:50:06

Ron

The story you’re telling highlights a lot of the ways in which OCaml is legitimately an academic language and that part of the way of moving things forward and of convincing people to accept a new feature is actually going through the trouble of writing serious academic papers to really outline the design and explain what the novel contributions are. And there are some novel contributions.

00:50:25

Ron

From a more ordinary workaday systems programmer perspective, how should someone who is used to the parallelism story in Java think about the advances in OCaml? How from a pragmatic point of view is the coming OCaml multicore runtime going to be better?

00:50:44

Anil

It’s only going to be better because it will not have any surprises. So whenever you use multicore parallelism in Java, you have to know a lot of things. You have to know about the memory model in Java, you have to understand the atomics and the various interfaces they expose. There’s different levels of things exposed in different versions of the JVM. In OCaml, this is potentially just because of the young age of multicore in OCaml, we think we just have a cleaner model that avoids a lot of pitfalls that Java made.

00:51:10

Anil

Now one of the interesting properties about programming languages is that it’s very hard to take back a semantic. If someone has written some code in it, there’s just a vast number of complaints if that changes because it can fail at runtime. Just by waiting for this long and observing how all the different languages have built their systems and then doing the research to thread that needle to find the least surprising memory model across all of the hardware deployed today, that’s what we have in OCaml.

00:51:34

Anil

A Java programmer should find it the most boring experience to do multicore parallelism in OCaml. They’ll just use high-level libraries like domainslib that gives them all of the usual parallel programming libraries and it’ll just work. No surprises, fast.

00:51:49

Ron

Do you have a pithy example of a pitfall in multicore Java that doesn’t exist in multicore OCaml?

00:51:54

Anil

There’s something called a data race. And when you have a data race, this means that two threads of parallel execution are accessing the same memory at the same time. At this point, the program has to decide what the semantics are. In C++, for example, when you have a data race, it results in undefined behavior for the rest of the program, the program can do anything. Conventionally, daemons could fly out of your nose is an example of just what the compiler can do.

00:52:18

Anil

In Java, you can have data races that are bounded in time so the fact that you change a value can mean later on in execution, because of the workings of the JVM, you can then have some kind of undefined behavior. It’s very hard to debug because it is happening temporally across executions of multiple threads.

00:52:35

Anil

In OCaml, we guarantee that the program is consistent and sequentially consistent between data races. It’s hard to explain any more without showing you fragments of code. But conceptually, if there’s a data race in OCaml code, it will not spread in either space or time. In C++, if there’s a data race, it’ll spread to the rest of the codebase. In Java, if there’s a data race, it’ll spread through potentially multiple executions of that bit of code in the future.

00:53:05

Anil

In OCaml, none of those things happen. The data race happens, some consequence exists in that particular part of the code but it doesn’t spread through the program. So if you’re debugging it, you can spot your data race because it happens in a very constrained part of the application and that modularity is obviously essential for any kind of semantic reasoning about the program because you can’t be looking in your logging library for undefined behavior when you’re working on a trading strategy or something else. It’s got to be in your face, at the point.

00:53:32

Ron

Yeah, it seems to be like the core thing you’re talking about is buggy code is easier to reason about. It’s enormously important because almost all code is buggy, like parts of every codebase have bugs and problems, and this is why the classic undefined-behavior stance of traditional C and C++ compilers is so maddening because there’s an amplification of error where you make some mistake where you step outside of the standard and suddenly, anything can happen.

00:54:00

Ron

I’ve actually been seeing this happening with my son who has a summer internship where he’s off hacking at a bunch of C code. When you make a mistake in C code, it can be really hard to nail it down because the compiler can make all sorts of assumptions and push the mistakes into places where you totally wouldn’t expect it.

00:54:18

Ron

It sounds like the same thing happens in the context of data races in C and C++ and to some degree in Java and reducing that just makes it more predictable. It makes debugging easier. I feel pretty convinced by this story.

00:54:31

Anil

It’s quite pleasant working in multicore OCaml when it comes to debugging things because of this property.

00:54:36

Ron

Are you brave enough to venture a date by which a mere mortal who installs the latest version of OCaml will be able to run two threads in parallel that access the same heap?

00:54:46

Anil

Well, I can’t give you a date but I will give you… Well, I can give you a date. You can do that today.

00:54:52

Anil

So what I did…

00:54:53

Ron

Ha.

00:54:54

Anil

What I did a couple of weeks ago was to merge the multicore OCaml working tree that we use which is a set of patches against the latest stable OCaml into the mainline opam repository. This means that with one line, you can switch from OCaml 4.12.0 to OCaml 4.12.0-plus-domains and all the work that the multicore OCaml team has been doing has been focused around ecosystem compatibility. You can just start with your existing projects and you can then start adding in domain support.

00:55:23

Anil

If you’re really, really experimental, we have a future-looking branch which also adds something called an effect system on top of this patch set. This effect system is the ability to interpret certain external events that happen and just deal with them through what are known as effect handlers.

00:55:41

Anil

For example, if I’m writing to a blocking network socket instead of having to then use async/await or Lwt or monadic-style concurrency, our effects system just lets another part of the OCaml program deal with the blocking IO and then resume the thread of execution whenever it is ready to happen again.

00:55:58

Anil

This is highly experimental but it results in some of the most pleasant and straight-line OCaml code I’ve ever written. It reminds me of writing code in the early 2000s when we just use pthreads and Unix for everything. All of these different variants and levels of the OCaml compiler are now available in opam. Depending on how near-line features you want to test, all of the trees are available for you to try out.

00:56:21

Anil

The next thing we’re doing is that we’re working on OCaml 5.0 and this is hopefully going to be the release after 413 which contains the domains-only patch set. It will expose just two extra modules that provide you with the ability to launch multiple threads of execution. After six years of work, it’s two modules. But those two modules obviously have enormous power because you can then use those to spin up, without having to fork multiple processes or do lots of complicated serialization, multiple threads.

00:56:50

Anil

And then our plan gets more experimental. 5.0 is the sole focus of features that we have been approved to get into core OCaml because they’ve gone through extensive peer review. Then for 5.1, our plan is to propose the runtime parts of this effect system. This lets us not only express parallelism, which is what you get in OCaml 5.0, but concurrency directly in the language, so the ability to interleave multiple threads of control in a very natural way.

00:57:16

Anil

This is the original research that we just published in PLDI this year on how we made the runtime part of the effect system as flexible as possible, and again without breaking any compatibility with your existing tools. It uses GDB and all of the familiar debugging tools you’re used to.

00:57:31

Anil

And then later on, in 5.2, we’re going to expose that effect system into the core OCaml language using something known as effect handlers and typed effect handlers. We’re doing that in close collaboration with Jane Street engineers as well.

00:57:42

Anil

This roadmap is multiple years of work but the first step, OCaml 5.0, we’ll get into your hands as soon as we can. But all the trees are in open source and the way to speed it up is by giving it a try, trying your application against it, and giving us bug reports. So that’s the heart of open source and how you get a concrete date. Help us to help you.

00:58:03

Ron

Question well dodged.

00:58:07

Ron

By the way, just to highlight a little point you said there. You mentioned how the domains-only version of it is meant to provide the basic parallelism and then on top of that, you want to add some notion of concurrency like in some sense, once you add parallelism, there’s some amount of now concurrent execution. But I guess this reminds me of the old Solaris-style, you have some number of kernel-provided, truly in-parallel threads, and then you have some kind of micro-thread notion that operates inside of there that’s lighter weight.

00:58:37

Ron

That’s the split that’s really being talked about here. The idea is you have something like one domain that you’d run per, say, physical CPU that you have and then you might have tens or hundreds or tens of thousands of little micro-threads that are running inside each domain and importantly, migratable so you can take one of these and pick them up and move them to a different core. That’s an important part of that model.

00:59:00

Anil

It’s a really important point. Instead of calling them micro-threads, we call them fibers. These are really lightweight data structures. You can have millions of these in your heap, resuming them on a different core is just a matter of writing some OCaml code.

00:59:10

Anil

The really nice thing with effect handlers is that your schedulers, the things that normally the operating system would decide to do for you like thread scheduling, are written in OCaml as well. And so this means that you can write application-specific logic for things that conventionally the kernel would take care of for you.

00:59:26

Anil

The kernel doesn’t really know how to do things optimally. It knows how to do things to cause the least harm. And so by just domain specialization, your applications in OCaml can get really, really fast.

00:59:38

Anil

Now this should be familiar to you, right? Because this is the future of MirageOS. The goal of the effect system is to internalize about a decade’s worth of learnings about how to build portability libraries, how to build abstractions and device drivers, and now we’re having the time of our lives rebuilding all of these things in direct-style code using the effect system.

00:59:56

Anil

We have a new effects stack called EIO which is pure direct-line code, its performance is competitive with Rust and Go and so on. I think it’s faster than Go by quite a long way and it’s competitive with Rust, and it uses all of the new features in operating systems, io_uring in Linux, it uses Grand Central Dispatch in macOS and iOS, and it uses IOCP subsystem on Windows. All of these things happen invisibly inside the I/O subsystem written in OCaml. But as a programmer, you just write normal, straight-out OCaml code and the effect system takes care of all of that for you.

01:00:28

Anil

It’s a very, very exciting frontier for what’s coming in OCaml in the future and it makes MirageOS code even more Mirage-y because it’s just normal OCaml code that you write and all of this stuff is being handled for you in the background through various effect handlers.

01:00:40

Ron

Well, I think that’s a fantastic place to stop as you tie a little bow around connecting Mirage and the most recent work you’ve been doing in OCaml.

01:00:47

Ron

Anil, thank you so much for joining me. This has been a real pleasure.

01:00:50

Anil

Thanks, Ron. Fun as always.

01:00:53

Ron

All right, cheers.

01:00:54

Ron

You’ll find a complete transcript of the episode along with links to some of the things that we discussed including Mirage and some of Anil’s other research at signalsandthreads.com. Thanks for joining us and see you next time.