Listen in on Jane Street’s Ron Minsky as he has conversations with engineers working on everything from clock synchronization to reliable multicast, build systems to reconfigurable hardware. Get a peek at how Jane Street approaches problems, and how those ideas relate to tech more broadly.
Stephen Dolan works on Jane Street’s Tools and Compilers team where he focuses on the OCaml compiler. In this episode, Stephen and Ron take a trip down memory lane, discussing how to manage computer memory efficiently and safely. They consider trade-offs between reference counting and garbage collection, the surprising gains achieved by prefetching, and how new language features like local allocation and unboxed types could give OCaml users more control over their memory.
Stephen Dolan works on Jane Street’s Tools and Compilers team where he focuses on the OCaml compiler. In this episode, Stephen and Ron take a trip down memory lane, discussing how to manage computer memory efficiently and safely. They consider trade-offs between reference counting and garbage collection, the surprising gains achieved by prefetching, and how new language features like local allocation and unboxed types could give OCaml users more control over their memory.
Some links to topics that came up in the discussion:
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack, from Jane Street. I’m Ron Minsky. It is my pleasure to introduce Stephen Dolan. Stephen Dolan works here at Jane Street on the Tools and Compilers team, in particular working on the OCaml compiler. Stephen has an interesting set of experiences, both in industry and in academia.
A thing that you might have heard about that he’s done is he wrote a little tool called jq for querying JSON objects, which lots and lots of people have run into. And his dissertation, which made a lot of waves, was about a system for reconciling subtyping and type inference, and a prototype system for doing that together called MLsub. And we are totally not going to talk about that today. Instead, we’re going to talk about what Stephen has spent his time procrastinating on while he was doing his dissertation, which is memory management.
In particular, Stephen did a bunch of work on OCaml’s multicore garbage collector. Stephen joined Jane Street in 2019, and he’s worked on OCaml’s garbage collector directly and also on extending OCaml’s language to give programmers more direct control over how memory is managed, which is very important from a performance perspective. And we’re going to talk about a lot of that today. To start with, Stephen, maybe you could give us a little bit of an overview of how memory management works, not just in OCaml, which is what you’ve been working on here, but in general, across lots of different programming languages.
Sure, that sounds like a good idea. And thanks for the introduction. So I suppose at a basic level, programs need to put their data in memory somewhere and need to make sure that there aren’t two things in the same place at the same time. I suppose at the very basic level, the simplest way of doing that is you just write a big map of where everything goes. Everything goes into a specific spot and never moves. And there’s just a fixed amount of memory in use for all time.
I mean, people actually do that for tiny programs running on tiny machines in microcontrollers, but it reaches its limits fairly quickly after that. So I suppose for full-scale programming languages, the two basic things that you have for managing memory are the stack and the heap. So the stack is, you notice that an awful lot of values that our program manipulates are used for a very short space of time.
There’s a local variable in a function. You need some memory to store it. Then when the function returns, it’s gone and you don’t need it again. So you can just reserve a large piece of memory, use it as a stack. So it just grows from one end, whenever you need a bit of memory, you put it on the end, whenever you’re done with it, you pop it off and you ensure that you do things in a strictly last in, first out manner.
Right. And the ordering there is in part enforced by the discipline of function calls. You call a function, it gets a new stack frame, you do some stuff. Then when the function returns, the stack frame vanishes, and it looks like a stack data structure.
Yeah, exactly. It works particularly well for the local variables of a function. Of course not all memory that a program uses is going to work in such a straightforward way. Sometimes exactly how long a piece of data is going to hang around for is not completely obvious, or even when it’s obvious doesn’t line up perfectly with a single function call.
So then you’ve got heap allocations where you have some lifetime which is more complicated than a stack. And there are lots and lots of techniques for dealing with those. I suppose the oldest one is the C-style one, where there’s a function called to malloc some memory, to allocate a new piece of memory. And then when you’re done with it, you free it. And if you make some minor slip-up here, so you free something twice accidentally, or you continue using something after you’ve freed it, then your program has an exploitable security vulnerability and everything breaks in a million different ways.
So the ones I’m interested in are what’s called safe languages, where there’s some means of managing heap memory, where this doesn’t happen, where it’s actually memory safe. You end up not crashing or being exploitable or reusing memory with different types and so on. Those boil down to some form of garbage collection when the language is responsible for working out when memory is in use and freeing it when it’s no longer in use.
And the alternative being some form of static tracking, a static verification that you’ve done your manual memory management correctly, like the ownership and borrowing in Rust. So at the moment OCaml, the programming language very heavily used in Jane Street, is very far off the garbage-collect-everything scales, which is the right trade-off only some of the time. And some of the things that we’re working on, this is to try to allow some other styles of memory management so you can do more explicit control over how long things are aligned for.
Right. So just to summarize that, heap and stack are the two basic data structures. It is maybe worth saying with the heap, you have some big region of memory and one way or another, you keep track of some notion of what’s free and what’s in use and in the C world, where you’re doing things manually and unsafely, malloc and free are the things for moving those data structures around.
And you have to make sure you call them in exactly the right places. And then in a garbage-collected language, we have some system that’s still got some notion of what’s allocated and what isn’t allocated, but we have automated techniques for computing an approximation of what are the things that you actually need to keep around and making sure that you keep around at least all the things that you need to keep around and maybe a little bit extra. Is it maybe worth talking about what are those techniques for approximating what is the stuff that you need to keep around?
Sure. Yeah. So I suppose the big two of them are what’s called tracing collection and reference counting, which have their own innumerably many different variants. But in tracing collection, you periodically work out what’s live by starting with the stuff that you definitely know is in use. So the local variables of functions that are running and global variables and tracing from there. So following every pointer that they point to to see which… So the structure has a reference to that structure, has a reference to that structure. So you follow all of the references until you find the set of all of the structures that are in use.
And when you get to the end and you have no more references left to follow, then everything that you haven’t yet touched is garbage. That space cannot possibly ever be used again. So that’s available for future allocations. The alternative is reference counting. So each piece of memory that’s in use, you keep track of how many references there are to it. And you increment or decrement that when more references are created or destroyed. So you’re constantly maintaining this counter for how many other pieces of data refer to a piece of data. And if it ever reaches zero, then that bit of memory can be reused.
Right. And I guess one immediate problem with the naive ref counting approach is when you have cycles, when you have a set of data that points circularly to itself. And then if you just do the kind of naive ref counting thing, the count will never get decremented to zero. And as a result, you’ll never collect that data.
Yeah, exactly. I suppose a really good overview of this is there’s a paper called “A Unified Theory of Garbage Collection,” which is David Bacon, I think, which, one way of looking at this is that in tracing GC, you need to walk over all of the stuff that’s still in use. And then everything that you don’t touch is the stuff that’s dead and can be garbage collected.
Whereas in reference counting, you need to walk over all of the stuff that’s dead. Whenever a reference reaches zero, you need to go through all of its references, decrement all of those references and see that those reach zero. So you end up tracing a different set of stuff. You end up tracing through all the stuff that you’re about to get rid of instead of the stuff that you’re trying to keep alive. But as you say, if there are cycles here, then the cycle ref counts never go to zero, essentially.
Right. Although there are various hacks that people have for getting rid of this. I don’t know if I have the right summary of this, but my sense is that you, roughly speaking, you occasionally do some tracing to figure out what’s actually dead to get rid of cycles. Is that a good summary of how these cycle-removal techniques for ref counting systems work?
Python does the one you’re saying, it has a mostly reference-counted system and periodically runs a tracing collector to collect cycles. The other technique is you can have some programmer annotation that says certain references should be weak and shouldn’t contribute to the reference counts so that those references aren’t enough to keep things alive.
So a cycle that involves those gets collected anyway. But it’s a little trickier though and requires the programmer to really understand what’s going on and cooperate in telling you which references are which. The general solutions that work without any programmer annotation boil down to occasionally do tracing.
And how do you think of the trade-off between ref counting and tracing-based garbage collection? I feel like there’s an intuition out there of, oh, ref counting is faster and tracing is slower and that’s the trade-off between them, which I think is a bad summary. How would you describe the difference?
Yeah, I think that’s certainly not a great summary. It is a big enough chunk of work to implement a memory-management system for a language that it’s quite difficult to compare like with like, because there aren’t that many that have a choice between the two. The few experiments that have been done have mostly shown that ref counting, especially ref counting in a multi-threaded system, is really quite slow.
There are some tricks called deferred reference counting to avoid some of the overhead, but mostly you end up having to do an awful lot of increasing and decreasing the reference counts every time you move pointers around. And in a multi-threaded system, those operations end up being atomic operations, which carry their own overhead. The benefit that you do get though is that memory can be reused much faster than it can be reused in a tracing system.
So ref counting systems tend to detect more or less immediately when a piece of data is free and is available for reuse, whereas tracing garbage collected systems, even when they don’t stop the whole world to do everything at once, even when they run incrementally, they tend to operate in phases. And if a piece of memory becomes available for reuse, becomes garbage during this phase, it’s not going to be noticed until the next phase.
So it tends to hang around for a bit longer before being reused. So that trade-off usually means that ref counting can often work in a smaller amount of memory than tracing garbage collection can. I think a very rough one would be ref counting tends to take up less memory and tracing would usually be faster, but that’s also pretty simplistic because tracing garbage collection has a lot of knobs to twiddle.
In particular, you get to choose afterwards how often you want the garbage collector to run. You can choose to trace frequently or not to trace frequently, which means that you can’t really describe the performance of a program in terms of a single point saying it takes this long and uses this much memory. It’s more like a curve that you get.
With a tracing system, you could decide to say, “I want my program to not take very long,” and then you’ll need a ton of memory. Or you could say, “I want my program to not use very much memory,” and then it’ll take a long time, garbage collecting constantly. So you get this knob that you can slide from one extreme to the other.
Right. And then one of the dark arts of building performant garbage collected programs is figuring out just how to twiddle that knob just so that you get the right behavior for the program in question.
Yeah. I mean, this varies a lot from collector to collector. I mean, some of them have a lot of things to twiddle and they can be very difficult to get the performance to go the way you want. I quite like the design of the OCaml one in this regard. I mean, there are a few, but there’s basically only one that matters. And if you put it one direction, the program takes more memory and goes faster and if you put it the other direction, the program takes less memory and goes slower.
That’s pleasingly simple. I guess another thing that people think of as a trade-off there has to do with the length of pauses, right? If you want responsive programs, then I think people think, oh, tracing collectors are problematic because they have these phases where they stop the program and scan everything and then ref counting ones less so because there’s kind of incrementally doing the work immediately as soon as it becomes relevant.
But that’s not quite right either, right? Because the whole thing where once you decrement the counts to zero, there could be a whole big thing that becomes immediately unavailable. And then you have to immediately, synchronously collect all of it. Whereas tracing collectors are often designed, and the OCaml one in particular is designed, to operate incrementally, where you can collect things in slices and do a few 100 microseconds of work here and a few 100 microseconds of work there and never have to delete a whole big data structure all at once.
Yeah. It’s a bit of a pity that that’s a fairly common misconception. It is exactly as you say it. When you remove the final reference to the root of a very large tree, most ref counting systems will then immediately have to traverse the entire tree just to decrement all of the references and mark them all as free. In a very naively implemented garbage collector where you just periodically stop the world and do everything, you end up with massive pauses and a very naively implemented reference counting system tends to work reasonably well.
But when you get really down into the details of optimizing for post times, there’s a bit more that you can do in a tracing collector. The work there is already in phases, and it’s easy to divide up into chunks. It’s a bit trickier to do that kind of thing with a ref counting collector. It’s not impossible. There are ref counting collectors that do lazy destruction of things.
Do they do lazy propagation of the counts as well?
This has been implemented and this has been done. I don’t think this is deployed much in production systems because you lose what’s often described as one of the major benefits of ref counting, in that the disruptors run promptly. If you’re going to delay them, then you end up with having something that has the bad finalization semantics of a tracing garbage collector with the overhead of actually maintaining reference counts all the time. So it’s not…
It’s a clever design which obtains for you the worst of both worlds.
More or less. Yeah. Actually, it’s not strictly the worst because the case where reference counting does really well is when there are objects that just only ever have one reference and they don’t have references to anything else. A good example of this is a large I/O buffer, that you’re opening a file, allocating a big buffer, using it for some stuff, and then throwing away the buffer.
And in a ref-counted system, you don’t need to worry about the traversals because it’s a buffer. It’s got no pointers in it. It’s just got some bytes. The ref count goes from zero to one when you allocate the buffer. You do some stuff with it, and then it goes from one to zero when you’re done. And then the next buffer you want, you get the same buffer back. So your program keeps running only ever using one of these buffers for every single file that it opens.
That style of memory usage where there’s a single large object, it’s quite difficult to get that to perform well with a tracing garbage collector. You often end up cycling through 100 buffers because it takes that long for the collector to notice that the first one is now available for reuse, because that’s how often it’s cycling through the entire heap. This is partially why it’s so difficult to compare these because they’re so heavily workload dependent. And if you have a favorite, you can easily come up with programs where your favorite strategy is the best one.
Right. And also in a given language ecosystem, the programs are going to evolve towards the efficient idioms. So you take a set of benchmarks and you change how memory management works and you may get worse performance than it seems like you should because things have been tuned for the other behavior.
Yeah, exactly. I mean, to some extent, even the example I just said about buffer management is a candidate for that. In languages with lots of garbage collection, you’ll often find APIs where the caller provides the buffer. So reading a file, should reading a file return you some bytes or should it accept a byte buffer into which to put the contents? And in languages where you’re using tracing GC and it’s difficult to efficiently allocate and deallocate a large buffer all the time, you’ll often end up with APIs where the caller provides the buffer. I mean, that’s more efficient in ref-counted languages, too, but it’s less of a gap.
The other thing that’s interesting about ref-counted GCs is the determinism of finalizers, which I guess you don’t get if you do lazy finalization, but you do get otherwise, also makes it easier to manage things other than memory in the garbage collector, right? Because you know, when things are going to be deallocated.
So if you’re managing something like memory in your GPU, which you have a very small amount of and would like to give back as soon as possible, you can do that in a ref counted system, and in a tracing GC, it’s much harder because you really want to know that when you’re done with this particular round of training, that you’ve given up the relevant resources on the GPU. And it’s just harder to know that if that memory is being again managed by this unpredictable tracer.
Well, I suppose for that sort of example, what you really want is ownership, for instance like Rust provides or uniqueness types like in Clean or linear types or any of the various implementations of this idea, but where you know that you have the only reference to this GPU memory or this hardware resource or whatever, so that you can see on a particular line of code, that’s where I get rid of what I know to be the only reference.
If you’re using ref counting, then you don’t necessarily know that maybe something you have called has accidentally captured a reference to that GPU buffer and stored it in a global data structure or a cache or something. And it’s hanging around there. And at the point where you think you’re done with the reference count is actually two, not one. You don’t quite have the level of determinism with ref counting that you might imagine. You know that if this is the only reference, then it will definitely be cleaned up at this point. But it can be hard to tell that it’s genuinely the only reference to something.
Right. I guess the behavior is more deterministic, but you can’t quite reason about it locally. You can’t just look at the code in front of you and know that it’s gone. Some other thing that you may have called may have stashed a reference. So it’s a little harder to predict what’s going on. So we’ve talked a bunch about the trade-off between two different flavors of automatic garbage collection. How do you think about the trade-off between C- or Rust-style manual garbage collection versus automatic collection?
Well, since you did say C or Rust style, I’m going to have to jump on that one first because I see a big difference between the safe and the unsafe styles. The unchecked style in C is essentially a mechanism for upgrading any bug in your program into an exploitable security hole. There are no benign mistakes when you’re using truly manual unchecked memory management.
Right. There was some results not so far in the past where Microsoft, I think, said something like 40 or 60 or something percent of their bugs were just buffer overflows or some basic memory management problems.
Yeah, it is a bit saddening, but on happier topics, Rust exists. That basically gets you the coding style of very well written C++, where all of the memory management is being correctly done. And the compiler can actually verify this and tell you that all of the memory management is being correctly done. Which is a style that I like if I’m writing that sort of low-level code where I’m paying lots of attention to exactly how memory is laid out.
But not a style that I like if I’m writing a type checker or writing something higher level that deals with lots of complicated linked tree-like structures, because it’s just a lot of work to explain to the compiler exactly why it’s okay to hold onto a particular reference. Sometimes it’s easier to just have a long-time tracing garbage collector work out which ones it’s safe to hold onto.
Right. So trade-off number one, manual collection is manual. It requires work. Okay. But how about in the other direction? What are the upsides of manual management?
So for purely memory, it’s primarily performance. You can go faster, you can write the API, looks like it’s returning strings and not have to deal with manually managing buffers, working out when it’s safe to reuse a buffer, worrying about aliasing between two mutable structures.
You can write in-place mutation of things and know that you’re mutating the only reference to this thing. So nobody else’s data structures are suddenly going to change underneath them. And you can generally write in the very efficient programming style that you can do in, say, C, but you can write in that style while still writing robust modular software and not worrying about having mysterious side effects because you mutated or freed something.
So I guess you’re really highlighting two benefits. One benefit is more precise control makes things more performant. And then the other one is the same type-level mechanisms that let you get rid of the garbage collector also let you get rid of a bunch of common and difficult programming problems. Basically in order to get rid of the need for a garbage collector, you want some type-level tracking of who owns this piece of memory. So I can tell when no one owns it and I can make sure statically in the code that it’s something that gets rid of it at the right point in time.
And that you never use something that has already had that been done to it. At the same time, that tracking of ownership avoids what you refer to as aliasing problems, which is basically the kind of bug you run into when you have some function that has two arguments that are referring to what seem like two different things. And under the covers, in a way that you’re not necessarily thinking about, it turns out they share some structures underneath. And so when you mutate things, you can end up with all sorts of bizarre results in the case where there’s aliasing, where you have two things pointed to the same place, that you wouldn’t see if they were actually distinct.
Yeah. And it’s not just those bugs that have a cost, it’s the things people do to avoid the possibility of those bugs has a cost. In languages where you’re making heavy use of mutable data, you often find people doing defensive deep copying of things. Some function takes in a list. But the first thing it does is copy that to another list because, oh, someone might have passed me their list and I’m going to start adding things to the end of it.
I can’t add stuff to the end of someone else’s list or some of their data structures might be corrupted. So you see these kind of defensive copying, or even deep copying of structures, just to make sure that they’re all separate. That’s often a waste of time, but it’s difficult to tell exactly when it’s a waste of time without doing this careful poring through every line of code.
I know two good strategies to avoid having to do that. There is have type-level control over aliasing like Rust does and know exactly whether two things might overlap and know that if someone’s passing this to me, then either it’s a borrowed and mutable reference, I don’t get to change it, or it’s transferred ownership or give me a mutable reference to it, and I do. Or alternatively, the other one is use immutable data structures into a garbage collector. So you can update an OCaml map and you get a new map and you haven’t changed the old one. I would quite like to be able to do both of those tricks in the same language, but we’ll wait for now.
Yeah, I like that account because I feel like it explains the difference between a language like OCaml and a language like Rust, very cleanly. And I think there’s a funny thing that on some level Rust and OCaml seem massively different, but from a kind of sensibility perspective, they seem very close. In fact, I think it is not accidental that the first version of Rust was written in OCaml, certainly written by people who understand languages like OCaml very well.
The problem is shared mutation. And there are two ways out. You can make sure there’s no sharing, or you can make sure there’s no mutation. And sometimes you want to be able to take both in the same language and there are varying trade-offs when you do that. So one of the advantages of manual memory management is just performance. And on the automatic garbage collector side, one of the ways one deals with that is by just piling almost unbounded amounts of cleverness into trying to make the garbage collector more efficient.
You can’t quite close all the gap. There’s nothing quite as good as a human who’s carefully designed exactly what you want. No automated system quite gets there, but there’s a lot of work that can be done to make garbage collectors faster. And there’s in fact been a lot of work recently on OCaml’s garbage collector, which you’ve had some involvement in, making it faster. So I’d like to talk some about that. But before we do, it might be worth having a quick tour of how OCaml’s garbage collector works. Can you say a little bit about the structure of OCaml’s GC?
So it’s what’s called a generational collector, which means there are roughly two almost independent garbage collectors in there. I’m going to start by describing the one called the major collector. So that one is responsible for most of the memory of the program. It’s got an allocator that’s fairly standard. I mean, it’s quite a neat one, but it’s interface looks more or less like a C allocator, except it doesn’t have a free operation.
And every so often it samples all of the, what are called the roots. Those are the things that are definitely in use. They’re the local variables for the functions that are currently running and the global variables for all the modules that have been loaded. And a couple of other weird special case things, but that’s most of it. And then the idea is it’s going to trace everything from the roots, everything reachable by following any reference, starting from the roots, gets marked is what the term is.
There’s a little bit at the start of each object that indicates whether or not it’s been marked. And if one’s out of things deserves a big stack of stuff that is in the process of being marked. When it finds a new object, pushes it onto that stack, marks it, and then periodically pops something off, marks all of its fields. When it runs out of stuff to mark, then it enters the other phase, which is called sweeping. It runs through all of the objects and the ones that haven’t been marked, those are free memory.
The stuff that hasn’t been marked by that stage is garbage and can be reused. It gets re-added to the allocator’s data structures for future allocations. This is known as a mark-and-sweep garbage collector. It’s got the two phases of marking and sweeping. The one in OCaml is what’s called incremental, which doesn’t stop everything and then walk over the entire heap. Instead, it divides up that work of marking and also the work of sweeping into small, fixed-size slices and just does some marking periodically. And then in the sweep phase, does some sweeping periodically and maintains all of the states that it can leave off and resume. So the program has never stopped for very long.
Right. And importantly, this matters a lot if you’re building various kinds of soft real-time applications, where your response time matters. The idea that you might need to scan your entire heap periodically is just really destructive to having any kind of predictable latency profile.
Sure. Yeah. I mean, it matters for soft real-time programs, but I think it even matters for any kind of program. I mean, if you have a heap that’s maybe a gigabyte or a couple of gigabytes, doing a full mark and sweep collection is going to take a second or two. Even a program that doesn’t have any soft, real-time constraint, anything even vaguely interactive will be unhappy or will cause unhappiness if it just stops for a second or two.
Any sort of graphical program or anything that’s going to respond to any sort of user requests is much better if it can divide up that work into short, 100s of microsecond or few milliseconds pauses. That’s the major collector. There’s an observation, it’s very old, the original paper that spotted it. It’s a thing called the generational hypothesis. And it’s a remarkable fact about programming languages in that it’s one of the few that’s been well confirmed across a huge range of completely different programming languages.
This is one that’s been shown to hold in functional languages and object-oriented languages and imperative languages. It seems to be remarkably robust to what sort of language or even program you’re dealing with. And it’s the observation that objects which have recently been allocated are likely to be the ones with the shortest lifetimes. That lots and lots of objects don’t live for very long.
They’re used just as temporary storage and then thrown away. If you’ve got a collection of objects, the ones that have been around the longest are the ones that are likely to stay around the longest. The ones that you’ve just allocated are likely to go away soon. So you can use this to massively speed up a garbage collector. What you do is you do what’s called a generational collector. You have a separate area for objects that have very recently been allocated.
You allocate new objects there using a very, very fast collector, which is just subtraction. It’s known as bump pointer allocator, which means there’s a pointer, and when you want to allocate five bytes, you subtract five from the pointer and just put all of the objects one by one in this contiguous region of memory. And then when that fills up, then you do what’s called a minor garbage collection.
And you just go over the objects that are live in that region, which, because that region is relatively small, it doesn’t take too long. And anything that’s still alive there, you move onto the major heap or you promote in some way. You do something more expensive to keep them alive. You copy them out of that region or you move them to one end of that region, or you do something with them. But the point is that usually you can eliminate 90% of the allocations that way, sometimes even more.
Right. And a neat trick about the minor collector is that you pay for allocation. And for the things that aren’t promoted to the next stage, the deallocation is kind of free because you only had to do work proportional to the number of things that were live.
Yes. And also you pay very, very little for allocation. A real allocator, like the major heap uses, or like all heap allocations in, say, C or Rust use, that has to do a certain amount of work. It can’t just slap objects anywhere in memory. It needs to find somewhere where they fit to avoid using too much memory. It needs to make sure gaps in the heap are filled in with appropriate-sized objects.
There are reasonably efficient allocators but there’s some actual work that has to be done there. One of the advantages of a generational collector is you can defer that work until objects have survived at least one minor collection. So that means you use an extremely stupid allocator for the first time around and only if the objects survive a while, only then is it worth actually trying to find a permanent home in a fragmented heap for them. And that way you can avoid even using the major allocator in 90% of cases.
I guess one thing that’s important to mention about the OCaml collector is it’s a single-core collector or a single-threaded collector. You can’t have multiple threads concurrently allocating on the same heap and concurrently accessing data on the same heap. In OCaml we have a single lock that separates out different threads to make sure that they don’t interact. And one of the other pieces of work, which I don’t know how much we’ll get to, is you did a lot of work on the multicore garbage collector that is currently extremely actively being pushed on upstream and people are trying to get out.
But within its scope as a single-core collector, OCaml’s collector is pretty well respected. It’s considered to be a nice balance. It’s both simple and pretty easy to understand and efficient and all of that. And people say lots of nice things about it. And I used to believe those things too. And then recently, you worked along with a guy named Will Hasenplaugh on some changes to OCaml’s collector that made the marking phase two to three times faster, which I found very surprising for a well regarded and like 20-year-old garbage collector, or 25-year-old garbage collector, to suddenly get so much better and have such low-hanging fruit. So what is the thing that you did to make it so much faster?
Yeah, it was surprising to us as well quite how much of a difference that change made. I think it might be best by explaining some stuff about how people write high-performance code. So one of the first things that you learn when you get into writing really efficient programs is following pointers is slow. There’s a cache and there’s the whole memory system of the CPU. And it’s really good if you’re just streaming in sequentially a big, long array.
But if you’re chasing over pointers randomly, it’s going to be much slower. And the reason is, if you’re just copying a load of memory or just adding up a really long array of numbers, then it can stream that in from memory at, I think Skylake processors are something like 12 gigabytes a second on one core, which is, I mean, that’s a lot of memory coming in.
So if your core is running at three gigahertz, that’s four bytes a cycle. If a cache line is 64 bytes, you’re getting a new cache line worth of memory every 16 cycles. They’re just constantly coming in. Whereas if you set up a big, long array, you’re following a link list. Every time you have to get the next element of the list, you have to dereference a new pointer.
So each time you follow one of those, that means you have to make a new request to the memory system to look at a new place. Not somewhere it’s been streaming sequentially out of, but a new different place. And that request can take something like 300 cycles. So instead of getting a new item or a new thing every 16 cycles, you’re getting a new item every 300 cycles, and your program, even though it looks exactly the same as the array version, it’s going over a list and adding up numbers for a slightly different notion of list, but it’s now dozens of times slower.
Right. I think the key insight here is to think about the things inside of your single computer as, in fact, a distributed system. The issue is the memory is very far away. You have a lot of bandwidth, but the latency is high. So if you need to do a bunch of dependent fetches and you need to get some data and then make a decision, and get some other data and make a decision, and make some other data.
Just like you’re pinging some server far away and have all of these extra round trips, it doesn’t matter that you have 10 gigabits worth of bandwidth through the computer, you’re going to be kind of poking along request by request if the latency is long. And it’s essentially the same story here, except the long latencies are all kind of on the scale of what fits inside your very small computer.
Yeah, exactly that. So the marking phase of a garbage collector is kind of the maximally bad example for this. It’s a thing that has to chase every single pointer in the system. Every time it finds a new object, it has to de-reference every single point that it sees. And it all points to random other locations on the heap. So it spends most of its time just doing these one requests, wait to see whether that comes back. Oh, that was another thing I have to follow. Okay, follow some of those references. Wait to see what those come back with.
It’s like a garbage collector was a machine designed for creating cache misses.
Yeah, exactly so. Just sitting there doing cache misses all day. And so at some fundamental level, it kind of has to do all of these cache misses. I mean, we’re trying to mark a chunk of a heap. The heap might be gigabytes large, the cache or at least the L1 level of the cache is 32 kilobytes large. It doesn’t fit, and it doesn’t fit by a factor of tens of thousands. But the trouble is that the garbage collector is doing those cache misses one at a time, it’s waiting for the answers to this one before it even begins processing the next one.
Yeah, that totally makes sense.
So it turns out that the current crop of processors can do maybe 10 or so cache misses, can be active in parallel. It’s still that model of, you’re sending requests to a faraway server and they take a while to get back. And there’s no really good way of making them get back any faster, but you don’t have to wait for the answers to the first one before you send the next. You can have about 10 of them in flight or more.
I think the new Apple chips, or so I’ve heard, can handle about 20 of them in flight at once. So there’s quite a lot of width here. So the change that we made to the garbage collector is to use what’s called prefetching instructions. Those are instructions where you say in advance you start saying, “Please start loading this piece of memory.” With the idea being that by the time you actually get around to using it, it’s already made its way into cache.
The observation that makes this work is that in the marking phase of a garbage collector, your job is to mark everything that’s reachable through any of the references. But you don’t actually care what order you mark them in. It doesn’t really matter. Instead of marking them just one at a time, what we did is we put a small queue in front of marking.
So every time we discover a new object that needs marking, instead of immediately walking over its fields, we issue a prefetch and add it to this small queue, and then take the next thing out of that queue, prefetch it, add it back to the queue. And by the time we actually got round to traversing the fields of an object, we’ve prefetched it maybe 10 objects ago. So firstly, there’s a good chance that the data that we’re looking for has made it into cache at this point. But more importantly, while it was sitting in that queue, it was 10th in the line. There were 10 other things being prefetched at the same time.
We get to overlap 10 requests or 20 or 30 or whatever many number of requests the hardware can handle. So we end up making much better use of the hardware’s resources. In that example, each individual memory request that’s not in cache still has to wait the full 300 cycles. But if we can get 10 of them going at the same time, then we can be hitting a new cache line every 30 cycles. I mean, that’s not as good as getting on every 16 cycles, but it’s close. You’re actually able to get to a reasonable fraction of the raw memory bandwidth of the machine while just traversing randomly over this huge one gigabyte heap. It worked out very neatly.
Yeah. In fact, one of the things that surprised me about this whole story is how surprised we were. As you point out, the idea isn’t new. There are papers about using prefetching to make garbage collectors faster, and they report speedups. And they’re all not as good as this. I’m curious if you have any insight on it. What happened? Is it that the OCaml garbage collector is somehow different or have machines just evolved since those papers were written so that the constants are different and the value of making this kind of change is just higher than it used to be?
So yeah, the memory-level parallelism. So that number that I’ve been saying is around 10, of how many things you can have going at the same time, that used to be smaller than 10, which meant there was a smaller benefit from being able to do these tricks. As well as that, there’s the fact that the OCaml garbage collector has to touch very little memory other than the stuff that’s on the heap.
The thing that all garbage collectors have to do is tell whether a piece of memory contains a pointer because they need to traverse all the pointers and they need to not traverse the things which are not pointers. In OCaml that’s done using what’s called tagging. Each eight bytes of memory on the heap, the ones that end in a one are integers, and the ones that end in the zero are pointers.
So it’s very easy to tell in the garbage collector, as you’re walking along, which is which. That’s not the case for all garbage collectors. In a lot of other garbage collectors, there’s a separate data structure off to one side that you have to look up with some tag from each object to work out its layout, to decide where the pointers are.
So, I mean, this is fairly speculative, I haven’t really dug into many other implementations of prefetching, but that’s one reason why it’s perhaps easier to get good results out of OCaml, because there are just very few memory accesses to be done in the first place. There are only the memory accesses to the actual data. There are no, like, off-to-one-side table look-ups to worry about.
So in a world where we have a lot more parallel memory-fetching hardware than we used to, are we under-using software prefetching in general? One of the nice things about using software prefetch instructions in a garbage collector is it’s a relatively simple story.
The garbage collector is running, while it’s running, nothing else is running. It’s kind of taken over your program. You don’t have to worry about competing with resources for anything else. Do you think we should be surfacing software prefetching more directly to programmers in a more general context and one that’s not quite as simple as the garbage collector case, where people are doing all sorts of ad hoc data structures?
Yeah. I think that’s probably worthwhile. It’s a little bit difficult though. In fact, I think Sam Ainsworth and Timothy Jones at Cambridge have a couple of papers on actually doing this in OCaml, about exposing some prefetching instructions and using them to try and speed things up. In general though, it’s pretty difficult because to get good performance out of prefetching, you need to get the timing just right, you need to prefetch something so that you’re going to need to access it soon.
If you need to access it immediately, then there was no point in prefetching because the time that you end up using the data, the prefetch hasn’t completed. And if you end up accessing it quite a while away, then your prefetch gets fetched all the way into cache, but then it gets flushed out of cache because something else comes in before you ever get around to using it.
The marking phase of a garbage collector is maybe an easy case for that sort of work, because it genuinely doesn’t care what order it traverses things in. Adding this little ring buffer to make sure everything gets traversed 10 units of time after it was accessed or 100 or whatever the numbers are, was a relatively straightforward change.
Whereas adding prefetches in ways that are definitely going to be useful is a bit trickier in other contexts. An example of a case where you might want to do it is in the List.map function, a very commonly used functional in OCaml, which applies some transformation to every element of a list. And it needs to traverse this list and lists in OCaml are link lists. So a nice thing that you might try and do is, when you’re about to access this element of the list, and you’re about to run the user function on this element of the list, prefetch the next element.
But the trouble is if the function that you’re mapping, the function that you’re calling the user function, is very simple then, it’s just adding one or something, then the prefetch won’t have any real benefit because it won’t actually be very far ahead of time. The actual access done to the list will happen before the prefetch is completed. If the function you’re doing is very complicated and very slow, then the prefetching will actually be harmful because you’ll have pushed something out of cache early and then taken up space, but it won’t be in cache by the time you get to the searcher that you need.
And if the thing in the list is itself a complicated linked structure, then merely having prefetched the next link list node doesn’t mean you’ve prefetched its contents. So it might not actually be enough to speed things up. There’s definitely gains to be had in exposing it to programmers, but it’s quite tricky to do in a modular, general way. You’d really like that someone who understood memories of systems could put the prefetching stuff into the commonly used libraries, and then everyone’s code would go faster. But unfortunately it seems quite not really to work like that. It depends on very subtle details of exactly how fast everything is and exactly how much memory it uses.
Right. There’s essentially a lack of composability of the performance semantics. You can go and optimize some data structure and it may work well in some context where you use it, and then in some other context where you’re iterating over it in different ways and doing different things in between each step where you test the data structure, and that can just totally change what’s happening. So it’s not even clear, even if you were an expert trying to design the ideal data structure, it’s just not clear that the thing exists. Because it doesn’t have enough knowledge to make the right choices given that it needs to be general purpose and be used in lots of different contexts.
Yeah. And unfortunately, it’s also an area where the performance of a lot of these things doesn’t smoothly degrade as things get worse. It’s either hitting L1 three cycles, missing L1 but getting the next best thing in L2 is, I don’t know, 40 cycles. There’s a sudden big jump, and then hitting the last level cache is, I don’t know, 60.
And then missing that is 300. So if you get some of this and it’s nearly right, and your program changes in a way that makes it just very slightly slower or very slightly bigger, then suddenly that can have a five or 10 times difference in the performance. Because the thing that was reliably hitting in cache was no longer hitting in cache or the prefetch that was reliably there is now getting evicted.
So the garbage collector work you’ve done is about making the automatic garbage collection story more efficient by just going and improving the actual collector itself, which is great stuff. And in fact it had shockingly good effects on a bunch of different programs that we run in real life. So that’s great.
But that’s not the only story, right? The other thing we’ve been talking about is giving programmers more control over memory. And some of the work that you’ve been doing is about that. And one interesting one is adding what are called local types, essentially adding safe stack allocation to OCaml. Can you give a summary of what you think the benefit of this is? What’s the point of adding safe stack allocation to a language like OCaml?
Right. A lot of functions use temporary data. My go-to example from earlier of a buffer that you’re using to load the file before you parse its contents, or just any small data structure that’s just used to pass options around or something. And these data structures are created, used a couple of times, and then immediately destroyed, or should be immediately destroyed. You know that five lines of code later, nothing will ever use them again.
But in OCaml at the moment, those end up being heap allocations, these small little buffers or the sum or list things that you might use to pass arguments to a function or pass optional arguments, all of these end up being small allocations. They’re allocations on the minor heap, which is relatively cheap, but they do consume some space on the minor heap and they make the next minor collection happen a bit sooner and they have a non-zero cost.
It would be nice if these things had a lower cost. If you do a stack allocation in C or C++ or Rust, stack allocation in those languages has a cost about the same as minor heap allocation in OCaml. There’s just a subtraction from a pointer that’s in the register. But when you’re done with it, you get to undo that stack allocation by undoing the register movement, rather than having to wait for the next garbage collector. You don’t contribute towards the next garbage collection pause, you don’t add any GC work, and more importantly, you get to reuse that exact bit of memory, which is definitely in L1 cache, because you’ve just used it. You get to reuse those bytes for the next temporary allocation that you do.
Yeah. I was going to poke on exactly this question, which is if minor allocation is so cheap, doesn’t that get rid of a lot of the benefits of stack allocation? And the story you’re telling about cache provides a good answer. In fact, a thing that we’ve noticed experimentally is that there are two incompatible ways you can tweak the GC to make your programs run faster. First, you can make your program faster by making the minor heap bigger because making it bigger gives you more time to discover that things that were recently allocated are, in fact, going to become unreachable soon.
And so never need to be promoted to the major heap and you can avoid all the work associated with that. And then the other totally incompatible trick is to make the minor heap smaller. In particular, a tightly written program that doesn’t do that much minor allocation can run faster with a smaller minor heap, because that reduces the number of cache lines that are touched as the allocator strides over the minor heap. And so stack allocation, I guess, gives you the best of both worlds. You get better cache behavior, but without creating a lot of unnecessary promotions.
Yeah. With local stack allocations, you actually get both of those advantages. As you say, you get the advantage of reusing that same cache memory as though your minor heap was much smaller because it’s all reusing the same bit of stack memory, but also all of the allocations that don’t go through this local mechanism, the ones that are still being minor collected, the ones that are still happening on the minor heap, there’s a lot fewer of those. So if there’s half as many of those it’s as though you had twice as big a minor heap, you get the same avoiding promotion benefits.
Okay. So stack allocation is a thing that exists in languages previous to OCaml. Can you describe how the thing you are currently designing for OCaml differs from what shows up in other places? I guess Rust is one that I know about.
Two languages that make heavy use of stack allocation are Rust and Go. The comparison to Rust makes a bit more sense because the stack allocation in Go is for allocations that the compiler can see definitely don’t escape. There’s no particularly good way of expressing in a function interface type that a certain value isn’t going to escape. If you’re passing things around to other functions, it’s difficult for those to be stack-allocated, to kind of a completely best-effort, invisible thing.
We’d really like to be able to express some of the stuff in types that programmers can know that something is or isn’t going to be stack allocated. So the main challenge is to do this in a way that’s safe so that the compiler can guarantee at compile time that memory isn’t going to be accessed after it’s been freed. That means that you need to ensure that no references to a locally allocated value are used after it’s gone out of scope.
The tricky case there is about when you locally allocate something like a data structure to describe how to initialize something, and you pass it to a function. So in say, C, you could allocate that data structure locally on the stack, pass a reference to it to a function, then that function could say, “Ooh, a pointer to something.” And they can put that in a global variable or squirrel it away in some cache or hide it somewhere in your program.
Then later on it returns and you’ve got this dangling reference to something which you later access and everything explodes or worse. So the challenge is to allow programmers to write that kind of program where you are really passing references to locally allocated data around to other functions and have that somehow be safe. Definitely the most well-known way of doing that is the one that Rust has.
So in Rust, there are these new entities at the type level called lifetimes. They’re like variables in the type system, and there’s a lifetime variable associated with each reference type. Instead of saying, I have a reference to a tuple, you say, I have a reference to a tuple with a particular lifetime alpha. And then if you try to use a reference where you try to make something point to something else, you can only do so if the lifetimes are compatible.
So you try to store a reference to this thing with lifetime alpha in the thing with lifetime beta, then you must know that beta outlives alpha, so that you’ll never end up with a dangling reference. That means if you’ve got this idea of a function that accepts some locally allocated data, then that ends up being expressed as a function that’s polymorphic in lifetime. If you’ve got something which accepts as a parameter, a locally allocated tuple, then the type of that in Rust is something like for all lifetimes alpha, it accepts a tuple of lifetime alpha.
And then that’s enough to know that the function’s not going to store that in a global variable because the global variable has lifetime static. And if it did that, then it wouldn’t be able to accept a tuple of lifetime alpha, it would have to accept the tuple of lifetime static. The fact that this function is able to compile with the type “for all alpha” means that it genuinely works for any possible lifetime. Callers of that function can make a tuple on the stack. So it has a particular lifetime representing its scope and can then pass it to this function that works for all lifetimes. And since it works for any lifetime, it will work for this one too.
And since it works for any lifetime, it can’t possibly leak a reference to that tuple into any other scope than the one that came in. So that works very neatly. And you can use these lifetime variables to express various complicated relationships between lifetimes, but there is a tricky part to it from a type-system point of view. It’s that this function, which just takes a tuple, now has a lifetime variable in it, and is in fact now a polymorphic function.
So if you find yourself writing higher-order codes, this is functions which take functions as parameters, which you find yourself doing this in OCaml quite a lot. If you’re writing some of those, then if your higher-order function is allocating some local tuple and giving it to a callback that it had accepted, then the type of that callback needs to be polymorphic. It needs to work for any alpha. It actually needs to be for all alpha because there’s no way of even naming the alpha that it’s supposed to work with from outside the function.
It’s some anonymous alpha representing a particular lifetime that isn’t in scope. You need a surprising amount of type-level machinery to get this to work. There’s a feature in Rust for exactly this purpose called higher-rank trait bounds, which basically allows you to say, I am accepting a parameter. And it’s not that I am polymorphic in the lifetime, it’s that I require that my parameter be polymorphic in the lifetime so that I can make up a particular lifetime, like the scope from this line to that line, and then use this callback function that I took as a parameter with that lifetime.
It’s fairly well-studied machinery, but it is fairly heavyweight machinery. And in particular, it’s a fact about type systems that once you have that type inference, more or less stops working, that you can’t do the level of, in OCaml, even higher-order functions can have their types inferred. You can just write an implementation of List.map or something, and OCaml will say, “Yep, that’s got this type.” Whereas if you’re doing it with these higher-rank types, which is the thing that you need in Rust to do these sort of callbacks, you really have to be explicitly writing out the types of all of those higher-order functions because there’s no way that the compiler can figure them out by itself.
There’s a lot of heavy machinery, which means there’s work to do for the person implementing the type system and figuring it all out. But worse than that, it leaks into the experience of the programmer. So one thing it requires is that you have to do a lot of annotations. Is it also the case that the annotations are themselves subtle to figure out what the right ones are? Are these easy annotations to write or hard annotations to write?
In principle, they’re relatively tricky. You’re dealing with rank-two types. These are fairly advanced things. On the other hand, the Rust compiler is very good at guiding you down these paths. It says, “Ah, you’re trying to do something that needs this feature. Copy-paste this piece of code into the type and it’ll probably work.” Because to be honest, in Rust, it mostly just comes up in this particular context.
I’m certainly not saying that I think Rust has made a bad choice here, the Rust design is quite comfortable with a fairly heavy amount of annotations. You end up writing types a lot. So writing these annotations maybe isn’t the end of the world. It’s certainly not appropriate for OCaml, where programmers are used to a compiler that knows an awful lot more about a program, where you can just start mashing out code and have the compiler tell you whether there are type errors in it without having given it very many hints about what it is that you’re trying to do.
Which is really a rather nice feature.
It is. I do like it. So to be able to keep some amount of that, we’ve gone for a different design in the local-types work. Instead of having lifetime variables, we have natively this concept of a function that does not capture its argument. It’s a function whose argument definitely won’t be squirreled away into a global variable or some cache or some memoized structure or anything. It’s definitely not going to capture its argument.
That means if you have one of those functions, you can locally allocate the tuple or whatever and pass it directly to a function like that. The user experience in this case is pretty similar, but importantly, there are no variables and no polymorphism going on at the type level. So the type inference story works out a lot more simply. There is a trade-off here in that it’s a lot less flexible than Rust.
It more or less only covers this case. Rust’s lifetime variables can be used for lots more stuff than just saying a function that accepts stack-allocated arguments. You can have a function with multiple lifetime parameters, and you can say it takes these two arguments of this lifetime, this argument is a tuple where one half is this lifetime, the other half is that lifetime, it returns something which has a third lifetime, which is related to these two in some complicated way. And you can have really precise specifications of lifetime behavior, which are basically impossible with our system for OCaml.
But on the other hand, in the common case of passing a reference to a stack-allocated thing to a function, you don’t have to write any annotations and the types are much simpler. I think this is kind of an interesting trade-off between the two. A lot of it is driven by the fact that in Rust this memory management system is the only one there is, and it has to work for everything. That’s all memory goes through the lifetime system in Rust, whereas in OCaml, when we get to the complicated cases, we haven’t thrown away the garbage collector. We can still use it for anything difficult.
Right. Yeah. It’s a very different kind of set of ergonomic choices. In many ways, I think about what language I want OCaml to evolve in over, I don’t know, the next five to 10 years, is kind of exactly this, of something that is simple and ergonomic to use by default, but that when you need precise control, you have some ways of giving more of that precise control. And maybe with trade-offs where it’s not quite as much precise control as you could get in the most advanced system, but aiming for or something that is simple and easy for people to use most of the time.
That’s definitely my goal as well for this kind of thing. It can often be a really tricky needle to thread because it’s actually a lot harder to have a language that is easy mostly, and occasionally you have precise control over things. It’s actually a much more difficult design than one which allows precise control of everything and requires precise control of everything all the time. The interface between the bits where it’s automatic and the bits where it’s manual can be quite a tricky thing to work out.
So one tricky aspect of stack allocation has to do with things like smart constructors, right? Which is to say some of the time you want to allocate something in basically a stack-like discipline, but you’d also like some abstraction capabilities. I’d like to be able to not have to in-line in my present function, do all the work of doing the allocation. I’d like to write some function that wraps up some pattern that allocates something for me. And then I can use it following a stack discipline. How does that fit into this design?
That fits in fairly neatly, actually. There’s a lot of difficulty in returning a stack-allocated thing if you take the Algol, C, Rust view of stacks, where there is the stack, each function call pushes a new frame onto the stack, and that frame is destroyed from the function returns. The trouble is, it gives you very little obvious places for the return value of the function to go.
If you’re just returning an integer or something that fits in a register, then it’s easy enough. You pick some register as the return value. If it’s larger than that, it’s quite tricky. You can reserve some space in the caller’s side, but then the caller needs to know exactly how big that is. So in particular, the callee can’t decide itself how large an object to return. And if the callee is allocating that object on its own stack, you may need to copy these things or move them around or something like that.
What we’ve actually gone for is instead something closer to actually how Forth does stacks, which never really made any sense to me until I started working on this project, which is that there’s a separate code in a data stack. The region of memory for calls and returns, so the activation records of functions. That’s slightly different from the region of memory where the local allocations take place.
And they line up most of the time. So normally a function does local allocations, pushes some stuff to its data stack. The local allocations then they’re cleared out at the time it returns. But it means that you can allow a certain amount of slip between them to handle these smart constructors. A smart constructor in this model is just a function which allocates things on the local stack, which it then leaves there as it returns. The return has to pop a return address, but it pops a return address off the code stack, the system stack, and leaves the local allocations where they are.
And they show up in exactly the same place of memory as if the caller had done that allocation. You don’t need to do this little awkward dance where you copy them from one place to another, just to get them to the other side of the return address slot. I think that was quite a tricky problem before we settled on the idea of just having a slightly different region of memory. It also means that it makes the implementation a lot simpler because the binary format of it is more or less the same as the minor heap. So much of the garbage collector support is already there.
And do programmers have to think explicitly about these stack frames that are going on this other stack? When we write programs with functions, basically the structure of the function stack falls naturally out of the structure of the function calls. Is the structure of the data stack something that the programmer has to think about explicitly, or does that fall out of some other part of the structure of the code?
The programmer has to think about it explicitly when they’re trying to do something other than make it follow the structure of the function calls. By default, it just follows the structure of the function calls. But if you’re writing one of these smart-constructor things that allocates something for the caller to deallocate, then those require an annotation. You need to say, “In those functions, I am actually allocating something that I expect to outlive this function frame and it’s the caller’s problem to clean up.”
Got it. So you have to say that a given function is going to return something on the parent stack, and then beyond that you just use the functions and that’s that. You don’t have to explicitly say, “And now I’m going to open a new stack frame.” So you’re not explicitly in the code manipulating stack frames.
Yeah, that’s correct. We did at one point have a… Maybe we could just infer all of this and just have the compiler decide when things go on the stack. And realized shortly afterwards that if we left it entirely up to the compiler, it would put everything on the stack, which sounds good in theory, but really isn’t because things on the stack don’t get deallocated for a long time. So if every single allocation goes there, nothing ever gets deallocated and there’s just a massive stack all the time.
Yeah, that doesn’t sound good.
So you really do want stack frames to end occasionally. The current setup is that they end when things go out of scope at the end of the function, unless you explicitly tell it otherwise.
So all of this stuff about local types, that’s all about making stack allocation better and more available for more things. Let’s talk about making the heap better. When I think about what’s wrong with allocations on the OCaml heap, I think of there as being two problems. One is that the data is too scattered, right? There are too many pointers to chase. It’s not rectangular enough in other words. And the other is that the data is too big.
And I think of these as both problems having to do with OCaml’s super nice and simple data representation. Which is to say OCaml values are one of two things. They’re either a thing that fits inside of a single machine word, call those immediates, or they’re heap allocated values, right? And heap allocated values all look the same. There are some number of words, the first word is some header stuff that tells you what kind of thing it is, and then for everything that you put in there’s one full word.
So would you like to represent a character? Great. Here’s your machine word of eight bytes to represent that one character. Would you like to represent two characters? Great. Here’s eight bytes. Anything you want, it fits into the same size slot. So there’s a lot of uniformity. OCaml is, under cover of dark, it turns out it’s really just like a scheme implementation in terms of its memory representation.
So another thing that we’ve been doing some thinking about is some ways of changing this, which is a system that you’ve been working on called unboxed types. And I’m wondering if you can give us a tour of how unboxed types kind of steps away from this very simple memory model that OCaml has to give the programmer more control over how memory’s laid out.
Right. I mean, there’s a few things that you really want to be able to do, which, I mean, as you just mentioned, the avoiding all of these pointers, all of these indirections, all of these separate allocations that OCaml currently forces you to do. Some things that we’d really like to be able to do in OCaml are C-style, in-lining one struct into another.
You can have one record type include that record in another record type, and have that be as though you had manually copy-pasted the fields in, not a separate allocation under pointer between them. Or do the same, particularly with primitive integer types, like, Int32 in OCaml has a somewhat hilarious representation. A 32-bit integer is currently represented as a 64-bit pointer to a sequence of three 64-bit words. There’s a header, there’s the fact that it is a 32-bit integer, and then there’s the 32-bit integer expanded to fill 64 bits.
And there are good reasons why it’s like that from the point of view of keeping everything uniform with respect to the rest of the system. But none of those reasons are very satisfying if you are writing an application that’s primarily processing 32-bit integers and has to use this bizarrely verbose representation for them.
So we’d really like some of these things to be able to be represented natively, both to make the heap structures more efficient so that you can have a pointer to a heap structure, which is in fact a flash sequence of bits rather than a nest of linked structures. But also to make some of the heap structures just go away entirely. At the moment, if you want to avoid a heap allocation, and you’re trying to pass two values to an OCaml function, you need to make a tuple on the heap to pass them. Or you can use pass two arguments and have two separate variable names. There’s no way where you can write one name and have that one name represent two things without it being a heap allocation.
Right. And this, I guess, highlights a general phenomenon in OCaml and I think in other languages, where the abstractions that you use, which you almost want to think about as, like, step back, how am I abstractly structuring my code, the abstractions call out very particular performance choices. And as relevant in this conversation, very particular memory representations. And so you need to make some real changes in order to give yourself the power to have different memory representations than the ones that are there by default.
Yeah. And it’s that abstraction which makes this tricky. And some of those things could be done fairly straightforwardly. In-lining one record into another is a fairly straightforward transformation that the compiler could do if both of those records had visible type definitions. Or if you could see that this type is Int32, you could use an efficient representation for that pretty straightforwardly.
But things like passing two arguments to a function and having those consume two registers instead of one, that kind of flies in the face of a lot of how polymorphism works in OCaml. A lot of the things like List.map, which is supposed to work on any function, List.map unpacks a list, the actual implementation of it, it unpacks a list whose element type it does not know, plucks an element that it does not know about over the list, passes that to a function, not knowing anything about what that function does. Only knowing that the element type of the list happens to be the same type as the argument type of the function.
And that works in OCaml without generating a million different copies of List.map for every time it’s used. Sometimes it gets in-lined but there is a generic case. There is one map that works for any possible function or any possible list type. And the only reason that can be made to work is because OCaml knows that an arbitrary argument to an arbitrary function can in fact be represented in a single integer register.
And that’s of course only true because of the inefficient representation that it uses for various things. Things that really should be represented as a floating point register or a pair of integer registers, are instead represented in a single integer register by making a heap allocation, passing around a pointer to that.
So you’re pointing at this kind of notion of polymorphism as the way of dealing with it. To what degree is this about the compilation strategy, right? Part of what’s going on is OCaml supports separate compilation. You can compile your individual file separately. And when you compile List.map, there really is just one implementation of List.map, which is then used in lots of places, but that’s not the only way to compile.
You could do a whole program compiler and optimizer, where it kind of gets the whole program together, in some sense, before it compiles. And it can see the concrete types at which List.map is used and maybe specialize the behavior for those types. So I’m wondering to what degree the problem with polymorphism is fundamental and to what degree it is a kind of result of the engineering choice about how the compiler works on large code bases.
Right. There’s definitely been languages that work like that, as you’re saying, where things like List.map, there is no possible generic implementation that works in all cases. And it has to be specialized for every different type with which it’s used. For functions like List.map, that works pretty well. It gets a little bit trickier when you start dealing with some more advanced forms of polymorphism.
OCaml supports some things like, it supports polymorphic record fields and first-class modules and various other things where you can package up this concept of a polymorphic function, and pass it around as a value or store it in a hash table or do stuff with it. And then pluck it out and call it with different types. So it’s not enough to specialize each call site because sometimes you don’t even know what function it is that you’re calling at the time that you’re specializing. You would need to somehow package up this representation of every possible type or every possible type that it could be called at and pass those things around.
There are certainly language designs that don’t have those more advanced polymorphism features or restrict them in some way, and work by specializing all of the polymorphic functions at each use, the technique called monomorphization. That works pretty well. It generates fast code. It generates quite large code because you’ve got all of these duplicated copies of slightly similar things. And it can take a long time to compile because it’s not enough to just point to… It can be a good, specialized List.map for performance, but when you just want a fast build, it’s nice to be able to just point out the generic implementation of it.
Right. And separate compilation is really nice, right? Having an incremental build where you build your program bit by bit, rather than having to do the whole thing in one big go is from a kind of straightforward developer-experience point of view, really rather a big win.
Yeah. Another point in the spectrum of techniques you can use here is what the .NET generic system does, which is essentially monomorphization, but it’s happening at one time because there’s a JIT compiler. There’s List.map, and there’s this thing that represents the generic List.map. And every time you call it, it says, “Ah, what type are you calling this at?” “Ah, I haven’t actually compiled a version of List.map that type yet, let me go off and compile one of those for you.” And then you run that.
That works quite neatly. It fixes the code size problem or somewhat fixes it. You still end up with many copies of List.map, but you end up with only the ones that you use rather than every possible one. You do end up with some disadvantages that are hard to get around that are fundamental to JITs in that sometimes you will hit an uncompiled case and your program will stop and it enters JIT-compiling mode for a while.
It’s a new kind of pause, not only the garbage collector, but the compiler itself can now interrupt your program.
Yeah. So this wide spectrum of approaches here, the one that OCaml uses, everything being uniform, is definitely not the only one. I will say that it’s not obviously worse than any of the others. There’s no clear winner here. They’re all quite tricky in various edge cases, especially when you have advanced modular features.
All right. So we’ve gotten a little far afield. Let’s get back to the unboxed types question.
Yeah. Okay. So how does this actually work in the unboxed type’s view of the world? The design we were going for there was to try and keep the nice properties of separate compilation, all of the benefits that OCaml has from, as you say, secretly being a Lisp machine, where there is a generic implementation of these functions. And the thing that we’re missing from the type system is the ability to express that this is a generic polymorphic function, but it’s not actually going to work for all types.
It’s going to work for those types that fit in a single integer register and don’t require garbage collection support or those types which contain only floating point data or somewhere between saying, this is a function with concrete types that works only in this case. And this is a function that works for every possible type you could ever write, which are the only two things you have in OCaml at the moment.
But being able to say this works for all of the types that have this layout. And so we’ve prototyped some of this. It works quite neatly for some cases where you can implement stuff like data structures, which are generic and parameterized by a type, but are parameterized only by those types that don’t require any garbage collector support. So you never have to pay the cost of garbage collector barriers, even though you are in generic code.
Right. And just to kind of tie this to some broader nomenclature, this notion of a type of type is what’s typically called a kind, right? What you’re essentially adding is a kind system to OCaml for tracking the fact that different data we have might have different layouts and the regular polymorphic functions work over now, one of the various layouts, and we can have different layouts. And we can still write functions that are polymorphic, they’re just polymorphic for a given fixed layout of data.
Yeah. And there is in fact one particular layout called value, which represents OCaml’s current notion of polymorphism. In the unboxed type’s view of the world, a function which is polymorphic over all types of layout value is what OCaml today would call just polymorphic. All of the other layouts, all of the other kinds, are the new additions where you can say it’s polymorphic over things which are unboxed tuples of a certain width or things which are immediate values, never containing pointers and so on.
Got it. So the new abstractions that we’re presenting people with are these different data layouts. And some of them are about atomic bits of data. I can say I have an atomic piece of data and I can say how wide it is. This is an eight-bit-wide character. This is a 32-bit-wide integer. And then the next level up from that is I can express collections.
I can say a tuple or a struct and say this is like an unboxed tuple or an unboxed struct, and it contains inside of it things whose layout is specifically specified as well. Once I have this kind of unboxed struct, what can I do with it? How do I actually use it? And in fact, where is that data represented?
Right. The idea of an unboxed struct is that it’s just something which is wider than a normal data value. So there’s never any representation of it on the heap. There’s just, if you have a local variable of type “unboxed tuple of two Ints,” then that’s the same as having two local variables.
It’s not one which is a pointer to something, it’s just two local variables. They live in separate registers. If you pass them to a function that takes up two of its argument slots, instead of taking up one, if you return it, it has to return a wider structure and so on. The idea is that these things will work like normal OCaml values, but will never be anywhere other than the local variable stack or similar.
Got it. So this is, in some sense, a way of encoding this pattern of taking… I have a bunch of things I want to pass them to some other function, right? I kind of get a calling convention now, where if I pass some unboxed structure to another function, then it’s just going to get turned into ordinary function parameters. And it also gives me a kind of multi-argument return. If I want to return one of these structures, I now have a way of returning a bunch of things without going through an intermediate heap-allocated value.
Yeah. And you’re right to mention calling conventions because these, in fact, a lot of the design here was based on a paper by, I think Richard Eisenberg and co. from the Haskell community, entitled “Kinds are Calling Conventions.” It’s about determining the calling convention of a function based not on its type, but on the kinds of its type.
All functions of a given kind. So all of the things that take floating arguments, all the things that take non-pointer arguments, those will all have the same calling convention. And you can write generic polymorphic code that calls those functions in a completely uniform way. It’s only when the kinds differ that you might have different calling conventions. The kinds tell you how many registers a value needs and so on.
Okay. But I claim this whole thing about unboxed types is about how to more efficiently represent things on the heap. And then we’ve said, ah, we have all these structures that do not in the ordinary course admit a representation on the heap. So how does this tie back to the heap?
Oh, actually, that is true. A lot of the interesting stuff, in fact, a lot of the hard stuff with the unboxed types work is not just about getting better heap representations, it’s about not having heap representations at all for things that don’t need to be on the heap in the first place. For any of these unboxed tuples or unboxed structures, you can always choose to put them on the heap if you want to box them up and have a single pointer to represent this collection of things.
There are a couple of restrictions on being able to do that, to do with weird internal details of the OCaml garbage collector, which currently expects an object to consist either entirely of tagged pointers that it knows how to look at, or to not contain any pointers at all. So while an unboxed tuple containing one pointer and one float is a thing that you can pass around or pass to a function and so on, it’s not a thing that you can store on the heap because the garbage collector won’t know how to process it. You can convert back and forth between these unboxed things and heap allocations, but there are a couple of sharp corners at the moment to do with exactly how the garbage collector expects the heap to be laid out.
Got it. So there’s some explicit way, if I just use an unboxed struct on its own, that’s a kind of floating, ethereal thing. It exists at no fixed point in memory. And in fact, I guess, has to be immutable because in order for something to be a mutable object, it has to have a fixed location somewhere that multiple… You need to be able to alias. Multiple different parts of the program have to talk about the same thing.
So it needs a location as a name in order to be mutable. So we have these ethereal, immutable unboxed things, and then we have some explicit way of packing them into the coarse, low-level, heap-allocated structures. And presumably when we do that, we get all the nice, tight packing of data so that we don’t have to spend a whole 64 bits to represent our eight-bit character.
Yeah. The design at the moment is that you either get the tight packing of data, or sometimes you get a compiler error saying, “I’m afraid I can’t do that, Dave.” You just can’t have a pointer and an eight-bit char right beside each other. The garbage collector will be very confused. It’s a separate bit of work that might be worthwhile doing some time to maybe lift some of those restrictions and let the garbage collector handle more densely-packed, complicated, half-pointer, half-data structures. But I think the goal at the moment is to actually surface the capabilities that the OCaml runtime system already has to the programmer. There’s more stuff that you can do than you can comfortably do in the source language at the moment.
Got it. So there’s still a thing missing from the story, as I understand it, which is, I also want to do some imperative programming on these things. I have some big data store where I have a bunch of nicely, compactly packed things. And now what do I want to do? I want to write some code that walks over that and maybe mutates it. Because often the most efficient way of dealing with large chunks of memory is not to do things always in a purely functional style, sometimes you actually want to imperatively modify data. So how does that fit into the unboxed types design?
I don’t think there’s any huge differences there from normal boxed data. So at some point, at some level, there has to be some boxing in at least the way we’ve been doing things to get the mutability. Because as you’ve mentioned, a purely unboxed tuple is just some data. There’s nothing there to mutate. But as long as the outer structure is mutable, you can write some syntax that looks like you’re mutating inner fields, even though it’s doing something a little bit more complicated.
Got it. So at the end of the day, we have more primitives than just, take some unboxed structure and flatten it onto a heap-allocated thing or un-flatten it. There’s also some notion of saying, I want to address something deep in the middle of this unboxed structure and say, let’s modify that. And that’s another language feature that kind of is surfaced to let you express that in a concise way.
So one of the really lovely things about OCaml is variants, right? Variants is this language feature that lets you express, I have some data structure which is this or that or the other. And in this discussion of unboxed types, we’ve been talking entirely about records and tuples. The difference between variants, which are sum types represent disjunctions, this or that, versus product types, things that represent this and that and the other. Should the unboxed types support in OCaml also have some kind of support for unboxed variants? Is that a meaningful idea?
It’s definitely meaningful and the degree to which it’s easy varies. So unboxing one variant into another variant is actually a thing that I think Gabriel Scherer and Nicolas Chataing in upstream OCaml development are working on. So this is if I have a variant that says it’s foo or bar or baz, and bar is in fact itself a variant, that’s one of these three things, then maybe we should in-line that and represent that as a single variant with five cases, rather than like a variant whose middle case points to a third one.
And that’s the moral equivalent of in-lining one record into another, it’s a fairly straightforward transformation. You could do it by copying and pasting the code, but you’d rather it was done by the compiler. A trickier one is doing unboxed variance as local values or as return values, return values being a particularly interesting one. This is very useful for functions which we’d like that to be efficient. But it’s useful for larger variants as well.
The difficulty is that the exact size of the variant is not known in advance because the different cases of a variant can have different numbers of fields. So you end up needing to do something where over approximation reserve enough space or enough registers for the largest of the possible cases, which might even be larger than any of the cases, because the fields may have incompatible parts. You may have a one case where there’s an unboxed floating point value and another case there’s an unboxed integer value, and you need to reserve both the floating and the integer registers to handle one case or the other.
And this is one aspect, it’s a thing that’s desirable to do, but it doesn’t quite fit into the model of stuff the OCaml system can already do that we’d like to surface to the programmer. The OCaml backend compiler-optimization pipeline doesn’t know how to return things whose layout may depend on which case it’s in. Some of the trickiness involved in getting that to work is what originally pushed Leo White and I to start thinking about the local types, because that provides a different and much simpler way of getting many of the same benefits in that a function that’s returning a variant instead doing this somewhat fiddly but very optimized representation of a variant could instead return a locally allocated variant. And then it would have the same representation as variants normally do in OCaml, it would just be allocated somewhere else and deallocated very cheaply.
Right. And I guess that’s one of the trade-offs. I mean, I guess in some cases you can imagine doing the same kind of feature in both unboxed types and with local types. And one of the nice things about doing it with local types is that it gets to use the same data types, right? The fact that unboxed types is based on a kind system means that you kind of have to throw all your ordinary types away.
Would you like to return an ordinary OCaml option value? The answer is no, you can’t do that as an unboxed type because it’s just not unboxed. It has the value representation. It doesn’t have one of these other representations. And so there’s a kind of code duplication and lack of polymorphism issue here, whereas in the case of local types, well, you can just use exactly the same data layouts, which gives you kind of more data and code sharing than you get otherwise.
Yeah. And in fact, it’s very hard to tell which of the two approaches
is going to be the best for any given situation. For variants like
Error, I’m leaning towards local types being a nicer way of
expressing that. But for things like option, actually unboxed types
are great. For an option, what you’d really like is something like
null where you can say… So the option type in OCaml is… A t option is
either Some t or None.
So it kind of represents a nullable t. But the issue is that that Some t case involves an extra heap allocation just for this block. And the only thing that the block does is say that it’s not None. Many languages do and have a distinguished null value. But the trouble is that that gets you something different from an option. Because with an option, you can distinguish Some of None from None. If you have a nested option, you know at which level the null was introduced.
Right. Some of None or Some of Some of None, or Some of Some of Some of Some of None, right?
Yeah. Exactly. You very rarely use these things on purpose, but the lack of them is quite annoying in a lot of languages. For instance, I’m not sure if it’s still this way, but back in the day, at least the Java hashtable had the standard hashtable interface in Java has a .get method. And the .get, you pass it the key and it might return null, but if it returns null, does that mean that the key was not present in the hash table or the key was present in the hash table, but the value that was associated with it was itself null?
Right. You would never write code that just nested options directly because it would be really confusing. But it’s really an abstraction problem in that you might very well have some abstract type, which under the covers is implemented as an option. And then you might want an optional version of that. And boy oh boy do you not want to have the breaking of abstraction that you would get when you collapse the two None and Some of None. Those mean really different things. And totally non-perverse, reasonable code will run up against this and do crazy bad things.
Yeah. But on the other hand, an awful lot of the cases of option in OCaml are things like string option or Int option, where it’s very obvious there is not going to be any nested options going on and you really could just represent it as a special magical value. So what you really want to be able to do there is write a t option as t nullable, but in order to be allowed to write t nullable, you would need to know that t is itself not a nullable type, so that you’re never going to run into this case of confusing the two nulls.
So that fits very neatly into the kind system of the unboxed types, where there’s a kind of non-nullable types. I mean, I think we were even discussing making this be the default for an abstract type t that you write in an interface file with no actual annotation, because this means that you can then write t option or t nullable, get a type error on writing t nullable nullable, and know that anywhere where t nullable appears, t will itself never be a nullable type. So there’ll be no ambiguity by what the null is attached to and what the null means.
That is really cool. I feel like this problem of how do you deal with nested options efficiently is something that has been kicking around the OCaml community for, I think, a solid 20 years. I think there’s like a 20-year-old discussion post I saw, discussing various ways of fixing this. So it’s neat. I guess one of the things I keep on coming back to is building a good programming language takes a really long time. The idea that there’s a problem that you identify and it takes you 20 years to figure out is actually just not that weird.
Yeah. I think there’s always this trade-off between trying to find an efficient solution for the problem and trying to rearrange things so that the problem never ever comes up. So the nullable thing that we’re doing is not trying to find an efficient representation for nested options, it’s just trying to arrange the whole type system so that we never, ever have to nest them and we can prove that they will never nest.
Yeah. And I guess this is just part of a bigger set of trade-offs of how much do you get the right thing to happen by cleverly arranging the system without any thinking about it, the right thing happens. And how much do you give the programmer explicit control? And there are lots of different trade-off points that you can pick there.
Yeah. And as I mentioned before, the really difficult thing is when you want both in the same language, because it’s very easy to get into a bad state where the bit of the language that assumes everything will be right all the time butts up against the bit of the language where the programmer has explicit control and decided it would be efficient to sometimes do the other thing here.
Well, good luck, I suppose. I think your task for the next little bit is to figure out how effectively we can do this in OCaml.
All right. Well, thanks for coming. This has been a bunch of fun. I feel like the part I’ve liked most about this is the way in which what seems like a low-level, systems-y topic turns out to be so tied into fancy-shmancy type theory stuff. Sometimes the best solution involves pulling things from a couple of different domains. But anyway, thanks for joining. This has been great.
Thanks for having me.
You’ll find a complete transcript of the episode, along with links to some of the things that we discussed, including some of Stephen’s papers and write-ups and some of the new language features he’s worked on, at signalsandthreads.com. Thanks for joining us, and see you next time.