Listen in on Jane Street’s Ron Minsky as he has conversations with engineers working on everything from clock synchronization to reliable multicast, build systems to reconfigurable hardware. Get a peek at how Jane Street approaches problems, and how those ideas relate to tech more broadly.
Sylvain Gugger is a former math teacher who fell into machine learning via a MOOC and became an expert in the low-level performance details of neural networks. He’s now on the ML-infra team at Jane Street, where he helps traders speed up their models. In this episode, Sylvain and Ron go deep on learning rate schedules; the subtle performance bugs PyTorch lets you write; how to keep a hungry GPU well-fed; and the importance of reproducibility in training runs. They also discuss some of the unique challenges of doing ML in the world of trading, like the unusual size and shape of market data and the need to do inference at very low latencies.
Sylvain Gugger is a former math teacher who fell into machine learning via a MOOC and became an expert in the low-level performance details of neural networks. He’s now on the ML-infra team at Jane Street, where he helps traders speed up their models. In this episode, Sylvain and Ron go deep on learning rate schedules; the subtle performance bugs PyTorch lets you write; how to keep a hungry GPU well-fed; and the importance of reproducibility in training runs. They also discuss some of the unique challenges of doing ML in the world of trading, like the unusual size and shape of market data and the need to do inference at very low latencies.
Some links to topics that came up in the discussion:
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I’m Ron Minsky. It’s my pleasure to introduce Sylvain Gugger. Sylvain is a Machine Learning Engineer here at Jane Street, and he’s done a bunch of interesting stuff in the outside world as well. He was a core maintainer of Hugging Face’s Transformers library. He wrote Hugging Face Accelerate, which is a nice library from them that helps you run your models performantly on a lot of different kinds of hardware. And he also wrote a lovely book along with Jeremy Howard called, Deep Learning For Coders with fastai and PyTorch. So he’s done a bunch of interesting stuff in the outside world, he’s also doing a lot of interesting machine learning work here at Jane Street, so thanks for joining me.
Thanks, I’m very honored to be here.
And just to kick things off, I’d love to hear a little bit more about your background, and in particular, how did you get to work on machine learning in the first place?
So that’s a good question. I was originally a math teacher, like 10 years ago in France teaching at a first year university level. And yeah, I moved to the U.S. in 2015, I had kids, so I took some small projects at home to mostly take care of my kids. In 2017, AI was kind of becoming more mainstream. I actually read an article in the New York Times about it. It was going to steal like everyone’s jobs in the next two or three years and that–
(laughs)
–didn’t happen, but it’s still something that became more mainstream. And at the end of the article, they were mentioning a couple of online courses for people interested in diving in more, and I was interested, so I dived into it. So one of the courses mentioned was a fast.ai course by Jeremy Howard, which I followed. It was very interesting and yeah, I started commenting a little bit more and more on the forums, and making a couple of contributions to the fast.ai library, which is used throughout the course to make training models a little bit faster, a little bit easier. And then towards the end of the course, Jeremy led a fast.ai team to this competition called the DAWNBench Competition, which is the ancestor of the MLPerf benchmark. It was organized by Stanford and the goal was to train a computer vision model as fast as possible to a given accuracy. And so, yeah, we entered the competition, I helped to be the team, and we were like positioned first for the longest time, and yeah, at the very end, Google kind of trolled us by–
(laughs)
–publicly releasing TPUs for the first time, and yeah, those massive computers that no one else had access to trashed our best entry and our best time.
So I want to hear more about the competition, but before that, can you tell me a little about, like, what is fast.ai? What’s the basic program there? What’s the mission behind that organization?
So fast.ai is a non-profit whose goal is to educate people about deep learning, especially in those early years. It was starting to become more mainstream, but not necessarily as mainstream as it is today. And the idea behind it – that Jeremy Howard and I believe – is that to get the best model, you need good machine learning engineers, but you also need people who really understand the data that those models are gonna consume. So if you want good modeling for radiology, you need a very good radiologist to kind of understand how machine learning works, so that they’re gonna be able to help you build those best models. The fast.ai course are aimed both at coders who want to like, really dive deep into machine learning, but also like, beginning as more of an introduction that anyone who is interested can take it to learn more about, “What is machine learning? What are those deep learning models and what can they do?”
So the basic idea is to democratize machine learning, so all sorts of domain experts can actually know enough about it to really leverage it in a meaningful way.
Exactly. You said it way better than I did. (laughs)
(laughs) Let’s get back to the competition. So the end of the competition is a great dramatic story of, “Google gorilla stomps on everything by dropping TPUs at the last moment.” But what were you actually doing in order to get into first place before Google kind of jumped in there?
Yeah, so a couple of things. The main thing is related to the way we were training the model and in particular, the learning rate schedule. So to take a little step back, when you train with machine learning models, initially, your model is random, so it’s outputting crappy predictions. But then you compute loss, and from that loss, some gradients that are going to make your model a little bit better if you adjust the weight following those gradients. This whole process is called stochastic gradient descent.
Right, and just to say a higher level thing about all of this. This is just an example of a more general thing called function optimization, right? You have some function that you want to optimize, in this case, the function is given the set of model weights and the input you want to run on. You want to find the set of model weights that give you the best and most accurate answer. And we just approach this like we do in some sense, almost any other kind of optimization problem with techniques that actually like go back 50 years or something of, we’re just going to compute a derivative and we’re gonna walk the model weights in the direction of the derivative, and just do that over and over until we get to a more optimal result.
Yes, exactly. Like the whole process and the whole math behind it existed for like 50, 60 years. It’s just that with GPU becoming more and more powerful, we actually had the compute to apply that process to complex problems, like deep learning. So that very important type of parameter, the learning rate, is kind of the size of the step we take, following those gradients. At the time of the competition, the most popular learning rate schedules were, like, very inefficient, just training at a low learning rate for a very long time. And then we divide that low learning rate by 10, and we train for even a longer time. That did converge to a good accuracy, but it was very inefficient. And one of the things we had in our competition entry was to follow a learning rate schedule that is more like a warm-up from a low learning rate to a high learning rate, so not start at a high learning rate because otherwise your model immediately explodes. But by warming up, starting from something low and gradually increasing it to the maximum, we can have the model learn a little bit of something for those and then have a high learning rate for a little bit of time, so that we can explore the loss landscape efficiently, and then decrease it towards the end. And this kind of schedule made it possible to train the model to the same accuracy, but way faster.
Why is that a better schedule? If you kind of just think about this without knowing all of the details. The idea that when you’re very far from the answer, you want to take large steps, and when you get closer to the answer, you want to take smaller steps, seems intuitive. But here, you’re saying instead that you start with small steps, and then go up to big steps, and then go down to small steps. So what about the structure of the problem makes that the right approach?
I mean at the beginning of the problem, like, since your model is randomly initialized, in the landscape of your loss function, you are very, very high, and you actually have very steep canyons. So if you take large steps at the beginning, you can at least begin to descend into one of those canyons of the loss function, and then increase that learning rate to dive through there fast. And you will skip over a lot of local minimas because your learning rate is large. So towards the end, you need that decrease, to step down further into one of those smaller parts of the landscape of the loss that have these local minimas.
So, is the intuition here that when you start at a randomly initialized point, the terrain around which you’re trying to optimize is just more wild. And if you take big steps, your derivatives are very high, and you’re kind of jumping all over the place. But even with a little bit of optimization away from that initial randomness, you end up in something that feels like a more regular space, and now you can go back to what makes more intuitive sense.
Yes. It depends also, like, we are talking about just a 3D problem. But we have millions of dimensions because your model has millions of parameters. So it’s the idea that, yeah, on some of those dimensions the landscape is very, very spiky. So at least taking care of that at the beginning with a low learning rate is gonna make the whole optimization problem easier, and then you can have larger steps.
Yeah I do think there’s this, kind of, terrible intuition one has when one thinks about a problem like this, of like, “I’ll try and visualize it in two or three–
(laughs)
–dimensions.” And you’re like, “You have just lost all of the important structure and you really need to think about this high-dimensional problem to really know what’s going on.”
That was one of the optimizations. The other optimization we did was a computer vision problem. And so the kind of models we applied to them – which are called CNNs, for convolutional neural networks – can work on any size of images because it’s just some kind of filter that you apply over all of your image. And so the idea is at the beginning of the model, like when you train, the model is random, it’s crappy, so it doesn’t really need to see the whole picture. We kind of gave it a more blurry version of the picture, like just 128x128, and then gradually as training goes on, we increase the size of those images to make them more of the standard size. People are doing that problem we’re using. And you have that gradual resizing as well, because if at the beginning, your image is smaller and you have your filter to like, put all around the place of that image, it’s going to be more efficient if you have less pixels, compared to doing the training with always, like, the high-resolution images.
Right, so there’s two neat properties of convolutional neural networks that are coming into play here. Convolutional networks are, in general, a dimensionality reduction trick. You can imagine a big network that’s applying to all of the different inputs, and all the different parts of the image, and then you could just have weights that are individual to all the neurons that are associated with all the different parts of the image. But that’s enormously wasteful because in the early parts of the network, you actually want, sort of, the same regular structure over and over. And so the basic idea of a CNN is you kind of lock those weights together, so you in some sense just have one copy of this neuron which is activated multiple times in multiple places. But then once you’ve done that trick, you also have this resolution independence where you can write it at multiple different resolutions and you’re saying, “Well, we’re just gonna train this thing at low resolution, and then, again, after it gets into the ballpark and we need to more precisely fine tune it, then we’ll increase the resolution and do the rest of it.”
Yeah.
And were these essentially new techniques at the time of this puzzle?
Yeah, both of them were new techniques. Like, the gradual resizing is still not that widely used. The new, kind of, learning rate schedule is like, now everyone uses that, like all transformers models like GPT-3.5. I think we are GPT-5 now, not for sure, because OpenAI does not publish its research, but like–
(laughs)
–the open source versions of that are trained using that kind of schedule. And since birth, we have seen that kind of learning rate schedule all the time.
So that’s how you got into fast.ai and you got into this competition space. How did you end up being co-author of this book?
So after collaborating on the fast.ai library and yeah, participating on the forum, that competition, Jeremy Howard kindly offered me a job at fast.ai, which I accepted. So I worked there for two years, built a couple versions of the fast.ai library, and two iterations of the online course. And it was very natural going from, like, the course to publish a reference book with kind of the same content, just in a different format for people who prefer to learn from books, instead of YouTube videos.
Got it. And then what brought you to Hugging Face? And what is Hugging Face?
So, Hugging Face is kind of the GitHub of machine learning. The idea is that we have a website that looks kind of like GitHub – except instead of having repos with code, you have repos with model weights. So like lemma one, lemma two, lemma three, and 3.2, that was released a couple of days ago, are all on Hugging Face, along with I think now is a million public model from all kind of applications of machine learning, like computer vision, text, speech, et cetera, et cetera. And yes, the idea is that they are kind of at the forefront of the open source AI by allowing people to share those models. And what’s happened, a couple of libraries, because model weights are very good, but if you don’t have the code to actually instantiate those models, they are kind of useless. To complement that, they have libraries like there’s a transformers library, which actually contains the code of those models.
And how did you end up at Hugging Face?
In 2020, there was this thing that happened worldwide.
Oh, yeah. I vaguely remember that.
(laughs) Yeah. And so, that kind of disrupted some plans at fast.ai. So I looked for another job, and there was this start-up from French people, which was based in New York City. And so, I knew them from, like, the French community in New York City, and I met them a couple of times before. They were looking to expand. So I applied to Hugging Face and I joined them randomly in June of 2020 as a continuation of my work in open source from fast.ai to democratize machine learning and help people use the transformers library or their website with all the public weights on it.
So what kind of technical work did you end up doing at Hugging Face?
Couple of things. There was the maintenance of the open source libraries, because there are people doing pull requests, having issues kind of all the time. So that is already a huge amount of work. Then I developed new tutorials and new examples to help people use those libraries, and that kind of ended with an online course that was meant to be taken after the fast.ai course, like for more – for people who wanted to specialize a little bit more into transformers. So there were those two aspects, and then yeah, at some point, like all the researchers at Hugging Face were kind of annoyed by our big black box trainer, which contained all the stuff for the training loop. And it becomes with time, like, this huge amount of spaghetti code, because you have new flags that appear to kind of control everything that people want to do with our trainings. And so I created a new open source library to make it much more lightweight to help people with our trainings, so that they can have more flexibility. Yeah, the idea is that usually those APIs are trained models. You have a trainer API and you give it some things, like your model and your data, and you click “train,” and it trains, and it’s marvelous for people who just want that. But yeah, researchers who wanted to change, like tweak a training loop a little bit, were struggling a bit more. So there are various techniques that have applied for that in the past. Like in fast.ai, we had some callback-based systems. So we had callbacks that the researcher could implement to change a little bit the behavior of the training loop at this particular point or another. The Hugging Face trainer was less extensible. But for that library called Accelerate, I went back to, like yeah, if the researcher is just going to write their training loop, and there’s not going to be like a black box trainer, and they just need to change like a couple of lines here and there to make it run on any kind of systems. At first, it was like six lines, then five lines, and we tried to reduce that number of lines to the absolute minimum so that there was as little intrusion as possible. That kind of gave that API from Accelerate.
And when you say you want to make it possible for people to do their training on multiple different kinds of systems, what is the diversity of systems underneath that you’re thinking about? What are the kinds of systems and different variations on the training that you were trying to enable with Accelerate?
Training requires a lot of data. Usually when you train with large language models or like even other kind of models. And to make it more efficient, usually you kind of divide and conquer. And if you have multiple GPUs, you give a slice of the dataset to each of your GPUs. And so let’s say you have n GPUs then your training time should be reduced by n at the end of the day because they fully parallelize the things that you cared about, just by splitting your data this way. So this is called data parallelism, and it’s kind of the first level of parallelism we can use when we have multiple GPUs and we want to run a training on them. And so you can do that in PyTorch, except it requires some kind of boilerplate code that is a bit annoying. So the idea of Accelerate was to remove that boilerplate code by just having to change a couple of lines in your training loop, and poof! Your model can now run training on multiple GPUs, also on TPUs – but of course the code to run them with the same kind of distributed data parallelism on TPUs is different from the one on GPUs. That would be too simple otherwise.
(laughs)
And then if you have done the modification, nothing runs on CPU again. So the idea is that it kind of deals with all of that crap for you of detecting which kind of an environment you are on, and then adding the boilerplate code that is needed for your training to run successfully for all of those kinds of systems. And then it also adds like, if you want to train in a mixed precision setting because you want to use lower precision types, we can talk about that later. It also dealt with the additional lines of codes that were required to properly do that, kind of automatically.
Yeah, I mean I think this whole discussion kind of underlines just the diversity of different hardware and setups that you can do when you’re doing training. There’s the, kind of, in some sense, simplest thing of like, you can run your training on a CPU, which is a thing that people did for a long time. And then there are multiple different parallel architectures: GPUs, which are like, literally descendants of graphic programming chips, and TPUs, which is this tensor processor that Google came up with. And the main game here, going from the CPUs to the GPUs and TPUs, is about parallelism. It turns out, CPUs are these kind of funny machines that have lots of parallel circuits, but they’re interpreters for a brutally sequential programming language, right? And so they’re not that good at doing lots of things in parallel, and in fact, there’s all the complexities of like, multi-core architectures and stuff on that side, which is how you try and take advantage of parallelism there. But then GPUs and TPUs are machines that are much more directly parallel in their structure, and built for large-scale, highly regular parallel computations. And then at some point, those things aren’t enough either, and now you start getting to various forms of distributed. You want so much parallelism, you want multiple GPUs. And then the first thing that you were talking about was this data parallel training, where what we’re doing is we’re running this, like, stochastic gradient descent, where we’re picking random subsets of data, breaking it up into batches, and then training on individual GPUs and computing, like, a net gradient which we then use for updating the model. And then there’s also pipeline style parallelism, which you might need when your model itself is too big to fit. In fact, not just pipeline parallelism, but various kinds of model-level parallelism, where you actually take the model and break it up and split it among multiple GPUs, because even the model weights themselves are too big to fit there. And then Accelerate is trying to help you write your model once, and you’re training once, and do a modest amount of modifications to be able to access this whole sweep of different ways of doing the training.
Yeah, exactly. If your model does not fit anymore on one GPU, you can split it different ways. You can split the layer. You can say, if it’s a deep learning model, like usually those come by, they are bigger because you have stacked more layer. You have layer one on GPU one, layer two on GPU two, layer three on GPU three, et cetera, et cetera, which is a good idea because then your model fits, but then there’s this inefficiency in the sense of, GPU two has to wait for GPU one to be finished to be able to process as a result and pass it along to GPU three. And so that’s where pipeline parallelism comes into play, where you’re trying to pipeline things efficiently. So like, give a little bit of your data to GPU one, which is gonna send it to GPU two, and then GPU one will process the second little bit of data, while GPU two is busy computing the first part. And there is this ping pong between the forward, when you run through your model and the backward pass, where you compute all of your gradients. So you can also efficiently interlace, like, some part of the forward and some part of the backward computation in that pipeline parallelism. And then there is tensor parallelism where instead of splitting your model by layers, you actually split the weights of your model into chunks. And like, each GPU only sees, like, one part of the weights. And so like when the GPU need to come together and agree on the results of all the matrix multiplies that you compute. So this kind of parallelism requires way more – I mean, a very efficient way to communicate between GPUs to be accessible.
That’s right. Maybe the other interesting thing about the hardware around this kind of stuff is the criticality of the network. You need these very fast network transfers to do the tensor exchanges. And yeah, there are some contexts where it can be a little less critical because you can overlap compute and data, and some things like this tensor parallelism, the GPUs are just gonna be sitting idle while you’re waiting. So, we nowadays have these, kind of, wild new networks which have much, much higher capacity and are very focused on these very low latency and high determinism data transfers. One of the things I think is interesting about this is the way in which the networking stack has changed, right? I think when I started learning about, “How do you do high-performance trading systems?” I learned about, “Well, the operating system kernel is obviously too slow. So if you want it to be reasonably fast, you have to do kernel bypass. You have to have a user level networking stack that’s doing the communication.” And these systems use a technology called RDMA, remote direct memory access, which I think an easier way of understanding what’s going on here is, it’s CPU bypass, right? Basically network comes in on the NIC, and then without going through any CPU at all, just gets copied directly to the place in memory that it needs to go, maybe directly into the GPU memory. So you’re really cutting away all of the fat from the bones that you can to make this stuff go as fast as possible.
Yes. And even in the more recent hardware that NVIDIA has announced, at the last GTC, like you kind of stack your GPUs as close as possible, and you try to put as many as possible you can in a single cab. So, like there are 72 GPUs in the same cab, very close to each other, so that you can have even faster network between those because you have a V. They stack some, the network is in the middle, some GPUs above, some GPUs below and they have this big NVLink in the back that links everything together very fast, just because they sit very close together.
Yeah, you start caring an enormous amount about the physical layer at this point. Today we can get these NVLink setups where inside of a single box with, say, 8 GPUs in it, you get this fast network. And yeah, what you’re describing is doing this at the cabinet level.
Yes.
Which is funny. Yeah, I mean, I remember hearing people talk about, like, earlier hacks, not for machine learning but for other purposes, where people would, like, you know, basically try and make little supercomputers where you unroll your PCI Express network and basically spread it over an entire cabinet. And in some sense, InfiniBand sort of grew out of the similar supercomputer networking fabric. And indeed, InfiniBand plays a real role in how these GPU networks work as well. Okay. That was the stuff you did at Hugging Face. So more recently you’ve joined Jane Street, tell me a little bit about what your role here entails.
Sure. So Jane Street, I mostly work here on the engineering performance, around machine learning. So day-to-day life is, a researcher will come to me with a model they’ve trained, and they’re like, “Oh, my training is going really slowly. Could you help me with that?” And we’ll profile it together, try to identify the bottlenecks, and make it faster. To take a step back, most of the researchers here at Jane Street use PyTorch, which is the software to write neural nets and train models, which has the particularity of being really accessible because it’s eager. The counterparts from Google, TensorFlow, and JAX are more like compiled languages. So it’s kind of harder to get started, because you write your model but then it does not compile, and so you need to fix some of the operations that seem like valid Python operations, but you need to kind of modify them so that TensorFlow or JAX recognize them and see, “Oh, this is what you are trying to do.” Whereas in PyTorch, you can do anything you want, but then your code can be inefficient in surprising ways, because that particular operation, for instance, has no implementation in the GPU. And so, like, the computer only needs to transfer data back to the CPU just to be able to execute it, and then send it back the other way. And in general, especially on modern GPUs, and the way PyTorch works is that when you want to execute a model, the CPU dispatches the operation on the GPU async, so that, like, the CPU immediately runs to the next instruction. And you’re getting your hardware in a good state, and if your CPU is always ahead of your GPU, and then the GPU has lots of stuff to process, but as soon as your code requires some synchronization because you need some data back from the GPU to the CPU, it can become pretty inefficient. Just because you’re kind of stalling the GPU as the CPU will wait to get the data back. And then it will take time for the CPU to send back some new operations to execute to the GPU.
Right, and it’s that waiting, where the GPU’s waiting on the CPU, it’s slow for a lot of reasons. It’s slow because the memory transfers are slow. It’s slow because CPUs are inherently slow. And then, oh my god, it’s slow because the code that’s running is written in Python. Which is maybe like 60 times slower than what the corresponding thing written in C might have looked like.
Exactly. Like, even if you don’t care about GPU, most of your Python code, you will always try to have it vectorized. We’re trying to write as few for loops in Python as possible, because those will be very slow. Whereas if you can execute an operation from, like, NumPy, which will be backed by C or C++, it will be much faster. And it’s kind of the same idea for the GPU, except on top of that, you have that complexity of, to avoid synchronization point between the CPU and the GPU as much as possible.
And notably, when a C programmer says, “Oh, I wanna make sure this is vectorized.” What they mean is, “I wanna make sure I’m using, like, the SSE, AVX,” whatever instructions you’re vectorizing, that are, like, using fundamental parallelism technologies baked into the CPU to be able to do, like, four or eight or whatever, computations in parallel. And when a Python programmer says “vectorize,” what they mean is the inner loop is in C. And maybe it’s also vectorized with AVX or whatever at the bottom. But the fundamental thing is getting away from the Python interactive loop.
Exactly. Sometimes you can have code that looks very innocuous, but you’re actually executing a for loop which is going to, at every iteration, trigger asynchronization between the CPU and the GPU, which is extremely bad, because you’ll, like, launch a tiny operation on the GPU, and then have to wait for the GPU to finish it and get back the result to the CPU, and then launch a new tiny operation on the GPU, et cetera, et cetera. And this is also really bad, because one thing we forgot to mention is, starting something on the GPU is also very slow. It takes some time for the CPU to send the code of the kernel, all the input and the outputs. That takes a couple of mics, or even sometimes a millisecond to get started and actually having your GPU starting to do the work.
It’s maybe worth saying, we’re throwing this word around, kernel, a lot, which is kind of a funny, GPU-specific word. And basically the kernel is the small computational program that you are typically running on the GPU. And writing these GPU kernels is actually really hard because they’re highly parallel and they’re hard to reason about. And so the programs, in fact, tend to be numerically very intense, but in terms of lines of code, pretty small. You’re not creating, like, million-line code bases that are running on the GPU. They’re a lot tighter than that.
Yeah. You call those individual small kernels. A kernel to do matmul, and then a kernel to do some activation function in neural net. Yeah this is just one Python line which is then dispatched on the GPU to be executed in parallel.
So the thing that always kills me about this whole PyTorch story, is that if you asked me to design something, I would definitely design something like TensorFlow or JAX. Just to say, like, the basic idea of TensorFlow and JAX is that you’re more or less hijacking Python as a metaprogramming system. You kind of write what looks like Python, but what you’re really doing is you’re writing in some domain-specific language for expressing the computational graph that represents the program that you’re gonna run on the GPU. And the reason I would’ve wanted to do it that way is because it seems just dramatically easier to make sure that thing is gonna run fast. You can’t take every arbitrary Python thing and make it run fast on the GPU. So you restrict yourself to some DSL where you can guarantee that things are running fast, and it just seems like the whole thing is gonna be much easier to reason about whether I’m staying inside of the envelope of reasonable fast programs, and all of that. PyTorch is kind of clearly one, JAX is new and exciting and maybe that will get more mindshare over time, but TensorFlow was the big thing and then PyTorch has been much more successful. And it just kind of frustrates my intuitions as a person who designs APIs. Do you have a view as to, like, why is it that PyTorch kind of won, and things like TensorFlow and JAX are more niche?
So, yeah, PyTorch won for the flexibility. Like we saw ML researchers want to easily fool around with various ideas. And maybe at first it will be very inefficient, but they want to be able to iterate really fast through their ideas and test quickly if they’re going to be worth it or not. And even if the first training round is inefficient, if the idea turns out to be a good idea, then we can spend some time optimizing it and making it as fast as possible. PyTorch kind of represents that model well. You can fool around very easily and yeah. Also, like, with that model of execution that is asynchronous, you still get the performance. Unless your code triggers some of the hidden CPU, GPU, synchronization, your code is still performant when you run it from PyTorch. There is this flexibility, this idea you can easily fool around. And they did come around having a compiled thing like PyTorch 2.2 introduced torch.compile, which was kind of what people didn’t like about TensorFlow, but they kind of had to implement it at the end. Modern GPUs are really, really fast, and that programming model of, “I’m just going to dispatch the operation asynchronously from Python” was starting to lose, just because the GPU was so fast that by the time your CPU has scheduled the kernel, the GPU was already finished, basically. And even if you kept telling the CPU, like, “Schedule this kernel, this kernel, this kernel,” in a row, it would just not be fast enough for the GPU. And this idea behind torch.compile is, again, to get the whole computational graph from your model, and then try to identify in that graph, maybe there are places where you’re doing something that’s very inefficient and we can simplify the instructions. But more importantly, try to take two consecutive instructions and fuse them together on the GPU so that instead of launching a lot of small kernels on the GPU, you launch one big kernel which does a lot of work. And this is very efficient for – first, you don’t pay the overhead. And the second thing that’s very efficient is that very often, like, kernels that are in a row, they read the data that the previous kernel has already written. So you have this inefficiency, “I’m gonna write something in GPU memory,” and then immediately in the next kernel, “Oh, I’m gonna read that GPU memory I just wrote.” And there are some cache systems in the GPU but still, you have some bit of overhead by doing that. Whereas in a fused kernel you can just keep that data in registers in the GPU, and you don’t have to move it around if it wasn’t necessary.
Right, so you get to skip both the memory transfers and the kernel launch time?
Yeah, kernel launch overhead. And so we do this, which is kind of a crazy hack, by using another Python DSL, which is called Triton, which is kind of a subset of Python where you can directly write efficient CUDA kernels, which works well. Like, if you want to write a fast matrix multiplication in Triton, it’s relatively easy and we have some crazy templates, basically for all of the operations you can do in a PyTorch mode. And we fuse these templates from the graph that they extracted during torch.compile to create big Triton kernels that can execute big chunks of the model on the GPU at once.
Right. So yeah, maybe we should talk for a second about the programming language ecosystem around GPUs. GPUs have a really interesting underlying computational model. And then there’s a big collection of programming tools for it. Maybe you can walk us through what some of the major pieces of this ecosystem are.
Yeah, so if we start at the bottom level, like the equivalent of C for GPU is CUDA, which is a proprietary language from NVIDIA that they developed to program their GPUs. AMD supports most of CUDA as well because they kind of have to. They are a bit late in the game and if they want people to adopt their product, they kind of need to make sure the software is what people are used to. It’s basically C except you have those kernels that you write, which are executed in parallel on the GPU. And it comes with everything that’s kind of a pain in C, like you have to do lots of pointer arithmetic to make sure that you are looking at the right data, you have undefined behaviors every time you are not super careful, and it’s pretty hard to debug.
So it’s a very low-level system, and it also exposes you directly to the performance characteristics of the GPU. Or like, not exactly directly, because it gives you some layer of abstraction, but you get to see a lot of the underlying details. And I guess one of the things that struck me, as someone who’s mostly used to thinking about performance in the CPU context, is how different the concept of threads is on a GPU versus a CPU. I wonder if you can say a little bit on, “How should someone who’s coming to GPUs for the first time think about threads?”
Oh. (laughs) You will have lots and lots of them, for one. The GPU can launch a million threads pretty easily, and all execute them in parallel. The idea is that you have those blocks that correspond to physical blocks on the hardware, where, like, a bunch of thread is executed. And even those threads are executed in a group which is called a warp. When you write a kernel, it’s actually each instruction is going to be seen exactly at the same time by 42 threads, which together form a warp. And one block has a number of warps. I mean, any number of warps that you want that’s not too, too large. Like, one block can accommodate, like, 1,024 threads maximum, and then you can launch several of those blocks in parallel. Yeah, so the idea of that block layer is that it’s physically on the GPU chip at one location, so you can have, like, some memory that is shared between those threads, which is useful, for instance, if you’re doing matrix multiply, you are gonna load some of the weights into that shared memory, and then use it with those threads to compute something repeatedly, instead of, like, accessing the same region in global memory several times.
Right. So there’s some more expensive, smaller, closer-to-the-thread memory that sits there to be shared among these threads that are on the same SM, right?
Streaming multiprocessor.
Right. And then maybe the other thing that’s perhaps not obvious to someone who hasn’t thought much about GPUs is, you also have dedicated registers.
Yeah. Up to, like, a certain amount of registers. Like it’s 65K for the whole SM. You can have a program with lots of threads that use few registers or maybe a program that has less threads but each thread can use more registers.
Right. And the critical difference here between CPUs and GPUs, is on a CPU you have a really small number of registers, and then when there’s a thread, there’s just like one thread running on the CPU and using all of those registers. And then, when you want a different thread to run, you have to swap it out in all of the registers, and swap the new thread in. And so you have this fairly large context switch time, and context switch times on GPUs are incredibly small. And so this is part of what enables you to do this, kind of, massive multi-threading. You have all of these different threads, and the threads are both able to execute in these warp groups so they can do stuff in parallel, in groups of 32. But also, they often end up being blocked. Not typically blocked on I/O, because the GPU’s not doing I/O, but just blocked on memory, right? They need to do a thing, they need to wait for memory to be shuffled in. And so you can immediately grab some other group of threads that’s running and get them started, and you can hide a lot of the memory latency by having all of these threads that are consuming different pieces of memory concurrently.
Yeah, that’s the job of the SM. So the warp controller is gonna schedule warp on a unit that’s gonna do some float arithmetic however you need, specifically dedicated to matrix multiply, which can do like a small 16x16 matrix multiply for those 32 threads that we just mentioned. And whatever units that are going to load something from global memory or from shared memory instruction is dispatched on one of those calls. And then immediately after it’s finished, like another warp is gonna take its place. And this way most of the latency is hidden from the user as long as you can express your program in a way that you always have a warp computing something.
And CUDA gives you direct, explicit, low-level access to more or less this computation model in an unsafe programming model, which is not especially clearly documented, and can be, like, hard to figure out and hard to understand. And when you get it wrong, you just get weird, undefined behavior and your program breaks in hard to understand ways. Okay, so that’s CUDA. It’s great and terrible. What else is there in the programming language space?
So we mentioned PyTorch, TensorFlow, and JAX, which are kind of the exact other hand. So it’s something that’s in Python, with all the good and the bad of Python, that then is gonna express, either compile the computational graph onto side of JAX and TensorFlow or, like, directly send instruction to the GPU on the side of PyTorch, which yeah, we’re gonna dispatch with CUDA kernels that we just talked about. And in the middle there is a flurry of new languages, because as it turns out, researchers love to hack and test new ideas, but they also don’t love to code in CUDA, for some reason.
(laughs)
And in the middle, there are several languages, like Triton, which kind of sit in Python land in the sense that it’s a syntax that looks like Python and you have, like, some subset of Python operations that are supported, but are actually just DSLs to generate efficient CUDA kernels. So we mentioned Triton is one of them.
And I guess one thing about Triton is, it’s in some ways not quite as general purpose, it’s really good for doing things that kinda vaguely look like matrix multiplies.
Yeah, I mean, in general, modern GPUs are really, really good at matrix multiplies. They have special cores on them called TensorCores, which are really efficient. And any way you can make your program look like a matrix multiplier, you’re gonna get way more flops than if it’s just regular floating point operations. Triton is really good at programming those styles of arrays and matrix multiplying them, or like then reducing them if you want. If your model computation is slightly different than that, sadly, very often Triton will not compile your code and won’t necessarily tell you why, as your message is not always super clear. And the debugging experience is also not always super nice because you’re not in Python anymore. Like, it’s generating CUDA kernel and so, you can’t really in the middle of it inspect the states of everything or like, you can try to print a bit of the stuff, but it kinda stops like that.
There’s also this weird decision that the whole machine learning world has made, that we can have all the innovation we want on the programming language side, but the syntax always has to be Python.
(laughs). Yeah, most people are used to Python. So, you can try to move them all to another language. There are some times, like Google tried Swift for TensorFlow to try to get Swift programmers into machine learning or to move away a Python programmer to another language that is more efficient, but that didn’t go so well. It’s also like, there’s a whole ecosystem in Python with all the rest of the libraries you need to process your data or inspect your results and stuff like that. You can try moving researchers away from where they like, but very usually they don’t really follow you. (laughs).
So another interesting language in the space, which I actually don’t know a ton about is Mojo. I’m kinda curious what your thoughts on that are.
So Mojo, I think I don’t know a lot about it, so I hope people will excuse me if I say many mistakes, but it’s kind of the same as Triton, except instead of wanting to be a new DSL in Python, it’s kind of its own new language which looks a bit like Python but still is its own language. The support for GPU in Mojo is going to be released in a couple of months from what I heard but it’s not there yet. But the idea is that you will be able to write those efficient CUDA kernels like you do in Tritan in that language Mojo. But since you are not trying to do it a DSL in Python, there is going to be support for like, debugging or like, maybe better handling just because you are writing in a language that was specifically designed for that, instead of trying to add that functionality in Python.
Right. And I think unlike writing stuff directly in CUDA, it’s a safe language, right? I think it’s got enough type system support that if you do something crazy, it will actually try and catch it for you. The way I understand it is that it’s, like, a little bit Rust-inspired. I think it has some of the same Rust-like mechanisms, lifetimes, things like that. And so, if it’s following that kind of approach I would expect them to try and make it actually safe.
Yeah and then you have other projects in the same space, so Mosaic GPU and Mosaic TPU are some Google projects that kind of do the same thing of giving you some Python interface to create efficient CUDA kernels. And if you want to write CUDA kernels but in Python, because you really love Python, there are some languages like Numba. You’re doing exactly the same thing as you would do in CUDA, just the syntax is Python.
Got it. Stepping away from all this panoply of languages out there, how does this all play into the work you do here? You know, researchers working on a model, they’ve put together their model in PyTorch, it’s not running as fast as they think it should, or they hoped it would. What do you do? How do you approach these performance questions?
First things first, is profiling multiple times to identify – we talked about CPU and GPU synchronization points, which are inefficient, so a profile will show you that very easily. And you can like track, “Oh, this instruction created a choke point by synchronizing GPU and CPU so let’s remove that or let’s try to find a way to remove it. Like some of them are easy to remove because you can express them in different ways. Although, that can be a bit trickier, like for instance if you want your training to stop because your loss is none. So if you have a final loss after computing your loss from your data on your random initialized weights, that is very large or none, all your gradients are going to be none and then all your model weights are going to be none, so basically your training is finished and completely borked. So you might as well want to stop and stop wasting GPU hours on it. So like, even that tiny thing is kind of difficult because when you have a GPU loss that is none in Python to be able to know which branch of that, which statement the CPU should execute, it needs to know like if the loss is none or not, so it needs to wait on the GPU to have finished computing to be able to inspect the value. You have kind of a synchronization point here that looks difficult to remove. One of the solutions is to do that check but in another thread. Like launch another thread that’s gonna do that check, where the CPU is gonna be blocked, but that’s okay because the main thread will continue executing the model. And maybe you will do a couple of iterations wrong on the CPU and your weight is bad. It will ultimately be nones but that’s okay, because your program will be stopped by that thread. This is one example of something that’s a bit trickier to remove. The idea is that once you remove all those GPU, CPU synchronizations, your GPU is fed up as fast as possible and then the next step is you can try to compile your model to access this world of kernel fusion, we talked about just before, to make it even faster. In the process you might also want to use different type for your floating point operations. Most of the models have been trained for a long time in float32s, but we discovered that for deep neural networks, float16 is actually kind of enough for the precision in the layers in the middle as long as you do your sum. For instance, like when you do a matrix multiply you can have the weights of both matrices be in float16s and still have the results that are kind of correct as long as you do that accumulation of others float32s. And that has led to NVIDIA introducing on these GPUs like very efficient matrix multiplies for like, float16s. Now it’s even like float8, or even like FP4 for the new generation of Blackwell GPUs that are going to be released soon.
Is the float4 a real thing, or is that just a joke? (laughs)
I have no idea. It’s on the slides, so. I don’t know. (laughs)
I’m looking forward to float1.
(laughs) That sounds interesting.
(laughs) It’s either zero or one, or something, I don’t even know.
But yeah, without going as deep as that, like float16s are really great because you can train as much as 2X or 4X faster depending on the shapes of your problem for free by doing these mixed precision things, where like some operations are computing in float16, some of those are computing in float32. Just because you access functions or calls that are like pretty specialized matrix multiply units and they do it really fast if the two matrices are in float16 or like this variant called bfloat16 that was invented at Google, as in “b” standing for “brain.”
So how do you think about the programming language setup influencing the problem that you have of helping people build fast models? Like one thing you might imagine wanting is having this split between, “I am doing the inefficient thing, or, I am doing the efficient thing” be really explicit – so that instead of having to come to someone who knows a lot about performance, they could just look at their code and be like, “Huh, let me press the ‘is it fast’ button” …
(laughs)
And be like, “Oh yeah, it’s not fast here and it’s not fast here,” and then they could move things around until it was fast. But it sounds like that’s kind of not what’s going on. What’s going on is, everything kinda looks okay and you can run it and stuff, but it’s a little harder to figure out whether or not you’re doing the bad thing. So is there something to improve there to make it easier for the end users of the system to understand when they are doing the slow thing?
It’s kind of hard. And in that regard, PyTorch is actually better than TensorFlow for instance because it lets you explicitly manage what data is on the GPU and what data is on the CPU. You choose when you do the transfers. Unless like, there is an instruction if loss then is none, like we talked about, which creates a transfer that is hidden. TensorFlow for instance does not even let you handle what is on the GPU and what’s on the CPU, it’s gonna take care of everything for you because it is compiling everything and it decides for you like, where is your data and how it moves. And so sometimes it can also result in efficient code, just because the compiler decided that this line should be executed in CPU and this line should be executed in GPU, but that day, the compiler was wrong. So at least in PyTorch you can fix things because you get more fine-grained control into stuff like that. And yeah, an employee who was, like, in love with Keras, for instance, would generate something and be like, “Huh. This thing in PyTorch is really great. I can choose where my data is, and on which device, and like, move it to where I want it to move, and it’s not going to move back unless I ask for it.”
So there’s like, two different dimensions along which you might want explicit control. One is about, “Am I doing a thing that can be put on the GPU or not?” And the other is, “Even if I am doing a thing that could be put on the GPU or could be put on the CPU, I can explicitly control where it goes.” And it sounds like it’s more explicit on one side and that it actually just forces everything into the completely understood domain-specific language. But then the actual execution of that language has a bunch of compiler magic that you don’t get control over in an explicit way. And this echoes actually a bunch of stuff that we’re doing on the OCaml compiler side, where we are trying to do a lot of stuff to try and make OCaml faster, mostly by giving end users more explicit control, as opposed to making the compiler magically faster. Exactly for this reason of – when you’re trying to enable performance engineering, the key to the realm is control.
Which was also the idea behind Accelerate in some way. Like the key that’s to give back to researchers more control over a training loop because they wanted to mess around with it. There is this idea of making sure our synchronization, is it good, is it bad, we only see it by profiling it. We are trying to do a better job here at Jane Street to at least to ultimately keep profiling all the jobs to identify and let researchers know, “Oh by the way, like this particular model, took very very long to do this particular step. Are you sure that it’s implemented efficiently? Maybe you should profile it.” And we’re trying to get everyone an easy way to profile and look at traces to kind of identify the bottlenecks. When we have done all of that, like sometimes researchers have some ideas that cannot be expressed into the building blocks that we have. And if they want to do something that doesn’t have a fast CUDA implementation already in the package in Pytorch, we need to dive deeper into the stacks. So like we mentioned Triton and writing into CUDA directly. So yeah, sometimes this is needed just because there is a specific layer a researcher invented and they want to either try it or put it into production and we need to make it as fast as possible.
Right and then there’s a couple of other interesting features from CUDA that I’ve heard you guys talk about a bunch. One of them is CUDA graphs and the other is CUDA streams.
Oh yeah.
How do those fit in?
So CUDA graphs is something that CUDA redid and that was used by PyTorch before torch.compile. It’s been designed explicitly to remove that kernel launch overhead we talked about earlier, like when you’re trying to launch a lot of small kernels and you pay that overhead for each of the small launches. And so CUDA graph is technology that allows you to play that graph of kernels once, inefficiently, but it’s going to record all of those kernel launches and the next time you replay that graph, it’s going to remove all of that overhead because it already knows what it has to dispatch. It can do that more efficiently. So that technology is really useful to remove the overhead of launching a series of small kernels.
So it gives you like a lightweight form of fusion, where you’re not really changing any of the operations, you’re just taking all of the scheduling work and putting that on the hardware so you never have to go back to the CPU to do it, and you don’t do unnecessary copying of memory back and forth.
Exactly. And you don’t get the kernel fusion, which would give you the additional benefit of avoiding the memory transfers. You’re still doing those memory transfers. If kernel 2 requires something in memory from kernel 1, you still have, like, kernel 1 is going to write it and then kernel 2 is going to read it. This is still valid, the only way to remove that memory in efficiency is to fuse the two kernels either by hand or using something like torch.compile. But, you remove the overhead which is really nice.
When you think about fusion in lots of programming languages, you can get rid of memory operations, but you can also sometimes get rid of other operations, right? You can do other, kind of, optimizations across the two kernels, where like if I know I’m going to do this set of things, maybe there’s some things that can be merged together. So are you also getting that computational benefit?
Sometimes, yeah. If your two kernels add some inefficiency in the middle with something you didn’t really need, you can remove that when you fuse them together. Usually the benefits come more from avoiding memory transfers. In some instances you can remove maybe some intermediate set that wasn’t really needed and we can avoid computing it.
Got it. Okay so that’s CUDA graphs. What are CUDA streams?
It’s a way of parallelizing stuff with CUDA. When you build those CUDA kernels and when you take CUDA to execute them, it’s going to execute them sequentially. So like, kernel 2 is only going to be executed when kernel 1 is fully finished on the GPU. And CUDA stream is a way to parallelize that. If you have two kernels and you know they can be run inparallel because they don’t touch the same data, you can use two of them in different streams and they will be executed in parallel. At least, up until a certain limit. You shouldn’t use CUDA streams if you want to run 100 things in parallel. And NVIDIA told us it’s not a good idea, and it’s true that CUDA streams do not really perform well. This API is exposed all the way up to PyTorch. So for instance if you want to do some stuff like I’m running my data and I’m gonna put it on the GPU, and in parallel, I would also like to compute some prediction in my previous batch of data which is already on the GPU, I would like to do those two things in parallel and you can use CUDA streams for that. Like you have one stream that does the compute and one stream where you transfer the data from the CPU to the GPU. If your model is written well with no synchronization point, your GPU is fully utilized all the time, without any break.
So can I just think of CUDA streams as a coarse-grained threading protocol, where each of the threads themselves has lots of little mini threads on the inside?
Yeah, kind of. It’s more like a hint than a hard requirement. Like it’s hinting to the GPU you can run those two things in parallel and it’s safe. The GPU might choose not to do it, sometimes.
Okay so a lot of the different optimizations you’ve talked about here have been very focused on the GPU programming itself and the, kind of, connection between the CPU and GPU pieces. What other parts of the process end up needing to be thought about when you’re trying to get the maximum performance out of a training run?
We talked about CPU GPU transfer, GPU programming. Networking, we talked a little bit about that as well, like that is another part that is really important if you’re training on multiple GPUs, they need to communicate efficiently. If you don’t want to be bottlenecked by that.
I think that I’ve seen us spend a bunch of time on this, thinking about – not just about making the fabric of the network efficient but also organizing the data loading in a way that you’re not going to stall out. I mean, there’s this, kind of, general problem, the GPUs are these kind of incredibly productive compute machines. And so they’re very hungry, right? They want data as fast as possible. What do you need to do to make sure you can keep them fed?
Yeah, yeah. Data loading is definitely an important subject. Especially when you have some data that is asymmetrical. You have examples in your training set that are really really long, and examples in your training set that are really really short, and you kind of need to patch them together to do like one iteration of that stochastic gradient descent that we talked about before. There are lots of ways you can do that. Like for instance you can just decide like, “I’m going to take the long and the short together and I’m going to pad everything so that I have a bunch of zero in my tensors, after the short sequence has finished and until the end of the very long sequence, which consumes a lot of memory so that is not super efficient. You can have some kind of representation of tensors where you concatenate everything together but you see the offsets at which each thing is, which is a bit of a more efficient memory layout. But even then, like when you do distributed training, where you’re kind of sad because if you have one GPU that has to load a very very long sample, and the other GPUs have like shorter samples, since they need to communicate together to agree on which gradient is the right gradient, the GPUs with the very short samples are going to wait for the GPUs with the long samples for a long time. You kind of need to organize your data and everything in a way where you think about your distributed training so that each GPU has kind of a balanced load of data, so it all takes the same time to load the samples. And so at least like when you have the long samples, everyone has a long sample to load, otherwise it is pretty inefficient. But then it might impact your training accuracy because you’re not shuffling for your data set, you’re kind of doing a pseudo-shuffle where you still group things by size. So it’s kind of a trade-off between performance and accuracy, by removing some degrees of randomness in your shuffle of the data.
Yeah, one thing that’s maybe not obvious is that a lot of these algorithms are essentially structured around barrier synchronizations. “I’m gonna do a bunch of stuff, and then you’re gonna wait until everyone’s done, and then you’re gonna do a bunch of stuff and wait until everyone’s done.” Barrier synchronizations are super terrible if you have a lot of nondeterminism or just non-uniformity in the different pieces that are going into meeting that barrier. Because some people are going to get to the barrier first, and then you’re gonna wait on the others, and while you’re waiting you’re just not doing anything. Happily, GPUs are mostly pretty deterministic in terms of the amount of time it takes to do a given computation. But you have to feed it a computation of the same shape everywhere in order to get it to really neatly line up.
And we were also talking before, when you were asking me why PyTorch was a winner – I think one thing that we really want in the set of PyTorch is the way when you have like asymmetrical data and like different sizes and different batches, it’s way easier to code that in PyTorch, which is more flexible because compiling that kind of thing is really really hard. In TensorFlow or JAX, you kind of need to go to some extreme lengths in your code to make your data the same shape again and then send it to your model. Whereas in PyTorch it’s really easy. Like you can just batch together small things, and then the next batch is going to be a very long thing, and PyTorch is still happy because it is eager and not compiled.
Right. I mean, I guess this is always the problem when you go to some simpler, more highly structured, domain-specific language, is like – there are some things it’s good at expressing and there’s some things it’s bad at expressing. When you want to do the thing that it’s bad at expressing, you can just be in a world of hurt.
Yeah. Exactly.
You know, you’ve spent a bunch of time in your career working on various, kind of, open source training, machine learning ecosystems, and you now spend a lot of time internally working in our world. I think it’s fair to say we are in various ways more immature than lots of other organizations. I think at the time where Google was already designing their own custom hardware for efficiently evaluating neural net models, we weren’t really using neural net models at all. Like, I think all of this effort on our side has really spun up in the last few years. We’ve been doing various kinds of statistically driven inference of trading strategies for as long as I’ve been at Jane Street. Like, that’s the first job I had like 21 years ago, was doing various kinds of optimizations and model fitting stuff – but very different models and didn’t have any of the same performance shape, and so all of our tooling around this is relatively new. And I’m kind of curious, for you, as someone who’s seen stuff in the outside world and seen our ecosystem here, what are the things that you see as the big gaps? What are the kind of things that don’t work as well here as they should, and that you want to see us improve and that you want to work on?
One nice aspect of the fact that we’re newer to this machine learning stuff is that people are not necessarily aware of the things that are not performant and making a lot of mistakes when writing the code. So it’s really easy for me to come in and, like, spend a couple of hours on a project and be like, “Oh yeah, no, it’s gonna train four times faster, and you just have to transpose five lines of code.” So it makes my job very easy in that regard. (laughs)
(laughs)
Sometimes it’s a little bit more difficult than that. But yeah, there have been a couple of instances where, like, optimizing a given training was really, really easy, just by profiling it once because of this. We should improve our infrastructure around training loops in general, like making the training infrastructure work better for researchers because we’re kind of making the same mistakes as other people already did in the open source world. Like these giant training loops with lots of spaghetti code that researchers end up not willing to use because they can’t modify what’s inside of them. It feels like sometimes we have that same problem internally as well.
So do you think we need to, sort of, do morally the same thing that Accelerate did of trying to build a set of libraries that are highly configurable and modular, instead of having, like, one training loop that everyone uses – make it easy for people to build their own training loops for different use cases?
Yeah, especially since people here are very smart and really like to hack things together. It feels like a better solution for them. The magic training loop where you press play and your model trains has its appeal, which I can understand for people who are less familiar with machine learning. But at least for people who are deeply familiar with all the internals of machine learning and want to deep research into, like, every part of a training loop, they need something that’s akin to Accelerate where you just add, like, small composable building blocks that are very easy to use and not like this giant black box with, like, 150 arguments that you have to pass in the correct order.
Yeah, that’s terrible. (laughs)
I’m talking about overtraining APIs, like not giving any bad time to any engineers here. (laughs) That does not exist internally.
Right. Certainly, there are pieces of code that we have, you know, some of which I’ve written, that have the property of, “You have hard-coded a bunch of concrete behaviors into it, and it has become ossified and hard to change. And it’s certainly a problem that shows up. Maybe it’s worth just saying, like, a few words about the way in which the problems we solve here are different than problems solved in the outside world. And maybe just to talk about, like, what role machine learning actually has here. So just say, like, some very high-level things. We use machine learning for a bunch of stuff. We use it for some, kind of, general purposes in the way that any organization might. We have a whole AI Assitants team whose job is to try and leverage various AI techniques for building various kinds of automations, a lot of it focused around LLMs and coding assistance, but not just that. So, that’s, like, one kind of use case. And then we have a bunch of use cases that are very focused on trading. Even inside of the trading world, I think there’s, kind of, two major streams of applications. There’s, “We are going off and trying to extract data from the outside world in order to inform our trading.” But we’re using the same kind of data that has already been shown to be a good target for standard machine learning techniques. So maybe we wanna get data out of images, or geospatial data, or text data. There are all sorts of published models out there and published architectures that you can do for this. And we are, like, happy to leverage, and fine tune, and exploit those existing things. And there, the work that we end up doing looks a lot like the work that people do on machine learning in the outside world. And then, in some sense, the magic is more about, “How do we pick the data that we’re going to apply it to? And how do we integrate that into the decisions we’re making on the trading side?” And then there are places where we are applying machine learning techniques to trading data itself, like the data that you get from exchanges, various alternative sources of data that can inform that – and I’m kind of curious, like, how do you think of that set of data as being different from the kind of data that you typically see in the larger world of machine learning?
The data is much noisier. So it’s way harder to train good models on it, just because the signal you can extract from it is actually way weaker than in something very structured like text or images. Very often, you will never get the same kind of accuracy as what you get on computer vision and on text. But even extracting a very small amount of signal can still lead to good training strategies. So you can still get valuable feedback from that kind of data.
It’s maybe worth saying there are fundamental reasons why trading data sets are noisy, because the fundamental process of trading is one where, when there’s a signal, people trade that signal, and that signal kind of gets removed from the data essentially. And so, to the first order of approximation, the time series of the prices of a security look kind of like a random walk. And there’s a little bit of signal in there, but it really is mostly noise. And so whatever your training technique is, it has to be able to deal with the fact that there’s a ton of noise in the labels, essentially.
And so, yes, so that’s one aspect of it, very noisy. And as you said, it also changes and reacts to the way you use it. So, like, in difference with text, for instance, Huud BERT was released in 2018. You use it right now, it’s still as good as it was in 2018. The same is definitely not true for a model that you train on market data. The kind of strategies that you could run a couple of years ago won’t necessarily work right now just because the market has reacted to them and adapted to them. So you kind of need to come up with new modeling strategies all the time and reinvent yourself. Another aspect of that data that is different from the rest of the outside world is that it’s huge. We have massive amounts of market data. I think it’s a couple of terabytes per day. So you multiply that by the number of days or like a couple of years, and yeah, you have a massive amount of data to feed your model. So that brings its own challenges in terms of data loading and making sure it’s efficient and that the GPU gets saturated.
Right. And in practice, the model sizes that we tend to do are – tend to be smaller than, like, the sizes of the very largest language models. And so, the overall, like, ratios of flops per byte are just very, very different. And so the things that people are optimizing for in the designs of the GPUs and the designs of the network are often not exactly the thing that we’re seeing.
We have to do a whole bunch of research, basically. We can’t just rely on what’s been done for other kinds of modalities like text or images. We need to reinvent new models that are adapted to market data, and new ways of loading that data and keeping the GPU fed as much as possible. And yeah, sometimes we care about algorithms that are, like, completely different from the one NVIDIA or PyTorch care about because they’re not necessarily used in LLMs and everyone is all about LLMs these days. It is a good amount of, like, programming (laughs) to do in terms of GPU performance, so.
Yeah, and I think it’s actually an exciting part of the machine learning world here in general, that there’s a wide variety of coming up with and experimenting with new architectures and new models and new ways of applying it to datasets, where it’s like – there just aren’t a lot of papers telling you how to, like, analyze financial time series, because the people who are good at that mostly don’t publish papers.
I wonder why. (laughs)
(laughs) Another thing that I think comes up, which is interesting, is just inference times, right? So we care about using these things in the context of trading, and that, in general, the level of speed at which we care about responsiveness to some input is sometimes very, very small. It can vary by orders of magnitude depending on the application. Like sometimes we care about turning around a packet in literally 100 nanos, and sometimes a few microseconds, and sometimes a few hundred microseconds is slow enough, sometimes milliseconds. There are some kinds of machine learning problems where, like, “Oh, getting an inference once an hour would be great. That’s, like, all we need, and sometimes even less than that.” So you just have a very wide variety of inference times you need, and at the very low end of the scale, it’s nothing anyone else cares about.
Yeah, that’s why also you were mentioning before, our models have various sizes. Some are very small because we want them to run very, very fast. But even if they are small, there are some challenges to make sure that they can run in the timeframe we need to make sure we are as low latency as possible.
Right, and just to keep up with the data rate.
Yeah, because, like, there can be a million events in a single stock in one day. So if you’re not fast enough to, like, just process them, it might not be the case that we need the prediction very fast. But sometimes if you want to keep up and not, like, get behind too much, you just need to be a couple of mics per event and not much more than that.
So those are a whole bunch of differences between the kinds of problems we look at, and what you’ve seen in other places. How do you think that influences the tooling we build? Are there ways in which we need to build different kinds of machine learning tools in response to the ways in which the shape of the problems are different?
Yeah, we talked about data loading, for instance, that comes with its own challenges. So like, obviously we have developed a ton of custom data loading utilities that we can use to make this faster. We also talked about models that are not necessarily the same ones as everyone else cares about for other, kind of, well-studied modalities like text or image. So we have a lot of custom models written internally that we have found work well that we keep trying to optimize for training and inference. So this is a bunch of exciting work. Like, the rest of the training is mostly the same as in any other machine learning job. Like, stochastic gradient descent has not changed. (laughing)
Yeah.
It’s the same algorithm.
And the same algorithm it was 40 years ago.
Yeah. (laughing)
So, another thing that you’ve done a lot of in your career is education, right? You were a math teacher for a bunch of years, and then you did a lot of education work, both at fast.ai and at Hugging Face. So you’re also involved some in the education story here. Can you tell us a little bit more about what that looks like?
Yeah. Jane Street is trying to up its machine learning game, both by hiring more people who do machine learning, but also by educating the existing people on machine learning. And we talked a little bit before about fast.ai and how it was important to, like, make radiologists, for instance, competent at machine learning so that they can do radiology better with machine learning. It’s the same here. Like we need to educate traders about machine learning, so that they can do better trading using machine learning, and can inform the choice of models that then machine learning researchers can pick because they know the data very well and they are kind of domain experts. So we have a boot camp that we run, like, every couple of months, with either traders or researchers that are not super familiar with machine learning. And we try to make them up to speed with the latest technique, both from the outside world and from inside Jane Street.
You mentioned this point of, in part you wanna teach people – so, people who are not primarily going to be machine learning modelers as like, their primary job (but still have them better understand)… people who are experts in other aspects of the trading problem and have them understand more about machine learning. That’s one goal. I think it’s also the case that you can teach people the machine learning stuff in some ways – like, it can’t be that hard to learn modern machine learning because in some ways, modern machine learning is 10 years old.
I was saying like, make them into domain experts so they can help better, but they – some of them, like, actually end up training models, doing a lot of machine learning themselves. It depends on whether they like it or not. Because machine learning is a bit like cooking. Like you throw a bunch of stuff and then you let your model training stir for a while and you see if it was good or not. It’s not the same kind of like, just programming some training system. So, some people like it, some people really (laughs) dislike it.
Yeah, that makes total sense. So there’s a lot of things you’re trying to convey to people when you are running these classes and these courses. What are things that you find hard to get across?
The point that’s most difficult to get across to people is that, yeah, no one knows anything about machine learning.
(laughs)
Like, it’s really just a cooking sense. We still don’t know why neural nets generalize so well. We have a little bit of theory explaining why they are able to train on the training data. But why are they any good out of training samples? We still don’t know, like, why they are so good at generalizing. And in general, like, you can try to get a little bit of intuition into, like, trying to do this to fix this kind of problem, like, “I’m overfitting, so I’m gonna try this regularization technique. Maybe that will help.” But yeah, there’s always that maybe, like it’s not until you’ve tried that you can know for sure that the thing is gonna work. So, this is really hard to convey. And then I try to get people very disciplined about reproducibility. Like one mistake that beginners in machine learning do all the time is they train a model and then we forget what they did to train that model. And so like two months later, like, “Oh, I did train that model. It was good. I should try to retrain it and maybe use it.” And they never managed to reproduce their initial results just because they didn’t write down all of the stuff that was needed to train that model. Those two points are really difficult to make, because of course, I guess you can’t fully understand them until you have gone through the pain (laughs) of two of them.
You don’t understand the importance of reproducibility until you’ve gone through your first reproducibility crisis.
Yeah, exactly. (laughing) Then you fully understand why it’s so important to save absolutely everything down to the revision of the code so that you can run the exact same thing at another time.
Yeah, and I think some of the reproducibility stuff is about getting people on board with the discipline required. We’ve talked a lot about technology that meets the researchers where they are and makes it easy for them to express what they want. But there’s some part of it, if you wanna be a good researcher, you actually just need an enormous amount of discipline, because the tools are somewhat imperfect and you just have to be really careful to get that reproducibility story right. And at the same time, I think there’s also a lot of work we can do on the tool side to make reproducibility easier, right? I think it’s complicated in part because the overall ecosystem is complicated. Just managing Python packages is shockingly complicated.
And making sure you didn’t have an upgrade of a random package that broke everything. Yeah, that’s all really difficult. And then making sure, your code is checked out and that you know the revision of the code that you are using. That’s another thing because you can change a small line of code in your model and think, “Oh, this is totally harmless,” but then it actually destroys the performance of your model because it was a key ingredient in your cooking recipe and you hadn’t realized that. So yeah, making sure that your code is still there. And the last thing is, training involves a certain number of hyperparameters. Usually people write all of that in some kind of config. So make sure that you save that config somewhere so that when you want to reproduce your training, you actually know that you had used this learning rate, this batch size, et cetera, et cetera.
I guess another fun source of nondeterminism is to the degree that you’re doing your research in Python notebooks. The fact that you can run Python cells in Python notebooks in arbitrary orders, and if you do this in the wrong way, you can end up computing a result that you just, like, have no record of exactly where that result came from.
Yeah, (laughs) that’s another kind of fun. Fortunately, notebooks are still a bit difficult to check into any kind of repo, so, usually people move away from notebooks by the time it’s time to check the code in some kind of infrastructure. So like this issue kind of disappears. But it’s true while you are experimenting, this is another fun source of reproducibility. And then you have, like, GPU being non-deterministic at a fundamental level because we are heavily parallel. So that’s always so fun. Like, when you’re trying to debug exactly why my floating-point result at the end is not the same thing as what I was expecting, just because floating-point arithmetic is not associative, and GPUs have many threads which may end in any kind of random order.
GPU training is, in some sense, non-deterministic because it’s in parallel. But it’s also, in some sense, non-deterministic just because we can tolerate it. You could do things highly in parallel and make sure that you’re always doing things in the same order and do stuff to conserve the determinism–
It’s usually at a huge cost in performance, but yeah.
– but it’s at a huge cost in performance. And it doesn’t totally matter, right? That’s actually one of the interesting things about machine learning is, because you’re doing this, kind of, soft numerical optimization process, you can just take some error. And actually, a lot of the early research in various places is called, kind of, hogwild concurrency, where you just had shared model weights and things checking out and producing new gradients and updating them. And were they data races? Yes, they were. They were.
(laughs)
And like, it was kind of okay. But I think over time, that’s fallen somewhat into disfavor, because it’s just even harder to predict what’s going on.
Yes. It’s completely unreproducible. So you can end up with a model that’s really good, but you have no idea why, and you were never able to reproduce it anymore. So, that’s a bit annoying. (laughs)
Anyway, this was a lot of fun. Thanks for joining me.
Thanks a lot for having me.
You’ll find a complete transcript of the episode, along with links to some of the things that we discussed, at signalsandthreads.com. Thanks for joining us, and see you next time.