BLURB

Chris Lattner is the creator of LLVM and led the development of the Swift language at Apple. With Mojo, he’s taking another big swing: How do you make the process of getting the full power out of modern GPUs productive and fun? In this episode, Ron and Chris discuss how to design a language that’s easy to use while still providing the level of control required to write state of the art kernels. A key idea is to ask programmers to fully reckon with the details of the hardware, but making that work manageable and shareable via a form of type-safe metaprogramming. The aim is to support both specialization to the computation in question as well as to the hardware platform. “Somebody has to do this work,” Chris says, “if we ever want to get to an ecosystem where one vendor doesn’t control everything.”

SUMMARY

Some links to topics that came up in the discussion:

Democratizing AI compute (an 11-part series)
Modular AI
Mojo
MLIR
Swift

TRANSCRIPT

And so one of the things that I built when I was at Google was a bunch of foundational compiler technology for that category of systems. And there’s this compiler technology called MLIR. MLIR is basically LLVM 2.0. And so take everything you learn from building LLVM and helping solve this, but then bring it forward into this next generation of compiler technology so that you can go hopefully unify the world’s compute for this GPU and AI and ASIC kind of world. MLIR has been amazingly successful, and I think it’s used in roughly every one of these AI systems and GPUs. It’s used by Nvidia, it’s used by Google, it’s used by roughly everybody in this space. But one of the challenges is that there hasn’t been unification. And so you have these very large-scale AI software platforms. You have CUDA from Nvidia, you have XLA from Google, you have ROCm from AMD.

It’s countless. Every company has their own software stack. And one of the things that I discovered and encountered, and I think the entire world sees, is that there’s this incredible fragmentation driven by the fact that each of these software stacks built by a hardware maker are just all completely different. And some of them work better than others, but regardless, it’s a gigantic mess. And there’s these really cool high-level technologies like PyTorch that we all love and we want to use. But if PyTorch is built on completely different stacks and schooling together these megalithic worlds from different vendors, it’s very difficult to get something that works.

00:14:17

Ron

Right. They’re both complicated trade-offs around the performance that you get out of different tools and then also a different set of complicated trade-offs around how hard they are to use, how complicated it is to write something in them, and then what hardware you can target from each individual one. And each of these ecosystems is churning just incredibly fast. There’s always new hardware coming out and new vendors in new places, and there’s also new little languages popping up into existence, and it makes the whole thing pretty hard to wrangle.

I want the full power of the GPU. I want to be able to deliver performance that meets and beats Nvidia on their own hardware. I want to have portability and unify this crazy compute where you have these really fancy heterogeneous systems and you have tensor cores and you have this explosion of complexity and innovation happening in this hardware platform layer. Most programming languages don’t even know that there’s an 8-bit floating point that exists. And so we looked around and I really did not want to have to do this, but it turns out that there really is no good answer. And again, we decided that, hey, the stakes are high, we want to do something impactful. We’re willing to invest. I know what it takes to build a programming language. It’s not rocket science, it’s just a lot of really hard work and you need to set the team up to be incentivized the right way. But we decided that, yeah, let’s do that.

00:23:08

Ron

So I want to talk more about Mojo and its design, but before we do, maybe let’s talk a little bit more about the pre-existing environment. I did actually read that blog post series. I recommended it to everyone. I think it’s really great, and I want to talk a little bit about what the existing ecosystem of languages looks like, but even before then, can we talk more about the hardware? What does the space of hardware look like that people want to run these ML models on?

00:23:29

Chris

Yeah, so the one that most people zero in on is the GPU. And so GPUs are, I think, getting better understood now. And so if you go back before that though, you have CPUs. So, modern CPUs in a data center, often you’ll have—I mean today you guys are probably riding quite big iron, but you got 100 cores in a CPU and you got a server with two-to-four CPUs on a motherboard, and then you go and you scale that. And so, you’ve got traditional threaded workloads that have to run on CPUs, and we know how to scale that for internet servers and things like this. If you get to a GPU, the architecture shifts. And so they have basically these things called SMs. And now the programming model is that you have effectively much more medium-sized compute that’s now put together on much higher performance memory fabrics and the programming model shifts. And one of the things that really broke CUDA, for example, was when GPUs got this thing called a tensor core—and the way to think about a tensor core is it’s a dedicated piece of hardware for matrix multiplication. And so, why’d we get that? Well, a lot of AI is matrix multiplication. And so, if you design the hardware to be good at a specific workload, you can have dedicated silicon for that and you can make things go really fast.

00:24:36

Ron

There are really these two quite different models sitting inside of the GPU space. Of course, the name itself is weird. GPU is ‘graphics processing unit,’ which is what they were originally for. And then this SM model is really interesting. They have this notion of a warp. A warp is a collection of typically 32 threads that are operating together in lockstep, always doing the same thing—a slight variation on what’s called the SIMD model, same instruction, multiple data. It’s a little more general than that, but more or less, you can think of it as the same thing. And you just have to run a lot of them. And then there’s a ton of hardware inside of these systems basically to make switching between threads incredibly cheap. So you pay a lot of silicon to add extra registers. So the context switch is super cheap, so you can do a ton of stuff in parallel.

Each thing you’re doing is itself 32-wise parallel. And then because you can do all this very fast context switching, you can hide a lot of latency. And that worked for a while. And then we’re like, actually, we need way more of this matrix multiplication stuff. And you can sort of do reasonably efficient matrix multiplication through this warp model, but not really that good. And then there’s a bunch of quite idiosyncratic hardware, which changes its performing characteristics from generation to generation, just for doing these matrix multiplications. So that’s the Nvidia GPU story, and Volta is like V100 and A100 and H100. They just keep on going and changing, pretty materially from generation to generation in terms of the performance characteristics, and then also the memory model, which keeps on changing.

00:25:57

Chris

You go back to intuition, CUDA was never designed for this world. CUDA was not designed for modern GPUs. It was designed for a much simpler world. And CUDA being 20 years old, it hasn’t really caught up. And it’s very difficult because, as you say, the hardware keeps changing. And so CUDA was designed from a world where—almost like C is designed for a very simple programming model that it expected to scale, but then as the hardware changed, it couldn’t adapt. Now, if you get beyond GPUs, you get to Google TPU and many other dedicated AI systems. They blow this way out and they say, ‘Okay, well, let’s get rid of the threads that you have on a GPU and let’s just have matrix multiplication units and have really big matrix multiplication units and build the entire chip around that. And you get much more specialization, but you get a much higher throughput for those AI workloads.

Going back to, ‘Why Mojo?’ Well, Mojo was designed from first principles to support this kind of system. Each of these chips, as you’re saying, even within Nvidia’s family, from Volta, to Ampere, to Hopper, to Blackwell, these things are not compatible with each other. Actually, Blackwell just broke compatibility with Hopper, so it can’t run Hopper kernels always on Blackwell. Oops, well, why are they doing that? Well, AI software is moving so fast. They decided that was the right trade-off to make. And meanwhile, we all software people need the ability to target this. When you look at other existing systems, with Triton for example, their goal was, ‘Let’s make it easier to program a GPU,’ which I love, that’s awesome. But then they said, ‘We’ll just give up 20% of the performance of the silicon to do it.’ Wait a second. I want all the performance. And so if I’m using a GPU—GPUs are quite expensive by the way—

I want all the performance. And if it’s not going to be able to deliver the same quality of results you get by writing CUDA, well then, you’re always going to run to this head room, where you get going quickly, but then you run into a ceiling and then have to switch to a different system to get full performance. And so this is where Mojo is really trying to solve this problem where we can get more usability, more portability, and full performance of the silicon because it’s designed for these wacky architectures like tensor cores.

00:27:51

Ron

And if we look at the other languages that are out there, there’s languages like CUDA, and OpenCL, which are low level, typically look like variations on C++, in that tradition are unsafe languages, which means that there’s a lot of rules you have to follow. And if you don’t exactly follow the rules, you’re in undefined behavior land, it’s very hard to reason about your program.

00:28:10

Chris

And just let me make fun of my C++ heritage because I’ve spent so many years, like, you just have a variable that you forget to initialize, it just shoots your foot off. [laughs] Like, it’s just unnecessary violence to programmers.

00:28:21

Ron

Right. And it’s done in the interest of making performance better because the idea is C++ and its related languages don’t really give you enough information to know when you’re making a mistake, and they want to have as much space as they can to optimize the programs they get. So the stance is just, if you do anything that’s not allowed, we have no obligation to maintain any kind of reasonable semantics or debug ability around that behavior. And we’re just going to try really, really hard to optimize correct programs, which is a super weird stance to take, because nobody’s programs are correct. There are bugs and undefined behavior in almost any C++ program of any size. And so, you’re in a very strange position in terms of the guarantees that you get from the compiler system you’re using.

00:29:02

00:32:30

Chris

Yeah, again, just to relate how different the situation is—back when I was working on Swift, one of the major problems to solve was, objective C was very difficult for people to use, and you had pointers, and you had square brackets, and it was very weird. And so the goal in the game of the day was, invent new syntax and bring together modern programming language features to build a new language. Fast forward to today, actually, some of that is true. AI people don’t like C++. C++ has pointers, and it’s ugly, and it’s a 40-year-old-plus language, and has actually the same problem that Swift had to solve back in the day. But today there’s something different, which is that AI people do actually love a thing. It’s called Python. And so, one of the really important things about Mojo is, it’s a member of the Python family. And so, this is polarizing to some, because yes—I get it that some people love curly braces, but it’s hugely powerful because so much of the AI community is Pythonic already.

And so we started out by saying, let’s keep the syntax like Python and only diverge from that if there’s a really good reason. But then what are the good reasons? Well, the good reasons are, we want—as we were talking about—performance, power, full control over the system. And for GPUs, there’s these very important things you want to do that require metaprogramming. And so Mojo has a very fancy metaprogramming system, kind of inspired by this language called Zig, that brings runtime and compile time together to enable really powerful library designs. And the way you crack open this problem with tensor cores and things like this, is you enable really powerful libraries to be built in the language as libraries, instead of hard coding into the compiler.

00:33:57

Ron

Let’s take it a little bit to the metaprogramming idea. What is metaprogramming and why does it matter for performance in particular?

00:34:03

Chris

Yeah, it’s a great question, and I think you know the answer to this too, and I know you, but—

00:34:08

Ron

[Laughs] We are also working on metaprogramming features in our own world.

00:34:11

Chris

Exactly. And so the observation here is, when you’re writing a for loop in a programming language, for example, typically that for loop executes at runtime, so you’re writing code that when you execute the program, it’s the instructions that the computer will follow to execute the algorithm within your code. But when you get into designing higher level type systems, suddenly you want to be able to run code at compile time as well. And so there’s many languages out there. Some of them have macro systems, C++ has templates. What you end up getting is, you end up getting, in many languages, this duality between what happens at runtime, and then a different language almost that happens at compile time. And C++ is the most egregious, because templates that you have a for loop in runtime, but then you have unrolled recursive templates, or something like that at compile time.

Well, so the insight is, hey, these two problems are actually the same. They just run at different times. And so what Mojo does is says, let’s allow the use of effectively any code that you would use at runtime to also work at compile time. And so you can have a list, or a string, or whatever you want in the algorithms—go do memory allocation, deallocation—and you can run those at compile time, enabling you to build really powerful high-level abstractions and put them into libraries. So why is this cool? Well, the reason it’s cool is that on a GPU, for example, you’ll have a tensor core. Tensor cores are weird. We probably don’t need to deep dive into all the reasons why, but the indexing and the layout that tensor cores use is very specific and very vendor different. And so the tensor core you have on AMD, or the tensor cores you have on different versions of Nvidia GPUs are all very different.

And so what you want, is you want to build as a GP programmer a set of abstractions so you can reason about all of these things in one common ecosystem and have the layouts much higher level. And so what this enables, it enables very powerful libraries—and very powerful libraries where a lot of the logic is actually done at compile time, but you can debug it because it’s the same language that you use at runtime. And it makes the language much more simpler, much more powerful, and just be able to scale into these complexities in a way that’s possible with C++. But in C++, you get some crazy template stack trace that is maddening and impossible to understand. In Mojo, you can get a very simple error message. You can actually debug your code, and debugger things like this.

00:36:17

00:39:02

Ron

We’re all still pretty new to it, but I think it’s got a lot of exciting things going for it. I mean, the first thing is, yeah, it gives you the kind of programming model you want to get the performance that you need. And actually, in many ways the same kind of programming model that you get out of something like CUTLASS or CuTe DSL, which are these Nvidia-specific, some at the C++ level, some at the Python DSL level—and by the way, every tool you can imagine nowadays is done once in C++ and once in Python. We don’t need to implement programming languages in any other way anymore. They’re all either skins on C++ or skins on Python. But depending on which path you go down, whether you go the C++ path or the Python path, you get all sorts of complicated trade-offs.

Like in the C++ path in particular, you get very painful compilation times. The thing you said about template metaprogramming is absolutely true. The error messages are super bad. If you look at these more Python-embedded DSLs, the compile times tend to be better. It still can be hard to reason about though. One nice thing about Mojo is the overall discipline seems very explicit when you want to understand: Is this a value that’s happening at execution time at the end, or is it a value that is going to be dealt with at compile time? It’s just very explicit in the syntax, you can look and understand. Whereas in some of these DSLs, you have to actively go and poke the value and ask it what kind of value it is. And I think that kind of explicitness is actually really important for performance engineering, making it easy to understand just what precisely you’re doing.

You actually see this a ton, not even with these very low-level things, but if you look at PyTorch, which is a much higher level tool, PyTorch does this thing where you get to write a thing that looks like an ordinary Python program, but really it’s got a much trickier execution model. Python’s an amazing and terrible ecosystem in which to do this kind of stuff, because what guarantees do you have when you’re using Python? None. What can you do? Anything. You have an enormous amount of freedom. The PyTorch people in particular have leveraged this freedom in a bunch of very clever ways, where you can write a Python program that looks like it’s doing something very simple and straightforward that would be really slow, but no—it’s very carefully delaying and making some operations lazy so it can overlap compute on the GPU and CPU and make stuff go really fast. And that’s really nice, except sometimes it just doesn’t work.

00:41:04

Chris

This is the trap again, this is my decades of battle scars now. So as a compiler guy, I can make fun of other compiler people. There’s this trap and it’s an attractive trap, which is called the ‘sufficiently smart compiler.’ And so what you can do is you can take something and you can make it look good on a demo and you can say, ‘Look! I make it super easy and I’m going to make my compiler super smart, and it’s going to take care of all this and make it easy through magic.’ But magic doesn’t exist. And so anytime you have one of those ‘sufficiently smart compilers,’ if you go back in the days, it was like auto-parallelization, just write C code is sequential logic, and then we’re going to automatically map it into running on 100 cores on a supercomputer or something like that.

They often actually do work, they work in very simple cases and they work in the demos. But the problem is that you go and you’re using them and then you change one thing and suddenly everything breaks. Maybe the compiler crashes, it just doesn’t work. Or you go and fix a bug and now instead of 100-times speedup, you get 100-times slowdown because it foiled the compiler. A lot of AI tools, a lot of these systems, particularly these DSLs, have this design point of, let me pretend like it’s easy and then I will take care of it behind the scenes. But then when something breaks, you have to end up looking at compiler dumps, right? And this is because magic doesn’t exist. And so this is where predictability and control is really, I think, the name of the game, particularly if you want to get the most out of a piece of hardware, which is how we ended up here.

00:42:23

Ron

It’s funny, the same issue of, “How clever is the underlying system you’re using?” comes up when you look at the difference between CPUs and GPUs. CPUs themselves are trying to do a weird thing where a chip is a fundamentally parallel substrate. It’s got all of these circuits that in principle could be running in parallel and then it is yoked to running this extremely sequential programming language, which is just trying to do one thing after another. And then how does that actually work with any reasonable efficiency? Well, there’s all sorts of clever dirty tricks happening under the covers where it’s trying to predict what you’re going to do, this speculation that allows it to dispatch multiple instructions in a row by guessing what you’re going to do in the future. There’s things like memory prefetching where it has heuristics to estimate what memory you’re going to ask in the future so it can dispatch multiple memory requests at the same time.

And then if you look at things like GPUs, and I think even more, TPUs, and then also totally other things like FPGAs, the field-programmable gate arrays where you put basically a circuit design on it. It’s a very different kind of software system. But all of them are in some sense simpler and more deterministic and more explicitly parallel. Like when you write down your program, you have to write an explicitly parallel program—that’s actually harder to write. I don’t want to complain too much about CPUs. The great thing about CPUs is they’re extremely flexible and incredibly easy to use and all of that dark magic actually works a pretty large fraction of the time.

00:43:42

Chris

Yeah, remarkably well. But your point here, I think it’s really great, and what you’re saying is, you’re saying CPUs are the magic box that makes sequential code go in parallel pretty fast. And then we have new, more explicit machines, somewhat harder to program because they’re not a magic box, but you get something from it. You get performance and power because that magic box doesn’t come without a cost. It comes with a very significant cost, often the amount of power that your machine dissipates. And so it’s not efficient. And so a lot of the reasons we’re getting these new accelerators is because people really do care about it being a hundred times faster, or using way less power, or things like this. And I’d never thought about it, but your analogy of Triton to Mojo kind of follows a similar pattern, right? Triton is trying to be the magic box, and it doesn’t give you the full performance, and it burns more power, and all that kind of stuff. And so Mojo is saying, look, let’s go back to being simple. Let’s give the programmer more control. And that more explicit approach, I think, is a good fit for people that are building crazy advanced hardware like you’re talking about—but also people that want to get the best performance out of the existing hardware we have.

00:44:42

Ron

So we talked about how metaprogramming lets you write faster programs by boiling away this control structure that you don’t really need. So that part’s good. How does it give you portable performance? How does it help you on the portability front?

00:44:54

Chris

Yeah, so this is another great question. So in this category of ‘sufficiently smart compilers,’ and particularly for AI compilers, there’s been years of work and MLIR has catalyzed a lot of this work building these magic AI compilers that take TensorFlow or even the new PyTorch stuff and trying to generate optimal code for some chip. So take some PyTorch model and put it through a compiler, and magically get out high performance. And so there’s tons of these things, and there’s a lot of great work done here, and a lot of people have shown that you can take kernels and accelerate them with compilers. The challenge with this is that people don’t ever measure—what is the full performance of the chip? And so people always measure from a somewhat unfortunate baseline and then try to climb higher instead of saying—what is the speed of light? And so if you measure from speed of light, suddenly you say, okay, how do I achieve several different things?

Even if you zero into one piece of silicon, how do I achieve the best performance for one use case? And then how do I make it so the software I write can generalize even within the domain? And so for example, take a matrix multiplication, well, you want to work on maybe float32, but then you want to generalize it to float16. Okay, well, templates and things like this are easy ways to do this. Then programming allows you to say, okay, I will tackle that. And then the next thing that happens is, because you went from float32 to float16, your effective cache size has doubled, because twice as many elements fit into cache if there’s 16 bits than if there are 32 bits. Well, if that’s the case, now suddenly the access pattern needs to change. And so you get a whole bunch of this conditional logic that now changes in a very parametric way as a result of one simple change that happened with float32 to float16.

Now you play that forward and you say, okay, well actually matrix multiplication is a recursive hierarchical problem. There’s specializations for tall and skinny matrices, and a dimension is one or something. There’s all these special cases. Just one algorithm for one chip becomes this very complicated subsystem that you end up wanting to do a lot of transformations to so you can go specialize it for different use cases. And so Mojo with the metaprogramming allows you to tackle that. Now you bring in other hardware, and so think of matrix multiplication these days as being almost an operating system, and there’s so many different subsystems, and special cases, and different D types, and crazy float4 and six and other stuff going on.

00:47:07

Ron

At some point they’re going to come out with a floating point number so small that it will be a joke. But every time I think that they’re just kidding, it turns out it’s real.

00:47:14

Chris

Seriously, I heard somebody talking about 1.2-bit floating point, right? It’s exactly like you’re saying, is that a joke? You can’t be serious. And so now when you bring in other hardware, other hardware brings in more complexity because suddenly the tensor core has a different layout in AMD than it does on Nvidia. Or maybe to your point about warps, you have 64 threads in a warp on one and 32 threads in a warp on the other. But what you realize is, wait a second—this really has nothing to do with hardware vendors. This is actually true even within, for example, the Nvidia line, because across these different data types, the tensor cores are changing. The way the tensor core works for float32 is different from the way it works for float4 or something. And so you already—within one vendor—have to have this very powerful metaprogramming to be able to handle the complexity and do so in the scaffolding of a single algorithm like matrix multiplication.

Chris

Well, you have a C++ committee, what is the C++ committee going to do? They’re going to keep adding features to C++. Don’t expect C++ to get smaller. It’s common sense. And so with Mojo, there’s a couple of different things. So one of which is, start from Python. So Python being the surface-level syntax enables me as management to be able to push back and say, ‘Look, let’s make sure we’re implementing the full power of the Python ecosystem. Let’s have lists, and for-comprehensions, and all this stuff before just inventing random stuff because it might be useful.’ But there’s also, for me personally, a significant back pressure on complexity. How can we factor these things? How can we get, for example, the metaprogramming system to subsume a lot of complexity that would otherwise exist? And there are fundamental things that I want us to add.

For example, checked generics, things like this because they have a better UX, they’re part of the metaprogramming system, they’re part of the core addition that we’re adding, but I don’t want Mojo to turn into a ‘add every language feature’ that every other language has just because it’s useful to somebody. I was actually inspired by and learned a lot from Go, and it’s a language that people are probably surprised to hear me talk about. Go, I think, did a really good job of intentionally constraining the language with Go 1. And they took a lot of heat for that. They didn’t add a generic system, and everybody, myself included, were like, ‘Ha ha ha, why doesn’t this language even have a generic system? You’re not even a modern language.’ But they held the line, they understood how far people could get, and then they did a really good job of adding generics to Go 2, and I thought they did a great job.

There was a recent blog post I was reading, talking about Go, and apparently they have an 80-20 rule, and they say they want to have 80% of the features with 20% of the complexity, something like that. And the observation is that that’s a point in the space that annoys everybody, because everybody wants 81% of the features, but 81% of the features maybe gives you 35% of the complexity. And so, figuring out where to draw that line and figuring out where to say no—for example, we have people in the community that are asking for very reasonable things that exist in Rust. And Rust is a wonderful language. I love it. There’s a lot of great ideas and we shamelessly pull good ideas from everywhere. But I don’t want the complexity.

01:03:02

Ron

I often like to say that one of the most critical things about a language design is maintaining the power-to-weight ratio.

You want to get an enormous amount of good functionality, and power, and good user experience while minimizing that complexity. I think it is a very challenging thing to manage, and it’s actually a thing that we are seeing a lot as well. We are also doing a lot to extend OCaml in all sorts of ways, pulling from all sorts of languages, including Rust, and again, doing it in a way where the language maintains its basic character and maintains its simplicity is a real challenge. And it’s kind of hard to know if you’re hitting the actual right point on that. And it’s easier to do in a world where you can take things back, try things out and decide that maybe they don’t work, and then adjust your behavior. And we’re trying to iterate a lot in that mode, which is a thing you can do under certain circumstances. It gets harder as you have a big open-source language that lots of people are using.

01:03:47

Chris

That’s a really great point. And so one of the other lessons I’ve learned with Swift, is that with Swift, I pushed very early to have an open design process where anybody could come in, write a proposal, and then it would be evaluated by the language committee, and then if it was good, it would be implemented and put into Swift. Again, be careful what you wish for. That enabled a lot of people with really good ideas to add a bunch of features to Swift. And so with Mojo as a counterbalance, I really want the core team to be small. I want the core team not just to be able to add a whole bunch of stuff because it might be useful someday, but to be really deliberate about how we add things, how we evolve things.

01:04:20

Ron

How are you thinking about maintaining backwards compatibility guarantees as you evolve it forward?

01:04:25

01:09:27

Chris

That’s right. And so today—this isn’t distant future—today, you can take your Python package and you can create a Mojo file and you can say, ‘Okay, well these for loops are slow, move it over to Mojo.’ And we have people, for example, doing bioinformatics and other crazy stuff I know nothing about, saying, ‘Okay, well I’m just taking my Python code, I move it over to Mojo. Wow, now I get types, I get these benefits, but there’s no bindings. The pip experience is beautiful. It’s super simple.’ You don’t have to have FFI’s and nanobind and all this complexity to be able to do this. You also are not moving from Python with its syntax to curly braces and borrow checkers and other craziness. You now get a very simple and seamless way to extend your Python package. And we have people that say, okay, well I did that and I got it first 10x, and 100x, and 1000x faster on CPU.

But then because it was easy, I just put it on a GPU. And so to me, this is amazing because these are people that didn’t even think and would never have gotten it on a GPU if they switched to Rust or something like that. Again, the way I explain it is, Mojo is good for performance. It’s good if you want to go fast on a GPU, on a CPU, if you want to make Python go fast, or if you want to—I mean, some people are crazy enough to go whole hog and just write entirely from scratch Mojo programs, and that’s super cool. If you fast forward six, nine months, something, I think that Mojo will be a very credible top-to-bottom replacement for Rust.

And so we need a few more extensions to the generic system. And there’s a few things I want to bake out a little bit. Some of the dynamic features that Rust has for the existentials, the ability to make a runtime trait is missing in Mojo. And so we’ll add a few of those kinds of features. And as we do that, I think that’ll be really interesting as an applications-level programming language for people who care about this kind of stuff. You fast forward, I might even project a timeframe, maybe a year, 18 months from now, it depends on how we prioritize things, and we’ll add classes. And so as we add classes, suddenly it will look and feel to a Python programmer much more familiar. The classes in Mojo will be intentionally designed to be very similar to Python, and at that point we’ll have something that looks and feels kind of like a Python 4.

Why ML Needs a New Programming Language

with Chris Lattner

Episode 25 | September 3rd, 2025

BLURB

SUMMARY

TRANSCRIPT

00:00:03

Ron

00:00:43

Chris

00:00:45

Ron

00:00:54

Chris

00:01:14

Ron

00:01:19

Chris

00:02:35

Ron

00:02:44

Chris

00:02:59

Ron

00:03:39

Chris

00:04:36

Ron

00:05:00

Chris

00:06:23

Ron

00:06:38

Chris

00:06:41

Ron

00:07:09

Chris

00:07:10

Ron

00:07:12

Chris

00:08:09

Ron

00:08:39

Chris

00:09:33

Ron

00:09:47

Chris

00:10:38

Ron

00:10:59

Chris

00:11:56

Ron

00:12:12

Chris

00:14:17

Ron

00:14:42

Chris

00:15:45

Ron

00:16:22

Chris

00:17:38

Ron

00:17:45

Chris

00:18:48

Ron

00:19:11

Chris

00:20:49

Ron

00:20:57

Chris

00:23:08