“LLMs are trying to make the human happy. I hope that formalisation solves this problem.”
Imperial’s Kevin Buzzard on Lean, AI in math, and the future of mathematical truth.

In an era when large language models are storming math, generating a torrent of plausible-looking arguments that may or may not be correct, the ability to verify proofs quickly and reliably has become more urgent than ever. A machine that produces a hundred candidate proofs a minute is more useful if you have a way to check them without a hundred mathematicians standing by.
This is where formalisation and proof assistants come into play. Programming languages like Lean turn mathematical proof into code. The idea is simple in principle: if a proof compiles in Lean, it is correct. But formalising any serious result requires that everything it depends on is already formalised too. This demands enormous effort as the underlying library of known results has to be built from scratch.
That library, Mathlib, has been growing steadily for years, assembled by enthusiasts encoding advanced mathematics piece by piece. It is now increasingly used to validate the proofs generated by AI models. In turn AI is beginning to accelerate the formalisation itself, translating mathematical arguments into Lean at a speed no human team could match. This includes truly deep results: last month, a Fields Medal-winning paper was translated into Lean by AI.
Kevin Buzzard has been at the centre of the Lean community for nearly a decade. A professor of pure mathematics at Imperial College London and a specialist in the Langlands programme, he is among the most prominent advocates for formal verification in research mathematics. Buzzard holds a five-year grant to formalise the proof of Fermat’s Last Theorem, one of the most complex and celebrated results in the history of mathematics.
He also was the first speaker in the AI for Mathematical Sciences seminar series held at the London Institute for Mathematical Sciences. This series, sponsored by Nebius, explores how AI is reshaping maths and theoretical physics. After the seminar, we sat down with Kevin Buzzard to talk about the state of formal verification, the growing role of AI in mathematics, and why it’s wrong when a machine silently corrects a flawed proof without telling anyone.
You are now one of the leaders of the formal verification community and all the Lean efforts. What intrigued you into formal verification in the first place?
There were lots of reasons that all happened at the same time. First, I got interested through teaching. I was teaching Imperial’s Introduction to Proof class, a class that every maths student has to sit through, where we try to convince them that what they did at school is somewhat different from university mathematics. I had taught it many times before, but I thought I could jazz it up by introducing technology.
Then there was another issue. I had some bad experiences refereeing papers, with experts writing things that turned out not to be correct. I wondered whether software was in a position to help with that. And of course the answer I found out pretty quickly was no, not in 2017.
I was also interested in using it as a tool for research. I didn’t really know what the state of the art was. But then, very shortly after I got interested, I saw a talk by Tom Hales at the Isaac Newton Institute. He gave a really inspiring talk about how these tools could engage with modern research across all areas of mathematics. This vision really captured my imagination.
Back in 2017, was it already Lean, or were you using some other software? What were your first wins with it?
I was experimenting with several pieces of software, before the course I was going to teach. But in Tom Hales’s talk, he spoke about Lean. That was the first time I’d heard about Lean.
At the beginning, the wins were just making it work. There was a document called Theorem Proving in Lean that Jeremy Avigad had written, a textbook, essentially. I read through it and did all the exercises. You get a little buzz of adrenaline every time you complete an exercise. It taught me how the software worked.
Then when the term started, the challenge was to see if I could handle the course I was about to teach, all this basic intro-to-proof stuff. Lean was not ready to solve some of these problems at all. It was in a pretty primitive state. Back then there were about 30,000 lines of code. It knew what a finite set was, it had the naturals, the integers, the rationals and the reals, but not the complex numbers. It knew what a group, a ring, a field and a topological space were. And that was about it.
So over the summer I ploughed through these problem sheets. My mental model of whether these questions were easy or difficult was based on whether a human school student found them easy or difficult. Doing them with a computer was a completely different experience. Questions about equivalence relations that students typically found very difficult were completely trivial for the machine, because an equivalence relation is defined by just three axioms, you type in the axioms and away you go.
Whereas making the real numbers in Lean is a huge amount of effort. You don’t normally think about maths like that. Students are much better with real numbers than equivalence relations because they’ve been using real numbers since they were twelve, but they’ve never seen equivalence relations before. So the software found some questions much easier than others.

Is it then that you started creating Lean projects with your students?
Yes. I started using the software in the lectures, and I also started a club. I ran it on Thursday evenings, and I expected the weak students to come along and ask questions about what mathematics was. But what actually happened was all the strong students showed up, and they just wanted projects. They didn’t necessarily even want to do the problem sheets, they wanted to contribute to the maths library and learn more mathematics.
The students who showed up were ones who already knew all the course material because they’d read notes on the internet before coming to university. So I had to find other things for them to do. I gave some of them my Galois Theory lecture notes, and told a bunch of first-years: teach the computer this theory from the third-year course.
I gave others a graduate text in Commutative Algebra and said: teach the computer Commutative Algebra. They came back two weeks later and said: “We’ve developed the theory of localisation.” These were first-year undergraduates. And I thought: well, if we’ve done that, and we’ve got topological spaces, then we can probably make schemes. And we taught a computer what a scheme is.
I told some computer scientists, and they said: “All you’ve done is make a definition. Did you prove any theorems?” I said no. Theorems about schemes are hard. First you’ve got to make the definition, and that’s already an achievement. You couldn’t convince a professional mathematician to use the software at that point because it couldn’t do anything. But the undergraduates were too young and too naive to know any better. So I was constantly trying to find things for them to do.
Do you remember when the first really interesting theorem was encoded in Lean, something that made you feel you’d truly arrived?
Before we had an interesting theorem, we had an interesting definition. In 2019, Johan Commelin, Patrick Massot and I wrote down the definition of a perfectoid space. This was no different in spirit to the definition of a scheme, we just made the definition. We couldn’t really produce any example.
But when I started telling other mathematicians, people got very excited, because perfectoid spaces are very fashionable, and Peter Scholze had just won the Fields Medal for inventing them. Once we had made this definition, I just thought: if we can do this, we can surely do anything.
The last real challenge was to prove a complicated theorem about a complicated definition. We already knew, from the odd-order theorem and the four-colour theorem, that you could prove complicated theorems about simple objects. And we had now made a very complicated object but proved no theorems about it. I thought that was very exciting because we were pushing the software in a different direction. The next question was: can we prove a complex theorem about complex objects?
And that was the Scholze project. Peter Scholze challenged us to prove a certain theorem about liquid vector spaces, and 18 months later, in July 2022, the community had completed it: the Liquid Tensor Experiment. By that point, it was absolutely clear that we could do anything. The only problem was that it takes a lot of time. If we leave humans to do the formalisation, it might take years.

Now you’re working on a complicated theorem about simple objects, Fermat’s Last Theorem. How is the project going?
It’s a long project. There’s a question about whether AI will help speed it up, but independent of that, I have a vision for what this is going to look like.
At the beginning, I was formalising for fun. But now it’s become my job, and so I need a better answer than “it’s fun.” For Fermat, what I want to do is make some kind of living interactive document that will teach people the proof. You’ll be able to unfold it and look at technical details, or keep it folded and look at the big picture. A living interactive textbook, basically. That’s where we’re going now.
I spent the first year of the project building up stuff from the basics. But then I basically got bored. I wanted to engage with the interesting stuff, and all I was doing in practice was building theory that was known in the 1950s and 1960s. So at the beginning of this year, I thought: that’s enough of that. We proved a technical result in early January, we showed that a certain space was finite-dimensional.
So for the last few months I’ve been re-reading the proof, checking I understand it, and planning what the code is going to look like. We’re going to prove theorems of the form: “this complicated thing implies Fermat’s Last Theorem, ” then “this slightly easier thing implies Fermat’s Last Theorem.” So we’re building from the top down, which is not really common in big formalisation projects, but is actually very common when you’re proving an individual theorem. We’re definitely on track to finish by 2029, and ideally earlier.
Have you had any concrete experiences recently where AI was, or wasn’t, helpful?
Here’s a very concrete example of where you can’t trust the machine: making a complex definition. Last week I tried it and it didn’t work. The machine says: “Here’s the definition, I’ve done it.” And then you read it carefully and think: this definition is wrong. I haven’t seen a system which reliably gets the human out of the loop for this.
And then there’s another layer: even if the code is mathematically equivalent to what you wanted, is it idiomatic? Is it usable? Is it in Lean’s Mathlib? If I write a definition of Hilbert modular forms and put it in the Fermat project, and then somebody else writes a different definition and puts it in Mathlib, now I have a big problem. But getting things into Mathlib is a very slow process, because everything is being manually curated. The maths library is a human-generated and human-curated work of art.
So there are various levels: not even mathematically correct; mathematically correct but unusable; mathematically correct and usable but not in Mathlib. And then the gold standard: mathematically correct, usable, has a basic API, and is officially in Mathlib. Definitions have a long journey, and this is going to really complicate things.
Do you think making the right definitions will remain a human job for some time, even as proof-writing gets automated?
Right now I can say with confidence that AI is not great at making complicated definitions. But in six months' time, who knows? A few years ago I would say I don’t want to predict what’s happening in five years' time. Right now, I don’t really want to predict what’s happening in one year’s time.
Does it disturb you?
No. I’m very excited. I love the chaos. I love the fact that my area is moving very, very fast.
I was a number theorist for 20 years. The only time I’ve ever really felt that things were happening so quickly was after the proof of Fermat’s Last Theorem. That opened up so many new ideas.

Let’s go back to one of your original motivations for Lean — concerns about poorly written papers and mathematical errors. How bad do you think the situation is in mathematics right now?
I think humans are mostly confident that the important stuff is correct, because the important stuff gets read very carefully. I got paranoid and worried that maybe things were more wrong than most people believed. But this might just have been a midlife crisis I had. Maybe things are fine. Fermat’s Last Theorem is fine. We’ve found some mistakes, but they’ve all been fixable.
It’s a terrible thing to say, but I sort of lost confidence in many of my number theory colleagues. When I started formalising, I started finding mistakes in papers, simple mistakes, things that could be fixed. You read it, and it looks fine. And then you type it into the machine, and the machine says: “Well, you say you’re proving this by induction, but what about the base case?” And then you realise that actually the base case is not entirely clear that it’s true at all. There were several events like that which made me worry more and more about human literature.
I think we’re going to get to the point where machines will start checking papers. And just last week I saw a remarkable story. An LLM wrote a solution to a problem, and then one of these Lean tools auto-formalised the LLM proof into Lean. A human then started writing up the experience, following the LLM’s informal proof and found a mistake in it. But how? The proof had been checked in Lean.
And then you look at the Lean code. It turns out the Lean code does not formalise the original proof. It formalises a slightly different proof. The system found the mistake, realised it couldn’t formalise the original, and just found a workaround, without ever saying: “By the way, the proof you gave me was wrong.” It just fixed it and got on with it.
That’s so very LLM.
Exactly! If LLMs start formalising the literature, they’ll just be silently fixing up simple mistakes along the way. What I worry about is the bigger gaps, where the author says “it is obvious that A and B imply C, ” and the AI just can’t work out why. The paper doesn’t give enough information. The author might be right, or they might be wrong. And then you have to find the human who is old, or retired, or dead. And they don’t remember.
Could we end up in a situation like the four-colour theorem, where a computer proves something and humans struggle to truly understand the proof?
There are some nightmare scenarios. One extreme: AI gets really good and starts proving things autonomously. Suddenly we have a billion-line proof of the Riemann Hypothesis, and you ask the AI to explain it, and it says: “Oh, this is just too long and complicated for you puny humans to understand.” That’s one extreme.
But there’s an extreme on the other side that people don’t talk about, because everyone is so convinced AI will do everything. I’ve talked to people who try to get AI to discover new interesting constructions, extremal constructions in combinatorics. You read papers where it says: “Our neural network solved all 10 of these questions in graph theory.” And then you ask the authors: what about these other 10? And they say: oh no, we tried those, it didn’t work. You only see the success stories.
Do you use LLMs yourself in your research now?
Yes. When I was a PhD student, I’d work very hard and try to fill in the details myself. But now I’m lazy. When I’m reading literature for this Fermat course and I don’t understand something, I just ask a language model: “Can you explain how they get from this to this?” And sometimes it writes rubbish. But because I’m an expert, I can tell. Eventually it will say: “Okay, just do this, this and this.” And I say: “That does work. That’s a nice idea.” But whose idea was it?
I had an email from Richard Taylor (British-American mathematician specialising in number theory) that just asserted a certain property of a map of rings without proof. I knew the answer was in the literature somewhere, but I asked a language model, and it came up with a proof that this map had a much stronger property than I’d expected. And I couldn’t tell whether AI had beaten the standard references or just found some article I’d never read.
The problem is, it could never give me a coherent answer as to how it had worked this out. So I hesitate to say we’ve beaten the existing literature, because someone might say: “Well, actually I proved that ten years ago, why aren’t you citing me?” And the answer is: because I didn’t know it was you. The language model told me.

I wonder if the art of writing papers is becoming a dying one, if papers are increasingly written by LLMs and read by LLMs.
The whole thing is very weird. When humans write papers, you believe they have a good motivation for them not to be lying. If a human submits a paper, they probably believe they’ve proved the theorem. But if an LLM is writing a paper, you have no idea. It doesn’t have any beliefs. It’s just doing next-token prediction. It’s trying to make the human happy.
I’m hoping that formalisation solves this problem. And just in the last month, we’re seeing real indications that this is happening.
For instance, Fabian Gloeckle’s work at Meta FAIR for his PhD used around 13,000 Claude agents to formalise a 700-page textbook in algebraic combinatorics in about a week, at a cost of around $100,000, roughly 200 dollars per page. For humans, that would be a multi-year project.
And then there’s First Proof, this ten challenge problem, six of them have been reportedly solved by LLMs, and for some of those, machines translated the LLM output into Lean. And it works. So auto-formalisation is really becoming a thing.
And there’s the sphere packing result from last month, formalizing this Fields-winning paper by Maryna Viazovska. These are well-written papers that are a joy to read, and now they’ve been translated by machine into Lean. This gives us a great data point about what AI can do in March 2026. In May 2026, AI will probably be able to do more.
Finally, will there ever be an automatic way of evaluating whether a paper is meaningful, not just correct?
A related question is: will a machine ever start asking interesting questions? Because this is something we haven’t seen any examples of. We’ve seen machines asking very, very dull questions. Generating interesting conjectures has been tried many times, and right now the answer is always: “Here are some very boring conjectures.”
Discriminative tasks are usually easier than generative ones. Telling interesting from uninteresting might be easier than actually producing something interesting.
Maybe AI will confidently tell you whether a paper is interesting, maybe that’s easy already. And the question is: will it agree with the human’s opinion?
It would be an interesting experiment. Take a top journal, take 100 papers submitted to it, 50 accepted, 50 rejected because the ideas were not interesting enough, and feed them all into an AI and see if it can tell which is which. The problem is, with a commercial model it’s probably been trained on everything, and will just look at where the paper was published. You’d need some careful, bespoke AI that definitely hasn’t been trained on this data.
I am sure right now it would confidently give you an answer, but whether or not it’s the right answer, it’s a different question, right?



