Collaborations

The Mixer: Storming Group Theory With Group Intelligence

How imperfect algorithms, working together, can crack the century-old Burnside puzzle. A new study from the Stevens-Nebius-Gradarius Lab for AI in Math.

By Vlad Stepanov

2026-04-16

15 mins to read

In 1902, a Cambridge-trained mathematician William Burnside posed a question so simple a bright teenager could grasp it, yet so deep it has consumed some of the finest mathematical minds for over a century.

If a machine has a finite number of operations, and repeating any sequence of them $n$ times always brings it back to where it started, must the machine have only a limited number of possible states?

The question sounds easy. The answer has proved anything but. The Burnside problem, as it came to be known, remains one of the hardest challenges in group theory, the mathematics of symmetry.

Group theory isn’t confined to chalkboards. Group-theoretic computations underpin the cryptography behind most online transactions. Subatomic symmetry groups dictate which particle interactions are possible. Matrix groups and linear algebra power computer graphics, quantum mechanics, and much of modern AI. When we sharpen our understanding of group structures, the effects ripple across science and engineering.

For decades, mathematicians have attacked Burnside’s problem with increasingly sophisticated methods. Some simple cases were resolved early. Computational tools helped push the results a little further. But add just a bit more complexity to the task, and the computations explode, overwhelming even the most powerful algorithms. We’ve explored those limits firsthand at a new AI in mathematics lab created by Stevens Institute of Technology, Nebius Academy, and Gradarius.

And recently, by looking at the problem differently, we found a possible way through. Our approach, called algorithm mixing, shifts the focus from refining a single algorithm to letting several imperfect ones run in parallel and cooperate. This method delivers a 60× speedup over existing baselines, pointing to new broad implications across computer science. It could prove most powerful for challenges that demand not one solution, but a group of them.

What Is a Group

Take a multiplication table. The first row and column list integers. Choose one of each, and the cell where they meet gives the product. But nothing says the table has to use multiplication. We could fill the same grid with addition or with any operation we want.

Group theory generalizes this idea: start with a set of elements, define an operation on them, and then study what follows, provided the operation satisfies a few natural rules. When you squint, a surprising amount of the world fits this mold. The rotations of a cube, for instance, form a group: take any two rotations (the elements), and the operation is just doing one rotation and then doing the other one right after it.

Groups can be finite, like the 24 rotations of a cube, or infinite, like the sequence of integers under addition.

What Is a Burnside Group

So far, we’ve described a group using a multiplication table that lists every possible combination. But mathematicians rarely work that way. There’s a more compact approach. Instead of listing every element, we begin with a small set of generators, the basic building blocks, and a set of relations, the rules they must obey. Every element of the group is formed by combining these generators and their inverses. The relations determine when two different sequences actually represent the same element.

For example, the group $⟨a | a³ = e⟩$ has a single generator and a single rule: any cubed a equals the identity $(e)$ , the element that does nothing, or, put more simply, do $a$ three times and you’re back where you started. Here, any sequence of $a$ ’s quickly collapses — $aaa = e$ , $aaaa = a$ , $aaaaa = a²$ — so the whole group consists of just three elements: ${e, a, a²}$ . Three elements, fully determined by one relation.

A Burnside group has an especially clean definition in this language:

$B(m, n) = ⟨a_1, …, a_m | x^n = e$ for all $x$ ⟩ — the group with $m$ generators where every element raised to the $n$ -th power equals the identity.

That’s it: $m$ generators (usually written $a$ , $b$ , $c$ , …) and one rule — repeat any sequence of generators $n$ times and you get back to One. The definition is almost trivially simple, yet it leads straight to one of the most notorious open problems in algebra: The Burnside Problem. For which values of $m$ and $n$ is $B(m, n)$ finite?

The small cases are easy to work by hand. $B (2,1)$ has one element. $B (2,2)$ — four.

But the sizes escalate fast. $B(4, 3)$ has over 4 million elements. $B(3, 4)$ contains more than 5 × 10^20. And $B(4, 4)$ weighs in at roughly 1.08 × 10^127. These groups are all known to be finite, but you obviously cannot just list every element. We need better algorithms.

A Century-Long Quest

Burnside himself solved the easy cases in a 1902 paper but could make no headway on the general question. The first major development came in 1964, when Evgeny Golod and Igor Shafarevich constructed a counterexample to the broadest version of the problem: an infinite group generated by finitely many elements, in which every element has finite order. These orders, however, were unbounded, so there was no single number that worked for all elements, as required in the central bounded case, where every element satisfies $x^n = e$ for a fixed $n$ .

The decisive breakthrough arrived in 1968, when Pyotr Novikov and Sergei Adian published an epic proof, three papers, roughly 330 pages, showing that $B(m, n)$ is infinite for all $m ≥ 2$ and odd $n ≥ 4381$ . The proof was so intricate it became a cautionary tale. English mathematician John Britton spent years crafting a rival 300-page proof, only for Adian to demonstrate that, although the individual lemmas were correct, the inequalities tying them together were inconsistent. Britton never fully recovered, it was the last major paper he published.

Adian later tightened the bound to odd $n ≥ 665$ . Efim Zelmanov won a Fields Medal in 1994 for solving the restricted Burnside problem: not whether $B(m, n)$ is finite, but whether there is a largest finite group with m generators and exponent $n$ . And yet specific small cases remain stubbornly open, above all $B(2, 5)$ : just two generators, exponent 5.

William Burnside (1952-1927), the author of one of the oldest questions in group theory.

The Knuth-Bendix Approach

The standard computational tool for this kind of problem is the Knuth–Bendix completion procedure, introduced by Donald Knuth and Peter Bendix. Their 1970 paper “Simple Word Problems in Universal Algebra” has a certain touch of irony in the title, given the formidable challenges this method would face.

The idea is to take the rules that define a group and turn them into a system that can simplify any expression in a consistent way. This is called a confluent rewriting system. If the process works, every complicated expression can be reduced to a single, standard form. And once that’s possible, you can always tell whether two different-looking expressions actually represent the same group element.

In practice, Knuth–Bendix works by repeatedly finding pairs of rules whose interaction creates an ambiguity (a “critical pair”), then resolving the ambiguity by adding a new rule or simplifying an existing one. (Readers familiar with commutative algebra will recognize a close parallel: Buchberger’s algorithm for computing Gröbner bases operates on essentially the same principle). The trouble is that the number of intermediate rules can explode combinatorially. The procedure is not guaranteed to terminate, and even when it does, it may require enormous memory and time.

For Burnside groups, Knuth–Bendix succeeds on small cases — $B(2, 3), B(2, 4)$ , even $B(4, 3)$ with a good implementation. But no existing Knuth–Bendix implementation has ever completed on $B(2, 5)$ . The intermediate rule sets simply grow too large, too fast.

Learning From TimSort

Before we describe our approach, consider an analogy from a completely different corner of computer science.

TimSort, the default sorting algorithm in Python, Java, and many other languages, is a hybrid. It combines Insertion Sort, which excels on small or nearly sorted data, with Merge Sort, which efficiently handles large, messy datasets. Instead of committing to one strategy, TimSort inspects the data as it runs and switches to whichever method fits best.

At first glance, this seems unnecessary. Merge Sort already guarantees near-optimal performance in the worst case, so why bother mixing anything in? Well, mainly because real-world data has structure. Its parts may already be nearly sorted, and an algorithm tuned only for the worst case can miss those easy wins. By combining two strategies, TimSort adapts to the data and runs faster in practice. That’s why many of its variants power sorting in modern programming languages.

Our question at the Stevens–Nebius–Gradarius lab was: could this idea apply more broadly?

Algorithm Mixing

The core concept is simple: instead of running a single algorithm in isolation, run several simultaneously and let them exchange intermediate results. Each algorithm may explore the space of relations differently. One might use new term ordering, another might prioritize different kinds of simplifications. By exploring the problem differently, one algorithm’s progress can help another escape a dead end.

Our framework has three layers:

Base algorithms. These are the workhorses — instances of Knuth–Bendix. Each can be configured with different parameters: term orderings, reduction strategies, etc. This is a crucial point: a single algorithm with different parameters will explore the search space in a fundamentally different order, get stuck in different places, and make progress on different fronts. For the mixing layer’s purposes, two Knuth–Bendix instances with different term orderings are as distinct as two entirely different algorithms, and that diversity is exactly what makes the architecture work.

The Mixing Layer. This is the connective tissue. At configurable intervals, the Mixing Layer takes a snapshot of each algorithm’s current working set of relations and injects selected relations into the other algorithms. The key design question is what to share and when. Sharing too aggressively can flood a struggling algorithm with unhelpful data, while sharing too conservatively misses the whole point.

The Orchestrator. Sitting above the mixing layer is a controller that decides the policy: which algorithms to run, what parameters to give them, when to trigger an exchange, and what selection criteria to use when sharing relations. In our current experiments this is configured by hand. The next step is to replace it with a reinforcement-learning agent that learns the policy from experience.

The framework is deliberately general. Adding a new base algorithm means implementing a thin adapter that exposes its working set and accepts injected relations. The mixing and orchestration layers don’t need to know the internals.

Beating Optimized Algorithms

Our first target was $B(4, 3)$ , large enough to be a serious benchmark, and with known outcomes to validate against.

The result: a more than 60× speedup over an already highly optimized implementation. It didn’t come from tweaking the core algorithm, but from running two differently configured versions side by side and letting them share what they had learned at just the right moment. The speedup far exceeds what parallelism alone could explain: two algorithms on two cores yields at most a 2× advantage, so the remaining factor comes from the cooperative exchange of relations. When they exchanged relations at a critical juncture, one was able to leapfrog past an intermediate blowup that would otherwise have bogged it down for far longer.

In the world of algorithm design, a 60× improvement on a mature, well-studied problem through a conceptually simple architectural change is remarkable. It is not every day that a straightforward idea beats heavily optimized algorithms by an order of magnitude. But for our team, this result is a waypoint, not a destination.

Method	Time	Notes
GAP / KBMAG [1]	~100s	Heavily optimized single-algorithm baseline
Single KB (our implementation, RPO ordering)	~130s	Our reimplementation, one algorithm
Mixed: RPO + shortlex	~1.5s	Two KB instances exchanging relations

[1]: KBMAG (Knuth–Bendix on Monoids and Automatic Groups) is a widely-used package within the GAP computational algebra system.

Beyond Group Theory

The immediate goal remains the same: push this approach toward $B(2, 5)$ , a case no Knuth–Bendix implementation has ever completed. Whether algorithm mixing will be enough is still an open question. But a 60× speedup on $B(4, 3)$ is a powerful signal that the framework has real potential.

The roadmap now has two tracks. First, replace the hand-tuned orchestrator with a reinforcement-learning agent capable of discovering mixing policies we wouldn’t think to try ourselves. The state space is rich: the agent can monitor the size, reduction rate, and overlap of each algorithm’s working set, learning when an exchange is likely to pay off.

Second, expand beyond Knuth–Bendix. Coset Enumeration, Cayley graph construction, and hybrid strategies are all candidates. The mixing framework is deliberately algorithm-agnostic, making it inexpensive to incorporate additional methods.

Of course, the scope is broader than group theory. The same shift applies wherever heuristic algorithms explore a shared combinatorial space: SAT solvers, theorem provers, constraint solvers. In all of these, AI makes it cheap to produce algorithm variants, while finding the right combination remains the real bottleneck. An architecture that turns the question from “which algorithm” into “which mix” could be remarkably useful.