Agents, Ranked: Nebius SWE-rebench leaderboard tracks models across key metrics
Turning Pieces Into Performance
Nebius' R&D lead outlines how the company’s research connects the dots in the agentic era.

AI agents are rewriting the interface between humans and computers. Browser-navigating assistants click, scroll, and complete tasks. Software engineering agents write and fix code in an autonomous manner. Deep research tools compress hundreds of pages into structured insights. The frontier of AI has rapidly shifted from static text generation to dynamic execution.
This is a thrilling time for researchers. And of course, it comes with a new class of problems.
Under the hood, an AI agent is no longer just a single neural network. Typically, at its core sits a large language model wrapped in scaffolding — prompts, tools, memory and rules that define how the model acts inside real environments, from Docker containers to laptops. Building better agents goes beyond model quality. It’s a full-stack engineering and research challenge.
Deployed agents routinely consume millions of tokens for complex tasks as they navigate, observe feedback, and correct errors. To make agents faster, smarter, and more reliable, researchers are tackling the problem across several interconnected fronts: collecting environments at scale, reliable evaluation, post-training for agentic behavior, inference-time scaling, and system optimization. At Nebius, R&D focuses on this exact intersection.
SWE-rebench
Environments are the foundation. You cannot train or evaluate a capable agent without a world it can act in. For an AI, an environment isn’t merely a sandbox, it also must include robust, automated verification for task completion. Static benchmarks break quickly: they get contaminated and fail to reflect real-world complexity. In our SWE-rebench
This produced around 20,000 verifiable Python environments and allowed us to build a decontaminated benchmark that updates monthly with fresh tasks. With our recent SWE-rebench V2
With scalable, verifiable environments in place, we can approach post-training. Reinforcement learning is promising for teaching agents to navigate long horizons, but it introduces severe complexities. In our paper
A persistent hurdle in this kind of agentic RL is credit assignment. When an agent generates thousands of tokens to satisfy multiple constraints, giving it a single reward score is often too coarse. We addressed this fragility in our research
Resolve rate metrics. Even though the complexity may vary from month to month, the overall trend suggests rapid improvement in agentic capabilities
System Optimization
Of course, training isn’t the only lever. We can also scale compute at inference time. Search algorithms allow agents to simulate different paths before acting. In Guided Search Strategies in Non-Serializable Environments, we introduced a separate verifier that predicts state-action values, guiding the agent toward higher-quality trajectories. This allows us to intelligently guide the model toward better solutions, an approach that proves effective in both process and outcome-based regimes.
Finally, the practical utility of all these advancements hinges on systems optimization. When agents spend millions of tokens per run, inference speed quickly becomes the ultimate bottleneck. Here, techniques like speculative decoding become essential for accelerating token generation. In our paper LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding, we asked a foundational question: what is the best objective to maximize the acceptance rate directly? By introducing modifications to loss functions, we optimize directly for acceptance and can create more efficient draft models that visibly speed up the entire pipeline.
Framing this way, agent research becomes a coherent narrative rather than a set of scattered problems. Better automated environments lead to more rigorous evaluation. Reliable evaluation unlocks better multi-turn RL. Precise reward attribution improves training. Smarter test-time search boosts final performance. Inference optimization makes the whole loop practically deployable. AI agents hold huge potential, and at Nebius, we’re doing research across the entire stack.


