Our Research

Turning Pieces Into Performance

Nebius' R&D lead outlines how the company’s research connects the dots in the agentic era.

By Alexander Golubev

2026-05-10

8 mins to read

AI agents are rewriting the interface between humans and computers. Browser-navigating assistants click, scroll, and complete tasks. Software engineering agents write and fix code in an autonomous manner. Deep research tools compress hundreds of pages into structured insights. The frontier of AI has rapidly shifted from static text generation to dynamic execution.

This is a thrilling time for researchers. And of course, it comes with a new class of problems.

Under the hood, an AI agent is no longer just a single neural network. Typically, at its core sits a large language model wrapped in scaffolding — prompts, tools, memory and rules that define how the model acts inside real environments, from Docker containers to laptops. Building better agents goes beyond model quality. It’s a full-stack engineering and research challenge.

Deployed agents routinely consume millions of tokens for complex tasks as they navigate, observe feedback, and correct errors. To make agents faster, smarter, and more reliable, researchers are tackling the problem across several interconnected fronts: collecting environments at scale, reliable evaluation, post-training for agentic behavior, inference-time scaling, and system optimization. At Nebius, R&D focuses on this exact intersection.

Agents, Ranked: Nebius SWE-rebench leaderboard tracks models across key metrics

SWE-rebench

Environments are the foundation. You cannot train or evaluate a capable agent without a world it can act in. For an AI, an environment isn’t merely a sandbox, it also must include robust, automated verification for task completion. Static benchmarks break quickly: they get contaminated and fail to reflect real-world complexity. In our SWE-rebench work, we addressed this with an automated pipeline for repository scraping, environment setup, and quality filtering.

This produced around 20,000 verifiable Python environments and allowed us to build a decontaminated benchmark that updates monthly with fresh tasks. With our recent SWE-rebench V2, accepted to ICML 2026, we scaled this to 32,000+ environments across 20 programming languages, moving toward a language-agnostic evaluation layer.

With scalable, verifiable environments in place, we can approach post-training. Reinforcement learning is promising for teaching agents to navigate long horizons, but it introduces severe complexities. In our paper Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning, we built an end-to-end RL pipeline using tasks directly from SWE-rebench. This work highlighted the difficulty of multi-turn RL in a sparse-reward setting, serving as an initial step toward understanding how models can gracefully learn from extended, interactive workflows without reliance on teacher models.

A persistent hurdle in this kind of agentic RL is credit assignment. When an agent generates thousands of tokens to satisfy multiple constraints, giving it a single reward score is often too coarse. We addressed this fragility in our research on Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards. While we tested this on math reasoning tasks, the core finding is highly relevant to agents: distributing rewards across token segments tied to different objectives produces a clearer learning signal.

Resolve rate metrics. Even though the complexity may vary from month to month, the overall trend suggests rapid improvement in agentic capabilities.

System Optimization

Of course, training isn’t the only lever. We can also scale compute at inference time. Search algorithms allow agents to simulate different paths before acting. In Guided Search Strategies in Non-Serializable Environments, we introduced a separate verifier that predicts state-action values, guiding the agent toward higher-quality trajectories. This allows us to intelligently guide the model toward better solutions, an approach that proves effective in both process and outcome-based regimes.

Finally, the practical utility of all these advancements hinges on systems optimization. When agents spend millions of tokens per run, inference speed quickly becomes the ultimate bottleneck. Here, techniques like speculative decoding become essential for accelerating token generation. Our ICML 2026 paper LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding, asks a foundational question: what is the best objective to maximize the acceptance rate directly? By introducing modifications to loss functions, we optimize directly for acceptance and can create more efficient draft models that visibly speed up the entire pipeline.

Framing this way, agent research becomes a coherent narrative rather than a set of scattered problems. Better automated environments lead to more rigorous evaluation. Reliable evaluation unlocks better multi-turn RL. Precise reward attribution improves training. Smarter test-time search boosts final performance. Inference optimization makes the whole loop practically deployable. AI agents hold huge potential, and at Nebius, we’re doing research across the entire stack.