Field Reports

ICLR: 20 Papers That Matter

From inference to life sciences: impactful publications from ICLR-2026 in Rio de Janeiro, filtered through Nebius research priorities.

By Arseniy Sokolov
Arseniy Sokolov
Solutions Analyst, Nebius

Founded by AI pioneers Yann LeCun and Yoshua Bengio, the International Conference on Learning Representations (ICLR, widely pronounced “Eye-Clear”) is the youngest of the “Big Three” machine learning conferences. Born during the deep learning boom of the 2010s it rapidly grew into the second highest-impact AI research conference globally, according to Google Scholar.

This April 23-27, ICLR arrives in Rio de Janeiro for its new edition, and the numbers alone tell the story of a field in overdrive. With nearly 20,000 submissions, a 70% jump from last year, 5,355 papers made the cut after an especially challenging review process. Besides its strong program, ICLR 2026 will be remembered for how hard it fought to keep the selection process honest.

Rapid AI growth came with its share of headaches, such as LLM-generated submissions and reviews. To counter this, ICLR tightened its rules on AI-written papers. Authors and reviewers had to disclose any use of AI tools, and ILCR ran automated checks to flag likely AI-written reviews and hallucinated references, leading to desk rejections.

The conference also weathered a serious security breach when a malicious actor exploited a vulnerability in the OpenReview platform, leaking sensitive data, and triggering attempts to pressure reviewers. Conference administration had to reset all review scores and reassign area chairs to salvage the process, finally yielding an acceptance rate similar to prior years.

Despite the turbulence, ICLR opens in Rio with a hard-won program that spans a broad range of topics. Nebius, a gold sponsor of ICLR 2026, is also part of this scene, with a team of researchers and engineers on the ground (find us at Booth 308). Events like this are invaluable for staying close to the frontier but they come with a navigation challenge. While the organizers have made efforts to structure the program, identifying the papers that matter most to your work is no small task, even when everything is publicly available.

That challenge is what sparked this selection. We scored ICLR papers using a composite of their review ratings across four categories, some narrow, where our team has deep expertise, others broader but also central to what we do. The result is 20 papers worth your attention. This is not a ranking, and we recognize that review scores are an imperfect proxy for impact prediction. Filtered through our own experts, the list reflects something more focused: papers that matter to the directions Nebius is betting on.

Coding Agents

At Nebius, software engineering agents sit at the core of our AI research. We invest in both evaluation and infrastructure for agentic systems, from building SWE-rebench, a benchmark that tests agents on real-world GitHub tasks, to creating datasets for training them. That focus strongly echoes what ICLR 2026 is showing: coding agents are evolving in depth, tackling problems as complex as GPU kernel optimization, and in breadth, expanding into new domains. Both shifts are driven by multi-agent architectures, model scaling, and reinforcement learning.

Kimi-Dev: Agentless Training as Skill Prior for SWE-agents

The SWE field has settled into two camps. Agentless systems decompose software engineering into modular workflows, stable but rigid when tasks require iterative updates. SWE agents take the opposite bet: end-to-end, multi-turn reasoning that plans and acts the way a human developer works. The flexibility comes at a cost: trajectories stretch over hundreds of steps and the model must juggle exploration, reasoning, and tool use at once. The two paradigms have long been treated as mutually exclusive. Kimi-Dev challenges that, showing agentless training can serve as a skill prior that boosts subsequent SWE-agent training, and agentic performance.

Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

Most self-improving coding agents grow by a simple rule: favor the modifications that score higher on benchmarks. Sounds reasonable, but at the same time a high-scoring agent can seed a dead-end lineage, while a lower-scoring one produces descendants that compound into something far more capable. The Huxley-Gödel Machine paper names this the Metaproductivity-Performance Mismatch and builds an algorithm around fixing it. Inspired by the biological concept of clades, it introduces a metric that judges an agent not by its own benchmark score but by the aggregate performance of its descendants. The agent optimized by HGM matches the best officially checked results of human-engineered coding agents on SWE-bench Lite.

SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving

SwingArena explores if a model can submit a patch that survives a full CI pipeline and automated peer review, handling the long-context challenge of real C++, Python, Rust, and Go codebases. The framework pairs LLMs in adversarial roles: a submitter that generates patches and a reviewer that writes test cases and validates them through CI pipelines. SwingArena includes a retrieval-augmented code generation module that gives models standardized access to relevant snippets regardless of their context window size. SwingArena surfaces nuanced trade-offs between patch correctness, and review strictness that static benchmarks miss.

Reinforcement Learning for Machine Learning Engineering Agents

Most ML engineering agents prompt a powerful model and hope for the best. This Stanford paper asks what happens if you train a weaker model with RL instead. The result: Qwen2.5-3B trained with RL, given enough compute, eventually outperforms Claude-3.5-Sonnet with agent scaffolding by 22% on average across 12 Kaggle tasks. Two engineering contributions make this work. First, duration-aware gradient updates that stop the agent from favoring fast but suboptimal solutions over slower, higher-reward ones. Second, environment instrumentation: a static model inserts print statements into the agent’s code to log experimental progress, extracting partial credit as a reward signal.

STARK: Strategic Team of Agents for Refining Kernels

STARK sits at the intersection of SWE agents and infrastructure, and is worth reading from both angles. GPU kernel optimization is one of the hardest coding tasks: hardware trade-offs are non-obvious, and a single wrong decision in thread scheduling can tank performance. STARK addresses the problem with an LLM agentic framework that systematically explores the design space through multi-agent collaboration, grounded instruction, dynamic context management, and strategic search. On KernelBench, STARK produces correct solutions where single-agent baselines fail outright, and achieves up to 16× faster kernel runtimes, an important step towards fully automated, scalable GPU kernel optimization.

Inference & Infrastructure

Inference and infrastructure optimization are the defining research bets at Nebius, and both are drawing growing attention at ICLR. Inference papers attack the problem from every angle: speculative decoding, quantization, sparsity, and increasing focus on reasoning models specifically, trimming overthinking and cutting tokens that don’t move the needle. Infrastructure work is broadening too: distributed training efficiency and kernel-level optimization feature alongside federated learning and more honest evaluation environments.

Reasoning with Sampling: Your Base Model is Smarter Than You Think

This paper challenges the assumption that reinforcement learning is necessary to make LLMs better at reasoning. The authors show that sampling from a base model multiple times, using the model’s own confidence scores to iteratively zero in on better answers, can match or even beat RL-trained models on hard math, coding, and science benchmarks. The method requires no additional training data or fine-tuning, making it substantially cheaper and more broadly applicable than RL post-training.

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

In LLM serving, reusing the KV cache across requests reduces latency and cost, but conflicts with load balancing: grouping similar requests improves cache reuse, while spreading them evenly improves utilization. Existing schedulers favor one and fail at both. DualMap resolves this by mapping each request to two candidate instances via independent hash functions, then selecting the better one based on current load and cache state. Three additional mechanisms handle real-world complexity: SLO-aware routing, hotspot rebalancing, and lightweight instance scaling. Experiments show up to 2.25× capacity improvement under the same latency constraints.

SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models

Mixture-of-Experts models save compute by only activating a subset of their parameters per token, but when processing many tokens in a batch, too many experts end up being activated, killing that efficiency gain. SERE fixes this by dynamically redirecting tokens away from secondary experts toward similar ones that are already active, reducing redundancy without statically pruning the model. It achieves up to 2x speedup on reasoning benchmarks with minimal quality loss, and drops into existing serving infrastructure with almost no code changes.

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Native Sparse Attention reduces the compute cost of long-context LLMs, but its kernel is only efficient when models use many query heads per GQA group, a configuration most production LLMs don’t use. Flash Sparse Attention fixes this by inverting the kernel’s loop order, making NSA efficient across the query-head configurations that actually exist in practice. The result: up to 3.5× kernel-level speedup and up to 1.25× end-to-end training speedup compared to standard NSA kernel implementation.

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Outliers in weights and activations are a core challenge in LLM quantization, causing errors that compound in reasoning models over long chains of thought. ParoQuant addresses this by combining two mathematical transforms with channel-wise scaling to balance extreme values before quantization. The inference kernel is co-designed to keep these transforms lightweight at runtime. The result is a 2.4% accuracy gain over a strong baseline on reasoning tasks with under 10% overhead.

Physical AI

Physical AI is a key vertical at Nebius. ICLR paints robotics as shifting to generalist systems grounded in the real world, with policies trained on large, diverse datasets, fused with VLA models, and fine-tuned through interaction. The center of gravity is moving toward embodied foundation models that can plan, adapt, and reuse skills across tasks. Sim-to-real transfer, data efficiency, and safety are core constraints.

LeRobot: An Open-Source Library for End-to-End Robot Learning

Robot learning research has long been slowed by fragmented tooling, with different libraries handling different parts of the stack. LeRobot is an open-source library that covers the full pipeline, from motor control and hardware communication to large-scale dataset collection and state-of-the-art learning algorithms. The goal is to lower the barrier to entry for robotics research while keeping things reproducible and scalable.

Embodied Navigation Foundation Model

Most navigation models are built for a specific robot type and a specific task. Navigation Foundation Model is trained across quadrupeds, drones, wheeled robots, and vehicles on eight million samples spanning vision-language navigation, object search, tracking, and autonomous driving. Unified architecture with identifier tokens handles varying camera setups and task horizons, reaching state-of-the-art performance across seven benchmarks without task-specific fine-tuning.

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning

Mobile manipulation in households requires understanding not just where objects are, but how they function and which parts can be interacted with. MomaGraph is a unified scene representation that combines spatial, functional, and part-level information into a single scene graph, updated dynamically as the agent acts. The authors also release a large-scale dataset and benchmark for this setting, and train a 7B vision-language model on top that serves as both a scene graph predictor and zero-shot task planner, reaching 71.6% accuracy on their benchmark.

Remotely Detectable Robot Policy Watermarking

Trained robot policies are valuable intellectual property, but verifying ownership is hard when auditors only have access to external video footage rather than the robot’s internal state. This paper introduces Colored Noise Coherency (CoNoCo), the first watermarking strategy designed for remote detection. CoNoCo embeds a spectral signal into the robot’s motion by leveraging the policy’s natural stochasticity. The watermark is detectable from motion capture and video across both simulated and real-world experiments.

Abstracting Robot Manipulation Skills via Mixture-of-Experts Diffusion Policies

Scaling diffusion-based policies to multiple robot manipulation tasks is expensive in terms of model size and data. Skill Mixture-of-Experts Policy learns a compact set of reusable skills and routes each inference step through only the relevant subset of experts, keeping the model small and fast. It outperforms large diffusion baselines on both success rate and inference cost in simulation and on a real dual-arm platform.

Healthcare & Life Sciences

Some of Nebius' most important partners and customers come from biotech and medicine, fields that AI is visibly rewriting. This section is one of the richest on the ICLR’s program. Models for protein design are everywhere, brain decoding becomes a major frontier, and drug discovery goes fully generative and interactive, with diffusion and graph-LLM hybrids treating molecules as optimizable objects. A parallel push is seen for clinical realism: models that can handle messy, longitudinal patient data while staying robust and interpretable. The lab and the model are converging really fast.

Protein Structure Tokenization via Geometric Byte Pair Encoding

Tokenizing protein structures for use in multimodal models is a hard task, because these structures are noisy, and multi-scale. The paper introduces GEOBPE tokenizer which borrows the byte-pair encoding idea from NLP and applies it to protein geometry. GEOBPE iteratively merges geometric primitives into a hierarchical vocabulary of structural motifs, yielding more than 10x compression in bits-per-residue at similar distortion, and outperforming leading protein structure tokenizers on downstream transfer benchmarks. It works with far less training data, and produces tokens that align with known functional families, making them interpretable rather than arbitrary codebook entries.

Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction

The dominant assumption in gene expression prediction has been that longer input sequences are better, since distant enhancers can influence gene expression from hundreds of kilobases away. This paper shows that for current models longer sequences can actually hurt performance, and that nearby epigenomic signals can matter more than simply extending the input context. Prism treats gene expression prediction as a causal inference problem: a lightweight CNN learns to represent different background chromatin states, and predictions are averaged across these states to cancel out confounding effects. The result is a small model on short sequences that outperforms heavier baselines.

La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching

Generating full-atom protein structures jointly with amino acid sequences is difficult because side chains vary in length across residues. La-Proteina sidesteps this by modeling the α-carbon backbone explicitly while encoding sequence and side-chain details into fixed-size per-residue latent variables, and then applying flow matching over this hybrid space. It reaches state-of-the-art performance on all-atom co-designability, diversity, and structural validity, and scales to proteins of up to 800 residues where prior all-atom baselines fail.

Generating metamers of human scene understanding

MetamerGen is a powerful tool for studying human scene understanding. When humans look at a scene, they combine sharp detail from where they fixate with low-resolution context from peripheral vision. MetamerGen is a latent diffusion model that takes both inputs and generates images that are perceptually equivalent to the original from the viewer’s perspective, so-called metamers. A behavioral experiment with human participants shows that some generated images are genuine metamers, and analysis suggests which visual features at different processing levels drive human scene judgments.

TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

Neuroscience has progressed by fragmenting into specialized domains, and most brain encoding models focus on a single modality or brain region. TRIBE claims to be the first deep neural network that predicts fMRI responses to stimuli across multiple modalities, cortical areas and individuals. This neural network took first place in the Algonauts 2025 brain encoding competition with a significant margin over other competitors. While unimodal models predict their corresponding sensory cortices well, multimodal integration appears necessary to accurately model higher-level associative areas.

How we picked the papers

We ranked all 5,355 accepted ICLR 2026 papers within thematic categories. The final output is one CSV per category containing every paper ranked by a composite quality score. The methodology has three independent components: (1) LLM-based thematic classification to assign each paper to a category, (2) a reviewer-score-based composite to rank papers within each category, (3) domain experts selecting the final papers from the top of each algorithmically ranked list.

Each paper was classified into exactly one of 5 focused categories — SWE Agents, Inference/Infrastructure, AI for Life Sciences, Robotics, or Other — using Qwen3.5-397B-A17B from TokenFactory with structured JSON output at temperature 0.

The model received a system prompt defining each category with specific scope boundaries (e.g.,"SWE Agent” covers coding agents and repository-level code generation, not any agent framework). Classification rules instructed the model to prioritize the paper’s main contribution over secondary applications.

Papers with"datasets and benchmark” as primary areas were overridden into a separate category and excluded, since their contribution type is qualitatively different from research papers.

Papers within each category were ranked by local_top_overall, a weighted linear combination of five reviewer-derived signals standardized to zero mean and unit variance within the category (not globally). Category-local z-scoring ensures that a paper is measured against its thematic peers — a top Physical AI paper surfaces even if the subfield’s raw scores run lower than, say, Inference&Infrastructure.

The formula is:

local_top_overall = 0.40 × z (contribution_mean) + 0.35 × z (rating_mean) + 0.10 × z (confidence_mean) + 0.10 × z (soundness_mean) − 0.05 × z (rating_std)

where all z-scores are computed over the papers in that category only.

Weight rationale. Contribution (perceived novelty and significance) carries the highest weight (40%), reflecting the goal of surfacing impactful work rather than merely well-executed work. Mean reviewer rating is second (35%) as a general quality signal. Soundness and reviewer confidence serve as minor quality gates (10% each). Rating standard deviation applies a small penalty for disagreement (5%) — kept low because within a focused thematic category, some spread in opinions is natural. The reviewer scores used here were taken from the public ICLR 2026 review data on OpenReview.

The ranking did the heavy lifting, our experts shaped the final cut. After the composite ranking surfaced the top-20 candidates in each category, domain experts at Nebius went through these papers and made the final selection. Our list reflects two signals: quantitative ranking as a starting point, expert judgment informed by Nebius focus areas as the deciding vote.

Sign in to save this post