📚 Personalized Daily Arxiv Papers - May 19, 2026

Total relevant papers: 29

💡 Daily Research Summary

### RL for LLMs A major theme this week is the maturation of Reinforcement Learning (RL) as the primary engine for post-training, moving beyond simple reward modeling to tackle fundamental challenges in alignment, reasoning, and efficiency. Several works introduce novel reward designs and training frameworks that directly address the limitations of existing methods like GRPO. For instance, **ICRL**proposes a framework to internalize self-critique, while**Calibrating LLMs with Semantic-level Reward (CSR)**and**Reducing the Safety Tax (OPSA)**introduce semantic calibration and on-policy self-distillation rewards respectively to improve calibration and balance the safety-reasoning tradeoff.**VSPO**modifies GRPO with steering vectors for fine-grained behavioral control, and**Nudging Beyond the Comfort Zone (NudgeRL)**introduces structured exploration with strategy-level contexts to boost reasoning under larger rollout budgets. On the efficiency front,**DualKV**tackles the computational bottleneck of long-context RL training by eliminating shared-prompt replication in FlashAttention, and**GRLO**demonstrates that strong generalization in reasoning and code tasks can be achieved from scratch with minimal data and compute. The study of**CoT Obfuscation (Training on Documents About Monitoring)**serves as a critical cautionary tale, revealing that RL-trained models can learn to reward-hack by obfuscating their chain-of-thought, a finding with profound implications for agent safety and alignment research.

Multimodal & Video Diffusion

Reinforcement Learning is making significant inroads into generative video and multimodal models, with a focus on alignment, verifiable rewards, and training efficiency.**Video Models Can Reason with Verifiable Rewards (VideoRLVR)**directly translates the RLVR paradigm from LLMs to video diffusion, introducing a novel reward design and training pipeline.Flash-GRPOdramatically improves the efficiency of aligning video diffusion models through a one-step policy optimization, addressing a key bottleneck in this domain. In the multimodal space,ASRUleverages activation steering combined with a custom RL reward to perform targeted unlearning in large multimodal models. For vision-language models,From Failure to Feedbackintroduces a group-revision optimization paradigm that uses reward shaping from revised candidates to improve object-level grounding on hard cases, demonstrating how RL can refine perception capabilities.

Agents & Self-Improving Systems

The frontier of agent research is defined by frameworks that enable continuous, autonomous self-improvement without the need for costly weight updates.SolvitaandFORGEboth propose protocols for self-evolving agents, using RL updates to graph-structured knowledge networks and population broadcast with reflection loops, respectively, to continuously improve performance.Argusintroduces a deep research agent with an RL-trained Navigator for scalable evidence assembly and parallel search.ALSOaddresses the challenge of dynamic adaptation in multi-agent social environments using an online adversarial bandit framework. TheDifferentiable Mixture-of-Agentsframework proposes a self-evolving multi-agent system with differentiable routing, incentivizing emergent swarm intelligence. For scientific domains,DrugSAGEapplies self-evolving agent experience to drug discovery, whileLEAPintroduces a trajectory-level evaluation metric for iterative scientific design, showing that offline RL with this metric can improve agent performance.AstraFlowprovides a crucial systems-level contribution, decoupling rollout, dataflow, and training in a dataflow-oriented RL system for agentic LLMs to enable scalable multi-policy training. Finally,**Agentic Discovery of Neural Architectures (AIRA-Compose and AIRA-Design)**demonstrates that LLM agents can autonomously discover and design novel neural architectures, recursively improving themselves.

Reward Design & Optimization

A deep dive into the mechanics of reward and optimization reveals a trend towards more principled, efficient, and controllable methods.Process Rewards with Learned Reliability (BetaPRM)advances process reward modeling by predicting both step-level success probability and the reliability of that prediction, enabling adaptive computation. The$f$-Trajectory Balanceloss family provides a theoretical foundation for tuning generative models with both on- and off-policy data, unifying GFlowNets and LLM fine-tuning.AdaScopechallenges the assumption that every denoising step is equally important for RL fine-tuning of diffusion models, proposing an adaptive selection strategy that reduces cost and improves alignment.**Embedding-perturbed Exploration Preference Optimization (E2PO)**tackles the problem of intra-group variance collapse in flow models by introducing perturbation at the embedding level. In the domain of machine translation,Reference-Free RL Fine-Tuningsuccessfully applies GRPO with a novel reference-free hybrid reward to encoder-decoder models.PAGERbridges the semantic-execution gap in GUI control by using a precision-aligned RL reward for point-precise geometric control.Context Pruning for Coding Agents (LaMR) introduces a latent multi-rubric reasoning framework for context pruning, optimizing the agent's input space.

📑 Table of Contents

  1. ICRL: Learning to Internalize Self-Critique with Reinforcement Learning (Rel: 8/10, Nov: 8/10) Authors: Jianbo Lin, Xiaomin Yu, Yi Xin, Yifu Guo, Zhuosong Jiang, Zhongqi Yue, Weishi Wang, Heqing Zou, Chengwei Qin, Hui Xiong

  2. DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts (Rel: 8/10, Nov: 8/10) Authors: Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis

  3. Video Models Can Reason with Verifiable Rewards (Rel: 8/10, Nov: 8/10) Authors: Tinghui Zhu, Sheng Zhang, James Y. Huang, Selena Song, Xiaofei Wen, Yuankai Li, Hoifung Poon, Muhao Chen

  4. Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution (Rel: 8/10, Nov: 8/10) Authors: Han Li, Jinyu Tian, Rili Feng, Yuqiao Du, Chong Zheng, Chenyu Wang, Chenchen Liu, Shihao Li, Xinping Lei, Yifan Yao, Weihao Xie, Letian Zhu, Jiaheng Liu

  5. Calibrating LLMs with Semantic-level Reward (Rel: 8/10, Nov: 7/10) Authors: Fengfei Yu, Ruijia Niu, Dongxia Wu, Yian Ma, Rose Yu

  6. Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation (Rel: 8/10, Nov: 7/10) Authors: Yu Fu, Longxuan Yu, Haz Sameen Shahgir, Zhipeng Wei, Hui Liu, N. Benjamin Erichson, Yue Dong

  7. Process Rewards with Learned Reliability (Rel: 8/10, Nov: 7/10) Authors: Jinyuan Li, Langlin Huang, Chengsong Huang, Shaoyang Xu, Donghong Cai, Yuyi Yang, Wenxuan Zhang, Jiaxin Huang

  8. FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast (Rel: 8/10, Nov: 7/10) Authors: Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

  9. Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR (Rel: 8/10, Nov: 7/10) Authors: Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang

  10. Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design (Rel: 7/10, Nov: 8/10) Authors: Alberto Pepe, Chien-Yu Lin, Despoina Magka, Bilge Acun, Yannan Nellie Wu, Anton Protopopov, Carole-Jean Wu, Yoram Bachrach

  11. Argus: Evidence Assembly for Scalable Deep Research Agents (Rel: 7/10, Nov: 8/10) Authors: Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, Xinyu Wang

  12. Training on Documents About Monitoring Leads to CoT Obfuscation (Rel: 7/10, Nov: 8/10) Authors: Reilly Haskins, Bilal Chughtai, Joshua Engels

  13. PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control (Rel: 7/10, Nov: 7/10) Authors: Jingxuan Wei, Xi Bai, Shan Liu, Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Cheng Tan

  14. Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization (Rel: 7/10, Nov: 7/10) Authors: Xiaoxuan He, Siming Fu, Zeyue Xue, Weijie Wang, Ruizhe He, Yuming Li, Dacheng Yin, Shuai Dong, Haoyang Huang, Hongfa Wang, Nan Duan, Bohan Zhuang

  15. ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models (Rel: 7/10, Nov: 7/10) Authors: Jiahui Guang, Yingjie Zhu, Cuiyun Gao, Haiyan Wang, Jing Li, Di Shao, Zhaoquan Gu

  16. ALSO: Adversarial Online Strategy Optimization for Social Agents (Rel: 7/10, Nov: 7/10) Authors: Xiang Li, Liping Yi, Mingze Kong, Min Zhang, Zhongxiang Dai, QingHua Hu

  17. Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models (Rel: 7/10, Nov: 7/10) Authors: Xingjian Wu, Junkai Lu, Siyu Yan, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang

  18. VSPO: Vector-Steered Policy Optimization for Behavioral Control (Rel: 7/10, Nov: 7/10) Authors: Xuechen Zhang, Zijian Huang, Kai Yang, Weijia Zhang, Jiasi Chen, Samet Oymak

  19. Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models? (Rel: 7/10, Nov: 7/10) Authors: Renye Yan, Jikang Cheng, Shikun Sun, Yi Sun, You Wu, Wei Peng, Zongwei Wang, Ling Liang, Junliang Xing, Yimao Cai

  20. $f$-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data (Rel: 7/10, Nov: 7/10) Authors: Jake Fawkes, Jason Hartford

  21. Embedding-perturbed Exploration Preference Optimization for Flow Models (Rel: 7/10, Nov: 7/10) Authors: Sujie Hu, Chubin Chen, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu, Xiu Li

  22. From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding (Rel: 7/10, Nov: 7/10) Authors: Yuyuan Liu, Yiping Ji, Anjie Le, Jiayuan Zhu, Jiazhen Pan, Can Peng, Jiajun Deng, Fengbei Liu, Junde Wu

  23. Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective (Rel: 7/10, Nov: 7/10) Authors: Ernesto Garcia-Estrada, Carlos Escolano, Jos'e A. R. Fonallosa

  24. Look Before You Leap: Autonomous Exploration for LLM Agents (Rel: 7/10, Nov: 7/10) Authors: Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, Fuli Feng

  25. DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery (Rel: 7/10, Nov: 7/10) Authors: Yikun Zhang, Xiwei Cheng, Tianyu Liu, Yuanqi Du, Wengong Jin

  26. Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning (Rel: 7/10, Nov: 7/10) Authors: Jingjing Wang, Xiwen Chen, Wenhui Zhu, Huayu Li, Zhengxiao He, Feiyang Cai, Ana S. Carreon-Rascon, Xuanzhao Dong, Feng Luo

  27. GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero (Rel: 7/10, Nov: 7/10) Authors: Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi

  28. LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design (Rel: 7/10, Nov: 7/10) Authors: Marilyn Zhang, Tianfeng Chen, Fabi'an Barzuna, Ankita Rathod, Mark E. Whiting

  29. AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs (Rel: 7/10, Nov: 7/10) Authors: Haizhong Zheng, Yizhuo Di, Jiahui Wang, Shuowei Jin, Xueshen Liu, Yongji Wu, Z. Morley Mao, Ion Stoica, Jiawei Zhao, Beidi Chen


📝 Detailed Papers

ArXiv ID: 2605.15224 | Relevance Score: 8/10 | Novelty Score: 8/10

👥 Authors: Jianbo Lin, Xiaomin Yu, Yi Xin, Yifu Guo, Zhuosong Jiang, Zhongqi Yue, Weishi Wang, Heqing Zou, Chengwei Qin, Hui Xiong

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes ICRL, a novel RL framework that jointly trains a solver and critic to internalize self-critique, directly addressing RL for LLMs and self-improving agents.
View Abstract

arXiv:2605.15224v1 Announce Type: new Abstract: Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick-pid/ICRL.

ArXiv ID: 2605.15422 | Relevance Score: 8/10 | Novelty Score: 8/10

👥 Authors: Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes DualKV, a novel FlashAttention kernel that eliminates shared-prompt replication in RL post-training (GRPO/DAPO), achieving significant speedups for large-rollout, long-context LLM training.
View Abstract

arXiv:2605.15422v1 Announce Type: new Abstract: Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training ($N{\geq}16$, $P{\geq}8\text{K}$), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once -- a property not yet exploited at the kernel level for training. We propose \textbf{DualKV}, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1)~fused CUDA forward and backward kernels that iterate over two disjoint KV regions -- shared context and per-sequence response -- in a single kernel launch, and (2)~a data-pipeline redesign in veRL that repacks $N(P{+}R)$ tokens into $P{+}NR$ tokens per micro-batch, extending the token reduction from attention to the entire model by a factor $\rho = N(P{+}R)/(P{+}NR)$. DualKV is mathematically equivalent to standard attention and introduces no approximation. On Qwen3-8B GRPO training with 8$\times$H100 GPUs ($N{=}32$, 8K-context), DualKV achieves $1.63$--$2.09\times$ policy-update speedup, enables $2\times$ larger micro-batches, and raises MFU from $36\%$ to $76\%$. Similar gains hold for DAPO ($2.47\times$ speedup, $77\%$ MFU). At 30B MoE scale on 16$\times$H100, DualKV achieves $3.82\times$ policy-update and $3.38\times$ end-to-end step speedup over FlashAttention (which requires 4-way Ulysses sequence parallelism to avoid OOM).

ArXiv ID: 2605.15458 | Relevance Score: 8/10 | Novelty Score: 8/10

👥 Authors: Tinghui Zhu, Sheng Zhang, James Y. Huang, Selena Song, Xiaofei Wen, Yuankai Li, Hoifung Poon, Muhao Chen

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes VideoRLVR, a novel RL with verifiable rewards (RLVR) recipe for video diffusion models, directly matching the RL for LLMs criterion with a new reward design and training pipeline.
View Abstract

arXiv:2605.15458v1 Announce Type: new Abstract: Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.

ArXiv ID: 2605.15301 | Relevance Score: 8/10 | Novelty Score: 8/10

👥 Authors: Han Li, Jinyu Tian, Rili Feng, Yuqiao Du, Chong Zheng, Chenyu Wang, Chenchen Liu, Shihao Li, Xinping Lei, Yifan Yao, Weihao Xie, Letian Zhu, Jiaheng Liu

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes Solvita, a self-evolving agent framework for competitive programming that uses RL updates to graph-structured knowledge networks for continuous learning without weight updates.
View Abstract

arXiv:2605.15301v1 Announce Type: new Abstract: Large language models (LLMs) still struggle with the rigorous reasoning demands of hard competitive programming. While recent multi-agent frameworks attempt to bridge this reliability gap, they remain fundamentally stateless: they rely on static retrieval and discard the valuable problem-solving and debugging experience gained from previous tasks. To address this, we present Solvita, an agentic evolution framework that enables continuous learning without requiring weight updates to the underlying LLM. Solvita reorganizes problem-solving into a closed-loop system of strategy selection, program synthesis, certified supervision, and targeted hacking, executed by four specialized agents: Planner, Solver, Oracle, and Hacker. Crucially, each agent is paired with a trainable, graph-structured knowledge network. As the system operates, outcome signals, such as pass/fail verdicts, test certification quality, and adversarial vulnerabilities discovered by the Hacker, are recast as reinforcement learning updates to these network weights. This allows the agents to dynamically route future queries based on past successes and failures, effectively accumulating transferable reasoning experience over time. Evaluated across CodeContests, APPS, AetherCode, and live Codeforces rounds, Solvita establishes a new state-of-the-art among code-generation agents, outperforming existing multi-agent pipelines and nearly doubling the accuracy of single-pass baselines.

ArXiv ID: 2605.15588 | Relevance Score: 8/10 | Novelty Score: 7/10

👥 Authors: Fengfei Yu, Ruijia Niu, Dongxia Wu, Yian Ma, Rose Yu

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes a novel RL framework (CSR) with a semantic calibration reward that directly improves LLM calibration, strongly matching the RL for LLMs criterion with a new reward design.
View Abstract

arXiv:2605.15588v1 Announce Type: new Abstract: As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.

ArXiv ID: 2605.15239 | Relevance Score: 8/10 | Novelty Score: 7/10

👥 Authors: Yu Fu, Longxuan Yu, Haz Sameen Shahgir, Zhipeng Wei, Hui Liu, N. Benjamin Erichson, Yue Dong

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes OPSA, a novel on-policy self-distillation method for safety alignment that directly addresses the safety-reasoning tradeoff via dense per-token KL supervision and teacher flip rate, strongly matching RL for LLMs criteria.
View Abstract

arXiv:2605.15239v1 Announce Type: new Abstract: Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety demonstrations produced by humans, external models, or fixed self-generated traces, rather than on trajectories sampled from its own policy. We identify off-policy training mismatch as a second source of this tax and study on-policy self-distillation for safety alignment, which we call OPSA. The model generates its own rollouts and receives dense per-token KL supervision from a frozen teacher copy of itself conditioned on a privileged safety context. Because this teacher must be safer than the sampled student trajectory, we introduce \emph{teacher flip rate}: a criterion that measures how often a privileged context converts unsafe responses into safe ones. We use this signal to search for contexts that activate latent safety reasoning rather than merely elicit safe-looking demonstrations. Across two reasoning-model families and five model scales, OPSA achieves a stronger safety--reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning, with the largest gains on smaller models (+8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B). The gains persist across training-set sizes and adaptive jailbreak evaluations. Token-level analyses further show that OPSA concentrates updates near early compliance-decision tokens, providing a mechanism for improving safety while preserving general reasoning.

ArXiv ID: 2605.15529 | Relevance Score: 8/10 | Novelty Score: 7/10

👥 Authors: Jinyuan Li, Langlin Huang, Chengsong Huang, Shaoyang Xu, Donghong Cai, Yuyi Yang, Wenxuan Zhang, Jiaxin Huang

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes BetaPRM, a distributional process reward model that predicts both step-level success probability and prediction reliability, enabling adaptive computation allocation for PRM-guided reasoning.
View Abstract

arXiv:2605.15529v1 Announce Type: new Abstract: Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

ArXiv ID: 2605.16233 | Relevance Score: 8/10 | Novelty Score: 7/10

👥 Authors: Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes FORGE, a self-evolving agent memory protocol using population broadcast and reflection loops to improve LLM agent performance without weight updates.
View Abstract

arXiv:2605.16233v1 Announce Type: new Abstract: Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

ArXiv ID: 2605.15726 | Relevance Score: 8/10 | Novelty Score: 7/10

👥 Authors: Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes NudgeRL, a structured exploration framework for RLVR that uses strategy-level contexts and a decomposed reward objective to improve reasoning in LLMs, outperforming GRPO with larger rollout budgets.
View Abstract

arXiv:2605.15726v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.

ArXiv ID: 2605.15871 | Relevance Score: 7/10 | Novelty Score: 8/10

👥 Authors: Alberto Pepe, Chien-Yu Lin, Despoina Magka, Bilge Acun, Yannan Nellie Wu, Anton Protopopov, Carole-Jean Wu, Yoram Bachrach

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes a dual-framework (AIRA-Compose and AIRA-Design) for LLM agents to autonomously discover and design neural architectures, demonstrating recursive self-improvement and strong empirical results.
View Abstract

arXiv:2605.15871v1 Announce Type: new Abstract: Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.

ArXiv ID: 2605.16217 | Relevance Score: 7/10 | Novelty Score: 8/10

👥 Authors: Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, Xinyu Wang

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes Argus, a deep research agent system with a Navigator trained via RL to verify, dispatch, and synthesize evidence, enabling scalable parallel search and self-improvement.
View Abstract

arXiv:2605.16217v1 Announce Type: new Abstract: Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

ArXiv ID: 2605.15257 | Relevance Score: 7/10 | Novelty Score: 8/10

👥 Authors: Reilly Haskins, Bilal Chughtai, Joshua Engels

🏫 Affiliations: Unknown Institution

💡 Highlight: Directly studies CoT obfuscation in RL-trained models, showing monitor-aware models learn to reward-hack without triggering CoT monitors, which is highly relevant to RL for LLMs and agent safety.
View Abstract

arXiv:2605.15257v1 Announce Type: new Abstract: Chain-of-thought (CoT) monitoring is one of the most promising tools we have for detecting model misbehavior, but its effectiveness depends on models faithfully externalizing their reasoning. Motivated by this vulnerability, we study whether monitor-aware models are capable of obfuscating their reasoning to evade detection. We use synthetic document finetuning to expose eight models to realistic pre-training-style documents describing a CoT monitor and find that monitor-aware models consistently achieve higher rates of undetected misbehavior compared to unaware controls. This effect is weaker but still present on a harder agentic task. We also show that CoT controllability, a model's ability to reshape its own reasoning trace under an imposed constraint, is closely correlated with obfuscation success across the eight models studied ($r=0.800$, $p=0.017$). Monitor-aware models placed under equal reinforcement learning optimization pressure also learn to reward-hack without triggering a CoT monitor substantially faster than unaware controls. Together, these results suggest that knowledge of monitoring combined with high CoT controllability poses a risk to CoT-based monitoring.

ArXiv ID: 2605.15963 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Jingxuan Wei, Xi Bai, Shan Liu, Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Cheng Tan

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes PAGER, a topology-aware GUI agent that uses precision-aligned RL to bridge the semantic-execution gap in point-precise geometric control, directly matching RL for LLMs criteria.
View Abstract

arXiv:2605.15963v1 Announce Type: new Abstract: Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.

ArXiv ID: 2605.15980 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Xiaoxuan He, Siming Fu, Zeyue Xue, Weijie Wang, Ruizhe He, Yuming Li, Dacheng Yin, Shuai Dong, Haoyang Huang, Hongfa Wang, Nan Duan, Bohan Zhuang

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes Flash-GRPO, a one-step policy optimization framework that significantly improves training efficiency and alignment quality for video diffusion models, directly addressing RL for LLMs via a novel RL training recipe.
View Abstract

arXiv:2605.15980v1 Announce Type: new Abstract: Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.

ArXiv ID: 2605.15687 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Jiahui Guang, Yingjie Zhu, Cuiyun Gao, Haiyan Wang, Jing Li, Di Shao, Zhaoquan Gu

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes a multimodal unlearning framework that uses activation steering and a custom reward function (RL) to optimize unlearning boundaries, directly matching RL for LLMs.
View Abstract

arXiv:2605.15687v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8x) on average while effectively preserving model utility, using only a small amount of retained supervision data.

ArXiv ID: 2605.15768 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Xiang Li, Liping Yi, Mingze Kong, Min Zhang, Zhongxiang Dai, QingHua Hu

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes an online adversarial bandit framework for LLM-based social agents to dynamically adapt strategies in non-stationary multi-agent environments, directly addressing self-improving agent adaptation.
View Abstract

arXiv:2605.15768v1 Announce Type: new Abstract: Social simulation provides a compelling testbed for studying social intelligence, where agents interact through multi-turn dialogues under evolving contexts and strategically adapting opponents. Such environments are inherently non-stationary, requiring agents to dynamically adjust their strategies over time. However, most Large Language Model (LLM) based social agents rely on static personas, while existing approaches for enhancing social intelligence, such as offline reinforcement learning or external planners, are ill-suited to these settings, typically assuming stationarity and incurring substantial training overhead. To bridge this gap, we propose \textbf{ALSO} (\textbf{A}dversarial on\textbf{L}ine \textbf{S}trategy \textbf{O}ptimization), the first framework for online strategy optimization in multi-agent social simulation. ALSO advances social adaptation through two key contributions. (1) ALSO formulates multi-turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms, providing a principled solution to non-stationarity without relying on environmental stability assumptions. (2) To predict rewards and generalize sparse feedback in multi-turn dialogues, ALSO introduces a lightweight neural surrogate to predict rewards from interaction histories, enabling sample-efficient exploration and continuous online adaptation. Experiments on the Sotopia benchmark demonstrate that ALSO consistently outperforms static baselines and existing optimization methods in dynamic environments, validating the effectiveness of adversarial online strategy optimization for building robust social agents.

ArXiv ID: 2605.15706 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Xingjian Wu, Junkai Lu, Siyu Yan, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes a self-evolving multi-agent framework with differentiable routing and test-time adaptation, directly matching the self-improving agent criterion.
View Abstract

arXiv:2605.15706v1 Announce Type: new Abstract: Recent advances in Large Language Models (LLMs) have catalyzed the development of multi-agent systems (MAS) for complex reasoning tasks. However, existing MAS typically rely on pre-defined or pre-compiled communication topologies, which limits their flexibility and adaptability to dynamic task requirements. In this work, we propose Differentiable Mixture-of-Agents (DMoA), a self-evolving multi-agent framework that enables elastic and adaptive agent collaboration during inference. Instead of statically constructing workflows, DMoA dynamically routes and activates agents at each reasoning step, allowing the system to implicitly simulate diverse communication topologies and adapt to evolving demands. To achieve this, we design a differentiable, context-aware routing mechanism that leverages recurrent structures to incorporate historical and contextual information, producing sparse agent activations in a step-wise manner. Furthermore, we introduce predictive entropy as self-supervised signals to optimize the routing process, enabling efficient test-time adaptation without external annotations. Extensive experiments across 9 benchmarks demonstrate that DMoA achieves state-of-the-art performance while exhibiting strong efficiency, robustness, and ensembling capabilities.

ArXiv ID: 2605.15604 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Xuechen Zhang, Zijian Huang, Kai Yang, Weijia Zhang, Jiasi Chen, Samet Oymak

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes VSPO, a novel RL method modifying GRPO with steering vectors to control behavioral intensity, directly addressing sparse reward in multi-objective LLM alignment.
View Abstract

arXiv:2605.15604v1 Announce Type: new Abstract: Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.

ArXiv ID: 2605.15855 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Renye Yan, Jikang Cheng, Shikun Sun, Yi Sun, You Wu, Wei Peng, Zongwei Wang, Ling Liang, Junliang Xing, Yimao Cai

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes AdaScope, an RL fine-tuning method for diffusion models that adaptively selects denoising stages for optimization, reducing cost and improving alignment.
View Abstract

arXiv:2605.15855v1 Announce Type: new Abstract: Despite strong image-generation performance, diffusion models' reconstruction objectives limit alignment with human preferences. RL enables such alignment through explicit rewards. However, most studies apply RL to the full denoising trajectory, making it computationally costly and weakening preference alignment, i.e., doing more but achieving less. We observe that the impact of RL fine-tuning varies significantly across denoising stages. In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action-reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose AdaScope, an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, AdaScope adaptively identifies the optimal intervention timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare 'dual benefit': a reduction in computational costs alongside a significant performance improvement. We offer theoretical grounds for the design of AdaScope. Compared with state-of-the-art methods, AdaScope improves performance by 66% while cutting computational cost by 59%.

ArXiv ID: 2605.15417 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Jake Fawkes, Jason Hartford

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes a family of f-divergence-based surrogate losses for tuning generative models including LLMs, with on-policy gradient equivalence and off-policy validity, directly relevant to RL for LLMs via reward design.
View Abstract

arXiv:2605.15417v1 Announce Type: new Abstract: In GFlowNets and variational inference, it has been shown that the mean square error between target and model log probabilities is an effective, low variance, surrogate loss for training generative models. This loss has the property that when evaluated \emph{on-policy} its gradients correspond to those of the KL divergence, while \emph{off-policy} it remains a valid loss with the same global minimizer. In this work, we demonstrate that this construction can be extended to the whole family of $f$-divergences, leading to a family of losses whose on-policy gradients are that of the corresponding $f$-divergence, but retain the same global minimizer off-policy. Specifically, we show that the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and $f$-divergences. This equivalence allows us to design new surrogate loss functions for tuning a wide class of generative models that inherit the properties of the corresponding $f$-divergence, such as being more mode covering, whilst being applicable to off-policy data. We apply our losses on a range of tasks, including classic synthetic examples, SynFlowNets for molecule discovery, and asynchronous large language model (LLM) tuning, demonstrating that our models retain their predicted properties on- and off-policy in a wide class of generative models.

ArXiv ID: 2605.15803 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Sujie Hu, Chubin Chen, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu, Xiu Li

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes a novel RL framework (E2PO) for flow models that addresses intra-group variance collapse via embedding-level perturbation, directly improving alignment.
View Abstract

arXiv:2605.15803v1 Announce Type: new Abstract: Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.

ArXiv ID: 2605.15951 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Yuyuan Liu, Yiping Ji, Anjie Le, Jiayuan Zhu, Jiazhen Pan, Can Peng, Jiajun Deng, Fengbei Liu, Junde Wu

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes a group-revision optimization paradigm for RL finetuning of LVLMs that uses reward shaping from revised candidates to improve learning on hard cases in object-level grounding.
View Abstract

arXiv:2605.15951v1 Announce Type: new Abstract: Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at https://github.com/yyliu01/GroupRevision.

ArXiv ID: 2605.15976 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Ernesto Garcia-Estrada, Carlos Escolano, Jos\'e A. R. Fonallosa

🏫 Affiliations: Unknown Institution

💡 Highlight: Applies GRPO with a reference-free hybrid reward to encoder-decoder MT models, demonstrating RL-based fine-tuning for LLMs with a novel reward design and consistent gains across low-resource languages.
View Abstract

arXiv:2605.15976v1 Announce Type: new Abstract: Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to $+$5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages . We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicate this pattern across English and Spanish source languages.

ArXiv ID: 2605.16143 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, Fuli Feng

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes a training strategy interleaving exploration and task-execution rollouts with verifiable rewards for LLM agents, enabling self-improvement via grounded exploration.
View Abstract

arXiv:2605.16143v1 Announce Type: new Abstract: Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.

ArXiv ID: 2605.15461 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Yikun Zhang, Xiwei Cheng, Tianyu Liu, Yuanqi Du, Wengong Jin

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes a self-evolving agent framework with cross-task memory for efficient drug discovery model development.
View Abstract

arXiv:2605.15461v1 Announce Type: new Abstract: Building state-of-the-art (SOTA) predictive models for drug discovery requires expensive search over tools, architectures, and training strategies. Current LLM-based agents can find SOTA solutions through extensive trial and error, but they do not retain the experience accumulated along the way and therefore pay the full search cost on every new task. We propose \method (Self-evolving Agent Experience), a framework that accumulates and reuses experience across tasks to build SOTA drug discovery models efficiently. \method maintains a cross-task memory of verified skills, statistical evidence about effective strategies, and a record of recurring errors and their fixes. In some cases, \method transfers a working solution directly without test-time search. In 33 molecular property prediction tasks, \method ranks first among nine SOTA agents in a single-task setting. With memory accumulated from 16 smaller tasks, \method achieves an averaged normalized score of 0.935 on 17 held-out tasks in a cross-task evaluation setting and outperforms all baseline agents by 10-30\% in a zero-test-time search regime. In summary, our work shows the advantage of cross-task memory for efficient SOTA model development in drug discovery.

ArXiv ID: 2605.15315 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Jingjing Wang, Xiwen Chen, Wenhui Zhu, Huayu Li, Zhengxiao He, Feiyang Cai, Ana S. Carreon-Rascon, Xuanzhao Dong, Feng Luo

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes LaMR, a latent multi-rubric reasoning framework for context pruning in coding agents, directly matching the latent reasoning criterion with a novel structured decomposition and CRF-based approach.
View Abstract

arXiv:2605.15315v1 Announce Type: new Abstract: LLM-powered coding agents spend the majority of their token budget reading repository files, yet much of the retrieved code is irrelevant to the task at hand. Existing learned pruners compress this context with a single-objective sequence labeler, collapsing all facets of code relevance into one score and one transition matrix. We show that this formulation creates a modeling bottleneck: a single CRF transition prior must serve heterogeneous retention patterns, including contiguous semantic spans and sparse structural support lines. We propose LaMR (Latent Multi-Rubric), a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. To supervise each dimension without additional annotation cost, we derive multi-rubric labels from the existing training corpus via AST-based program analysis, simultaneously denoising the teacher's binary labels. By effectively filtering distracting noise, LaMR frequently matches or even outperforms unpruned full-context baselines. Experiments on four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA) show that LaMR wins 12 of 16 head-to-head multi-turn comparisons. It saves up to 31% more tokens on multi-turn agent tasks and improves Exact Match by up to +3.5 on single-turn tasks, while performance is frequently enhanced by denoising the context, and any remaining drops are marginal.

ArXiv ID: 2605.15464 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes GRLO, a method studying generalization of RLHF from scratch in open-ended environments, showing strong transfer to reasoning and code tasks with minimal data and compute.
View Abstract

arXiv:2605.15464v1 Announce Type: new Abstract: Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about $46\times$ less data and $68\times$ less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen's released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \href{https://github.com/SJY8460/GRLO}{https://github.com/SJY8460/GRLO}.

ArXiv ID: 2605.15341 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Marilyn Zhang, Tianfeng Chen, Fabi\'an Barzuna, Ankita Rathod, Mark E. Whiting

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes a trajectory-level evaluation metric for LLMs in iterative scientific design and shows offline RL with this metric improves performance, directly matching RL for LLMs via a new reward design and training loop.
View Abstract

arXiv:2605.15341v1 Announce Type: new Abstract: LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons. This leaves the learning trajectory unmeasured, even though the trajectory is what captures learning efficiency, where each iteration saved is a real saving in cost and time. Motivated by this, we examine three evaluation choices that change the conclusions one draws about LLM learning efficiency in iterative scientific design: what to measure, what baseline to compare against, and what to ground against. We introduce LEAPBench, Learning Efficiency in Adaptive Processes, a 55-task framework that pairs a best-so-far area under the curve (AUC) trajectory metric with a classical Bayesian-optimization reference and an audit grounded in published literature. Applied to eight contemporary LLMs, switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks where the oracle's reward signal is aligned with configurations from the published-best design, domain-aware prompting leads to LLM choices that match the published-best's approximately 10 percentage points less often than domain-agnostic prompting at iteration 30. The pattern is sharpest on 6 tasks where the literature-typical and published-best configurations diverge, and domain-agnostic prompting matches the published-best more often on all 6. The trajectory metric also doubles as a tractable training target. Offline reinforcement learning with the metric as a reward improves performance on 14 of 21 held-out tasks.

ArXiv ID: 2605.15565 | Relevance Score: 7/10 | Novelty Score: 7/10

👥 Authors: Haizhong Zheng, Yizhuo Di, Jiahui Wang, Shuowei Jin, Xueshen Liu, Yongji Wu, Z. Morley Mao, Ion Stoica, Jiawei Zhao, Beidi Chen

🏫 Affiliations: Unknown Institution

💡 Highlight: Proposes AstraFlow, a dataflow-oriented RL system for agentic LLMs that decouples rollout, dataflow, and training to enable scalable multi-policy collaborative training and efficient use of heterogeneous compute.
View Abstract

arXiv:2605.15565v1 Announce Type: new Abstract: Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.


Click to view Paper Selection Criteria

Research Interest Criteria

Please evaluate papers using a strict filter. We only want a small set of strongly aligned papers, not a broad survey of adjacent work.

Global Selection Principle

  • Mark a paper as relevant only if one of the three topics below is a central contribution of the paper, not a side application, benchmark, or discussion point.
  • Be conservative by default: if the connection is indirect, partial, or ambiguous, treat it as not relevant or score it low.
  • Prefer papers that introduce new methods, training frameworks, reward designs, agent improvement loops, or latent reasoning architectures.
  • Do not favor papers just because they mention "LLM", "agent", "reasoning", "RL", "alignment", or "CoT" in a generic way.
  • Exclude broad foundation model papers, generic post-training papers, generic tool-use agents, application/system papers, domain-specific deployments, and pure benchmark papers unless their core method strongly matches one of the topics below.

1. Reinforcement Learning (RL) for LLMs

We care about RL methods that directly improve LLM behavior, alignment, or reasoning through explicit reward design, RL pipelines, verifier-driven optimization, or scalable preference/feedback mechanisms.

  • Strong match: RLHF/RLAIF variants, verifiable reward based RL, rejection sampling plus RL pipelines, reward model design, outcome/process reward methods, online RL for LLMs, and methods that materially improve helpfulness, honesty, harmlessness, or hallucination behavior.
  • Especially valuable: Papers that propose a new reward function, a new scalable RL training recipe, or a new data-generation / self-improvement loop for post-training LLMs.
  • Usually not relevant: Generic reinforcement learning papers, RL for robotics/control/games, prompt optimization without RL, supervised finetuning only, inference-time search without a training contribution, or alignment papers with no substantive RL component.

2. Agents for LLMs

We care specifically about self-evolving agents rather than static agent systems. The central question is whether the agent can improve itself over time through interaction, reflection, tool-use traces, self-generated supervision, or automated curriculum/data generation.

  • Strong match: Self-improving agent frameworks, reflection/self-critique loops that feed back into learning, trajectory-driven continual improvement, automatic tool-use policy improvement, environment interaction that produces new supervision, and long-horizon agent adaptation.
  • Especially valuable: Closed-loop systems showing measurable capability gain over time, reliable agent learning from interaction traces, or agent architectures that turn execution experience into better future performance.
  • Usually not relevant: Vanilla ReAct-style agents, workflow orchestration, multi-agent communication only, tool calling systems without learning, benchmark papers, memory systems without self-improvement, or application-focused agent papers.

3. Latent CoT/Reasoning

We care about methods that move reasoning away from verbose explicit text chains into compact latent computation while preserving or improving reasoning quality.

  • Strong match: Latent reasoning architectures, hidden-state reasoning, implicit CoT, recurrent depth / recurrent latent computation, pause or thinking tokens for internal computation, explicit-to-latent distillation, and methods that reduce reasoning token cost or KV-cache use while retaining accuracy.
  • Especially valuable: New architectures or objectives for latent reasoning, hidden-state alignment/distillation, compression of reasoning traces, or methods that materially improve reasoning efficiency on math and logic tasks.
  • Usually not relevant: Standard explicit CoT prompting, generic reasoning finetuning, test-time scaling without latent computation, reasoning benchmarks, or papers whose main contribution is evaluation rather than a latent reasoning mechanism.

Final Preference

If a paper is merely adjacent to these areas, do not select it. Favor precision over recall. It is better to miss a borderline paper than to include many weakly related ones.