Total relevant papers: 4
Today's research in RL for LLM alignment demonstrates a sophisticated shift towards handling uncertainty and personalization. The key development is the move beyond monolithic reward models to frameworks that can decompose and model the inherent uncertainty in human preferences. The proposed uncertainty-aware variational reward factorization (VRF) represents a significant step, treating reward components as probabilistic bases. This allows for more nuanced personalization by capturing the variability in what different users or contexts consider valuable, directly addressing a core challenge in creating robust and adaptable aligned models.
The frontier of autonomous LLM agents is being pushed on two critical fronts: architectural design and operational integrity. Architecturally, we see innovation in enabling sustained, multi-turn reasoning through frameworks like asymmetric actor-critic, which decouples and specializes components for action generation and runtime critique. This facilitates self-evolution and more stable long-horizon tasks. Concurrently, there is a strong emphasis on building trustworthy, real-world systems, as evidenced by new frameworks for privacy-compliant agentic reasoning. These systems introduce formal mechanisms to handle conflicting evidence and adhere to compliance constraints, which are essential for moving agents from controlled benchmarks to practical, reliable deployment.
Safety alignment for Multimodal Large Models (MLLMs) is advancing beyond simple input filtering or post-hoc corrections. The emerging trend is towards deeply integrated, conditional generation strategies that bake safety considerations directly into the decoding process. The proposed conditional decoding approach (CASA) exemplifies this, aiming to create a more robust safety layer by allowing the model to dynamically adjust its output based on safety conditions. This represents a proactive shift from merely detecting unsafe content to structurally preventing its generation, a crucial development for the responsible deployment of increasingly powerful multimodal systems.
Uncertainty-Aware Variational Reward Factorization via Probabilistic Preference Bases for LLM Personalization Authors: Gyuseok Lee, Wonbin Kweon, Zhenrui Yue, SeongKu Kang, Jiawei Han, Dong Wang
Asymmetric Actor-Critic for Multi-turn LLM Agents Authors: Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto
CARE: Privacy-Compliant Agentic Reasoning with Evidence Discordance Authors: Haochen Liu, Weien Li, Rui Song, Zeyu Li, Chun Jason Xue, Xiao-Yang Liu, Sam Nallaperuma, Xue Liu, Ye Yuan
Robust Multimodal Safety via Conditional Decoding Authors: Anurag Kumar, Raghuveer Peri, Jon Burnsky, Alexandru Nelus, Rohit Paturi, Srikanth Vishnubhotla, Yanjun Qi
ArXiv ID: 2604.00997 |
π₯ Authors: Gyuseok Lee, Wonbin Kweon, Zhenrui Yue, SeongKu Kang, Jiawei Han, Dong Wang
π« Affiliations: Unknown Institution
arXiv:2604.00997v1 Announce Type: new Abstract: Reward factorization personalizes large language models (LLMs) by decomposing rewards into shared basis functions and user-specific weights. Yet, existing methods estimate user weights from scarce data in isolation and as deterministic points, leading to inaccurate and unreliable inference. We introduce Variational Reward Factorization (VRF), an uncertainty-aware framework that represents each user's preferences as a variational distribution in a shared preference space. VRF infers user distributions via a variational encoder, derives weights through Wasserstein distance matching with shared probabilistic bases, and downweights uncertain estimates through a variance-attenuated loss. On three benchmarks, VRF outperforms all baselines across seen and unseen users, few-shot scenarios, and varying uncertainty levels, with gains extending to downstream alignment.
ArXiv ID: 2604.00304 |
π₯ Authors: Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto
π« Affiliations: Unknown Institution
arXiv:2604.00304v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong reasoning and conversational abilities, but ensuring reliable behavior in multi-turn interactions remains challenging. In many real-world applications, agents must succeed in one-shot settings where retries are impossible. Existing approaches either rely on reflection or post-hoc evaluation, which require additional attempts, or assume fully trainable models that cannot leverage proprietary LLMs. We propose an asymmetric actor-critic framework for reliable conversational agents. A powerful proprietary LLM acts as the actor, while a smaller open-source critic provides runtime supervision, monitoring the actor's actions and intervening within the same interaction trajectory. Unlike training-based actor-critic methods, our framework supervises a fixed actor operating in open-ended conversational environments. The design leverages a generation-verification asymmetry: while high-quality generation requires large models, effective oversight can often be achieved by smaller ones. We further introduce a data generation pipeline that produces supervision signals for critic fine-tuning without modifying the actor. Experiments on $\tau$-bench and UserBench show that our approach significantly improves reliability and task success over strong single-agent baselines. Moreover, lightweight open-source critics rival or surpass larger proprietary models in the critic role, and critic fine-tuning yields additional gains over several state-of-the-art methods.
ArXiv ID: 2604.01113 |
π₯ Authors: Haochen Liu, Weien Li, Rui Song, Zeyu Li, Chun Jason Xue, Xiao-Yang Liu, Sam Nallaperuma, Xue Liu, Ye Yuan
π« Affiliations: Unknown Institution
arXiv:2604.01113v1 Announce Type: new Abstract: Large language model (LLM) systems are increasingly used to support high-stakes decision-making, but they typically perform worse when the available evidence is internally inconsistent. Such a scenario exists in real-world healthcare settings, with patient-reported symptoms contradicting medical signs. To study this problem, we introduce MIMIC-DOS, a dataset for short-horizon organ dysfunction worsening prediction in the intensive care unit (ICU) setting. We derive this dataset from the widely recognized MIMIC-IV, a publicly available electronic health record dataset, and construct it exclusively from cases in which discordance between signs and symptoms exists. This setting poses a substantial challenge for existing LLM-based approaches, with single-pass LLMs and agentic pipelines often struggling to reconcile such conflicting signals. To address this problem, we propose CARE: a multi-stage privacy-compliant agentic reasoning framework in which a remote LLM provides guidance by generating structured categories and transitions without accessing sensitive patient data, while a local LLM uses these categories and transitions to support evidence acquisition and final decision-making. Empirically, CARE achieves stronger performance across all key metrics compared to multiple baseline settings, showing that CARE can more robustly handle conflicting clinical evidence while preserving privacy.
ArXiv ID: 2604.00310 |
π₯ Authors: Anurag Kumar, Raghuveer Peri, Jon Burnsky, Alexandru Nelus, Rohit Paturi, Srikanth Vishnubhotla, Yanjun Qi
π« Affiliations: Unknown Institution
arXiv:2604.00310v1 Announce Type: new Abstract: Multimodal large-language models (MLLMs) often experience degraded safety alignment when harmful queries exploit cross-modal interactions. Models aligned on text alone show a higher rate of successful attacks when extended to two or more modalities. In this work, we propose a simple conditional decoding strategy, CASA (Classification Augmented with Safety Attention) that utilizes internal representations of MLLMs to predict a binary safety token before response generation. We introduce a novel safety attention module designed to enhance the model's ability to detect malicious queries. Our design ensures robust safety alignment without relying on any external classifier or auxiliary head, and without the need for modality-specific safety fine-tuning. On diverse benchmarks such as MM-SafetyBench, JailbreakV-28k, and adversarial audio tests, CASA lowers the average attack success rate by more than 97% across modalities and across attack types. Our empirical evaluations also show that CASA maintains strong utility in benign inputs, a result validated through both automated and human evaluations (via 13 trained annotators). Together, these results highlight CASA as a simple and generalizable framework to improve multimodal LLM safety.
Please evaluate the papers based on the following three core areas:
Supervised Fine-Tuning (SFT) is a critical technique for adapting pre-trained Large Language Models (LLMs) to specific downstream tasks or domains. This process involves further training the model on a curated dataset of high-quality, labeled examples that reflect the target behavior. Our research focuses on optimizing the SFT process to enhance model performance, alignment, and efficiency.
Reinforcement Learning, particularly from human or AI-generated feedback, offers a powerful paradigm for aligning LLM behavior with complex human preferences and objectives. This research area explores methods beyond simple instruction-following, aiming to imbue models with more nuanced capabilities. Our work centers on creating robust RL pipelines and novel reward mechanisms to steer LLM development.
Autonomous agents built upon LLMs are a key step towards creating general-purpose problem-solving systems. Our emphasis is on self-evolving agents: agents that can iteratively improve their policies, tools, and internal knowledge through experience. This direction studies the mechanisms that make agent behavior not only autonomous at inference time, but continually improving over long horizons via self-generated feedback and data. The goal is to design agent architectures that translate self-evolution into measurable capability gains and strong real-world deployability.
In addition to the three core areas above, the reader also has strong interests in: