奖励函数设计一直是现实世界强化学习(RL)部署的核心挑战之一,尤其在涉及多重目标的场景中。基于偏好的强化学习通过从人类对行为结果成对比较的偏好中学习,提供了一种有吸引力的替代方案。近期,基于AI反馈的强化学习(RLAIF)表明,大语言模型(LLMs)可大规模生成偏好标签,从而减轻对人工标注者的依赖。然而,现有RLAIF研究通常仅聚焦于单目标任务,尚未解决RLAIF如何处理多目标系统这一开放问题。在多目标系统中,相互冲突的目标之间难以明确权衡关系,策略易退化为仅优化某一主导目标。本文探讨将RLAIF范式扩展至多目标自适应系统。我们证明,多目标RLAIF能够生成反映不同用户优先级的均衡权衡策略,无需繁琐的奖励工程。我们认为,将RLAIF融入多目标强化学习,为在固有目标冲突的领域中实现可扩展、用户对齐的策略学习提供了一条可行路径。
Reward design has been one of the central challenges for real world reinforcement learning (RL) deployment, especially in settings with multiple objectives. Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes. More recently, RL from AI feedback (RLAIF) has demonstrated that large language models (LLMs) can generate preference labels at scale, mitigating the reliance on human annotators. However, existing RLAIF work typically focuses only on single-objective tasks, leaving the open question of how RLAIF handles systems that involve multiple objectives. In such systems trade-offs among conflicting objectives are difficult to specify, and policies risk collapsing into optimizing for a dominant goal. In this paper, we explore the extension of the RLAIF paradigm to multi-objective self-adaptive systems. We show that multi-objective RLAIF can produce policies that yield balanced trade-offs reflecting different user priorities without laborious reward engineering. We argue that integrating RLAIF into multi-objective RL offers a scalable path toward user-aligned policy learning in domains with inherently conflicting objectives.