离线强化学习(Offline RL)仅利用历史记录数据即可实现策略改进,以历史回报或其他可测结果作为环境反馈。其关键难点在于:在不超出离线数据支持范围的前提下提升已观测行为。我们提出\emph{反事实传输流}(counterfactual transport flows),一种以源为条件的轨迹精炼框架,用于受环境反馈引导的离线决策。给定一条低反馈的候选轨迹,我们在潜在轨迹空间中检索邻近的、具有更高任务特异性反馈的轨迹,由此构建局部偏好对,并将其作为弱监督信号以实现保守精炼。该框架学习实例特定的精炼方向:在推理阶段,通过一个精炼强度参数控制候选轨迹被传输的距离,从而在保持原始行为与施加更强改进之间实现权衡。在 D4RL 基准(包括 AntMaze 和 MuJoCo 任务)上的实验表明,本方法能基于历史回报这一环境反馈提升行为表现,同时提供可解释的轨迹级精炼路径。
Offline reinforcement learning (RL) offers a path to policy improvement from logged data alone, using historical returns or other measurable outcomes as world feedback. A key difficulty is improving observed behavior without extrapolating beyond what the offline data supports. We propose \emph{counterfactual transport flows}, a source-conditioned trajectory refinement framework for offline decision-making guided by world feedback. Given a low-feedback candidate trajectory, we construct local preference pairs from offline data by retrieving nearby trajectories in latent trajectory space with higher task-specific feedback, and use them as weak supervision for conservative refinement. The framework learns instance-specific refinement directions: at inference time, a refinement strength parameter controls how far the candidate trajectory is transported, enabling a trade-off between preserving the original behavior and applying stronger improvement. Experiments on D4RL benchmarks, including AntMaze and MuJoCo tasks, show that our method improves behavior from historical returns as world feedback, while providing interpretable trajectory-level refinement paths.