论文
arXiv
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
Agent
GeoSimulation
中文标题
OVO-S-Bench:面向多模态大语言模型的流式空间智能分层基准
English Title
OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs
Yifei Li, Pengyiang Liu, Yuhang Zang, Zhongyue Shi, Qi Fu, Hongye Hao, Jiwen Lu
发布时间
2026/6/3 00:51:32
来源类型
preprint
语言
en
摘要
中文对照

机器人、增强现实与自动驾驶中的多模态智能体需基于连续的第一人称视角视频流推理场景与空间布局,且常需依赖当前视野之外的证据。现有基准或针对完整视频进行离线评估,或聚焦于事件识别而非空间结构理解。我们提出OVO-S-Bench——一个完全由人工标注的流式空间智能基准,涵盖348个源视频上的1,680个问题。标注工作由12名经训练的标注员完成,每人同时担任盲审交叉评审员,总计投入约804人小时的多轮质量保障。每个问题均附带查询时间戳与证据时间区间;在评估时,模型仅可访问查询时间点之前的视频前缀。问题覆盖四个逐级抽象的层次:瞬时第一人称感知、时空上下文追踪、空间模拟与推理、以及外源性空间映射(allocentric mapping)。在38个专有及开源多模态大语言模型(MLLM)上的评测显示,Gemini-3.1-Pro得分59.2,较人类专家(86.6)低27分,其中外源性空间映射为最显著瓶颈。值得注意的是,经流式处理与空间细调的MLLM性能反而低于其原始骨干模型。此外,我们发现当思维链(chain-of-thought)推理未锚定于视频流时,会加剧空间错误。OVO-S-Bench通过揭示上述局限,为下一代流式空间MLLM确立了一个高要求的评测平台。

English Original

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.

元数据
arXiv2606.03890v1
来源arXiv
类型论文
抽取状态raw
关键词
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
Agent
GeoSimulation
cs.CV