我们提出 MiniVLA-Nav v1,一个面向语言条件物体接近(Language-Conditioned Object Approach, LCOA)导航任务的仿真数据集:给定一段简短的自然语言指令,NVIDIA Nova Carter 差分驱动机器人需在四个逼真的 Isaac Sim 环境(办公室、医院、完整仓库、含多层货架的仓库)中导航至指定物体,并在其 1 米范围内停止。全部 1,174 个 episode 均配有一条指令,以及同步采集的 640×640 RGB 图像、度量深度图(float32,单位:米)和实例分割掩码;同时提供由基于视觉的比例控制器生成的连续控制信号(线速度 v 和角速度 ω)及 7×7 tokenized 专家动作标签,采样频率为 60 Hz。轨迹多样性通过三类起始距离层级(近:1.5–3.5 米;中:3.5–7.0 米;远:全局人工筛选点位;起始距离与轨迹长度的 Pearson 相关系数 r = 0.94)、12 类物体、18 个训练模板及 12 个分布外(OOD)同义改写模板予以保障。本数据集包含五个评估划分,分别用于评估分布内准确性、模板-改写鲁棒性及分布外物体类别泛化能力。该数据集已公开发布于 https://huggingface.co/datasets/alibustami/miniVLA-Nav
We present MiniVLA-Nav v1, a simulation dataset for Language-Conditioned Object Approach (LCOA) navigation: given a short natural-language instruction, an NVIDIA Nova Carter differential-drive robot must navigate to the named object and stop within 1 m across four photorealistic Isaac Sim environments (Office, Hospital, Full Warehouse, and Warehouse with Multiple Shelves). Each of the 1,174 episodes pairs an instruction with synchronized 640x640 RGB images, metric depth maps (float32, metres), and instance segmentation masks, together with continuous (v,omega) and 7x7 tokenized expert action labels recorded at 60 Hz from a vision-based proportional controller. Trajectory diversity is ensured through three spawn-distance tiers (near: 1.5-3.5 m, mid: 3.5-7.0 m, far: global curated points; Pearson r=0.94 between spawn distance and trajectory length), 12 object categories, 18 training templates, and 12 paraphrase-OOD templates. Five evaluation splits support in-distribution accuracy, template-paraphrase robustness, and OOD object-category benchmarking. The dataset is publicly available at https://huggingface.co/datasets/alibustami/miniVLA-Nav