决定机器人夹持器实用性的,不在于它能否拾取某一个物体,而在于它能否连续拾取后续多个物体,且使用的是此前从未操作过的工具。决定自动驾驶系统安全性的,不仅在于其能否对特定场景进行推理,更在于[…]
What makes a robot gripper useful isn’t that it can pick up one object — it’s that it can pick up the next one, and the one after that, with a tool it’s never held before. What makes an autonomous vehicle system safe isn’t just that it can reason through a situation — it’s that […]
让机器人夹爪真正有用,不在于它能抓取某一个物体,而在于它能持续抓取下一个、再下一个物体——即便所用工具是它此前从未操作过的。让自动驾驶系统真正安全,不仅在于它能对某种场景进行推理,更在于它能在车辆实际搭载的硬件上足够快速地完成该推理。让虚拟智能体具备能力的关键,则是在其直面真实世界之前,尽可能多地接触各类不同环境。在今年的计算机视觉与模式识别会议(CVPR)上,NVIDIA Research 发布了三篇论文,分别应对上述三大挑战;这些论文共享一个核心主题:大规模训练可构建泛化能力更强、适用于多样化应用场景的系统。NVIDIA 还在 CVPR 上发布了全新的物理 AI 智能体技能,助力研究人员与开发者加速自动驾驶汽车、机器人及视觉 AI 系统的研发进程。 针对双指夹爪训练的视觉-语言-动作策略,仅能学会使用那两个手指进行抓取;同理,针对灵巧抓取训练的策略,也仅适用于其训练所用的定制化多指夹爪。每更换一种新构型的本体(embodiment),通常都需要重复整个流程——重新采集训练数据、微调模型并开展验证。这一限制导致大多数机器人公司往往选定一种夹爪后便长期沿用,不再更换。 正如大语言模型无需重新训练即可将其语言理解能力迁移至新任务,GraspGen-X 也能将其对几何结构与接触关系的理解,泛化至其所遇到的任意机器人夹爪。给定一种新型夹爪的几何结构,以及一个此前从未见过的未知物体,该模型即可生成可靠的抓取位姿建议,使机器人得以成功抓取该物体。为实现这一目标,研究人员需要一个在现实世界中无法规模化采集的数据集。他们通过仿真生成了涵盖数千种物体形状与合成夹爪构型的 20 亿次抓取样本,全面覆盖实际部署机器人可能遭遇的各类形态差异。 在 GraspGen 研究基础之上,另一篇题为 Grasp-MPC 的论文(将在 ICRA 2026 上发表)推进了该技术流程的下一阶段:从抓取生成迈向闭环式抓取执行。近年来,研究人员发现,允许 AI 进行“推理”——即在最终输出答案前生成中间思考步骤——可稳定提升其决策质量。对自动驾驶而言,挑战在于如何在车载硬件上完成此类推理。基于文本的思维链(chain-of-thought)推理会生成词语,而每个词均对应一个 token,其生成耗时不可忽视。在车载处理器上,token 数量直接制约着系统响应速度。 因此,该系统并未生成人类可读的推理步骤,而是以紧凑的潜在空间(latent space)进行“思考”——其中的状态表征空间信息,而非生成文本。其架构在两类思考模式间交替运行:先提出候选动作,再预测若执行这些动作后世界将呈现何种状态;随后利用该预测的世界状态来优化下一步动作。这本质上仍是同一推理循环,只是以比自然语言更高效的计算形式实现。结果表明:该模型输出轨迹质量与文本式推理相当,但所需 token 数量约为后者的一半。该模型基于 NVIDIA Alpamayo 构建,并利用现有车辆数据衍生的监督信号完成训练。 Isaac GR00T 是 NVIDIA 面向人形机器人的开源基础模型,其设计遵循一条简单原则:只要让模型接触足够多样的情境,它便能泛化至未曾见过的新情境。NitroGen 将该原则延伸至虚拟环境,复用 GR00T 架构,在一系列虚拟世界中训练面向具身智能体(embodied agents)的基础模型。电子游戏提供了一种难以从零构建的优质资源:结构清晰、变化丰富、目标明确且成功条件定义完备的虚拟世界。它们是高质量、可规模化获取的训练环境。NitroGen 正是将游戏视作如此训练场——用于训练未来可应对全新真实或仿真世界情境的智能体,例如依据“把这几样东西收进食品储藏室”等宽泛指令,驱动家务机器人完成任务。 相同技术未来还可助力游戏中更具适应性的非玩家角色(NPC)、AI 伴侣及游戏机制系统,并支持对复杂游戏环境开展更广泛的测试。在低数据条件下——即智能体仅见过少量新环境示例时——以 NitroGen 为起点可为其带来显著优势,相较此前最先进的方法,性能提升最高可达 52%。
What makes a robot gripper useful isn’t that it can pick up one object — it’s that it can pick up the next one, and the one after that, with a tool it’s never held before. What makes an autonomous vehicle system safe isn’t just that it can reason through a situation — it’s that it can do so quickly enough on the hardware actually installed in the car. What makes a virtual agent capable is exposure to as many different environments as possible before it faces the real world. At this year’s Computer Vision and Pattern Recognition (CVPR) conference, NVIDIA Research is presenting three papers that address each of these challenges — and share a common theme: training at scale creates systems that generalize across diverse applications. NVIDIA also unveiled at CVPR new physical AI agent skills that help researchers and developers speed the development of autonomous vehicles, robots and vision AI systems. A vision-language-action policy trained for a two-finger gripper only learns to grasp with those two fingers. Similarly, a policy for dextrous grasping will only work for the bespoke multi-fingered gripper it’s trained on. For every new embodiment, the process typically needs to be repeated — requiring new training data, fine-tuning and validation. This constraint means most robotics companies pick a gripper, train for it and stick with it. Like a large language model that can apply its understanding of language to a new task without retraining, GraspGen-X applies its understanding of geometry and contact to any robotic gripper it encounters. Given the geometry of a new gripper and an unknown object it’s never seen before, the model generates reliable grasp pose proposals to enable the robot to grasp the object. To get there, the researchers needed a dataset that’s impossible to collect in the real world at scale. They generated 2 billion simulated grasps across thousands of object shapes and synthetic gripper configurations, spanning the diversity of form factors a deployed robot might encounter. Building on the GraspGen research foundation, another paper, Grasp-MPC — presented at ICRA 2026 — advances the next step in the pipeline: moving from grasp generation to closed-loop grasp execution. In recent years, researchers have found that letting an AI reason — generating intermediate thinking steps before committing to an answer — reliably improves its decision-making. For autonomous vehicles, the challenge is doing that reasoning on the hardware inside an actual vehicle. Text-based chain-of-thought reasoning generates words, and every word is a token that takes time to produce. On the processor running inside a car, token count is a real constraint on how fast the system can respond. Instead of generating human-readable reasoning steps, the system thinks in a compact latent space — states that capture spatial information rather than producing text. The architecture alternates between two kinds of thinking: proposing candidate actions, then predicting what the world will look like if those actions are taken. It uses that predicted world state to refine its next step. It’s the same reasoning loop — just in a more computationally efficient form than natural language. The result: comparable output trajectory quality to text-based reasoning, using roughly half the tokens. The model was built on NVIDIA Alpamayo and trained using supervision derived from existing vehicle data. Isaac GR00T — NVIDIA’s open foundation model for humanoid robots — is built on a simple principle: expose a model to enough diverse situations, and it will generalize to ones it hasn’t seen. NitroGen extends that principle to virtual environments, using the GR00T architecture to train a foundation model for embodied agents across a breadth of virtual worlds. Video games offer something that’s hard to build from scratch: structured, varied worlds with defined goals and well-specified success conditions. They’re high-quality training environments, available at scale. NitroGen treats them that way — as a training ground for agents that will eventually be trained to handle novel real- or simulated-world situations, like powering a robot that helps with housework based on broad instructions such as, “Put these items away in the pantry.” The same techniques could eventually help enable more adaptive nonplayable characters, AI companions and gameplay systems inside games, as well as broader testing of complex game environments. In low-data conditions — where an agent has seen only a handful of examples of a new environment — starting with NitroGen gives agents a huge head start, improving performance by up to 52% over previous state-of-the-art methods.