UrbanComp Lab | 学习资料库

返回资讯

资讯

NVIDIA Technical Blog

Industry

Agent

Dataset

Platform

中文标题

NVIDIA Blackwell 在首个具身智能体 AI 基础设施基准测试中领先

English Title

NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark

Shruti Koparkar

发布时间

2026/6/13 05:00:08

来源类型

blog

语言

摘要

中文对照

Artificial Analysis 推出的 AgentPerf 是业界首个具身智能体 AI 基准测试，为开发者、企业及基础设施提供商提供了清晰的系统性能对比方法。在首期公开测试结果中，NVIDIA Blackwell Ultra NVL72 平台在各项具身智能体 AI 工作负载中均展现出领先性能，其每兆瓦功耗可运行的智能体数量是 NVIDIA […] 的 20 倍。

English Original

AgentPerf from Artificial Analysis, the industry’s first agentic AI benchmark, gives developers, enterprises and infrastructure providers a clear way to compare systems for agentic AI. In the first round of published results, the NVIDIA Blackwell Ultra NVL72 platform delivers leading performance across the agentic AI workloads tested, running 20x more agents per megawatt than NVIDIA […]

正文

中文全文

智能体 AI（Agentic AI）本质上是一种与对话式 AI 完全不同的工作负载。一次聊天补全（chat completion）是一场短跑：仅调用一次大语言模型（LLM），生成一个响应。而一个智能体则更像一场接力赛：它将目标拆解为多个步骤，并持续运行直至任务完成。这一过程会将数十至数百次 LLM 调用串联起来，每次调用均向后续步骤传递不断增长的上下文，并在每次交接中嵌入工具调用——例如代码编译与执行、数据库检索及网页浏览。其复杂性并非线性叠加，而是呈乘积式增长。这一差异对性能评估具有重大影响。当前主流的 AI 推理基准测试仅针对单次 LLM 调用设计：衡量 LLM 响应单个请求的速度，以及系统可同时处理多少并发请求。它们并未面向智能体工作负载而构建；在后者中，链式 LLM 调用、工具调用延迟以及持续扩大的上下文，会以截然不同且更为严苛的方式对加速计算系统施加压力——这远非单次 LLM 调用所能比拟。对于大规模构建与部署智能体的企业而言，准确评估智能体的响应速度、可同时部署的数量，以及每单位美元投入和每瓦功耗所能交付的实质性算力，至关重要。CUDA 内核通过重叠通信与计算进一步加速该流程，使跨专家协同的协调开销被吸收而非叠加至端到端延迟之中。NVIDIA TensorRT LLM 可在并发智能体会话规模扩大时持续维持高效率。例如，它将输入处理与输出生成分离，从而实现二者各自独立优化。上述成果基于一套从零构建的基准测试方法论，真实反映智能体 AI 在生产环境中的实际运行方式。AgentPerf 的设计依据来自真实的编程智能体行为轨迹：智能体接收任务、读取文件、编写与修改代码、执行命令，并依据执行结果迭代优化——所有数据均源自覆盖 12 种以上编程语言的公开代码仓库。其长序列长度、工具调用模式及各类延迟，均高度还原真实世界的编程工作流。 AgentPerf 进而测量平台在满足预设响应速度与输出 token 速率性能阈值的前提下，可同时支持多少此类智能体任务。工具调用本身不实际执行，而是采用具备代表性的 CPU 处理时间进行模拟，因此测试结果差异仅反映加速计算硬件本身的性能表现。测试结果可直接转化为基础设施决策依据：每块加速器、每兆瓦电力可支持多少并发智能体任务。对于大规模部署 AI 智能体的企业而言，这些数字决定了既定基础设施投资所能实际交付的生产力水平。随着 NVIDIA 与开源生态持续优化推理软件，智能体工作负载下的性能与能效将持续提升。NVIDIA Vera Rubin 架构现已全面投入量产，为规模化智能体 AI 日益增长的算力需求提供新一代基础设施支撑。请参阅本技术博客，深入了解 AgentPerf 方法论及 NVIDIA 面向智能体 AI 的全栈优化方案。

English Original

Agentic AI is a fundamentally different workload than conversational AI. A single chat completion is a sprint: one large language model (LLM) call, one response. An agent functions more like a relay: It breaks a goal into many steps and keeps going until the task is done. That results in dozens to hundreds of LLM calls chained together, each passing growing context to the next, with tool calls like code compile and execution, database search and web browsing at every handoff. The complexity isn’t additive; it’s multiplicative. The distinction matters enormously for performance measurement. Existing AI inference benchmarks measure one LLM call: how fast an LLM responds to a single request and how many simultaneous requests a system can handle. They weren’t designed for agentic workloads, where chained LLM calls, tool call delays and growing context stress accelerated computing systems in fundamentally different ways than a single LLM call ever could. For companies building and deploying agents at scale, it’s important to understand how responsive agents are, how many can be deployed simultaneously and how much useful work AI infrastructure can deliver for every dollar and watt invested. CUDA kernels accelerate this further by overlapping communication and compute, so the cost of coordinating across experts is absorbed rather than added to latency. NVIDIA TensorRT LLM sustains efficiency as concurrent agent sessions scale. For example, it separates the processing of inputs from the generation of outputs so each can be optimized independently. These results are grounded in a benchmark methodology built from the ground up to reflect how agentic AI actually works in production. AgentPerf is built based on real coding agent trajectories: an agent receives a task, reads files, writes and edits code, executes commands and iterates based on the results — all drawn from real public code repositories across 12+ programming languages. The long sequence lengths, tool call patterns and delays are all representative of real-world coding workflows. AgentPerf then measures how many of these agentic tasks a platform can support simultaneously while meeting defined performance thresholds for responsiveness and output token rate. Tool calls are not executed but simulated using representative CPU processing time, so differences in results reflect accelerated computing performance only. The results translate directly into infrastructure decisions: how many concurrent agentic tasks can be run per accelerator and per megawatt of power. For enterprises deploying AI agents at scale, those numbers determine how much productive work a given infrastructure investment can actually deliver. As NVIDIA and the open source ecosystem continue to optimize inference software, performance and efficiency on agentic workloads will only improve. The NVIDIA Vera Rubin architecture is now in full production, bringing the next generation of infrastructure capacity to meet the growing demands of agentic AI at scale. Dive deeper into AgentPerf’s methodology and NVIDIA’s full-stack optimizations for agentic AI in this technical blog.

资源链接

元数据

来源NVIDIA Technical Blog

类型资讯

抽取状态raw

关键词

AI Infrastructure

Hardware

Networking

Software

Agentic AI

CUDA

Inference

NVIDIA Blackwell

TensorRT

Industry

Agent

Dataset

Platform