UrbanComp Lab | 学习资料库

返回资讯

资讯

Hugging Face Blog

LLM

Agent

Dataset

Platform

中文标题

推出 North Mini Code：Cohere 首款面向开发者的模型

English Title

Introducing North Mini Code: Cohere’s First Model For Developers

Cohere Code Agents Team

发布时间

2026/6/9 23:56:23

来源类型

blog

语言

摘要

中文对照

North Mini Code 是 Cohere 全新模型系列中的首款模型，专为智能体（agentic）软件工程任务设计与训练。图1：North Mini Code 在智能体编程任务及复杂代码生成基准测试中的表现，与同规模主流开源模型对比。基准测试方法论详情参见此处。

English Original

正文

中文全文

North Mini Code 是 Cohere 新模型系列中的首个模型，专为智能体（agentic）软件工程任务设计并训练。图 1：North Mini Code 在智能体编程任务及复杂代码生成基准测试中的表现，与同规模领先开源模型的对比。基准测试方法论详情参见此处。图 2：North Mini Code 是一种混合专家（Mixture-of-Experts）Transformer 解码器，采用交错式滑动窗口自注意力机制与全自注意力机制。图 3：后训练流程包含两个阶段的监督微调（SFT）以及一个面向软件工程与终端任务的、基于可验证奖励的智能体强化学习（RLVR）阶段。图 4：为支持多种智能体编程框架（agentic coding harnesses），North Mini Code 在第二阶段 SFT 中接触了多种编程框架。智能体编程的 rollout 过程长且长度高度可变，最慢的轨迹通常比中位数长度高出一个数量级。若采用同步 RL 循环，训练器将在每一批次中因等待这些 rollout 生成而空转；因此，我们解耦采样与学习过程：训练器与一个持续提供 rollout 的 vLLM 边车（sidecar）并行运行。策略权重每隔若干学习步（K=4）导出至 vLLM，确保采样器在任一时刻仅略微偏离当前策略。剩余的偏差则在损失层面予以校正。我们采用 CISPO [12] 进行训练——这是一种带词元级重要性采样校正的对数似然目标函数。CISPO 区别于 PPO 和 GRPO，在于其重要性权重作用于对数似然而非概率比，并在 RLOO [13] 基础上引入更强的正则化。我们在词元层级而非提示（prompt）层级聚合损失，从而使梯度信号随轨迹长度缩放；其中承载大部分信用分配信号的长智能体轨迹不会相对于短轨迹被降权。评估人员获配基于评分标准（rubric）的打分问题，以辅助其逐项评估单个响应指标，并首先对单个尝试进行评分，再对两条模型轨迹给出最终偏好评级。我们公布了 North Mini Code 的评估结果，对比了 SFT 检查点与最终发布模型检查点的表现。图 6：人类评估的成对偏好结果，比较经 RLVR 训练后的最终 North Mini Code 检查点与仅经 SFT 的检查点，在 85 个样本上的表现。我们的评估表明，RLVR 尤其提升了模型在代码编辑任务上的性能，最终模型在各子集上的综合胜率相较仅 SFT 模型达 66.1%。 1. AAII 编程指数（AAII Coding Index）将 Terminal Bench Hard 视为一项智能体编程任务，将 SciCode 视为面向科学问题的代码生成基准测试。↩ @coherecode 感谢分享！在我们的内部测试中，DSV4 作为基线模型表现优异，因此若能基于 DSV4 进行 SFT/RL 将非常理想（我们已在第 0 周支持中完整运行过整套流程，且在涵盖 10,000+ 次工具调用/编程语料的数据集上完成训练仅需两天）。

English Original

North Mini Code is the first model in Cohere’s new family of models, and is specifically designed and trained for agentic software engineering tasks. Figure 1: North Mini Code’s performance in agentic coding tasks and complex code generation benchmarks, compared to leading open-source models of similar size. See here for the details of our benchmarking methodology. Figure 2: North Mini Code is a Mixture-of-Experts Transformer decoder with interleaved sliding-window self-attention and full self-attention. Figure 3: The post-training pipeline is made up of two phases of supervised fine-tuning (SFT) and a phase of agentic reinforcement learning with verifiable rewards (RLVR) targeting software engineering and terminal tasks. Figure 4: To power a variety of agentic coding harnesses, North Mini Code is exposed to a variety of coding harnesses during the second SFT stage. Coding-agent rollouts are long and highly variable in length, with the slowest trajectories routinely an order of magnitude longer than the median. A synchronous RL loop would idle the trainer waiting for those trials to be generated for every batch, so we decouple sampling from learning: a trainer runs alongside a vLLM sidecar that serves rollouts continuously. Policy weights are exported into vLLM every few learner steps (K=4), so the sampler is at most slightly off-policy at any moment. The residual mismatch is then corrected at the loss level. We train using CISPO [12], a log-likelihood objective with token-level importance sampling correction. CISPO differs from PPO and GRPO in that the importance weight multiplies a log-likelihood rather than a probability ratio and enhances RLOO [13] with stronger regularization. We aggregate the loss at the token level rather than the prompt level, so the gradient signal scales with trajectory length and long agentic traces (where most of the credit-assignment signal lives) are not down-weighted relative to short ones. Evaluators are provided with rubric-based scoring questions to help them assess individual response criteria and rate individual attempts first, before giving a final preference rating between the two model trajectories.2 We share evaluation results of North Mini Code, comparing the SFT checkpoint with the final model release checkpoint. Figure 6: Pairwise preference results for human evaluation comparing the final North Mini Code checkpoint after RLVR against the SFT-only checkpoint across 85 samples. Our evaluations show that RLVR especially improves model performance on code editing tasks, resulting in an aggregate win rate of 66.1% across subsets for the final model against its SFT-only counterpart. 1. AAII Coding Index includes Terminal Bench Hard as an agentic coding task and SciCode as code generation benchmark for scientific problems. ↩ @coherecode Thank you for sharing! In our internal test DSV4 is really good as baseline , so it will be good that we could sft/RL over DSV4 (we have ran though whole pipeline in week 0 support, and it will take 2 days to finish trainining over 10k+ tool calls/coding corpus) ?

资源链接

Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMsaclanthology.org/2024.acl-long.662 On Leakage of Code Generation Evaluation Datasetsaclanthology.org/2024.findings-emnlp.772 Careersapply.workable.com/huggingface AAII Coding Indexartificialanalysis.ai/models/capabilities/coding SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineeringarxiv.org/abs/2405.15793 SciCode: A Research Coding Benchmark Curated by Scientistsarxiv.org/abs/2407.13168 RoPE to NoPE and Back Again: A New Hybrid Attention Strategyarxiv.org/abs/2501.18795 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attentionarxiv.org/abs/2506.13585 SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?arxiv.org/abs/2509.16941 Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Modelsarxiv.org/abs/2512.13607 Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillationarxiv.org/abs/2603.19220 外部资源cdn-uploads.huggingface.co...93c12a09b8a3184bff/-85CjexGqOFX5bahIdqxx.png 外部资源cdn-uploads.huggingface.co...93c12a09b8a3184bff/8DQUkAkjo7Afat2Z4L7ue.png 外部资源cdn-uploads.huggingface.co...93c12a09b8a3184bff/Oe79J_Vn3Lbi10oKlHiQi.png 外部资源cdn-uploads.huggingface.co...93c12a09b8a3184bff/f8NXK5yKtc6xE-hJ4XWbd.png 外部资源cdn-uploads.huggingface.co...93c12a09b8a3184bff/g-SYXPG1oIxHEwnItd3he.png 外部资源cdn-uploads.huggingface.co...93c12a09b8a3184bff/xPc4PSWREdLtTS62tfl8L.png Forumdiscuss.huggingface.co https://github.com/SWE-agent/mini-swe-agentgithub.com/SWE-agent/mini-swe-agent https://github.com/anomalyco/opencodegithub.com/anomalyco/opencode GitHubgithub.com/huggingface Harbor: A Framework for Evaluating and Optimizing Agents and Models in Container Environmentsgithub.com/laude-institute/harbor Terminal-Bench: A Benchmark for AI Agents in Terminal Environmentsgithub.com/laude-institute/terminal-bench bf16huggingface.co/CohereLabs/North-Mini-Code-1.0 fp8huggingface.co/CohereLabs/North-Mini-Code-1.0-fp8 SWE-bench: Can Language Models Resolve Real-World GitHub Issues?openreview.net/forum LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Codeopenreview.net/forum Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to Allqwen.ai/blog Forge: Scalable Agent RL Framework and Algorithmwww.minimax.io...ge-scalable-agent-rl-framework-and-algorithm 原始来源页面huggingface.co...log/CohereLabs/introducing-north-mini-code

元数据

来源Hugging Face Blog

类型资讯

抽取状态raw

关键词

LLM

Agent

Dataset

Platform