[译][论文] DeepSeek-R1：通过强化学习激励大模型的推理能力（DeepSeek，2024）

Published at 2025-02-15 | Last Update 2025-03-09

译者序

本文翻译自 2024 年 DeepSeek AI 的 paper DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning。介绍了 DeepSeek 第一代推理模型（reasoning models） （所以缩写为 R1）的设计和训练过程：

Fig. How DeepSeek-R1-series models were trained.

要理解 DeepSeek-R1 的创新之处，可以先阅读如何训练一个企业级 GPT 助手（OpenAI，2023），里面介绍了典型的大模型训练 pipeline，其中包括预训练、SFT、RM、RL等步骤。

OpenAI：训练一个 GPT 助手的流程

DeepSeek-R1-Zero 的创新之处在于完全跳过了 SFT 步骤，直接在基座模型上进行大规模 RM+RL 训练，性能达到了 OpenAI-o1-0912 的水平。
- LLaMA 2：开放基础和微调聊天模型（Meta/Facebook，2023）对基于人类反馈的强化学习（HFRL）有较详细的介绍，DeepSeek 这里用的 RL 没有 HF，离 AGI 更进了一步。
- 更详细的 HFRL 可介绍可以参考 InstructGPT：基于人类反馈训练语言模型遵从指令的能力（OpenAI，2022），
InstructGPT 三部曲：(1) SFT, (2) RM training, (3) RLHF via proximal policy optimization (PPO) on RM.
蓝色箭头表示相应的数据用于训练模型。Step 2 中 A-D 是模型输出的采样，然后标注员对它们进行排序。
为了解决 DeepSeek-R1-Zero 存在的一些问题（可读性差，语言混用），又引入了少量的 SFT 数据作为冷启动，再参考 R1-Zero 的过程，训练了 DeepSeek-R1，在推理任务上的表现与 OpenAI-o1-1217 不相上下。
将 DeepSeek-R1 的推理能力蒸馏到 Qwen/LLaMA 等小型 dense 模型上，性能也很好。

总结下和 OpenAI 的性能对标：

DeepSeek Models	OpenAI Models
DeepSeek-R1-Zero	OpenAI-o1-0912
DeepSeek-R1	OpenAI-o1-1217
DeepSeek-R1 Distilled Models	OpenAI-o1-mini

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
摘要
1 引言
2 方法
3 实验（略）
- 3.1 DeepSeek-R1 评估
- 3.2 蒸馏模型评估
4 讨论
- 4.1 蒸馏与强化学习的性能对比
- 4.2 失败的尝试
  - 4.2.1 Process Reward Model (PRM)
  - 4.2.2 Monte Carlo Tree Search (MCTS)
5 结论、局限性和未来工作
参考文献

摘要

本文介绍我们的第一代推理模型，DeepSeek-R1-Zero 和 DeepSeek-R1。

DeepSeek-R1-Zero
- 这是一个跳过监督微调（SFT）步骤，直接通过大规模强化学习（RL）训练得到的模型，具备卓越的推理能力。
  
  译注：下图来自如何训练一个企业级 GPT 助手（OpenAI，2023），展示了 OpenAI 从预训练开始逐步训练出一个 GPT 助手的步骤， pre-training -> SFT -> RM -> RL 也是典型的大模型训练过程。 R1-Zero 是在 DeepSeek-V3 基座大模型上直接进行 RM+RL，跳过中间的 SFT，
  
  OpenAI：训练一个 GPT 助手的流程
- 通过大规模 RL，DeepSeek-R1-Zero 自然地涌现出许多强大且有趣的推理行为。不过，它也存在可读性差、混用语言等问题。
DeepSeek-R1
- 为了解决以上提到的 R1-Zero 存在的问题，并进一步提升推理性能，在 RL 阶段之前引入了多阶段训练和冷启动数据，训练得到的模型称为 DeepSeek-R1。
- DeepSeek-R1 在推理任务上的表现与 OpenAI-o1-1217 不相上下。
  
  Figure 1 | Benchmark performance of DeepSeek-R1.

为了支持研究社区，我们此次开源了 8 个推理模型：

DeepSeek-R1
DeepSeek-R1-Zero
DeepSeek-R1-Distill-Llama-70B
DeepSeek-R1-Distill-Qwen-32B
DeepSeek-R1-Distill-Qwen-14B
DeepSeek-R1-Distill-Llama-8B
DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Qwen-1.5B

其中，后面 6 个是以 Qwen/Llama 作为基座模型，利用 DeepSeek-R1 蒸馏出来的 dense 模型。

1 引言

近年来，大模型的迭代与演进速度非常快（OpenAI, 2024a；Anthropic, 2024；Google, 2024）。

1.0 Post-Training：完整 training pipeline 的重要组成部分

现在，post-training 已成为完整 training pipeline 的一个重要组成部分。

1.0.1 作用

Post-Training 的好处：

提高推理任务的准确性，
与人类社会价值观对齐，
能适应用户偏好，
相对于预训练，所需的计算资源极少。

1.0.2 提高推理能力：与 OpenAI-o1 的思路区别

具体到提高推理能力方面，

OpenAI 的 o1（OpenAI, 2024b）系列模型首次通过增加推理过程中的思维链长度（Chain-of-Thought, CoT）来引入 inference-time scaling。这种方法在数学、编码和科学推理等推理任务上取得了显著的效果。
但是，有效的 test-time scaling 仍然是社区的一个开放性问题。此前，业界已经探索了很多方法，包括 process-based reward models (Uesato et al., 2022; Lightman et al., 2023; Wang et al., 2023), reinforcement learning (Kumar et al., 2024), and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Xin et al., 2024; Trinh et al., 2024)，但这些方法都没有达到与 OpenAI o1 相当的通用推理性能。

本文迈出了通过纯强化学习（pure RL）提高模型推理能力的第一步。

我们的目标是探索大模型在没有任何监督数据的情况下 —— 单纯通过 RL 过程自我进化 —— 发展出推理能力的潜力。
具体来说，我们使用 DeepSeek-V3-Base 作为基础模型，采用 GRPO（Shao 等，2024）作为 RL 框架，来提高模型在推理方面的表现。
在训练过程中，DeepSeek-R1-Zero 自然地涌现出许多强大且有趣的推理行为。经过几千步的 RL 训练后， DeepSeek-R1-Zero 在推理基准测试中表现出色。例如，AIME 2024 的 pass@1 得分从 15.6% 提高到 71.0%，加上多数投票，得分进一步提高到 86.7%，与 OpenAI-o1-0912 表现相当。

然而，DeepSeek-R1-Zero 面临着诸如可读性差、语言混用等挑战。为了解决这些问题并进一步提升推理性能，我们引入了少量的冷启动数据和一个 multi-stage training pipeline，训练得到 DeepSeek-R1，其性能与 OpenAI-o1-1217 相当。

最后，我们还进一步探索了从 DeepSeek-R1 蒸馏较小的 dense models。例如，使用 Qwen2.5-32B（Qwen, 2024b）作为基础模型，两种思路：

直接在 Qwen-32B 上进行强化学习（RL），得到一个推理模型；
从 DeepSeek-R1 进行蒸馏（把 DeepSeek-R1 的知识“传授”给 Qwen2.5-32B），得到一个推理模型；

我们发现后者（蒸馏）的性能优于前者（直接 RL）。这表明尺寸更大的基础模型发现的推理模式对于提高推理能力至关重要。

我们开源了基于 Qwen/Llama（Dubey 等，2024）的蒸馏模型。值得注意的是，我们蒸馏出的 14B 模型在 AIME 2024 上的表现大幅超过了现有的开源模型 QwQ-32B-Preview（Qwen, 2024a），而蒸馏出的 32B 和 70B 模型在针对 dense models 的推理基准测试中创下了新纪录。

1.1 贡献

1.1.1 post-training：在基础模型上进行大规模强化学习

我们跳过监督微调（SFT）步骤，直接在基础模型（base model）上应用 RL。这会使模型去探索解决复杂问题时的思维链（CoT），用这种方式训练得到的就是 DeepSeek-R1-Zero。

DeepSeek-R1-Zero 展现出自我验证、反思和生成长 CoT 等能力，为社区研究树立了一个重要的里程碑。
值得注意的是，这是首个证实大模型的推理能力可以通过纯 RL 激励实现（无需 SFT）的公开研究，这一突破为该领域的未来发展铺平了道路。

此外，我们还介绍了开发 DeepSeek-R1 的 pipeline。

Fig. How DeepSeek-R1-Zero and DeepSeek-R1 were trained (based on the same base model).

该 pipeline 包含，

两个 RL stage
- 一个用于发现更强的推理模式（stage 2）
- 一个用于与人类偏好对齐（stage 4）
两个 SFT stage：用于激发出模型的 reasoning and non-reasoning 能力。

1.1.2 蒸馏：小型模型也可以很强大

我们证明了大型模型的推理模式可以被蒸馏到小型模型中，

与在小型模型上进行 RL 发现的推理模式相比，蒸馏可以取得更好的性能。
开源的 DeepSeek-R1 及其 API 将有助于社区在未来蒸馏出更好的小模型。

利用 DeepSeek-R1 生成的推理数据，我们微调了几个在社区中广泛使用的小型 dense 模型。结果显示，这些经过蒸馏的小型 dense model 在基准测试中表现非常好。

DeepSeek-R1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview.
DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on LiveCodeBench.
These results significantly outperform previous open-source models and are comparable to o1-mini.

1.2 性能评估结果

1.2.1 推理任务

DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other models.
On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks.

1.2.2 知识

On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-source models, demonstrating its competitive edge in educational tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses 4o on this benchmark.

1.2.3 其他

DeepSeek-R1 also excels in a wide range of tasks, including creative writing, general question answering, editing, summarization, and more. It achieves an impressive length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on ArenaHard, showcasing its strong ability to intelligently handle non-exam-oriented queries. Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring long-context understanding, substantially outperforming DeepSeek-V3 on long-context benchmarks.

2 方法

2.1 概述

以往的研究重度依赖于大量的监督数据（人类标注数据）来提升模型性能。本文的研究证明：

不使用监督微调（SFT），单纯通过大规模强化学习（RL）也能显著提升推理能力。
通过引入少量冷启动数据（SFT 训练数据），还可以进一步增强性能。

2.2 DeepSeek-R1-Zero：在基础模型（base model）上进行强化学习

之前的研究（Wang 等，2023；Shao 等，2024）已经证明，强化学习对提高推理性能非常有用。但是，这些前期研究都重度依赖监督数据，而收集监督数据是个费事费力的过程。

本节探索在没有任何监督数据的情况下（单纯通过 RL 过程自我进化），大模型发展出推理能力的过程。

2.2.1 强化学习算法：Group Relative Policy Optimization (`GRPO`)

为了降低 RL 训练成本，我们采用了 GRPO（组相对策略优化）算法（Shao 等，2024），该方法放弃了 critic model（通常尺寸与 policy model 大小相同），而是用 group scores 来估计基线。

具体来说，对于每个问题 $q$, GRPO 从老的 policy $\pi_{\theta_{old}}$ 中采样得到一组输出 ${o_1, o_2, \cdots, o_G}$，然后用下面的目标函数优化 policy model $\pi_{\theta}$：

2.2.2 奖励建模（Reward Modeling）：`rule-based reward system`

奖励是 training signal 的来源，它决定了强化学习的优化方向。训练 DeepSeek-R1-Zero 时，我们采用了一个基于规则的奖励系统（rule-based reward system），该系统主要由两种类型的奖励组成。

类型一：准确性奖励（Accuracy rewards）

准确性奖励模型评估响应是否正确（whether the response is correct）。例如，

对于具有确定性结果的数学问题，要求模型以指定格式提供最终答案，从而能可靠地基于规则验证正确性。
对于 LeetCode 问题，可以使用编译器对生成的程序进行编译，然后运行预定义的测试用例。

类型二：格式奖励（Format rewards）

我们还采用了一个格式奖励模型，强制推理模型将其思考过程放在 <think> 和 </think> tag 内。

这里没有使用结果或过程神经奖励模型（outcome or process neural reward model），因为我们发现神经奖励模型可能会在大规模强化学习过程中出现 reward hacking 行为，并且重新训练奖励模型需要额外的训练资源，也会使整个训练流程变得更加复杂。

2.2.3 训练模板（提示词模板）

我们设计了一个简单直白的模板，指导基础模型遵循我们的具体指令。如表 1 所示，

A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:

表 1：DeepSeek-R1-Zero 的模板。在训练期间，将用具体的推理问题替换提示。

可以看到，这个模板要求 DeepSeek-R1-Zero 首先生产一个推理过程，然后再给出最终答案。我们有意将约束限制在这一结构内，避免任何 content-specific biases —— 例如，mandating reflective reasoning or promoting particular problem-solving strategies —— 以确保我们能够准确地观察模型在 RL 过程中的自然进化。

2.2.4 DeepSeek-R1-Zero 的性能、自我进化过程和顿悟时刻

性能

下图展示了 DeepSeek-R1-Zero 在 AIME 2024 基准测试中的性能轨迹，

Figure 2:AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation.

可以看到，随着 RL 训练的进行，DeepSeek-R1-Zero 的性能稳步提升。 AIME 2024 pass@1 得分从 15.6% 跃升至 71.0%，达到了与 OpenAI-o1-0912 相当的性能水平，说明了我们的 RL 算法在优化模型性能方面的有效性。

表 2 是 DeepSeek-R1-Zero 与 OpenAI o1-0912 在多种推理基准测试上的性能对比，

表 2：DeepSeek-R1-Zero 与 OpenAI o1 在推理相关基准测试上的性能对比。

几点结论，

通过 RL，DeepSeek-R1-Zero 能够在无需任何监督微调数据的情况下获得强大的推理能力，也就是说模型仅通过 RL 就能有效学习和泛化。
DeepSeek-R1-Zero 的性能还可以通过多数投票（majority voting）进一步提升。例如，在 AIME 基准测试中采用多数投票时，DeepSeek-R1-Zero 的性能从 71.0% 上升至 86.7%，超过了 OpenAI-o1-0912 的性能。
DeepSeek-R1-Zero 在有无多数投票的情况下都能取得如此高的性能，突显了其强大的基础能力以及在推理任务中进一步发展的潜力。

自我进化过程

DeepSeek-R1-Zero 的自我进化过程非常好地展示了强化学习是如何驱动模型自主提升推理能力的。

直接从基础模型启动 RL 训练，使得我们免受监督微调（SFT）阶段的影响，从而能直观监测模型的进化过程。
这种方法为我们提供了一个观察模型随时间演变的清晰视角，特别是在处理复杂推理任务方面。

Figure 3:The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.

如图 3 所示，DeepSeek-R1-Zero 的思考时间在整个训练过程中呈现出持续改进（增加）的趋势。

这种进步并非外部调整的结果，而是模型内部的自然发展。
DeepSeek-R1-Zero 自然获得了通过增加 test-time computation 来解决越来越复杂的推理任务的能力。
这里所说的 computation 是指生成几百到几千个不等的推理 token，使模型能够更深入地探索和完善其思考过程。

随着 test-time computation 的增加，这种自我进化过程中最显著的方面之一是出现了复杂行为。例如，观察到下面两个行为同时自发出现了，

反思行为：模型重新审视和评估自己先前的步骤
模型主动探索解决问题的替代方法

这些行为并非明确编程的结果，而是模型与强化学习环境互动的结果。这种自发的发展显著增强了 DeepSeek-R1-Zero 的推理能力，使其能够以更高的效率和准确性应对更具挑战性的任务。

顿悟时刻

在 DeepSeek-R1-Zero 的训练过程中，观察到的一个奇特现象是所谓的 “顿悟时刻”。如表 3 所示，

Table 3:An interesting “aha moment” of an intermediate version of DeepSeek-R1-Zero. The model learns to rethink using an anthropomorphic tone. This is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning.

这一时刻出现在模型的一个中间版本中。在这个阶段，DeepSeek-R1-Zero 学会了通过重新评估其初始处理方法，为问题分配更多的思考时间。这种行为不仅是模型逐步增长的推理能力的证明，也是强化学习能够带来意外且复杂结果的一个迷人例证。

这对于模型和观察其行为的研究者来说都是一个 “顿悟时刻”，它凸显了强化学习的力量和美感：

我们并没有明确地教导模型如何解决问题，而是仅仅提供了正确的激励，模型便能够自主地发展出高级的问题解决策略。
“顿悟时刻” 有力地提醒了我们 RL 激发人工智能系统新智能水平的潜力，为未来更具自主性和适应性的模型铺平了道路。

缺点和解决方式

尽管 DeepSeek-R1-Zero 展示了强大的推理能力，并且能够自主发展出意外且强大的推理行为，但它也面临一些问题。例如，DeepSeek-R1-Zero 遇到了诸如可读性差、语言混用等挑战。为了使推理过程更具可读性，我们探索了 DeepSeek-R1。

2.3 DeepSeek-R1：带冷启动的强化学习

DeepSeek-R1-Zero 的结果令人鼓舞，关于如何进一步提升性能，自然会产生两个问题：

引入少量高质量数据作为冷启动，是否可以进一步提升推理性能或加速收敛？
如何训练一个用户友好的模型，该模型不仅能够产生清晰连贯的思维链（CoT），而且还能展现出强大的通用能力？

为了回答这些问题，我们设计了一个新的 pipeline，训练得到的模型称为 DeepSeek-R1。

该 pipeline 包含四个阶段。

2.3.1 阶段一：冷启动

为了避免从基础模型直接开始 RL 训练导致的不稳定冷启动阶段，我们构建了一定量的长 CoT 数据集并对模型进行微调（SFT），得到一个 initial RL actor。

数据源

几种方式：

提供一个 CoT 作为示例，然后使用 few-shot prompting 生成更多例子；
直接提示模型（directly prompting models），让它生成带有反思和验证过程的详细回答；
收集 DeepSeek-R1-Zero 输出的一些回答，并通过人工标注对输出的质量进行增强。

我们收集了几千个冷启动数据，拿来微调 DeepSeek-V3-Base，得到的模型作为接下来的 RL 过程的起点。

冷启动数据的好处

冷启动数据的好处包括：

提升输出的可读性

DeepSeek-R1-Zero 的主要问题之一是输出的内容经常可读性很差。可能会混杂多种语言，或者不是 markdown 格式，无法高亮一些重点。

因此，在为 DeepSeek-R1 创建冷启动数据时，我们设计了一种可读性很好的格式，在每个响应的末尾包含一个总结，并过滤出对读者不友好的响应。在这里，我们定义输出格式为 |special_token|<reasoning_process>|special_token|<summary>，其中 <reasoning_process> 是用户输入的 query 对应的 CoT（推理过程），而 <summary> 用于总结推理结果。
潜力
- 基于人的先验知识（human priors）精心设计冷启动数据，观察到训练出来的模型比 DeepSeek-R1-Zero 表现更好。
- 我们相信迭代式训练（iterative training）是很好的训练推理模型的方式。

2.3.2 阶段二：面向 reasoning 的强化学习

在使用冷启动数据对 DeepSeek-V3-Base 进行微调后，第二阶段的训练过程与 DeepSeek-R1-Zero 相同：使用大规模强化学习进行后训练。这一阶段专注于提升模型的 reasoning 能力，特别是在推理密集型任务中，如编码、数学、科学和逻辑推理，这些任务具有明确定义的问题和解决方案。

在训练过程中，我们观察到 CoT 经常出现语言混用（language mixing），特别是在 RL 提示词涉及多种语言时。为了缓解这个问题，我们在 RL 训练中引入了一种语言一致性奖励（language consistency reward），计算方式是 CoT 中目标语言单词的比例（proportion of target language words in the CoT）。尽管消融实验表明，这种对齐会导致模型性能略有下降，但这种奖励与人类偏好一致，使其更具可读性。

最后，我们直接将推理任务的准确性与语言一致性奖励相加来形成最终奖励。然后，我们在微调后的模型上应用 RL 训练，直到它在推理任务上收敛。

这个阶段的 RL 收敛时，保存一个 checkpoint 供第三阶段使用。

2.3.3 阶段三：拒绝采样和监督微调

Rejection sampling is a technique where the LLM generates multiple candidate answers and then filters out those that do not meet certain criteria, retaining only the “good” results。It is used to enhance the quality and reliability of the model’s outputs, making them more aligned with desired standards or distributions

更多信息，可参考 LLaMA 2：开放基础和微调聊天模型（Meta/Facebook，2023），里面对 rejection sampling 有较详细的介绍。

译注。

利用第二阶段的 checkpoint 收集 SFT（监督微调）数据。

初始冷启动数据主要关注推理，而这一阶段则纳入了来自其他领域的数据，以增强模型在写作、角色扮演和其他通用任务中的能力。具体来说，我们按照以下方式生成数据并微调模型。

推理数据（Reasoning data）：600k

人工整理一批推理提示词，从上述 RL 训练的 checkpoint 进行拒绝采样来生成推理轨迹。

在第二阶段，我们只纳入了可以使用基于规则的奖励进行评估的数据。

在这一阶段，

引入额外数据来扩展数据集，其中一些数据使用生成式奖励模型 —— 将事实和模型预测输入 DeepSeek-V3 进行判断。
由于模型输出有时会杂乱无章且难以阅读，我们会过滤掉带有混合语言、冗长段落和代码块的思维链。
对于每个提示，我们采样多个响应，并且只保留正确的响应。

总共，我们收集了大约 600k 个与推理相关的训练样本。

非推理数据（Non-Reasoning data）：200k

对于非推理数据，如写作、事实问答、自我认知和翻译，我们采用 DeepSeek-V3 pipeline，并复用 DeepSeek-V3 的一部分 SFT 数据集。

对于某些非推理任务，我们调用 DeepSeek-V3 来生成一个潜在的思维链，然后通过提示回答问题。
对于更简单的查询，如 “hello”，我们不会在响应中提供 CoT。

最终，我们收集了总共大约 200k 个与推理无关的训练样本。

我们使用上述整理的数据集（约 800k 样本）对 DeepSeek-V3-Base 进行了两个 epoch 的微调。

2.3.4 阶段四：所有场景的强化学习

为了进一步使模型与人类偏好对齐，我们又进行了一轮强化学习，在完善模型推理能力的同时，提高模型的有用性和无害性（helpfulness and harmlessness）。

Fig. How DeepSeek-R1-Zero and DeepSeek-R1 were trained (based on the same base model).

具体来说，我们组合使用 reward signals 和多样化的 prompt distributions 来训练模型。

对于推理数据，遵循 DeepSeek-R1-Zero 中的方法，利用基于规则的奖励来指导数学、编码和逻辑推理领域的学习过程。
对于通用数据，借助奖励模型，以捕捉复杂微妙场景中的人类偏好。我们基于 DeepSeek-V3 pipeline，并采用类似的偏好对和训练提示分布。
对于有用性，仅关注最终总结，确保评估强调响应对用户的实用性和相关性，同时尽量减少对底层推理过程的干扰。
对于无害性，评估模型的整个响应，包括推理过程和总结，以识别和减轻在生成过程中可能出现的任何潜在风险、偏见或有害内容。

这些方式组合起来，最终使我们训练出一个在推理方面表现出色、同时还会优先考虑有用性和无害性的模型。

2.4 蒸馏：赋予小型模型推理能力

为了使小型模型具备类似 DeepSeek-R1 的推理能力，我们直接用 DeepSeek-R1 生成的 800k 样本对开源模型进行微调。

我们的研究发现，这种直接蒸馏的方法能显著提升小型模型的推理能力。我们使用的基础模型包括：

Qwen2.5-Math-1.5B
Qwen2.5-Math-7B
Qwen2.5-14B
Qwen2.5-32B
Llama-3.1-8B
Llama-3.3-70B-Instruct。选择 Llama-3.3 是因为其推理能力略优于 Llama-3.1。

蒸馏过程：在以上基础模型上进行监督微调（SFT），

这里不再进行强化学习（RL），尽管叠加 RL 可能会进一步提升模型性能。
我们的主要目的是展示蒸馏技术的有效性，叠加 RL 阶段的探索就留给更社区研究。

3 实验（略）

3.1 DeepSeek-R1 评估

3.2 蒸馏模型评估

4 讨论

4.1 蒸馏与强化学习的性能对比

前面已经看到，通过蒸馏 DeepSeek-R1，小型模型可以取得非常好的效果。但这里还有一个问题待解答：通过本文讨论的大规模 RL 对小模型训练，和蒸馏方式相比，哪个效果来的更好？

为了回答这个问题，我们在 Qwen-32B-Base 上进行了大规模 RL 训练，使用数学、编码和 STEM 数据，训练了超过 10K 步，得到了 DeepSeek-R1-Zero-Qwen-32B。两种方式得到的模型，性能对比如下，

Table 6:Comparison of distilled and RL Models on Reasoning-Related Benchmarks.

大规模 RL 训练的 32B 基础模型，在性能上与 QwQ-32B-Preview 相当。
从 DeepSeek-R1 蒸馏而来的模型，在所有基准测试中都显著优于 DeepSeek-R1-Zero-Qwen-32B。

因此，我们可以得出两个结论：

将更强大的模型蒸馏到小型模型中，可以让小模型获得出色的性能。对小型模型进行大规模 RL 也能取得不错的性能，但需要的算力比蒸馏要多很多，而且可能无法达到蒸馏取得的效果。
蒸馏是一种既经济又高效的方式，但要突破智能边界，可能仍需要更强大的基础模型和更大规模的强化学习。

4.2 失败的尝试

在开发 DeepSeek-R1 早期，我们也遇到了一些失败和挫折。这里分享一些失败经验，提供一些见解，但这并不意味着这些方法无法开发出有效的推理模型。

4.2.1 Process Reward Model (PRM)

PRM is a reasonable method to guide the model toward better approaches for solving reasoning tasks (Uesato et al., 2022; Lightman et al., 2023; Wang et al., 2023). However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether the current intermediate step is correct is a challenging task. Automated annotation using models may not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward hacking (Gao et al., 2022), and retraining the reward model needs additional training resources and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments.

4.2.2 Monte Carlo Tree Search (MCTS)

Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Silver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time compute scalability. This approach involves breaking answers into smaller parts to allow the model to explore the solution space systematically. To facilitate this, we prompt the model to generate multiple tags that correspond to specific reasoning steps necessary for the search. For training, we first use collected prompts to find answers via MCTS guided by a pre-trained value model. Subsequently, we use the resulting question-answer pairs to train both the actor model and the value model, iteratively refining the process.

However, this approach encounters several challenges when scaling up the training. First, unlike chess, where the search space is relatively well-defined, token generation presents an exponentially larger search space. To address this, we set a maximum extension limit for each node, but this can lead to the model getting stuck in local optima. Second, the value model directly influences the quality of generation since it guides each step of the search process. Training a fine-grained value model is inherently difficult, which makes it challenging for the model to iteratively improve. While AlphaGo’s core success relied on training a value model to progressively enhance its performance, this principle proves difficult to replicate in our setup due to the complexities of token generation.

In conclusion, while MCTS can improve performance during inference when paired with a pre-trained value model, iteratively boosting model performance through self-search remains a significant challenge.

5 结论、局限性和未来工作

In this work, we share our journey in enhancing model reasoning abilities through reinforcement learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data, achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on a range of tasks.

We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several small dense models. The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks with 28.9% on AIME and 83.9% on MATH. Other dense models also achieve impressive results, significantly outperforming other instruction-tuned models based on the same underlying checkpoints.

In the future, we plan to invest in research across the following directions for DeepSeek-R1.

General Capability: Currently, the capabilities of DeepSeek-R1 fall short of DeepSeek-V3 in tasks such as function calling, multi-turn, complex role-playing, and JSON output. Moving forward, we plan to explore how long CoT can be leveraged to enhance tasks in these fields.
Language Mixing: DeepSeek-R1 is currently optimized for Chinese and English, which may result in language mixing issues when handling queries in other languages. For instance, DeepSeek-R1 might use English for reasoning and responses, even if the query is in a language other than English or Chinese. We aim to address this limitation in future updates.
Prompting Engineering: When evaluating DeepSeek-R1, we observe that it is sensitive to prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using a zero-shot setting for optimal results.
Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.

参考文献

AI@Meta (2024) AI@Meta. Llama 3.1 model card, 2024. URL https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md .
Anthropic (2024) Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet .
Chen et al. (2021) M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. et al Evaluating large language models trained on code. , abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374 .
Dubey et al. (2024) A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024.
Dubois et al. (2024) Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475 , 2024.
Feng et al. (2024) X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024. URL https://arxiv.org/abs/2309.17179 .
Gao et al. (2022) L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization, 2022. URL https://arxiv.org/abs/2210.10760 .
Gema et al. (2024) A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, and P. Minervini. Are we done with mmlu? , abs/2406.04127, 2024. URL https://doi.org/10.48550/arXiv.2406.04127 .
Google (2024) Google. Our next-generation model: Gemini 1.5, 2024. URL https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024 .
He et al. (2024) Y. He, S. Li, J. Liu, Y. Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng, et al. Chinese simpleqa: A chinese factuality evaluation for large language models. arXiv preprint arXiv:2411.07140 , 2024.
Hendrycks et al. (2020) D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 , 2020.
Huang et al. (2023) Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322 , 2023.
Jain et al. (2024) N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. , abs/2403.07974, 2024. URL https://doi.org/10.48550/arXiv.2403.07974 .
Krishna et al. (2024) S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stambler, S. Upadhyay, and M. Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. , abs/2409.12941, 2024. 10.48550/ARXIV.2409.12941 . URL https://doi.org/10.48550/arXiv.2409.12941 .
Kumar et al. (2024) A. Kumar, V. Zhuang, R. Agarwal, Y. Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917 , 2024.
Li et al. (2023) H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212 , 2023.
Li et al. (2024) T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939 , 2024.
Lightman et al. (2023) H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050 , 2023.
Lin (2024) B. Y. Lin. ZeroEval: A Unified Framework for Evaluating Language Models, July 2024. URL https://github.com/WildEval/ZeroEval .
MAA (2024) MAA. American invitational mathematics examination - aime. In American Invitational Mathematics Examination - AIME 2024 , February 2024. URL https://maa.org/math-competitions/american-invitational-mathematics-examination-aime .
OpenAI (2024a) OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/ .
OpenAI (2024b) OpenAI. Learning to reason with llms, 2024b. URL https://openai.com/index/learning-to-reason-with-llms/ .
OpenAI (2024c) OpenAI. Introducing SimpleQA, 2024c. URL https://openai.com/index/introducing-simpleqa/ .
OpenAI (2024d) OpenAI. Introducing SWE-bench verified we’re releasing a human-validated subset of swe-bench that more, 2024d. URL https://openai.com/index/introducing-swe-bench-verified/ .
Qwen (2024a) Qwen. Qwq: Reflect deeply on the boundaries of the unknown, 2024a. URL https://qwenlm.github.io/blog/qwq-32b-preview/ .
Qwen (2024b) Qwen. Qwen2.5: A party of foundation models, 2024b. URL https://qwenlm.github.io/blog/qwen2.5 .
Rein et al. (2023) D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022 , 2023.
Shao et al. (2024) Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024.
Silver et al. (2017a) D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. , abs/1712.01815, 2017a. URL http://arxiv.org/abs/1712.01815 .
Silver et al. (2017b) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. , 550(7676):354–359, 2017b. 10.1038/NATURE24270 . URL https://doi.org/10.1038/nature24270 .
Snell et al. (2024) C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.03314 .
Trinh et al. (2024) T. Trinh, Y. Wu, Q. Le, H. He, and T. Luong. Solving olympiad geometry without human demonstrations. , 2024. 10.1038/s41586-023-06747-5 .
Uesato et al. (2022) J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275 , 2022.
Wang et al. (2023) P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935 , 2023.
Wang et al. (2022) X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 , 2022.
Wang et al. (2024) Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. , abs/2406.01574, 2024. URL https://doi.org/10.48550/arXiv.2406.01574 .
Xia et al. (2024) C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint , 2024.
Xin et al. (2024) H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao, Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL https://arxiv.org/abs/2408.08152 .
Zhou et al. (2023) J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911 , 2023.

« [译] AI Workflow & AI Agent：架构、模式与工程建议（Anthropic，2024） [译][论文] Transformer paper | Attention Is All You Need（Google，2017） »

ArthurChiao's Blog