A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer>tags respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. User: {prompt}. Assistant:
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer>tags respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. User: What is 2 + 3 * 4?. Assistant:
并且我们期望模型生成符合模板的输出,例如:
<think> Order of operations: multiply before add. 3 * 4 = 12. 2 + 12 = 14 </think> <answer> 14 </answer>
为了解决 R1 Zero 问题并真正正确地进行 DeepSeek 推理,研究人员进行了冷启动数据收集并加入了监督微调。你可以将其视为在真正激烈的强化学习训练之前为模型打下良好的推理基础。基本上,他们想教会DeepSeek-V3Base良好的推理是什么样的以及如何清晰地呈现它。
具备长CoT的Few-shot提示
他们为 DeepSeek-V3 Base 提供了一些问题示例以及非常详细的分步解决方案,称为思维链 (CoT)。这个想法是让模型通过示例进行学习,并开始模仿这种分步推理风格。让我们直观地理解基于示例的学习:对于示例问题 What is 2 + 3 * 4?,他们可能会显示如下提示:
Problem Examples with Solutions:
Problem: What's the square root of 9 plus 5? Solution: | special_token | First, find the square root of 9, which is 3. Then, add 5 to 3. 3 + 5 equals 8. | special_token | Summary: The answer is 8.
Problem: Train travels at 60 mph for 2 hours, how far? Solution: | special_token | Use the formula: Distance = Speed times Time. Speed is 60 mph, Time is 2 hours. Distance = 60 * 2 = 120 miles. | special_token | Summary: Train travels 120 miles.
Problem: What is 2 + 3 * 4? Solution:
| special_token | 只是将推理步骤与Summary分开的标记,使模型能够清楚地学习结构。看到这些例子后,模型应该学会以类似的格式给出答案,例如 What is 2 + 3 * 4?
| special_token | Following order of operations (PEMDAS/BODMAS), do multiplication before addition. So, first calculate 3 * 4 = 12. Then, add 2 to 12. 2 + 12 = 14. | special_token | Summary: The answer is 14.
直接提示他们收集数据的另一种方式是直接提示模型不仅解决问题,而且还逐步明确地展示其推理,然后仔细检查其答案。这是为了促进更加谨慎和周到的解决问题。对于 What is 2 + 3 * 4? 提示可能是:
Problem: Solve this, show reasoning step-by-step, and verify: What is 2 + 3 * 4?
他们确实期望输出包含推理步骤和验证部分:
| special_token | Reasoning: To solve 2 + 3 * 4, I need to use order of operations. Multiplication comes before addition. Step 1: Calculate 3 * 4 = 12. Step 2: Add 2 to the result from step 1: 2 + 12 = 14. Verification: Checking order of operations again, yes, multiplication is before addition. Calculation looks right. | special_token | Summary: The answer is 14.
后处理细化他们甚至使用了已经训练过的 R1 Zero 模型的输出。尽管 R1 Zero 存在问题,但它可以进行一些推理。因此,他们采用了 R1 Zero 的输出,并让人工注释者对其进行注释,使其更加清晰、更有条理,并纠正错误。例如,混乱的 R1 Zero 输出可能是:
<think> ummm... multiply 3 and 4... get 12... then add 2...</think> <answer> 14 </answer>
人类对其进行完善,使其更加清晰且格式更好。
| special_token | Reasoning: To solve this, we use order of operations, doing multiplication before addition. Step 1: Multiply 3 by 4, which is 12. Step 2: Add 2 to the result: 2 + 12 = 14. | special_token | Summary: The answer is 14.