Not All Correct Answers Are Equal: Why Your Distillation Source Matters
Abstract
Distilling reasoning data from advanced language models improves student model performance across various benchmarks.
Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The AM-based model consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging FaceDatasets are available on Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled{AM-Thinking-v1-Distilled}, https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled{AM-Qwen3-Distilled}.}.
Community
We conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art language models—AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1—over a shared corpus of 1.89 million queries, resulting in three parallel distilled datasets. Among them, the AM-Thinking-v1 distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on multiple reasoning benchmarks, including AIME2024 (84.3), AIME2025 (72.2), MATH500 (98.4), and LiveCodeBench (65.9), where the AM-based student consistently achieves the best performance. Notably, it also demonstrates adaptive response generation, producing longer outputs for harder problems and shorter ones for simpler tasks. These results highlight the importance of high-quality, verified reasoning traces for enhancing model performance. To support future research, we release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training (2025)
- 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training (2025)
- AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale (2025)
- OpenCodeReasoning: Advancing Data Distillation for Competitive Coding (2025)
- Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking (2025)
- Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model (2025)
- Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper