A Primer paper about how reasoning models improve after training
Shows that better reasoning models depend less on raw data size and more on checkable training evidence.
reasoning data is NOT simple question-and-answer pairs. The useful part is often the feedback that says why an answer, step, tool action, or full attempt was good or bad.
A prompt and a response tell you what a model said, but not why that answer became learnable, which judge blessed it, which failures were hidden, or whether the skill was already inside the base model.
The core idea is to describe each training example as a record that includes the task, the model’s behavior, the checking signal, and metadata about where it came from.
The authors sort reasoning data by how it can be checked, such as exact rule-based checks for math and code, environment checks for agents using tools, and human or model judgments when no exact checker exists.
They also explain why common assumptions fail, because long reasoning traces may be fake, harder examples may be useless for some models, and larger datasets may still miss important coverage.
The key point is that agent data should preserve mess: failed actions, retries, recoveries, state differences, and terminal checks, because that is where learning signal often lives.
----
Link – arxiv. org/abs/2606.02113
Title: "A Primer in Post-Training Reasoning Data: What They Know About How It Works"


