A new study, PRISM, provides the most comprehensive empirical analysis to date on the critical but often opaque phase of large language model (LLM) development known as mid-training. The research systematically demonstrates that a targeted, high-quality mid-training phase is not merely beneficial but essential for unlocking significant reasoning capabilities, particularly when followed by reinforcement learning (RL).
What the Researchers Built: A Controlled Study of Mid-Training
The PRISM (Retention and Interaction in Mid-Training) study is a large-scale, controlled experiment designed to isolate the effects of mid-training design choices. The researchers trained seven base models from four different families: Granite, LLaMA, Mistral, and Nemotron-H. This set included both dense Transformer architectures and attention-Mamba hybrids, with parameter counts ranging from 3 billion to 24 billion.
The core intervention was a consistent mid-training phase of approximately 27 billion high-quality tokens, applied to each base model. The study then analyzed the effects of this phase on general capabilities and, crucially, on the subsequent effectiveness of reinforcement learning (RL) fine-tuning.
Key Results: Consistent Gains and a RL Prerequisite
The results show clear, quantifiable improvements from the mid-training phase alone, and a dramatic shift in RL effectiveness.
Mid-Training Gains:
After mid-training on the 27B-token corpus, models showed consistent improvements across specialized benchmarks:
- Math: +15 to +40 points
- Code: +5 to +12 points
- Science: +6 to +13 points
These gains were achieved while preserving general language modeling performance, indicating a targeted enhancement of reasoning skills.
The RL Multiplier Effect:
The most striking finding concerns the interaction between mid-training and RL. Applying RL directly to most of the original base models was "substantially less effective," resulting in AIME (a math benchmark) scores near zero. In contrast, applying the same RL pipeline after mid-training produced a 3-4x improvement.
The macro-average score across six reasoning benchmarks jumped from under 12 to between 29 and 42. This demonstrates that mid-training creates a necessary precondition for RL to work effectively on reasoning tasks.
Base Model - - - ~12 (Pre-RL) After Mid-Training +15 to +40 pts +5 to +12 pts +6 to +13 pts N/A After Full PRISM→RL N/A N/A N/A 29 to 42 (3-4x improvement)How It Works: Data Composition and Mechanistic Changes
The study provides a mechanistic explanation for why mid-training is so pivotal.
Data Composition is Decisive at Mid-Training:
The research found that data composition matters most during mid-training, not during RL. For example, including science data in the mid-training mix unlocked gains of +17 to +28 points on the GPQA-Diamond benchmark during subsequent RL. In contrast, varying the data mixture during RL itself produced differences of less than 2 points. This indicates that the model acquires foundational knowledge and structural priors during mid-training that RL later refines.
Dense Restructuring vs. Sparse Refinement:
Analysis of model weights revealed two distinct mechanistic regimes:
- Mid-Training densely restructures over 90% of the model's parameters. This is a wholesale, foundational update.
- RL makes sparse, front-loaded refinements to only about 5% of parameters.
Representation analysis using Centered Kernel Alignment (CKA) showed that RL preserves the representational geometry established during mid-training with remarkable fidelity (CKA > 0.998). This means RL operates on top of a stable foundation created by mid-training.
The researchers note a crucial observation: RL applies essentially identical weight changes regardless of the starting model. However, these changes only lead to performance gains when applied to a mid-trained model. This is consistent with the hypothesis that mid-training places the model in a specific configuration or "basin" in the loss landscape from which RL can effectively navigate to a high-performance solution.
Why It Matters: Practical Guidance for Model Development
The PRISM study moves beyond anecdotal evidence to provide concrete, scalable guidance for building capable reasoning models.
First, it establishes a reliable recipe: a ~27B-token mid-training phase on high-quality, domain-mixed data (especially including science) is a highly effective lever for boosting math, code, and science performance. This "retention-aware" mid-training is shown to be more impactful for final reasoning performance than tweaking the later RL data mix.
Second, it clarifies the distinct roles of training phases. Pretraining builds general knowledge, mid-training restructures the model for specific reasoning domains, and RL performs a sparse, final alignment on that restructured foundation. Attempting RL without the mid-training restructuring phase is largely ineffective for hard reasoning tasks.
For practitioners, this means allocating compute and careful data curation to the mid-training phase is a critical strategic decision. The study provides empirical justification for a three-stage pipeline (pretrain → mid-train → RL) as the most reliable path to state-of-the-art reasoning models.






