PRISM Study: Mid-Training on 27B Tokens Boosts Math Scores by +15 to +40 Points, Enables Effective RL
AI ResearchScore: 80

PRISM Study: Mid-Training on 27B Tokens Boosts Math Scores by +15 to +40 Points, Enables Effective RL

A comprehensive study shows mid-training on 27B high-quality tokens consistently improves reasoning in LLMs. This 'retention-aware' phase restructures 90% of weights, creating a configuration where RL can succeed.

13h ago·5 min read·10 views·via arxiv_ml
Share:

A new study, PRISM, provides the most comprehensive empirical analysis to date on the critical but often opaque phase of large language model (LLM) development known as mid-training. The research systematically demonstrates that a targeted, high-quality mid-training phase is not merely beneficial but essential for unlocking significant reasoning capabilities, particularly when followed by reinforcement learning (RL).

What the Researchers Built: A Controlled Study of Mid-Training

The PRISM (Retention and Interaction in Mid-Training) study is a large-scale, controlled experiment designed to isolate the effects of mid-training design choices. The researchers trained seven base models from four different families: Granite, LLaMA, Mistral, and Nemotron-H. This set included both dense Transformer architectures and attention-Mamba hybrids, with parameter counts ranging from 3 billion to 24 billion.

The core intervention was a consistent mid-training phase of approximately 27 billion high-quality tokens, applied to each base model. The study then analyzed the effects of this phase on general capabilities and, crucially, on the subsequent effectiveness of reinforcement learning (RL) fine-tuning.

Key Results: Consistent Gains and a RL Prerequisite

The results show clear, quantifiable improvements from the mid-training phase alone, and a dramatic shift in RL effectiveness.

Mid-Training Gains:
After mid-training on the 27B-token corpus, models showed consistent improvements across specialized benchmarks:

  • Math: +15 to +40 points
  • Code: +5 to +12 points
  • Science: +6 to +13 points

These gains were achieved while preserving general language modeling performance, indicating a targeted enhancement of reasoning skills.

The RL Multiplier Effect:
The most striking finding concerns the interaction between mid-training and RL. Applying RL directly to most of the original base models was "substantially less effective," resulting in AIME (a math benchmark) scores near zero. In contrast, applying the same RL pipeline after mid-training produced a 3-4x improvement.

The macro-average score across six reasoning benchmarks jumped from under 12 to between 29 and 42. This demonstrates that mid-training creates a necessary precondition for RL to work effectively on reasoning tasks.

Base Model - - - ~12 (Pre-RL) After Mid-Training +15 to +40 pts +5 to +12 pts +6 to +13 pts N/A After Full PRISM→RL N/A N/A N/A 29 to 42 (3-4x improvement)

How It Works: Data Composition and Mechanistic Changes

The study provides a mechanistic explanation for why mid-training is so pivotal.

Data Composition is Decisive at Mid-Training:
The research found that data composition matters most during mid-training, not during RL. For example, including science data in the mid-training mix unlocked gains of +17 to +28 points on the GPQA-Diamond benchmark during subsequent RL. In contrast, varying the data mixture during RL itself produced differences of less than 2 points. This indicates that the model acquires foundational knowledge and structural priors during mid-training that RL later refines.

Dense Restructuring vs. Sparse Refinement:
Analysis of model weights revealed two distinct mechanistic regimes:

  1. Mid-Training densely restructures over 90% of the model's parameters. This is a wholesale, foundational update.
  2. RL makes sparse, front-loaded refinements to only about 5% of parameters.

Representation analysis using Centered Kernel Alignment (CKA) showed that RL preserves the representational geometry established during mid-training with remarkable fidelity (CKA > 0.998). This means RL operates on top of a stable foundation created by mid-training.

The researchers note a crucial observation: RL applies essentially identical weight changes regardless of the starting model. However, these changes only lead to performance gains when applied to a mid-trained model. This is consistent with the hypothesis that mid-training places the model in a specific configuration or "basin" in the loss landscape from which RL can effectively navigate to a high-performance solution.

Why It Matters: Practical Guidance for Model Development

The PRISM study moves beyond anecdotal evidence to provide concrete, scalable guidance for building capable reasoning models.

First, it establishes a reliable recipe: a ~27B-token mid-training phase on high-quality, domain-mixed data (especially including science) is a highly effective lever for boosting math, code, and science performance. This "retention-aware" mid-training is shown to be more impactful for final reasoning performance than tweaking the later RL data mix.

Second, it clarifies the distinct roles of training phases. Pretraining builds general knowledge, mid-training restructures the model for specific reasoning domains, and RL performs a sparse, final alignment on that restructured foundation. Attempting RL without the mid-training restructuring phase is largely ineffective for hard reasoning tasks.

For practitioners, this means allocating compute and careful data curation to the mid-training phase is a critical strategic decision. The study provides empirical justification for a three-stage pipeline (pretrain → mid-train → RL) as the most reliable path to state-of-the-art reasoning models.

AI Analysis

The PRISM study is significant because it provides rigorous, multi-model evidence for a practice that has become folklore in LLM development: the importance of a high-quality 'mid-training' or 'continued pretraining' phase. By controlling for model family, architecture, and scale, the authors isolate the effect of this phase, showing it's not an artifact of a particular model but a general principle. The mechanistic findings are particularly insightful. The dichotomy between mid-training's dense, global restructuring and RL's sparse, local refinement offers a clear conceptual model for what each stage does. The >0.998 CKA similarity between mid-trained and RL-tuned models is a striking result—it suggests that successful RL fine-tuning for reasoning is not about radically changing the model's representations, but about making precise, small adjustments to a foundation that is already well-structured for the task. This helps explain why RL fails on base models: the necessary foundation isn't there. Practitioners should note the emphasis on data composition during mid-training, not RL. This flips a common intuition. Many teams spend significant effort curating RLHF or DPO preference datasets. This research suggests that for boosting performance on hard benchmarks like GPQA or AIME, investing in the quality and domain mix of the mid-training data (e.g., ensuring strong science content) is a higher-return activity. The study provides a quantitative basis for pipeline design decisions that have previously been based on trial and error.
Original sourcearxiv.org

Trending Now

More in AI Research

Browse more AI articles