ReDiPrune: Training-Free Token Pruning Before Projection Boosts MLLM Efficiency 6x, Gains 2% Accuracy
AI ResearchScore: 75

ReDiPrune: Training-Free Token Pruning Before Projection Boosts MLLM Efficiency 6x, Gains 2% Accuracy

Researchers propose ReDiPrune, a plug-and-play method that prunes visual tokens before the vision-language projector in multimodal LLMs. On EgoSchema with LLaVA-NeXT-Video-7B, it achieves a +2.0% accuracy gain while reducing computation by over 6× in TFLOPs.

GAla Smith & AI Research Desk·3h ago·8 min read·6 views·AI-Generated
Share:
Source: arxiv.orgvia arxiv_cvSingle Source
ReDiPrune: Training-Free Token Pruning Before Projection Boosts MLLM Efficiency 6x, Gains 2% Accuracy

March 25, 2026 — A new paper on arXiv introduces ReDiPrune (Relevance-Diversity Pre-Projection Token Pruning), a training-free method designed to dramatically reduce the computational cost of multimodal large language models (MLLMs) while, counterintuitively, often improving their accuracy. The core innovation is applying token pruning before the vision-language projector, where visual features remain rich and discriminative, rather than after projection where information is already compressed.

What the Researchers Built

Multimodal LLMs like LLaVA, GPT-4V, and Gemini process images and videos by first encoding them into a sequence of visual tokens via a vision encoder (e.g., CLIP's ViT). These tokens are then projected into the language model's embedding space through a trainable vision-language projector (often a simple MLP). The Transformer-based LLM then processes this lengthy sequence of visual tokens alongside text tokens, which is computationally expensive—often the dominant cost.

Existing token pruning methods typically operate after the projector, on the already-projected embeddings. ReDiPrune shifts this operation upstream. It is inserted between the vision encoder and the projector, selecting a small subset of the original visual tokens to send forward. This allows pruning to work with the full, high-dimensional visual features before any compression loss occurs.

How It Works

The pruning algorithm is a lightweight, rule-based scorer that operates in two steps for each token:

Figure 3: Qualitative examples from the TGIF dataset 6 using Video-LLaVA-7B 9. For each question, we show the ground

  1. Text-Conditioned Relevance: The method computes a relevance score between each visual token and the textual query. This is done by leveraging the cross-modal alignment already learned by the vision encoder and text encoder (from pre-training like CLIP). A simple cosine similarity between the visual token embedding and the text query embedding (from the same encoder family) provides a relevance measure.
  2. Max-Min Diversity: To avoid selecting redundant tokens that convey similar information, the method enforces diversity. It uses a "max-min" criterion: after selecting the most relevant token, it chooses the next token that is both relevant and maximally distant (in the feature space) from the already selected tokens. This ensures the final set covers diverse spatial and semantic aspects of the image or video frame.

The final token score is a weighted combination of relevance and diversity. The top-k tokens by this score are retained (e.g., 15%), and only these are passed to the vision-language projector and subsequent LLM.

Crucially, ReDiPrune requires no retraining or fine-tuning. It is fully plug-and-play, compatible with any vision encoder/LLM pair that uses a separate projector. The authors note it can be seamlessly inserted into existing MLLM pipelines with a few lines of code.

Key Results

The method was evaluated across nine benchmarks: four video understanding tasks (EgoSchema, NExT-QA, IntentQA, Star) and five image understanding tasks (VQAv2, GQA, VizWiz, TextVQA, ScienceQA-IMG).

Figure 2: Overview of ReDiPrune: Given a prompt pp, we build a normalized weighted query vector q^\hat{q} (§\S 3.2). For

The standout result is on the EgoSchema benchmark using the LLaVA-NeXT-Video-7B model. By retaining only 15% of visual tokens, ReDiPrune achieved an absolute accuracy gain of +2.0% while reducing computational cost (measured in TFLOPs) by more than 6×.

EgoSchema (LLaVA-NeXT-Video-7B) 15% +2.0% >6× NExT-QA (LLaVA-NeXT-Video-7B) 20% +0.9% >5× VQAv2 (LLaVA-1.5-7B) 20% +0.5% >5× ScienceQA-IMG (LLaVA-1.5-13B) 20% +0.3% >5×

The results show a consistent trend: aggressive pruning (keeping 10-20% of tokens) not only saves substantial compute but frequently improves accuracy over using all tokens. The authors hypothesize this is due to the removal of noisy, redundant, or irrelevant tokens, which acts as a form of attention focusing for the LLM.

ReDiPrune also outperformed post-projection pruning baselines and other adaptive token selection methods across nearly all benchmarks, demonstrating the advantage of operating on the richer pre-projection features.

Why It Matters

Efficiency is a critical bottleneck for deploying multimodal LLMs, especially in video contexts where token counts explode with frame count. Methods that reduce FLOPs often come at a cost to accuracy. ReDiPrune is notable for demonstrating that a well-designed, training-free pruning step can improve the accuracy-efficiency Pareto frontier—doing more with less.

Figure 1:Comparison of pruning strategies in MLLMs:(a) Post-projection pruning selects diverse tokens but ignores tex

The work challenges the default assumption that feeding the LLM more visual information is always better. Instead, it suggests that selectivity is key. A small set of highly relevant, diverse tokens can provide a better signal than the full, noisy sequence.

For practitioners, the plug-and-play nature is a significant advantage. Integrating ReDiPrune into an existing MLLM pipeline could yield immediate reductions in inference cost and latency for visual question answering, video understanding, and other multimodal tasks, with no additional training overhead.

The code is publicly available on GitHub.

gentic.news Analysis

This paper arrives amidst a concentrated wave of efficiency research for large models. Just this week, arXiv has seen a flurry of activity, including studies on RAG chunking strategies and RL for robot planning, reflecting the community's intense focus on making AI systems more practical and deployable. The trend of arXiv hosting pivotal efficiency research is clear; it has appeared in 46 articles this week alone, cementing its role as the primary conduit for rapid dissemination of such ideas.

ReDiPrune's approach is philosophically aligned with other recent work we've covered that emphasizes intelligent data selection over brute-force scaling. For instance, our coverage of MDKeyChunker highlighted the importance of structure-aware chunking for RAG performance. Both works identify a critical bottleneck—excessive token count in MLLMs, excessive chunk count in RAG—and apply a selectivity filter to improve downstream performance while reducing compute. This suggests a broader paradigm shift: instead of merely building models to handle more data, researchers are increasingly building smarter filters to give models better data.

Furthermore, the paper's release follows closely on the heels of GitHub's launch of Spec-Kit, an open-source AI toolkit for generating specs and code. While different in application, both developments share a theme of providing open-source, plug-and-play tooling to improve AI development workflows—one at the system design stage (Spec-Kit), the other at the inference optimization stage (ReDiPrune). The availability of ReDiPrune's code on GitHub (which has been mentioned in 21 articles this week) ensures it can be rapidly tested and integrated by the community, accelerating iteration and potential adoption.

The result that pruning can improve accuracy is particularly compelling and warrants scrutiny. It implies that current MLLM projectors and LLMs may be easily distracted by irrelevant visual tokens. If this finding holds broadly, it could influence the design of future vision-language architectures, perhaps moving towards more integrated, sparse token selection mechanisms from the outset, rather than relying on a dense process followed by pruning.

Frequently Asked Questions

What is token pruning in multimodal LLMs?

Token pruning is a technique to reduce the computational load of multimodal LLMs by removing a portion of the visual tokens created from an image or video before they are fully processed by the language model's Transformer. The goal is to maintain task performance (like answering questions about the image) while using significantly fewer computational resources (FLOPs).

How is ReDiPrune different from other token pruning methods?

Most existing token pruning methods operate after the vision-language projector, which compresses visual features into the LLM's embedding space. ReDiPrune prunes tokens before this projection, while the visual features are still in their original, high-dimensional, and information-rich form from the vision encoder (like CLIP). This allows it to make more informed pruning decisions based on fine-grained spatial and semantic information that is lost after projection.

Does using ReDiPrune require retraining my multimodal model?

No. A key feature of ReDiPrune is that it is training-free and plug-and-play. It does not require any fine-tuning or retraining of the vision encoder, projector, or language model. You can insert the ReDiPrune module between an existing vision encoder and projector in your pipeline, and it will immediately work.

On which models and tasks has ReDiPrune been tested?

The paper evaluates ReDiPrune primarily on the LLaVA family of models (LLaVA-1.5 and LLaVA-NeXT-Video) across nine benchmarks. These include four video understanding tasks (EgoSchema, NExT-QA, IntentQA, Star) and five image understanding tasks (VQAv2, GQA, VizWiz, TextVQA, ScienceQA-IMG). The method is architecture-agnostic and should be applicable to any MLLM that uses a separate vision encoder and projector, such as models based on OpenFlamingo or similar frameworks.

AI Analysis

The ReDiPrune paper is a technically sound contribution to the pressing problem of MLLM efficiency. Its most provocative claim—that aggressive pruning can improve accuracy—is well-supported by their benchmark suite. This isn't just a minor efficiency hack; it's an empirical challenge to the design logic of current pipelines. If the vision encoder produces many redundant or noisy tokens, then the projector and LLM are wasting parameters and attention on them. Pruning upstream acts as a forced bottleneck that only lets the most salient signal through, which the LLM apparently uses more effectively. From an engineering perspective, the choice of a simple, rule-based scorer (cosine similarity + max-min diversity) is clever. It leverages the existing cross-modal alignment from CLIP-style pre-training, avoiding the need for a learned, parametric pruning network that would add complexity and require tuning. This simplicity is key to its plug-and-play promise. Practitioners should note, however, that its effectiveness is inherently tied to the quality of the vision-text alignment in the base encoders. It may work less well for domains or modalities far from the encoder's pre-training distribution. The timing and context are significant. This work is part of a clear trend we've been tracking: the move from simply scaling models to making them smarter and more efficient with the data they have. It complements other recent arXiv highlights, like the research on optimal RAG chunking, which tackles a analogous problem in retrieval systems. The community is systematically attacking the 'wasteful data processing' problem across multiple AI subfields. For teams deploying MLLMs in cost-sensitive or latency-sensitive applications (e.g., real-time video analysis, edge devices), ReDiPrune offers a immediately testable solution that could yield double-digit percentage reductions in inference cost.
Enjoyed this article?
Share:

Related Articles

More in AI Research

View all