Bingxin Xu1Yuzhang Shang1Yunhao Ge2Qian Lou3Yan Yan1
1Illinois Institute of Technology2University of Southern California3University of Central Florida
victoriaxu@gmail.com
Abstract
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in visual-language tasks but face significant deployment challenges due to their high computational demands. While recent token reduction methods show promise for accelerating LMMs, they typically require extensive retraining or fine-tuning, making them impractical for many state-of-the-art models, especially those with proprietary training data.We propose freePruner, a training-free token reduction approach that can be directly applied to any open-source LMM without additional training. Unlike existing methods that rely heavily on token merging operations, freePruneremploys a two-stage token selection strategy: (1) identifying pivotal tokens that capture high-level semantic information using our designed contribution degree metric, and (2) selecting complementary tokens that preserve essential low-level visual details through attention pattern analysis. Extensive experiments demonstrate that freePrunerachieves 2× acceleration while maintaining comparable performance across mainstream visual question-answering benchmarks in the training-free setting. Moreover, freePruneris orthogonal to and can be combined with other post-training acceleration techniques, such as post-training quantization, providing a practical solution for efficient LMM deployment.
1 Introduction
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation tasks[42, 54, 23, 55].Building upon these advances, Large Multimodal Models (LMMs) have emerged as powerful tools that integrate visual understanding with language processing[37, 35, 70]. By connecting visual encoders with LLMs, these models tackle complex visual tasks such as visual question-answering, image captioning, and visual reasoning[36, 42, 54, 4].
Despite their impressive capabilities, the deployment of LMMs faces significant challenges due to their high computational demands.This challenge is further amplified in real-world applications where rapid response times and resource efficiency are crucial.Several approaches have been proposed to address the efficiency challenges in LMMs as surveyed in [24]. Methods such as model compression[17, 32, 57, 52], and architectural optimization[11, 12] have shown promise.Beyond the obvious efficiency bottleneck in the LLM backbone, LMMs face an additional challenge: the large number of visual tokens. The computational complexity in the LLM backbone grows quadratically with the number of input tokens, making inference particularly expensive for high-resolution images and videos. Recent works[46, 50] have demonstrated that visual tokens, which serve as prefix content for LMMs, often contain substantial redundancy that could be optimized for more efficient processing.
However, existing token reduction approaches typically require extensive retraining or fine-tuning of the LMMs [46, 50, 6, 66, 28, 30], which presents significant practical limitations. Specifically, reducing the number of multimodal tokens necessitates retraining the entire model on the original dataset to adapt to the reduced token setting. This retraining requirement poses substantial practical challenges. State-of-the-art LMMs undergo extensive visual instruction fine-tuning using large-scale datasets, requiring intensive computational resources. For instance, training LLaVA-OneVision-72B[26] demands thousands of A100 GPU hours, while even the more modest LLaVA-1.5-13B[35] requires hundreds of hours on A100 GPUs.Unfortunately, this represents the best-case scenario, where training materials—data and training recipes—are publicly accessible. More critically, many high-performing LMMs rely on proprietary training data that is not publicly available, making reproduction or retraining impractical for the broader research community. For example, powerful models like Pali-Gemma[5], Intern VL2[10], and Qwen VL2[59] only release their model weights while keeping their training data private, rendering retraining impossible.
To address these challenges, we propose freePruner, a training-free approach that can be directly applied to any open-source LMM without additional training or fine-tuning. Through extensive analysis of previous token reduction studies[46, 50, 6, 66, 28, 30], we identify a critical limitation: existing methods rely heavily on token merging operations. While these operations effectively pack the information into compressed tokens, they substantially change the distribution of token representation. This alteration makes it challenging for pretrained models to handle the modified representations without additional retraining. This observation explains why previous token reduction methods perform poorly in training-free settings and require full model retraining.In contrast, freePruner employs a pure token selection strategy, eschewing any merging operations.To ensure comprehensive visual understanding, our approach captures both high-level and low-level representations through two key components. First is Pivotal Token Selection. We design a token contribution degree metric to identify tokens that capture high-level semantic information across multiple transformer layers [25, 8].Next is Complementary Token Selection. Based on attention patterns in the penultimate layer, we select additional tokens that exhibit strong relationships with pivotal tokens, preserving essential low-level visual details.
Our main contributions are threefold:
- •
We propose a training-free paradigm for accelerating LMMs that applies to any open-sourced LMMs, requiring no access to training data or further fine-tuning.
- •
We propose a two-stage token selection strategy that balances high-level semantic features with low-level visual details. This approach achieves approximately 2× acceleration while maintaining performance (see Fig.2).
- •
freePruneris orthogonal to existing post-training acceleration methods such as quantization[32, 49, 47], offering an additional option for LMM acceleration.
2 Related Work
Token Reduction for Efficient LMMs
Large Language Models (LLMs) such as GPT-4[42], LLaMA[55], Mistral[23], and Gemini[54] have demonstrated remarkable capabilities in text-based question answering and reasoning tasks. Building upon these advances, Large Multimodal Models (LMMs)[37, 70, 62, 64] extend these capabilities to visual domains, enabling chat-based interactions that combine image understanding with language generation. Recent developments have further expanded LMM capabilities to include region-level understanding[7, 67, 44, 9], video comprehension[31, 65], and 3D scene interpretation[22].These multimodal architectures typically process visual information by feeding visual tokens directly into the LLM as prefix tokens, utilizing various interface mechanisms such as MLPs[34], Qformers[13, 70], or resamplers[2]. However, the number of visual tokens can become prohibitively large, particularly for high-resolution images[35, 43]. This poses a significant challenge due to the quadratic computational complexity of Transformer architectures[56].
Token reduction has emerged as a promising approach for accelerating Transformer-based models[21]. Traditional uni-modal token reduction methods[39, 61, 30, 6, 16] focus on reducing token quantity within the internal transformer structure to decrease computational overhead. For example, token merging[6] maintains full attention while progressively reducing tokens through a bipartite matching-based selection of representative tokens.In the context of LMMs, several recent approaches have been proposed for token reduction. PruMerge[46] and TokenPacker[28] achieve compression rates of 20% to 50% through token aggregation based on similarity metrics. Qwen-VL[4] employs a resampler to compress visual tokens to a fixed length. However, these methods face significant limitations in training-free settings due to their reliance on token merging operations. While merging effectively condenses information, it fundamentally alters the token distribution, requiring model retraining to maintain performance. This limitation makes existing approaches impractical for scenarios where retraining is not feasible.To the best of our knowledge, our work presents the first exploration of training-free token reduction specifically designed for LMMs.
3 freePruner: A Training-free Approach for Large Multimodal Model Acceleration
In this paper, we propose a post-training LMM acceleration method in a training-free environment.In this section, we first revisit the overall structure of Large Multimodal Models (LMMs) and the architecture of transformer encoders. We emphasize the direct relationship between the number of visual tokens and the computational efficiency of LMMs (Sec.3.1).Subsequently, we present our training-free token reduction method, freePruner, designed specifically for LMM acceleration.Our method features two key components:(1) Pivotal token identification, which extracts tokens containing coarse-to-fine feature information from the visual encoders (Sec.3.2); (2) Complementary token selection, which utilizes pivotal tokens as anchors to select highly correlated visual tokens (Sec.3.3). The pipeline is visualized in Fig.3.
3.1 Preliminaries
Large Multimodal Models (LMMs). The fundamental concept behind LMMs is to leverage large language models (LLMs) for multimodal understanding, as illustrated in Fig.1 (without freePrunerpart).Given a multimodal input (e.g., an image or video), the system first employs a Transformer encoder[56, 15] to generate multimodal tokens . These tokens are then processed through a projector that maps them into the text feature space. After aligning the visual tokens and textual tokens in the same feature space, they are concatenated and processed by the LLM backbone to generate the final outputs [64].
Transformer Encoder[15, 56] serve as the bridge to transform a visual input to visual tokens representation, which later is sent to a LLM for understanding[35, 70, 22, 64, 31].It is commonly used as visual encoders’ architecture in LMMs. It is mainly composed of multiple transformer blocks that include multi-head self-attention (MSA) layers, a feed-forward neural network (FFN), skip connections, and layer normalization [3, 56]. In the encoder, an visual input is first segmented into a grid of patches, each of which is then transformed into token embeddings. As these tokens pass through successive transformer layers, their representations become progressively refined. Within the self-attention layer, each input token is transformed into three distinct vectors: query , key , and value , via corresponding linear transformation matrices , , and .Applying these transformation vectors on the inputs leads to the matrices for query, key and value: , , and . The attention matrix measures how much attention it contributes to other elements:
(1) |
where is the dimension of and . A higher value in indicates a greater relevance between two tokens.Self-attention generates new representations of each token by aggregating information from all other tokens, weighted by the attention scores:
(2) |
Following the self-attention layers, the transformer encoder incorporates a feed-forward network (FFN). This network comprises two layers of linear transformations that are interspaced by a nonlinear activation function, expressed as:.where and are the matrices of the linear transformation layers, and denotes the nonlinear activation function.
A critical challenge in the LMM architecture is that the computational complexity grows quadratically with the number of input tokens[53]. Since visual tokens comprise the majority of these inputs for LLM backbone, reducing visual token quantity is crucial for improving LMM efficiency[46, 50]. However, existing token reduction approaches[46, 50, 6, 28, 66] are not suitable for training-free LLM acceleration scenarios.This limitation stems from their reliance on token merging operations, which fundamentally alter the token distribution ().Specifically, these methods typically combine multiple tokens through weighted summation or more complex operations to create compressed representations.While such merging operations can theoretically preserve information in a more condensed form, they introduce perturbations to the original token distribution.These perturbations cause a gap between the merged tokens and the expected input distribution of the pretrained projector and LLM .Consequently, the merged representations deviate significantly from the distribution the model was trained on, making them difficult to process without model retraining.
To achieve training-free token reduction while preserving essential information for the LLM backbone, we must first examine the internal architecture of multimodal encoders, specifically the Transformer architecture[15].
3.2 Pivotal Token for High-level Feature
A crucial aspect of LMM’s multimodal understanding is the effective balance between low-level and high-level visual features[7, 58]. While previous works[33, 68] have explored various approaches to feature extraction, maintaining this balance in a training-free token selection setting presents unique challenges.The primary challenge lies in identifying and selecting the most informative tokens using only the internal representations and attention patterns within the transformer, without access to additional training signals or external supervision.To address this, we first introduce a metric called the token contribution degree , which quantifies how much each token influences other tokens in the network. This metric is defined as:
(3) |
where represents the attention map at layer , computed using Eq.1. This metric measures the extent to which the -th token contributes to the entire image representation at layer .The intuition behind this metric is straightforward: a larger indicates that in layer , most tokens direct their attention to the -th token, suggesting that -th token effectively represents the features extracted at that layer.
Fig.4 presents the distribution of token contribution degrees across different layers. Notably, in most layers, exhibits a highly sparse distribution, with only a few tokens showing significant contribution degrees. This sparsity pattern is consistent across different depths of the network, from shallow (layer 6) to deep layers (layer 23). The observed sparsity reveals an important property: in the middle layers, while most visual tokens primarily attend to themselves, a small subset of pivotal tokens emerges that significantly influences the representations of other tokens.
Fig.4 also illustrates varying degrees of contribution across different layers. In the shallower layers, the contribution degree is minimal for all tokens, suggesting that no single token significantly influences others. As the layers deepen, the contribution degree noticeably increases, suggesting a progressive sharing of information among tokens. A contribution level of 400 means that, on average, a pivotal token delivers an attention score of 0.7 to each visual token, thus contributing to 70% of the information exchanged among all tokens. However, the contribution level diminishes in the penultimate layer. At this stage, having interacted through all previous layers, each token has garnered extensive information and has developed a distinct representation, leading to a more balanced interaction where no single token predominates.
Building upon this observation, we propose a selective token retention strategy based on the token contribution degree . As shown in Fig.3, our approach strategically identifies and retains tokens with high contribution degrees from deeper layers to capture high-level semantic features essential for complex visual understanding.
Intuitive Explanation: These high-contribution visual tokens function analogously to global tokens within the visual token set, similar to the class token in vision transformer architectures. Just as the class token in vision transformers[15] aggregates information from all tokens for classification, our identified high-contribution tokens serve as natural focal points that integrate information from surrounding tokens. This emergent behavior allows these tokens to capture global context effectively, making them particularly valuable for representing complex visual features across different hierarchical levels.
We present the positioning of pivotal tokens in Fig.5. Most pivotal tokens are located in the image areas that contain the most information. However, there is one counter-intuitive observation. Initial tokens, typically positioned in the top left corner with minimal visual content such as a plain white-wall background, are often marked as pivotal. This observation aligns with insights from existing studies on Transformers. For example, retaining initial tokens is noted to significantly enhance window attention (i.e., attention sink), despite their lack of semantic importance[60]. This supports our approach to selecting initial tokens as pivotal, emphasizing their high contribution degree and their role in improving performance in subsequent tasks.
1:The original representative tokens, . is the number of input visual tokens. is the total number of layers, and is the index of the starting layer for pivotal token identification.
2:Refine to (adaptive) visual tokens , in which .
3:freePruner:
4:for in range(, )do
5:Calculate token contribution degree for every visual token using Eq.3.
6:Select pivotal token indices based on the highest , and obtain pivotal tokens index list . (see Sec.3.2)
7:endfor
8:The complete indices list of pivotal tokens are
9:for in do
10:Adaptively select outliers from the attention score vector of the -th layer (obtaining with Eq.1). Obtain complementary token index list:
11:endfor
12:The complete indices list of complementary tokens are (see Sec.3.3).
13:Final indices list of selected tokens .
14:Output a refined stack of visual tokens based on .
Method LLM Res. PT IT VQAv2 SQAI VQAT POPE MME MMB BLIP-2 Vicuna-13B 224 129M - 41.0 61 42.5 85.3 1293.8 - InstructBLIP Vicuna-7B 224 129M 1.2M - 60.5 50.1 - - 36 InstructBLIP Vicuna-13B 224 129M 1.2M - 63.1 50.7 78.9 1212.8 - Shikra Vicuna-13B 224 600K 5.5M 77.4 - - - - 58.8 IDEFICS-9B LLaMA-7B 224 353M 1M 50.9 - 25.9 - - 48.2 IDEFICS-80B LLaMA-65B 224 353M 1M 60.0 - 30.9 - - 54.5 Qwen-VL Qwen-7B 448 1.4B 50M 78.8 67.1 63.8 - - 38.2 Qwen-VL-Chat Qwen-7B 448 1.4B 50M 78.2 68.2 61.5 - 1487.5 60.6 LLaVA-1.5 Vicuna-7B 336 558K 665K 78.5 66.8 58.2 85.9 1510.7 64.3 LLaVA-PruMerge+ Vicuna-7B 336 558K 665K 76.8 68.3 57.1 84.0 1462.4 64.9 LLaVA + PruMerge+ Vicuna-7B 336 0 0 76.6 67.5 55.6 86.5 1414.0 62.9 LLaVA + freePruner Vicuna-7B 336 0 0 77.6 68.6 60.0 87.7 1485.2 63.8 LLaVA-1.5 Vicuna-13B 336 558K 665K 80.0 71.6 61.3 85.9 1531.3 67.7 LLaVA-PruMerge+ Vicuna-13B 336 558K 665K 77.8 71.0 58.6 84.4 1485.5 65.7 LLaVA + PruMerge+ Vicuna-13B 336 0 0 77.6 71.9 56.3 85.5 1470.1 64.5 LLaVA + freePruner Vicuna-13B 336 0 0 78.7 72.5 60.0 87.9 1516.7 68.0
3.3 Complementary Token for Low-level Feature
Pivotal tokens distill essential semantic content from visual inputs, focusing primarily on key object features. However, they often miss finer, low-level details, particularly in information-rich images. To address this, we propose a complementary token selection method to target and incorporate low-level features into the image representation.
As tokens progress through the network layers, they strive to encapsulate key information to render a comprehensive understanding. However, as depicted in Fig.4, certain tokens continue to exhibit high contribution degree in the penultimate layer, indicating that some information remains unabsorbed by the pivotal tokens. This unabsorbed information primarily consists of low-level details that pivotal tokens, with their focus on high-level semantic content, fail to capture. Thus, there is a clear need to integrate the missing low-level information into image representation.
Fig.6 demonstrates how the contribution degree vary across network layers, highlighting a completely opposite trend between complementary tokens and pivotal tokens. It reveals that as the network layers deepen, the contribution degree of complementary tokens steadily increases, indicating a consistent release of low-level information. In contrast, pivotal tokens exhibit a high contribution degree in the middle layers, which gradually declines in deeper layers. This shift suggests that the network, initially focused on high-level semantics, gradually incorporates more detailed low-level features, such as textures and edges, which are critical for a refined understanding.
Our selection method for complementary tokens leverages their high attention scores towards pivotal tokens in the penultimate layer, identifying those that carry vital low-level details not captured by pivotal tokens. This approach moves beyond using uniformly sampled spatial tokens, which merely capture spatial data at regular intervals, to a strategy that targets the essential low-level visual details. Complementary tokens are not just served as expanding pivotal tokens but are important in delivering the nuanced low-level information essential for image understanding. Related experiments and discussions are in Sec.4.5.The complete token selection algorithm is outlined in Alg.1.
4 Experiments
We first present the empirical results of our token reduction method, freePruner, applied to LLaVA-1.5 in Sec.4.1. Second, we explore the scalability in Sec.4.2 and generalization capability in Sec.4.3 through various experiments.Furthermore, we evaluate the efficiency enhancements achieved by employing our freePruneron LMM in Sec.4.4. Finally, we demonstrate the necessity and effectiveness of each module in our model in Sec.4.5.
4.1 Main Results
We implement our token reduction method to LLaVA-1.5[34], selecting 50% of original visual tokens without any training or fine-tuning. Our evaluation spans six visual question-answering and reasoning benchmarks, including VQAv2[20], ScienceQA[41], TextVQA[51], POPE hallucination bench[29], MME[18], and MMBench[40].
As shown in Tab.1, our approach not only has comparable performance with LLaVA-1.5, but also surpasses it in specific benchmarks like POPE[29] and ScienceQA[41] across both 7B and 13B LLM backbones. Comparing with existing token reduction method PruMerge+[46], freePruner outperforms its trained and untrained version. Furthermore, our training-free freePrunermethod outperforms previous models that require training, such as BLIP2[27], InstructBLIP[13], Shikra[9], and both IDEFICS-9B[1] and IDEFICS-13B[1].
4.2 Scalability Analysis
We explore the scalability of our freePrunermethod, focusing on how increasing the number of selected tokens influences performance on six VQA benchmarks. As illustrated in Fig.7, the model demonstrates improved performance as the number of selected tokens increases on most benchmarks, except for ScienceQA[41]. Additionally, we compare our method against the existing token reduction technique, PruMerge+[46], using both its merged and non-merged versions as baselines. Our findings reveal that freePruner consistently outperforms PruMerge+[46] across all benchmarks. As the number of selected tokens increases, freePrunershows a pronounced improvement in performance, while PruMerge+[46] exhibits a more gradual enhancement. This underscores freePruner’s robust scalability, even in the absence of further training. When employing only 50% of the selected tokens, freePrunersurpasses the baseline performance of LLAVA-1.5 on TextVQA[51], ScienceQA[41], and POPE[29] tasks.We would like to highlight our goal: rather than pursuing extreme token reduction ratios, our method focuses more on achieving training-free acceleration without the need for model retraining—effectively providing “free-lunch” speedup for LMM inference. This is why we emphasize on the scalability of training-free token reduction methods.
4.3 Generalization Across Modalities and Encoders
To evaluate the broader applicability of freePruner, we examine its performance on video understanding tasks using VideoLLaVA[31] with the LanguageBind[69] encoder, a transformer-based video-language foundation model.Results in Tab.2 demonstrate that our method maintains or exceeds the original VideoLLaVA performance while significantly reducing token count (8 times). This effectiveness can be attributed to the inherently higher redundancy in video tokens, which manifests as increased sparsity in transformer attention patterns (see Secs.3.2 and 3.3). Given that Video-LLMs typically process substantially more tokens than image-based models[45, 38], our method’s success in this domain suggests particular promise for video understanding applications.
Methods LLM size MSVD-QA MSRVT-QA ActivityNet-QA Accuracy Score Accuracy Score Accuracy Score FrozenBiLM 1B 32.2 - 16.8 - 24.7 - VideoChat 7B 56.3 2.8 45.0 2.5 - 2.2 LLaMA-Adapter 7B 54.9 3.1 43.8 2.7 34.2 2.7 Video-LLaMA 7B 51.6 2.5 29.6 1.8 12.4 1.1 Video-ChatGPT 7B 64.9 3.3 49.3 2.8 35.2 2.7 Video-LLaVA 7B 70.7 3.9 59.2 3.5 45.3 3.3 Video-LLaVA + PruMerge+ 7B 71.1 3.9 59.3 3.6 47.7 3.4 Video-LLaVA + freePruner 7B 71.3 3.9 59.5 3.5 48.4 3.4
4.4 Efficiency Analysis
Method LLM Quantization OPs Prefill Accessing Storing Backbone (TB) Time (ms) Memory (GB) Activation (GB) VideoLLaVA Vicuna-7B FP16 29.4 232.2 67.6 25.7 VideoLLaVA w/ freePruner Vicuna-7B FP16 7.3 52.6 20.6 3.3 VideoLLaVA Vicuna-7B INT4 29.4 102.6 16.9 6.4 VideoLLaVA w/ freePruner Vicuna-7B INT4 7.3 24.7 5.2 0.8 LLaVA-1.5 Vicuna-13B FP16 15.9 112.7 39.2 6.1 LLaVA-1.5 w/ freePruner Vicuna-13B FP16 8.2 57.1 31.5 2.5 LLaVA-1.5 Vicuna-13B INT4 15.9 53.4 9.8 1.5 LLaVA-1.5 w/ freePruner Vicuna-13B INT4 8.2 27.4 7.9 0.6
We further look into computational efficiency by applying freePruneron LLaVA-13B[35] and VideoLLavA[31] with A6000 GPU. The theoretical performance result is estimated by roofline-based LLM-Viewer analysis[63].
Tab.3 shows the efficiency improvement by applyingfreePruneron both image and video LMMs. For LLaVA-1.5, freePrunerhalves the visual tokens, which results in a twofold increase in prefill times. In video-LLM, VideoLLaVA, the impact of freePruneris even more substantial.By training-freely reducing the visual tokens to just a quarter of the baseline, freePrunerachieves a fourfold increase in prefill times. This considerable reduction in token input also leads to a 70% decrease in memory access and an eightfold reduction in activation storage in video tasks.As the Video-LLMs’s deployment bottleneck is in the memory consumption for activation/KV-cache[48, 19], our token reduction method shows more potential in the Video-LLMs[45].By optimizing the number of visual tokens used, freePrunerensures that models perform more efficiently, maintaining high performance across various benchmarks while optimizing resource use, with particularly notable improvements in video tasks.It is essential to recognize that the benefits of freePrunerare not limited to enhancing efficiency alone. The token reduction technique employed by freePrunercan integrate with additional acceleration methods for large multimodal models, such as quantization (see Appendix).
4.5 Ablation Study on Different Modules
In this subsection, we examine the effectiveness of the two key modules in freePruner: pivotal tokens (Sec.3.2) and complementary tokens (Sec.3.3). We aim to validate the necessity of each module and verify whether the coexistence of dual modules is superior to a single module setup. All experiments utilize the LLAVA-1.5 framework with the Vicuna-7B LLM as the backbone.Fig.6 illustrates the performance comparisons among three groups: using only pivotal tokens (PT only), using pivotal tokens plus randomly selected tokens (PT + Random), and using pivotal tokens plus complementary tokens (PT + CT). The pairwise comparison of these groups, shown in Fig.8, reveal three key findings.
First, simply increasing the number of tokens does not guarantee a more comprehensive visual representation. Models using only pivotal tokens outperform those using an equivalent number of tokens composed of pivotal and randomly selected tokens. This suggests that the performance is enhanced by selecting tokens that contain important information. Second, complementary tokens are not random effects. We compare two configurations with dual modules, both supported by the same quantity of pivotal tokens but differing in the addition of either complementary tokens or random tokens. The results show that a combination of pivotal and complementary tokens outperforms a mix of pivotal and random tokens. This indicates that the complementary token module provides meaningful information, rather than random effects. Third, the necessity of complementary tokens is evident; they cannot merely be replaced by increasing the number of pivotal tokens. Comparisons show that models using both pivotal and complementary tokens outperform those using only pivotal tokens, illustrating that complementary tokens provide crucial low-level visual details that pivotal tokens alone fail to capture.In summary, both pivotal tokens and complementary tokens presented in our freePruner method carry important visual information that enhances the overall visual representation.
5 Conclusion
We present freePruner, a training-free token reduction approach for accelerating LMMs.Rather than pursuing extreme token reduction ratios, our method focuses on achieving training-free acceleration without the need for model retraining—effectively providing “free-lunch” speedup for LMM inference.We propose a two-stage token selection strategy that identifies and retains the most informative visual tokens while requiring no model retraining or access to training data. Extensive experiments demonstrate that freePrunerachieves 2× acceleration while maintaining comparable performance across various visual question-answering tasks. Moreover, our approach is orthogonal to existing acceleration techniques like quantization, offering additional pathways for efficient LMM deployment.
References
- [1]Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023.
- [2]Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, etal.Flamingo: a visual language model for few-shot learning.NeurIPS, 35:23716–23736, 2022.
- [3]JimmyLei Ba, JamieRyan Kiros, and GeoffreyE Hinton.Layer normalization.arXiv preprint arXiv:1607.06450, 2016.
- [4]Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, etal.Qwen technical report.arXiv preprint arXiv:2309.16609, 2023.
- [5]Lucas Beyer, Andreas Steiner, AndréSusano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, etal.Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024.
- [6]Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman.Token merging: Your ViT but faster.In International Conference on Learning Representations, 2023.
- [7]Mu Cai, Haotian Liu, SivaKarthik Mustikovela, GregoryP. Meyer, Yuning Chai, Dennis Park, and YongJae Lee.Making large multimodal models understand arbitrary visual prompts.In IEEE Conference on Computer Vision and Pattern Recognition, 2024.
- [8]Yuanhao Cai, Jing Lin, Xiaowan Hu, Haoqian Wang, Xin Yuan, Yulun Zhang, Radu Timofte, and Luc VanGool.Coarse-to-fine sparse transformer for hyperspectral image reconstruction.In ECCV, 2022.
- [9]Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao.Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023.
- [10]Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, etal.Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy, 2024.
- [11]Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, etal.Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2023.
- [12]Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, etal.Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024.
- [13]Wenliang Dai, Junnan Li, Dongxu Li, Anthony MengHuat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi.Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- [14]Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer.Qlora: Efficient finetuning of quantized llms.NeurIPS, 2024.
- [15]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020.
- [16]Mohsen Fayyaz, SoroushAbbasi Koohpayegani, FarnoushRezaei Jafari, Sunando Sengupta, Hamid RezaVaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and Jürgen Gall.Adaptive token sampling for efficient vision transformers.In European Conference on Computer Vision, 2022.
- [17]Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh.Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022.
- [18]Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, etal.Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023.
- [19]Yao Fu.Challenges in deploying long-context transformers: A theoretical peak performance analysis.arXiv preprint arXiv:2405.08944, 2024.
- [20]Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.Making the v in vqa matter: Elevating the role of image understanding in visual question answering.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
- [21]JoakimBruslund Haurum, Sergio Escalera, GrahamW Taylor, and ThomasB Moeslund.Which tokens to use? investigating token reduction in vision transformers.In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- [22]Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan.3d-llm: Injecting the 3d world into large language models.NeurIPS, 2023.
- [23]AlbertQ Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, etal.Mistral 7b.arXiv preprint arXiv:2310.06825, 2023.
- [24]Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, etal.Efficient multimodal large language models: A survey.arXiv preprint arXiv:2405.10739, 2024.
- [25]YoungKyung Kim, JMatías DiMartino, and Guillermo Sapiro.Vision transformers with natural language semantics.arXiv preprint arXiv:2402.17863, 2024.
- [26]Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li.Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024.
- [27]Junnan Li, Dongxu Li, Silvio Savarese, and Steven C.H. Hoi.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.In International Conference on Machine Learning, 2023.
- [28]Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, and Lei Zhang.Tokenpacker: Efficient visual projector for multimodal llm.arXiv preprint arXiv:2407.02392, 2024.
- [29]Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, WayneXin Zhao, and Ji-Rong Wen.Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023.
- [30]Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie.Not all patches are what you need: Expediting vision transformers via token reorganizations.arXiv preprint arXiv:2202.07800, 2022.
- [31]Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan.Video-llava: Learning united visual representation by alignment before projection.EMNLP, 2024.
- [32]Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han.Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6:87–100, 2024.
- [33]Daria Lioubashevski, Tomer Schlank, Gabriel Stanovsky, and Ariel Goldstein.Looking beyond the top-1: Transformers determine top tokens in order.arXiv preprint arXiv:2410.20210, 2024.
- [34]Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee.Improved baselines with visual instruction tuning, 2023.
- [35]Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and YongJae Lee.Llava-next: Improved reasoning, ocr, and world knowledge.2024.
- [36]Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and YongJae Lee.Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
- [37]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.arXiv:2304.08485, 2023.
- [38]Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel.World model on million-length video and language with ringattention.arXiv preprint arXiv:2402.08268, 2024.
- [39]Xiangcheng Liu, Tianyi Wu, and Guodong Guo.Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention.arXiv preprint arXiv:2209.13802, 2022.
- [40]Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, etal.Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023.
- [41]Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan.Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 2022.
- [42]OpenAI.Gpt-4 technical report.2023.
- [43]OpenAI.Gpt-4v(ision) system card.https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
- [44]Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei.Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023.
- [45]MichaelS Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, and JuanCarlos Niebles.xgen-mm-vid (blip-3-video): You only need 32 tokens to represent a video even in vlms.arXiv preprint arXiv:2410.16267, 2024.
- [46]Yuzhang Shang, Mu Cai, Bingxin Xu, YongJae Lee, and Yan Yan.Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024.
- [47]Yuzhang Shang, Gaowen Liu, RamanaRao Kompella, and Yan Yan.Enhancing post-training quantization calibration through contrastive learning.In CVPR, 2024.
- [48]Yuzhang Shang, Bingxin Xu, Weitai Kang, Mu Cai, Yuheng Li, Zehao Wen, Zhen Dong, Kurt Keutzer, YongJae Lee, and Yan Yan.Interpolating video-llms: Toward longer-sequence lmms in a training-free manner.arXiv preprint arXiv:2409.12963, 2024.
- [49]Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong.Pb-llm: Partially binarized large language models.arXiv preprint arXiv:2310.00034, 2023.
- [50]Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, and Jiaqi Wang.Crossget: Cross-guided ensemble of tokens for accelerating vision-language transformers.ICML, 2024.
- [51]Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach.Towards vqa models that can read.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
- [52]Mingjie Sun, Zhuang Liu, Anna Bair, and JZico Kolter.A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023.
- [53]Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler.Efficient transformers: A survey.ACM Computing Surveys, 2022.
- [54]Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM Dai, Anja Hauth, etal.Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023.
- [55]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
- [56]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- [57]Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, and Jiwen Lu.Q-vlm: Post-training quantization for large vision-language models.arXiv preprint arXiv:2410.08119, 2024.
- [58]Feng Wang, Jieru Mei, and Alan Yuille.Sclip: Rethinking self-attention for dense vision-language inference.In ECCV, 2024.
- [59]Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, etal.Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024.
- [60]Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis.Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023.
- [61]Hongxu Yin, Arash Vahdat, JoseM Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov.A-vit: Adaptive tokens for efficient vision transformer.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10809–10818, 2022.
- [62]Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen.A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023.
- [63]Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, YongJae Lee, Yan Yan, etal.Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363, 2024.
- [64]Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu.Mm-llms: Recent advances in multimodal large language models.arXiv preprint arXiv:2401.13601, 2024.
- [65]Hang Zhang, Xin Li, and Lidong Bing.Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023.
- [66]Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, and Liqiang Nie.Token-level correlation-guided compression for efficient multimodal document understanding.arXiv preprint arXiv:2407.14439, 2024.
- [67]Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo.Gpt4roi: Instruction tuning large language model on region-of-interest.arXiv preprint arXiv:2307.03601, 2023.
- [68]Chong Zhou, ChenChange Loy, and Bo Dai.Extract free dense labels from clip.In European Conference on Computer Vision (ECCV), 2022.
- [69]Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, etal.Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment.arXiv preprint arXiv:2310.01852, 2023.
- [70]Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny.Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023.
6 Appendix
6.1 Orthogonal Method with Post-training Quantization on LLMs
Bits Method Subject Context Modality Average NAT SOC LAN TXT IMG NO FP - 89.39 96.06 85.64 88.71 87.65 88.50 89.81 W6A6 AWQ 85.39 92.01 83.27 84.80 83.54 85.99 86.23 QLoRA 88.45 94.71 84.48 87.63 86.07 87.87 88.43 Q-VLM 89.43 95.73 84.00 88.71 87.51 87.25 89.34 Q-VLM + freePruner 89.27 95.73 84.24 88.26 86.94 87.65 88.68 W4A4 AWQ 74.33 72.22 74.82 73.41 67.13 77.98 74.02 QLoRA 77.53 75.48 79.18 76.64 70.70 81.95 77.53 Q-VLM 80.86 75.93 80.73 80.01 72.48 83.90 79.79 Q-VLM + freePruner 81.21 75.48 80.70 81.30 72.52 83.01 79.03
To demonstrate the versatility of freePruner, we evaluated its compatibility with existing training-free LLM acceleration techniques, particularly post-training quantization methods. We conducted experiments combining our approach with several prominent quantization methods: AWQ[32], QLoRA[14], and the multimodal-specific Q-VLM[57].Results presented in Table4 show that freePrunersuccessfully integrates with Q-VLM-quantized LLaVA models[57], functioning as a plug-and-play module without requiring additional modifications. This seamless integration demonstrates the orthogonality of our token reduction approach to existing post-training acceleration methods, suggesting promising opportunities for combining multiple acceleration strategies in the LMM optimization pipeline.
6.2 freePruneron LLaVA-Next with AnyRes
Method LLM Res. PT IT MME MMB LLaVA-Next Vicuna-7B 336 558K 665K 1519.3 65.6 LLaVA-Next + freePruner Vicuna-7B 336 0 0 1502.8 64.2
Effectiveness on High-Resolution LMMs.LLaVA-Next[36] introduced the “AnyRes” technique to effectively process high-resolution images while maintaining data efficiency. This capability enables the model to capture fine-grained visual details, significantly reducing hallucination artifacts that typically occur when models process low-resolution inputs. The architecture’s ability to handle variable high-resolution inputs makes it particularly valuable for detailed visual analysis tasks.We evaluated freePruner’s compatibility with LLaVA-Next’s high-resolution processing capabilities, as shown in Table5. Note that in our implementation, we modified the standard approach by disabling adaptive token pruning and instead maintaining a fixed token length to align with the AnyRes architecture. Our method successfully reduces the token count by 50% while preserving the model’s high-resolution processing capabilities, effectively doubling the inference speed of LLaVA-Next without compromising its ability to capture fine-grained visual details.