Fei Zhao, Taotian Pang11footnotemark: 1, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai National Key Laboratory for Novel Software Technology, Nanjing University {zhaof,pangtt,lich,guojj,xsy}@smail.nju.edu.cn,{wuz,daixinyu}@nju.edu.cn https://aligngpt-vl.github.ioEqual contributions.Corresponding author.
Abstract
Multimodal Large Language Models (MLLMs) are widely regarded as crucial in the exploration of Artificial General Intelligence (AGI). The core of MLLMs lies in their capability to achieve cross-modal alignment. To attain this goal, current MLLMs typically follow a two-phase training paradigm: the pre-training phase and the instruction-tuning phase. Despite their success, there are shortcomings in the modeling of alignment capabilities within these models. Firstly, during the pre-training phase, the model usually assumes that all image-text pairs are uniformly aligned, but in fact the degree of alignment between different image-text pairs is inconsistent. Secondly, the instructions currently used for finetuning incorporate a variety of tasks and different tasks usually require different levels of alignment capabilities, but previous MLLMs overlook these differentiated alignment needs. To tackle these issues, we propose a new multimodal large language model AlignGPT. In the pre-training stage, instead of treating all image-text pairs equally, we divide them into different groups according to the degrees of alignment of them. Then, the model is trained to learn the representations of different alignment levels. In the instruction-tuning phase, we adaptively combine these representations of alignment levels to meet the dynamic alignment needs of different tasks. Extensive experimental results show that our model achieves competitive performance on 12 benchmarks.
1 Introduction
Multimodal Large Language Models (MLLMs) are considered a crucial step towards achieving Artificial General Intelligence (AGI)[31, 1, 13, 30]. The uniqueness of these models lies in their ability to integrate and understand various types of information, especially text and image data. In the pursuit of AGI, this cross-modal understanding and processing capability is essential, as it mimics how humans interact with the world and comprehend complex information through different senses, such as vision and language. The development of multimodal large language models not only advances the field of artificial intelligence but also provides machines with a way to process and understand information that is closer to human cognition.
Currently, MLLMs typically adhere to a unified training paradigm, which is divided into two key phases: the pre-training phase and the instruction-tuning phase[25, 47, 43, 3, 5, 40, 19, 41, 46]. The pre-training phase concentrates on aligning images with text, aiming to train the model to understand the relation between image contents and the respective textual descriptions of them. This alignment imbues the model with cross-modal comprehension abilities. The instruction-tuning phase further enhances its adaptability to specific tasks. This includes enabling the model to complete particular visual-language tasks based on given instructions, such as generating textual descriptions from images, answering questions related to images, or even performing complex reasoning based on both text and images. This training paradigm equips multimodal pre-trained models with not only fundamental cross-modal understanding but also the flexibility to adapt to diverse task demands.
Although current MLLMs have made great progress, the modeling of alignment capabilities of them is still insufficient for the following reasons:
•
The degree of alignment is inconsistent between different image-text pairs: During the pre-training phase, the model typically operates on a key assumption that all image-text pairs are consistently aligned. However, the degree of alignment in image-text pairs is not always uniform: in some image-text pairs, the text may describe the whole image (as shown in the rightmost of Fig.2) while in others the text only describes a part of the image (as shown in the left three image-text pairs in Fig.2) . If these differences are not differentiated during the pre-training phase, it could lead to a misunderstanding of the image-text alignment relationships in the learning process.
•
The different tasks require different levels of alignment capabilities: The instructions currently used for finetuning cover a variety of tasks. Some of them, like image captioning[42], rely more on global image-text alignment capabilities. In contrast, other tasks, such as visual question answering (VQA)[2], typically require the model to answer questions based on specific parts of the image, which necessitates not only global image-text alignment but also local image-text alignment capabilities. However, previous work has neglected these differentiated alignment requirements.
To effectively enhance the alignment capabilities, we propose a new multimodal large language model called AlignGPT. In the pre-training phase, we aim to make the model to understand different degrees of the image-text alignment relation. Specifically, instead of treating all image-text pairs equally, we divide image-text pairs into different groups according to the degrees of alignment of them and give an extra group label to each pair. This process is achieved by the help of CLIP scores[34], where the higher scores indicate the higher degrees of alignment[36, 17]. For example, in Fig.2, the degree of alignment of each image-text pair rises from left to right and the CLIP score of each pair also increases. Subsequently, we utilize these group labels as control signals to make the model to learn the representations of different alignment levels. During the instruction-tuning phase, the model is trained to dynamically combine these representations obtained by pre-training for the instructions of each task. In this process, we not only assign global alignment capabilities but also adaptively configure different local alignment capabilities according to the alignment needs for instructions of each task. The broad range of tests conducted demonstrates that our model achieves competitive performance across 12 benchmarks, as shown in Fig.1.
Our contribution can be summarized as follows:
•
We propose a new multi-modal large language model AlignGPT to elevate and empower the alignment capabilities of MLLMs.
•
We propose a novel alignment strategy that learns different alignment levels in the pre-training stage, and then adaptively combines these alignment levels in the instruction-tuning stage to meet the needs of alignment capabilities for different tasks.
•
We conduct evaluations across multiple academic benchmarks and multimodal instruction-following benchmarks. Extensive experimental results show that our proposed AlignGPT achieves competitive performance. Further analysis verifies the effectiveness of the model.
2 Related Work
In this section, we review the existing studies on large language models and visual language models.
Large Language Models.
In the field of natural language processing, BERT[11] and GPT-2[33], as pioneering large pre-trained language models, marked a significant breakthrough in this technological direction. Their training on vast web text datasets demonstrated unprecedented language understanding and generation capabilities. Subsequently, the launch of GPT-3[4] further accelerated the development of this field, with its large model parameters and extensive training datasets showcasing exceptional abilities in few-shot learning, significantly enhancing task adaptability and flexibility. Following this, the introduction of InstructGPT and ChatGPT[32] focused on optimizing the efficiency and naturalness of interactions between models and humans, where InstructGPT enhanced the capability to execute precise instructions, and ChatGPT improved the conversational experience, making these models more fluent in human-computer communication. Meanwhile, as large language model (LLM) technology continued to evolve, emerging models like LLaMA[39] and GLM[14] began to make their mark. To equip these models with the ability to respond to human instructions similar to ChatGPT, research teams finetune LLaMA and GLM using high-quality instruction datasets, thereby further enhancing its capability to follow instructions, with representative projects such as Alpaca[38], Vicuna[9], and ChatGLM[45].
Although these models have made significant progress in interacting with humans through language, we recognize that human understanding and processing of complex information relies not only on language but also critically on visual and other sensory inputs. The observation has driven us to further explore more comprehensive visual-language models in order to more accurately simulate complex interactions between humans and the real world.
Visual Language Models.
In recent years, multimodal large language models (MLLMs) have garnered increasing attention. The core of MLLMs lies in their ability to achieve cross-modal understanding and generalization. Most current models, such as LLaVA[25], MiniGPT-4[47], mPLUG-Owl[43], Qwen-VL[3], MiniGPT-v2[5], NExT-GPT[41], InternLM-XComposer[46], CogVLM[40], and MM1[29], utilize a standard training framework consisting of two primary phases: pre-training and instruction-tuning. In the pre-training phase, the model utilizes image caption data to establish a rich understanding of cross-modal semantic knowledge. This training phase enables the model to comprehend and capture the correlation between images and text, establishing a solid foundation for subsequent stage. In the instruction-tuning phase, the model receives specific task instructions to optimize its performance on that task. Through this instruction-tuning phase, the model can further refine its understanding to execute specific tasks, enabling it to flexibly and accurately address various task requirements in practical applications.
Although current MLLMs have achieved promising results, they overlook two critical factors. First, the degree of alignment between different image-text pairs is inconsistent during the pre-training phase. Second, different tasks require different levels of alignment capabilities during the instruction-tuning phase. As a result, the modeling of alignment capabilities in these models remains inadequate. To address these limitations, we propose a new multimodal large language model AlignGPT to effectively enhance the alignment capabilities of MLLMs.
3 Methodology
In this section, we initially present the fundamental structure of the visual-language model AlignGPT, followed by a demonstration of how to enhance the alignment capability of the model by our pre-training and instruction-tuning paradigms.
3.1 Architecture
AlignGPT consists of four components: a visual backbone, a linear projection layer, a large language model, and an alignment module. Fig.4 provides an overview of the AlignGPT architecture and its training process. The followings are the implementation details of these components:
Visual backbone.
We utilize the pre-trained CLIP visual encoder ViT-L/14[34] as our visual backbone. We train the model using an image resolution of 336336.
Linear projection layer.
We adopt a linear projection layer to map the representations of images from the vector space of the vision backbone to that of the language model.
Large language model.
We choose the open-source model Vicuna[9] as our language model backbone, given its strong ability to follow instructions effectively in various language tasks.
Alignment module.
We propose to add alignment embeddings to the inputs of MLLMs to enrich their alignment capabilities. These alignment embeddings are positioned ahead of the image embeddings and text embeddings. In the subsequent sections, we will elaborate on the role of the alignment embeddings and the process to acquire them.
In our methodology, we utilize the similarity scores generated by the CLIP[34] model to evaluate the degree of alignment between images and text. As shown in Fig.2, we present four image-text pairs and the CLIP scores of them. From left to right, the degree of alignment between each image-text pair rises, i.e., the text description becomes more comprehensive. Correspondingly, the CLIP score of each pair increases. The rationale of adopting CLIP similarity scores lies in that CLIP is pre-trained on a massive dataset of paired images and their corresponding textual descriptions, which enables it to effectively capture the relationship between visual and linguistic information. By employing contrastive learning techniques[8], CLIP minimizes the distance between representations of similar image-text pairs while maximizing the distance between those of different pairs. This training approach relies on 400 million data pairs, allowing the model to develop a nuanced understanding of image-text relationships.
Beside, we also demonstrate the CLIP similarity distribution of image-text pairs in the pre-training dataset in Fig.3. The results indicate that the CLIP similarity distribution varies significantly, suggesting a substantial difference in the alignment between images and texts. By jointly observing Fig.2 and Fig.3, we find that pairs with lower scores correspond to texts that describe only partial regions of the image, indicating weaker alignment. In contrast, pairs with higher scores reflect texts that provide a more comprehensive description of the image, suggesting a stronger alignment between the text and the image[36, 17].
3.3 Alignment Level-aware Pre-training
As mentioned before, in the pre-training stage, the model usually assumes that all image-text pairs are uniformly aligned, and these pairs are used to train the model to comprehend the relations between images and their corresponding textual descriptions. However, in reality, the degree of alignment between these image-text pairs may vary considerably. Overlooking the difference could lead to a misunderstanding of the image-text alignment relations during the learning process.
Instead of treating all image-text pairs equally, we divide image-text pairs into different groups according to the degree of alignment of them and give each pair an extra group label. To achieve this, we leverage the similarity scores provided by CLIP. The higher the CLIP score is, the stronger the alignment is between image and text. Subsequently, we use these group labels as control signals to train the model, enabling it to understand the different alignment relations between different image-text pairs.
More precisely, we start by computing the CLIP similarities for all training image-text pairs. Subsequently, we rank all image-text pairs based on their similarity scores. Finally, we use a bucketing technique to divide them into discrete alignment levels. The process can be expressed as:
(1)
where denotes a bucketing function that assigns each pair into one of equally spaced intervals and is the alignment level (i.e., the group label) of an image-text pair. In this way, image-text pairs with lower CLIP similarity scores are assigned to buckets indicative of lower alignment levels, whereas those with higher CLIP similarity scores are grouped into buckets representing higher alignment levels.
Once the alignment level of each image-text pair is determined, we can regard it as a special token to express the alignment relation between the image and its textual description. This special token is placed ahead of the image and text tokens. During the pre-training phase, in addition to learning the mapping function in the linear projection layer, we also initialize this special token as an alignment embedding and continuously update its representation.
3.4 Adaptive Alignment-based Instruction-tuning
Currently, the instructions used for finetuning cover various tasks such as image captioning, visual question answering, and visual grounding, etc. These tasks place different requirements on the alignment capabilities. For example, image captioning tasks mainly rely on global alignment between images and text, while VQA and visual grounding tasks require not only global alignment but also local alignment capabilities between images and text. To equip the model with the adaptive alignment capability, we propose an adaptive alignment-based instruction-tuning paradigm, which dynamically combine the alignment embeddings to meet the alignment needs for each task.
To this end, we first clarify how to represent the global and local alignment capabilities between image-text pairs. As mentioned in Section 3.3, after the pre-training stage, we obtain alignment embeddings corresponding to discrete alignment levels . Among them, represents the highest level of alignment, i.e., indicates that the text provides very comprehensive description of an image. Here we regard it as a global alignment embedding. The embeddings below represent different degrees of alignment between the image and the text (i.e., ), which means the text only describes a part of the information of the image from weak to strong. Thus, we regard them as local alignment embeddings of varying degrees.
Afterwards, we not only allocate global alignment capabilities to the instructions of each task, but also adaptively distribute varying degrees of local alignment capabilities based on the distinct alignment needs for each instruction. The reason behind this is that global alignment serves as the foundation for cross-modal understanding; only by mastering global alignment capabilities can a model truly focus on enhancing local alignment abilities. Specifically, in addition to the global alignment embeddings, we assign different weights to the local alignment embeddings via a gate network. These weights are obtained based on input instructions and image, as the input instructions greatly influence the visual regions the model should focus on. The implementation of the gate network is as follows:
(2)
where and denote the embeddings of the input instruction and the image, and are weight matrix and bias, means the weights of local alignment embeddings. Finally, we aggregate the global alignment embedding and the local alignment embeddings with varying weights to ensure a more precise fulfillment of alignment requirements for instructions of each task:
(3)
where means the final alignment embedding for each instruction during the instruction-tuning stage.
In general, we can regard the alignment embeddings obtained in the pre-training phase as foundational components, each of which has different alignment capabilities. During the instruction-tuning phase, we dynamically combine these components to meet the alignment needs for instructions of different tasks.
Introduction: My name is Tish Haag, I am a excited, delightful, curious, beautiful, agreeable, enchanting, fancy person who loves writing and wants to share my knowledge and understanding with you.
We notice you're using an ad blocker
Without advertising income, we can't keep making this site awesome for you.