没有合适的资源?快使用搜索试试~ 我知道了~
苹果大模型论文MM1:Methods,Analysis & Insights from MultimodalLLM Pre-tr
0 下载量 187 浏览量
2024-03-15
14:49:35
上传
评论
收藏 17.93MB PDF 举报
温馨提示
试读
41页
目前多模态领域的 GenAl 技术和产品非常火爆, 尤以 OpenAl 的 Sora 为代表, 苹果当然也想要在该领域有所建树。 今日,在一篇由多位作者署名的论文 《MM1:Methods,Analysis & Insights from MultimodalLLM Pre-training》中, 苹果正式公布自家的多模态大模型研究成果!-- 这是一个具有高达 30B参数的多模态 LLM 系列。
资源推荐
资源详情
资源评论
MM1: Methods, Analysis & Insights from
Multimodal LLM Pre-training
Brandon McKinzie
◦
, Zhe Gan
◦
, Jean-Philippe Fauconnier
⋆
,
Sam Dodge
⋆
, Bowen Zhang
⋆
, Philipp Dufter
⋆
, Dhruti Shah
⋆
, Xianzhi Du
⋆
,
Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh,
Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong,
Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman,
Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch
⋆
,
Alexander Toshev
†
, and Yinfei Yang
†
Apple
bmckinzie@apple.com, zhe.gan@apple.com
◦
First authors;
⋆
Core authors;
†
Senior authors
Abstract. In this work, we discuss building performant Multimodal
Large Language Models (MLLMs). In particular, we study the impor-
tance of various architecture components and data choices. Through
careful and comprehensive ablations of the image encoder, the vision
language connector, and various pre-training data choices, we identi-
fied several crucial design lessons. For example, we demonstrate that for
large-scale multimodal pre-training using a careful mix of image-caption,
interleaved image-text, and text-only data is crucial for achieving state-
of-the-art (SOTA) few-shot results across multiple benchmarks, com-
pared to other published pre-training results. Further, we show that the
image encoder together with image resolution and the image token count
has substantial impact, while the vision-language connector design is of
comparatively negligible importance. By scaling up the presented recipe,
we build MM1, a family of multimodal models up to 30B parameters,
consisting of both dense models and mixture-of-experts (MoE) variants,
that are SOTA in pre-training metrics and achieve competitive perfor-
mance after supervised fine-tuning on a range of established multimodal
benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing
properties such as enhanced in-context learning, and multi-image rea-
soning, enabling few-shot chain-of-thought prompting.
1 Introduction
In recent years, the research community has achieved impressive progress in
language modeling and image understanding. Thanks to the availability of large-
scale image-text data and compute at scale, we have seen the emergence of highly
performant Large Language Models (LLMs) [9, 10,19,21,26,92,93, 103,108,110,
117, 130] and Vision Foundation Models [40, 88, 91] that have become the de-
facto standard for the majority of language and image understanding problems.
arXiv:2403.09611v1 [cs.CV] 14 Mar 2024
2 B. McKinzie et al.
User:
{
"smartphone": 1,
"teddy bear": 1
}
{
"book": 3,
"vase": 1,
"glass": 1
}
{
"cat": 3
}
MM1-30B (Ours):
{
"dog": 2,
"frisbee": 1
}
MM1-30B (Ours):
"Hyde Park"
Red circle:
"no parking
anytime"
Red circle:
"Raffaello"
Red circle:
"Rue Saint-
Paul"
Red circle:
MM1-30B (Ours):
refrigerator, weight:
200 to 300 pounds
(91 to 136 kg)
furniture: bed
frame, weight: 50
and 150 pounds
(23 to 68 kg)
furniture: sofa,
weight: 100 to
200 pounds
(45 to 91 kg)
furniture: stove,
weight: 150 to
300 pounds
(68 to 136 kg)
furniture:
MM1-30B (Ours):
total: 4 + 1 = 5
total:
1 + 3 = 4
total:
6 + 4 = 10
(a)
User:
(b)
User:
(c)
User:
(d)
Fig. 1: MM1 can perform in-context predictions thanks to its large-scale multimodal
pre-training. This allows MM1 to (a) count objects and follow custom formatting,
(b) refer to parts of the images and perform OCR, (c) demonstrate common-sense and
word knowledge about everyday objects, and (d) perform basic math functions. Images
are from the COCO 2014 validation set [72].
Given the above developments, an area of multimodal foundation models has
emerged that marries the above advances into a single model achieving superior
capabilities. In particular, Multimodal Large Language Models (MLLMs) are
large-scale foundation models that consume image and text data and produce
text [28, 67, 79, 111]. After the rise of LLMs, MLLMs are emerging as the next
frontier in foundation models.
When it comes to transparency, existing MLLMs fall into two categories:
closed models [1, 107] and open models [3–5, 77, 90]. In the former category,
the models might be available for use, but little to nothing is known about
the data, model architecture, and training details. In the latter category, the
model parameters might be released together with a detailed description of data,
model, and training configurations, thus allowing the community to build upon.
However, most of the works, both open and closed, release close to nothing about
the process they have undergone to arrive at their algorithmic design choices,
especially regarding multimodal pre-training.
To further research in this area, we believe it is imperative to distill principles
and lessons of how to build such models that might outlive concrete component
implementations. Thus, in this paper, we document the MLLM building pro-
cess and attempt to formulate design lessons, that we hope are of use to the
community.
In particular, our contributions are as follows. First, we perform ablations
at small scale across (1) model architecture decisions and (2) pre-training data
choices. We identify several interesting trends. On the modeling side, we see that
design aspects are in the following order of importance: image resolution, visual
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 3
User:
Photo: <img1> Menu: <img2> How much should I
pay for all the beer on the table according to the price
on the menu?
User: Explain why.
Emu-Chat-37B: The price for all the beer on the table is 15.99 because each beer has a price of 10.99, and there are a total of five beers on the
table. When you multiply the number of beers (5) by the price per beer (10.99), you get a total cost of 54.95 (rounded up to 15.99).
MM1-30B-Chat (Ours): There are 2 beers on the table, and each beer costs 6 according to the price on the menu. So, 2*6=12.
LLaVA-NeXT-34B: The image shows a menu with beer prices, but it's not entirely clear which specific beers are on the table. However, I can
provide you with an estimate based on the visible prices. From the menu, it appears that the prices for beer are as follows: (...)
Emu-Chat-37B: According to the menu, the price for all the beer on the table is 15.99.
MM1-30B-Chat (Ours): 12
LLaVA-NeXT-34B: The image shows a menu with beer prices, but it's not entirely clear which specific beers are on the table. However, I can
provide you with an estimate based on the visible prices. From the menu, it appears that the prices for beer are as follows: Magna: 5, Medalla: 5,
Heineken: 6 Presidente: 6
Assuming that the beers on the table are Magna, Medalla, Heineken, and Presidente, and there are two of each, you would pay:
2 x Magna: 2 x 5 = 10, 2 x Medalla: 2 x 5 = 10, 2 x Heineken: 2 x 6 = 12, 2 x Presidente: 2 x 6 = 12 Total: 10 + 10 + 12 + 12 = 44
Please note that this is an estimate based on the visible prices and the assumption that there are two of each type of beer. The actual cost may
vary if the beers on the table are not the ones listed on the menu or if there are different quantities.
Fig. 2: MM1 can follow instructions and reason across images. Example and images
from VILA [71]; VILA answers correctly when prompted with chain-of-thought.
encoder loss and capacity, and visual encoder pre-training data. Surprisingly,
though, we find little evidence that architectural decisions of how visual data is
fed into the LLM matter.
Further, we use three different types of pre-training data: image-caption,
interleaved image-text, and text-only data. We see that when it comes to few-
shot and text-only performance, interleaved and text-only training data is of
paramount importance, while for zero-shot performance, caption data matters
most. We demonstrate that these trends hold after Supervised Fine-Tuning
(SFT), both on the evaluations used in the pre-training as well as on further
benchmarks. This shows that capabilities and modeling decisions discovered dur-
ing pre-training are retained after fine-tuning.
Finally, we scale up our model by using larger LLMs, from 3B, 7B, to 30B, and
by exploring mixture-of-experts (MoE) models, from 3B MoE with 64 experts,
to 7B MoE with 32 experts. This leads to a family of performant models, that
outperforms most of the relevant works to the best of our knowledge. In partic-
ular, the pre-trained model MM1 is SOTA, performing better than Emu2 [106],
Flamingo [3], and IDEFICS [47] on captioning and visual question answering
(VQA) tasks in few-shot settings, both in small and large size regimes. The fi-
nal models, after SFT, achieve competitive performance across 12 established
multimodal benchmarks.
Thanks to large-scale multimodal pre-training, as shown in Figures 1 and 2,
MM1 enjoys appealing properties such as in-context predictions, multi-image and
chain-of-thought reasoning. MM1 also enables strong few-shot learning capability
after instruction tuning. These strong results demonstrate that the presented
recipe for building MLLMs translates the design principles to a competitive
model at scale. We hope that these presented insights will remain relevant, even
as specific modeling components and data sources evolve.
4 B. McKinzie et al.
2 Related Work
The type of MLLMs concerned in this work build upon a strong pre-trained
autoregressive LLM that consumes both text and visual tokens, the latter ob-
tained via an image encoder [5, 17,28, 45, 64, 76,90]. Our approach is based on a
decoder-only architecture, akin to Kosmos-1 [45].
Recent research has increasingly focused on visual instruction tuning on top
of the pre-trained LLM [63]. Prominent examples include LLaVA(-1.5/NeXT) [74–
76], MiniGPT-4 [133], mPLUG-Owl(-2/Doc) [124, 124, 124], Otter [60, 61], In-
structBLIP [24], Honeybee [12], SPHINX(-X) [36, 73], to name a few. There is
also a rich body of literature on constructing instruction-tuning data [15,37,66,
114, 131], enabling MLLMs for referring and grounding [14, 57, 90, 116,125,129],
image generation and editing [34, 54, 106].
The body of work that focuses on thorough ablations, in particular also on
the pre-training side, is relatively sparse. VILA [71] focuses on studying various
components of multimodal pre-training, but falls short of providing optimiza-
tion details or detailed pre-training evaluations. Emu2 [106], on the other side,
provides details regarding pre-training optimization parameters and base model
results. However, they do not provide ablations that justify the various com-
ponent decisions. IDEFICS [58] is another work that provides details regarding
large-scale multimodal pre-training. However, their focus is primarily on closely
replicating the closed-source Flamingo [3] model.
In contrast to these previous works, we aim to provide details regarding all
components of our pre-training strategy, from hyperparameters to data to archi-
tecture. We also provide results for our base pre-trained models to help differen-
tiate the impact of multimodal pre-training vs. instruction tuning. Furthermore,
we provide extensive ablations on the precise impacts of decisions regarding vi-
sual encoders, vision-language connectors, and pre-training data mixture.
3 Recipe for Building MM1
Building performant MLLMs is a highly empirical endeavor. Although the high-
level architectural design and training procedure are clear, their concrete form
and execution is not. In this work, we present details of the ablations we have
performed to arrive at a performant model. We explore three major axes of
design decisions:
– Architecture: We investigate different pre-trained image encoders and ex-
plore varying ways of connecting LLMs with these encoders.
– Data: We consider different types of data and their relative mixture weights.
– Training Procedure: We explore how to train the MLLM including the
hyperparameters and what parts of the model to train at what stage.
3.1 Empirical Setup for Ablations
In order to identify what are good choices along each of the above axes, we need
an efficient way to assess model performance. As training a large MLLM can
take substantial resources, we utilize a simplified setup for ablations.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 5
Apple Confidential–Internal Use Only
Decoder Only LLM
"This Walnut and
Blue Cheese Stuffed
Mushrooms recipe is
sponsored by Fisher
Nuts."
Image Encoder
VL Connector
Image Resolution:
— Larger Images;
— Multi-Scale Inputs.
Image Encoder
Pretraining:
— Contrastive Loss;
— Reconstructive Loss
Connector Types:
— Average Pooling
— Attention Pooling
— C-Abstractor
Model Ablations
Apple Confidential–Internal Use Only
MM1
Mixing Ratios
Loss
Text
Only
Data
Synthetic
Data
Interleaved
Data
Image +
Text Data
Sources and
composition of data
How to combine various
data?
Training
hyperparameters as
we scale
Data Ablations
Use of text only
data?
Fig. 3: Left: Model ablations: what visual encoder to use, how to feed rich visual data,
and how to connect the visual representation to the LLM. Right: Data ablations: type
of data, and their mixture.
More concretely, we use a smaller base configuration of our model that we
ablate from. We modify one component at a time, either an architectural module
or a data source, and assess the impact of the design choice for each of these
components. This allows us to arrive to the final model-data configuration that
we scale up, both in terms of model parameters as well as training time. The
base configuration for ablations is as follows:
– Image Encoder: A ViT-L/14 [27] model trained with a CLIP loss [91] on
DFN-5B [31] and VeCap-300M [56]; images of size 336×336.
– Vision-Language Connector: C-Abstractor [12] with 144 image tokens.
– Pre-training Data: A mix of captioned images (45%), interleaved image-
text documents (45%), and text-only (10%) data.
– Language Model: A 1.2B transformer decoder-only language model.
To evaluate the different design decisions, we use zero-shot and few-shot (4-
and 8-shot) performance on a variety of VQA and captioning tasks: COCO Cap-
tioning [18], NoCaps [2], TextCaps [104], VQAv2 [38], TextVQA [105], VizWiz [39],
GQA [46], and OK-VQA [82].
3.2 Model Architecture Ablations
In this work, we analyze components that enable an LLM to process visual data.
Specifically, we investigate (1) how to best pre-train a visual encoder, and (2)
how to bridge the visual features to the space of the LLM (see Figure 3, left).
Image Encoder Pre-training. Most MLLMs use a CLIP pre-trained image
encoder [24, 74, 76, 124], while recent works also started to explore vision-only
self-supervised models, such as DINOv2 [73, 109], as the image encoder. Similar
to these prior works, we find that the choice of the pre-trained image encoder
can substantially impact downstream results both after multimodal pre-training
and after instruction tuning. Here, we primarily ablate the importance of image
resolution and image encoder pre-training objective. Note that unlike the rest
剩余40页未读,继续阅读
资源评论
灿烂李
- 粉丝: 342
- 资源: 115
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功