苹果大模型论文MM1:Methods,Analysis&InsightsfromMultimodalLLMPre-tr

187 浏览量 2024-03-15 14:49:35 上传评论收藏 17.93MB PDF 举报

资源推荐

资源详情

资源评论

MM1: Methods, Analysis & Insights from

Multimodal LLM Pre-training

Brandon McKinzie

◦

, Zhe Gan

◦

, Jean-Philippe Fauconnier

⋆

Sam Dodge

⋆

, Bowen Zhang

⋆

, Philipp Dufter

⋆

, Dhruti Shah

⋆

, Xianzhi Du

⋆

Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh,

Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong,

Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman,

Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch

⋆

Alexander Toshev

†

, and Yinfei Yang

†

Apple

bmckinzie@apple.com, zhe.gan@apple.com

◦

First authors;

⋆

Core authors;

†

Senior authors

Abstract. In this work, we discuss building performant Multimodal

Large Language Models (MLLMs). In particular, we study the impor-

tance of various architecture components and data choices. Through

careful and comprehensive ablations of the image encoder, the vision

language connector, and various pre-training data choices, we identi-

ﬁed several crucial design lessons. For example, we demonstrate that for

large-scale multimodal pre-training using a careful mix of image-caption,

interleaved image-text, and text-only data is crucial for achieving state-

of-the-art (SOTA) few-shot results across multiple benchmarks, com-

pared to other published pre-training results. Further, we show that the

image encoder together with image resolution and the image token count

has substantial impact, while the vision-language connector design is of

comparatively negligible importance. By scaling up the presented recipe,

we build MM1, a family of multimodal models up to 30B parameters,

consisting of both dense models and mixture-of-experts (MoE) variants,

that are SOTA in pre-training metrics and achieve competitive perfor-

mance after supervised ﬁne-tuning on a range of established multimodal

benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing

properties such as enhanced in-context learning, and multi-image rea-

soning, enabling few-shot chain-of-thought prompting.

1 Introduction

In recent years, the research community has achieved impressive progress in

language modeling and image understanding. Thanks to the availability of large-

scale image-text data and compute at scale, we have seen the emergence of highly

performant Large Language Models (LLMs) [9, 10,19,21,26,92,93, 103,108,110,

117, 130] and Vision Foundation Models [40, 88, 91] that have become the de-

facto standard for the majority of language and image understanding problems.

arXiv:2403.09611v1 [cs.CV] 14 Mar 2024

2 B. McKinzie et al.

User:

{

"smartphone": 1,

"teddy bear": 1

}

{

"book": 3,

"vase": 1,

"glass": 1

}

{

"cat": 3

}

MM1-30B (Ours):

{

"dog": 2,

"frisbee": 1

}

MM1-30B (Ours):

"Hyde Park"

Red circle:

"no parking

anytime"

Red circle:

"Raﬀaello"

Red circle:

"Rue Saint-

Paul"

Red circle:

MM1-30B (Ours):

refrigerator, weight:

200 to 300 pounds

(91 to 136 kg)

furniture: bed

frame, weight: 50

and 150 pounds

(23 to 68 kg)

furniture: sofa,

weight: 100 to

200 pounds

(45 to 91 kg)

furniture: stove,

weight: 150 to

300 pounds

(68 to 136 kg)

furniture:

MM1-30B (Ours):

total: 4 + 1 = 5

total:

1 + 3 = 4

total:

6 + 4 = 10

(a)

User:

(b)

User:

(c)

User:

(d)

Fig. 1: MM1 can perform in-context predictions thanks to its large-scale multimodal

pre-training. This allows MM1 to (a) count objects and follow custom formatting,

(b) refer to parts of the images and perform OCR, (c) demonstrate common-sense and

word knowledge about everyday objects, and (d) perform basic math functions. Images

are from the COCO 2014 validation set [72].

Given the above developments, an area of multimodal foundation models has

emerged that marries the above advances into a single model achieving superior

capabilities. In particular, Multimodal Large Language Models (MLLMs) are

large-scale foundation models that consume image and text data and produce

text [28, 67, 79, 111]. After the rise of LLMs, MLLMs are emerging as the next

frontier in foundation models.

When it comes to transparency, existing MLLMs fall into two categories:

closed models [1, 107] and open models [3–5, 77, 90]. In the former category,

the models might be available for use, but little to nothing is known about

the data, model architecture, and training details. In the latter category, the

model parameters might be released together with a detailed description of data,

model, and training conﬁgurations, thus allowing the community to build upon.

However, most of the works, both open and closed, release close to nothing about

the process they have undergone to arrive at their algorithmic design choices,

especially regarding multimodal pre-training.

To further research in this area, we believe it is imperative to distill principles

and lessons of how to build such models that might outlive concrete component

implementations. Thus, in this paper, we document the MLLM building pro-

cess and attempt to formulate design lessons, that we hope are of use to the

community.

In particular, our contributions are as follows. First, we perform ablations

at small scale across (1) model architecture decisions and (2) pre-training data

choices. We identify several interesting trends. On the modeling side, we see that

design aspects are in the following order of importance: image resolution, visual

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 3

User:

Photo: <img1> Menu: <img2> How much should I

pay for all the beer on the table according to the price

on the menu?

User: Explain why.

Emu-Chat-37B: The price for all the beer on the table is 15.99 because each beer has a price of 10.99, and there are a total of ﬁve beers on the

table. When you multiply the number of beers (5) by the price per beer (10.99), you get a total cost of 54.95 (rounded up to 15.99).

MM1-30B-Chat (Ours): There are 2 beers on the table, and each beer costs 6 according to the price on the menu. So, 2*6=12.

LLaVA-NeXT-34B: The image shows a menu with beer prices, but it's not entirely clear which speciﬁc beers are on the table. However, I can

provide you with an estimate based on the visible prices. From the menu, it appears that the prices for beer are as follows: (...)

Emu-Chat-37B: According to the menu, the price for all the beer on the table is 15.99.

MM1-30B-Chat (Ours): 12

LLaVA-NeXT-34B: The image shows a menu with beer prices, but it's not entirely clear which speciﬁc beers are on the table. However, I can

provide you with an estimate based on the visible prices. From the menu, it appears that the prices for beer are as follows: Magna: 5, Medalla: 5,

Heineken: 6 Presidente: 6

Assuming that the beers on the table are Magna, Medalla, Heineken, and Presidente, and there are two of each, you would pay:

2 x Magna: 2 x 5 = 10, 2 x Medalla: 2 x 5 = 10, 2 x Heineken: 2 x 6 = 12, 2 x Presidente: 2 x 6 = 12 Total: 10 + 10 + 12 + 12 = 44

Please note that this is an estimate based on the visible prices and the assumption that there are two of each type of beer. The actual cost may

vary if the beers on the table are not the ones listed on the menu or if there are diﬀerent quantities.

Fig. 2: MM1 can follow instructions and reason across images. Example and images

from VILA [71]; VILA answers correctly when prompted with chain-of-thought.

encoder loss and capacity, and visual encoder pre-training data. Surprisingly,

though, we ﬁnd little evidence that architectural decisions of how visual data is

fed into the LLM matter.

Further, we use three diﬀerent types of pre-training data: image-caption,

interleaved image-text, and text-only data. We see that when it comes to few-

shot and text-only performance, interleaved and text-only training data is of

paramount importance, while for zero-shot performance, caption data matters

most. We demonstrate that these trends hold after Supervised Fine-Tuning

(SFT), both on the evaluations used in the pre-training as well as on further

benchmarks. This shows that capabilities and modeling decisions discovered dur-

ing pre-training are retained after ﬁne-tuning.

Finally, we scale up our model by using larger LLMs, from 3B, 7B, to 30B, and

by exploring mixture-of-experts (MoE) models, from 3B MoE with 64 experts,

to 7B MoE with 32 experts. This leads to a family of performant models, that

outperforms most of the relevant works to the best of our knowledge. In partic-

ular, the pre-trained model MM1 is SOTA, performing better than Emu2 [106],

Flamingo [3], and IDEFICS [47] on captioning and visual question answering

(VQA) tasks in few-shot settings, both in small and large size regimes. The ﬁ-

nal models, after SFT, achieve competitive performance across 12 established

multimodal benchmarks.

Thanks to large-scale multimodal pre-training, as shown in Figures 1 and 2,

MM1 enjoys appealing properties such as in-context predictions, multi-image and

chain-of-thought reasoning. MM1 also enables strong few-shot learning capability

after instruction tuning. These strong results demonstrate that the presented

recipe for building MLLMs translates the design principles to a competitive

model at scale. We hope that these presented insights will remain relevant, even

as speciﬁc modeling components and data sources evolve.

4 B. McKinzie et al.

2 Related Work

The type of MLLMs concerned in this work build upon a strong pre-trained

autoregressive LLM that consumes both text and visual tokens, the latter ob-

tained via an image encoder [5, 17,28, 45, 64, 76,90]. Our approach is based on a

decoder-only architecture, akin to Kosmos-1 [45].

Recent research has increasingly focused on visual instruction tuning on top

of the pre-trained LLM [63]. Prominent examples include LLaVA(-1.5/NeXT) [74–

76], MiniGPT-4 [133], mPLUG-Owl(-2/Doc) [124, 124, 124], Otter [60, 61], In-

structBLIP [24], Honeybee [12], SPHINX(-X) [36, 73], to name a few. There is

also a rich body of literature on constructing instruction-tuning data [15,37,66,

114, 131], enabling MLLMs for referring and grounding [14, 57, 90, 116,125,129],

image generation and editing [34, 54, 106].

The body of work that focuses on thorough ablations, in particular also on

the pre-training side, is relatively sparse. VILA [71] focuses on studying various

components of multimodal pre-training, but falls short of providing optimiza-

tion details or detailed pre-training evaluations. Emu2 [106], on the other side,

provides details regarding pre-training optimization parameters and base model

results. However, they do not provide ablations that justify the various com-

ponent decisions. IDEFICS [58] is another work that provides details regarding

large-scale multimodal pre-training. However, their focus is primarily on closely

replicating the closed-source Flamingo [3] model.

In contrast to these previous works, we aim to provide details regarding all

components of our pre-training strategy, from hyperparameters to data to archi-

tecture. We also provide results for our base pre-trained models to help diﬀeren-

tiate the impact of multimodal pre-training vs. instruction tuning. Furthermore,

we provide extensive ablations on the precise impacts of decisions regarding vi-

sual encoders, vision-language connectors, and pre-training data mixture.

3 Recipe for Building MM1

Building performant MLLMs is a highly empirical endeavor. Although the high-

level architectural design and training procedure are clear, their concrete form

and execution is not. In this work, we present details of the ablations we have

performed to arrive at a performant model. We explore three major axes of

design decisions:

– Architecture: We investigate diﬀerent pre-trained image encoders and ex-

plore varying ways of connecting LLMs with these encoders.

– Data: We consider diﬀerent types of data and their relative mixture weights.

– Training Procedure: We explore how to train the MLLM including the

hyperparameters and what parts of the model to train at what stage.

3.1 Empirical Setup for Ablations

In order to identify what are good choices along each of the above axes, we need

an eﬃcient way to assess model performance. As training a large MLLM can

take substantial resources, we utilize a simpliﬁed setup for ablations.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 5

Apple Conﬁdential–Internal Use Only

Decoder Only LLM

"This Walnut and

Blue Cheese Stuﬀed

Mushrooms recipe is

苹果大模型论文MM1:Methods,Analysis & Insights from MultimodalLLM Pre-tr

最新资源

苹果大模型论文MM1:Methods,Analysis & Insights from MultimodalLLM Pre-tr

苹果MM1: Methods, Analysis&Insights from Multimodal LLM Pre-traini

13.56MHZ PN532 NFC模块PROTEL99SE设计硬件原理图PCB+AD集成库文件.zip

MM1:MM1仿真代码

第三章三节MM1排队模型PPT课件.ppt

MM1K标准模型

模拟MM1：实施模型MM1-Modelagem eSimulação

02 第二章 通信信源模型和MM1排队系统

排队论模型程序mm1.zip_mm1排队论模型_mm1模拟程序_排队M_排队模型_排队论mm1

mm1但服务台排队模型

通信网理论基础：第二章 通信信源模型和MM1排队系统.pdf

排队论的泊松服务模型mm1

排队论MM1模型的C++仿真程序

MM1&MD1代码

Untitled4_1_mm1模型实现方法_

opnet可用的mm1模型

opnet加强的mm1模型

中移动彩信协议MM1-MM8

paidui.rar_mm1模型_paidui mm1_排队论 mm1_排队论 仿真_排队论c

mm1:一个简单的 angularJS 练习

完整车牌号识别程序，可以识别车牌和颜色，可以集成到项目中 支持win7+

ChatGPT教程（终极版）最全整理

博客中Kmeans以及FCM算法数据（免积分）

神经网络回归预测--气温数据集

hugging face的models-openai-clip-vit-large-patch14文件夹

XGBoost+LightGBM+LSTM-光伏发电量预测

Mathwork+Matlab+编程手册

时间序列预测模型实战案例(Xgboost)(Python)(机器学习)包括时间序列预测和时间序列分类，点击即可运行！

中文短信数据集-带标签

Stable-Diffusion WEBUI 简体中文语言包（2023.05.30更新）

最新资源

02 第二章通信信源模型和MM1排队系统

通信网理论基础：第二章通信信源模型和MM1排队系统.pdf

paidui.rar_mm1模型_paidui mm1_排队论 mm1_排队论仿真_排队论c

完整车牌号识别程序，可以识别车牌和颜色，可以集成到项目中支持win7+