【免费】Google开源大模型gemma技术报告

需积分: 0 29 浏览量 2024-03-10 10:24:26 上传评论收藏 445KB PDF 举报

资源推荐

资源详情

资源评论

2024-02-21

Gemma: Open Models Based on Gemini

Research and Technology

Gemma Team, Google DeepMind

See Contributions and Acknowledgments section for full author list. Please send correspondence to [email protected].

This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research

and technology used to create Gemini models. Gemma models demonstrate strong performance across

academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models

(2 billion and 7 billion parameters), and provide both pretrained and ﬁne-tuned checkpoints. Gemma

outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive

evaluations of safety and responsibility aspects of the models, alongside a detailed description of model

development. We believe the responsible release of LLMs is critical for improving the safety of frontier

models, and for enabling the next wave of LLM innovations.

Introduction

We present Gemma, a family of open models

based on Google’s Gemini models (Gemini Team,

2023).

We trained Gemma models on up to 6T to-

kens of text, using similar architectures, data,

and training recipes as the Gemini model family.

Like Gemini, these models achieve strong gener-

alist capabilities in text domains, alongside state-

of-the-art understanding and reasoning skills at

scale. With this work, we release both pre-trained

and ﬁne-tuned checkpoints, as well as an open-

source codebase for inference and serving.

Gemma comes in two sizes: a 7 billion param-

eter model for eﬃcient deployment and develop-

ment on GPU and TPU, and a 2 billion param-

eter model for CPU and on-device applications.

Each size is designed to address diﬀerent compu-

tational constraints, applications, and developer

requirements. At each scale, we release raw, pre-

trained checkpoints, as well as checkpoints ﬁne-

tuned for dialogue, instruction-following, help-

fulness, and safety. We thoroughly evaluate the

shortcomings of our models on a suite of quantita-

tive and qualitative benchmarks. We believe the

release of both pretrained and ﬁne-tuned check-

points will enable thorough research and inves-

tigation into the impact of current instruction-

tuning regimes, as well as the development of

increasingly safe and responsible model develop-

ment methodologies.

Gemma advances state-of-the-art performance

relative to comparable-scale (and some larger),

open models (Almazrouei et al., 2023; Jiang

et al., 2023; Touvron et al., 2023a,b) across a

wide range of domains including both automated

benchmarks and human evaluation. Example do-

mains include question answering (Clark et al.,

2019; Kwiatkowski et al., 2019), commonsense

reasoning (Sakaguchi et al., 2019; Suzgun et al.,

2022), mathematics and science (Cobbe et al.,

2021; Hendrycks et al., 2020), and coding (Austin

et al., 2021; Chen et al., 2021). See complete de-

tails in the Evaluation section.

Like Gemini, Gemma builds on recent work

on sequence models (Sutskever et al., 2014) and

transformers (Vaswani et al., 2017), deep learn-

ing methods based on neural networks (LeCun

et al., 2015), and techniques for large-scale train-

ing on distributed systems (Barham et al., 2022;

Dean et al., 2012; Roberts et al., 2023). Gemma

also builds on Google’s long history of open mod-

els and ecosystems, including Word2Vec (Mikolov

et al., 2013), the Transformer (Vaswani et al.,

2017), BERT (Devlin et al., 2018), and T5 (Raﬀel

et al., 2019) and T5X (Roberts et al., 2022).

We believe the responsible release of LLMs is

critical for improving the safety of frontier models,

for ensuring equitable access to this breakthrough

technology, for enabling rigorous evaluation and

analysis of current techniques, and for enabling

Gemma: Open Models Based on Gemini Research and Technology

Performance by Score

Question Answering Reasoning Math / Science Coding

LLaMA 2 (7B) LLaMA 2 (13B) Mistral (7B) Gemma (7B)

Figure 1

Language understanding and generation performance of Gemma 7B across diﬀerent capa-

bilities compared to similarly sized open models. We group together standard academic benchmark

evaluations by capability and average the respective scores; see Table 6 for a detailed breakdown of

performance.

the development of the next wave of innovations.

While thorough testing of all Gemma models has

been conducted, testing cannot cover all appli-

cations and scenarios in which Gemma may be

used. With this in mind, all Gemma users should

conduct rigorous safety testing speciﬁc to their

use case before deployment or use. More details

on our approach to safety can be found in section

Responsible Deployment.

In this technical report, we provide a detailed

overview of the model architecture, training in-

frastructure, and pretraining and ﬁne-tuning

recipes for Gemma, followed by thorough eval-

uations of all checkpoints across a wide-variety

of quantitative and qualitative benchmarks, as

well as both standard academic benchmarks and

human-preference evaluations. We then discuss

in detail our approach to safe and responsible de-

ployment. Finally, we outline the broader impli-

cations of Gemma, its limitations and advantages,

and conclusions.

Model Architecture

The Gemma model architecture is based on the

transformer decoder (Vaswani et al., 2017). The

core parameters of the architecture are summa-

Parameters 2B 7B

d_model 2048 3072

Layers 18 28

Feedforward hidden dims 32768 49152

Num heads 8 16

Num KV heads 1 16

Head size 256 256

Vocab size 256128 256128

Table 1 | Key model parameters.

Model

Embedding

Parameters

Non-embedding

Parameters

2B 524,550,144 1,981,884,416

7B 786,825,216 7,751,248,896

Table 2

Parameter counts for both sizes of

Gemma models.

rized in Table 1. Models are trained on a context

length of 8192 tokens.

We also utilize several improvements proposed

after the original transformer paper. Below, we

list the included improvements:

Multi-Query Attention (Shazeer, 2019). No-

Gemma: Open Models Based on Gemini Research and Technology

tably, the 7B model uses multi-head attention

while the 2B checkpoints use multi-query atten-

tion (with

𝑛𝑢𝑚_𝑘𝑣_ℎ𝑒𝑎𝑑𝑠 =

1), based on ablation

studies that revealed respective attention variants

improved performance at each scale (Shazeer,

2019).

RoPE Embeddings (Su et al., 2021). Rather than

using absolute positional embeddings, we use ro-

tary positional embeddings in each layer; we also

share embeddings across our inputs and outputs

to reduce model size.

GeGLU Activations (Shazeer, 2020). The stan-

dard ReLU non-linearity is replaced by the GeGLU

activation function.

Normalizer Location. We normalize both the in-

put and the output of each transformer sub-layer,

a deviation from the standard practice of solely

normalizing one or the other. We use RMSNorm

(Zhang and Sennrich, 2019) as our normalization

layer.

Training Infrastructure

We train the Gemma models using TPUv5e;

TPUv5e are deployed in pods of 256 chips, con-

ﬁgured into a 2D torus of 16 x 16 chips. For the

7B model, we train our model across 16 pods, to-

taling to 4096 TPUv5e. We pretrain the 2B model

across 2 pods, totaling 512 TPUv5e. Within a pod,

we use 16-way model sharding and 16-way data

replication for the 7B model. For the 2B, we sim-

ply use 256-way data replication. The optimizer

state is further sharded using techniques simi-

lar to ZeRO-3. Beyond a pod, we perform data-

replica reduce over the data-center network, us-

ing Pathways approach of (Barham et al., 2022).

As in Gemini, we leverage the ’single controller’

programming paradigm of Jax (Roberts et al.,

2023) and Pathways (Barham et al., 2022) to

simplify the development process by enabling a

single Python process to orchestrate the entire

training run; we also leverage the GSPMD par-

titioner (Xu et al., 2021) for the training step

computation and the MegaScale XLA compiler

(XLA, 2019).

Carbon Footprint

We estimate the carbon emissions from pretrain-

ing the Gemma models to be

∼

131

𝑡𝐶𝑂

𝑒𝑞

. This

value is calculated based on the hourly energy us-

age reported directly from our TPU datacenters;

we also scale this value to account for the addi-

tional energy expended to create and maintain

the data center, giving us the total energy usage

for our training experiments. We convert total

energy usage to carbon emissions by joining our

hourly energy usage against hourly per-cell car-

bon emission data reported by our data centers.

In addition, Google data centers are carbon

neutral, achieved through a combination of en-

ergy eﬃciency, renewable energy purchases, and

carbon oﬀsets. This carbon neutrality also applies

to our experiments and the machines used to run

them.

Pretraining

Training Data

Gemma 2B and 7B are trained on 2T and 6T

tokens respectively of primarily-English data from

web documents, mathematics, and code. Unlike

Gemini, these models are not multimodal, nor are

they trained for state-of-the-art performance on

multilingual tasks.

We use a subset of the SentencePiece tokenizer

(Kudo and Richardson, 2018) of Gemini for com-

patibility. It splits digits, does not remove extra

whitespace, and relies on byte-level encodings for

unknown tokens, following the techniques used

for both (Chowdhery et al., 2022) and (Gemini

Team, 2023). The vocabulary size is 256k tokens.

Filtering

We ﬁlter the pre-training dataset to reduce the

risk of unwanted or unsafe utterances, and ﬁlter

out certain personal information and other sen-

sitive data. This includes using both heuristics

and model-based classiﬁers to remove harmful or

low-quality content. Further, we ﬁlter all evalua-

tion sets from our pre-training data mixture, run

targeted contamination analyses to check against

evaluation set leakage, and reduce the risk of

Gemma: Open Models Based on Gemini Research and Technology

recitation by minimizing proliferation of sensitive

outputs.

The ﬁnal data mixture was determined through

a series of ablations on both the 2B and 7B mod-

els. Similar to the approach advocated in (Gemini

Team, 2023), we stage training to alter the cor-

pus mixture throughout training to increase the

weight of relevant, high-quality data towards the

end of training.

Instruction Tuning

We ﬁnetune Gemma 2B and 7B with supervised

ﬁne-tuning (SFT) on a mix of text-only, English-

only synthetic and human-generated prompt-

response pairs and reinforcement learning from

human feedback (RLHF) with the reward model

trained on labelled English-only preference data

and the policy based on a set of high-quality

prompts. We ﬁnd that both stages are important

for improved performance on downstream auto-

matic evaluations and human preference evalua-

tions of model outputs.

Supervised Fine-Tuning

We selected our data mixtures for supervised ﬁne-

tuning based on LM-based side-by-side evalua-

tions (Zheng et al., 2023). Given a set of held-

out prompts, we generate responses from a test

model, generate responses on the same prompts

from a baseline model, shuﬄe these randomly,

and ask a larger, high capability model to express

a preference between two responses. Diﬀerent

prompt sets are constructed to highlight speciﬁc

capabilities, such as instruction following, factual-

ity, creativity, and safety. The diﬀerent automatic

LM-based judges we use employ a number of tech-

niques, such as chain-of-thought prompting (Wei

et al., 2022) and use of rubrics and constitutions

(Bai et al., 2022), to be aligned with human pref-

erences.

Filtering

When using synthetic data, we run several stages

of ﬁltering over it, removing examples that show

certain personal information, unsafe or toxic

model outputs, mistaken self-identiﬁcation data,

or duplicated examples. Following Gemini, we

ﬁnd that including subsets of data that encour-

age better in-context attribution, hedging, and

refusals to minimize hallucinations can improve

performance on several factuality metrics, with-

out degrading model performance on other met-

rics.

The ﬁnal data mixtures and supervised ﬁne-

tuning recipe, which includes tuned hyperparam-

eters, were chosen on the basis of improving help-

fulness while minimizing model harms related to

safety and hallucinations.

Formatting

Instruction tuned models are trained with a spe-

ciﬁc formatter that annotates all instruction tun-

ing examples with extra information, both at

training and inference time. It has two purposes:

1) indicating roles in a conversation, such as the

User role, and 2) delineating turns in a conver-

sation, especially in a multi-turn conversation.

Special control tokens are reserved in the tok-

enizer for this purpose. While it is possible to

get coherent generations without the formatter,

it will be out-of-distribution for the model, and

will very likely produce worse generations.

The relevant formatting control tokens are pre-

sented in Table 3, with a dialogue example pre-

sented in Table 4.

Context Relevant Token

User turn user

Model turn model

Start of conversation turn <start_of_turn>

End of conversation turn <end_of_turn>

Table 3

Relevant formatting control tokens used

for both SFT and RLHF of Gemma models.

Reinforcement Learning from Human Feed-

back

We further ﬁnetuned the supervised ﬁne-tuned

model using RLHF (Christiano et al., 2017;

Ouyang et al., 2022). We collected pairs of pref-

剩余15页未读，继续阅读

评论收藏

内容反馈

拥抱AI

粉丝: 3238
资源: 3

Google开源大模型gemma技术报告

中文版Gemma技术报告-16页.pdf

《AI大模型》--使用nextjs本地化部署AI大模型gemma.zip

gemma.cpp windows已编译好 exe

谷歌宣布推出了一款新的 AI 语言模型系列 - Gemma

Teoria GEMMA_GEMMA_

算法部署-基于C++推理Google-Gemma模型-轻量级实现-附项目源码+详细流程介绍-优质项目实战.zip

Gemma - the GTK/Ruby/DBI ERP-开源

前端项目-gemma.zip

中微子电子散射：Z'和暗光子模型的一般约束

gemma:一个轻量级的库，用于使用Gherkin样式功能规范记录JAX-RS HTTP服务

谷歌Gemmal系列的PyTorch实现

gemma-codelab

Python库 | gemma-zds-client-0.4.2.tar.gz

html+css+bootstrap 实现前端框架电商网 Gemma .zip

PyPI 官网下载 | gemma_zds_client-1.0.0-py3-none-any.whl

bsm_model:Python中的Black–Scholes–Merton模型计算器

gemma:Sublime Text 3插件，可轻松将宝石添加到您的Gemfile

MDE_GEMMA-GRAFCET

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

第十五届蓝桥杯大赛软件赛省赛-PythonB组题目

（免费）Chrome浏览器插件axure-chrome-extension

axure谷歌浏览器插件

免费插件-AI插件-illustrator插件集合-尺寸标注-智能填充-颜色自动处理-自动批处理-Windows安装包.zip

VisualGDB 5.6 R9//支持VS2008-VS2022

hugging face的models-openai-clip-vit-large-patch14文件夹

火狐Firefox浏览器的插件Video DownloadHelper 8.0 的合作应用VdhCoAppSetup2.0.19

Chatgpt 4omni 发布 GPT 4o / chatgpt-4 桌面版 chchatgpt 4 下载 / darkgpt

最新版YS9082HC主控开卡工具 YS9082HC-MPToolV8.00.00.18.826-HCS1A25E2023062

安卓期末大作业（AndroidStudio开发），垃圾分类助手app，分为前台后台，代码有注释，均能正常运行

最新资源