没有合适的资源?快使用搜索试试~ 我知道了~
Google开源大模型gemma技术报告
需积分: 0 1 下载量 29 浏览量
2024-03-10
10:24:26
上传
评论
收藏 445KB PDF 举报
温馨提示
试读
16页
Google 最新发布的开源大模型 Gemma 技术报告详细介绍了这一创新成果。Gemma 基于 Google 的 Gemini 模型,通过在高达 6 万亿词块的文本上进行训练,展现了卓越的通用性和先进的理解与推理能力。该模型系列包括两种规模:70 亿参数模型和 20 亿参数模型,分别针对不同的部署需求和计算约束进行优化。Gemma 提供了预训练和微调模型参数,以及用于推理和服务的开源代码库,旨在支持开发者在 GPU、TPU、CPU 以及设备端应用中高效部署和开发。此外,Gemma 在多个领域内实现了性能提升,包括问答、常识推理、数学和科学以及编程等。Google 的这一举措不仅推动了对当前指令调整机制的研究,还促进了更安全、更负责任的模型开发方法论的发展。
资源推荐
资源详情
资源评论
2024-02-21
Gemma: Open Models Based on Gemini
Research and Technology
Gemma Team, Google DeepMind
1
1
See Contributions and Acknowledgments section for full author list. Please send correspondence to [email protected].
This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research
and technology used to create Gemini models. Gemma models demonstrate strong performance across
academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models
(2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma
outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive
evaluations of safety and responsibility aspects of the models, alongside a detailed description of model
development. We believe the responsible release of LLMs is critical for improving the safety of frontier
models, and for enabling the next wave of LLM innovations.
Introduction
We present Gemma, a family of open models
based on Google’s Gemini models (Gemini Team,
2023).
We trained Gemma models on up to 6T to-
kens of text, using similar architectures, data,
and training recipes as the Gemini model family.
Like Gemini, these models achieve strong gener-
alist capabilities in text domains, alongside state-
of-the-art understanding and reasoning skills at
scale. With this work, we release both pre-trained
and fine-tuned checkpoints, as well as an open-
source codebase for inference and serving.
Gemma comes in two sizes: a 7 billion param-
eter model for efficient deployment and develop-
ment on GPU and TPU, and a 2 billion param-
eter model for CPU and on-device applications.
Each size is designed to address different compu-
tational constraints, applications, and developer
requirements. At each scale, we release raw, pre-
trained checkpoints, as well as checkpoints fine-
tuned for dialogue, instruction-following, help-
fulness, and safety. We thoroughly evaluate the
shortcomings of our models on a suite of quantita-
tive and qualitative benchmarks. We believe the
release of both pretrained and fine-tuned check-
points will enable thorough research and inves-
tigation into the impact of current instruction-
tuning regimes, as well as the development of
increasingly safe and responsible model develop-
ment methodologies.
Gemma advances state-of-the-art performance
relative to comparable-scale (and some larger),
open models (Almazrouei et al., 2023; Jiang
et al., 2023; Touvron et al., 2023a,b) across a
wide range of domains including both automated
benchmarks and human evaluation. Example do-
mains include question answering (Clark et al.,
2019; Kwiatkowski et al., 2019), commonsense
reasoning (Sakaguchi et al., 2019; Suzgun et al.,
2022), mathematics and science (Cobbe et al.,
2021; Hendrycks et al., 2020), and coding (Austin
et al., 2021; Chen et al., 2021). See complete de-
tails in the Evaluation section.
Like Gemini, Gemma builds on recent work
on sequence models (Sutskever et al., 2014) and
transformers (Vaswani et al., 2017), deep learn-
ing methods based on neural networks (LeCun
et al., 2015), and techniques for large-scale train-
ing on distributed systems (Barham et al., 2022;
Dean et al., 2012; Roberts et al., 2023). Gemma
also builds on Google’s long history of open mod-
els and ecosystems, including Word2Vec (Mikolov
et al., 2013), the Transformer (Vaswani et al.,
2017), BERT (Devlin et al., 2018), and T5 (Raffel
et al., 2019) and T5X (Roberts et al., 2022).
We believe the responsible release of LLMs is
critical for improving the safety of frontier models,
for ensuring equitable access to this breakthrough
technology, for enabling rigorous evaluation and
analysis of current techniques, and for enabling
© 2024 Google DeepMind. All rights reserved
Gemma: Open Models Based on Gemini Research and Technology
Performance by Score
0
20
40
60
80
Question Answering Reasoning Math / Science Coding
LLaMA 2 (7B) LLaMA 2 (13B) Mistral (7B) Gemma (7B)
Figure 1
|
Language understanding and generation performance of Gemma 7B across different capa-
bilities compared to similarly sized open models. We group together standard academic benchmark
evaluations by capability and average the respective scores; see Table 6 for a detailed breakdown of
performance.
the development of the next wave of innovations.
While thorough testing of all Gemma models has
been conducted, testing cannot cover all appli-
cations and scenarios in which Gemma may be
used. With this in mind, all Gemma users should
conduct rigorous safety testing specific to their
use case before deployment or use. More details
on our approach to safety can be found in section
Responsible Deployment.
In this technical report, we provide a detailed
overview of the model architecture, training in-
frastructure, and pretraining and fine-tuning
recipes for Gemma, followed by thorough eval-
uations of all checkpoints across a wide-variety
of quantitative and qualitative benchmarks, as
well as both standard academic benchmarks and
human-preference evaluations. We then discuss
in detail our approach to safe and responsible de-
ployment. Finally, we outline the broader impli-
cations of Gemma, its limitations and advantages,
and conclusions.
Model Architecture
The Gemma model architecture is based on the
transformer decoder (Vaswani et al., 2017). The
core parameters of the architecture are summa-
Parameters 2B 7B
d_model 2048 3072
Layers 18 28
Feedforward hidden dims 32768 49152
Num heads 8 16
Num KV heads 1 16
Head size 256 256
Vocab size 256128 256128
Table 1 | Key model parameters.
Model
Embedding
Parameters
Non-embedding
Parameters
2B 524,550,144 1,981,884,416
7B 786,825,216 7,751,248,896
Table 2
|
Parameter counts for both sizes of
Gemma models.
rized in Table 1. Models are trained on a context
length of 8192 tokens.
We also utilize several improvements proposed
after the original transformer paper. Below, we
list the included improvements:
Multi-Query Attention (Shazeer, 2019). No-
2
Gemma: Open Models Based on Gemini Research and Technology
tably, the 7B model uses multi-head attention
while the 2B checkpoints use multi-query atten-
tion (with
𝑛𝑢𝑚_𝑘𝑣_ℎ𝑒𝑎𝑑𝑠 =
1), based on ablation
studies that revealed respective attention variants
improved performance at each scale (Shazeer,
2019).
RoPE Embeddings (Su et al., 2021). Rather than
using absolute positional embeddings, we use ro-
tary positional embeddings in each layer; we also
share embeddings across our inputs and outputs
to reduce model size.
GeGLU Activations (Shazeer, 2020). The stan-
dard ReLU non-linearity is replaced by the GeGLU
activation function.
Normalizer Location. We normalize both the in-
put and the output of each transformer sub-layer,
a deviation from the standard practice of solely
normalizing one or the other. We use RMSNorm
(Zhang and Sennrich, 2019) as our normalization
layer.
Training Infrastructure
We train the Gemma models using TPUv5e;
TPUv5e are deployed in pods of 256 chips, con-
figured into a 2D torus of 16 x 16 chips. For the
7B model, we train our model across 16 pods, to-
taling to 4096 TPUv5e. We pretrain the 2B model
across 2 pods, totaling 512 TPUv5e. Within a pod,
we use 16-way model sharding and 16-way data
replication for the 7B model. For the 2B, we sim-
ply use 256-way data replication. The optimizer
state is further sharded using techniques simi-
lar to ZeRO-3. Beyond a pod, we perform data-
replica reduce over the data-center network, us-
ing Pathways approach of (Barham et al., 2022).
As in Gemini, we leverage the ’single controller’
programming paradigm of Jax (Roberts et al.,
2023) and Pathways (Barham et al., 2022) to
simplify the development process by enabling a
single Python process to orchestrate the entire
training run; we also leverage the GSPMD par-
titioner (Xu et al., 2021) for the training step
computation and the MegaScale XLA compiler
(XLA, 2019).
Carbon Footprint
We estimate the carbon emissions from pretrain-
ing the Gemma models to be
∼
131
𝑡𝐶𝑂
2
𝑒𝑞
. This
value is calculated based on the hourly energy us-
age reported directly from our TPU datacenters;
we also scale this value to account for the addi-
tional energy expended to create and maintain
the data center, giving us the total energy usage
for our training experiments. We convert total
energy usage to carbon emissions by joining our
hourly energy usage against hourly per-cell car-
bon emission data reported by our data centers.
In addition, Google data centers are carbon
neutral, achieved through a combination of en-
ergy efficiency, renewable energy purchases, and
carbon offsets. This carbon neutrality also applies
to our experiments and the machines used to run
them.
Pretraining
Training Data
Gemma 2B and 7B are trained on 2T and 6T
tokens respectively of primarily-English data from
web documents, mathematics, and code. Unlike
Gemini, these models are not multimodal, nor are
they trained for state-of-the-art performance on
multilingual tasks.
We use a subset of the SentencePiece tokenizer
(Kudo and Richardson, 2018) of Gemini for com-
patibility. It splits digits, does not remove extra
whitespace, and relies on byte-level encodings for
unknown tokens, following the techniques used
for both (Chowdhery et al., 2022) and (Gemini
Team, 2023). The vocabulary size is 256k tokens.
Filtering
We filter the pre-training dataset to reduce the
risk of unwanted or unsafe utterances, and filter
out certain personal information and other sen-
sitive data. This includes using both heuristics
and model-based classifiers to remove harmful or
low-quality content. Further, we filter all evalua-
tion sets from our pre-training data mixture, run
targeted contamination analyses to check against
evaluation set leakage, and reduce the risk of
3
Gemma: Open Models Based on Gemini Research and Technology
recitation by minimizing proliferation of sensitive
outputs.
The final data mixture was determined through
a series of ablations on both the 2B and 7B mod-
els. Similar to the approach advocated in (Gemini
Team, 2023), we stage training to alter the cor-
pus mixture throughout training to increase the
weight of relevant, high-quality data towards the
end of training.
Instruction Tuning
We finetune Gemma 2B and 7B with supervised
fine-tuning (SFT) on a mix of text-only, English-
only synthetic and human-generated prompt-
response pairs and reinforcement learning from
human feedback (RLHF) with the reward model
trained on labelled English-only preference data
and the policy based on a set of high-quality
prompts. We find that both stages are important
for improved performance on downstream auto-
matic evaluations and human preference evalua-
tions of model outputs.
Supervised Fine-Tuning
We selected our data mixtures for supervised fine-
tuning based on LM-based side-by-side evalua-
tions (Zheng et al., 2023). Given a set of held-
out prompts, we generate responses from a test
model, generate responses on the same prompts
from a baseline model, shuffle these randomly,
and ask a larger, high capability model to express
a preference between two responses. Different
prompt sets are constructed to highlight specific
capabilities, such as instruction following, factual-
ity, creativity, and safety. The different automatic
LM-based judges we use employ a number of tech-
niques, such as chain-of-thought prompting (Wei
et al., 2022) and use of rubrics and constitutions
(Bai et al., 2022), to be aligned with human pref-
erences.
Filtering
When using synthetic data, we run several stages
of filtering over it, removing examples that show
certain personal information, unsafe or toxic
model outputs, mistaken self-identification data,
or duplicated examples. Following Gemini, we
find that including subsets of data that encour-
age better in-context attribution, hedging, and
refusals to minimize hallucinations can improve
performance on several factuality metrics, with-
out degrading model performance on other met-
rics.
The final data mixtures and supervised fine-
tuning recipe, which includes tuned hyperparam-
eters, were chosen on the basis of improving help-
fulness while minimizing model harms related to
safety and hallucinations.
Formatting
Instruction tuned models are trained with a spe-
cific formatter that annotates all instruction tun-
ing examples with extra information, both at
training and inference time. It has two purposes:
1) indicating roles in a conversation, such as the
User role, and 2) delineating turns in a conver-
sation, especially in a multi-turn conversation.
Special control tokens are reserved in the tok-
enizer for this purpose. While it is possible to
get coherent generations without the formatter,
it will be out-of-distribution for the model, and
will very likely produce worse generations.
The relevant formatting control tokens are pre-
sented in Table 3, with a dialogue example pre-
sented in Table 4.
Context Relevant Token
User turn user
Model turn model
Start of conversation turn <start_of_turn>
End of conversation turn <end_of_turn>
Table 3
|
Relevant formatting control tokens used
for both SFT and RLHF of Gemma models.
Reinforcement Learning from Human Feed-
back
We further finetuned the supervised fine-tuned
model using RLHF (Christiano et al., 2017;
Ouyang et al., 2022). We collected pairs of pref-
4
剩余15页未读,继续阅读
资源评论
拥抱AI
- 粉丝: 3238
- 资源: 3
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功