大模型量化技术GPTQ_量化压缩方法GPTQ资源-CSDN文库

需积分: 5 37 浏览量 2024-06-13 21:00:19 上传评论收藏 510KB PDF 举报

### 大模型量化技术GPTQ详解 #### 一、引言近年来，随着人工智能技术的飞速发展，特别是自然语言处理（NLP）领域的突破性进展，预训练的生成式变换器（Generative Pre-trained Transformers，简称GPT或OPT）模型在复杂语言建模任务上取得了卓越的成绩。然而，这些模型往往体积庞大，导致其计算和存储成本极高。例如，大型且高精度的GPT模型在推理阶段可能就需要多块高性能GPU的支持，这极大地限制了这类模型的应用范围。为了缓解这一问题，学术界和产业界都在积极探索模型压缩的方法，但现有技术在应对GPT模型的规模和复杂度时仍然面临挑战。 #### 二、GPTQ技术概述针对上述问题，Elias Frantar等人在ICLR 2023会议上发表了一篇名为《GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS》的研究论文，提出了一种名为GPTQ的新一代一次性权重量化方法。该方法基于近似二次信息，不仅高效而且准确。具体来说，GPTQ能够在大约四个GPU小时内完成对拥有175亿个参数的GPT模型的量化工作，并将每个权重的位宽降低到3或4比特，同时保持与未压缩基线相当的精度损失。这种技术将压缩比相对以往的一次性量化方法提高了两倍以上，同时还能够确保精度不下降，使得首次可以在单个GPU上执行175亿参数的大规模模型进行生成式推理成为可能。 #### 三、GPTQ的技术特点 GPTQ的主要特点包括： 1. **高效量化**：GPTQ采用了一种高效的一次性量化方法，能够在短时间内完成对大规模模型的量化处理。 2. **精度保持**：通过利用近似二次信息，GPTQ能够在减少位宽的同时，几乎不牺牲模型的精度。 3. **极端量化**：除了常规的3或4比特量化外，GPTQ还能够在更加极端的情况下（如2比特甚至三值量化）提供合理的精度表现。 4. **高性能加速**：实验结果显示，在高端GPU（如NVIDIA A100）上使用GPTQ可以实现相比FP16大约3.25倍的端到端推理速度提升；而在更具成本效益的GPU（如NVIDIA A6000）上，则可以达到约4.5倍的速度提升。 5. **开放源代码**：GPTQ的实现代码已公开发布于GitHub（<https://github.com/IST-DASLab/g>），供研究者和开发者使用和进一步改进。 #### 四、GPTQ的工作原理 GPTQ的核心思想是通过近似二次信息来优化量化过程中的权重分布。这种方法使得量化后的模型能够在保留大部分原始性能的同时，大幅度减小模型的大小。具体而言，GPTQ利用了以下关键技术： 1. **二次信息的近似**：为了提高量化过程的效率，GPTQ采用了近似的二次信息来估计量化误差的影响，从而避免了传统的全矩阵求逆操作，显著减少了计算成本。 2. **动态量化**：在量化过程中，GPTQ允许不同层使用不同的量化级别，这有助于更好地匹配每层的实际需求，从而在整体上提高模型的性能。 3. **精度调整**：GPTQ通过迭代调整量化参数，以最小化量化后模型与原始模型之间的差异，从而确保量化后的模型能够保持较高的精度水平。 #### 五、GPTQ的应用场景鉴于GPTQ的强大性能，它在多个应用场景中都有着广泛的应用前景，包括但不限于： 1. **云端推理**：通过GPTQ量化后的模型可以在单个GPU上运行，降低了硬件资源的需求，使得大规模语言模型的应用更为普及。 2. **边缘计算**：对于边缘设备而言，由于计算资源有限，GPTQ可以帮助减少模型的计算和存储开销，使其更适合部署在边缘侧。 3. **移动设备**：对于智能手机等移动设备，GPTQ的轻量化特性可以提高模型的运行效率，为用户提供更流畅的体验。 #### 六、结论 GPTQ是一种极具创新性的量化技术，它不仅显著提高了大规模语言模型的量化效率，还能够在保持高精度的同时大幅度减少模型的计算和存储需求。随着GPTQ技术的不断发展和完善，我们有理由相信它将在未来的人工智能领域发挥越来越重要的作用，推动整个行业的进步和发展。

资源推荐

资源详情

资源评论

Published as a conference paper at ICLR 2023

GPTQ: ACCURATE POST-TRAINING QUANTIZATION

FOR GENERATIVE PRE-TRAINED TRANSFORMERS

Elias Frantar

∗

IST Austria

Saleh Ashkboos

ETH Zurich

Torsten Hoeﬂer

ETH Zurich

Dan Alistarh

IST Austria & NeuralMagic

ABSTRACT

Generative Pre-trained Transformer models, known as GPT or OPT, set them-

selves apart through breakthrough performance across complex language mod-

elling tasks, but also by their extremely high computational and storage costs.

Speciﬁcally, due to their massive size, even inference for large, highly-accurate

GPT models may require multiple performant GPUs, which limits the usability

of such models. While there is emerging work on relieving this pressure via

model compression, the applicability and performance of existing compression

techniques is limited by the scale and complexity of GPT models. In this paper,

we address this challenge, and propose GPTQ, a new one-shot weight quantiza-

tion method based on approximate second-order information, that is both highly-

accurate and highly-efﬁcient. Speciﬁcally, GPTQ can quantize GPT models with

175 billion parameters in approximately four GPU hours, reducing the bitwidth

down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the

uncompressed baseline. Our method more than doubles the compression gains rel-

ative to previously-proposed one-shot quantization methods, preserving accuracy,

allowing us for the ﬁrst time to execute an 175 billion-parameter model inside a

single GPU for generative inference. Moreover, we also show that our method

can still provide reasonable accuracy in the extreme quantization regime, in which

weights are quantized to 2-bit or even ternary quantization levels. We show ex-

perimentally that these improvements can be leveraged for end-to-end inference

speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100)

and 4.5x when using more cost-effective ones (NVIDIA A6000). The implemen-

tation is available at https://github.com/IST-DASLab/gptq.

1 INTRODUCTION

Pre-trained generative models from the Transformer (Vaswani et al., 2017) family, commonly known

as GPT or OPT (Radford et al., 2019; Brown et al., 2020; Zhang et al., 2022), have shown break-

through performance for complex language modelling tasks, leading to massive academic and prac-

tical interest. One major obstacle to their usability is computational and storage cost, which ranks

among the highest for known models. For instance, the best-performing model variants, e.g. GPT3-

175B, have in the order of 175 billion parameters and require tens-to-hundreds of GPU years to

train (Zhang et al., 2022). Even the simpler task of inferencing over a pre-trained model, which is

our focus in this paper, is highly challenging: for instance, the parameters of GPT3-175B occupy

326GB (counting in multiples of 1024) of memory when stored in a compact ﬂoat16 format. This

exceeds the capacity of even the highest-end single GPUs, and thus inference must be performed

using more complex and expensive setups, such as multi-GPU deployments.

Although a standard approach to eliminating these overheads is model compression, e.g. (Hoeﬂer

et al., 2021; Gholami et al., 2021), surprisingly little is known about compressing such models for

inference. One reason is that more complex methods for low-bitwidth quantization or model prun-

ing usually require model retraining, which is extremely expensive for billion-parameter models.

Alternatively, post-training methods (Nagel et al., 2020; Wang et al., 2020; Hubara et al., 2020;

Nahshan et al., 2021), which compress the model in one shot, without retraining, would be very

appealing. Unfortunately, the more accurate variants of such methods (Li et al., 2021; Hubara et al.,

2021; Frantar et al., 2022) are complex and challenging to scale to billions of parameters (Yao et al.,

∗

Corresponding author: elias.frantar@ist.ac.at

arXiv:2210.17323v2 [cs.LG] 22 Mar 2023

Published as a conference paper at ICLR 2023

2022). To date, only basic variants of round-to-nearest quantization (Yao et al., 2022; Dettmers

et al., 2022) have been applied at the scale of GPT-175B; while this works well for low compression

targets, e.g., 8-bit weights, they fail to preserve accuracy at higher rates. It therefore remains open

whether one-shot post-training quantization to higher compression rates is generally-feasible.

#params in billions

Perplexity on WikiText2

110.

OPT Model Family

4bit RTN

4bit GPTQ

FP16

#params in billions

Perplexity on WikiText2

571.

BLOOM Model Family

3bit RTN

3bit GPTQ

FP16

Figure 1: Quantizing OPT models to 4 and BLOOM models to 3 bit precision, comparing GPTQ

with the FP16 baseline and round-to-nearest (RTN) (Yao et al., 2022; Dettmers et al., 2022).

Contribution. In this paper, we present a new post-training quantization method, called GPTQ,

which is efﬁcient enough to execute on models with hundreds of billions of parameters in at most

a few hours, and precise enough to compress such models to 3 or 4 bits per parameter without

signiﬁcant loss of accuracy. For illustration, GPTQ can quantize the largest publicly-available mod-

els, OPT-175B and BLOOM-176B, in approximately four GPU hours, with minimal increase in

perplexity, known to be a very stringent accuracy metric.

Further, we show that our model can also provide robust results in the extreme quantization regime,

in which models are quantized to 2 bits per component, or even ternary values. On the practical

side, we develop an execution harness which allows us to execute the resulting compressed models

efﬁciently for generative tasks. Speciﬁcally, we are able to run the compressed OPT-175B model

for the ﬁrst time on a single NVIDIA A100 GPU, or using only two more cost-effective NVIDIA

A6000 GPUs. We also implement bespoke GPU kernels which are able to leverage compression for

faster memory loading, resulting in speedups of ≈ 3.25× when using A100 GPUs, and 4.5× when

using A6000 GPUs.

To our knowledge, we are the ﬁrst to show that extremely accurate language models with hundreds

of billions of parameters can be quantized to 3-4 bits/component: prior post-training methods only

remain accurate at 8 bits (Yao et al., 2022; Dettmers et al., 2022), while prior training-based tech-

niques have only tackled models that are smaller by one to two orders of magnitude (Wu et al., 2022).

This high degree of compression may appear natural, as these networks are overparametrized; yet,

as we discuss in our detailed analysis of results, compression induces non-trivial tradeoffs between

the accuracy of the language modeling (perplexity), bit-width, and the size of the original model.

We hope that our work will stimulate further research in this area, and can be a further step towards

making these models available to a wider audience. In terms of limitations, our method currently

does not provide speedups for the actual multiplications, due to the lack of hardware support for

mixed-precision operands (e.g. FP16 x INT4) on mainstream architectures. Moreover, our current

results do not include activation quantization, as they are not a signiﬁcant bottleneck in our target

scenarios; however, this can be supported using orthogonal techniques (Yao et al., 2022).

2 RELATED WORK

Quantization methods fall broadly into two categories: quantization during training, and post-

training methods. The former quantize models during typically extensive retraining and/or ﬁne-

tuning, using some approximate differentiation mechanism for the rounding operation (Gholami

et al., 2021; Nagel et al., 2021). By contrast, post-training (“one-shot”) methods quantize a pre-

This merges the name of the OPT model family with the abbreviation for post-training quantization (PTQ).

Published as a conference paper at ICLR 2023

trained model using modest resources, typically a few thousand data samples and a few hours of

computation. Post-training approaches are particularly interesting for massive models, for which

full model training or even ﬁnetuning can be expensive. We focus on this scenario here.

Post-training Quantization. Most post-training methods have focused on vision models. Usually,

accurate methods operate by quantizing either individual layers, or small blocks of consecutive

layers. (See Section 3 for more details.) The AdaRound method (Nagel et al., 2020) computes a

data-dependent rounding by annealing a penalty term, which encourages weights to move towards

grid points corresponding to quantization levels. BitSplit (Wang et al., 2020) constructs quantized

values bit-by-bit using a squared error objective on the residual error, while AdaQuant (Hubara et al.,

2021) performs direct optimization based on straight-through estimates. BRECQ (Li et al., 2021)

introduces Fisher information into the objective, and optimizes layers within a single residual block

jointly. Finally, Optimal Brain Quantization (OBQ) (Frantar et al., 2022) generalizes the classic

Optimal Brain Surgeon (OBS) second-order weight pruning framework (Hassibi et al., 1993; Singh

& Alistarh, 2020; Frantar et al., 2021) to apply to quantization. OBQ quantizes weights one-by-one,

in order of quantization error, always adjusting the remaining weights. While these approaches can

produce good results for models up to ≈ 100 million parameters in a few GPU hours, scaling them

to networks orders of magnitude larger is challenging.

Large-model Quantization. With the recent open-source releases of language models like

BLOOM (Laurenc¸on et al., 2022) or OPT-175B (Zhang et al., 2022), researchers have started to

develop affordable methods for compressing such giant networks for inference. While all exist-

ing works—ZeroQuant (Yao et al., 2022), LLM.int8() (Dettmers et al., 2022), and nuQmm (Park

et al., 2022)— carefully select quantization granularity, e.g., vector-wise, they ultimately just round

weights to the nearest (RTN) quantization level, in order to maintain acceptable runtimes for very

large models. ZeroQuant further proposes layer-wise knowledge distillation, similar to AdaQuant,

but the largest model it can apply this approach to has only 1.3 billion parameters. At this scale,

ZeroQuant already takes ≈ 3 hours of compute; GPTQ quantizes models 100× larger in ≈ 4 hours.

LLM.int8() observes that activation outliers in a few feature dimensions break the quantization

of larger models, and proposes to ﬁx this problem by keeping those dimensions in higher preci-

sion. Lastly, nuQmm develops efﬁcient GPU kernels for a speciﬁc binary-coding based quantization

scheme.

Relative to this line of work, we show that a signiﬁcantly more complex and accurate quantizer can

be implemented efﬁciently at large model scale. Speciﬁcally, GPTQ more than doubles the amount

of compression relative to these prior techniques, at similar accuracy.

3 BACKGROUND

Layer-Wise Quantization. At a high level, our method follows the structure of state-of-the-art

post-training quantization methods (Nagel et al., 2020; Wang et al., 2020; Hubara et al., 2021; Fran-

tar et al., 2022), by performing quantization layer-by-layer, solving a corresponding reconstruction

problem for each layer. Concretely, let W

be the weights corresponding to a linear layer ` and let

denote the layer input corresponding to a small set of m data points running through the network.

Then, the objective is to ﬁnd a matrix of quantized weights

W which minimizes the squared error,

relative to the full precision layer output. Formally, this can be restated as

argmin

||WX −

WX||

. (1)

Further, similar to (Nagel et al., 2020; Li et al., 2021; Frantar et al., 2022), we assume that the

quantization grid for

W is ﬁxed before the process, and that individual weights can move freely as

in (Hubara et al., 2021; Frantar et al., 2022).

Optimal Brain Quantization. Our approach builds on the recently-proposed Optimal Brain

Quanization (OBQ) method (Frantar et al., 2022) for solving the layer-wise quantization problem

deﬁned above, to which we perform a series of major modiﬁcations, which allow it to scale to large

language models, providing more than three orders of magnitude computational speedup. To aid

understanding, we ﬁrst brieﬂy summarize the original OBQ method.

The OBQ method starts from the observation that Equation (1) can be written as the sum of the

squared errors, over each row of W. Then, OBQ handles each row w independently, quantizing one

weight at a time while always updating all not-yet-quantized weights, in order to compensate for

the error incurred by quantizing a single weight. Since the corresponding objective is a quadratic,

Published as a conference paper at ICLR 2023

whose Hessian is H

= 2X

, where F denotes the set of remaining full-precision weights,

the greedy-optimal weight to quantize next, which we denote by w

, and the corresponding optimal

update of all weights in F , denoted by δ

, are given by the following formulas, where quant(w)

rounds w to the nearest value on the quantization grid:

= argmin

(quant(w

) − w

)

−1

]

q q

, δ

= −

− quant(w

)

−1

]

q q

· (H

−1

)

:,q

. (2)

OBQ quantizes weights iteratively using these two equations, until all the weights of w are quan-

tized. This is done efﬁciently, avoiding expensive full recomputations of H

−1

, by removing the qth

row and column of H, which is necessary after quantizing w

, directly in the inverse via one step of

Gaussian elimination. Namely, the updated inverse is given by the formula

−1

−q



−1

−

−1

]

q q

−1

:,q

−1

q ,:



−p

. (3)

This method comes with a vectorized implementation, handling multiple rows of W in parallel.

Eventually, the algorithm can achieve reasonable runtimes on medium-sized models: for instance, it

can fully quantize the ResNet-50 model (25M parameters) in ≈ 1 hour on a single GPU, which is

roughly in line with other post-training methods achieving state-of-the-art accuracy (Frantar et al.,

2022). However, the fact that OBQ’s runtime for a d

row

×d

col

matrix W has cubic input dependency

O(d

row

· d

col

) means that applying it to models with billions of parameters is extremely expensive.

4 THE GPTQ ALGORITHM

Step 1: Arbitrary Order Insight. As explained in the previous section, OBQ quantizes weights in

greedy order, i.e. it always picks the weight which currently incurs the least additional quantization

error. Interestingly, we ﬁnd that, while this quite natural strategy does indeed seem to perform very

well, its improvement over quantizing the weights in arbitrary order is generally small, in particular

on large, heavily-parametrized layers. Most likely, this is because the slightly lower number of

quantized weights with large individual error is balanced out by those weights being quantized

towards the end of the process, when only few other unquantized weights that can be adjusted for

compensation remain. As we will now discuss, this insight that any ﬁxed order may perform well,

especially on large models, has interesting ramiﬁcations.

Inverse Layer Hessian

(Cholesky Form)

computed initially

block i quantized recursively

column-by-column

Weight Matrix / Block

unquantized weights

that are updated

quantized weights

Figure 2: GPTQ quantization procedure. Blocks

of consecutive columns (bolded) are quantized at

a given step, using the inverse Hessian informa-

tion stored in the Cholesky decomposition, and

the remaining weights (blue) are updated at the

end of the step. The quantization procedure is

applied recursively inside each block: the white

middle column is currently being quantized.

The original OBQ method quantizes rows of W

independently, in a speciﬁc order deﬁned by the

corresponding errors. By contrast, we will aim

to quantize the weights of all rows in the same

order, and will show that this typically yields

results with a ﬁnal squared error that is simi-

lar to the original solutions. As a consequence,

the set of unquantized weights F and similarly

−1

is always the same for all rows (see Fig-

ure 2 for an illustration). In more detail, the lat-

ter is due to the fact that H

depends only on

the layer inputs X

, which are the same for all

rows, and not on any weights. Therefore, we

have to perform the update of H

−1

given by

Equation (3) only d

col

times, once per column,

rather than d

row

·d

col

times, once per weight. This

reduces the overall runtime from O(d

row

· d

col

)

to O(max {d

row

· d

col

, d

col

}), i.e., by a factor of

min {d

row

, d

col

}. For larger models, this differ-

ence consists of several orders of magnitude.

However, before this algorithm can actually be

applied to very large models in practice, two ad-

ditional major problems need to be addressed.

Step 2: Lazy Batch-Updates. First, a direct implementation of the scheme described previously

will not be fast in practice, because the algorithm has a relatively low compute-to-memory-access

ratio. For example, Equation (3) needs to update all elements of a potentially huge matrix using just a

剩余15页未读，继续阅读

评论收藏

内容反馈

（initial）

粉丝: 290
资源: 34

大模型量化技术GPTQ

0769-极智开发-解读大模型量化算法之gptq原理和示例代码

量化投资策略源码模型量化策略

基于pytorch的模型剪枝+模型量化+BN合并+TRT部署（cifar数据）

大语言模型加速-使用GPTQ量化算法对LLaMA进行4比特量化-附量化加速效果benchmark+项目源码-优质项目实战.zip

基于使用RFM+R模型量化用户价值进行金融产品精准营销.zip

基于模型量化金融模型建模比赛当中做过的题目.zip

YOLOv11模型优化中的量化技术与实践

yolov5模型，yolov5量化模型，yolov5 FP16 FP32 INT8量化模型

大模型的量化技术AWQ.pdf

AI大模型技术方案.docx

百川大模型微调，lora模型，训练微调自己的大预言模型

yolov5的模型量化

AI大模型ppt介绍总结了大模型的参数的规模、算力、精度以及发展等各方面情况

大语言模型量化-对LLMs进行量化以进行搞笑Finetuning微调-附项目源码-优质项目分享.zip

大模型关键技术与应用.pdf

量化投资 交易模型开发与数据挖掘.pptx

YOLO模型轻量化：量化技术在低功耗设备上的应用

0770-极智开发-解读大模型量化BNB原理及示例代码

量化加速-使用Pytorch-quantization对YOLOv8目标检测算法进行量化加速-模型小型化-附项目源码优质项目实战

2023大模型与AIGC峰会（公开）PPT汇总（25份）.zip

矢量量化技术与应用 孙圣和等著.pdf

yolov8 rknn3588混合量化

均线模型量化股票交易.docx

MODNet官方onnx及其转换的ncnn模型、NCNN量化后模型

YOLOv8模型优化：量化与剪枝的实战指南

论文研究-基于RR-EP模型的风险量化技术.pdf

stable-diffusion部署需要的包

大规模语言模型：从理论到实践

最新资源

量化投资交易模型开发与数据挖掘.pptx

矢量量化技术与应用孙圣和等著.pdf