算法部署-基于C++推理Google-Gemma模型-轻量级实现-附项目源码+详细流程介绍-优质项目实战.zip

共54个文件

h：16个

cc：11个

md：5个

版权申诉

轻量级实现

6 浏览量 2024-05-16 20:30:05 上传评论收藏 151KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

算法部署_基于C+推理Google-Gemma模型_轻量级实现_附项目源码+详细流程介绍_优质项目实战.zip （54个子文件）

算法部署_基于C+推理Google-Gemma模型_轻量级实现_附项目源码+详细流程介绍_优质项目实战

BUILD.bazel 3KB

CMakeLists.txt 4KB

bazel

com_google_sentencepiece.patch 89KB

BUILD 174B

sentencepiece.bazel 2KB

compress_weights.cc 3KB

configs.h 5KB

run.cc 9KB

WORKSPACE 130B

.clang-format 35B

experimental

.gitkeep 0B

README.md 70B

benchmark.cc 11KB

examples

hello_world

CMakeLists.txt 2KB

run.cc 2KB

build

.gitignore 14B

README.md 2KB

README.md 323B

gemma.cc 66KB

ops_test.cc 17KB

docs

CONTRIBUTING.md 1KB

MODULE.bazel 2KB

.clang-tidy 9KB

build

.gitignore 24B

compression

blob_store.h 2KB

nuq-inl.h 31KB

distortion.h 3KB

analyze.h 7KB

sfp_test.cc 16KB

sfp-inl.h 22KB

sfp.h 2KB

nuq.h 4KB

stats.h 4KB

nuq_test.cc 16KB

compress-inl.h 17KB

blob_store.cc 14KB

compress.h 7KB

BUILD 3KB

test_util.h 2KB

stats.cc 3KB

ops.h 32KB

.bazelrc 23B

gemma_test.cc 15KB

models

.gitignore 0B

gemma.h 4KB

util

app.h 7KB

convert_weights.py 7KB

args.h 6KB

README.md 20KB

cmake.sh 8KB

goldens

7b-it.txt 216B

2b-it.txt 799B

CMakePresets.json 1KB

.bazelversion 6B

# gemma.cpp gemma.cpp is a lightweight, standalone C++ inference engine for the Gemma foundation models from Google. For additional information about Gemma, see [ai.google.dev/gemma](https://ai.google.dev/gemma). Model weights, including gemma.cpp specific artifacts, are [available on kaggle](https://www.kaggle.com/models/google/gemma). NOTE: 2024-04-04: if using 2B models, please re-download weights from Kaggle and ensure you have the latest version (-mqa or version 3). We are changing the code to match the new weights. If you wish to use old weights, change `ConfigGemma2B` in `configs.h` back to `kVocabSize = 256128` and `kKVHeads = 8`. ## Who is this project for? Modern LLM inference engines are sophisticated systems, often with bespoke capabilities extending beyond traditional neural network runtimes. With this comes opportunities for research and innovation through co-design of high level algorithms and low-level computation. However, there is a gap between deployment-oriented C++ inference runtimes, which are not designed for experimentation, and Python-centric ML research frameworks, which abstract away low-level computation through compilation. gemma.cpp provides a minimalist implementation of Gemma 2B and 7B models, focusing on simplicity and directness rather than full generality. This is inspired by vertically-integrated model implementations such as [ggml](https://github.com/ggerganov/ggml), [llama.c](https://github.com/karpathy/llama2.c), and [llama.rs](https://github.com/srush/llama2.rs). gemma.cpp targets experimentation and research use cases. It is intended to be straightforward to embed in other projects with minimal dependencies and also easily modifiable with a small ~2K LoC core implementation (along with ~4K LoC of supporting utilities). We use the [Google Highway](https://github.com/google/highway) Library to take advantage of portable SIMD for CPU inference. For production-oriented edge deployments we recommend standard deployment pathways using Python frameworks like JAX, Keras, PyTorch, and Transformers ([all model variations here](https://www.kaggle.com/models/google/gemma)). ## Contributing Community contributions large and small are welcome. See [DEVELOPERS.md](https://github.com/google/gemma.cpp/blob/main/DEVELOPERS.md) for additional notes contributing developers and [join the discord by following this invite link](https://discord.gg/H5jCBAWxAe). This project follows [Google's Open Source Community Guidelines](https://opensource.google.com/conduct/). *Active development is currently done on the `dev` branch. Please open pull requests targeting `dev` branch instead of `main`, which is intended to be more stable.* ## Quick Start ### System requirements Before starting, you should have installed: - [CMake](https://cmake.org/) - [Clang C++ compiler](https://clang.llvm.org/get_started.html), supporting at least C++17. - `tar` for extracting archives from Kaggle. Building natively on Windows requires the Visual Studio 2012 Build Tools with the optional Clang/LLVM C++ frontend (`clang-cl`). This can be installed from the command line with [`winget`](https://learn.microsoft.com/en-us/windows/package-manager/winget/): ```sh winget install --id Kitware.CMake winget install --id Microsoft.VisualStudio.2022.BuildTools --force --override "--passive --wait --add Microsoft.VisualStudio.Workload.VCTools;installRecommended --add Microsoft.VisualStudio.Component.VC.Llvm.Clang --add Microsoft.VisualStudio.Component.VC.Llvm.ClangToolset" ``` ### Step 1: Obtain model weights and tokenizer from Kaggle or Hugging Face Hub Visit [the Gemma model page on Kaggle](https://www.kaggle.com/models/google/gemma/frameworks/gemmaCpp) and select `Model Variations |> Gemma C++`. On this tab, the `Variation` dropdown includes the options below. Note bfloat16 weights are higher fidelity, while 8-bit switched floating point weights enable faster inference. In general, we recommend starting with the `-sfp` checkpoints. Alternatively, visit the [gemma.cpp](https://huggingface.co/models?other=gemma.cpp) models on the Hugging Face Hub. First go the the model repository of the model of interest (see recommendations below). Then, click the `Files and versions` tab and download the model and tokenizer files. For programmatic downloading, if you have `huggingface_hub` installed, you can also download by running: ``` huggingface-cli login # Just the first time huggingface-cli download google/gemma-2b-sfp-cpp --local-dir build/ ``` 2B instruction-tuned (`it`) and pre-trained (`pt`) models: | Model name | Description | | ----------- | ----------- | | `2b-it` | 2 billion parameter instruction-tuned model, bfloat16 | | `2b-it-sfp` | 2 billion parameter instruction-tuned model, 8-bit switched floating point | | `2b-pt` | 2 billion parameter pre-trained model, bfloat16 | | `2b-pt-sfp` | 2 billion parameter pre-trained model, 8-bit switched floating point | 7B instruction-tuned (`it`) and pre-trained (`pt`) models: | Model name | Description | | ----------- | ----------- | | `7b-it` | 7 billion parameter instruction-tuned model, bfloat16 | | `7b-it-sfp` | 7 billion parameter instruction-tuned model, 8-bit switched floating point | | `7b-pt` | 7 billion parameter pre-trained model, bfloat16 | | `7b-pt-sfp` | 7 billion parameter pre-trained model, 8-bit switched floating point | > [!NOTE] > **Important**: We strongly recommend starting off with the `2b-it-sfp` model to > get up and running. ### Step 2: Extract Files If you downloaded the models from Hugging Face, skip to step 3. After filling out the consent form, the download should proceed to retrieve a tar archive file `archive.tar.gz`. Extract files from `archive.tar.gz` (this can take a few minutes): ``` tar -xf archive.tar.gz ``` This should produce a file containing model weights such as `2b-it-sfp.sbs` and a tokenizer file (`tokenizer.spm`). You may want to move these files to a convenient directory location (e.g. the `build/` directory in this repo). ### Step 3: Build The build system uses [CMake](https://cmake.org/). To build the gemma inference runtime, create a build directory and generate the build files using `cmake` from the top-level project directory. Note if you previous ran `cmake` and are re-running with a different setting, be sure to clean out the `build/` directory with `rm -rf build/*` (warning this will delete any other files in the `build/` directory. For the 8-bit switched floating point weights (sfp), run cmake with no options: #### Unix-like Platforms ```sh cmake -B build ``` **or** if you downloaded bfloat16 weights (any model *without* `-sfp` in the name), instead of running cmake with no options as above, run cmake with WEIGHT_TYPE set to [highway's](https://github.com/google/highway) `hwy::bfloat16_t` type (this will be simplified in the future, we recommend using `-sfp` weights instead of bfloat16 for faster inference): ```sh cmake -B build -DWEIGHT_TYPE=hwy::bfloat16_t ``` After running whichever of the above `cmake` invocations that is appropriate for your weights, you can enter the `build/` directory and run `make` to build the `./gemma` executable: ```sh # Configure `build` directory cmake --preset make # Build project using make cmake --build --preset make -j [number of parallel threads to use] ``` Replace `[number of parallel threads to use]` with a number - the number of cores available on your system is a reasonable heuristic. For example, `make -j4 gemma` will build using 4 threads. If the `nproc` command is available, you can use `make -j$(nproc) gemma` as a reasonable default for the number of threads. If you aren't sure of the right value for the `-j` flag, you can simply run `make gemma` instead and it should still build the `./gemma` executable. > [!NOTE] > On Windows Subsystem for Linux (WSL) users should set the number of > parallel threads to 1. Using a larger number may result in errors. If the build is succe

评论收藏

内容反馈

版权申诉