# llama.cpp
![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml)
Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++
### Recent API changes
- [2024 Apr 4] State and session file functions reorganized under `llama_state_*` https://github.com/ggerganov/llama.cpp/pull/6341
- [2024 Mar 26] Logits and embeddings API updated for compactness https://github.com/ggerganov/llama.cpp/pull/6122
- [2024 Mar 13] Add `llama_synchronize()` + `llama_context_params.n_ubatch` https://github.com/ggerganov/llama.cpp/pull/6017
- [2024 Mar 8] `llama_kv_cache_seq_rm()` returns a `bool` instead of `void`, and new `llama_n_seq_max()` returns the upper limit of acceptable `seq_id` in batches (relevant when dealing with multiple sequences) https://github.com/ggerganov/llama.cpp/pull/5328
- [2024 Mar 4] Embeddings API updated https://github.com/ggerganov/llama.cpp/pull/5796
- [2024 Mar 3] `struct llama_context_params` https://github.com/ggerganov/llama.cpp/pull/5849
### Hot topics
- **MoE memory layout has been updated - reconvert models for `mmap` support and regenerate `imatrix` https://github.com/ggerganov/llama.cpp/pull/6387**
- Model sharding instructions using `gguf-split` https://github.com/ggerganov/llama.cpp/discussions/6404
- Fix major bug in Metal batched inference https://github.com/ggerganov/llama.cpp/pull/6225
- Multi-GPU pipeline parallelism support https://github.com/ggerganov/llama.cpp/pull/6017
- Looking for contributions to add Deepseek support: https://github.com/ggerganov/llama.cpp/issues/5981
- Quantization blind testing: https://github.com/ggerganov/llama.cpp/discussions/5962
- Initial Mamba support has been added: https://github.com/ggerganov/llama.cpp/pull/5328
----
<details>
<summary>Table of Contents</summary>
<ol>
<li>
<a href="#description">Description</a>
</li>
<li>
<a href="#usage">Usage</a>
<ul>
<li><a href="#get-the-code">Get the Code</a></li>
<li><a href="#build">Build</a></li>
<li><a href="#blas-build">BLAS Build</a></li>
<li><a href="#prepare-and-quantize">Prepare and Quantize</a></li>
<li><a href="#run-the-quantized-model">Run the quantized model</a></li>
<li><a href="#memorydisk-requirements">Memory/Disk Requirements</a></li>
<li><a href="#quantization">Quantization</a></li>
<li><a href="#interactive-mode">Interactive mode</a></li>
<li><a href="#constrained-output-with-grammars">Constrained output with grammars</a></li>
<li><a href="#instruct-mode">Instruct mode</a></li>
<li><a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a></li>
<li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li>
<li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li>
<li><a href="#android">Android</a></li>
<li><a href="#docker">Docker</a></li>
</ul>
</li>
<li><a href="#contributing">Contributing</a></li>
<li><a href="#coding-guidelines">Coding guidelines</a></li>
<li><a href="#docs">Docs</a></li>
</ol>
</details>
## Description
The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
variety of hardware - locally and in the cloud.
- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2 and AVX512 support for x86 architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
- Vulkan, SYCL, and (partial) OpenCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
Since its [inception](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022), the project has
improved significantly thanks to many contributions. It is the main playground for developing new features for the
[ggml](https://github.com/ggerganov/ggml) library.
**Supported platforms:**
- [X] Mac OS
- [X] Linux
- [X] Windows (via CMake)
- [X] Docker
- [X] FreeBSD
**Supported models:**
Typically finetunes of the base models below are supported as well.
- [X] LLaMA ð¦
- [x] LLaMA 2 ð¦ð¦
- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
- [X] Falcon
- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
- [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
- [X] [Persimmon 8B](https://github.com/ggerganov/llama.cpp/pull/3410)
- [X] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417)
- [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553)
- [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
- [X] [StableLM models](https://huggingface.co/stabilityai)
- [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
- [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
- [x] [GPT-2](https://huggingface.co/gpt2)
- [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
- [x] [CodeShell](https://github.com/WisdomShell/codeshell)
- [x] [Gemma](https://ai.google.dev/gemma)
- [x] [Mamba](https://github.com/state-spaces/mamba)
- [x] [Xverse](https://huggingface.co/models?search=xverse)
- [x] [Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01)
- [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
- [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
(instructions for supporting more models: [HOWTO-add-model.md](./docs/HOWTO-add-model.md))
**Multimodal models:**
- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
- [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
- [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
**HTTP server**
[llama.cpp web server](./examples/server) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
**Bindings:**
- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
- Go
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
没有任何依赖的纯 C/C++ 实现 Apple 芯片是一等公民 - 通过 ARM NEON、Accelerate 和 Metal 框架进行优化 对 x86 架构的 AVX、AVX2 和 AVX512 支持 1.5 位、2 位、3 位、4 位、5 位、6 位和 8 位整数量化,可加快推理速度并减少内存使用 用于在 NVIDIA GPU 上运行 LLM 的自定义 CUDA 内核(通过 HIP 支持 AMD GPU) Vulkan、SYCL 和(部分)OpenCL 后端支持 CPU+GPU 混合推理,部分加速大于 VRAM 总容量的模型 自启动以来,由于许多贡献,该项目已取得显着改进。 它是为 ggml 库开发新功能的主要场所。 支持的平台: Mac OS Linux Windows (via CMake) Docker FreeBSD 支持型号: 通常还支持以下基本模型的微调。 LLaMA LLaMA 2 Mistral 7B Mixtral MoE Falcon Chinese LLaMA / Alpaca and Chinese LLaMA-2 /
资源推荐
资源详情
资源评论
收起资源包目录
llama.cpp 的主要目标是在本地和云端的各种硬件上以最少的设置和最先进的性能实现 LLM 推理 (583个子文件)
AUTHORS 27KB
chat-13B.bat 2KB
win-build-sycl.bat 828B
install-oneapi.bat 802B
win-run-llama2.bat 327B
ggml.c 695KB
ggml-quants.c 514KB
ggml-backend.c 75KB
ggml-alloc.c 36KB
ggml-mpi.c 7KB
test-c.c 96B
.clang-tidy 791B
cloud-v-pipeline 1KB
FindSIMD.cmake 3KB
build-info.cmake 2KB
gen-build-info-cpp.cmake 943B
op_mul_mat_q6_k.comp 4KB
common.comp 4KB
op_rope_f16.comp 3KB
op_rope_f32.comp 3KB
op_norm.comp 2KB
rope_common.comp 2KB
op_mul_mat_q8_0.comp 2KB
op_mul_mv_q_n.comp 2KB
op_softmax.comp 2KB
op_add.comp 2KB
op_mul_mat_f16.comp 2KB
op_cpy_f16_f16.comp 1KB
op_cpy_f32_f16.comp 1KB
op_cpy_f16_f32.comp 1KB
op_cpy_f32_f32.comp 1KB
op_rmsnorm.comp 1KB
op_mul.comp 1KB
op_mul_mat_mat_f32.comp 1KB
op_getrows_q6_k.comp 1KB
op_mul_mat_q4_1.comp 1KB
op_mul_mat_q4_0.comp 1018B
op_getrows_q4_1.comp 962B
op_getrows_q4_0.comp 919B
op_getrows_f16.comp 787B
op_diagmask.comp 726B
op_addrow.comp 640B
op_getrows.comp 609B
op_gelu.comp 604B
op_silu.comp 543B
op_scale_8.comp 528B
op_mul_mv_q_n_pre.comp 521B
op_relu.comp 508B
op_scale.comp 432B
ggml-sycl.cpp 686KB
llama.cpp 657KB
ggml-vulkan.cpp 310KB
unicode-data.cpp 168KB
server.cpp 152KB
common.cpp 107KB
finetune.cpp 88KB
ggml-opencl.cpp 84KB
clip.cpp 83KB
ggml-kompute.cpp 77KB
test-backend-ops.cpp 77KB
perplexity.cpp 72KB
train.cpp 65KB
baby-llama.cpp 61KB
train-text-from-scratch.cpp 57KB
test-grad0.cpp 54KB
llama-bench.cpp 44KB
main.cpp 38KB
convert-llama2c-to-ggml.cpp 34KB
json-schema-to-grammar.cpp 32KB
infill.cpp 28KB
test-json-schema-to-grammar.cpp 26KB
speculative.cpp 24KB
imatrix.cpp 22KB
gguf-split.cpp 19KB
llava.cpp 19KB
grammar-parser.cpp 18KB
quantize.cpp 16KB
lookahead.cpp 16KB
console.cpp 16KB
quantize-stats.cpp 16KB
parallel.cpp 15KB
export-lora.cpp 14KB
sampling.cpp 14KB
test-quantize-perf.cpp 14KB
test-sampling.cpp 13KB
vdot.cpp 13KB
llama-android.cpp 13KB
retrieval.cpp 13KB
llava-cli.cpp 11KB
ngram-cache.cpp 11KB
test-llama-grammar.cpp 11KB
test-chat-template.cpp 11KB
gritlm.cpp 10KB
benchmark-matmult.cpp 10KB
passkey.cpp 9KB
unicode.cpp 9KB
lookup.cpp 8KB
test-grammar-parser.cpp 8KB
save-load-state.cpp 8KB
gguf.cpp 8KB
共 583 条
- 1
- 2
- 3
- 4
- 5
- 6
资源评论
lcwmgecom
- 粉丝: 313
- 资源: 7
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- python爱心代码高级.txt
- Yolo for Android 和 iOS - 用 Kotlin 和 Swift 编写的实时移动深度学习对象检测.zip
- Yolnp 是一个基于 YOLO 检测车牌的项目.zip
- Unity Barracuda 上的 Tiny YOLOv2.zip
- Ultralytics YOLO iOS App 源代码可用于在你自己的 iOS 应用中运行 YOLOv8.zip
- 各种(西佳佳)小游戏 ≈ 代码
- Tensorrt YOLOv8 的简单实现.zip
- TensorFlow 中空间不变注意、推断、重复 (SPAIR) 的原始实现 .zip
- Tensorflow 中的 Tiny YOLOv2 变得简单!.zip
- 8ba1f8ab2c896fd7d5c62d0e5e9ecf46.JPG
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功