llama.cpp的主要目标是在本地和云端的各种硬件上以最少的设置和最先进的性能实现LLM推理

共583个文件

cpp：76个

txt：66个

md：45个

arm

linux

windows

需积分: 5 180 浏览量 2024-04-12 08:52:58 上传评论收藏 8.72MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

llama.cpp 的主要目标是在本地和云端的各种硬件上以最少的设置和最先进的性能实现 LLM 推理（583个子文件）

AUTHORS 27KB

chat-13B.bat 2KB

win-build-sycl.bat 828B

install-oneapi.bat 802B

win-run-llama2.bat 327B

ggml.c 695KB

ggml-quants.c 514KB

ggml-backend.c 75KB

ggml-alloc.c 36KB

ggml-mpi.c 7KB

test-c.c 96B

.clang-tidy 791B

cloud-v-pipeline 1KB

FindSIMD.cmake 3KB

build-info.cmake 2KB

gen-build-info-cpp.cmake 943B

op_mul_mat_q6_k.comp 4KB

common.comp 4KB

op_rope_f16.comp 3KB

op_rope_f32.comp 3KB

op_norm.comp 2KB

rope_common.comp 2KB

op_mul_mat_q8_0.comp 2KB

op_mul_mv_q_n.comp 2KB

op_softmax.comp 2KB

op_add.comp 2KB

op_mul_mat_f16.comp 2KB

op_cpy_f16_f16.comp 1KB

op_cpy_f32_f16.comp 1KB

op_cpy_f16_f32.comp 1KB

op_cpy_f32_f32.comp 1KB

op_rmsnorm.comp 1KB

op_mul.comp 1KB

op_mul_mat_mat_f32.comp 1KB

op_getrows_q6_k.comp 1KB

op_mul_mat_q4_1.comp 1KB

op_mul_mat_q4_0.comp 1018B

op_getrows_q4_1.comp 962B

op_getrows_q4_0.comp 919B

op_getrows_f16.comp 787B

op_diagmask.comp 726B

op_addrow.comp 640B

op_getrows.comp 609B

op_gelu.comp 604B

op_silu.comp 543B

op_scale_8.comp 528B

op_mul_mv_q_n_pre.comp 521B

op_relu.comp 508B

op_scale.comp 432B

ggml-sycl.cpp 686KB

llama.cpp 657KB

ggml-vulkan.cpp 310KB

unicode-data.cpp 168KB

server.cpp 152KB

common.cpp 107KB

finetune.cpp 88KB

ggml-opencl.cpp 84KB

clip.cpp 83KB

ggml-kompute.cpp 77KB

test-backend-ops.cpp 77KB

perplexity.cpp 72KB

train.cpp 65KB

baby-llama.cpp 61KB

train-text-from-scratch.cpp 57KB

test-grad0.cpp 54KB

llama-bench.cpp 44KB

main.cpp 38KB

convert-llama2c-to-ggml.cpp 34KB

json-schema-to-grammar.cpp 32KB

infill.cpp 28KB

test-json-schema-to-grammar.cpp 26KB

speculative.cpp 24KB

imatrix.cpp 22KB

gguf-split.cpp 19KB

llava.cpp 19KB

grammar-parser.cpp 18KB

quantize.cpp 16KB

lookahead.cpp 16KB

console.cpp 16KB

quantize-stats.cpp 16KB

parallel.cpp 15KB

export-lora.cpp 14KB

sampling.cpp 14KB

test-quantize-perf.cpp 14KB

test-sampling.cpp 13KB

vdot.cpp 13KB

llama-android.cpp 13KB

retrieval.cpp 13KB

llava-cli.cpp 11KB

ngram-cache.cpp 11KB

test-llama-grammar.cpp 11KB

test-chat-template.cpp 11KB

gritlm.cpp 10KB

benchmark-matmult.cpp 10KB

passkey.cpp 9KB

unicode.cpp 9KB

lookup.cpp 8KB

test-grammar-parser.cpp 8KB

save-load-state.cpp 8KB

gguf.cpp 8KB

共 583 条

# llama.cpp ![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png) [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT) [Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml) Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++ ### Recent API changes - [2024 Apr 4] State and session file functions reorganized under `llama_state_*` https://github.com/ggerganov/llama.cpp/pull/6341 - [2024 Mar 26] Logits and embeddings API updated for compactness https://github.com/ggerganov/llama.cpp/pull/6122 - [2024 Mar 13] Add `llama_synchronize()` + `llama_context_params.n_ubatch` https://github.com/ggerganov/llama.cpp/pull/6017 - [2024 Mar 8] `llama_kv_cache_seq_rm()` returns a `bool` instead of `void`, and new `llama_n_seq_max()` returns the upper limit of acceptable `seq_id` in batches (relevant when dealing with multiple sequences) https://github.com/ggerganov/llama.cpp/pull/5328 - [2024 Mar 4] Embeddings API updated https://github.com/ggerganov/llama.cpp/pull/5796 - [2024 Mar 3] `struct llama_context_params` https://github.com/ggerganov/llama.cpp/pull/5849 ### Hot topics - **MoE memory layout has been updated - reconvert models for `mmap` support and regenerate `imatrix` https://github.com/ggerganov/llama.cpp/pull/6387** - Model sharding instructions using `gguf-split` https://github.com/ggerganov/llama.cpp/discussions/6404 - Fix major bug in Metal batched inference https://github.com/ggerganov/llama.cpp/pull/6225 - Multi-GPU pipeline parallelism support https://github.com/ggerganov/llama.cpp/pull/6017 - Looking for contributions to add Deepseek support: https://github.com/ggerganov/llama.cpp/issues/5981 - Quantization blind testing: https://github.com/ggerganov/llama.cpp/discussions/5962 - Initial Mamba support has been added: https://github.com/ggerganov/llama.cpp/pull/5328 ---- <details> <summary>Table of Contents</summary> <ol> <li> <a href="#description">Description</a> </li> <li> <a href="#usage">Usage</a> <ul> <li><a href="#get-the-code">Get the Code</a></li> <li><a href="#build">Build</a></li> <li><a href="#blas-build">BLAS Build</a></li> <li><a href="#prepare-and-quantize">Prepare and Quantize</a></li> <li><a href="#run-the-quantized-model">Run the quantized model</a></li> <li><a href="#memorydisk-requirements">Memory/Disk Requirements</a></li> <li><a href="#quantization">Quantization</a></li> <li><a href="#interactive-mode">Interactive mode</a></li> <li><a href="#constrained-output-with-grammars">Constrained output with grammars</a></li> <li><a href="#instruct-mode">Instruct mode</a></li> <li><a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a></li> <li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li> <li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li> <li><a href="#android">Android</a></li> <li><a href="#docker">Docker</a></li> </ul> </li> <li><a href="#contributing">Contributing</a></li> <li><a href="#coding-guidelines">Coding guidelines</a></li> <li><a href="#docs">Docs</a></li> </ol> </details> ## Description The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud. - Plain C/C++ implementation without any dependencies - Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks - AVX, AVX2 and AVX512 support for x86 architectures - 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use - Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP) - Vulkan, SYCL, and (partial) OpenCL backend support - CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity Since its [inception](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022), the project has improved significantly thanks to many contributions. It is the main playground for developing new features for the [ggml](https://github.com/ggerganov/ggml) library. **Supported platforms:** - [X] Mac OS - [X] Linux - [X] Windows (via CMake) - [X] Docker - [X] FreeBSD **Supported models:** Typically finetunes of the base models below are supported as well. - [X] LLaMA ð¦ - [x] LLaMA 2 ð¦ð¦ - [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) - [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral) - [X] Falcon - [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2) - [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne) - [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) - [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft) - [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila) - [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187) - [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim) - [X] [Persimmon 8B](https://github.com/ggerganov/llama.cpp/pull/3410) - [X] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417) - [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553) - [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi) - [X] [StableLM models](https://huggingface.co/stabilityai) - [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek) - [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen) - [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557) - [x] [Phi models](https://huggingface.co/models?search=microsoft/phi) - [x] [GPT-2](https://huggingface.co/gpt2) - [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118) - [x] [InternLM2](https://huggingface.co/models?search=internlm2) - [x] [CodeShell](https://github.com/WisdomShell/codeshell) - [x] [Gemma](https://ai.google.dev/gemma) - [x] [Mamba](https://github.com/state-spaces/mamba) - [x] [Xverse](https://huggingface.co/models?search=xverse) - [x] [Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01) - [x] [SEA-LION](https://huggingface.co/models?search=sea-lion) - [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B) (instructions for supporting more models: [HOWTO-add-model.md](./docs/HOWTO-add-model.md)) **Multimodal models:** - [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) - [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava) - [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5) - [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V) - [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM) - [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL) **HTTP server** [llama.cpp web server](./examples/server) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients. **Bindings:** - Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python) - Go

评论收藏

内容反馈