InferenceLlama2inonefileofpureC.zip资源-CSDN文库

共25个文件

py：8个

c：4个

md：3个

需积分: 5 142 浏览量 2023-12-31 10:05:50 上传评论收藏 683KB ZIP 举报

《Inference Llama 2 in one file of pure C》是一个专为C语言爱好者和开发者设计的压缩包，其中包含了单一源代码文件实现的推理引擎——Inference Llama 2。这款引擎完全用C语言编写，旨在提供高效、简洁的推理算法实现。在深入探讨这个项目之前，我们首先来理解一下什么是推理引擎以及C语言的优势。推理引擎是一种软件系统，它使用逻辑规则进行推理，以解决复杂的问题或执行自动化决策。在计算机科学领域，推理引擎常用于人工智能、专家系统和知识表示等领域。它们通常包含规则库、事实数据库和推理机制，能够根据已知的事实和规则推断出新的知识。 C语言作为一门低级编程语言，以其高效、灵活和接近硬件的特性而闻名。使用C语言编写推理引擎有以下几个显著优势： 1. **性能**：C语言编译后的代码运行速度快，占用资源少，适合处理计算密集型任务，如推理过程中的大量逻辑运算。 2. **移植性**：C语言是跨平台的语言，编写的代码可以在多种操作系统上编译和运行，这使得Inference Llama 2能应用于各种环境。 3. **控制度高**：C语言允许开发者直接控制内存管理，这对于实现高效的数据结构和算法至关重要，尤其在构建复杂的推理系统时。 4. **简洁性**：C语言语法简洁，易于理解和维护，单个源文件的实现便于整体把握代码逻辑。在《Inference Llama 2 in one file of pure C》中，我们可以期待看到以下关键知识点： 1. **数据结构**：推理引擎通常涉及到复杂的数据结构，如链表、树、图等，用于存储和检索事实和规则。在C语言中，这些数据结构需要手动实现，因此代码会涵盖如何创建、操作和优化这些结构。 2. **推理算法**：核心的推理算法可能是基于规则的推理（如前向链推、反向链推）或者基于搜索的推理（如深度优先搜索、宽度优先搜索）。这些算法的实现将展示C语言在处理逻辑和控制流方面的灵活性。 3. **函数封装**：尽管整个引擎在一个文件中，但为了代码的可读性和模块化，开发者可能会使用函数封装各个功能，形成清晰的接口。 4. **错误处理**：在C语言中，错误处理非常重要。代码可能包含大量的错误检查和异常处理机制，以确保程序的稳定性和可靠性。 5. **内存管理**：由于C语言没有自动垃圾回收，所以开发者需要手动管理内存。这可能包括动态内存分配、释放和避免内存泄漏。通过对《Inference Llama 2 in one file of pure C》的分析和学习，不仅可以掌握推理引擎的基本原理，还能深入了解C语言在实现高级算法和系统时的强大能力。无论你是C语言初学者还是经验丰富的开发者，这个项目都能为你提供宝贵的学习材料。

资源推荐

资源详情

资源评论

收起资源包目录

Inference Llama 2 in one file of pure C.zip （25个子文件）

sss

build_msvc.bat 45B

test_all.py 4KB

configurator.py 2KB

test.c 3KB

tinystories.py 11KB

.github

workflows

build.yml 4KB

runq.c 42KB

assets

llama_cute.jpg 183KB

tokenizer.py 3KB

doc

stories260K.md 3KB

train_llama_tokenizer.md 3KB

win.h 2KB

Makefile 2KB

LICENSE 1KB

export.py 24KB

run.ipynb 4KB

sample.py 3KB

model.py 15KB

run.c 38KB

requirements.txt 107B

tokenizer.model 488KB

train.py 14KB

tokenizer.bin 424KB

win.c 4KB

README.md 33KB

## llama2.c <p align="center"> <img src="assets/llama_cute.jpg" width="300" height="300" alt="Cute Llama"> </p> Have you ever wanted to inference a baby [Llama 2](https://ai.meta.com/llama/) model in pure C? No? Well, now you can! Train the Llama 2 LLM architecture in PyTorch then inference it with one simple 700-line C file ([run.c](run.c)). You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough (ref: [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) paper). This repo is a "fullstack" train + inference solution for Llama 2 LLM, with focus on minimalism and simplicity. As the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B. Work on model quantization is currently ongoing. Please note that this repo started recently as a fun weekend project: I took my earlier [nanoGPT](https://github.com/karpathy/nanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https://github.com/ggerganov/llama.cpp) for inspiring this project. Compared to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies. ## feel the magic [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/karpathy/llama2.c/blob/master/run.ipynb) First, navigate to the folder where you keep your projects and clone this repository to this folder: ```bash git clone https://github.com/karpathy/llama2.c.git ``` Then, open the repository folder: ```bash cd llama2.c ``` Now, let's just run a baby Llama 2 model in C. You need a model checkpoint. Download this 15M parameter model I trained on the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset (~60MB download): ```bash wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin ``` Compile and run the C code: ```bash make run ./run stories15M.bin ``` You'll see the text stream a sample. On my M1 MacBook Air this runs at ~110 tokens/s. See [performance](#performance) or the Makefile for compile flags that can significantly speed this up. We can also try a bit bigger 42M parameter model: ```bash wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin ./run stories42M.bin ``` This still runs at interactive rates and samples more coherent and diverse stories: > Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals. She poured some tea into a tiny teapot and put it on top of the teapot. Suddenly, her little brother Max came into the room and wanted to join the tea party too. Lily didn't want to share her tea and she told Max to go away. Max started to cry and Lily felt bad. She decided to yield her tea party to Max and they both shared the teapot. But then, something unexpected happened. The teapot started to shake and wiggle. Lily and Max were scared and didn't know what to do. Suddenly, the teapot started to fly towards the ceiling and landed on the top of the bed. Lily and Max were amazed and they hugged each other. They realized that sharing was much more fun than being selfish. From that day on, they always shared their tea parties and toys. You can also prompt the model with a prefix or a number of additional command line arguments, e.g. to sample at temperature 0.8 for 256 steps and with a prompt: ```bash ./run stories42M.bin -t 0.8 -n 256 -i "One day, Lily met a Shoggoth" ``` > One day, Lily met a Shoggoth. He was very shy, but was also very generous. Lily said “Hello Shoggy! Can I be your friend?” Shoggy was happy to have a friend and said “Yes, let’s explore the universe together!” So they set off on a journey to explore the universe. As they travelled, Shoggy was happy to explain to Lily about all the wonderful things in the universe. At the end of the day, Lily and Shoggy had gathered lots of wonderful things from the universe, and they both felt very proud. They promised to explore the universe as one big pair and to never stop being generous to each other. There is also an even better 110M param model available, see [models](#models). Quick note on sampling, the recommendation for ~best results is to sample with `-t 1.0 -p 0.9`, i.e. temperature 1.0 (default) but also top-p sampling at 0.9 (default). Intuitively, top-p ensures that tokens with tiny probabilities do not get sampled, so we can't get "unlucky" during sampling, and we are less likely to go "off the rails" afterwards. More generally, to control the diversity of samples use either the temperature (i.e. vary `-t` between 0 and 1 and keep top-p off with `-p 0`) or the top-p value (i.e. vary `-p` between 0 and 1 and keep `-t 1`), but not both. Nice explainers on LLM sampling strategies include [this](https://peterchng.com/blog/2023/05/02/token-selection-strategies-top-k-top-p-and-temperature/), [this](https://docs.cohere.com/docs/controlling-generation-with-top-k-top-p) or [this](https://huggingface.co/blog/how-to-generate). ## Meta's Llama 2 models As the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 2 checkpoints by following the [Meta instructions](https://github.com/facebookresearch/llama). Once we have those checkpoints, we have to convert them into the llama2.c format. For this we need to install the python dependencies (`pip install -r requirements.txt`) and then use the `export.py` file, e.g. for 7B model: ```bash python export.py llama2_7b.bin --meta-llama path/to/llama/model/7B ``` The export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts. I would not attempt to run anything above 7B right now for two reasons: first, 13B+ currently doesn't work because of integer flow in pointer arithmetic, which is yet to be fixed, and second, even if it were fixed, this repo is doing float32 inference right now, so it would be fairly unusably slow. Once the export is done, we can run it: ```bash ./run llama2_7b.bin ``` This ran at about 4 tokens/s compiled with [OpenMP](#OpenMP) on 96 threads on my CPU Linux box in the cloud. (On my MacBook Air M1, currently it's closer to 30 seconds per token if you just build with `make runfast`.) Example output: > The purpose of this document is to highlight the state-of-the-art of CoO generation technologies, both recent developments and those in commercial use. The focus is on the technologies with the highest merit to become the dominating processes of the future and therefore to be technologies of interest to S&T ... R&D. As such, CoO generation technologies developed in Russia, Japan and Europe are described in some depth. The document starts with an introduction to cobalt oxides as complex products and a short view on cobalt as an essential material. The document continues with the discussion of the available CoO generation processes with respect to energy and capital consumption as well as to environmental damage. base models... ¯\\_(ツ)_/¯. Since we can inference the base model, it should be possible to also inference the chat model quite easily, and have a conversation with it. And if

评论收藏

内容反馈