Python3数据结构与算法、实现常用算法以及分布式系统相关算法。.zip资源-CSDN文库

共92个文件

py：64个

ipynb：14个

md：4个

需积分: 5 94 浏览量 2024-01-05 22:31:52 上传评论收藏 926KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Python3数据结构与算法、实现常用算法以及分布式系统相关算法。.zip （92个子文件）

zyqmv

__init__.py 2B

mqueue

__init__.py 79B

base.py 698B

hashtable

__init__.py 79B

ht.py 900B

tree

__init__.py 79B

red_black_tree.py 1KB

binary_search_tree.py 2KB

b_tree.py 79B

LICENSE 1KB

docker-compose.yaml 0B

linklist

__init__.py 79B

base.py 3KB

heap

__init__.py 79B

max_binary_heap.py 147B

binary_heap.py 2KB

priv_queue.py 79B

min_binary_heap.py 2KB

examples

q_s.py 406B

str_one.py 475B

xuanz_min.py 277B

find_dup_in_list.py 334B

readme.md 0B

randomfun.py 660B

llm

cerebra_lora.ipynb 3KB

alpaca_ft.ipynb 24KB

chatglm_6b.ipynb 31KB

max_min.py 360B

num_com.py 798B

pytorch

ny.ipynb 2KB

bt.ipynb 1KB

transformer.ipynb 0B

pytorch_tensor.ipynb 5KB

attention.ipynb 81KB

pytorch_file.ipynb 4KB

attention_seq2seq.ipynb 773B

softmax_reg.ipynb 83KB

mlp.ipynb 133KB

multi_regre.ipynb 13KB

line_reguration.ipynb 125KB

goldcoin.py 386B

num_set.py 2KB

libs

__init__.py 79B

stack

__init__.py 79B

base.py 733B

nlp

line_regression.py 1KB

pri_algorithm.py 79B

rnn.py 383B

no_line_regression.py 716B

openai_gpt.py 1KB

bert.py 6KB

lrg.py 2KB

transformer.py 10KB

mlp.py 985B

docs

readme.md 1KB

latexf

case.tex 78B

common

__init__.py 79B

reverse_list.py 1KB

recursion.py 393B

requirements.txt 714B

algorithm

__init__.py 79B

fab.py 404B

merge_sort.py 835B

insert_sort.py 58B

distributed_system

__init__.py 79B

pow.py 125B

raft.py 79B

paxos.py 79B

pos.py 6KB

dpos.py 79B

bubble_sort.py 555B

quick_sort.py 2KB

strstr.py 79B

binary_search.py 636B

2sum.py 132B

llmodels

cerebres

qkst.py 505B

alpaca

finetune.py 4KB

README.md 12KB

chat.py 1KB

bert

finetune.py 4KB

graph

__init__.py 79B

undirected_graph.py 2KB

directed_graph.py 2KB

base.py 2KB

.gitignore 1KB

images

light.jpg 258KB

bg.jpg 76KB

data_set.png 230KB

btc_tweet.png 265KB

sentiment_plot.png 11KB

stablity

awesome.py 1KB

README.md 4KB

# 微调 Alpaca与LLaMA模型: 在一个自定义的数据集上进行大模型训练很高兴给大家介绍基于Alpaca的Lora微调教程。在本教程当中, 我们将通过检测Tweets上比特币的情绪分析，来探索Alpaca LoRa的微调过程。 # 环境配置 [lpaca Lora仓库](https://github.com/tloen/alpaca-lora)提供了基于低秩适应(Lora)重现斯坦福Alpaca大模型效果的处理代码。包括一个效果类GPT-3(text-davinci-003)的指令模型。模型参数可以扩展到13b, 30b, 以及65b, 同时Hugging Face的[PEFT](https://github.com/huggingface/peft)以及Dettmers提供的[bitsandybytes](https://github.com/TimDettmers/bitsandbytes)被用于在大模型微调中的提效与降本。我们将在一个特定数据集上对Alpaca Lora进行一次完整的微调，首先从数据准备开始，最后是我们对模型的训练。本教程将会覆盖数据处理、模型训练、以及使用最普世的自然语言处理库比如Transformers和Hugging Face进行结果评估。此外我们也会通过使用Gradio来介绍模型的部署以及测试。在开始教程之前, 首先需要安装依赖包, 在本文中用到的依赖包如下: ```python pip install -U pip pip install accelerate==0.18.0 pip install appdirs==1.4.4 pip install bitsandbytes==0.37.2 pip install datasets==2.10.1 pip install fire==0.5.0 pip install git+https://github.com/huggingface/peft.git pip install git+https://github.com/huggingface/transformers.git pip install torch==2.0.0 pip install sentencepiece==0.1.97 pip install tensorboardX==2.6 pip install gradio==3.23.0 ``` 在安装好以上依赖之后, 即可开始我们本次的课程之旅了，首先让我们来引入对应的依赖包 ```python import json import transformers import textwrap from transformers import LlamaTokenizer, LlamaForCausalLM import os import sys from typing import List from peft import ( LoraConfig, get_peft_model, get_peft_model_state_dict, prepare_model_for_int8_training, ) import fire import torch from datasets import load_dataset import pandas as pd import matplotlib.pyplot as plt import matplotlib as mpl import seaborn as sns from pylab import rcParams # 设置是用GPU还是CPU, 如果是mac M1芯片可以尝试mps device = "cuda" if torch.cuda.is_available() else "cpu" ``` # 数据本文中我们使用的数据是BTC推特上的情绪分析[数据集](https://www.kaggle.com/datasets/aisolutions353/btc-tweets-sentiment), 在Kaggle网站上就可以下载到对应的数据集, 本数据集包含了50000+BTC相关的推文。为清洗这些数据, 本文中移除了所有'RT'开始以及包含链接的数据。OK, 首先我们来下载数据集。在Kaggle网站上，直接选择到对应的数据集，下载即可。当然也可以使用命令来下载。 ![数据集下载](https://github.com/csunny/algorithm/blob/master/images/btc_tweet.png) > !gdown 1xQ89cpZCnafsW5T3G3ZQWvR7q682t2BN 我们可以通过Pandas来加载CSV文件数据 ```python df = pd.read_csv("../../data/BTC_Tweets_Updated.csv") df.head() ``` ![head](https://github.com/csunny/algorithm/blob/master/images/data_set.png) 在数据集上, 处理之后差不多有1900条推文, 情绪标签通过数字来表示，-1表示消极情绪, 0表示中性情绪, 1表示积极情绪。首先看一下数据分布 ```python3 def.sentiment.value_counts() ``` ``` ['positive'] 22937 ['neutral'] 21932 ['negative'] 5983 Name: Sentiment, dtype: int64 ``` ```python df.Sentiment.value_counts().plot(kind="bar") ``` ![plot](https://github.com/csunny/algorithm/blob/master/images/sentiment_plot.png) 通过数据分布我们可以看出, 负面情绪的分布明显较低，在评估模型的效果时我们应该重点考虑。 # 构建JSON数据集在原始的alpaca仓库中，用到的数据集是JSON文件，是一份包含instruction、input、以及output的数据列表。接下来我们将数据转换为对应的json格式。 ```python def sentiment_score_to_name(score: float): if score > 0: return "Positive" elif score < 0: return "Negative" return "Neutral" dataset_data = [ { "instruction": "Detect the sentiment of the tweet.", "input": row_dict["tweet"], "output": sentiment_score_to_name(row_dict["sentiment"]) } for row_dict in df.to_dict(orient="records") ] dataset_data[0] ``` ```json { "instruction": "Detect the sentiment of the tweet.", "input": "@p0nd3ea Bitcoin wasn't built to live on exchanges.", "output": "Positive" } ``` 最后我们将数据保存到文件，用于之后的模型训练。 ```python import json with open("alpaca-bitcoin-sentiment-dataset.json", "w") as f: json.dump(dataset_data, f) ``` # 模型权重虽然没有原始的LLaMA模型的权重可以使用, 但它们被泄漏了并且被改编为HuggingFace的模型库可以跟Transformers一起使用。在这里我们使用decapoda研究的权重。 ```python BASE_MODEL = "decapoda-research/llama-7b-hf" model = LlamaForCausalLM.from_pretrained( BASE_MODEL, load_in_8bit=True, torch_dtype=torch.float16, device_map="auto", ) tokenizer = LlamaTokenizer.from_pretrained(BASE_MODEL) tokenizer.pad_token_id = ( 0 ) tokenizer.padding_side = "left" ``` 这段使用LlamaFOrCausalLM类来加载预训练的Llama模型,LlamaFOrCausalLM类被HuggingFace的Transformers库所实现。 load_in_8bit=True参数使用8位量化加载模型以减少内存使用并提高推理速度。同时以上代码也加载了分词器通过同样的Llama模型, 使用Transformers的LlamaTokenizer类，并且设置了一些额外的属性比如pad_token_id设置为了0来表现未知的token, 设置了padding_side 设置为了left, 为了在左侧填充序列。 # 数据集现在我们已经加载了模型和分词器, 我们可以通过HuggingFace提供的load_dataset()方法来处理我们之前保存的数据了。 ``` data = load_dataset("json", data_files="alpaca-bitcoin-sentiment-dataset.json") data["train] ``` ```python Dataset({ features: ['instruction', 'input', 'output'], num_rows: 1897 }) ``` 接下来, 我们需要需要从数据集中构建提示词，并进行标记。 ```python def generate_prompt(data_point): return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. # noqa: E501 ### Instruction: {data_point["instruction"]} ### Input: {data_point["input"]} ### Response: {data_point["output"]}""" def tokenize(prompt, add_eos_token=True): result = tokenizer( prompt, truncation=True, max_length=CUTOFF_LEN, padding=False, return_tensors=None, ) if ( result["input_ids"][-1] != tokenizer.eos_token_id and len(result["input_ids"]) < CUTOFF_LEN and add_eos_token ): result["input_ids"].append(tokenizer.eos_token_id) result["attention_mask"].append(1) result["labels"] = result["input_ids"].copy() return result def generate_and_tokenize_prompt(data_point): full_prompt = generate_prompt(data_point) tokenized_full_prompt = tokenize(full_prompt) return tokenized_full_prompt ``` 上述第一个函数generate_prompt 从数据集里面取一个数据点并通过组合instruction、input、以及output值来生成一个提示。第二个函数tokenize获取生成的提示词并对其进行分词。它也会给词追加一个结束序列并设置一个标签, 保持跟输入序列一致。第三个函数generate_and_tokenize_prompt 组合了第一个和第二个函数在一个步骤里面生成并且分词提示词。数据准备的最后一步是将数据拆分为单独的训练集和验证集 ```python train_val = data["train"].train_test_split( test_size = 200, shuffer=True, seed=42 ) train_data = ( train_val["train"].map(generate_and_tokenize_prompt) ) val_data

评论收藏

内容反馈