【免费】大模型微调入门LLM-quickstart-main资源-CSDN文库

共78个文件

ipynb：32个

png：17个

txt：6个

需积分: 0 190 浏览量更新于2025-01-01 收藏 5.95MB ZIP 举报

大模型微调入门

收起资源包目录

LLM-quickstart-main.zip （78个子文件）

LLM-quickstart-main

llama

llama2_inference.ipynb 5KB

llama2_instruction_tuning.ipynb 23KB

langchain

data_connection

text_embedding.ipynb 7KB

document_transformer.ipynb 61KB

document_loader.ipynb 63KB

vector_stores.ipynb 76KB

tests

state_of_the_union.txt 38KB

the_old_man_and_the_sea.txt 137KB

model_io

output_parser.ipynb 15KB

prompt.ipynb 60KB

model.ipynb 36KB

memory

memory.ipynb 24KB

images

simple_sequential_chain_0.png 479KB

router_chain.png 524KB

llm_chain.png 1.94MB

transform_chain.png 498KB

simple_sequential_chain_1.png 616KB

memory.png 111KB

sequential_chain_0.png 502KB

model_io.jpeg 643KB

chains

transform_chain.ipynb 317KB

router_chain.ipynb 16KB

sequential_chain.ipynb 22KB

LICENSE 11KB

deepspeed

train_on_one_gpu.sh 2KB

train_on_multi_nodes.sh 2KB

translation

run_translation.py 30KB

requirements.txt 119B

README.md 8KB

README.md 2KB

config

ds_config_zero3.json 1KB

ds_config_zero2.json 1KB

docs

version_check.py 966B

version_info.txt 382B

cuda_installation.png 136KB

INSTALL.md 4KB

requirements.txt 430B

README-en.md 6KB

.gitignore 3KB

quantization

bits_and_bytes.ipynb 12KB

AWQ-opt-125m.ipynb 14KB

docs

images

qlora.png 141KB

AWQ_opt-2.7b.ipynb 12KB

AutoGPTQ_opt-2.7b.ipynb 601KB

transformers

pipelines_advanced.ipynb 28KB

data

image

panda.jpg 601KB

cat-chonk.jpeg 55KB

cat_dog.jpg 69KB

audio

mlk.flac 374KB

fine-tune-quickstart.ipynb 41KB

fine-tune-QA.ipynb 88KB

docs

images

bert_pretrain.png 112KB

bert.png 222KB

pipeline_advanced.png 96KB

gpt2.png 43KB

bert-base-chinese.png 47KB

full_nlp_pipeline.png 96KB

question_answering.png 53KB

pipeline_func.png 52KB

pipelines.ipynb 51KB

chatglm

qlora_chatglm3.ipynb 39KB

data

zhouyi_dataset_20240118_152413.csv 214KB

raw_data.txt 19KB

zhouyi_dataset_20240118_163659.csv 147KB

zhouyi_dataset_handmade.csv 8KB

chatbot_with_memory.ipynb 11KB

qlora_chatglm3_timestamp.ipynb 37KB

chatbot_webui.py 1KB

chatglm_inference.ipynb 35KB

gen_dataset.ipynb 73KB

README.md 6KB

peft

peft_lora_whisper-large-v2.ipynb 41KB

peft_chatglm_inference.ipynb 7KB

chatglm3.ipynb 24KB

data

audio

test_zh.flac 789KB

whisper_eval.ipynb 23KB

peft_qlora_chatglm.ipynb 41KB

peft_lora_opt-6.7b.ipynb 269KB

资源推荐

资源预览

资源评论

## Translation This directory contains examples for finetuning and evaluating transformers on translation tasks. Please tag @patil-suraj with any issues/unexpected behaviors, or send a PR! For deprecated `bertabs` instructions, see [`bertabs/README.md`](https://github.com/huggingface/transformers/blob/main/examples/research_projects/bertabs/README.md). For the old `finetune_trainer.py` and related utils, see [`examples/legacy/seq2seq`](https://github.com/huggingface/transformers/blob/main/examples/legacy/seq2seq). ### Supported Architectures - `BartForConditionalGeneration` - `FSMTForConditionalGeneration` (translation only) - `MBartForConditionalGeneration` - `MarianMTModel` - `PegasusForConditionalGeneration` - `T5ForConditionalGeneration` - `MT5ForConditionalGeneration` `run_translation.py` is a lightweight examples of how to download and preprocess a dataset from the [ð¤ Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it. For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets#json-files and you also will find examples of these below. ## With Trainer Here is an example of a translation fine-tuning with a MarianMT model: ```bash python examples/pytorch/translation/run_translation.py \ --model_name_or_path Helsinki-NLP/opus-mt-en-ro \ --do_train \ --do_eval \ --source_lang en \ --target_lang ro \ --dataset_name wmt16 \ --dataset_config_name ro-en \ --output_dir /tmp/tst-translation \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --overwrite_output_dir \ --predict_with_generate ``` MBart and some T5 models require special handling. T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example: ```bash python examples/pytorch/translation/run_translation.py \ --model_name_or_path t5-small \ --do_train \ --do_eval \ --source_lang en \ --target_lang ro \ --source_prefix "translate English to Romanian: " \ --dataset_name wmt16 \ --dataset_config_name ro-en \ --output_dir /tmp/tst-translation \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --overwrite_output_dir \ --predict_with_generate ``` If you get a terrible BLEU score, make sure that you didn't forget to use the `--source_prefix` argument. For the aforementioned group of T5 models it's important to remember that if you switch to a different language pair, make sure to adjust the source and target values in all 3 language-specific command line argument: `--source_lang`, `--target_lang` and `--source_prefix`. MBart models require a different format for `--source_lang` and `--target_lang` values, e.g. instead of `en` it expects `en_XX`, for `ro` it expects `ro_RO`. The full MBart specification for language codes can be found [here](https://huggingface.co/facebook/mbart-large-cc25). For example: ```bash python examples/pytorch/translation/run_translation.py \ --model_name_or_path facebook/mbart-large-en-ro \ --do_train \ --do_eval \ --dataset_name wmt16 \ --dataset_config_name ro-en \ --source_lang en_XX \ --target_lang ro_RO \ --output_dir /tmp/tst-translation \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --overwrite_output_dir \ --predict_with_generate ``` And here is how you would use the translation finetuning on your own files, after adjusting the values for the arguments `--train_file`, `--validation_file` to match your setup: ```bash python examples/pytorch/translation/run_translation.py \ --model_name_or_path t5-small \ --do_train \ --do_eval \ --source_lang en \ --target_lang ro \ --source_prefix "translate English to Romanian: " \ --dataset_name wmt16 \ --dataset_config_name ro-en \ --train_file path_to_jsonlines_file \ --validation_file path_to_jsonlines_file \ --output_dir /tmp/tst-translation \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --overwrite_output_dir \ --predict_with_generate ``` The task of translation supports only custom JSONLINES files, with each line being a dictionary with a key `"translation"` and its value another dictionary whose keys is the language pair. For example: ```json { "translation": { "en": "Others have dismissed him as a joke.", "ro": "AlÈii l-au numit o glumÄ." } } { "translation": { "en": "And some are holding out for an implosion.", "ro": "Iar alÈii aÈteaptÄ implozia." } } ``` Here the languages are Romanian (`ro`) and English (`en`). If you want to use a pre-processed dataset that leads to high BLEU scores, but for the `en-de` language pair, you can use `--dataset_name stas/wmt14-en-de-pre-processed`, as following: ```bash python examples/pytorch/translation/run_translation.py \ --model_name_or_path t5-small \ --do_train \ --do_eval \ --source_lang en \ --target_lang de \ --source_prefix "translate English to German: " \ --dataset_name stas/wmt14-en-de-pre-processed \ --output_dir /tmp/tst-translation \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --overwrite_output_dir \ --predict_with_generate ``` ## With Accelerate Based on the script [`run_translation_no_trainer.py`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation_no_trainer.py). Like `run_translation.py`, this script allows you to fine-tune any of the models supported on a translation task, the main difference is that this script exposes the bare training loop, to allow you to quickly experiment and add any customization you would like. It offers less options than the script with `Trainer` (for instance you can easily change the options for the optimizer or the dataloaders directly in the script) but still run in a distributed setup, on TPU and supports mixed precision by the mean of the [ð¤ `Accelerate`](https://github.com/huggingface/accelerate) library. You can use the script normally after installing it: ```bash pip install git+https://github.com/huggingface/accelerate ``` then ```bash python run_translation_no_trainer.py \ --model_name_or_path Helsinki-NLP/opus-mt-en-ro \ --source_lang en \ --target_lang ro \ --dataset_name wmt16 \ --dataset_config_name ro-en \ --output_dir ~/tmp/tst-translation ``` You can then use your usual launchers to run in it in a distributed environment, but the easiest way is to run ```bash accelerate config ``` and reply to the questions asked. Then ```bash accelerate test ``` that will check everything is ready for training. Finally, you can launch training with ```bash accelerate launch run_translation_no_trainer.py \ --model_name_or_path Helsinki-NLP/opus-mt-en-ro \ --source_lang en \ --target_lang ro \ --dataset_name wmt16 \ --dataset_config_name ro-en \ --output_dir ~/tmp/tst-translation ``` This command is the same and will work for: - a CPU-only setup - a setup with one GPU - a distributed training with several GPUs (single or multi node) - a training on TPUs Note that this library is in alpha release so your feedb