<!---
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
## Translation
This directory contains examples for finetuning and evaluating transformers on translation tasks.
Please tag @patil-suraj with any issues/unexpected behaviors, or send a PR!
For deprecated `bertabs` instructions, see [`bertabs/README.md`](https://github.com/huggingface/transformers/blob/main/examples/research_projects/bertabs/README.md).
For the old `finetune_trainer.py` and related utils, see [`examples/legacy/seq2seq`](https://github.com/huggingface/transformers/blob/main/examples/legacy/seq2seq).
### Supported Architectures
- `BartForConditionalGeneration`
- `FSMTForConditionalGeneration` (translation only)
- `MBartForConditionalGeneration`
- `MarianMTModel`
- `PegasusForConditionalGeneration`
- `T5ForConditionalGeneration`
- `MT5ForConditionalGeneration`
`run_translation.py` is a lightweight examples of how to download and preprocess a dataset from the [ð¤ Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it.
For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets#json-files
and you also will find examples of these below.
## With Trainer
Here is an example of a translation fine-tuning with a MarianMT model:
```bash
python examples/pytorch/translation/run_translation.py \
--model_name_or_path Helsinki-NLP/opus-mt-en-ro \
--do_train \
--do_eval \
--source_lang en \
--target_lang ro \
--dataset_name wmt16 \
--dataset_config_name ro-en \
--output_dir /tmp/tst-translation \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate
```
MBart and some T5 models require special handling.
T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example:
```bash
python examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-small \
--do_train \
--do_eval \
--source_lang en \
--target_lang ro \
--source_prefix "translate English to Romanian: " \
--dataset_name wmt16 \
--dataset_config_name ro-en \
--output_dir /tmp/tst-translation \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate
```
If you get a terrible BLEU score, make sure that you didn't forget to use the `--source_prefix` argument.
For the aforementioned group of T5 models it's important to remember that if you switch to a different language pair, make sure to adjust the source and target values in all 3 language-specific command line argument: `--source_lang`, `--target_lang` and `--source_prefix`.
MBart models require a different format for `--source_lang` and `--target_lang` values, e.g. instead of `en` it expects `en_XX`, for `ro` it expects `ro_RO`. The full MBart specification for language codes can be found [here](https://huggingface.co/facebook/mbart-large-cc25). For example:
```bash
python examples/pytorch/translation/run_translation.py \
--model_name_or_path facebook/mbart-large-en-ro \
--do_train \
--do_eval \
--dataset_name wmt16 \
--dataset_config_name ro-en \
--source_lang en_XX \
--target_lang ro_RO \
--output_dir /tmp/tst-translation \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate
```
And here is how you would use the translation finetuning on your own files, after adjusting the
values for the arguments `--train_file`, `--validation_file` to match your setup:
```bash
python examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-small \
--do_train \
--do_eval \
--source_lang en \
--target_lang ro \
--source_prefix "translate English to Romanian: " \
--dataset_name wmt16 \
--dataset_config_name ro-en \
--train_file path_to_jsonlines_file \
--validation_file path_to_jsonlines_file \
--output_dir /tmp/tst-translation \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate
```
The task of translation supports only custom JSONLINES files, with each line being a dictionary with a key `"translation"` and its value another dictionary whose keys is the language pair. For example:
```json
{ "translation": { "en": "Others have dismissed him as a joke.", "ro": "AlÈii l-au numit o glumÄ." } }
{ "translation": { "en": "And some are holding out for an implosion.", "ro": "Iar alÈii aÈteaptÄ implozia." } }
```
Here the languages are Romanian (`ro`) and English (`en`).
If you want to use a pre-processed dataset that leads to high BLEU scores, but for the `en-de` language pair, you can use `--dataset_name stas/wmt14-en-de-pre-processed`, as following:
```bash
python examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-small \
--do_train \
--do_eval \
--source_lang en \
--target_lang de \
--source_prefix "translate English to German: " \
--dataset_name stas/wmt14-en-de-pre-processed \
--output_dir /tmp/tst-translation \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate
```
## With Accelerate
Based on the script [`run_translation_no_trainer.py`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation_no_trainer.py).
Like `run_translation.py`, this script allows you to fine-tune any of the models supported on a
translation task, the main difference is that this
script exposes the bare training loop, to allow you to quickly experiment and add any customization you would like.
It offers less options than the script with `Trainer` (for instance you can easily change the options for the optimizer
or the dataloaders directly in the script) but still run in a distributed setup, on TPU and supports mixed precision by
the mean of the [ð¤ `Accelerate`](https://github.com/huggingface/accelerate) library. You can use the script normally
after installing it:
```bash
pip install git+https://github.com/huggingface/accelerate
```
then
```bash
python run_translation_no_trainer.py \
--model_name_or_path Helsinki-NLP/opus-mt-en-ro \
--source_lang en \
--target_lang ro \
--dataset_name wmt16 \
--dataset_config_name ro-en \
--output_dir ~/tmp/tst-translation
```
You can then use your usual launchers to run in it in a distributed environment, but the easiest way is to run
```bash
accelerate config
```
and reply to the questions asked. Then
```bash
accelerate test
```
that will check everything is ready for training. Finally, you can launch training with
```bash
accelerate launch run_translation_no_trainer.py \
--model_name_or_path Helsinki-NLP/opus-mt-en-ro \
--source_lang en \
--target_lang ro \
--dataset_name wmt16 \
--dataset_config_name ro-en \
--output_dir ~/tmp/tst-translation
```
This command is the same and will work for:
- a CPU-only setup
- a setup with one GPU
- a distributed training with several GPUs (single or multi node)
- a training on TPUs
Note that this library is in alpha release so your feedb
没有合适的资源?快使用搜索试试~ 我知道了~
大模型微调入门 LLM-quickstart-main
共78个文件
ipynb:32个
png:17个
txt:6个
需积分: 0 0 下载量 190 浏览量
更新于2025-01-01
收藏 5.95MB ZIP 举报
大模型微调入门
收起资源包目录
LLM-quickstart-main.zip (78个子文件)
LLM-quickstart-main
llama
llama2_inference.ipynb 5KB
llama2_instruction_tuning.ipynb 23KB
langchain
data_connection
text_embedding.ipynb 7KB
document_transformer.ipynb 61KB
document_loader.ipynb 63KB
vector_stores.ipynb 76KB
tests
state_of_the_union.txt 38KB
the_old_man_and_the_sea.txt 137KB
model_io
output_parser.ipynb 15KB
prompt.ipynb 60KB
model.ipynb 36KB
memory
memory.ipynb 24KB
images
simple_sequential_chain_0.png 479KB
router_chain.png 524KB
llm_chain.png 1.94MB
transform_chain.png 498KB
simple_sequential_chain_1.png 616KB
memory.png 111KB
sequential_chain_0.png 502KB
model_io.jpeg 643KB
chains
transform_chain.ipynb 317KB
router_chain.ipynb 16KB
sequential_chain.ipynb 22KB
LICENSE 11KB
deepspeed
train_on_one_gpu.sh 2KB
train_on_multi_nodes.sh 2KB
translation
run_translation.py 30KB
requirements.txt 119B
README.md 8KB
README.md 2KB
config
ds_config_zero3.json 1KB
ds_config_zero2.json 1KB
docs
version_check.py 966B
version_info.txt 382B
cuda_installation.png 136KB
INSTALL.md 4KB
requirements.txt 430B
README-en.md 6KB
.gitignore 3KB
quantization
bits_and_bytes.ipynb 12KB
AWQ-opt-125m.ipynb 14KB
docs
images
qlora.png 141KB
AWQ_opt-2.7b.ipynb 12KB
AutoGPTQ_opt-2.7b.ipynb 601KB
transformers
pipelines_advanced.ipynb 28KB
data
image
panda.jpg 601KB
cat-chonk.jpeg 55KB
cat_dog.jpg 69KB
audio
mlk.flac 374KB
fine-tune-quickstart.ipynb 41KB
fine-tune-QA.ipynb 88KB
docs
images
bert_pretrain.png 112KB
bert.png 222KB
pipeline_advanced.png 96KB
gpt2.png 43KB
bert-base-chinese.png 47KB
full_nlp_pipeline.png 96KB
question_answering.png 53KB
pipeline_func.png 52KB
pipelines.ipynb 51KB
chatglm
qlora_chatglm3.ipynb 39KB
data
zhouyi_dataset_20240118_152413.csv 214KB
raw_data.txt 19KB
zhouyi_dataset_20240118_163659.csv 147KB
zhouyi_dataset_handmade.csv 8KB
chatbot_with_memory.ipynb 11KB
qlora_chatglm3_timestamp.ipynb 37KB
chatbot_webui.py 1KB
chatglm_inference.ipynb 35KB
gen_dataset.ipynb 73KB
README.md 6KB
peft
peft_lora_whisper-large-v2.ipynb 41KB
peft_chatglm_inference.ipynb 7KB
chatglm3.ipynb 24KB
data
audio
test_zh.flac 789KB
whisper_eval.ipynb 23KB
peft_qlora_chatglm.ipynb 41KB
peft_lora_opt-6.7b.ipynb 269KB
共 78 条
- 1
资源推荐
资源预览
资源评论
164 浏览量
183 浏览量
2024-01-30 上传
181 浏览量
2024-12-03 上传
174 浏览量
107 浏览量
2024-08-25 上传
196 浏览量
2024-01-11 上传
5星 · 资源好评率100%
2024-11-16 上传
2023-08-17 上传
154 浏览量
177 浏览量
2024-09-29 上传
121 浏览量
184 浏览量
176 浏览量
2024-09-30 上传
资源评论
weixin_43620082
- 粉丝: 0
- 资源: 7
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 02358 单片机原理及应用.zip
- Java SpringBoot使用Apache POI导入导出Excel文件(源代码)
- Screenshot_20250103_193853.jpg
- 双行程电极可调电阻焊机sw19全套技术资料100%好用.zip
- 托盘供料机sw18可编辑全套技术资料100%好用.zip
- 二维经验模式分解(BEMD)算法在图像上的应用Matlab实现 代码质量极高,方便学习和修改数据使用
- 收料扎制机sw18可编辑全套技术资料100%好用.zip
- 基于Python的决策树用于员工离职预测(数据+代码)
- stm32F3平台,基于sogi pll锁相环的并网逆变资料,含原理图和代码
- 基于AlexNet的FashionMNIST图像分类项目,包含FashionMNIST数据集,使用pytorch框架
- 连接器插拔力abaqus CAE仿真,提供原仿真 3D模型,已经处理好的CAE文件 此模型整体难度中等,适合初学者和自己有点基础的abaqus学习者
- 橡胶自动化称重切料机step全套技术资料100%好用.zip
- Matlab 永磁同步风力发电机 并网故障 低电压穿越策略 可以设计串电阻Bar策略 也可以增加三相故障
- NetCore开发的文件下载器,国外文件地址可下载
- 直驱风机simulink仿真模型,永磁直驱式风力发电系统 matlab simulink整体仿真,有380V和690V两个仿真,波形如图,现有2018 和 2021 两个版本,可导出2015b-202
- 东芝龙系列复合机服务便携手册e-STUDIO2006/2306/2506/2307/2507
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功