# MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
[Deyao Zhu](https://tsutikgiau.github.io/)* (On Job Market!), [Jun Chen](https://junchen14.github.io/)* (On Job Market!), [Xiaoqian Shen](https://xiaoqian-shen.github.io), [Xiang Li](https://xiangli.ac.cn), and [Mohamed Elhoseiny](https://www.mohamed-elhoseiny.com/). *Equal Contribution
**King Abdullah University of Science and Technology**
<a href='https://minigpt-4.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2304.10592'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/spaces/Vision-CAIR/minigpt4'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue'></a> <a href='https://huggingface.co/Vision-CAIR/MiniGPT-4'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a> [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OK4kYsZphwt5DXchKkzMBjYF6jnkqh4R?usp=sharing) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://www.youtube.com/watch?v=__tftoxpBAw&feature=youtu.be)
## News
We now provide a pretrained MiniGPT-4 aligned with Vicuna-7B! The demo GPU memory consumption now can be as low as 12GB.
## Online Demo
Click the image to chat with MiniGPT-4 around your images
[![demo](figs/online_demo.png)](https://minigpt-4.github.io)
## Examples
| | |
:-------------------------:|:-------------------------:
![find wild](figs/examples/wop_2.png) | ![write story](figs/examples/ad_2.png)
![solve problem](figs/examples/fix_1.png) | ![write Poem](figs/examples/rhyme_1.png)
More examples can be found in the [project page](https://minigpt-4.github.io).
## Introduction
- MiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer.
- We train MiniGPT-4 with two stages. The first traditional pretraining stage is trained using roughly 5 million aligned image-text pairs in 10 hours using 4 A100s. After the first stage, Vicuna is able to understand the image. But the generation ability of Vicuna is heavilly impacted.
- To address this issue and improve usability, we propose a novel way to create high-quality image-text pairs by the model itself and ChatGPT together. Based on this, we then create a small (3500 pairs in total) yet high-quality dataset.
- The second finetuning stage is trained on this dataset in a conversation template to significantly improve its generation reliability and overall usability. To our surprise, this stage is computationally efficient and takes only around 7 minutes with a single A100.
- MiniGPT-4 yields many emerging vision-language capabilities similar to those demonstrated in GPT-4.
![overview](figs/overview.png)
## Getting Started
### Installation
**1. Prepare the code and the environment**
Git clone our repository, creating a python environment and ativate it via the following command
```bash
git clone https://github.com/Vision-CAIR/MiniGPT-4.git
cd MiniGPT-4
conda env create -f environment.yml
conda activate minigpt4
```
**2. Prepare the pretrained Vicuna weights**
The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B.
Please refer to our instruction [here](PrepareVicuna.md)
to prepare the Vicuna weights.
The final weights would be in a single folder in a structure similar to the following:
```
vicuna_weights
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00003.bin
...
```
Then, set the path to the vicuna weight in the model config file
[here](minigpt4/configs/models/minigpt4.yaml#L16) at Line 16.
**3. Prepare the pretrained MiniGPT-4 checkpoint**
Download the pretrained checkpoints according to the Vicuna model you prepare.
| Checkpoint Aligned with Vicuna 13B | Checkpoint Aligned with Vicuna 7B |
:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:
[Downlad](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link) | [Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing)
Then, set the path to the pretrained checkpoint in the evaluation config file
in [eval_configs/minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#L10) at Line 11.
### Launching Demo Locally
Try out our demo [demo.py](demo.py) on your local machine by running
```
python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0
```
To save GPU memory, Vicuna loads as 8 bit by default, with a beam search width of 1.
This configuration requires about 23G GPU memory for Vicuna 13B and 11.5G GPU memory for Vicuna 7B.
For more powerful GPUs, you can run the model
in 16 bit by setting low_resource to False in the config file
[minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml) and use a larger beam search width.
Thanks [@WangRongsheng](https://github.com/WangRongsheng), you can also run our code on [Colab](https://colab.research.google.com/drive/1OK4kYsZphwt5DXchKkzMBjYF6jnkqh4R?usp=sharing)
### Training
The training of MiniGPT-4 contains two alignment stages.
**1. First pretraining stage**
In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets
to align the vision and language model. To download and prepare the datasets, please check
our [first stage dataset preparation instruction](dataset/README_1_STAGE.md).
After the first stage, the visual features are mapped and can be understood by the language
model.
To launch the first stage training, run the following command. In our experiments, we use 4 A100.
You can change the save path in the config file
[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage1_pretrain.yaml)
```bash
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml
```
A MiniGPT-4 checkpoint with only stage one training can be downloaded
[here](https://drive.google.com/file/d/1u9FRRBB3VovP1HxCAlpD9Lw4t4P6-Yq8/view?usp=share_link).
Compared to the model after stage two, this checkpoint generate incomplete and repeated sentences frequently.
**2. Second finetuning stage**
In the second stage, we use a small high quality image-text pair dataset created by ourselves
and convert it to a conversation format to further align MiniGPT-4.
To download and prepare our second stage dataset, please check our
[second stage dataset preparation instruction](dataset/README_2_STAGE.md).
To launch the second stage alignment,
first specify the path to the checkpoint file trained in stage 1 in
[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage2_finetune.yaml).
You can also specify the output path there.
Then, run the following command. In our experiments, we use 1 A100.
```bash
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml
```
After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.
## Acknowledgement
+ [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) The model architecture of MiniGPT-4 follows BLIP-2. Don't forget to check this great open-source work if you don't know it before!
+ [Lavis](https://github.com/salesforce/LAVIS) This repository is built upon Lavis!
+ [Vicuna](https://github.com/lm-sys/FastChat) The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!
If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:
```bibtex
@misc{zhu2022minigpt4,
tit
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
阿卜杜拉国王科技大学(KAUST)近期公布了一款名为MiniGPT-4的开源项目,该项目基于著名的GPT-4架构,以支持各类自然语言处理和计算机视觉任务。MiniGPT-4的开发使得开发者和研究人员能够更轻松地利用深度学习方法来解决实际问题,例如生成详细的图像描述和从手写草稿创建网站。 开源实现代码:为了方便开发者和研究人员的使用,阿卜杜拉国王科技大学提供了完整的MiniGPT-4实现代码。开源代码包括网络结构、训练策略以及各种实用工具,使得用户能够根据自身需求进行修改和优化。通过这些代码,研究者可以更好地了解GPT-4模型的工作原理,并在此基础上进行创新和改进。 模型权重文件:MiniGPT-4的模型权重文件是经过训练和验证的预训练模型,可以直接用于各种自然语言处理和计算机视觉任务。这些权重文件可以帮助用户节省大量的时间和计算资源,而无需从头开始训练模型。用户可以根据需求选择不同大小的模型权重文件,从而在精度和计算复杂度之间进行权衡。 训练数据集:阿卜杜拉国王科技大学还提供了丰富的训练数据集,包括各种文本和图像数据。这些数据集涵盖了多个领域和任务,为模型训练和验证提供了良好的基
资源推荐
资源详情
资源评论
收起资源包目录
MiniGPT-4(KAUST) 手写草稿创建网站 (112个子文件)
README.md 8KB
README_1_STAGE.md 3KB
PrepareVicuna.md 2KB
LICENSE_Lavis.md 1KB
LICENSE.md 1KB
README_2_STAGE.md 535B
MiniGPT_4.pdf 6.31MB
overview.png 2.42MB
online_demo.png 1.2MB
story_1.png 853KB
story_1.png 853KB
rhyme_2.png 805KB
rhyme_2.png 805KB
fun_1.png 713KB
fun_1.png 713KB
web_1.png 711KB
web_1.png 711KB
fix_1.png 690KB
fix_1.png 690KB
describe_1.png 679KB
describe_1.png 679KB
fact_2.png 658KB
fact_2.png 658KB
op_2.png 634KB
op_2.png 634KB
op_1.png 603KB
op_1.png 603KB
fun_2.png 597KB
fun_2.png 597KB
rhyme_1.png 588KB
rhyme_1.png 588KB
fix_2.png 586KB
fix_2.png 586KB
cook_2.png 586KB
cook_2.png 586KB
story_2.png 567KB
story_2.png 567KB
wop_2.png 565KB
wop_2.png 565KB
describe_2.png 555KB
describe_2.png 555KB
cook_1.png 538KB
cook_1.png 538KB
wop_1.png 519KB
wop_1.png 519KB
fact_1.png 468KB
fact_1.png 468KB
ad_2.png 457KB
ad_2.png 457KB
ad_1.png 380KB
ad_1.png 380KB
people_2.png 305KB
people_2.png 305KB
people_1.png 249KB
people_1.png 249KB
logo_1.png 189KB
logo_1.png 189KB
Qformer.py 47KB
modeling_llama.py 33KB
runner_base.py 23KB
eva_vit.py 19KB
config.py 15KB
utils.py 13KB
randaugment.py 11KB
mini_gpt4.py 10KB
registry.py 10KB
base_task.py 9KB
base_dataset_builder.py 8KB
base_model.py 8KB
blip2.py 8KB
conversation.py 7KB
data_utils.py 6KB
logger.py 6KB
demo.py 6KB
__init__.py 6KB
dataloader_utils.py 5KB
blip2_outputs.py 4KB
blip_processors.py 4KB
dist_utils.py 4KB
optims.py 3KB
image_text_pair_builder.py 3KB
train.py 3KB
caption_datasets.py 3KB
base_dataset.py 2KB
__init__.py 2KB
cc_sbu_dataset.py 2KB
laion_dataset.py 1KB
__init__.py 951B
__init__.py 823B
gradcam.py 815B
__init__.py 736B
base_processor.py 610B
image_text_pretrain.py 538B
convert_laion.py 508B
convert_cc_sbu.py 504B
__init__.py 306B
__init__.py 0B
__init__.py 0B
__init__.py 0B
__init__.py 0B
共 112 条
- 1
- 2
资源评论
Amarantine、沐风倩✨
- 粉丝: 121
- 资源: 2
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功