TaiSu（太素）--a_large-scale_Chinese_multimodal_datase_TaiSu.zip

共99个文件

py：46个

pyc：27个

yaml：6个

版权申诉

74 浏览量 2024-09-16 00:23:43 上传评论收藏 4.99MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

TaiSu（太素）--a_large-scale_Chinese_multimodal_datase_TaiSu.zip （99个子文件）

TaiSu-main

train

loader.py 6KB

multi_hc.slurm 846B

env_hc.sh 953B

train_clip.py 11KB

tokenizer.py 2KB

utils

__init__.py 107B

simple_tokenizer.py 5KB

sp_tokenizer.py 6KB

util.py 2KB

custom_schedulers.py 3KB

cogview_tokenizer.py 1KB

cog-pretrain.model 998KB

logger.py 3KB

__pycache__

sp_tokenizer.cpython-36.pyc 7KB

simple_tokenizer.cpython-36.pyc 6KB

sp_tokenizer.cpython-38.pyc 7KB

custom_schedulers.cpython-38.pyc 3KB

util.cpython-38.pyc 2KB

logger.cpython-38.pyc 3KB

util.cpython-36.pyc 2KB

custom_schedulers.cpython-36.pyc 4KB

simple_tokenizer.cpython-38.pyc 6KB

__init__.cpython-36.pyc 236B

__init__.cpython-38.pyc 244B

logger.cpython-36.pyc 3KB

bert_chinese_tokenizer.pth 520KB

requirements.txt 74B

models

configs

ViT.yaml 1009B

RN.yaml 1KB

modified_model.py 17KB

model.py 17KB

wrapper.py 14KB

model_infer.py 18KB

__pycache__

wrapper.cpython-36.pyc 10KB

__init__.cpython-36.pyc 212B

modified_model.cpython-36.pyc 14KB

clip

__init__.py 20B

simple_tokenizer.py 5KB

clip.py 9KB

bpe_simple_vocab_16e6.txt.gz 1.29MB

model.py 17KB

single_hc.sh 3KB

semantic_filtering

loader.py 5KB

cleaner.py 6KB

multi_hc.slurm 1002B

env_hc_zjx.sh 1011B

utils

__init__.py 107B

simple_tokenizer.py 5KB

sp_tokenizer.py 6KB

util.py 2KB

custom_schedulers.py 3KB

cogview_tokenizer.py 1KB

cog-pretrain.model 998KB

logger.py 3KB

__pycache__

sp_tokenizer.cpython-36.pyc 7KB

simple_tokenizer.cpython-36.pyc 6KB

util.cpython-36.pyc 2KB

custom_schedulers.cpython-36.pyc 4KB

__init__.cpython-36.pyc 239B

logger.cpython-36.pyc 3KB

collect.py 381B

models

configs

ViT.yaml 1009B

RN.yaml 1KB

modified_model.py 20KB

model.py 17KB

model_infer.py 18KB

single_hc.sh 2KB

LICENSE 919B

download_tool

download.py 2KB

eval

id_lmdb.py 925B

id2lmdb 300B

utils

__init__.py 107B

simple_tokenizer.py 5KB

sp_tokenizer.py 6KB

util.py 2KB

custom_schedulers.py 3KB

cogview_tokenizer.py 1KB

cog-pretrain.model 998KB

logger.py 3KB

__pycache__

sp_tokenizer.cpython-38.pyc 7KB

custom_schedulers.cpython-38.pyc 3KB

util.cpython-38.pyc 2KB

logger.cpython-38.pyc 3KB

simple_tokenizer.cpython-38.pyc 6KB

__init__.cpython-38.pyc 226B

ITRetrieval.py 9KB

bert_tokenizer.py 2KB

get__id.py 5KB

models

configs

ViT.yaml 1009B

RN.yaml 1KB

modified_model.py 20KB

model.py 17KB

model_infer.py 18KB

imgs

image.png 424KB

new_Len.png 64KB

new_noun_bar.png 167KB

txt_file.png 268KB

all_wc.png 1.03MB

README.md 5KB

# TaiSu(太素--亿级大规模中文视觉语言预训练数据集) **TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training** This paper has been accepted by NeurIPS 2022. * paper link: https://openreview.net/pdf?id=iAxH-ikIP0I * Dataset Construction: 1) Data collection 2) Text-based filtering 3) Image-text-retrieval-based filtering 4) Image-Captioning-based text augmentation ![word cloud](/imgs/all_wc.png) ## Dataset download ## Since most of the original urls are expired, we decided to directly provide the images and corresponding captions. To make the download process easier, we split the image set into more than 30 parts, and the captions are gathered in a single TXT file whose format of the content is shown in ![captions](/imgs/txt_file.png). ``` ID*****Web_caption*****Generated Caption # Empty captions are replaced with "None" ``` In order to download Taisu, please send an Email to <datasets_2022@outlook.com>, indicating your organization in the email, we will give you feedback as soon as possible. The files with the suffix of '.tgz' need first to be uncompressed to a file with the suffix of '.tar' using the command line ```pigz -d baidu_images*.tgz ```. Even though a part of the images is damaged or lost because of some reasons, you can still access the most part of TaiSu's data. Each image and its captions can be matched by the id, for example, 'img1baiducomitu1848496827104259151'. `Here is a tutorial that tells you how to download the files from BaiduCloud to your server:` <https://blog.csdn.net/wxplol/article/details/115283527>. `Hope it can help you.` ## Pretrained models ## Models trained on the web data of TaiSu and on the complete data of TaiSu are now availbale. Baidu cloud link：https://pan.baidu.com/s/1d3UKyQi7J4Qr1XE2j2V8og?pwd=0kjm * Example for usage: ``` from models.model_infer import build_lit from clip.clip import _transform from utils.sp_tokenizer import SentencepieceChineseTokenizer from PIL import Image lit=build_lit(visual_model_path=path/to/visual/model/state_dict,txt_model_path=path/to/textual/model/state_dict) #viusal model and textual model should be matched. '''API: lit.encode_image(imgs) lit.encode_text(txt) ''' device = "cpu" transform=_transform(n_px=224) tokenizer=SentencepieceChineseTokenizer(context_length=52) image = transform(Image.open("xxx.png")).unsqueeze(0).to(device) texts = tokenizer.tokenize(['我爱我的家乡','xxxx']).to(device) with torch.no_grad(): img_emb= lit.encode_image(image) txt_emb=lit.encode_text(texts) #The embeddings should be normalized to calculate cosine similarity img_emb=img_emb/img_emb.norm(dim=-1,keepdim=True) txt_emb=txt_emb/txt_emb.norm(dim=-1,keepdim=True) logits=img_emb@txt_emb.t() ``` ## LICENCE ## Unless specifically labeled otherwise, these Datasets are provided to You under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (“CC BY-NC-SA 4.0”), with the additional terms included herein. The CC BY-NC-SA 4.0 may be accessed at https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode. When You download or use the Datasets from the Website or elsewhere, You are agreeing to comply with the terms of CC BY-NC-SA 4.0, and also agreeing to the Dataset Terms. Where these Dataset Terms conflict with the terms of CC BY-NC-SA 4.0, these Dataset Terms shall prevail. We reiterate once again that this dataset is used only for non-commercial purposes such as academic research, teaching, or scientific publications. We prohibits You from using the dataset or any derivative works for commercial purposes, such as selling data or using it for commercial gain. `If any of the images belongs to you and you would like it removed, please kindly inform us, we will remove it from our dataset immediately.` ## Contact Email:datasets_2022@outlook.com Organization: Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China ## Citation ``` @inproceedings{liu2022taisu, author = {Liu, Yulong and Zhu, Guibo and Zhu, Bin and Song, Qi and Ge, Guojing and Chen, Haoran and Qiao, GuanHui and Peng, Ru and Wu, Lingxiang and Wang, Jinqiao}, booktitle = {Advances in Neural Information Processing Systems}, editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh}, pages = {16705--16717}, publisher = {Curran Associates, Inc.}, title = {TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training}, url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/6a386d703b50f1cf1f61ab02a15967bb-Paper-Datasets_and_Benchmarks.pdf}, volume = {35}, year = {2022} } ```

评论收藏

内容反馈

版权申诉