Youku-mPLUG中文视频文本大规模数据集.rar_youku数据集的类似文本数据集资源-CSDN文库

共17个文件

csv：11个

json：2个

md：2个

版权申诉

数据集

97 浏览量 2024-01-18 02:48:24 上传评论收藏 554.8MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

Youku-mPLUG中文视频文本大规模数据集.rar （17个子文件）

captioning_test.csv 676KB

retrieval_val.csv 303KB

dataset_infos.json 867B

logo.png 82KB

Youku-AliceMind.json 2KB

classification_test.csv 2.82MB

retrieval_test.csv 1.21MB

gitattributes 49B

quickstart.md 1KB

retrieval_train.csv 6.16MB

captioning_train.csv 27.19MB

captioning_val.csv 1.22MB

train.csv 134KB

pretrain.csv 1.37GB

classification_val.csv 2.22MB

README.md 17KB

classification_train.csv 14.98MB

--- license: Apache License 2.0 text: multi-modal: language: - zh tags: - Youku - mPLUG - 视频-文本预训练 - 多模态中文视频 - 视频分类 - 类目预测 - 视频检索 - 视频描述 --- # Youku-mPLUG Chinese Large-Scale Video Text Dataset [Github](https://github.com/X-PLUG/Youku-mPLUG) ![Youku-AliceMind](logo.png) ## Dataset Description To better promote the application and development of multimodal technology in the Chinese community, DAMO Academy and Youku jointly release the industry's first open-source large-scale Chinese video text dataset Youku-mPLUG: including millions of high-quality videos mined from Youku's massive video library as a multimodal pre-training dataset, and multiple downstream multimodal benchmarks (video retrieval, video description, category prediction). In addition, we provide different multimodal pre-training models for benchmark testing to accelerate the application and development of multimodal pre-training technology. ### Pretraining Dataset Introduction Youku-mPLUG pre-training dataset is mined from Youku's massive high-quality short video content, containing millions of videos and text data, totaling about 36TB. The videos cover UGC short video content of 10-120 seconds, and the texts are the corresponding description titles with lengths varying between 5-30 characters. The dataset is extracted with balanced categories, containing a total of 45 major categories: TV drama clips, TV drama peripherals, movie clips, movie peripherals, variety shows, crosstalk sketches, documentaries, traditional culture, animation, MV, cover songs, musical instrument performances, fitness, street dance, square dance, competitive sports, basketball, football, finance, technology, cars, science and popularization, life encyclopedia, daily life, comedy, education, games, professional workplace, food evaluation, food production, beauty and skincare, makeup, dressing, travel, pets, home decoration, real estate renovation, medical health, health care, agriculture, adorable kids daily life, parenting, children's talent, children's animation, children's toys. ### Downstream Task Datasets We provide 3 different downstream multimodal video benchmark datasets to measure the capabilities of pre-trained models. The 3 different tasks include: - **Video Category Prediction**：Given a video and its corresponding title, predict the category of the video. - **Video-Text Retrieval**：In the presence of some videos and some texts, use video for text retrieval and text for video retrieval. - **Video Captioning**：In the presence of a video, describe the content of the video. #### Video Category Prediction This task aims to predict the category of a video based on its title and content. There are a total of 45 subcategories, with 100k samples in the training set, 14k samples in the validation set, and 20k samples in the test set. The task uses Accuracy (Top-1 / Top-5) as the metric. #### Video-Text Retrieval This task aims to use text/video to select relevant samples from a large number of videos/texts. The training set contains a total of 37k samples, the validation set contains 1.8k samples, and the test set contains 7.4k samples. The task uses Recall@1, Recall@5, and Recall@10 as metrics, and tests Text-to-Video (T2V) and Video-to-Text (V2T). #### Video Captioning This task aims for the model to provide a one-sentence description of the content in the video based on the input video screen. The training set contains a total of 170k samples, the validation set contains 7.5k samples, and the test set contains 7.7k samples. The task uses BLEU-4, METEOR, LOUGE, and CIDEr as metrics. ## Data Format ### Dataset Download and Loading #### Pre-training Dataset Download **Before downloading data, please confirm that your modelscope package has been updated to version 1.7.1, run: pip3 install modelscope==1.7.1 ** ``` from modelscope.hub.api import HubApi from modelscope.msdatasets import MsDataset from modelscope.utils.constant import DownloadMode api = HubApi() sdk_token = "" # Required, obtain from ModelScope WEB personal center api.login(sdk_token) # online data = MsDataset.load( 'Youku-AliceMind', namespace='modelscope', # download_mode=DownloadMode.FORCE_REDOWNLOAD, # if you need to clean the cache , please use it subset_name='pretrain', split='train', # Options: train, test, validation use_streaming=True) print(next(iter(data))) ``` #### Video Category Prediction Dataset Download ``` from modelscope.hub.api import HubApi from modelscope.msdatasets import MsDataset from modelscope.utils.constant import DownloadMode api = HubApi() sdk_token = "" # Required, obtain from ModelScope WEB personal center api.login(sdk_token) # online data = MsDataset.load( 'Youku-AliceMind', namespace='modelscope', # download_mode=DownloadMode.FORCE_REDOWNLOAD, # if you need to clean the cache , please use it subset_name='classification', split='validation', # Options: train, test, validation use_streaming=True) print(next(iter(data))) # Slicing, for v1.8.4 or above len(data) data_new = data[10:15] for item in data_new: print(item) ``` #### Video-Text Retrieval Dataset Download ``` from modelscope.hub.api import HubApi from modelscope.msdatasets import MsDataset from modelscope.utils.constant import DownloadMode api = HubApi() sdk_token = "" # Required, obtain from ModelScope WEB personal center api.login(sdk_token) # online data = MsDataset.load( 'Youku-AliceMind', namespace='modelscope', # download_mode=DownloadMode.FORCE_REDOWNLOAD, # if you need to clean the cache , please use it subset_name='retrieval', split='validation', # Options: train, test, validation use_streaming=True) print(next(iter(data))) # Slicing len(data) data_new = data[10:15] for item in data_new: print(item) ``` #### Video Captioning Dataset Download ``` from modelscope.hub.api import HubApi from modelscope.msdatasets import MsDataset from modelscope.utils.constant import DownloadMode api = HubApi() sdk_token = "" # Required, obtain from ModelScope WEB personal center api.login(sdk_token) # online ds = MsDataset.load( 'Youku-AliceMind', namespace='modelscope', # download_mode=DownloadMode.FORCE_REDOWNLOAD, # if you need to clean the cache , please use it subset_name='caption', split='train', # Options: train, test, validation use_streaming=True) print(next(iter(ds))) Example: {'video_id:FILE': '~/.cache/modelscope/hub/datasets/modelscope/Youku-AliceMind/master/data_files/e9310682ebd280cf194897524c6725a6f75b7d32629861f5f25f136187bad6a7', 'golden_caption': '白色的小羊站在一旁讲话。旁边还有两只灰色猫咪和一只拉着灰狼的猫咪。'} Note: you can use mp4 decoder to open the video file in local cache. # Slicing len(ds) ds_new = ds[10:15] for item in ds_new: print(item) ``` ## Downstream Baseline Test Results ### Video Category Prediction We tested the performance of different baseline models after pre-training on the Youku-mPLUG dataset in the category prediction task. All models were selected on the validation set and tested on the test set. Here are the results of each model on the test set: | **Model** | **Pre-Trained** | **Top-1 Acc.(%)** | **Top-5 Acc.(%)** | | -------- | :--------: | :--------: | :--------: | | TimeSformer [1] | &cross; | 63.51 | 89.89 | | ALPRO [2] | &cross; | 69.40 | 90.07 | | ALPRO [2] | &check; | 78.15 | 95.15 | | mPLUG-2 [3,4] | &check; | 77.79 | 92.44 | | mPLUG-video (Freeze 1.3B) | &check; | 80.04 | 98.06 | | mPLUG-video (Freeze 2.7B) | &check; | 80.57 | 98.15 | ### Video Retrieval | **Model** | **Pre-Trained** | **V2T (R1/R5/R10)** | **T2V (R1/R5/R10)** | | -------- | :--------: | :--------: | :--------: | | ALPRO [2]

评论收藏

内容反馈

版权申诉