---
license: Apache License 2.0
text:
multi-modal:
language:
- zh
tags:
- Youku
- mPLUG
- 视频-文本预训练
- 多模态中文视频
- 视频分类
- 类目预测
- 视频检索
- 视频描述
---
# Youku-mPLUG Chinese Large-Scale Video Text Dataset [Github](https://github.com/X-PLUG/Youku-mPLUG)
![Youku-AliceMind](logo.png)
## Dataset Description
To better promote the application and development of multimodal technology in the Chinese community, DAMO Academy and Youku jointly release the industry's first open-source large-scale Chinese video text dataset Youku-mPLUG: including millions of high-quality videos mined from Youku's massive video library as a multimodal pre-training dataset, and multiple downstream multimodal benchmarks (video retrieval, video description, category prediction). In addition, we provide different multimodal pre-training models for benchmark testing to accelerate the application and development of multimodal pre-training technology.
### Pretraining Dataset Introduction
Youku-mPLUG pre-training dataset is mined from Youku's massive high-quality short video content, containing millions of videos and text data, totaling about 36TB. The videos cover UGC short video content of 10-120 seconds, and the texts are the corresponding description titles with lengths varying between 5-30 characters. The dataset is extracted with balanced categories, containing a total of 45 major categories: TV drama clips, TV drama peripherals, movie clips, movie peripherals, variety shows, crosstalk sketches, documentaries, traditional culture, animation, MV, cover songs, musical instrument performances, fitness, street dance, square dance, competitive sports, basketball, football, finance, technology, cars, science and popularization, life encyclopedia, daily life, comedy, education, games, professional workplace, food evaluation, food production, beauty and skincare, makeup, dressing, travel, pets, home decoration, real estate renovation, medical health, health care, agriculture, adorable kids daily life, parenting, children's talent, children's animation, children's toys.
### Downstream Task Datasets
We provide 3 different downstream multimodal video benchmark datasets to measure the capabilities of pre-trained models. The 3 different tasks include:
- **Video Category Prediction**:Given a video and its corresponding title, predict the category of the video.
- **Video-Text Retrieval**:In the presence of some videos and some texts, use video for text retrieval and text for video retrieval.
- **Video Captioning**:In the presence of a video, describe the content of the video.
#### Video Category Prediction
This task aims to predict the category of a video based on its title and content. There are a total of 45 subcategories, with 100k samples in the training set, 14k samples in the validation set, and 20k samples in the test set. The task uses Accuracy (Top-1 / Top-5) as the metric.
#### Video-Text Retrieval
This task aims to use text/video to select relevant samples from a large number of videos/texts. The training set contains a total of 37k samples, the validation set contains 1.8k samples, and the test set contains 7.4k samples. The task uses Recall@1, Recall@5, and Recall@10 as metrics, and tests Text-to-Video (T2V) and Video-to-Text (V2T).
#### Video Captioning
This task aims for the model to provide a one-sentence description of the content in the video based on the input video screen. The training set contains a total of 170k samples, the validation set contains 7.5k samples, and the test set contains 7.7k samples. The task uses BLEU-4, METEOR, LOUGE, and CIDEr as metrics.
## Data Format
### Dataset Download and Loading
#### Pre-training Dataset Download
**Before downloading data, please confirm that your modelscope package has been updated to version 1.7.1, run: pip3 install modelscope==1.7.1 **
```
from modelscope.hub.api import HubApi
from modelscope.msdatasets import MsDataset
from modelscope.utils.constant import DownloadMode
api = HubApi()
sdk_token = "" # Required, obtain from ModelScope WEB personal center
api.login(sdk_token) # online
data = MsDataset.load(
'Youku-AliceMind',
namespace='modelscope',
# download_mode=DownloadMode.FORCE_REDOWNLOAD, # if you need to clean the cache , please use it
subset_name='pretrain',
split='train', # Options: train, test, validation
use_streaming=True)
print(next(iter(data)))
```
#### Video Category Prediction Dataset Download
```
from modelscope.hub.api import HubApi
from modelscope.msdatasets import MsDataset
from modelscope.utils.constant import DownloadMode
api = HubApi()
sdk_token = "" # Required, obtain from ModelScope WEB personal center
api.login(sdk_token) # online
data = MsDataset.load(
'Youku-AliceMind',
namespace='modelscope',
# download_mode=DownloadMode.FORCE_REDOWNLOAD, # if you need to clean the cache , please use it
subset_name='classification',
split='validation', # Options: train, test, validation
use_streaming=True)
print(next(iter(data)))
# Slicing, for v1.8.4 or above
len(data)
data_new = data[10:15]
for item in data_new:
print(item)
```
#### Video-Text Retrieval Dataset Download
```
from modelscope.hub.api import HubApi
from modelscope.msdatasets import MsDataset
from modelscope.utils.constant import DownloadMode
api = HubApi()
sdk_token = "" # Required, obtain from ModelScope WEB personal center
api.login(sdk_token) # online
data = MsDataset.load(
'Youku-AliceMind',
namespace='modelscope',
# download_mode=DownloadMode.FORCE_REDOWNLOAD, # if you need to clean the cache , please use it
subset_name='retrieval',
split='validation', # Options: train, test, validation
use_streaming=True)
print(next(iter(data)))
# Slicing
len(data)
data_new = data[10:15]
for item in data_new:
print(item)
```
#### Video Captioning Dataset Download
```
from modelscope.hub.api import HubApi
from modelscope.msdatasets import MsDataset
from modelscope.utils.constant import DownloadMode
api = HubApi()
sdk_token = "" # Required, obtain from ModelScope WEB personal center
api.login(sdk_token) # online
ds = MsDataset.load(
'Youku-AliceMind',
namespace='modelscope',
# download_mode=DownloadMode.FORCE_REDOWNLOAD, # if you need to clean the cache , please use it
subset_name='caption',
split='train', # Options: train, test, validation
use_streaming=True)
print(next(iter(ds)))
Example:
{'video_id:FILE': '~/.cache/modelscope/hub/datasets/modelscope/Youku-AliceMind/master/data_files/e9310682ebd280cf194897524c6725a6f75b7d32629861f5f25f136187bad6a7', 'golden_caption': '白色的小羊站在一旁讲话。旁边还有两只灰色猫咪和一只拉着灰狼的猫咪。'}
Note: you can use mp4 decoder to open the video file in local cache.
# Slicing
len(ds)
ds_new = ds[10:15]
for item in ds_new:
print(item)
```
## Downstream Baseline Test Results
### Video Category Prediction
We tested the performance of different baseline models after pre-training on the Youku-mPLUG dataset in the category prediction task. All models were selected on the validation set and tested on the test set. Here are the results of each model on the test set:
| **Model** | **Pre-Trained** | **Top-1 Acc.(%)** | **Top-5 Acc.(%)** |
| -------- | :--------: | :--------: | :--------: |
| TimeSformer [1] | ✗ | 63.51 | 89.89 |
| ALPRO [2] | ✗ | 69.40 | 90.07 |
| ALPRO [2] | ✓ | 78.15 | 95.15 |
| mPLUG-2 [3,4] | ✓ | 77.79 | 92.44 |
| mPLUG-video (Freeze 1.3B) | ✓ | 80.04 | 98.06 |
| mPLUG-video (Freeze 2.7B) | ✓ | 80.57 | 98.15 |
### Video Retrieval
| **Model** | **Pre-Trained** | **V2T (R1/R5/R10)** | **T2V (R1/R5/R10)** |
| -------- | :--------: | :--------: | :--------: |
| ALPRO [2]