使用MIL-NCE在HowTo100M上训练的S3D文本-视频模型_Pyth.zip

共3个文件

py：1个

md：1个

license：1个

版权申诉

163 浏览量 2023-04-05 13:08:54 上传评论收藏 9KB ZIP 举报

标题中的“使用MIL-NCE在HowTo100M上训练的S3D文本-视频模型_Pyth.zip”指的是一个使用多实例学习（MIL-NCE）算法在大规模视频理解数据集HowTo100M上训练的S3D模型。这个模型是将视觉特征与文本信息结合的，用于实现文本到视频的检索或视频内容理解。 1. **多实例学习（MIL-NCE）**：多实例学习（Multiple Instance Learning, MIL）是一种机器学习方法，它允许模型从一组不确定的、可能包含正负样本的“bags”（集合）中学习。在视频理解中，一个bag可能是一段视频，其中包含多个视频帧，而每个帧可以视为一个instance。MIL-NCE（Multi-Instance Learning with Noise-Contrastive Estimation）是MIL的一种扩展，用于处理大规模的无标注数据，通过噪声对比估计来识别关键信息，尤其适合于视频理解任务，因为它可以捕获视频中的重要时刻。 2. **S3D模型**： S3D（Spatial-Spectral 3D ConvNet）是一种深度学习模型，设计用于视频理解和动作识别。它是由InceptionV1网络扩展而来，将二维卷积层转换为三维卷积层，以同时捕捉空间和时间信息。S3D模型通过学习视频帧序列中的时空特征，能够有效地理解和分类视频内容。 3. **HowTo100M数据集**： HowTo100M是一个大规模的视频教程数据集，包含了大约100万个视频剪辑，总时长大约1亿分钟。这些视频来源于YouTube，涵盖了大量的日常活动和技能教程，为视频理解和生成提供了丰富的语料库。使用HowTo100M训练模型，可以让模型学习到广泛的行为和事件，提高其在各种场景下的泛化能力。 4. **Pyth**：这可能是模型代码的编程语言，Python的简称。Python是数据科学和机器学习领域广泛使用的语言，因其易读性、丰富的库支持和强大的社区而受到青睐。在这个项目中，Python被用作实现MIL-NCE算法和S3D模型训练的编程工具。 5. **模型训练过程**：在HowTo100M数据集上训练S3D模型通常包括预处理、模型架构定义、损失函数选择、优化器设定、训练迭代和模型验证等步骤。MIL-NCE策略会使得模型在处理视频数据时，不仅关注单个帧的信息，还能学习到整个视频序列的上下文关系。 6. **应用**：这种模型可以应用于多个领域，如视频推荐系统、视频搜索、自动视频摘要、情感分析等。通过将视频内容与文本描述相结合，可以提高视频理解的准确性和实用性。这个压缩包中的内容可能是一个使用Python实现的、基于MIL-NCE的S3D模型训练代码和相关资源，该模型已经在HowTo100M数据集上进行了训练，适用于处理和理解大规模的视频数据，并能实现文本与视频之间的关联。

资源推荐

资源详情

资源评论

收起资源包目录

使用MIL-NCE在HowTo100M上训练的S3D文本-视频模型_Pyth.zip （3个子文件）

S3D_HowTo100M-master

LICENSE 11KB

s3dg.py 12KB

README.md 4KB

# PyTorch S3D Text-Video trained HowTo100M This repo contains a PyTorch S3D Text-Video model trained from scratch on HowTo100M using MIL-NCE [1] If you use this model, we would appreciate if you could cite [1] and [2] :). The official Tensorflow hub version of this model can be found here: https://tfhub.dev/deepmind/mil-nce/s3d/1 with a colab on how to use it here: https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/text_to_video_retrieval_with_s3d_milnce.ipynb ## Getting the data You will first need to download the model weights and the word dictionary. ```sh wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_howto100m.pth wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_dict.npy ``` ## How To use it ? The following code explain how to instantiate S3D Text-Video with the pretrained weights and run inference on some examples. ```python import torch as th from s3dg import S3D # Instantiate the model net = S3D('s3d_dict.npy', 512) # Load the model weights net.load_state_dict(th.load('s3d_howto100m.pth')) # Video input should be of size Batch x 3 x T x H x W and normalized to [0, 1] video = th.rand(2, 3, 32, 224, 224) # Evaluation mode net = net.eval() # Video inference video_output = net(video) # Text inference text_output = net.text_module(['open door', 'cut tomato']) ``` NB: The video network is fully convolutional (with global average pooling in time and space at the end). However, we recommend using T=32 frames (same as during training), T=16 frames also works ok. For H and W we have been using values from 200 to 256. *video_output* is a dictionary containing two keys: - *video_embedding*: This is the video embedding (size 512) from the joint text-video space. It should be used to compute similarity scores with text inputs using the text embedding. - *mixed_5c*: This is the global averaged pooled feature from S3D of dimension 1024. This should be use for classification on downstream tasks. *text_output* is also a dictionary with a single key: - *text_embedding*: It is the text embedding (size 512) from the joint text-video space. To compute the similarity score between text and video, you would compute the dot product between *text_embedding* and *video_embedding*. ## Computing all the pairwise video-text similarities: The similarity scores can be computed with a dot product between the *text_embedding* and the *video_embedding*. ```python video_embedding = video_output['video_embedding'] text_embedding = text_output['text_embedding'] # We compute all the pairwise similarity scores between video and text. similarity_matrix = th.matmul(text_embedding, video_embedding.t()) ``` ## References [1] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic and A. Zisserman, End-to-End Learning of Visual Representations from Uncurated Instructional Videos. https://arxiv.org/abs/1912.06430 [2] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev and J. Sivic, HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. https://arxiv.org/abs/1906.03327 Bibtex: ```bibtex @inproceedings{miech19howto100m, title={How{T}o100{M}: {L}earning a {T}ext-{V}ideo {E}mbedding by {W}atching {H}undred {M}illion {N}arrated {V}ideo {C}lips}, author={Miech, Antoine and Zhukov, Dimitri and Alayrac, Jean-Baptiste and Tapaswi, Makarand and Laptev, Ivan and Sivic, Josef}, booktitle={ICCV}, year={2019}, } @inproceedings{miech19endtoend, title={{E}nd-to-{E}nd {L}earning of {V}isual {R}epresentations from {U}ncurated {I}nstructional {V}ideos}, author={Miech, Antoine and Alayrac, Jean-Baptiste and Smaira, Lucas and Laptev, Ivan and Sivic, Josef and Zisserman, Andrew}, booktitle={CVPR}, year={2020}, } ``` # Acknowledgements We would like to thank Yana Hasson for the help provided in the non trivial porting of the original Tensorflow weights to PyTorch.

评论收藏

内容反馈

版权申诉