结合LSTM的encoder-decoder模型实现UCF101分类_Encoder-Decoder模式lstm资源-CSDN文库

共90个文件

npy：16个

pkl：14个

py：12个

需积分: 5 113 浏览量 2023-11-30 14:58:19 上传评论收藏 8.93MB 7Z 举报

在深度学习领域，视频分类是一项重要的任务，它旨在识别视频中的特定行为或动作。本教程将探讨如何利用结合了LSTM（长短期记忆网络）的encoder-decoder模型来实现UCF101数据集上的视频动作分类。UCF101是一个广泛使用的视频动作识别数据集，包含了101种不同的动作类别，对于研究和开发视频处理算法具有重要意义。我们来看encoder部分。在这个场景中，我们有三种选择：CNN（卷积神经网络）、3DCNN（三维卷积神经网络）和预训练的ResNet。CNN通常用于处理静态图像，通过一系列卷积层和池化层提取特征。当应用于视频时，我们需要考虑时间维度，这就是3DCNN的作用。3DCNN扩展了2D CNN，增加了对时间序列的处理能力，能够捕捉到帧间的运动信息。而预训练的ResNet（残差网络）是在ImageNet等大型图像数据集上预先训练好的模型，具有强大的特征学习能力，可以快速适应新的任务，如视频分类。接下来，LSTM作为decoder，负责整合由encoder产生的特征序列。LSTM是一种特殊的RNN（循环神经网络）结构，特别适合处理序列数据，因为它能有效地处理长期依赖问题。在视频分类中，LSTM可以接收每个时间步的编码特征，并通过其门控机制记住过去的信息，从而理解和预测序列中的动作。具体实现步骤如下： 1. **数据预处理**：UCF101数据集包含多个视频剪辑，需要先进行预处理，如视频裁剪、归一化、采样等，以便于输入到模型中。 2. **特征提取**：使用CNN或3DCNN（如ResNet）对每个视频帧进行特征提取，得到一序列的特征向量。 3. **序列编码**：将这些特征向量输入LSTM，LSTM会根据时间步长处理序列，生成一个固定长度的上下文向量，这个向量包含了整个序列的关键信息。 4. **分类**：这个上下文向量经过全连接层（classifier），进行多分类预测，输出对应的动作类别概率。 5. **模型训练与优化**：通过反向传播更新模型参数，通常采用交叉熵损失函数，并使用优化器如Adam进行梯度下降。同时，可能需要进行数据增强以提高模型泛化能力。 6. **评估与测试**：在验证集和测试集上评估模型性能，使用指标如准确率、平均精度等。在"video-classification"这个文件夹中，可能包含了实现上述过程的代码、模型配置、训练日志以及预训练模型等资源。通过深入研究这些文件，我们可以进一步了解和复现这个基于LSTM encoder-decoder模型的视频分类系统。结合LSTM的encoder-decoder模型是处理视频序列的有效方法，尤其是在处理时间序列数据和捕捉动态信息方面。通过合理选择encoder和优化decoder的设计，我们可以构建出针对UCF101等视频数据集的强大分类模型。

资源推荐

资源详情

资源评论

收起资源包目录

video-classification.7z （90个子文件）

video-classification

ResNetCRNN_varylength

UCF101_ResNetCRNN_varlen.py 10KB

check_predictions

wrong_predictions.pkl 45KB

check_video_predictions.ipynb 815KB

UCF101_videos_prediction.pkl 667KB

UCF101_frame_count.pkl 428KB

UCF101_tvflow_u_frame_count.pkl 418KB

functions.py 12KB

UCF101actions.pkl 2KB

ResNetCRNN_check_prediction.py 4KB

results

loss_UCF101_CRNN.png 582KB

CRNN_varlen_epoch_training_loss.npy 1KB

CRNN_varlen_epoch_test_score.npy 1KB

replot_loss.ipynb 119KB

CRNN_varlen_epoch_training_score.npy 1KB

CRNN_varlen_epoch_test_loss.npy 1KB

UCF101_tvflow_v_frame_count.pkl 418KB

CRNN.zip 815KB

CRNN

.DS_Store 6KB

check_predictions

.DS_Store 6KB

check_video_predictions.ipynb 815KB

CRNN_check_prediction.py 4KB

UCF101_CRNN.py 9KB

.idea

.name 14B

workspace.xml 2KB

misc.xml 188B

inspectionProfiles

profiles_settings.xml 174B

modules.xml 267B

deployment.xml 430B

.gitignore 184B

CRNN.iml 452B

functions.py 15KB

UCF101actions.pkl 2KB

__pycache__

load_data.cpython-36.pyc 3KB

functions.cpython-36.pyc 9KB

outputs

.DS_Store 6KB

loss_UCF101_CRNN.png 981KB

CRNN_epoch_test_loss.npy 360B

CRNN_epoch_test_score.npy 360B

CRNN_epoch_training_scores.npy 76KB

replot_loss.ipynb 137KB

CRNN_epoch_training_losses.npy 76KB

Conv3D

.DS_Store 6KB

check_predictions

.DS_Store 6KB

wrong_predictions.pkl 87KB

check_video_predictions.ipynb 851KB

UCF101_videos_prediction.pkl 668KB

.ipynb_checkpoints

check_video_predictions-checkpoint.ipynb 851KB

UCF101_3DCNN.py 8KB

functions.py 15KB

UCF101actions.pkl 2KB

Conv3D_check_prediction.py 3KB

__pycache__

load_data.cpython-36.pyc 3KB

functions.cpython-36.pyc 9KB

outputs

Conv3D_epoch_training_losses.npy 13KB

Conv3D_epoch_test_loss.npy 168B

replot_loss.ipynb 104KB

fig_UCF101_3DCNN.png 535KB

Conv3D_epoch_test_score.npy 168B

Conv3D_epoch_training_scores.npy 13KB

UCF101_videos_prediction.pkl 668KB

loss_UCF101_3DCNN.png 518KB

.ipynb_checkpoints

replot_loss-checkpoint.ipynb 104KB

ResNetCRNN

.DS_Store 6KB

check_predictions

wrong_predictions.pkl 45KB

check_video_predictions.ipynb 815KB

UCF101_videos_prediction.pkl 667KB

.ipynb_checkpoints

check_video_predictions-checkpoint.ipynb 815KB

functions.py 15KB

UCF101actions.pkl 2KB

UCF101_ResNetCRNN.py 9KB

ResNetCRNN_check_prediction.py 4KB

__pycache__

load_data.cpython-36.pyc 3KB

functions.cpython-36.pyc 9KB

outputs

.DS_Store 6KB

CRNN_epoch_test_loss.npy 488B

CRNN_epoch_test_score.npy 488B

CRNN_epoch_training_scores.npy 88KB

replot_loss.ipynb 165KB

loss_UCF101_ResNetCRNN.png 778KB

CRNN_epoch_training_losses.npy 88KB

.ipynb_checkpoints

replot_loss-checkpoint.ipynb 154KB

README.md 7KB

fig

kayaking.gif 2.22MB

.DS_Store 6KB

wrong_pred.png 124KB

loss_ResNetCRNN.png 1.04MB

f_CNN.png 27KB

loss_3DCNN.png 867KB

loss_CRNN.png 981KB

CRNN.png 647KB

# Video Classification The repository builds a **quick and simple** code for video classification (or action recognition) using [UCF101](http://crcv.ucf.edu/data/UCF101.php) with PyTorch. A video is viewed as a 3D image or several continuous 2D images (Fig.1). Below are two simple neural nets models: ## Dataset ![alt text](./fig/kayaking.gif) [UCF101](http://crcv.ucf.edu/data/UCF101.php) has total 13,320 videos from 101 actions. Videos have various time lengths (frames) and different 2d image size; the shortest is 28 frames. To avoid painful video preprocessing like frame extraction and conversion such as [OpenCV](https://opencv.org/) or [FFmpeg](https://www.ffmpeg.org/), here I used a preprocessed dataset from [feichtenhofer](https://github.com/feichtenhofer/twostreamfusion) directly. If you want to convert or extract video frames from scratch, here are some nice tutorials: - https://pythonprogramming.net/loading-video-python-opencv-tutorial/ - https://www.pyimagesearch.com/2017/02/06/faster-video-file-fps-with-cv2-videocapture-and-opencv/ ## Models ### 1. 3D CNN (train from scratch) Use several 3D kernels of size *(a,b,c)* and channels *n*, *e.g., (a, b, c, n) = (3, 3, 3, 16)* to convolve with video input, where videos are viewed as 3D images. *Batch normalization* and *dropout* are also used. ### 2. **CNN + RNN** (CRNN) The CRNN model is a pair of CNN encoder and RNN decoder (see figure below): - **[encoder]** A [CNN](https://en.wikipedia.org/wiki/Convolutional_neural_network) function encodes (meaning compressing dimension) every 2D image **x(t)** into a 1D vector **z(t)** by <img src="./fig/f_CNN.png" width="140"> - **[decoder]** A [RNN](https://en.wikipedia.org/wiki/Recurrent_neural_network) receives a sequence input vectors **z(t)** from the CNN encoder and outputs another 1D sequence **h(t)**. A final fully-connected neural net is concatenated at the end for categorical predictions. - Here the decoder RNN uses a long short-term memory [(LSTM)](https://en.wikipedia.org/wiki/Long_short-term_memory) network and the CNN encoder can be: 1. trained from scratch 2. a pretrained model [ResNet-152](https://arxiv.org/abs/1512.03385) using image dataset [ILSVRC-2012-CLS](http://www.image-net.org/challenges/LSVRC/2012/). <img src="./fig/CRNN.png" width="650"> ## Training & testing - For 3D CNN: 1. The videos are resized as **(t-dim, channels, x-dim, y-dim) = (28, 3, 256, 342)** since CNN requires a fixed-size input. The minimal frame number 28 is the consensus of all videos in UCF101. 2. *Batch normalization*, *dropout* are used. - For CRNN, the videos are resized as **(t-dim, channels, x-dim, y-dim) = (28, 3, 224, 224)** since the ResNet-152 only receives RGB inputs of size (224, 224). - Training videos = **9,990** vs. testing videos = **3,330** - In the test phase, the models are almost the same as the training phase, except that dropout has to be removed and batchnorm layer uses moving average and variance instead of mini-batch values. These are taken care by using "**model.eval()**". ## Usage For tutorial purpose, I try to build code as simple as possible. Essentially, **only 3 files are needed to for each model**. *eg.,* for 3D-CNN model - `UCF101_3DCNN.py`: model parameters, training/testing process. - `function.py`: modules of 3DCNN & CRNN, data loaders, and some useful functions. - `UCF101actions.pkl`: 101 action names (labels), e.g, *'BenchPress', 'SkyDiving' , 'Bowling', etc.* ### 0. Prerequisites - [Python 3.6](https://www.python.org/) - [PyTorch 1.0.0](https://pytorch.org/) - [Numpy 1.15.0](http://www.numpy.org/) - [Sklearn 0.19.2](https://scikit-learn.org/stable/) - [Matplotlib](https://matplotlib.org/) - [Pandas](https://pandas.pydata.org/) - [tqdm](https://github.com/tqdm/tqdm) ### 1. Download preprocessed UCF101 dataset For convenience, we use preprocessed UCF101 dataset already sliced into RGB images [feichtenhofer/twostreamfusion](https://github.com/feichtenhofer/twostreamfusion): - **UCF101 RGB:** [**part1**](http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.001), [**part2**](http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.002), [**part3**](http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.003) Put the 3 parts in same folder to unzip. The folder has default name: **jpegs_256**. ### 2. Set parameters & path In `UCF101_CRNN.py`, for example set ``` data_path = "./UCF101/jpegs_256/" # UCF101 video path action_name_path = "./UCF101actions.pkl" save_model_path = "./model_ckpt/" ``` ### 3. Train & test model - For **3D CNN/ CRNN/ ResNetCRNN** model, in each folder run ```bash $ python UCF101_3DCNN/CRNN/ResNetCRNN.py ``` ### 4. Model ouputs By default, the model outputs: - Training & testing loss/ accuracy: `epoch_train_loss/score.npy`, `epoch_test_loss/score.npy` - Model parameters & optimizer: eg. `CRNN_epoch8.pth`, `CRNN_optimizer_epoch8.pth`. They can be used for retraining or pretrained purpose. To check model prediction: - Run ``check_model_prediction.py`` to load best training model and generate all 13,320 video prediction list in [Pandas](https://pandas.pydata.org/) dataframe. File output: `UCF101_Conv3D_videos_prediction.pkl`. - Run `check_video_predictions.ipynb` with [Jupyter Notebook](http://jupyter.org/) and you can see where the model gets wrong: <img src="./fig/wrong_pred.png" width="600"> ## Version Warrning! As of today (May 31, 2019), it is found that in Pytorch 1.1.0 **flatten_parameters()** doesn't work under [torch.no_grad and DataParallel](https://github.com/pytorch/pytorch/issues/21108) (for multiple GPUs). Early versions before Pytorch 1.0.1 still run OK. See [Issues](https://github.com/HHTseng/video-classification/issues) Thanks to [raghavgarg97](https://github.com/raghavgarg97)'s report. ## Device & performance - The models detect and use multiple GPUs by themselves, where we implemented [torch.nn.DataParallel](https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html). - A field test using 2 GPUs (nVidia TITAN V, 12Gb mem) with my default model parameters and batch size `30~60`. - Some **pretrained models** can be found [here](https://drive.google.com/open?id=117mRMS2r09fz4ozkdzN4cExO1sFwcMvs), thanks to the suggestion of [MinLiAmoy](https://github.com/MinLiAmoy?tab=repositories). network | best epoch | testing accuracy | ------------ |:-----:| :-----:| 3D CNN | 4 | 50.84 % | 2D CNN + LSTM | 25 | 54.62 % | 2D ResNet152-CNN + LSTM | 53 |**85.68 %** | <img src="./fig/loss_3DCNN.png" width="650"> <img src="./fig/loss_CRNN.png" width="650"> <img src="./fig/loss_ResNetCRNN.png" width="650"> <br>

评论收藏

内容反馈