语义上能够搜索视频数据库（使用生成的摘要）python_10-43-1-(d)查询比电影《M1》时间更长的电影资源-CSDN文库

共96个文件

py：62个

png：6个

html：4个

版权申诉

python

5星 · 超过95%的资源 74 浏览量 2022-06-05 21:28:18 上传评论收藏 178.49MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

VideoSearchEnaster.zip （96个子文件）

VideoSearchEnaster

.gitignore 1KB

VideoSearchEngine

ImageCaptioningYolo

models.py 11KB

TensorLogger.py 2KB

train.py 7KB

build_vocab.py 2KB

data_loader.py 7KB

constants.py 169B

sample.py 3KB

im_args.py 3KB

tensor_board.py 517B

video_utils.py 4KB

ImageCaptioner.py 4KB

ImageCaptioningAnnotations

train.py 6KB

build_vocab.py 2KB

image_data_loader.py 5KB

image_caption_utils.py 177B

detection_add_catptions.py 3KB

resize.py 2KB

convert_coco_detection_result.py 251B

LanguageModels.py 8KB

config.py 586B

forms.py 215B

VideoDistributer.py 4KB

static

resultstyle.css 831B

style.css 473B

pytorch_ssim.py 3KB

ObjectDetector.py 1KB

cost_optimizer.py 1KB

example_summary.txt 1KB

page_rank.py 613B

constants.py 131B

database_utils.py 2KB

NoisyFrameFilter.py 392B

main.py 594B

SummaryJoiner.py 3KB

ImageCaptioningNoYolo

TensorLogger.py 2KB

train.py 7KB

build_vocab.py 2KB

data_loader.py 4KB

constants.py 137B

sample.py 3KB

model.py 3KB

resize.py 2KB

tensor_board.py 516B

database_settings.py 142B

__init__.py 0B

requirements.txt 1KB

ObjectDetection

DarknetModels

test_obj_detect.py 1KB

util.py 10KB

__init__.py 65B

obj_detect_utils.py 6KB

darknet.py 6KB

parse_cfg.py 5KB

layers.py 627B

Yolo.py 600B

TinyYolo.py 966B

__init__.py 99B

detector.py 8KB

bbox_detector.py 3KB

tables.py 476B

FrameExtracter.py 442B

Sarah_Page_Rank.py 1KB

webapp.py 1KB

tensor_board.py 523B

templates

_formhelpers.html 293B

results.html 531B

new_video.html 338B

index.html 650B

video_util_worker.py 5KB

VideoCollector.py 4KB

VideoSearchEnginePoster.ppt 18.4MB

README.md 8KB

.github

ISSUE_TEMPLATE

bug_report.md 799B

feature_request.md 560B

workerStartup.sh 4KB

conf

workers.conf 81B

saved

image_yolo

encoder-5-3000.ckpt 47.55MB

decoder-5-3000.ckpt 35.22MB

.vscode

settings.json 64B

LICENSE 1KB

downloader.py 4KB

figs

ui-main.png 14KB

distribution.png 58KB

summarization.png 60KB

ui-search.png 69KB

presentation.m4v 11.62MB

SystemOverview.png 26KB

presentation.gif 15.34MB

VideoSummarizationNetwork.png 145KB

requirements.txt 918B

Computer Vision Presentation.key 22.28MB

VideoSearchEnginePoster.pdf 6.75MB

.DS_Store 6KB

data

coco_yolo_objname_location.json 79.66MB

coco_detection_result 38.29MB

vocab.pkl 352KB

# Video Search Engine Authors: * [Abby Gray](https://github.com/abbygray) * [Akshat Shrivastava](https://github.com/AkshatSh) * [Kevin Bi](https://github.com/kevinb22) * [Sarah Yu](https://github.com/sarahyu17) Semantically be able to search through a database of videos (using generated summaries) Take a look at our [poster](VideoSearchEnginePoster.pdf)! ## Table of Contents * [System Overview](#system-overview) * [Video Summarization Overview](#video-summarization-overview) * [Example output](#example-output) * [User Interface](#user-interface) * [Set Up](#set-up) * [Training Captioning Network](#training-captioning-network) * [Plan](#plan) * [Data Sets](#data-sets-to-use) * [Citations](#citations) ## System Overview The video below shows exactly how the entire system works end to end. ![Presentation](figs/presentation.gif) The user facing system described here is the overview of the overall system architecture. ![System Overview](figs/SystemOverview.png) The backend and video summarizing system was distributed in an attempt to tackle large videos. The architecture is described in the image below ![Distribution](figs/distribution.png) ## Video Summarization Overview In this project, we attempted to solve video summarizatoin using image captioning. The architecture and motivation is explained in this section. Below is the initial architecture of the video summarization network used to generate video summaries. ![Video Summarization Network](figs/VideoSummarizationNetwork.png) We converted this into the following network for the final project ![Final Network](figs/summarization.png) We can walk through the steps occuring with explantions here: 1. We break apart frames into semantically different groups. * Here we use `SSMI` (structured similarity measurment index) to determine if two frames are similar * We define a threshold for comparison * Any sequence of frames within that threshold belongs to a specific group. 2. Random Sample from each group * Since each group are all the semantically similar frames, to reduce the redundancies in the frame captions we try to remove similar frames by selecting a very small subset (1-5) frames from each group 3. Feed each selected frame to an image captioning network to determine what happens in the frame * This uses an Encoder-Decoder model for captioning the images as descibed in [Object2Text](https://arxiv.org/abs/1707.07102) * Model description * `Encoder` * `EncoderCNN` * Uses ResNet-152 pretrained to feed all the features to an encoded feature vector * `YoloEncoder` * From a frame performs bounding box object detection on the frame to determine the objects and the bounding boxes for all of them. * Uses RNN structure (LSTM for this model) to encode the sequence of objects and their names * Uses the resul to create another encoded feature vector * `Decoder` * Combines the two feature vectors from the `EncoderCNN` and the `YoloEncoder` to create a new feature vector, and uses that feature vector as input to start language generation for the frame caption * Training * **Dataset:** uses COCO for training * **Bounding Box:** during train uses `TinyYOLO` for faster training time as well as allowing the network to use a less reliable network to train on, and the more reliable version during testing 4. Uses `Extractive Summarization` to select unique phrases from all the frame captions seletected to create a reasonable description of what occured in the video. The next section shows example output: ## Example output Given a minute long video of traffic in Dhaka Bangladesh. ```python ( 'a man riding a bike down a street next to a large truck .', 'a man riding a bike down a street next to a traffic light .', 'a green truck with a lot of cars on it', 'a green truck with a lot of cars on the road .', 'a city bus driving down a street next to a traffic light .' ) ``` ## User Interface To use our search engine we built a `Flask` based application similar to google to search through our database. ### Main UI This page features the main search functionality. A simplistic design similar to Google. ![main](figs/ui-main.png) ### Results UI This page features all the results for a given query. Every video in our database is returned in sorted order for relevance. We use `TF-IDF` scoring for a query to a rank for each of the summaries. ![results](figs/ui-search.png) ## Set Up To set up the python code create a python3 environment with the following: ```bash # create a virtual environment $ python3 -m venv env # activate environment $ source env/bin/activate # install all requirements $ pip install -r requirements.txt # install data files $ python dataloader.py ``` If you add a new package you will have to update the requirements.txt with the following command: ```bash # add new packages $ pip freeze > requirements.txt ``` And if you want to deactivate the virtual environment ```bash # decativate the virtual env $ deactivate ``` ## Training Captioning Network ### Caption Network Set up ```bash python VideoSearchEngine/ImageCaptioningNoYolo/resize.py --image_dir data/coco/train2014/ python VideoSearchEngine/ImageCaptioningNoYolo/resize.py --image_dir data/coco/val2014/ --output_dir data/val_resized2014 ``` ## Plan Our project will, broadly defined, be attempting video searching through video summarization. To do this we propose the following objectives and resulting action plan: * Break videos down into semantically different groups of frames * Recognize objects in an image (i.e. a frame) * Convert a frame to text * Merge summaries of all frames of a video into one large overall summary * Build a search engine to query videos via summary. ## Data Sets to Use ### [TaCos MulitModal Data Set](https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/tacos-multi-level-corpus/) Lots of labeled data for text generation of video summaries. [Paper](https://arxiv.org/pdf/1403.6173.pdf) about how data was collected and performance. The location of the video dataset: [Source](https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/human-activity-recognition/mpii-cooking-2-dataset/) ### [Common Object Data Set](http://cocodataset.org/#home) Consists of labeled images for image captioning ### [Sum Me Data Set](https://people.ee.ethz.ch/~gyglim/vsum/) Consists of action videos that can be used to test summaries. ### [MED Dataset](http://lear.inrialpes.fr/people/potapov/med_summaries) The "MED Summaries" is a new dataset for evaluation of dynamic video summaries. It contains annotations of 160 videos: a validation set of 60 videos and a test set of 100 videos. There are 10 event categories in the test set. ## Citations ### Papers * [Microsoft Research Paper on Video Summarization](https://arxiv.org/pdf/1704.01466.pdf) * [YOLO Paper for bounding box object detection](https://pjreddie.com/media/files/papers/YOLO9000.pdf) * [Using YOLO for image captioning](https://arxiv.org/abs/1707.07102) * [Unsupervised Video Summarization with Adversarial Networks](http://web.engr.oregonstate.edu/~sinisa/research/publications/cvpr17_summarization.pdf) * [Long-term Recurrent Convolutional Networks](https://arxiv.org/pdf/1411.4389.pdf) * [Coherent Multi-Sentence Video Description with Variable Level of Detail](https://arxiv.org/pdf/1403.6173.pdf) ### GitHubs * [Original YOLO implementation](https://github.com/pjreddie/darknet) * [Code for YOLO -> LSTM for image captioning](https://github.com/uvavision/obj2text-neuraltalk2) * [YOLO PyTorch Implementation for Guidance](https://github.com/longcw/yolo2-pytorch) * [Tiny YOLO Implementation](https://github.com/marvis/pytorch-yolo2) * [machinebox -> video analysis/frame partitioning](https

评论收藏

内容反馈

版权申诉