# Video Search Engine
Authors:
* [Abby Gray](https://github.com/abbygray)
* [Akshat Shrivastava](https://github.com/AkshatSh)
* [Kevin Bi](https://github.com/kevinb22)
* [Sarah Yu](https://github.com/sarahyu17)
Semantically be able to search through a database of videos (using generated summaries)
Take a look at our [poster](VideoSearchEnginePoster.pdf)!
## Table of Contents
* [System Overview](#system-overview)
* [Video Summarization Overview](#video-summarization-overview)
* [Example output](#example-output)
* [User Interface](#user-interface)
* [Set Up](#set-up)
* [Training Captioning Network](#training-captioning-network)
* [Plan](#plan)
* [Data Sets](#data-sets-to-use)
* [Citations](#citations)
## System Overview
The video below shows exactly how the entire system works end to end.
![Presentation](figs/presentation.gif)
The user facing system described here is the overview of the overall system architecture.
![System Overview](figs/SystemOverview.png)
The backend and video summarizing system was distributed in an attempt to tackle large videos. The architecture is described in the image below
![Distribution](figs/distribution.png)
## Video Summarization Overview
In this project, we attempted to solve video summarizatoin using image captioning. The architecture and motivation is explained in this section.
Below is the initial architecture of the video summarization network used to generate video summaries.
![Video Summarization Network](figs/VideoSummarizationNetwork.png)
We converted this into the following network for the final project
![Final Network](figs/summarization.png)
We can walk through the steps occuring with explantions here:
1. We break apart frames into semantically different groups.
* Here we use `SSMI` (structured similarity measurment index) to determine if two frames are similar
* We define a threshold for comparison
* Any sequence of frames within that threshold belongs to a specific group.
2. Random Sample from each group
* Since each group are all the semantically similar frames, to reduce the redundancies in the frame captions we try to remove similar frames by selecting a very small subset (1-5) frames from each group
3. Feed each selected frame to an image captioning network to determine what happens in the frame
* This uses an Encoder-Decoder model for captioning the images as descibed in [Object2Text](https://arxiv.org/abs/1707.07102)
* Model description
* `Encoder`
* `EncoderCNN`
* Uses ResNet-152 pretrained to feed all the features to an encoded feature vector
* `YoloEncoder`
* From a frame performs bounding box object detection on the frame to determine the objects and the bounding boxes for all of them.
* Uses RNN structure (LSTM for this model) to encode the sequence of objects and their names
* Uses the resul to create another encoded feature vector
* `Decoder`
* Combines the two feature vectors from the `EncoderCNN` and the `YoloEncoder` to create a new feature vector, and uses that feature vector as input to start language generation for the frame caption
* Training
* **Dataset:** uses COCO for training
* **Bounding Box:** during train uses `TinyYOLO` for faster training time as well as allowing the network to use a less reliable network to train on, and the more reliable version during testing
4. Uses `Extractive Summarization` to select unique phrases from all the frame captions seletected to create a reasonable description of what occured in the video.
The next section shows example output:
## Example output
Given a minute long video of traffic in Dhaka Bangladesh.
```python
(
'a man riding a bike down a street next to a large truck .',
'a man riding a bike down a street next to a traffic light .',
'a green truck with a lot of cars on it',
'a green truck with a lot of cars on the road .',
'a city bus driving down a street next to a traffic light .'
)
```
## User Interface
To use our search engine we built a `Flask` based application similar to google to search through our database.
### Main UI
This page features the main search functionality. A simplistic design similar to Google.
![main](figs/ui-main.png)
### Results UI
This page features all the results for a given query. Every video in our database is returned in sorted order for relevance. We use `TF-IDF` scoring for a query to a rank for each of the summaries.
![results](figs/ui-search.png)
## Set Up
To set up the python code create a python3 environment with the following:
```bash
# create a virtual environment
$ python3 -m venv env
# activate environment
$ source env/bin/activate
# install all requirements
$ pip install -r requirements.txt
# install data files
$ python dataloader.py
```
If you add a new package you will have to update the requirements.txt with the following command:
```bash
# add new packages
$ pip freeze > requirements.txt
```
And if you want to deactivate the virtual environment
```bash
# decativate the virtual env
$ deactivate
```
## Training Captioning Network
### Caption Network Set up
```bash
python VideoSearchEngine/ImageCaptioningNoYolo/resize.py --image_dir data/coco/train2014/
python VideoSearchEngine/ImageCaptioningNoYolo/resize.py --image_dir data/coco/val2014/ --output_dir data/val_resized2014
```
## Plan
Our project will, broadly defined, be attempting video searching through video summarization. To do this we propose the following objectives and resulting action plan:
* Break videos down into semantically different groups of frames
* Recognize objects in an image (i.e. a frame)
* Convert a frame to text
* Merge summaries of all frames of a video into one large overall summary
* Build a search engine to query videos via summary.
## Data Sets to Use
### [TaCos MulitModal Data Set](https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/tacos-multi-level-corpus/)
Lots of labeled data for text generation of video summaries.
[Paper](https://arxiv.org/pdf/1403.6173.pdf) about how data was collected and performance.
The location of the video dataset: [Source](https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/human-activity-recognition/mpii-cooking-2-dataset/)
### [Common Object Data Set](http://cocodataset.org/#home)
Consists of labeled images for image captioning
### [Sum Me Data Set](https://people.ee.ethz.ch/~gyglim/vsum/)
Consists of action videos that can be used to test summaries.
### [MED Dataset](http://lear.inrialpes.fr/people/potapov/med_summaries)
The "MED Summaries" is a new dataset for evaluation of dynamic video summaries. It contains annotations of 160 videos: a validation set of 60 videos and a test set of 100 videos. There are 10 event categories in the test set.
## Citations
### Papers
* [Microsoft Research Paper on Video Summarization](https://arxiv.org/pdf/1704.01466.pdf)
* [YOLO Paper for bounding box object detection](https://pjreddie.com/media/files/papers/YOLO9000.pdf)
* [Using YOLO for image captioning](https://arxiv.org/abs/1707.07102)
* [Unsupervised Video Summarization with Adversarial Networks](http://web.engr.oregonstate.edu/~sinisa/research/publications/cvpr17_summarization.pdf)
* [Long-term Recurrent Convolutional Networks](https://arxiv.org/pdf/1411.4389.pdf)
* [Coherent Multi-Sentence Video Description with Variable Level of Detail](https://arxiv.org/pdf/1403.6173.pdf)
### GitHubs
* [Original YOLO implementation](https://github.com/pjreddie/darknet)
* [Code for YOLO -> LSTM for image captioning](https://github.com/uvavision/obj2text-neuraltalk2)
* [YOLO PyTorch Implementation for Guidance](https://github.com/longcw/yolo2-pytorch)
* [Tiny YOLO Implementation](https://github.com/marvis/pytorch-yolo2)
* [machinebox -> video analysis/frame partitioning](https
没有合适的资源?快使用搜索试试~ 我知道了~
语义上能够搜索视频数据库(使用生成的摘要)python
共96个文件
py:62个
png:6个
html:4个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
5星 · 超过95%的资源 1 下载量 74 浏览量
2022-06-05
21:28:18
上传
评论
收藏 178.49MB ZIP 举报
温馨提示
我们可以在这里完成解释发生的步骤: 我们将帧分成语义不同的组。 这里我们使用SSMI(结构化相似度测量指数)来判断两帧是否相似 我们定义了一个比较阈值 该阈值内的任何帧序列都属于特定组。 每组的随机样本 由于每组都是语义相似的帧,为了减少帧标题中的冗余,我们尝试通过从每组中选择一个非常小的子集(1-5)帧来删除相似的帧 将每个选定的帧馈送到图像字幕网络以确定帧中发生的情况 这使用编码器-解码器模型为Object2Text中描述的图像添加字幕 型号说明 Encoder EncoderCNN 使用经过预训练的 ResNet-152 将所有特征提供给编码的特征向量 YoloEncoder From a frame 对帧执行边界框对象检测,以确定对象和所有对象的边界框。 使用 RNN 结构(此模型的 LSTM)对对象序列及其名称进行编码 使用结果创建另一个编码特征向量 Decoder 结合来自EncoderCNN和的两个特征向量YoloEncoder来创建一个新的特征向量,并使用该特征向量作为输入来开始为帧标题生成语言 训练 数据集:使用 COCO 进行训练 边界框:在训练期间Tiny
资源推荐
资源详情
资源评论
收起资源包目录
VideoSearchEnaster.zip (96个子文件)
VideoSearchEnaster
.gitignore 1KB
VideoSearchEngine
ImageCaptioningYolo
models.py 11KB
TensorLogger.py 2KB
train.py 7KB
build_vocab.py 2KB
data_loader.py 7KB
constants.py 169B
sample.py 3KB
im_args.py 3KB
tensor_board.py 517B
video_utils.py 4KB
ImageCaptioner.py 4KB
ImageCaptioningAnnotations
train.py 6KB
build_vocab.py 2KB
image_data_loader.py 5KB
image_caption_utils.py 177B
detection_add_catptions.py 3KB
resize.py 2KB
convert_coco_detection_result.py 251B
LanguageModels.py 8KB
config.py 586B
forms.py 215B
VideoDistributer.py 4KB
static
resultstyle.css 831B
style.css 473B
pytorch_ssim.py 3KB
ObjectDetector.py 1KB
cost_optimizer.py 1KB
example_summary.txt 1KB
page_rank.py 613B
constants.py 131B
database_utils.py 2KB
NoisyFrameFilter.py 392B
main.py 594B
SummaryJoiner.py 3KB
ImageCaptioningNoYolo
TensorLogger.py 2KB
train.py 7KB
build_vocab.py 2KB
data_loader.py 4KB
constants.py 137B
sample.py 3KB
model.py 3KB
resize.py 2KB
tensor_board.py 516B
database_settings.py 142B
__init__.py 0B
requirements.txt 1KB
ObjectDetection
DarknetModels
test_obj_detect.py 1KB
util.py 10KB
__init__.py 65B
obj_detect_utils.py 6KB
darknet.py 6KB
parse_cfg.py 5KB
layers.py 627B
Yolo.py 600B
TinyYolo.py 966B
__init__.py 99B
detector.py 8KB
bbox_detector.py 3KB
tables.py 476B
FrameExtracter.py 442B
Sarah_Page_Rank.py 1KB
webapp.py 1KB
tensor_board.py 523B
templates
_formhelpers.html 293B
results.html 531B
new_video.html 338B
index.html 650B
video_util_worker.py 5KB
VideoCollector.py 4KB
VideoSearchEnginePoster.ppt 18.4MB
README.md 8KB
.github
ISSUE_TEMPLATE
bug_report.md 799B
feature_request.md 560B
workerStartup.sh 4KB
conf
workers.conf 81B
saved
image_yolo
encoder-5-3000.ckpt 47.55MB
decoder-5-3000.ckpt 35.22MB
.vscode
settings.json 64B
LICENSE 1KB
downloader.py 4KB
figs
ui-main.png 14KB
distribution.png 58KB
summarization.png 60KB
ui-search.png 69KB
presentation.m4v 11.62MB
SystemOverview.png 26KB
presentation.gif 15.34MB
VideoSummarizationNetwork.png 145KB
requirements.txt 918B
Computer Vision Presentation.key 22.28MB
VideoSearchEnginePoster.pdf 6.75MB
.DS_Store 6KB
data
coco_yolo_objname_location.json 79.66MB
coco_detection_result 38.29MB
vocab.pkl 352KB
共 96 条
- 1
资源评论
- gshsksjs2022-07-14资源很实用,对我启发很大,有很好的参考价值,内容详细。
快撑死的鱼
- 粉丝: 1w+
- 资源: 9154
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于JavaScript讲解的数据结构和算法
- python计算机视觉python-computer-vision.rar
- VB+ACCESS计算机等级考试管理系统(源代码+系统+答辩PPT).zip
- python密码python-ciphers.rar
- 2c60fbb3dt9ad50ed8864298eea1484b.MP4
- 基于yolov8+dlib实现视觉识别的安全驾驶监测系统部署到jetson NX平台源码+模型.zip
- Qt框架+OpenCV+动态爱心+编程教学+520
- 基于opencv+yolov8实现目标追踪及驻留时长统计源码.zip
- 水稻病害基于Yolov8算法优化目标检测识别与AI辅助决策python源码+模型+使用说明.zip
- 海尔618算价表_七海5.20_16.00xlsx(1)(2).xlsx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功