# Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
> ð See our paper: [**"Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models"**](https://arxiv.org/abs/2402.17177) [![Paper](https://img.shields.io/badge/Paper-%F0%9F%8E%93-lightblue?style=flat-square)](https://arxiv.org/abs/2402.17177)
>
> ð See our newest Video Generation paper: [**"Mora: Enabling Generalist Video Generation via A Multi-Agent Framework"**](http://arxiv.org/abs/2403.13248) [![Paper](https://img.shields.io/badge/Paper-%F0%9F%8E%93-lightblue?style=flat-square)](http://arxiv.org/abs/2403.13248) [![GitHub](https://img.shields.io/badge/Gtihub-%F0%9F%8E%93-lightblue?style=flat-square))](https://github.com/lichao-sun/Mora)
>
> ð§ Please let us know if you find a mistake or have any suggestions by e-mail: lis221@lehigh.edu
## Table of Contents
- ð¡ [About](#about)
- ⨠[Updates](#updates)
- ð°ï¸ [History of Generative AI in the Vision Domain](#history-of-generative-ai-in-the-vision-domain)
- ð [Paper List](#paper-list)
- [Technology](#technology)
- [Data Pre-processing](#data-pre-processing)
- [Modeling](#modeling)
- [Language Instruction Following](#language-instruction-following)
- [Prompt Engineering](#prompt-engineering)
- [Trustworthiness](#trustworthiness)
- [Application](#application)
- [Movie](#movie)
- [Education](#education)
- [Gaming](#gaming)
- [Healthcare](#healthcare)
- [Robotics](#robotics)
- ð [Citation](#citation)
## About
Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and shows potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora's development and investigate the underlying technologies used to build this "world simulator". Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from filmmaking and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation.
<div align="center">
<img src="https://raw.githubusercontent.com/lichao-sun/SoraReview/main/image/sora_framework.png" width="85%"></div>
## Updates
- ð [28/02/2024] Our paper has been uploaded to arXiv and was selected as the Daily Paper by Hugging Face.
## History of Generative AI in the Vision Domain
<div align="center">
<img src="https://raw.githubusercontent.com/lichao-sun/SoraReview/main/image/history.png" width="85%"></div>
## Paper List
<div align="center">
<img src="https://raw.githubusercontent.com/lichao-sun/SoraReview/main/image/paper_list_structure.png" width="70%"></div>
### Technology
#### Data Pre-processing
- (*NeurIPS'23*) Patch nâPack: Navit, A Vision Transformer for Any Aspect Ratio and Resolution [[paper]](https://proceedings.neurips.cc/paper_files/paper/2023/file/06ea400b9b7cfce6428ec27a371632eb-Paper-Conference.pdf)
- (*ICLR'21*) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [[paper]](https://arxiv.org/abs/2010.11929)[[code]](https://github.com/google-research/vision_transformer)
- (*arXiv 2013.12*) Auto-Encoding Variational Bayes [[paper]](https://arxiv.org/abs/1312.6114)
- (*ICCV'21*) Vivit: A Video Vision Transformer [[paper]](https://openaccess.thecvf.com/content/ICCV2021/html/Arnab_ViViT_A_Video_Vision_Transformer_ICCV_2021_paper.html?ref=https://githubhelp.com)[[code]](https://github.com/google-research/scenic/tree/main/scenic/projects/vivit)
- (*ICML'21*) Is Space-Time Attention All You Need for Video Understanding? [[paper]](https://arxiv.org/abs/2102.05095)[[code]](https://github.com/facebookresearch/TimeSformer)
- (*NeurIPS'17*) Neural Discrete Representation Learning [[paper]](https://proceedings.neurips.cc/paper_files/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf)[[code]](https://github.com/google-deepmind/sonnet/blob/v2/sonnet/src/nets/vqvae.py)
- (*CVPR'22*) High-Resolution Image Synthesis with Latent Diffusion Models [[paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.pdf)[[code]](https://github.com/CompVis/latent-diffusion)
#### Modeling
- (*JMLR'22*) Cascaded Diffusion Models for High Fidelity Image Generation [[paper]](https://dl.acm.org/doi/abs/10.5555/3586589.3586636)
- (*ICLR'22*) Progressive Distillation for Fast Sampling of Diffusion Models [[paper]](https://arxiv.org/abs/2202.00512)[[code]](https://github.com/google-research/google-research/tree/master/diffusion_distillation)
- Imagen Video: High Definition Video Generation with Diffusion Models [[paper]](https://arxiv.org/abs/2210.02303)
- (*CVPR'23*) Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [[paper]](https://openaccess.thecvf.com/content/CVPR2023/papers/Blattmann_Align_Your_Latents_High-Resolution_Video_Synthesis_With_Latent_Diffusion_Models_CVPR_2023_paper.pdf)
- (*ICCV'23*) Scalable Diffusion Models with Transformers [[paper]](https://openaccess.thecvf.com/content/ICCV2023/papers/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.pdf)
- (*CVPR'23*) All Are Worth Words: A ViT Backbone for Diffusion Models [[paper]](https://openaccess.thecvf.com/content/CVPR2023/papers/Bao_All_Are_Worth_Words_A_ViT_Backbone_for_Diffusion_Models_CVPR_2023_paper.pdf)[[code]](https://github.com/baofff/U-ViT)
- (*ICCV'23*) Masked Diffusion Transformer Is a Strong Image Synthesizer [[paper]](https://openaccess.thecvf.com/content/ICCV2023/papers/Gao_Masked_Diffusion_Transformer_is_a_Strong_Image_Synthesizer_ICCV_2023_paper.pdf)[[code]](https://github.com/sail-sg/mdt)
- (*arXiv 2023.12*) DiffiT: Diffusion Vision Transformers for Image Generation [[paper]](https://arxiv.org/abs/2312.02139)[[code]](https://github.com/nvlabs/diffit)
- (*CVPR'24*) GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation [[paper]](https://arxiv.org/abs/2312.04557)
- (*arXiv 2023.09*) LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models [[paper]](https://arxiv.org/abs/2309.15103)[[code]](https://github.com/Vchitect/LaVie)
- (*arXiv 2024.01*) Latte: Latent Diffusion Transformer for Video Generation [[paper]](https://arxiv.org/abs/2401.03048)[[code]](https://github.com/Vchitect/Latte)
- (*arXiv 2024.03*) Scaling Rectified Flow Transformers for High-Resolution Image Synthesis [[paper]](https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf)
#### Language Instruction Following
- Improving Image Generation with Better Captions [[paper]](https://cdn.openai.com/papers/dall-e-3.pdf)
- (*arXiv 2022.05*) CoCa: Contrastive Captioners are Image-Text Foundation Models [[paper]](https://arxiv.org/abs/2205.01917)[[code]](https://github.com/lucidrains/CoCa-pytorch)
- (*arXiv 2022.12*) VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners [[paper]](https://arxiv.org/abs/2212.04979)
- (*CVPR'23*) InstructPix2Pix: Learning to Follow Image Editing Instructions [[paper]](https://arxiv.org/abs/2211.09800)[[code]](https://github.com/timothybrooks/instruct-pix2pix)
- (*NeurlPS'23*) Visual Instruction Tuning [[paper]](https://arxiv.org/abs/2304.08485)[[code]](https://github.com/haotian-liu/LLaVA)
- (*ICML'23*) mPLUG-2: A Modul
没有合适的资源?快使用搜索试试~ 我知道了~
Sora 关于大型视觉模型的背景、技术、局限性和机会的回顾
共5个文件
png:3个
md:1个
gitattributes:1个
需积分: 5 0 下载量 40 浏览量
2024-04-11
13:52:15
上传
评论
收藏 1.1MB ZIP 举报
温馨提示
Sora,一款由OpenAI公司开发的文本转视频的AI模型,自发布以来便在科技界引起了广泛关注。这款模型的独特之处在于其能够根据文本描述快速生成长达一分钟的高质量视频内容,这不仅展示了AI在视频生成领域的进步,也标志着人工智能在理解和模拟现实世界方面迈出了重要一步。 Sora的推出,被视为AI领域的又一爆炸性新闻,其能力之强大,甚至被业内人士评价为“小型核爆炸”。这一技术的出现,不仅推动了生成式AI的发展,也为多个行业带来了潜在的颠覆性变革。从生物医学到化学物理研究,Sora的潜在应用场景广泛,其算力甚至有望帮助人类推演尚未掌握的自然规律。 Sora的核心技术在于其“世界模拟器”的概念,这是OpenAI首次提出并实践的理念。通过这一模型,AI不仅能够学习信息,更能灵活运用这些信息,模拟出一个接近真实的虚拟世界。Sora展现出的实力,让现有的其他AI模型望尘莫及,同时也引发了关于AI未来发展方向的讨论。 尽管Sora的潜力巨大,但也存在一些质疑的声音。有专家指出,像Sora这样仅根据文字提示生成逼真视频的能力,并不能完全代表模型理解了物理世界。
资源推荐
资源详情
资源评论
收起资源包目录
Sora 关于大型视觉模型的背景_技术_局限性和机会的回顾.zip (5个子文件)
SoraReview-main
.gitattributes 66B
image
sora_framework.png 805KB
paper_list_structure.png 72KB
history.png 258KB
README.md 18KB
共 5 条
- 1
资源评论
就是一顿骚操作
- 粉丝: 446
- 资源: 40
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功