Sora关于大型视觉模型的背景、技术、局限性和机会的回顾资源-CSDN文库

共5个文件

png：3个

md：1个

gitattributes：1个

人工智能

需积分: 5 40 浏览量 2024-04-11 13:52:15 上传评论收藏 1.1MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Sora 关于大型视觉模型的背景_技术_局限性和机会的回顾.zip （5个子文件）

SoraReview-main

.gitattributes 66B

image

sora_framework.png 805KB

paper_list_structure.png 72KB

history.png 258KB

README.md 18KB

# Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models > ð See our paper: [**"Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models"**](https://arxiv.org/abs/2402.17177) [![Paper](https://img.shields.io/badge/Paper-%F0%9F%8E%93-lightblue?style=flat-square)](https://arxiv.org/abs/2402.17177) > > ð See our newest Video Generation paper: [**"Mora: Enabling Generalist Video Generation via A Multi-Agent Framework"**](http://arxiv.org/abs/2403.13248) [![Paper](https://img.shields.io/badge/Paper-%F0%9F%8E%93-lightblue?style=flat-square)](http://arxiv.org/abs/2403.13248) [![GitHub](https://img.shields.io/badge/Gtihub-%F0%9F%8E%93-lightblue?style=flat-square))](https://github.com/lichao-sun/Mora) > > ð§ Please let us know if you find a mistake or have any suggestions by e-mail: lis221@lehigh.edu ## Table of Contents - ð¡ [About](#about) - â¨ [Updates](#updates) - ð°ï¸ [History of Generative AI in the Vision Domain](#history-of-generative-ai-in-the-vision-domain) - ð [Paper List](#paper-list) - [Technology](#technology) - [Data Pre-processing](#data-pre-processing) - [Modeling](#modeling) - [Language Instruction Following](#language-instruction-following) - [Prompt Engineering](#prompt-engineering) - [Trustworthiness](#trustworthiness) - [Application](#application) - [Movie](#movie) - [Education](#education) - [Gaming](#gaming) - [Healthcare](#healthcare) - [Robotics](#robotics) - ð [Citation](#citation) ## About Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and shows potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora's development and investigate the underlying technologies used to build this "world simulator". Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from filmmaking and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation. <div align="center"> <img src="https://raw.githubusercontent.com/lichao-sun/SoraReview/main/image/sora_framework.png" width="85%"></div> ## Updates - ð [28/02/2024] Our paper has been uploaded to arXiv and was selected as the Daily Paper by Hugging Face. ## History of Generative AI in the Vision Domain <div align="center"> <img src="https://raw.githubusercontent.com/lichao-sun/SoraReview/main/image/history.png" width="85%"></div> ## Paper List <div align="center"> <img src="https://raw.githubusercontent.com/lichao-sun/SoraReview/main/image/paper_list_structure.png" width="70%"></div> ### Technology #### Data Pre-processing - (*NeurIPS'23*) Patch nâPack: Navit, A Vision Transformer for Any Aspect Ratio and Resolution [[paper]](https://proceedings.neurips.cc/paper_files/paper/2023/file/06ea400b9b7cfce6428ec27a371632eb-Paper-Conference.pdf) - (*ICLR'21*) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [[paper]](https://arxiv.org/abs/2010.11929)[[code]](https://github.com/google-research/vision_transformer) - (*arXiv 2013.12*) Auto-Encoding Variational Bayes [[paper]](https://arxiv.org/abs/1312.6114) - (*ICCV'21*) Vivit: A Video Vision Transformer [[paper]](https://openaccess.thecvf.com/content/ICCV2021/html/Arnab_ViViT_A_Video_Vision_Transformer_ICCV_2021_paper.html?ref=https://githubhelp.com)[[code]](https://github.com/google-research/scenic/tree/main/scenic/projects/vivit) - (*ICML'21*) Is Space-Time Attention All You Need for Video Understanding? [[paper]](https://arxiv.org/abs/2102.05095)[[code]](https://github.com/facebookresearch/TimeSformer) - (*NeurIPS'17*) Neural Discrete Representation Learning [[paper]](https://proceedings.neurips.cc/paper_files/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf)[[code]](https://github.com/google-deepmind/sonnet/blob/v2/sonnet/src/nets/vqvae.py) - (*CVPR'22*) High-Resolution Image Synthesis with Latent Diffusion Models [[paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.pdf)[[code]](https://github.com/CompVis/latent-diffusion) #### Modeling - (*JMLR'22*) Cascaded Diffusion Models for High Fidelity Image Generation [[paper]](https://dl.acm.org/doi/abs/10.5555/3586589.3586636) - (*ICLR'22*) Progressive Distillation for Fast Sampling of Diffusion Models [[paper]](https://arxiv.org/abs/2202.00512)[[code]](https://github.com/google-research/google-research/tree/master/diffusion_distillation) - Imagen Video: High Definition Video Generation with Diffusion Models [[paper]](https://arxiv.org/abs/2210.02303) - (*CVPR'23*) Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [[paper]](https://openaccess.thecvf.com/content/CVPR2023/papers/Blattmann_Align_Your_Latents_High-Resolution_Video_Synthesis_With_Latent_Diffusion_Models_CVPR_2023_paper.pdf) - (*ICCV'23*) Scalable Diffusion Models with Transformers [[paper]](https://openaccess.thecvf.com/content/ICCV2023/papers/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.pdf) - (*CVPR'23*) All Are Worth Words: A ViT Backbone for Diffusion Models [[paper]](https://openaccess.thecvf.com/content/CVPR2023/papers/Bao_All_Are_Worth_Words_A_ViT_Backbone_for_Diffusion_Models_CVPR_2023_paper.pdf)[[code]](https://github.com/baofff/U-ViT) - (*ICCV'23*) Masked Diffusion Transformer Is a Strong Image Synthesizer [[paper]](https://openaccess.thecvf.com/content/ICCV2023/papers/Gao_Masked_Diffusion_Transformer_is_a_Strong_Image_Synthesizer_ICCV_2023_paper.pdf)[[code]](https://github.com/sail-sg/mdt) - (*arXiv 2023.12*) DiffiT: Diffusion Vision Transformers for Image Generation [[paper]](https://arxiv.org/abs/2312.02139)[[code]](https://github.com/nvlabs/diffit) - (*CVPR'24*) GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation [[paper]](https://arxiv.org/abs/2312.04557) - (*arXiv 2023.09*) LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models [[paper]](https://arxiv.org/abs/2309.15103)[[code]](https://github.com/Vchitect/LaVie) - (*arXiv 2024.01*) Latte: Latent Diffusion Transformer for Video Generation [[paper]](https://arxiv.org/abs/2401.03048)[[code]](https://github.com/Vchitect/Latte) - (*arXiv 2024.03*) Scaling Rectified Flow Transformers for High-Resolution Image Synthesis [[paper]](https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf) #### Language Instruction Following - Improving Image Generation with Better Captions [[paper]](https://cdn.openai.com/papers/dall-e-3.pdf) - (*arXiv 2022.05*) CoCa: Contrastive Captioners are Image-Text Foundation Models [[paper]](https://arxiv.org/abs/2205.01917)[[code]](https://github.com/lucidrains/CoCa-pytorch) - (*arXiv 2022.12*) VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners [[paper]](https://arxiv.org/abs/2212.04979) - (*CVPR'23*) InstructPix2Pix: Learning to Follow Image Editing Instructions [[paper]](https://arxiv.org/abs/2211.09800)[[code]](https://github.com/timothybrooks/instruct-pix2pix) - (*NeurlPS'23*) Visual Instruction Tuning [[paper]](https://arxiv.org/abs/2304.08485)[[code]](https://github.com/haotian-liu/LLaVA) - (*ICML'23*) mPLUG-2: A Modul

评论收藏

内容反馈