## Open-Sora: Democratizing Efficient Video Production for All
We present **Open-Sora**, an initiative dedicated to **efficiently** produce high-quality video and make the model,
tools and contents accessible to all. By embracing **open-source** principles,
Open-Sora not only democratizes access to advanced video generation techniques, but also offers a
streamlined and user-friendly platform that simplifies the complexities of video production.
With Open-Sora, we aim to inspire innovation, creativity, and inclusivity in the realm of content creation.
<h4>Open-Sora is still at an early stage and under active development.</h4>
## ð¥ Latest Demo
| **2s 512Ã512** | **2s 512Ã512** | **2s 512Ã512** |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| [<img src="assets/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80) | [<img src="assets/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc) | [<img src="assets/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16) |
| A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. | A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff. | The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall. |
| [<img src="assets/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94) | [<img src="assets/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9) | [<img src="assets/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65) |
| A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...] | The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...] | A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...] |
Videos are downsampled to `.gif` for display. Click for original videos. Prompts are trimmed for display,
see [here](/assets/texts/t2v_samples.txt) for full prompts. See more samples at
our [gallery](https://hpcaitech.github.io/Open-Sora/).
## ð New Features/Updates
* ð Open-Sora-v1 released. Model weights are available [here](#model-weights). With only 400K video clips and 200 H800
days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512Ã512 videos.
* â
Three stages training from an image diffusion model to a video diffusion model. We provide the weights for each
stage.
* â
Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism.
Open-Sora improve **55%** training speed when training on 64x512x512 videos. Details locates
at [acceleration.md](docs/acceleration.md).
* â
We provide data preprocessing pipeline,
including [downloading](/tools/datasets/README.md), [video cutting](/tools/scenedetect/README.md),
and [captioning](/tools/caption/README.md) tools. Our data collection plan can be found
at [datasets.md](docs/datasets.md).
* â
We find VQ-VAE from [VideoGPT](https://wilson1yan.github.io/videogpt/index.html) has a low quality and thus adopt a
better VAE from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original). We also find patching in
the time dimension deteriorates the quality. See our **[report](docs/report_v1.md)** for more discussions.
* â
We investigate different architectures including DiT, Latte, and our proposed STDiT. Our **STDiT** achieves a better
trade-off between quality and speed. See our **[report](docs/report_v1.md)** for more discussions.
* â
Support clip and T5 text conditioning.
* â
By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet &
UCF101). See [commands.md](docs/commands.md) for more instructions.
* â
Support inference with official weights
from [DiT](https://github.com/facebookresearch/DiT), [Latte](https://github.com/Vchitect/Latte),
and [PixArt](https://pixart-alpha.github.io/).
<details>
<summary>View more</summary>
* â
Refactor the codebase. See [structure.md](docs/structure.md) to learn the project structure and how to use the
config files.
</details>
### TODO list sorted by priority
* [ ] Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity,
deduplication, etc.). See [datasets.md](/docs/datasets.md) for more information. **[WIP]**
* [ ] Training Video-VAE. **[WIP]**
<details>
<summary>View more</summary>
* [ ] Support image and video conditioning.
* [ ] Evaluation pipeline.
* [ ] Incoporate a better scheduler, e.g., rectified flow in SD3.
* [ ] Support variable aspect ratios, resolutions, durations.
* [ ] Support SD3 when released.
</details>
## Contents
* [Installation](#installation)
* [Model Weights](#model-weights)
* [Inference](#inference)
* [Data Processing](#data-processing)
* [Training](#training)
* [Contribution](#contribution)
* [Acknowledgement](#acknowledgement)
* [Citation](#citation)
## Installation
```bash
# create a virtual env
conda create -n opensora python=3.10
# activate virtual environment
conda activate opensora
# install torch
# the command below is for CUDA 12.1, choose install commands from
# https://pytorch.org/get-started/locally/ based on your own CUDA version
pip install torch torchvision
# install flash attention (optional)
# set enable_flashattn=False in config to avoid using flash attention
pip install packaging ninja
pip install flash-attn --no-build-isolation
# install apex (optional)
# set enable_layernorm_kernel=False in config to avoid using apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
# install xformers
pip install -U xformers --index-url https://download.pytorch.org/whl/cu121
# install this project
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v .
```
After installation, we suggest reading [structure.md](docs/structure.md) to learn the project structure and how to use
the config files.