<div align="center">
# VoxFormer: a Cutting-edge Baseline for 3D Semantic Occupancy Prediction
</div>
![](https://img.shields.io/badge/Ranked%20%231-Camera--Only%203D%20SSC%20on%20SemanticKITTI-green "")
![](./teaser/scene08_13_19.gif "")
> **VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion**, CVPR 2023.
> [Yiming Li](https://scholar.google.com/citations?hl=en&user=i_aajNoAAAAJ&view_op=list_works&sortby=pubdate), [Zhiding Yu](https://scholar.google.com/citations?user=1VI_oYUAAAAJ&hl=en), [Chris Choy](https://scholar.google.com/citations?user=2u8G5ksAAAAJ&hl=en), [Chaowei Xiao](https://scholar.google.com/citations?user=Juoqtj8AAAAJ&hl=en), [Jose M. Alvarez](https://scholar.google.com/citations?user=Oyx-_UIAAAAJ&hl=en), [Sanja Fidler](https://scholar.google.com/citations?user=CUlqK5EAAAAJ&hl=en), [Chen Feng](https://scholar.google.com/citations?user=YeG8ZM0AAAAJ&hl=en), [Anima Anandkumar](https://scholar.google.com/citations?user=bEcLezcAAAAJ&hl=en)
> [[PDF]](https://arxiv.org/pdf/2302.12251.pdf) [[Project]](https://github.com/NVlabs/VoxFormer) [[Intro Video]](https://youtu.be/KEn8oklzyvo?si=k2V4c22MCCvu9zFr)
## News
- [2023/07]: We release the code of voxformer with 3D deformable attention module, achieving slightly better performance.
- [2023/06]: 馃敟 We release [SSCBench](https://github.com/ai4ce/SSCBench), a large-scale semantic scene completion benchmark derived from KITTI-360, nuScenes, and Waymo.
- [2023/06]: Welcome to our CVPR poster session on 21 June (**WED-AM-082**), and check our [online video](https://www.youtube.com/watch?v=L0M9ayR316g).
- [2023/03]: 馃敟 VoxFormer is accepted by [CVPR 2023](https://cvpr2023.thecvf.com/) as a highlight paper **(235/9155, 2.5% acceptance rate)**.
- [2023/02]: Our paper is on [arxiv](https://arxiv.org/abs/2302.12251).
- [2022/11]: VoxFormer achieve the SOTA on [SemanticKITTI 3D SSC (Semantic Scene Completion) Task](http://www.semantic-kitti.org/tasks.html#ssc) with **13.35% mIoU** and **44.15% IoU** (camera-only)!
</br>
## Abstract
Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training by ~45% to less than 16GB.
## Method
| ![space-1.jpg](teaser/arch.png) |
|:--:|
| ***Figure 1. Overall framework of VoxFormer**. Given RGB images, 2D features are extracted by ResNet50 and the depth is estimated by an off-the-shelf depth predictor. The estimated depth after correction enables the class-agnostic query proposal stage: the query located at an occupied position will be selected to carry out deformable cross-attention with image features. Afterwards, mask tokens will be added for completing voxel features by deformable self-attention. The refined voxel features will be upsampled and projected to the output space for per-voxel semantic segmentation. Note that our framework supports the input of single or multiple images.* |
## Getting Started
- [Installation](docs/install.md)
- [Prepare Dataset](docs/prepare_dataset.md)
- [Run and Eval](docs/getting_started.md)
## Model Zoo
The query proposal network (QPN) for stage-1 is available [here](https://drive.google.com/file/d/1NzN6eqCnuxzau0m_N9B02Q2zwLBKhnBp/view?usp=share_link).
For stage-2, please download the trained models based on the following table.
| Backbone | Method | Lr Schd | IoU| mIoU | Config | Download |
| :---: | :---: | :---: | :---: | :---:| :---: | :---: |
| [R50](https://drive.google.com/file/d/1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE/view?usp=share_link) | VoxFormer-T | 20ep | 44.15| 13.35|[config](projects/configs/voxformer/voxformer-T.py) |[model](https://drive.google.com/file/d/1KOYN3MGHMyCTDZWw4lNNicCdImnKqvlz/view?usp=share_link) |
| [R50](https://drive.google.com/file/d/1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE/view?usp=share_link) | VoxFormer-S | 20ep | 44.02| 12.35|[config](projects/configs/voxformer/voxformer-S.py) |[model](https://drive.google.com/file/d/1UBemF77Cfr0d9rcC_Y9Qmjnqp_c4qoeb/view?usp=share_link)|
| [R50](https://drive.google.com/file/d/1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE/view?usp=share_link) | VoxFormer-T-3D | 20ep | 44.35| 13.69|[config](projects/configs/voxformer/voxformer-T_deform3D.py) |[model](https://drive.google.com/file/d/1JQwaO5XXMMkTcF95tCHk45q6PzZnofS6/view?usp=drive_link)|
| [R50](https://drive.google.com/file/d/1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE/view?usp=share_link) | VoxFormer-S-3D | 20ep | 44.42| 12.86|[config](projects/configs/voxformer/voxformer-S_deform3D.py) |[model](https://drive.google.com/file/d/1kwcMGRl9FOprV2k5kqCS0kfvrbCfJMcZ/view?usp=drive_link)|
## Dataset
- [x] SemanticKITTI
- [ ] KITTI-360
- [ ] nuScenes
## Bibtex
If this work is helpful for your research, please cite the following BibTeX entry.
```
@InProceedings{li2023voxformer,
title={VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion},
author={Li, Yiming and Yu, Zhiding and Choy, Christopher and Xiao, Chaowei and Alvarez, Jose M and Fidler, Sanja and Feng, Chen and Anandkumar, Anima},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2023}
}
```
## License
Copyright 漏 2022-2023, NVIDIA Corporation and Affiliates. All rights reserved.
This work is made available under the Nvidia Source Code License-NC. Click [here](https://github.com/NVlabs/VoxFormer/blob/main/LICENSE) to view a copy of this license.
The pre-trained models are shared under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
For business inquiries, please visit our website and submit the form: [NVIDIA Research Licensing](https://www.nvidia.com/en-us/research/inquiries/).
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=NVlabs/VoxFormer&type=Date)](https://star-history.com/#NVlabs/VoxFormer)
## Acknowledgement
Many thanks to these excellent open source projects:
- [BEVFormer](https://github.com/fundamentalvision/BEVFormer)
- [mmdet3d](https://github.com/open-mmlab/mmdetection3d)
- [MonoScene](https://github.com/astra-vision/MonoScene)
- [LMSCNet](https://github.com/astra-vision/LMSCNet)
- [semantic-kitti-api](https://github.com/PRBonn/semantic-kitti-api)
- [MobileStereoNet](https://github.com/cogsys-tuebingen/mobilestereonet)
- [Pseudo_Lidar_V2](https://github.com/mileyan/Pseudo_Lidar_V2)
- [wysiwyg](https://github.com/peiyunh/wysiwyg)
没有合适的资源?快使用搜索试试~ 我知道了~
基于transformer的3D图像语义理解.zip
共187个文件
txt:89个
py:74个
md:5个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 75 浏览量
2024-04-16
17:08:40
上传
评论
收藏 54.8MB ZIP 举报
温馨提示
基于transformer的3D图像语义理解 本项目是一个基于transformer的3D图像语义理解系统,旨在通过先进的深度学习技术,为用户提供高效、准确的3D图像语义分析服务。项目利用transformer模型处理3D图像数据,实现对场景中对象的分类、检测和分割等任务。 项目采用transformer模型作为核心算法,该模型是一种基于自注意力机制的深度学习模型,可以有效地处理序列数据。在3D图像语义理解任务中,transformer模型可以学习到3D图像中对象的特征表示,并在场景中准确地识别和理解对象。 系统首先对输入的3D图像数据进行预处理,提取3D图像特征并进行尺寸调整。然后,通过transformer模型对3D图像特征进行编码,以获取对象的特征表示。接着,系统利用这些特征表示进行语义理解,包括对象分类、检测和分割等任务。 为了提高系统的性能和效果,本项目还采用了多任务学习技术,允许模型同时学习多个相关任务,从而提高3D图像语义理解的准确性和鲁棒性。此外,系统还采用了数据增强和迁移学习技术,以进一步提高模型的泛化能力和准确性。 根据实际测试和评估结果,本系统在3D图像语义理解任务中表现出较高的准确率和鲁棒性。同时,本系统还提供了可视化的界面和交互式的操作方式,方便用户进行3D图像数据分析和结果展示。 总之,本项目是一个基于transformer的3D图像语义理解系统,具有高准确率、鲁棒性强、多任务学习、可视化界面和交互式操作等特点,可以为用户提供高效、准确的3D图像语义分析服务。
资源推荐
资源详情
资源评论
收起资源包目录
基于transformer的3D图像语义理解.zip (187个子文件)
MSNet3D_SF_DS_KITTI2015.ckpt 22.12MB
mapping.cpp 33KB
ms_deform_attn_cuda.cu 19KB
ms_deform_attn_cuda_kernel.cuh 41KB
scene15_16_18.gif 8.61MB
scene08_13_19.gif 8.4MB
scene11_20_21.gif 7.74MB
scene17_12_14.gif 7.48MB
common_cuda_helper.hpp 3KB
pytorch_cuda_helper.hpp 418B
README.md 7KB
README.md 6KB
prepare_dataset.md 3KB
install.md 2KB
getting_started.md 687B
arch.png 251KB
deformable_cross_attention.py 18KB
encoder_3D.py 17KB
semantic_kitti_dataset_stage2.py 17KB
encoder.py 16KB
lmscnet.py 15KB
lidar2voxel.py 13KB
dataset.py 13KB
deformable_self_attention.py 12KB
deformable_self_attention_3D_custom.py 11KB
custom_base_transformer_layer.py 11KB
voxformer_head.py 11KB
submodule.py 11KB
test.py 10KB
train.py 10KB
semantic_kitti_dataset_stage1.py 9KB
MSNet2D.py 9KB
mmdet_train.py 8KB
ssc_metric.py 7KB
io_data.py 7KB
transformer_3D.py 7KB
transformer.py 7KB
test.py 7KB
MSNet3D.py 7KB
multi_scale_deformable_attn_3D_custom_function.py 7KB
voxformer-T.py 6KB
voxformer.py 6KB
voxformer-T_deform3D.py 6KB
voxformer-S_deform3D.py 6KB
voxformer-S.py 6KB
multi_scale_deformable_attn_function.py 6KB
kitti_util.py 6KB
builder.py 6KB
label_preprocess.py 5KB
adamw.py 5KB
prediction.py 4KB
experiment.py 4KB
ssc_loss.py 4KB
group_sampler.py 4KB
epoch_based_runner.py 4KB
eval_hooks.py 3KB
qpn.py 3KB
KittiColormap.py 3KB
visualization.py 3KB
metrics.py 2KB
train.py 2KB
depth2lidar.py 2KB
distributed_sampler.py 1KB
setup.py 1KB
header.py 1KB
data_io.py 1KB
bricks.py 725B
default_runtime.py 485B
custom_hooks.py 445B
__init__.py 443B
hooks.py 403B
__init__.py 260B
__init__.py 249B
__init__.py 218B
sampler.py 189B
__init__.py 148B
__init__.py 147B
__init__.py 126B
__init__.py 126B
__init__.py 121B
__init__.py 72B
__init__.py 67B
__init__.py 54B
__init__.py 42B
__init__.py 42B
__init__.py 40B
__init__.py 30B
__init__.py 30B
__init__.py 25B
__init__.py 0B
image2depth.sh 849B
depth2lidar.sh 741B
lidar2voxel.sh 306B
dist_test.sh 284B
dist_train.sh 263B
mapping.cpython-37m-x86_64-linux-gnu.so 195KB
poses.txt 552KB
poses.txt 516KB
poses.txt 509KB
poses.txt 454KB
共 187 条
- 1
- 2
资源评论
小码蚁.
- 粉丝: 2534
- 资源: 4146
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功