没有合适的资源?快使用搜索试试~ 我知道了~
CVPR2021目标检测论文列表及摘要.docx
需积分: 30 1 下载量 20 浏览量
2021-12-20
19:55:47
上传
评论 1
收藏 56KB DOCX 举报
温馨提示
试读
30页
CVPR2021目标检测论文列表及摘要.docx
资源详情
资源评论
资源推荐
CVPR2021 目标检测论文列表,共 70 篇
3D 目标检测
1. 3DIoUMatch: Leveraging IoU Prediction for Semi-Supervised 3D
Object Detection
Abstract:3D object detection is an important yet demanding task that heavily
relies on difficult to obtain 3D annotations. To reduce the required amount of
supervision, we propose 3DIoUMatch, a novel semi-supervised method for 3D
object detection applicable to both indoor and outdoor scenes. We leverage a
teacher-student mutual learning framework to propagate information from the
labeled to the unlabeled train set in the form of pseudo-labels. However, due to
the high task complexity, we observe that the pseudo-labels suffer from
significant noise and are thus not directly usable. To that end, we introduce a
confidence-based filtering mechanism, inspired by FixMatch. We set confidence
thresholds based upon the predicted objectness and class probability to filter low-
quality pseudo-labels. While effective, we observe that these two measures do
not sufficiently capture localization quality. We therefore propose to use the
estimated 3D IoU as a localization metric and set category-aware self-adjusted
thresholds to filter poorly localized proposals. We adopt VoteNet as our backbone
detector on indoor datasets while we use PV-RCNN on the autonomous driving
dataset, KITTI. Our method consistently improves state-of-the-art methods on
both ScanNet and SUN-RGBD benchmarks by significant margins under all label
ratios (including fully labeled setting). For example, when training using only 10%
labeled data on ScanNet, 3DIoUMatch achieves 7.7 absolute improvement on
mAP@0.25 and 8.5 absolute improvement on mAP@0.5 upon the prior art. On
KITTI, we are the first to demonstrate semi-supervised 3D object detection and
our method surpasses a fully supervised baseline from 1.8% to 7.6% under
different label ratio and categories.
2. Categorical Depth Distribution Network for Monocular 3D Object
Detection
Abstract:Monocular 3D object detection is a key problem for autonomous
vehicles, as it provides a solution with simple configuration compared to typical
multi-sensor systems. The main challenge in monocular 3D detection lies in
accurately predicting object depth, which must be inferred from object and scene
cues due to the lack of direct range measurement. Many methods attempt to
directly estimate depth to assist in 3D detection, but show limited performance as
a result of depth inaccuracy. Our proposed solution, Categorical Depth
Distribution Network (CaDDN), uses a predicted categorical depth distribution for
each pixel to project rich contextual feature information to the appropriate depth
interval in 3D space. We then use the computationally efficient bird’s-eye-view
projection and single-stage detector to produce the final output detections. We
design CaDDN as a fully differentiable end-to-end approach for joint depth
estimation and object detection. We validate our approach on the KITTI 3D object
detection benchmark, where we rank 1?
st
?among published monocular methods.
We also provide the first monocular 3D detection results on the newly released
Waymo Open Dataset. We provide a code release for CaDDN which is made
available.
3. ST3D: Self-training for Unsupervised Domain Adaptation on 3D
Object Detection
Abstract:We present a new domain adaptive self-training pipeline, named ST3D,
for unsupervised domain adaptation on 3D object detection from point clouds.
First, we pre-train the 3D detector on the source domain with our proposed
random object scaling strategy for mitigating the negative effects of source
domain bias. Then, the detector is iteratively improved on the target domain by
alternatively conducting two steps, which are the pseudo label updating with the
developed quality-aware triplet memory bank and the model training with
curriculum data augmentation. These specific designs for 3D object detection
enable the detector to be trained with consistent and high-quality pseudo labels
and to avoid overfitting to the large number of easy examples in pseudo labeled
data. Our ST3D achieves state-of-the-art performance on all evaluated datasets
and even surpasses fully supervised results on KITTI 3D object detection
benchmark. Code will be available at https://github.com/CVMI-Lab/ST3D.
4. Center-based 3D Object Detection and Tracking
Abstract:Three-dimensional objects are commonly represented as 3D boxes in a
point-cloud. This representation mimics the well-studied image-based 2D
bounding-box detection but comes with additional challenges. Objects in a 3D
world do not follow any particular orientation, and box-based detectors have
difficulties enumerating all orientations or fitting an axis-aligned bounding box to
rotated objects. In this paper, we instead propose to represent, detect, and track
3D objects as points. Our framework, CenterPoint, first detects centers of objects
using a keypoint detector and regresses to other attributes, including 3D size, 3D
orientation, and velocity. In a second stage, it refines these estimates using
additional point features on the object. In CenterPoint, 3D object tracking
simplifies to greedy closest-point matching. The resulting detection and tracking
algorithm is simple, efficient, and effective. CenterPoint achieved state-of-the-art
performance on the nuScenes benchmark for both 3D detection and tracking,
with 65.5 NDS and 63.8 AMOTA for a single model. On the Waymo Open
Dataset, Center-Point outperforms all previous single model methods by a large
margin and ranks first among all Lidar-only submissions. The code and pretrained
models are available at https://github.com/tianweiy/CenterPoint.
2D 目标检测
5. Scaled-YOLOv4: Scaling Cross Stage Partial Network
Abstract:We show that the YOLOv4 object detection neural network based on
the CSP approach, scales both up and down and is applicable to small and large
networks while maintaining optimal speed and accuracy. We propose a network
scaling approach that modifies not only the depth, width, resolution, but also
structure of the network. YOLOv4-large model achieves state-of-the-art results:
55.5% AP (73.4% AP 50 ) for the MS COCO dataset at a speed of ~ 16 FPS on
Tesla V100, while with the test time augmentation, YOLOv4-large achieves
56.0% AP (73.3 AP 50 ). To the best of our knowledge, this is currently the
highest accuracy on the COCO dataset among any published work. The
YOLOv4-tiny model achieves 22.0% AP (42.0% AP 50 ) at a speed of ~443 FPS
on RTX 2080Ti, while by using TensorRT, batch size = 4 and FP16-precision the
YOLOv4-tiny achieves 1774 FPS.
6 You Only Look One-level Feature
Abstract:This paper revisits feature pyramids networks (FPN) for one-stage
detectors and points out that the success of FPN is due to its divide-and-conquer
solution to the optimization problem in object detection rather than multi-scale
feature fusion. From the perspective of optimization, we introduce an alternative
way to address the problem instead of adopting the complex feature pyramids -
utilizing only one-level feature for detection. Based on the simple and efficient
solution, we present You Only Look One-level Feature (YOLOF). In our method,
two key components, Dilated Encoder and Uniform Matching, are proposed and
bring considerable improvements. Extensive experiments on the COCO
benchmark prove the effectiveness of the proposed model. Our YOLOF achieves
comparable results with its feature pyramids counterpart RetinaNet while being
2.5× faster. Without transformer layers, YOLOF can match the performance of
DETR in a single-level feature manner with 7× less training epochs. Code is
available at https://github.com/megvii-model/YOLOF.
7. Sparse R-CNN: End-to-End Object Detection with Learnable
Proposals
8. End-to-End Object Detection with Fully Convolutional Network
9. Dynamic Head: Unifying Object Detection Heads with Attentions
Abstract:The complex nature of combining localization and classification in
object detection has resulted in the flourished development of methods. Previous
works tried to improve the performance in various object detection heads but
failed to present a unified view. In this paper, we present a novel dynamic head
framework to unify object detection heads with attentions. By coherently
combining multiple self-attention mechanisms between feature levels for scale-
awareness, among spatial locations for spatial-awareness, and within output
channels for task-awareness, the proposed approach significantly improves the
representation ability of object detection heads without any computational
overhead. Further experiments demonstrate that the effectiveness and efficiency
of the proposed dynamic head on the COCO benchmark. With a standard
ResNeXt-101-DCN backbone, we largely improve the performance over popular
object detectors and achieve a new state-of-the-art at 54.0 AP. The code will be
released at https://github.com/microsoft/DynamicHead.
10. Generalized Focal Loss V2: Learning Reliable Localization Quality
Estimation for Dense Object Detection
Abstract:Localization Quality Estimation (LQE) is crucial and popular in the
recent advancement of dense object detectors since it can provide accurate
ranking scores that benefit the Non-Maximum Suppression processing and
improve detection performance. As a common practice, most existing methods
predict LQE scores through vanilla convolutional features shared with object
classification or bounding box regression. In this paper, we explore a completely
novel and different perspective to perform LQE – based on the learned
distributions of the four parameters of the bounding box. The bounding box
distributions are inspired and introduced as "General Distribution" in GFLV1,
which describes the uncertainty of the predicted bounding boxes well. Such a
property makes the distribution statistics of a bounding box highly correlated to its
real localization quality. Specifically, a bounding box distribution with a sharp peak
usually corresponds to high localization quality, and vice versa. By leveraging the
close correlation between distribution statistics and the real localization quality,
we develop a considerably lightweight Distribution-Guided Quality Predictor
(DGQP) for reliable LQE based on GFLV1, thus producing GFLV2. To our best
knowledge, it is the first attempt in object detection to use a highly relevant,
statistical representation to facilitate LQE. Extensive experiments demonstrate
the effectiveness of our method. Notably, GFLV2 (ResNet101) achieves 46.2 AP
at 14.6 FPS, surpassing the previous state-of-the-art ATSS baseline (43.6 AP at
14.6 FPS) by absolute 2.6 AP on COCO test-dev, without sacrificing the efficiency
both in training and inference.
11. UP-DETR: Unsupervised Pre-training for Object Detection with
Transformers
Abstract:Object detection with transformers (DETR) reaches competitive
performance with Faster R-CNN via a transformer encoder-decoder architecture.
Inspired by the great success of pre-training transformers in natural language
processing, we propose a pretext task named random query patch detection to
Unsupervisedly Pre-train DETR (UP-DETR) for object detection. Specifically, we
randomly crop patches from the given image and then feed them as queries to
the decoder. The model is pre-trained to detect these query patches from the
original image. During the pre-training, we address two critical issues: multi-task
learning and multi-query localization. (1) To trade off classification and localization
preferences in the pretext task, we freeze the CNN backbone and propose a
patch feature reconstruction branch which is jointly optimized with patch
detection. (2) To perform multi-query localization, we introduce UP-DETR from
single-query patch and extend it to multi-query patches with object query shuffle
and attention mask. In our experiments, UP-DETR significantly boosts the
performance of DETR with faster convergence and higher average precision on
object detection, one-shot detection and panoptic segmentation. Code and pre-
training models: https://github.com/dddzg/up-detr.
12. MobileDets: Searching for Object Detection Architectures for
Mobile Accelerators
Abstract:Inverted bottleneck layers, which are built upon depth-wise
convolutions, have been the predominant building blocks in state-of-the-art object
detection models on mobile devices. In this work, we investigate the optimality of
this design pattern over a broad range of mobile accelerators by revisiting the
usefulness of regular convolutions. We discover that regular convolutions are a
potent component to boost the latency-accuracy trade-off for object detection on
accelerators, provided that they are placed strategically in the network via neural
architecture search. By incorporating regular convolutions in the search space
and directly optimizing the network architectures for object detection, we obtain a
family of object detection models, MobileDets, that achieve state-of-the-art results
across mobile accelerators. On the COCO object detection task, MobileDets
outperform MobileNetV3+SSDLite by 1.7 mAP at comparable mobile CPU
inference latencies. MobileDets also outperform MobileNetV2+SSDLite by 1.9
mAP on mobile CPUs, 3.7 mAP on Google EdgeTPU, 3.4 mAP on Qualcomm
Hexagon DSP and 2.7 mAP on Nvidia Jetson GPU without increasing latency.
Moreover, MobileDets are comparable with the state-of-the-art MnasFPN on
mobile CPUs even without using the feature pyramid, and achieve better mAP
scores on both EdgeTPUs and DSPs with up to 2× speedup. Code and models
are available in the TensorFlow Object Detection API [16]:
https://github.com/tensorflow/models/tree/master/research/object_detection.
13. Tracking Pedestrian Heads in Dense Crowd
Abstract:Tracking humans in crowded video sequences is an important
constituent of visual scene understanding. Increasing crowd density challenges
visibility of humans, limiting the scalability of existing pedestrian trackers to higher
crowd densities. For that reason, we propose to revitalize head tracking with
Crowd of Heads Dataset (CroHD), consisting of 9 sequences of 11,463 frames
with over 2,276,838 heads and 5,230 tracks annotated in diverse scenes. For
evaluation, we proposed a new metric, IDEucl, to measure an algorithm’s efficacy
in preserving a unique identity for the longest stretch in image coordinate space,
thus building a correspondence between pedestrian crowd motion and the
performance of a tracking algorithm. Moreover, we also propose a new head
detector, HeadHunter, which is designed for small head detection in crowded
scenes. We extend HeadHunter with a Particle Filter and a color histogram based
re-identification module for head tracking. To establish this as a strong baseline,
we compare our tracker with existing state-of-the-art pedestrian trackers on
剩余29页未读,继续阅读
舒心远航
- 粉丝: 301
- 资源: 7
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0