TOD-NetAnend-to-endtransformer-basedobjectdetectionnetwork

43 浏览量 2023-11-25 15:40:36 上传评论收藏 6.45MB PDF 举报

资源推荐

资源详情

资源评论

Computers and Electrical Engineering 108 (2023) 108695

Available online 8 April 2023

TOD-Net: An end-to-end transformer-based object

detection network

Museboyina Sirisha

, S.V. Sudha

School of Computer Science and Engineering, VIT-AP University, Amaravathi, Andhra Pradesh 522237, India

Professor and Dean, School of Computer Science and Engineering, VIT-AP University, Amaravathi, Andhra Pradesh 522237, India

ARTICLE INFO

Editor E. Cabal-Yepez

Keywords:

Object detection

Network learning

Feature representation

Transformer

Predictor module

Feature analysis

Local features

Scaling

Semantic analysis

Network layer

ABSTRACT

Various object detection approaches using a learning model intends to learn the semantic and

multi-scaling information to attain superior object saliency. This research employs a transformer-

based network framework for object detection (TOD − Net) for object detection. It is composed of

encoders, decoders, and transformer and predictor module. The predictor model bridges the

connectivity between the encoder and the transformer module and offers better insight into the

transformer module’s local measures. Here, feature extraction is performed to measure the local

features and establishes dense modeling by analyzing local features. The model gives broader

knowledge of local and global features. Python programming was used to experiment with the MS

COCO dataset (Microsoft Common Objects in Context) where the experimentation gives better

results over existing models. In contrast to existing methods, the proposed method achieves

68.7% precision and 4% accuracy. The proposed model outperforms different prevailing ap-

proaches and establishes a better trade-off.

1. Introduction

Object detection remains one of the biggest challenges in computer vision. Object detection methods are broadly utilized in

different regions, like detecting faces, tracking objects [1], detecting pedestrians, autonomous driving, and medical imaging based on

deep learning.[1–3] The rapid development in the latest days is achieved using CNNs(Convolutional Neural networks), which started

from Alexnet [4], making considerable gains in detecting objects. A pre-trained backbone network requires a large volume of data like

ImageNet and a signicant processing time, broadly utilized to extract features [5] effectively. MobileNetV2 [6] utilizes the linear

bottleneck and the inverted residuals, making it suitable to be a lightweight backbone network in many object detection models. The

process of depth-wise separable convolution was applied to the inception in Xception [7], an approach that separates channels from

spaces in a network. In addition, the point-wise convolution and the channels are used by ShufeNet [8]. ResNet [9] improved

performance by using residual learning to construct a deeper neural network, while ResNeXt [10] introduced grouped convolutions

based on ResNet to enhance accuracy.ResNeXt achieves relative performance by maximizing the cardinality with the help of grouped

convolutions based on the ResNet. The method of image pyramid [11] uses the multi-scale map for the features to identify the

multi-scale objects from the higher layers to the lower layers at the increased cost of the memory space and computational load. The

study helps to integrate the efcient mapping of features for the higher and lower layers, which is performed actively to solve this

problem.

* Corresponding author.

E-mail address: musesirisha@gmail.com (M. Sirisha).

Contents lists available at ScienceDirect

Computers and Electrical Engineering

journal homepage: www.elsevier.com/locate/compeleceng

https://doi.org/10.1016/j.compeleceng.2023.108695

Received 25 November 2022; Received in revised form 20 March 2023; Accepted 22 March 2023

Computers and Electrical Engineering 108 (2023) 108695

Feature Pyramid Network (FPN) represents one of the most common models for generating pyramidal feature representations of

detecting objects [12]. The backbone for extracting the higher-level features like the object parts or the textures is done with the help of

deep layers. However, lower-level features like the curves and the edges are extracted as the backbone in the shallow layers. FPN uses

the higher and lower-level features for hierarchically stacking the feature maps via the top-down way, with lateral connection for

detecting objects of different sizes. The multi-scale objects are detected to have the performance through the enhanced method, and the

problems related to memory and the cost of computation are resolved. A stronger structure of FPN is suggested [13] by adding the

bottom-up way. In addition, a new structure of FPN is suggested [14] via the search architecture of neural model, and the method of

fusion is suggested by Bi-FPN [15] for the multi-scale feature map using the implementation of the weighted bi-directional pyramid

network of the features.

The interpolation or deconvolution [16] is used with the structure based on the previous FPN for fusing the higher-level and

lower-level feature maps. Interpolation is performed mainly by using nearest neighbor interpolation or bilinear interpolation. The

calculation of the nearest neighbor interpolation is performed in a quick way using the value lled with the help of pixel values at the

nearest location. Still, the edge becomes ambiguous due to the presence of aliasing. The neighboring four-pixel values are multiplied

with the weights to obtain the values using the bilinear interpolation during the resize of the original image size using N times,

resulting in a blur on the image. The feature map size is expanded using deconvolution through inverse calculation of the convolution.

The artifact of the checkerboard arises because of the overlapped phenomenon based on the size of the kernel and the size of the stride.

PPM (Pyramid pooling module) is suggested for resolving the above issue, which uses various pooling sizes [17]. In addition, the

ASPP (Atrous spatial pyramid pooling) is suggested using the superimposition of the dilated convolution having different rates on the

atrous layer of pooling [16]. PPM, however, has the drawback of losing information about pixel location when applying pooling to a

range of sizes. In particular, the broad receptive eld is used at the loss of cost with the information of local context since the ASPP

utilizes dilated convolution. The system uses upsampling to reduce the loss of semantic information related to the images of input,

having high-level maps of features in the structure of FPN and the attraction to the global and local context information. Therefore,

modeling multi-scale information is an essential requirement in object detection. The major research contributions are summarized

below:

• To propose a novel TOD − Net model with a transformer for object detection, which employs an attention module to encode the

information and enhance feature learning.

• The feature extractor and the attention modules help to obtain the multi-scale features, and interpolation helps in fusing the

features and enhancing the prediction accuracy.

• The classier is used for feature mapping related to multi-scale. For performing the experimentation and to get better prediction

outcomes, Python programming was used to experiment with the MS COCO dataset, and the average precision metric value was

compared with other models to establish a trade-off.

1.1. Paper organization

The article is drafted as follows: Section 2 gives a wider analysis of various prevailing approaches and provides the merits and

demerits. The proposed TOD − Net is elaborated in Section 3, where the object detection process is elaborated. The outcomes are

represented in Section 4, with a summary in Section 5.

2. Related works

An object is usually detected using two types of techniques: one-stage detection and two-stage detection. As a result of applying

regional proposals sequentially, the two-stage detection methods proved to be highly accurate. Regional Convolutional Neural Net-

works (R-CNN) [18] use CNNs to propose candidate regions and extract features from each region. It is the rst method to use both

CNNs and features extracted from each region in two stages. Multiple enhanced variants for the R-CNN have been proposed, such as

faster-RCNN [19], cascade R-CNN, libra R-CNN [20], and mask R-CNN [21]. On the other hand, one-stage detection methods are faster

than two-stage methods but have lower accuracy. Solid State Devices (SSD) [22], You Only Look Once (YOLO) [23], EfcientDet, and

RetinaNet [24] are some examples of methods of one-stage detection. Some advanced methods for object detection, such as CenterNet

[25], FCOS [26], center-based ATSS, and CornerNet [27] that are anchor-free are proposed in recent years. Recently, many object

detection methods have eliminated NMS or the generation of anchors using transformers, which has greatly improved detection ac-

curacy [28].

Furthermore, detecting multi-scale objects is an essential issue to be addressed in the detection task. Image pyramids were widely

used in object detection to improve multiscale object detection accuracy. There is, however, a redundant computational cost associated

with extracting features at each level of the image pyramid. FPN is used to compensate for the semantic information loss during

forwarding with the help of a top-down way and lateral connection for lowering the computation load and enhancing the detection

accuracy. In recent years, more robust pyramid structures have been proposed based on FPN. The PANet adds the augmentation of the

bottom-up path for delivering the lower level of information to the higher level again to enable localization in a precise way. A new

FPN structure called NAS-FPN (Neural architecture search - FPN) was designed that uses natural architecture research (NAS). The

network’s performance improved but required more memory. In addition, the weighted BiFPN, a bi-directional FPN, is suggested,

which resolves the issue of cost computation, and multi-scale feature maps are fused more efciently and quickly.

M. Sirisha and S.V. Sudha

Computers and Electrical Engineering 108 (2023) 108695

Natural language processing has actively studied attention mechanisms as essential elements for neural networks. In recent years,

attention mechanisms in computer vision have been applied to various elds to enhance performance by highlighting regions of in-

terest in images and detecting long-range dependencies. Segmentation, generation of images, and classication of images are some of

the applications of computer vision using them. Previous studies have proven that attention mechanisms can improve object detection

performance by locating and recognizing objects in images. In addition, a MAD unit is suggested in [29] to identify the activation of the

neuron in the lower and higher streams via the aggressive search. A new structure couplenet is suggested, which combines the local and

global information related to objects to improve detection performance. Recent improvements have been made in the accuracy of

object detection methods using transformers.

The attention-based NLP models’ success has inspired a recent initiative to integrate transformers into CNNs. The ViT Vision

Transformer is a pure transformer used to classify the images. The 2D image is reshaped into attened patches to handle it. The

transformer attens the patches and maps them to dimensions using a trainable linear projection using constant latent vector sizes

across all its layers. The results obtained when trained on large datasets prove that the transformers efciently extract features of an

image. Video Vision Transformer (ViViT) is proposed for the classication of videos. A sequence of spatiotemporal tokens extracted

from the video input is used in this architecture as the primary method of computation for self-attention. Several methods are applied

to factorize the model along spatial and temporal dimensions to increase efciency and scalability. The author suggested the Tran-

sUNet, which integrates the transformer’s encoder and the U-Net. Transformer encodes the tokenized image patches based on the CNN

feature map to detect global context initially and extract global context later. Decoding ensures that the encoded features are

upsampled, and the CNN feature maps are combined for precise localization. TransUNet leverages detailed, high-resolution geospatial

information and the global context by combining CNN features and transformers. In various medical applications, TransUNet out-

performs other competing methods [29].

Many methods based on U-Net related to detecting video anomaly utilize the successive stacked frames as input and the constraints

of motion like loss of optical ow to extract the temporal information. The structure is limited, and inadequate temporal information is

used to predict and reconstruct. Better performance is obtained by ViViT than other models for video classication using different

variants of the transformers [29]. Using the transformers, the model performs better, encoding both spatial and temporal features in

the videos. Furthermore, the U-Net is widely used for detecting video anomalies. In addition, the potential integration of the U-Net and

the transformer is mentioned by the TransUNet. The model inspires us to modify the encoder of the transformer in the proposed system

to make reliable predictions in the video. The model encodes spatial data, while the suggested encoder of the transformer encodes the

temporal data. The model performs better than the baseline model based on prediction without the transformer module. However, the

performance is comparatively lesser and triggers the need for a novel approach. This research intends to solve the complexity of the

existing approaches by proposing a transformer-based network framework for object detection (TOD − Net).

3. Methodology

The backbone of the proposed network uses TOD − Net as the feature extractor that is modied by taking the existing ResNet-18 and

removing the primary fully connected layer. Hence, the feature extractor has a convolution layer of 7 × 7 and four residual blocks

Fig 1. Overview of TOD − Net.

M. Sirisha and S.V. Sudha

剩余11页未读，继续阅读

评论收藏

内容反馈

DrYJ

粉丝: 40
资源: 24

TOD-Net An end-to-end transformer-based object detection network

最新资源

TOD-Net An end-to-end transformer-based object detection network

Image object detection and semantic segmentation based on convolutional neural network

ToD-BERT:ToD-BERT的预训练模型

DSHerzberg-TOD-R

【3月】未来社区建设之TOD－27页.pdf

未来社区－TOD城市低碳发展实践浙－105页.pdf

Python库 | tod0-0.2.0-py3-none-any.whl

Miami-TOD-PMT:迈阿密戴德TPO的TOD性能监控工具包的脚本和文档

TOD-godot：从Gradle到坟墓

awesome-tiny-object-detection::sunglasses:精选的微小物体检测论文和相关资源清单

Side Chain Engineering of Naphthalenediimide-Bithiophene-Based Polymer Acceptors in All-Polymer Solar Cells

398-苏州市以公交为导向的发展模式（TOD）研究-word资料.pdf

024-2-北京市实施TOD的政策机制研究-word资料.pdf

【3月－浙江省院】 杭州之江未来社区建设试点方案（小街区密路网型立体TOD）－344页.pdf

朱晓兵-轨道交通四网融合背景下的TOD推进路径思考-20191110-r.pdf

187-轨道交通导向下的多维TOD发展策略思考-以合肥为例-word资料.pdf

特色小镇-南京市高淳区TOD小镇城市设计-澳撒中国-154页.pdf

[四川]TOD社区-现代风超大型未来住宅方案

118-96-面向TOD的山地中小城市公交规划探讨-word资料.pdf

以公共交通为导向的开发(TOD)标准2.1-交通与发展政策研究所(ITDP)-外文-报告资料.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

全新的SOTA模型YOLOv9

YOLOV5 + 双目相机实现三维测距（新版本）

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

最新资源

【3月－浙江省院】杭州之江未来社区建设试点方案（小街区密路网型立体TOD）－344页.pdf