2
rectangle which its size can be chosen according to the size
of the objects in each image. Thus, the network should not
learn features that rely on the whole object of interest. The
same inspiration underlies other methods, such as GridMask
[17] and Hide-and-seek (HaS) [18].
Augmenting across a batch of samples can be beneficial
as it extends the vicinity of the dataset as multiple instances
of multiple images are mixed to produce a new picture [19].
Various augmentations in image classification are applied to
a batch of images like mixup [19], cutmix [20], and puzzle
mix [21]. Although attempts have been made to extend the
applicability of these techniques beyond image classification to
object detection [19], a very compelling method called mosaic
augmentation is implemented in [22].
Studies suggest that the severity and number of augmen-
tation techniques used during the training affect the model
accuracy [23]–[25]. By training a reinforcement learning agent
on a small dataset, AutoAugment attempts to find an augmen-
tation policy for combining the augmentation transformations.
The high computational cost of AutoAugment encouraged the
authors of [24] to develop RandAugment, which parameterizes
the data augmentation process with only two parameters;
the number of operations (N), and severity (M). Combining
RandAugment [24] and mixup [19], AugMix augments an
image separately in different chains. An augmented image
is created by weighted summation of augmentation chains
using coefficients from a Dirichlet distribution. Finally, the
coefficients from a Beta distribution are drawn to calculate
the weighted sum of the original and augmented images.
B. Regularization
As with augmentation, the overfitting problem can be re-
duced by regularization techniques like dropout. Although
dropout works well with fully connected layers, the authors of
[26] have developed Dropblock as a method for convolutional
layers. Rather than randomly dropping features at random
locations, Dropblock drops a connected region. According to
their study, decreasing the probability of keeping blocks is
more effective than utilizing a fixed probability.
Some methods are only applicable to a specific structure,
like shake-shake regularization [27], which can be applied to a
multi-branch network. This approach is developed for a three-
branch network; two branches are multiplied by small random
numbers, then summed up with the third branch during the
training forward pass, and different random numbers from a
Beta distribution are used as multipliers during backpropaga-
tion. These two branches are multiplied by 0.5 at test time.
C. Attention Mechanisms
The application of attention mechanisms in the artificial
neural network has been associated with NLP tasks [28]. In
machine translation, the network should concentrate on certain
parts of the input sequence from the source language to predict
a word in the target language. An attention mechanism is
proposed in [29] that could help the network to pay attention.
This work encouraged other researchers to investigate the
applicability of attention for solving different tasks [30]–
[33]. Formerly, the common choices for solving the NLP
tasks like machine translation were recurrent and convolutional
neural networks. The authors of [34] suggested a different
architecture called Transformer. Unlike the other works which
combined attention with either recurrent or convolutional
neural networks, Transformer was only based on the scaled
dot multi-head self-attention (SDMHSA). They claimed that
attention could solve the machine translation task on its own.
This concept has been investigated in other studies like BERT
[35].
The promising results of using attention in the NLP field
has motivated computer vision scientists to improve their
results by adding attention to their networks [36]–[39]. In
[40], convolutional block attention module (CBAM) has been
introduced for convolutional neural networks. This module
includs two sub-modules. A spatial attention module (SAM)
and a channel attention module (CAM). The authors embedded
the CBAM in the structure of a number of the state-of-the-
art architectures like ResNet50 [41], ResNeXt50 [42], and
MobileNet [43]. By taking advantage of attention, they have
achieved higher accuracy in image classification on ImageNet
[44] and object detection on Pascal VOC [45] and Microsoft
COCO [46]. The vision transformer (ViT) has bridged the
gap between image classification and transformer architecture
by dealing with an image as a sequence of patches. This
network has achieved state-of-the-art accuracy on ImageNet
classification. Similar to [35] and [34], ViT only uses the
MHSDSA as the main component all over the network [12].
III. NETWORK STRUCTURE
The YOLOv4 architecture can be divided into three sub-
networks: the backbone, the neck, and the head. The backbone
of YOLOv4 is called CPS-Darknet-53. The CPS-Darknet-53
extracts feature from the input image and generate output
at three levels. The first level output has the highest spatial
resolution and is suitable for detecting small-sized objects.
The second level output has less spatial resolution than the
first, making it appropriate for finding medium-sized objects
in the image. The feature map has more depth than the first
stage feature map at this stage. The third and last stage
output has the deepest feature map with the least spatial
resolution. The YOLOv4 neck takes these feature maps and
up-samples the lowest resolution feature map with the bi-
linear interpolation method to match the spatial resolution of
the second stage feature map. Then this up-sampled feature
map is then concatenated with the second-level feature map
to help the mid-level resolution feature map enrich the features
for detecting medium-sized objects. The obtained feature map
is up-sampled and concatenated with the highest-resolution
feature map. The YOLOv4 head receives the feature maps
from the neck to detect objects at three scales.
It is evident that residual blocks play a vital role inside
the YOLOv4 backbone due to 23 residual blocks in CPS-
Darknet-53. Motivated by the ViT transformer block, a trans-
former attention block is implemented and utilized to replace
the residual blocks in CPS-Darknet-53. Replacement of the