feature extractors and classifiers to detect faces from coarse to fine. Despite their great success,
it is important to note that cascade detectors suffer some drawbacks such as having difficulties
in training and slow detection speed. The other branch is improved from general purpose
object detection algorithms [4, 5, 6]. General purpose object detectors take into account more
common features and broader characteristics of objects. Therefore, task-specific detectors can
share these information and then enforce the spectacular properties by special designs. Some
popular face detectors including YOLO [7, 8, 9, 10], Faster R-CNN [5] and RetinaNet [6] fall
into this category. In this paper, inspired by YOLOv5 [11], TridentNet [12] and Attention
Network in FAN [13], we propose a novel face detector that achieves the state-of-the-art in
one-stage face detection.
Although deep convolutional networks have improved face detection remarkably, detecting
faces with high variance in scale, pose, occlusion, expression, appearance, and illumination
in realistic scenes remains great challenge. In our previous work, we proposed the YOLO-
Face [14], an improved face detector based on YOLOv3 [9], which mainly focused on the
problem of scale variance, design anchor ratios suitable for human face and utilized a more
accurate regression loss function. The mAP of Easy, Medium, and Hard on the WiderFace [15]
validation set reached 0.899, 0.872, and 0.693, respectively. Since then variety of new detectors
have been presented and the face detection performance has been significantly improved.
However, for small objects, the one-stage detectors have to divide the search space with a finer
granularity, so it is apt to cause the problem of imbalance of positive and negative samples
[16]. Furthermore, face occlusions [13] in complex scenes affects the accuracy of the face
detector remarkably. Aimed to address the problems of varying face scales, easy and hard
sample imbalance and face occlusion, we propose a YOLOv5-based face detection method
called YOLO-FaceV2.
By carefully analyzing the difficulties encountered by face detectors and the shortcomings
of YOLOv5 detector, we carry out the following solutions.
Multi scale f usion: In many scenarios, there are usually different scale faces existing in
the images, which is really difficult for them all to be detected by the face detector. Therefore,
solving different scale faces is a very important task for face algorithms. Currently, the main
method to solve the problem of varying scales is constructing a pyramid to fuse the multi-scale
features of faces [17, 18, 19, 20]. For example, in YOLOv5, FPN [20] fuses the features of P3,
P4 and P5 layers. However, for small-scale objects, the information can be easily lost after
multi-layer convolutions, and the pixel information retained is very little, even in the shallower
P3 layer. Therefore, increasing the resolution of the feature map can undoubtedly benefit the
detection of small objects.
Attention mechanism: In many complex scenes, face occlusion often occurs, which is
one of the main reasons for the accuracy decline of face detectors. To address this problem,
some researchers try to use attention mechanism to facial feature extraction. FAN [13] proposes
a anchor-level attention. They suggest that the solution is to maintain the response value of
the unobstructed region and to compensate the reduced response value of the obscured region
through the attention mechanism. However, it doesn’t fully utilize the information between
channels.
Hard Samples: In one-stage detectors, many bounding boxes are not been filtered out
iterately. So the number of easy samples in one-stage detectors is very large. During training,
their cumulative contribution dominates the update of the model, leading to the overfit of
the model [16]. This is known as the problem of imbalanced samples. To deal with this
problem, Lin et al. proposes Focal Loss to dynamically assign more weights to difficult sample
examples [6]. Similar to focal loss, Gradient Harmonizing Mechanism (GHM) [21] suppresses
the gradients from positive and negative simple samples to focus more on difficult samples.
Prime Sample Attention (PISA) [22] proposed by Cao et al. assigns weights to positive and
2