ISPRS Int. J. Geo-Inf. 2020, 9, 571 2 of 20
dependent on longer-term learning results. Obtaining such a large number of remote sensing image
segmentation labels is very difficult, so it is of limited practical utility for the segmentation of remote
sensing images.
Another effective way of tackling the issues described above is to use self-attention mechanisms.
These are popular and simple to adapt to semantic segmentation tasks because of their varied and
flexible structure [
17
–
22
]. Self-attention mechanisms focus on local features by generating weight
feature maps and fusing downstream feature maps. This may involve having one or more modules
built upon a basic backbone, with each module focusing on things such as the channel or spatial
information. However, downstream feature maps can lose a lot of spatial information, and the capture
of the original spatial information directly is currently not feasible. Yet, having very precise spatial
information is crucial for the effective segmentation of remote sensing images.
To address the above issues, we propose here a novel self-attention mechanism model, called a
Dual Path Attention Network (DPA-Net), which is designed for remote sensing semantic segmentation.
It uses two attention modules: a total spatial attention module to capture spatial information and a
channel attention module to capture the channel information separately. The two modules can easily be
appended to other segmentation models such as PSP-Net [
23
]. At present, there are many methods for
the efficient extraction of different kinds of feature information. However, the input of almost all spatial
attention methods is the feature map after sampling. As mentioned above, compared with the original
image, the downsampled feature map contains a lot less spatial information. Therefore, this kind of
spatial attention is inevitably inefficient, as it is unable to fully utilize the spatial information in the
data. Therefore, instead of the downsampled image, we changed the input of the spatial attention
method to the original image. In the total spatial attention module, spatial information is captured
from the original image according to the self-attention mechanism mentioned above. The output
of the TSAM is a single channel weight matrix. Each pixel of the output can be updated again by
fusing according to the corresponding weight, with the weight itself being generated by the module.
After being fused with the final feature map of DPA-Net, the TSAM will provide a weight for each pixel.
During the training, the network pays higher attention to the areas with larger weights. This means
that each pixel has its own focus in the network. For the channel attention module, the self-attention
mechanism captures the channel information according to the channel maps. As with the total spatial
attention module, it generates a weight factor. The feature maps are updated by integrating this weight
factor. Once the two modules have completed their operations, two feature maps are obtained that
contain spatial information and channel information, respectively. Then, these two feature maps are
aggregated to generate the final output.
It is worth emphasizing that although the proposed method is more effective than the original
self-attention method, it does not significantly change the memory footprint. Overall, it solves the
conventional problems associated with self-attention mechanisms in a straightforward way. First of all,
the TSAM makes its calculations on the basis of the original image. When compared to downstream
feature maps, original remote sensing images contain more spatial information. Secondly, the output of
the two modules acts on the last feature map in the model. Thus, the two modules can control the back
propagation of the entire model. In addition, the simplicity of the module structure makes it easy for it
to be used with any segmentation model. To verify the effectiveness of our method, we conducted
experiments with U-Net, PSP-Net, and DeepLab V3+ [
24
,
25
] on the Gaofen Image Dataset (GID) [
26
].
It improved the mean IoU for each module by 0.84%, 2.54% and 1.32%, respectively.
The main contributions of the paper can be summarized as follows:
•
We propose a Dual Path Attention Network (DPA-Net) that uses a self-attention mechanism to
enhance a network’s ability to capture key local features in the semantic segmentation of remote
sensing images.
•
A total spatial attention module is used to extract pixel-level spatial information, and a channel
attention module is proposed to focus on different features. After the dual path feature extraction
has taken place, the performance of the sematic segmentation is significantly improved.