criminative key points and calculate adaptive warps. Then,
the line segment is proved to be another unique feature to
achieve better stitching quality and preserve linear struc-
tures [31, 49, 32, 19]. Recently, the large-scale edge is also
introduced in [10] to preserve the contour structures. Be-
sides, there is a great variety of other geometric features
that are leveraged to improve the stitching quality, such as
depth maps [33], semantic planar regions [26], etc.
Having calculated the warps, seam cutting is usually
used to remove parallax artifacts. To explore an invisible
seam, various energy functions are designed using colors
[22], edges [35, 8], salient maps [30], depth [6], etc.
From the broad usage of geometric features, a clear de-
veloping trend has been discovered: increasingly sophisti-
cated features are leveraged. We ask: are these complex
designs practical in real applications? We attempt to an-
swer this question from two perspectives. 1) These elabo-
rate algorithms with complicated geometric features poorly
adapt to scenes without sufficient geometric structures, such
as medical images, industrial images, and other natural im-
ages with low texture (Fig.9b), low light or low resolution.
2) When there exist abundant geometric structures, the run-
ning speed is intolerant (please refer to Table 2,3 for detail).
Such a trend seems to violate the “practical” original intent.
Recently, deep stitching technologies using convolu-
tional neural networks (CNNs) have aroused widespread
attention in the community. They abandon geometric fea-
tures and head for high-level semantic features that can be
adaptively learned in a data-driven pattern in a supervised
[24, 40, 44, 47, 23], weakly-supervised [46], or unsuper-
vised [41] manner. Although they are robust to various nat-
ural or unnatural conditions, they cannot handle large paral-
lax and demonstrate unsatisfactory generalization in cross-
dataset and cross-resolution conditions. A large-parallax
case is shown in Fig.9a, where the tree is in the middle of
the car in the reference image while it is on the left in the
target image. To deal with parallax, UDIS [41] reconstructs
stitched images from feature to pixel. However, the parallax
is so large that undesired blurs are produced as a side effect.
In this paper, we propose a parallax-tolerant unsuper-
vised deep image stitching technique, addressing the robust-
ness issue in traditional stitching and the large-parallax is-
sue in deep stitching simultaneously. Actually, the proposed
deep learning-based solution is naturally robust to various
scenes due to effective semantic feature extraction. Then,
it overcomes the large parallax via two stages: warp and
composition. In the first stage, we propose a robust and
flexible warp to model the image registration. Particularly,
we simultaneously parameterize homography transforma-
tion and thin-plate spline (TPS) transformation as unified
representations in a compact framework. The former offers
a global linear transformation, while the latter produces lo-
cal nonlinear deformation, allowing our warp to align im-
ages with parallax. Besides, this warp contributes to both
content alignment and shape preservation simultaneously
via combined optimization of alignment and distortion. In
the second stage, the existing reconstruction-based method
[41] treats artifact elimination as a reconstruction process
from feature to pixel, leading to inevitable blurs around the
parallax regions. To overcome this drawback, we cooper-
ate the motivation of seam-cutting into deep composition
and implicitly find a “seam” through unsupervised learn-
ing for seam-driven composition masks. To this end, we
design boundary and smoothness constraints to restrict the
endpoints and route of a “seam”, compositing the stitched
image seamlessly. In addition to the two stages, we de-
sign a simple iterative strategy to enhance the generaliza-
tion, rapidly improving the registration performance of our
warp in different datasets and resolutions.
Furthermore, we conduct extensive experiments about
the warp and composition, demonstrating our superiority to
other SoTA solutions. The contributions center around:
• We propose a robust and flexible warp by parameteriz-
ing the homography and thin-plate spline into unified
representations, realizing unsupervised content align-
ment and shape preservation in various scenes.
• A new composition approach is proposed to generate
seamless stitched images via unsupervised learning for
composition masks. Compared with the reconstruc-
tion [41], our composition eliminates parallax artifacts
without introducing undesirable blurs.
• We design a simple iterative strategy to enhance warp
adaption in different datasets and resolutions.
2. Related Work
2.1. Traditional Image Stitching
Adaptive warp. AutoStitch [4] leveraged SIFT [38] to
extract discriminative keypoints to construct a global ho-
mography transformation. After that, SIFT becomes an in-
dispensable feature to calculate various flexible warps, such
as DHW [13], SVA [36] APAP [50], ELA [28], TFA [27]
for better alignment, SPHP [5], AANAP [34], GSP [7] for
better shape preservation. Then, DFW [13] adopted line
segments extracted by LSD [48] with keypoints together to
enrich structural information in artificial environments. Fur-
thermore, line-guided mesh deformation [49] is designed
by optimizing an energy function of various line-preserving
terms [32, 19]. To preserve the nonlinear structures, the
edge features are used in GES-GSP [10] to achieve a smooth
transition between local alignment and structural preserva-
tion. In addition to these basic geometric features (point,
line, and edge), the depth maps and semantic planes are
also used to assist the feature matching using extra depth
consistency [33] and planar consensus [26].
2