CS231-图像拼接-斯坦福公开课作业_斯坦福cv课程作业资源-CSDN文库

共90个文件

m：51个

jpg：33个

mat：4个

图像拼接

CS231

斯坦福作业

需积分: 9 157 浏览量 2018-06-16 16:15:55 上传评论 1 收藏 19.84MB RAR 举报

【图像拼接】是计算机视觉领域的一个重要技术，它涉及将多张图片融合在一起，创建一个更大的、连续的视角或全景图。在本【斯坦福公开课作业】中，CS231课程（可能是"Convolutional Neural Networks for Visual Recognition"的缩写）的学生将深入学习并实践这一技术。图像拼接的主要目标是消除图像间的接缝，使得最终的全景图看起来自然且无中断。这通常需要处理一系列挑战，包括图像间的几何变换、光照不一致以及颜色匹配等问题。在作业中，学生可能需要掌握以下几个关键知识点： 1. **几何变换**：图像拼接首先需要对不同图像进行校正，以确保它们在空间上对齐。这可能涉及到旋转、平移、缩放等操作，可以通过计算相机参数（如焦距、主点坐标）来实现。 2. **特征匹配**：为了找到如何对齐图像，需要识别并匹配不同图像中的对应点。这可以使用SIFT（尺度不变特征转换）、SURF（加速稳健特征）或者ORB（Oriented FAST and Rotated BRIEF）等特征检测算法来实现。 3. ** homography估计**：一旦找到特征匹配，就可以通过最小化重叠区域中特征点的误差来估计homography矩阵，这将描述从一张图像到另一张图像的平面投影变换。 4. **图像融合**：为了消除接缝，需要对重叠区域的像素进行融合。这可能涉及到权重分配、光照补偿或色彩一致性算法，如 seam carving 或 alpha blending 技术。 5. **深度学习应用**：近年来，深度学习方法如卷积神经网络（CNN）也被引入图像拼接，用于端到端的学习或特定任务，如特征匹配和图像融合，以提高结果的质量和鲁棒性。在【斯坦福公开课作业】中，学生可能会被要求实现以上的一些或所有步骤，并通过编程来解决实际问题。这可能涉及到使用Python编程语言，配合OpenCV、PIL等图像处理库，以及TensorFlow、PyTorch等深度学习框架。学生不仅需要理解理论概念，还需要具备良好的编程技能，以便将理论转化为可运行的代码。这个作业旨在帮助学生深入理解图像处理和计算机视觉的基本原理，同时提高他们的实践能力，为他们未来在视觉识别、全景图像生成、虚拟现实等领域的工作打下坚实基础。通过解决实际问题，学生可以更好地掌握图像拼接的核心技术和挑战，从而在实际项目中游刃有余。

资源推荐

资源详情

资源评论

收起资源包目录

CS231-图像拼接-斯坦福.rar （90个子文件）

图像拼接作业

PlotSIFTDescriptor.m 4KB

KeypointDetect

plot_matched.m 718B

resample_bilinear.m 1KB

gauss2dx.m 573B

plotpoints.m 1KB

symmetric_match.m 975B

motion_corr.m 6KB

find_features.m 6KB

show_plist.m 194B

mv2.m 166B

affine.m 1KB

construct_key.m 817B

kill_edges.m 170B

showfeatures.m 1KB

filter_gaussian.m 2KB

eliminate_edges.m 137B

process_loop.m 283B

generate_parta_comparisons.m 406B

gauss1d.m 236B

find_extrema.m 3KB

match_dv_odometry.m 392B

get_dvtime.m 615B

motion_corr2.m 4KB

show_points.m 97B

fmransac_test2.m 7KB

structure2.m 678B

make_cost.m 227B

detect_features_DoG.m 3KB

filter_laplacian.m 2KB

drawbox.m 244B

refine_features.m 9KB

fit_paraboloid.m 432B

getpts.m 6KB

f_class.m 397B

fit_parabola.m 119B

build_pyramid.m 2KB

interp.m 408B

SIFTSimpleMatcher.m 2KB

MatcherTester.m 651B

StitchTester.m 2KB

EvaluateSIFTMatcher.m 576B

RANSACFit.m 5KB

TransformationTester.m 1KB

SIFTDescriptor.m 15KB

说明.docx 18KB

EvaluateAffineMatrix.m 687B

MultipleStitch.m 7KB

PlotMatch.m 10KB

ComputeAffineMatrix.m 2KB

SIFTTester.m 514B

SIFT paper.pdf 444KB

data

yard1.jpg 21KB

campus_00.jpg 2.69MB

trees_000.jpg 190KB

building_01.jpg 174KB

campus_02.jpg 2.86MB

uttower2.jpg 39KB

yosemite1.jpg 198KB

uttower2_bad.jpg 38KB

yosemite4.jpg 248KB

campus_01.jpg 2.81MB

pine3.jpg 123KB

trees_002.jpg 247KB

watering_cart_00.jpg 249KB

yard2.jpg 19KB

trees_003.jpg 261KB

watering_cart_03.jpg 162KB

watering_cart_02.jpg 169KB

building_03.jpg 151KB

watering_cart_01.jpg 215KB

trees_001.jpg 225KB

yosemite2.jpg 195KB

pine4.jpg 118KB

pine1.jpg 106KB

yosemite3.jpg 179KB

pine2.jpg 126KB

building_02.jpg 133KB

yard3.jpg 16KB

yard4.jpg 14KB

campus_03.jpg 3.1MB

uttower1.jpg 39KB

campus_04.jpg 3.04MB

building_04.jpg 174KB

uttower2_scaledup.jpg 166KB

EvaluateSIFTDescriptor.m 720B

checkpoint

Match_ref.mat 255B

Match_input.mat 964KB

Affine_ref.mat 207B

SIFT_ref.mat 75KB

PairStitch.m 2KB

Distinctive Image Features

from Scale-Invariant Keypoints

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., Canada

lowe@cs.ubc.ca

January 5, 2004

Abstract

This paper presents a method for extracting distinctive invariant features from

images that can be used to perform reliable matching between different views of

an object or scene. The features are invariant to image scale and rotation, and

are shown to provide robust matching across a a substantial range of afﬁne dis-

tortion, change in 3D viewpoint, addition of noise, and change in illumination.

The features are highly distinctive, in the sense that a single feature can be cor-

rectly matched with high probability against a large database of features from

many images. This paper also describes an approach to using these features

for object recognition. The recognition proceeds by matching individual fea-

tures to a database of features from known objects using a fast nearest-neighbor

algorithm, followed by a Hough transform to identify clusters belonging to a sin-

gle object, and ﬁnally performing veriﬁcation through least-squares solution for

consistent pose parameters. This approach to recognition can robustly identify

objects among clutter and occlusion while achieving near real-time performance.

Accepted for publication in the

International Journal of Computer Vision,

2004.

1 Introduction

Image matching is a fundamental aspect of many problems in computer vision, including

object or scene recognition, solving for 3D structure from multiple images, stereo correspon-

dence, and motion tracking. This paper describes image features that have many properties

that make them suitable for matching differing images of an object or scene. The features are

invariant to image scaling and rotation, and partially invariant to change in illumination and

3D camera viewpoint. They are well localized in both the spatial and frequency domains, re-

ducing the probability of disruption by occlusion, clutter, or noise. Large numbers of features

can be extracted from typical images with efﬁcient algorithms. In addition, the features are

highly distinctive, which allows a single feature to be correctly matched with high probability

against a large database of features, providing a basis for object and scene recognition.

The cost of extracting these features is minimized by taking a cascade ﬁltering approach,

in which the more expensive operations are applied only at locations that pass an initial test.

Following are the major stages of computation used to generate the set of image features:

1. Scale-space extrema detection: The ﬁrst stage of computation searches over all scales

and image locations. It is implemented efﬁciently by using a difference-of-Gaussian

function to identify potential interest points that are invariant to scale and orientation.

2. Keypoint localization: At each candidate location, a detailed model is ﬁt to determine

location and scale. Keypoints are selected based on measures of their stability.

3. Orientation assignment: One or more orientations are assigned to each keypoint lo-

cation based on local image gradient directions. All future operations are performed

on image data that has been transformed relative to the assigned orientation, scale, and

location for each feature, thereby providing invariance to these transformations.

4. Keypoint descriptor: The local image gradients are measured at the selected scale

in the region around each keypoint. These are transformed into a representation that

allows for signiﬁcant levels of local shape distortion and change in illumination.

This approach has been named the Scale Invariant Feature Transform (SIFT), as it transforms

image data into scale-invariant coordinates relative to local features.

An important aspect of this approach is that it generates large numbers of features that

densely cover the image over the full range of scales and locations. A typical image of size

500x500 pixels will give rise to about 2000 stable features (although this number depends on

both image content and choices for various parameters). The quantity of features is partic-

ularly important for object recognition, where the ability to detect small objects in cluttered

backgrounds requires that at least 3 features be correctly matched from each object for reli-

able identiﬁcation.

For image matching and recognition, SIFT features are ﬁrst extracted from a set of ref-

erence images and stored in a database. A new image is matched by individually comparing

each feature from the new image to this previous database and ﬁnding candidate match-

ing features based on Euclidean distance of their feature vectors. This paper will discuss

fast nearest-neighbor algorithms that can perform this computation rapidly against large

databases.

The keypoint descriptors are highly distinctive, which allows a single feature to ﬁnd its

correct match with good probability in a large database of features. However, in a cluttered

image, many features from the background will not have any correct match in the database,

giving rise to many false matches in addition to the correct ones. The correct matches can

be ﬁltered from the full set of matches by identifying subsets of keypoints that agree on the

object and its location, scale, and orientation in the new image. The probability that several

features will agree on these parameters by chance is much lower than the probability that

any individual feature match will be in error. The determination of these consistent clusters

can be performed rapidly by using an efﬁcient hash table implementation of the generalized

Hough transform.

Each cluster of 3 or more features that agree on an object and its pose is then subject

to further detailed veriﬁcation. First, a least-squared estimate is made for an afﬁne approxi-

mation to the object pose. Any other image features consistent with this pose are identiﬁed,

and outliers are discarded. Finally, a detailed computation is made of the probability that a

particular set of features indicates the presence of an object, given the accuracy of ﬁt and

number of probable false matches. Object matches that pass all these tests can be identiﬁed

as correct with high conﬁdence.

2 Related research

The development of image matching by using a set of local interest points can be traced back

to the work of Moravec (1981) on stereo matching using a corner detector. The Moravec

detector was improved by Harris and Stephens (1988) to make it more repeatable under small

image variations and near edges. Harris also showed its value for efﬁcient motion tracking

and 3D structure from motion recovery (Harris, 1992), and the Harris corner detector has

since been widely used for many other image matching tasks. While these feature detectors

are usually called corner detectors, they are not selecting just corners, but rather any image

location that has large gradients in all directions at a predetermined scale.

The initial applications were to stereo and short-range motion tracking, but the approach

was later extended to more difﬁcult problems. Zhang et al. (1995) showed that it was possi-

ble to match Harris corners over a large image range by using a correlation window around

each corner to select likely matches. Outliers were then removed by solving for a funda-

mental matrix describing the geometric constraints between the two views of rigid scene and

removing matches that did not agree with the majority solution. At the same time, a similar

approach was developed by Torr (1995) for long-range motion matching, in which geometric

constraints were used to remove outliers for rigid objects moving within an image.

The ground-breaking work of Schmid and Mohr (1997) showed that invariant local fea-

ture matching could be extended to general image recognition problems in which a feature

was matched against a large database of images. They also used Harris corners to select

interest points, but rather than matching with a correlation window, they used a rotationally

invariant descriptor of the local image region. This allowed features to be matched under

arbitrary orientation change between the two images. Furthermore, they demonstrated that

multiple feature matches could accomplish general recognition under occlusion and clutter

by identifying consistent clusters of matched features.

The Harris corner detector is very sensitive to changes in image scale, so it does not

provide a good basis for matching images of different sizes. Earlier work by the author

(Lowe, 1999) extended the local feature approach to achieve scale invariance. This work

also described a new local descriptor that provided more distinctive features while being less

sensitive to local image distortions such as 3D viewpoint change. This current paper provides

a more in-depth development and analysis of this earlier work, while also presenting a number

of improvements in stability and feature invariance.

There is a considerable body of previous research on identifying representations that are

stable under scale change. Some of the ﬁrst work in this area was by Crowley and Parker

(1984), who developed a representation that identiﬁed peaks and ridges in scale space and

linked these into a tree structure. The tree structure could then be matched between images

with arbitrary scale change. More recent work on graph-based matching by Shokoufandeh,

Marsic and Dickinson (1999) provides more distinctive feature descriptors using wavelet co-

efﬁcients. The problem of identifying an appropriate and consistent scale for feature detection

has been studied in depth by Lindeberg (1993, 1994). He describes this as a problem of scale

selection, and we make use of his results below.

Recently, there has been an impressive body of work on extending local features to be

invariant to full afﬁne transformations (Baumberg, 2000; Tuytelaars and Van Gool, 2000;

Mikolajczyk and Schmid, 2002; Schaffalitzky and Zisserman, 2002; Brown and Lowe, 2002).

This allows for invariant matching to features on a planar surface under changes in ortho-

graphic 3D projection, in most cases by resampling the image in a local afﬁne frame. How-

ever, none of these approaches are yet fully afﬁne invariant, as they start with initial feature

scales and locations selected in a non-afﬁne-invariant manner due to the prohibitive cost of

exploring the full afﬁne space. The afﬁne frames are are also more sensitive to noise than

those of the scale-invariant features, so in practice the afﬁne features have lower repeatability

than the scale-invariant features unless the afﬁne distortion is greater than about a 40 degree

tilt of a planar surface (Mikolajczyk, 2002). Wider afﬁne invariance may not be important for

many applications, as training views are best taken at least every 30 degrees rotation in view-

point (meaning that recognition is within 15 degrees of the closest training view) in order to

capture non-planar changes and occlusion effects for 3D objects.

While the method to be presented in this paper is not fully afﬁne invariant, a different

approach is used in which the local descriptor allows relative feature positions to shift signif-

icantly with only small changes in the descriptor. This approach not only allows the descrip-

tors to be reliably matched across a considerable range of afﬁne distortion, but it also makes

the features more robust against changes in 3D viewpoint for non-planar surfaces. Other

advantages include much more efﬁcient feature extraction and the ability to identify larger

numbers of features. On the other hand, afﬁne invariance is a valuable property for matching

planar surfaces under very large view changes, and further research should be performed on

the best ways to combine this with non-planar 3D viewpoint invariance in an efﬁcient and

stable manner.

Many other feature types have been proposed for use in recognition, some of which could

be used in addition to the features described in this paper to provide further matches under

differing circumstances. One class of features are those that make use of image contours or

region boundaries, which should make them less likely to be disrupted by cluttered back-

grounds near object boundaries. Matas et al., (2002) have shown that their maximally-stable

extremal regions can produce large numbers of matching features with good stability. Miko-

lajczyk et al., (2003) have developed a new descriptor that uses local edges while ignoring

unrelated nearby edges, providing the ability to ﬁnd stable features even near the boundaries

of narrow shapes superimposed on background clutter. Nelson and Selinger (1998) have

shown good results with local features based on groupings of image contours. Similarly,

Pope and Lowe (2000) used features based on the hierarchical grouping of image contours,

which are particularly useful for objects lacking detailed texture.

The history of research on visual recognition contains work on a diverse set of other

image properties that can be used as feature measurements. Carneiro and Jepson (2002)

describe phase-based local features that represent the phase rather than the magnitude of local

spatial frequencies, which is likely to provide improved invariance to illumination. Schiele

and Crowley (2000) have proposed the use of multidimensional histograms summarizing the

distribution of measurements within image regions. This type of feature may be particularly

useful for recognition of textured objects with deformable shapes. Basri and Jacobs (1997)

have demonstrated the value of extracting local region boundaries for recognition. Other

useful properties to incorporate include color, motion, ﬁgure-ground discrimination, region

shape descriptors, and stereo depth cues. The local feature approach can easily incorporate

novel feature types because extra features contribute to robustness when they provide correct

matches, but otherwise do little harm other than their cost of computation. Therefore, future

systems are likely to combine many feature types.

3 Detection of scale-space extrema

As described in the introduction, we will detect keypoints using a cascade ﬁltering approach

that uses efﬁcient algorithms to identify candidate locations that are then examined in further

detail. The ﬁrst stage of keypoint detection is to identify locations and scales that can be

repeatably assigned under differing views of the same object. Detecting locations that are

invariant to scale change of the image can be accomplished by searching for stable features

across all possible scales, using a continuous function of scale known as scale space (Witkin,

1983).

It has been shown by Koenderink (1984) and Lindeberg (1994) that under a variety of

reasonable assumptions the only possible scale-space kernel is the Gaussian function. There-

fore, the scale space of an image is deﬁned as a function, L(x, y, σ), that is produced from

the convolution of a variable-scale Gaussian, G(x, y, σ), with an input image, I(x, y):

L(x, y, σ) = G(x, y, σ) ∗ I(x, y),

where ∗ is the convolution operation in x and y, and

G(x, y, σ) =

2πσ

−(x

)/2σ

To efﬁciently detect stable keypoint locations in scale space, we have proposed (Lowe, 1999)

using scale-space extrema in the difference-of-Gaussian function convolved with the image,

D(x, y, σ), which can be computed from the difference of two nearby scales separated by a

constant multiplicative factor k:

D(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y)

= L(x, y, kσ) − L(x, y, σ). (1)

There are a number of reasons for choosing this function. First, it is a particularly efﬁcient

function to compute, as the smoothed images, L, need to be computed in any case for scale

space feature description, and D can therefore be computed by simple image subtraction.

评论收藏

内容反馈

m0_38132407

粉丝: 0
资源: 4

CS231-图像拼接-斯坦福公开课作业

图像处理课程图像拼接作业程序

全景图像拼接（图像拼接作业）

绝对可运行的图像拼接源程序

图像处理案列三之图像拼接

图像拼接之模板匹配程序

基于深度学习Superpoint 的Python图像全景拼接（Python2）

cs231n作业+数据集.zip

cs231n李飞飞课程作业一题目和答案

CNN cs231整理

CS231 卷积神经网络(中文版,带书签)

coursera斯坦福机器学习公开课作业1

公开课：斯坦福机器学习课程讲义及作业

毕业设计 大作业，图像拼接

coursera斯坦福机器学习公开课作业2

coursera斯坦福机器学习公开课作业3

斯坦福 cs231n 作业代码实践.zip

cs231n-2018-Assignment3

斯坦福公开课-机器学习 cs229

CS231N作业源码，斯坦福机器学习课

斯坦福大学李飞飞教授CS231N课程完整课件

图像拼接.zip

cs231n春季作业 2017版

cs231n-assigment2完整代码

李飞飞——计算机视觉——斯坦福CS231.rar

CS231n 编程作业（无答案）

最新 cs231n 最全资源

ml-luoyixin:机器学习斯坦福公开课作业

斯坦福公开课编程方法 作业1 stonemason

斯坦福公开课java作业1

斯坦福公开课iphone课程作业

最新资源

毕业设计大作业，图像拼接

斯坦福公开课编程方法作业1 stonemason