没有合适的资源?快使用搜索试试~ 我知道了~
A Comprehensive Guide to Machine Learning
需积分: 10 32 下载量 78 浏览量
2018-10-09
17:42:37
上传
评论
收藏 19.95MB PDF 举报
温馨提示
试读
185页
A Comprehensive Guide to Machine Learning
资源推荐
资源详情
资源评论
2
About
CS 189 is the Machine Learning course at UC Berkeley. In this guide we have created a com-
prehensive course guide in order to share our knowledge with students and the general public,
and hopefully draw the interest of students from other universities to Berkeley’s Machine Learning
curriculum.
This guide was started by CS 189 TAs Soroush Nasiriany and Garrett Thomas in Fall 2017, with
the assistance of William Wang and Alex Yang.
We owe gratitude to Professors Anant Sahai, Stella Yu, and Jennifer Listgarten, as this book is
heavily inspired from their lectures. In addition, we are indebted to Professor Jonathan Shewchuk
for his machine learning notes, from which we drew inspiration.
The latest version of this document can be found either at http://www.eecs189.org/ or http:
//snasiriany.me/cs189/. Please report any mistakes to the staff, and contact the authors if you
wish to redistribute this document.
Notation
Notation Meaning
R set of real numbers
R
n
set (vector space) of n-tuples of real numbers, endowed with the usual inner product
R
m×n
set (vector space) of m-by-n matrices
δ
ij
Kronecker delta, i.e. δ
ij
= 1 if i = j, 0 otherwise
∇f(x) gradient of the function f at x
∇
2
f(x) Hessian of the function f at x
p(X) distribution of random variable X
p(x) probability density/mass function evaluated at x
E[X] expected value of random variable X
Var(X) variance of random variable X
Cov(X, Y ) covariance of random variables X and Y
Other notes:
• Vectors and matrices are in bold (e.g. x, A). This is true for vectors in R
n
as well as for
vectors in general vector spaces. We generally use Greek letters for scalars and capital Roman
letters for matrices and random variables.
• We assume that vectors are column vectors, i.e. that a vector in R
n
can be interpreted as an
n-by-1 matrix. As such, taking the transpose of a vector is well-defined (and produces a row
vector, which is a 1-by-n matrix).
Contents
1 Regression I 5
1.1 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Hyperparameters and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Regression II 17
2.1 MLE and MAP for Regression (Part I) . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Bias-Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Multivariate Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 MLE and MAP for Regression (Part II) . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Kernels and Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 Sparse Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7 Total Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Dimensionality Reduction 63
3.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4 Beyond Least Squares: Optimization and Neural Networks 79
4.1 Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.7 Gauss-Newton Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.8 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.9 Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3
4 CONTENTS
5 Classification 107
5.1 Generative vs. Discriminative Classification . . . . . . . . . . . . . . . . . . . . . . . 107
5.2 Least Squares Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4 Gaussian Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.6 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.7 Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6 Clustering 151
6.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.2 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.3 Expectation Maximization (EM) Algorithm . . . . . . . . . . . . . . . . . . . . . . . 156
7 Decision Tree Learning 163
7.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8 Deep Learning 175
8.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.2 CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.3 Visualizing and Understanding CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Chapter 1
Regression I
Our goal in machine learning is to extract a relationship from data. In regression tasks, this
relationship takes the form of a function y = f(x), where y ∈ R is some quantity that can be
predicted from an input x ∈ R
d
, which should for the time being be thought of as some collection
of numerical measurements. The true relationship f is unknown to us, and our aim is to recover it
as well as we can from data. Our end product is a function ˆy = h(x), called the hypothesis, that
should approximate f . We assume that we have access to a dataset D = {(x
i
, y
i
)}
n
i=1
, where each
pair (x
i
, y
i
) is an example (possibly noisy or otherwise approximate) of the input-output mapping
to be learned. Since learning arbitrary functions is intractable, we restrict ourselves to some
hypothesis class H of allowable functions. More specifically, we typically employ a parametric
model, meaning that there is some finite-dimensional vector w ∈ R
d
, the elements of which are
known as parameters or weights, that controls the behavior of the function. That is,
h
w
(x) = g(x, w)
for some other function g. The hypothesis class is then the set of all functions induced by the
possible choices of the parameters w:
H = {h
w
| w ∈ R
d
}
After designating a cost function L, which measures how poorly the predictions ˆy of the hypothesis
match the true output y, we can proceed to search for the parameters that best fit the data by
minimizing this function:
w
∗
= arg min
w
L(w)
1.1 Ordinary Least Squares
Ordinary least squares (OLS) is one of the simplest regression problems, but it is well-understood
and practically useful. It is a linear regression problem, which means that we take h
w
to be of
the form h
w
(x) = x
>
w. We want
y
i
≈ ˆy
i
= h
w
(x
i
) = x
i
>
w
5
剩余184页未读,继续阅读
资源评论
aljazeeras
- 粉丝: 2
- 资源: 23
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功