没有合适的资源?快使用搜索试试~ 我知道了~
Algorithms for Big Data Lecture Notes (Harvard CS229r)
需积分: 10 23 下载量 195 浏览量
2017-03-07
17:49:19
上传
评论
收藏 4.95MB PDF 举报
温馨提示
试读
149页
Algorithms for Big Data Lecture Notes (Harvard CS229r)
资源推荐
资源详情
资源评论
CS 229r: Algorithms for Big Data Fall 2015
Lecture 1 — September 3, 2015
Prof. Jelani Nelson Scribes: Zhengyu Wang
1 Course Information
• Professor: Jelani Nelson
• TF: Jaros law B lasiok
2 Topic Overview
1. Sketching/Streaming
• “Sketch” C(X) with respect to some function f is a compression of data X. It allows
us computing f(X) (with approximation) given access only to C(X).
• Sometimes f has 2 arguments. For data X and Y , we want to compute f(X, Y ) given
C(X), C(Y ).
• Motivation: maybe you have some input data and I have some input data, and we want
to compute some similarity measure of these two databases across the data items. One
way is that I can just send you the database, and you can compute locally the similarity
measure, and vise versa. But image these are really big data sets, and I don’t want to
send the entire data across the wire, rather what I will do is to compute the sketch of my
data, and then send the sketch to you, which is something very small after compression.
Now the sketch C(X) is much smaller than X, and given the sketch you can compute
the function.
• Trivial example: image you have a batch of numbers, and I also have a batch of numbers.
We want to compute their sum. The sketch I can do is just locally sum all my input
data, and send you the sum.
• Streaming: we want to maintain a sketch C(X) on the fly as x is updated. In previous
example, if numbers come on the fly, I can keep a running sum, which is a streaming
algorithm. The streaming setting appears in a lot of places, for example, your router can
monitor online traffic. You can sketch the number of traffic to find the traffic pattern.
2. Dimensionality Reduction
• Input data is high-dimensional. Dimensionality reduction transforms high-dimensional
data into lower-dimensional version, such that for the computational problem you are
considering, once you solve the problem on the lower-dimensional transformed data, you
can get approximate solution on original data. Since the data is in low dimension, your
algorithm can run faster.
• Application: speed up clustering, nearest neighbor, etc.
1
3. Large-scale Machine Learning
• For example, regression problems: we collect data points {(z
i
, b
i
)|i = 1, . . . , n} such that
b
i
= f(z
i
) + noise. We want to recover
˜
f “close” to f.
• Linear regression: f(z) = hx, zi, where x is the parameter that we want to recover. If
the noise is Gaussian, the popular (and optimal to some sense) estimator we use is Least
Squares
x
LS
= arg min kZx − bk
2
2
= (Z
T
Z)
−1
Zb, (1)
where b = (b
1
, . . . , b
n
)
T
and Z = (z
1
, . . . , z
n
)
T
. If Z is big, matrix multiplication can
be very expensive. In this course, we will study techniques that allow us to solve least
squares much faster than just computing the closed form (Z
T
Z)
−1
Zb.
• Other regression problems: PCA (Principal Component Analysis), matrix completion.
For example, matrix completion for Netflix problem: you are given a big product-
customer matrix of customer ratings of certain products. The matrix is very sparse
because not every user is going to rate everything. Based on limited information, you
want to guess the rest of the matrix to do product suggestions.
4. Compressed Sensing
• Motivation: compress / cheaply acquire high dimensional signal (using linear measure-
ment)
• For example, images are very high dimensional vectors. If the dimension of an image is
thousands by thousands, it means that the image has millions of pixels. If we write the
image in standard basis as pixels, it is likely that the pixels are not sparse (by sparse
we mean almost zero), because just image that if we take a photo in a dark room, most
of the pixels have some intensity. But there are some basis called wavelet basis, pictures
are usually very sparse on that basis. Once something is sparse, you can compress it.
• JPEG (image compression).
• MRI (faster acquisition of the signal means less time in machine).
5. External Memory Model
• Motivation: measure disk I/O’s instead of number of instructions (because random seeks
are very expensive).
• Model: we have infinite disk divided into blocks of size b bits, and memory of size M
divided into pages of size b bits. If the data we want to read or the location we want
to write is in the memory, we can just simply do it for free; if the location we want to
access is not in the memory, we cost 1 unit time to load the block from the disk into the
memory, and vise versa. We want to minimize the time we go to the disk.
• B trees are designed for this model.
6. Other Models (if time permitting)
• For example, map reduce.
2
3 Approximate Counting Problem
In the following, we discuss the problem appearing in the first streaming paper [1].
Problem. There are a batch of events happen. We want to count the number of events while
minimizing the space we use.
Note that we have a trivial solution - maintaining a counter - which takes log n bits where n is
the number of events. On the other hand, by Pigeonhole Principle, we cannot beat log n bits if we
want to count exactly.
For approximate counting problem, we want to output ˜n such that
P(|˜n − n| > εn) < δ, (2)
where let’s say ε = 1/3 and δ = 1%.
First of all, we can say that if we want to design a deterministic algorithm for approximate counting
problem, we cannot beat against log log n bits, because similar to previous lower bound argument,
there are log n different bands (of different powers of 2), and it takes log log n bits to distinguish
them. Therefore, we maybe hope for O(log log n) bits algorithm. Actually, the following Morris
Algorithm can give us the desired bound:
1. Initialize X ← 0.
2. For each event, increment X with probability
1
2
X
.
3. Output ˜n = 2
X
− 1.
Intuitively, we have X ≈ lg n where lg x = log
2
(2+x). Before giving rigorous analysis (in Section 5)
for the algorithm, we first give a probability review.
4 Probability Review
We are mainly discussing discrete random variables. Let random variable X takes values in S.
Expectation of X is defined to be E X =
P
j∈S
j · P(X = j).
Lemma 1 (Linearity of expectation).
E(X + Y ) = E X + E Y (3)
Lemma 2 (Markov).
Xis a non-negative random variable ⇒ ∀λ > 0, P(X > λ) <
E X
λ
(4)
Lemma 3 (Chebyshev).
∀λ > 0, P(|X − E X| > λ) <
E(X − E X)
2
λ
2
(5)
3
Proof. P(|X − E X| > λ) = P((X − E X)
2
> λ
2
). It follows by Markov.
Moreover, Chebyshev can be generalized to be:
∀p > 0, ∀λ > 0, P(|X − E X| > λ) <
E(X − E X)
p
λ
p
. (6)
Lemma 4 (Chernoff). X
1
, . . . , X
n
are independent random variables, where X
i
∈ [0, 1]. Let X =
P
i
X
i
, λ > 0,
P(|X − E X| > λ · E X) ≤ 2 · e
−λ
2
·E X/3
. (7)
Proof. Since it’s quite standard, and the proof detail can be found in both previous scribe
1
(Lec-
ture 1 in Fall 2013) and wiki
2
, we only include a proof sketch here. We can prove that both
P(X − E X > λ · E X) and P(X − E X < −λ · E X) are smaller than e
−λ
2
·E X/3
, and then apply
union bound to prove the lemma.
The proof for P(X −E X < −λ·E X) < e
−λ
2
·E X/3
is symmetric to P(X−E X > λ·E X) < e
−λ
2
·E X/3
.
So we can focus on how to prove P(X −E X > λ ·E X) < e
−λ
2
·E X/3
. Since P(X −E X > λ E X) =
P(e
t(X−E X)
> e
t E X
) <
E e
t(X−E t)
e
t E X
for any t > 0, we can optimize t to get the desired bound.
Lemma 5 (Bernstein). X
1
, . . . , X
n
are independent random variables, where |X
i
| ≤ K. Let X =
P
i
X
i
and σ
2
=
P
i
E(X
i
− E X
i
)
2
. For ∀t > 0,
P(|X − E X| > t) . e
−ct
2
/σ
2
+ e
−ct/K
, (8)
where . means ≤ up to a constant, and c is a constant.
Proof. First, we define p (p ≥ 1) norm for random variable Z to be kZk
p
= (E |Z|
p
)
1/p
. In the
proof, we will also use Jensen Inequality: f is convex ⇒ f (E Z) ≤ E f(Z).
To prove Bernstein, it’s equivalent to show (equivalence left to pset)
∀p ≥ 1, k
X
i
X
i
− E
X
i
X
i
k
p
.
√
p · σ + p · K. (9)
Let Y
i
be identically distributed as X
i
, with {X
i
|i = 1, . . . , n}, {Y
i
|i = 1, . . . , n} independent.
We have
1
http://people.seas.harvard.edu/
~
minilek/cs229r/fall13/lec/lec1.pdf
2
https://en.wikipedia.org/wiki/Chernoff_bound
4
k
X
i
X
i
− E
X
i
X
i
k
p
= kE
Y
(
X
i
X
i
−
X
i
Y
i
)k
p
(10)
≤ k
X
i
(X
i
− Y
i
)k
p
(Jensen Inequality) (11)
= k
X
i
α
i
(X
i
− Y
i
)k
p
(Add uniform random signs α
i
= ±1) (12)
≤ k
X
i
α
i
X
i
k
p
+ k
X
i
α
i
Y
i
k
p
(Triangle Inequality) (13)
= 2k
X
i
α
i
X
i
k
p
(14)
= 2 ·
r
π
2
· kE
g
X
i
α
i
|g
i
|X
i
k
p
(Let g be vector of iid Gaussians) (15)
. k
X
i
α
i
|g
i
|X
i
k
p
(Jensen Inequality) (16)
= k
X
i
g
i
X
i
k
p
(17)
Note that
P
i
α
i
|g
i
|X
i
is Gaussian with variance
P
i
X
2
i
. The pth moment of Gaussian Z ∼ N(0, 1):
E Z
p
=
(
0, p is odd.
p!
(p/2)!2
p/2
≤
√
p
p
, p is even.
(18)
Therefore,
k
X
i
g
i
X
i
k
p
≤
√
p · k(
X
i
X
2
i
)
1/2
k
p
(19)
=
√
p · k
X
i
X
2
i
k
1/2
p/2
(20)
≤
√
p · k
X
i
X
2
i
k
1/2
p
(kZk
p
≤ kZk
q
for p < q) (21)
=
√
p[k
X
i
X
i
2
− E
X
i
X
2
i
+ E
X
i
X
2
i
k
1
2
p
] (22)
≤
√
p[kE
X
i
X
2
i
k
1/2
p
+ k
X
i
X
2
i
− E
X
i
X
2
i
k
1/2
p
] (23)
= σ
√
p +
√
p · k
X
i
X
2
i
− E
X
i
X
2
i
k
1/2
p
(24)
. σ
√
p +
√
p · k
X
i
g
i
X
2
i
k
1/2
p
(Apply the same trick (10)-(17)) (25)
Note that
P
i
g
i
X
2
i
is Gaussian with variance
P
i
X
4
i
≤ K
2
·
P
X
2
i
, and
P
i
g
i
X
i
is Gaussian with
variance
P
i
X
2
i
,
5
剩余148页未读,继续阅读
资源评论
绝不原创的飞龙
- 粉丝: 1w+
- 资源: 1091
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功