AlgorithmsforBigDataLectureNotes(HarvardCS229r)资源-CSDN文库

需积分: 10 195 浏览量 2017-03-07 17:49:19 上传评论收藏 4.95MB PDF 举报

资源推荐

资源详情

资源评论

CS 229r: Algorithms for Big Data Fall 2015

Lecture 1 — September 3, 2015

Prof. Jelani Nelson Scribes: Zhengyu Wang

1 Course Information

• Professor: Jelani Nelson

• TF: Jaros law B lasiok

2 Topic Overview

1. Sketching/Streaming

• “Sketch” C(X) with respect to some function f is a compression of data X. It allows

us computing f(X) (with approximation) given access only to C(X).

• Sometimes f has 2 arguments. For data X and Y , we want to compute f(X, Y ) given

C(X), C(Y ).

• Motivation: maybe you have some input data and I have some input data, and we want

to compute some similarity measure of these two databases across the data items. One

way is that I can just send you the database, and you can compute locally the similarity

measure, and vise versa. But image these are really big data sets, and I don’t want to

send the entire data across the wire, rather what I will do is to compute the sketch of my

data, and then send the sketch to you, which is something very small after compression.

Now the sketch C(X) is much smaller than X, and given the sketch you can compute

the function.

• Trivial example: image you have a batch of numbers, and I also have a batch of numbers.

We want to compute their sum. The sketch I can do is just locally sum all my input

data, and send you the sum.

• Streaming: we want to maintain a sketch C(X) on the ﬂy as x is updated. In previous

example, if numbers come on the ﬂy, I can keep a running sum, which is a streaming

algorithm. The streaming setting appears in a lot of places, for example, your router can

monitor online traﬃc. You can sketch the number of traﬃc to ﬁnd the traﬃc pattern.

2. Dimensionality Reduction

• Input data is high-dimensional. Dimensionality reduction transforms high-dimensional

data into lower-dimensional version, such that for the computational problem you are

considering, once you solve the problem on the lower-dimensional transformed data, you

can get approximate solution on original data. Since the data is in low dimension, your

algorithm can run faster.

• Application: speed up clustering, nearest neighbor, etc.

3. Large-scale Machine Learning

• For example, regression problems: we collect data points {(z

, b

)|i = 1, . . . , n} such that

= f(z

) + noise. We want to recover

f “close” to f.

• Linear regression: f(z) = hx, zi, where x is the parameter that we want to recover. If

the noise is Gaussian, the popular (and optimal to some sense) estimator we use is Least

Squares

= arg min kZx − bk

= (Z

−1

Zb, (1)

where b = (b

, . . . , b

)

and Z = (z

, . . . , z

)

. If Z is big, matrix multiplication can

be very expensive. In this course, we will study techniques that allow us to solve least

squares much faster than just computing the closed form (Z

−1

Zb.

• Other regression problems: PCA (Principal Component Analysis), matrix completion.

For example, matrix completion for Netﬂix problem: you are given a big product-

customer matrix of customer ratings of certain products. The matrix is very sparse

because not every user is going to rate everything. Based on limited information, you

want to guess the rest of the matrix to do product suggestions.

4. Compressed Sensing

• Motivation: compress / cheaply acquire high dimensional signal (using linear measure-

ment)

• For example, images are very high dimensional vectors. If the dimension of an image is

thousands by thousands, it means that the image has millions of pixels. If we write the

image in standard basis as pixels, it is likely that the pixels are not sparse (by sparse

we mean almost zero), because just image that if we take a photo in a dark room, most

of the pixels have some intensity. But there are some basis called wavelet basis, pictures

are usually very sparse on that basis. Once something is sparse, you can compress it.

• JPEG (image compression).

• MRI (faster acquisition of the signal means less time in machine).

5. External Memory Model

• Motivation: measure disk I/O’s instead of number of instructions (because random seeks

are very expensive).

• Model: we have inﬁnite disk divided into blocks of size b bits, and memory of size M

divided into pages of size b bits. If the data we want to read or the location we want

to write is in the memory, we can just simply do it for free; if the location we want to

access is not in the memory, we cost 1 unit time to load the block from the disk into the

memory, and vise versa. We want to minimize the time we go to the disk.

• B trees are designed for this model.

6. Other Models (if time permitting)

• For example, map reduce.

剩余148页未读，继续阅读

评论收藏

内容反馈

绝不原创的飞龙

粉丝: 1w+
资源: 1091

Algorithms for Big Data Lecture Notes (Harvard CS229r)

最新资源

Algorithms for Big Data Lecture Notes (Harvard CS229r)

Algorithms for Big Data Lecture Notes (UIUC CS598CSC)

Disk-Based.Algorithms.for.Big.Data

Disk-Based Algorithms for Big Data 无水印pdf 0分

Machine Learning Models and Algorithms for Big Data Classification.pdf

Disk-Based Algorithms for Big Data.rar

Algorithms & Models of Computation Lecture Notes (UIUC CS374)

Analysis of Algorithms Lecture Notes (Cornell CS6820)

Scalable Algorithms for Big Data and Network Analysis

Algebraic Graph Algorithms Lecture Notes (Stanford CS367)

A Second Course in Algorithms Lecture Notes (Stanford CS261)

Approximation Algorithms Lecture Notes (UIUC CS598CSC)

Algorithms for Clustering Data（2）

Data Structures and Algorithms for Big Databases

大数据应用中的次线性算法（Sublinear Algorithms for Big Data Applications）-2015年Springer原版，0积分

Introduction to Analysis of Algorithms Lecture Notes (Cornell CS4820)

Big Data Analysis - New Algorithms for a New Society

Lecture Notes on Approximation Algorithms - Volume I, Rajeev Motwani

CS229 Lecture notes

Algorithms.for.Data.Science

Vector Davinci官方帮助配置使用手册（AutoSAR）.pdf

c++入门，核心，提高讲义笔记

离散数学及其应用 第八版 奇数编号练习答案.pdf

数字图像处理 冈萨雷斯 课后习题

科研伦理与学术规范 期末考试2 （40题）.pdf

最值得收藏的 考研线性代数 全部知识点思维导图整理(张宇, 汤家凤), 附带惯用思维/做题技巧/易错点整理.emmx

软件著作权设计说明书模板（含填写说明）.docx

AUTOSAR培训教材.rar

菜菜sklearn课程讲义.rar

“互联网+”大学生创新创业大赛项目计划书

最新资源

离散数学及其应用第八版奇数编号练习答案.pdf

数字图像处理冈萨雷斯课后习题

科研伦理与学术规范期末考试2 （40题）.pdf

最值得收藏的考研线性代数全部知识点思维导图整理(张宇, 汤家凤), 附带惯用思维/做题技巧/易错点整理.emmx