Miningofmassivedatasets资源-CSDN文库

需积分: 7 9 浏览量 2014-12-20 18:29:26 上传评论收藏 2.51MB PDF 举报

资源推荐

资源详情

资源评论

Mining

of

Massive

Datasets

Anand Rajaraman

Kosmix, Inc.

Jeﬀrey D. Ullman

Stanford Univ.

Copyright

c

 2010, 2011 Anand Rajaraman and Jeﬀrey D. Ullman

Contents

1 Data Mining 1

1.1 What is Data Mining? . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Statistical Modeling . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . 2

1.1.3 Computational Approaches to Modeling . . . . . . . . . . 2

1.1.4 Summarization . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.5 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Statistical Limits on Data Mining . . . . . . . . . . . . . . . . . . 4

1.2.1 Total Information Awareness . . . . . . . . . . . . . . . . 5

1.2.2 Bonferroni’s Principle . . . . . . . . . . . . . . . . . . . . 5

1.2.3 An Example of Bonferr oni’s Pr inciple . . . . . . . . . . . 6

1.2.4 Exercises fo r Section 1.2 . . . . . . . . . . . . . . . . . . . 7

1.3 Things Useful to Know . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.1 Importance of Words in Documents . . . . . . . . . . . . 7

1.3.2 Hash Functions . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.3 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.4 Secondary Stor age . . . . . . . . . . . . . . . . . . . . . . 11

1.3.5 The Base of Natura l Log arithms . . . . . . . . . . . . . . 12

1.3.6 Power Laws . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.7 Exercises fo r Section 1.3 . . . . . . . . . . . . . . . . . . . 15

1.4 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.5 Summary of Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . 17

1.6 References for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . 17

2 Large-Scale File Systems and Map-Reduce 19

2.1 Distributed File Systems . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Physical Organization of Compute Nodes . . . . . . . . . 20

2.1.2 Large-Scale File-System Or ganization . . . . . . . . . . . 21

2.2 Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 The Map Tasks . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.2 Grouping and Aggregation . . . . . . . . . . . . . . . . . 24

2.2.3 The Reduce Tasks . . . . . . . . . . . . . . . . . . . . . . 2 4

2.2.4 Combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

v

剩余456页未读，继续阅读

内容反馈

juliya1983

粉丝: 0
资源: 1

最新资源

资源上传下载、课程学习等过程中有任何疑问或建议，欢迎提出宝贵意见哦~我们会及时处理！点击此处反馈

feedback-tip