没有合适的资源?快使用搜索试试~ 我知道了~
Mining of Massive Datasets(2nd edition)
需积分: 10 38 下载量 84 浏览量
2018-04-30
09:15:11
上传
评论 1
收藏 2.86MB PDF 举报
温馨提示
大数据-互联网大规模书数据挖掘与分布式处理(第2版)英文版
资源推荐
资源详情
资源评论
Mining
of
Massive
Datasets
Jure Leskovec
Stanford Univ.
Anand Rajaraman
Milliway Labs
Jeffrey D. Ullman
Stanford Univ.
Copyright
c
2010, 2011, 2012, 2013, 2014 Anand Rajaraman, Jure Leskovec,
and Jeffrey D. Ullman
ii
Preface
This book evolved from material develope d over several years by Anand Raja-
raman and Jeff Ullman for a one-quarter course at Stanford. The course
CS345A, titled “Web Mining,” was designed as an advanced graduate course,
although it has become a c cessible and interesting to advance d undergraduates .
When Jure Leskovec joined the Stanford faculty, we reorganized the material
considerably. He introduced a new course CS224W on network analysis and
added material to CS3 45A, which was renumbered CS246. The three author s
also introduced a large-scale data-mining project course, C S341. The book now
contains ma terial taught in all thr e e courses.
What the Book Is About
At the highest level of description, this book is a bo ut data mining. However,
it focuses on da ta mining of very large amounts of data, that is, data so large
it do e s not fit in main memory. Because of the emphasis on size, many of our
examples are about the Web or data derived from the Web. Further, the book
takes a n algorithmic point of view: data mining is about applying algorithms
to data, rather than using data to “train” a machine-learning engine of some
sort. The principal topics covered are:
1. Distributed file systems and map-reduce as a tool for creating parallel
algorithms that succeed on very large amounts of data.
2. Similarity search, including the key techniques of minhashing and locality-
sensitive hashing.
3. Data-stream processing and specializ ed algorithms for dealing with data
that arrives so fast it must be processed immediately or lost.
4. The technology of search engines, including Google’s PageRank, link-spam
detection, and the hubs-and-authorities approach.
5. Frequent-itemset mining, including association rules, market-baskets, the
A-Priori Algorithm and its improvements.
6. Algorithms for clustering very larg e, high-dimensional datasets.
iii
iv PREFACE
7. Two key pro ble ms for Web applications: mana ging advertising and rec-
ommendation systems.
8. Algorithms for analyzing and mining the structure of very large graphs,
especially social- ne twork graphs.
9. Techniques for obtaining the important properties of a large data set by
dimensionality reduction, including singular -va lue decomposition and la-
tent semantic indexing.
10. Machine-learning algorithms that can be applied to very large data, such
as perceptrons, support-vector machines, and gradient descent.
Prerequisites
To appreciate fully the material in this book, we recommend the following
prerequisites:
1. An introduction to database systems, covering SQL and related program-
ming systems.
2. A sophomore-level course in data structures, algorithms, and discrete
math.
3. A sophomore -level course in software s ystems, software engineering, and
programming languages.
Exercises
The book c ontains extensive exercises, with some for almost every sec tion. We
indicate harder exercises or pa rts of exercises with an exclamation point. The
hardest exercises have a do uble ex c lamation point.
Support on the Web
Go to http://www.mmds.org for slides, homework ass ignments, project require-
ments, and exams from courses related to this book.
Gradiance Automated Homework
There are automated exer cises based on this book, using the Gradiance root-
question technology, available at www.gradiance.com/services. Students may
enter a public clas s by creating an acc ount at that site and entering the class
with code 1EDD8A1D. Instructors may use the site by making an account ther e
PREFACE v
and then emailing support at gradiance dot com with their login name, the
name of their school, and a request to use the MMDS materials.
Acknowledgements
Cover art is by Scott Ullman.
We would like to thank Foto Afrati, Arun Marathe, and Rok Sosic for critica l
readings of a draft of this manuscript.
Erro rs were also reported by Rajiv Abraham, Ruslan Aduk, Apoorv Agar-
wal, Aris Anagnostopoulos, Yokila Arora, Stefanie Anna Baby, Atilla Soner
Balkir, Arnaud Belletoile, Robin Bennett, Susan B iancani, Amitabh Chaud-
hary, Leland Chen, Hua Feng, Marcus Gemeinder, Anastasios Gounaris, Clark
Grubb, Shrey Gupta, Waleed Hameid, Saman Haratizadeh, Julien Hoachuck,
Przemyslaw Horban, Jeff Hwang, Rafi Kamal, Lachlan Kang, Ed Knorr, Hae-
woon Kwak, E llis Lau, Greg Lee, David Z. Liu, Ethan Lozano, Yunan Luo,
Michael Mahoney, Justin Meyer, Bryant Moscon, Bra d Penoff, John Phillips,
Philips Kokoh Prasetyo, Qi Ge, Harizo Rajaona, Timon Ruban, Rich Seiter,
Hitesh Shetty, Angad Sing h, Sandee p Sripada, Dennis Sidharta, Krzysztof Sten-
cel, Mark Storus, Roshan Sumbaly, Zack Taylor, Tim Triche Jr., Wang Bin,
Weng Zhen-Bin, Robe rt West, Oscar Wu, Xie Ke, Christopher T.-R. Yeh, Nico-
las Zhao, and Zhou Jingbo, The remaining errors are ours, of cour se.
J. L.
A. R.
J. D. U.
Palo Alto, CA
March, 2014
剩余512页未读,继续阅读
资源评论
wolf61600
- 粉丝: 3
- 资源: 8
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功