Data Algorithms_Recipes for Scaling up with Hadoop and Spark
Data Algorithms
by Mahmoud Parsian
Publisher: O'Reilly Media, Inc.
Release Date:
二月
15, 2015
ISBN: 9781491906187
Book Description
Learn the algorithms and tools you need to build MapReduce applications with Hadoop for processing gigabyte, terabyte, or
petabyte-sized datasets on clusters of commodity hardware. With this practical book, Author Mahmoud Parsian, head of the big
data team at Illumina, takes you step-by-step through the design of machine-learning algorithms, such as Naive Bayes and
Markov Chain, and shows you how apply them to clinical and biological datasets, using MapReduce design patterns.
Table of Contents
1. Preface
2. 0.1 Introduction
3. 0.2 Relationship of Spark and Hadoop
4. 0.3 What is MapReduce?
5. 0.4 Why use MapReduce?
6. 0.5 What Is in This Book?
7. 0.6 What Is the Focus of This Book?
8. 0.7 What are Core Concepts of MapReduce/Hadoop?
9. 0.8 Is MapReduce for Everything?
10. 0.9 What is not MapReduce
11. 0.10 Who Is This Book For?
12. 0.11 What Software Is Used in This Book?
13. 0.12 Using Code Examples
14. 0.13 Where NOT to use MapReduce?
15. 0.14 Chapters in This Book?
16. 0.15 Online Resources
17. 0.16 Comments and Questions for This Book?
18. 1 Secondary Sort: Introduction
19. 1.1 What is a Secondary Sort Problem?
20. 1.2 Solutions to Secondary Sort Problem
21. 1.2.1 Sort Order of Intermediate Keys
22. 1.3 Data Flow Using Plug-in Classes
23. 1.4 Mapreduce/Hadoop Solution
24. 1.4.1 Input
25. 1.4.2 Expected Output
26. 1.4.3 map() function
27. 1.4.4 reduce() function
28. 1.4.5 Hadoop Implementation
29. 1.4.6 Sample Run of Hadoop Implementation
30. 1.4.7 Sample Run
31. 1.5 What If Sorting Ascending or Descending
32. 1.6 Spark Solution To Secondary Sorting
33. 1.6.1 Time-Series as Input
34. 1.6.2 Expected Output
35. 1.6.3 Option-1: Secondary Sorting in Memory
36. 1.6.4 Spark Sample Run
37. 1.6.5 Option-2: Secondary Sorting using Framework
38. 2 Secondary Sorting: Detailed Example
39. 2.1 Introduction
40. 2.2 Secondary Sorting Technique
41. 2.3 Complete Example of Secondary Sorting
42. 2.3.1 Problem Statement
43. 2.3.2 Input Format
44. 2.3.3 Output Format
45. 2.3.4 Composite Key
46. 2.3.5 Sample Run
47. 2.4 Secondary Sort using New Hadoop API
48. 3 Top 10 List
49. 3.1 Introduction
50. 3.2 Top-N Formalized
51. 3.3 MapReduce Solution
52. 3.4 Implementation in Hadoop
53. 3.4.1 Input
54. 3.4.2 Sample Run 1: find top 10 list
55. 3.4.3 Output
56. 3.4.4 Sample Run 2: find top 5 list
57. 3.5 Bottom 10
58. 3.6 Spark Implementation: Unique Keys
59. 3.6.1 Introduction
60. 3.6.2 What is an RDD?
61. 3.6.3 Spark's Function Classes
62. 3.6.4 Spark Solution for Top-10 Pattern
63. 3.6.5 Complete Spark Solution for Top-10 Pattern
64. 3.6.6 Input
65. 3.6.7 Sample Run : find top-10 list
66. 3.7 What If for Top-N
67. 3.7.1 Shared Data Structures Definition and Usage
68. 3.8 What If for Bottom-N
69. 3.9 Spark Implementation : Non-Unique Keys
70. 3.9.1 Complete Spark Solution for Top-10 Pattern
71. 4 Left Outer Join in MapReduce
72. 4.1 Introduction
73. 4.2 Implementation of Left Outer Join in MapReduce
74. 4.2.1 MapReduce Phase-1
75. 4.2.2 MapReduce Phase-2: Counting Unique Locations ...
76. 4.2.3 Implementation Classes in Hadoop
77. 4.3 Sample Run
78. 4.3.1 Input for Phase-1
79. 4.3.2 run Phase-1
80. 4.3.3 View Output of Phase-1 (Input of Phase-2)
81. 4.3.4 Run Phase-2
82. 4.3.5 View Output of Phase-2
83. 4.4 Spark Implementation
84. 4.4.1 Spark Program
85. 4.4.2 STEP-0: Import Required Classes
86. 4.4.3 STEP-1: Read Input Parameters
87. 4.4.4 STEP-2: Create JavaSparkContext Object
88. 4.4.5 STEP-3: Create a JavaPairRDD for Users
89. 4.4.6 STEP-4: Create a JavaPairRDD for Transactions
90. 4.4.7 STEP-5: Create a union of RDD's created by STEP-3 and STEP-4
91. 4.4.8 STEP-6: Create a JavaPairRDD(userID, List(T2)) by calling groupBy()
92. 4.4.9 STEP-7: Create a productLocationsRDD as JavaPair-RDD(String,String)
93. 4.4.10 STEP-8: Find all locations for a product
94. 4.4.11 STEP-9: Finalize output by changing "value"
95. 4.4.12 STEP-10: Print the final result RDD
96. 4.4.13 Running Spark Solution
97. 4.5 Running Spark on YA RN
98. 4.5.1 Script to Run Spark on YA RN
99. 4.5.2 Running Script
100. 4.5.3 Checking Expected Output
101. 4.6 Left Outer Join by Spark's leftOuterJoin()
102. 4.6.1 High-Level Steps
103. 4.6.2 STEP-0: import required classes and interfaces
104. 4.6.3 STEP-1: read input parameters
105. 4.6.4 STEP-2: create Spark's context object
106. 4.6.5 STEP-3: create RDD for user's data
107. 4.6.6 STEP-4: Create usersRDD: The "right" Table
108. 4.6.7 STEP-5: create transactionRDD for transaction's data
109. 4.6.8 STEP-6: Create transactionsRDD: The Left Table
110. 4.6.9 STEP-7: use Spark's built-in JavaPairRDD.leftOuterJoin() method
111. 4.6.10 STEP-8: create (product, location) pairs
112. 4.6.11 STEP-9: group (K=product, V=location) pairs by K .
113. 4.6.12 STEP-10: create final output (K=product, V=Set(location))
114. 4.6.13 Sample Run by YA RN
115. 5 Order Inversion Pattern
116. 5.1 Introduction
117. 5.2 Example of Order Inversion Pattern
118. 5.3 MapReduce for Order Inversion Pattern
119. 5.3.1 Custom Partitioner
120. 5.3.2 Relative Frequency Mapper
121. 5.3.3 Relative Frequency Reducer
122. 5.3.4 Implementation Classes in Hadoop
123. 5.4 Sample Run
124. 5.4.1 Input
125. 5.4.2 Running MapReduce Job
126. 5.4.3 Generated Output
127. 6 Moving Average
128. 6.1 Introduction
129. 6.1.1 Example-1: Time Series Data
130. 6.1.2 Example-2: Time Series Data
131. 6.2 Formal Definition
132. 6.3 Moving Average by POJO
133. 6.3.1 First solution: using Queue
134. 6.3.2 Second Solution : using Array
135. 6.3.3 Testing of Moving Average
136. 6.3.4 Sample Run
137. 6.4 MapReduce Solution
138. 6.4.1 Input
139. 6.4.2 Output
140. 6.4.3 MapReduce Solution: Option-1: sort in RAM
141. 6.4.4 Hadoop Implementation: sort in RAM
142. 6.4.5 Sample Run
143. 6.4.6 MapReduce Solution: Option-2: Sort by MR Framework
144. 6.5 Sample Run
145. 7 Market Basket Analysis
146. 7.1 What is Market Basket Analysis?
147. 7.2 MapReduce/Hadoop Solution
148. 7.3 What are the Application areas for MBA?
149. 7.4 Market Basket Analysis using MapReduce
150. 7.4.1 Mapper Formal
151. 7.4.2 Reducer
152. 7.5 MapReduce/Hadoop Implementation Classes
153. 7.5.1 Find Sorted Combinations
154. 7.5.2 Market Basket Analysis Driver: MBADriver
155. 7.5.3 Market Basket Analysis Mapper: MBAMapper ....
156. 7.5.4 Sample Run
157. 7.6 Spark/Hadoop Solution
158. 7.6.1 MapReduce Algorithm
159. 7.6.2 Input
160. 7.6.3 Spark Implementation
161. 7.6.4 Creating Item Sets From Transactions
162. 8 Common Friends
163. 8.1 Introduction
164. 8.2 Input
165. 8.3 Common Friends Algorithm
166. 8.4 MapReduce Algorithm
167. 8.4.1 MapReduce Algorithm in Action
168. 8.5 Solution 1: Hadoop Implementation using Text
169. 8.5.1 Sample Run for Solution 1
170. 8.6 Solution 2: Hadoop Implementation using ArrayListOfLongsWritable
171. 8.6.1 Sample Run for Solution 2
172. 8.7 Spark Solution
173. 8.7.1 STEP-0: Import Required Classes
174. 8.7.2 STEP-1: Check Input Parameters
175. 8.7.3 STEP-2: Create a JavaSparkContext Object
176. 8.7.4 STEP-3: Read Input
177. 8.7.5 STEP-4: Apply a Mapper
178. 8.7.6 STEP-5: Apply a Reducer
179. 8.7.7 STEP-6: Find Common Friends
180. 8.8 Sample Run of a Spark Program
181. 8.8.1 HDFS Input
182. 8.8.2 Script to Run Spark Program
183. 8.8.3 Log of Sample Run
184. 9 Recommendation Engines using MapReduce
185. 9.1 Customers Who Bought This Item Also Bought
186. 9.1.1 Input
187. 9.1.2 Expected Output
188. 9.1.3 MapReduce Solution
189. 9.2 Frequently Bought Together
190. 9.2.1 Input
191. 9.2.2 MapReduce Solution
192. 9.3 Recommend People Connection
193. 9.3.1 Input
194. 9.3.2 Output
195. 9.3.3 MapReduce Solution
196. 9.4 Spark Implementation
197. 9.4.1 STEP-0: Import Required Classes
198. 9.4.2 STEP-1: Handle Input Parameters
199. 9.4.3 STEP-2: Create Spark's Context Object
200. 9.4.4 STEP-3: Read HDFS Input File
201. 9.4.5 STEP-4: Implement map() Function
202. 9.4.6 STEP-5: Implement reduce() Function
203. 9.4.7 STEP-6: Generate Final Output
204. 9.4.8 Convenient Methods
205. 9.4.9 HDFS Input
206. 9.4.10 Script to Run Spark Program
207. 9.4.11 Program Run Log
208. 10 Content-Based Recommendation: Movies
209. 10.1 Input
210. 10.2 MapReduce PHASE-1
211. 10.3 MapReduce PHASE-2 and PHASE-3
212. 10.4 MapReduce-Phase-2 Mapper
213. 10.5 MapReduce-Phase-2 Reducer
214. 10.6 MapReduce-Phase-3 Mapper
215. 10.7 MapReduce-Phase-3 Reducer
216. 10.8 More Similarity Measures
217. 10.9 Movie Recommendation in Spark
218. 10.9.1 High-Level Solution in Spark
219. 10.9.2 High-Level Solution: All Steps
220. 10.9.3 STEP-0: Import Required Classes
221. 10.9.4 STEP-1: Handle Input Parameters
222. 10.9.5 STEP-2: Create a Spark's Context Object
223. 10.9.6 STEP-3: Read Input File and Create RDD
224. 10.9.7 STEP-4: Find Who Has Rated Movies
225. 10.9.8 STEP-5: Group moviesRDD by Movie