下载 > 资源分类 >  移动开发 >  其它 > Hadoop Application Architectures

Hadoop Application Architectures

2015-04-06 上传大小:9.2MB
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Introduction. . . . . . . . . . . . . . . . .
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Data Modeling in Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Data Storage Options 6
Standard File Formats 7
Hadoop File Types 9
Serialization Formats 11
Columnar formats 13
Compression 15
HDFS Schema Design 18
Location of HDFS files 19
Advanced HDFS Schema design 21
HBase Schema Design 24
Row Key 25
Timestamp 27
Hops 28
Tables and Regions 29
Using Columns 30
Using Column Families 32
Time-to-live (TTL) 32
Managing Metadata 33
What is metadata? 33
Why care about metadata? 34
Where to store metadata? 34
Examples of managing metadata 36
Limitations of Hive metastore and HCatalog 36
Other ways of storing metadata 37
iii
3. Data Movement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Data Ingestion Considerations 39
Timeliness of Data Ingestion 40
Incremental Updates 42
Access Patterns 43
Original Source System and Data Structure 44
Transformations 47
Network Bottleneck 48
Network Security 48
Push or Pull 48
Build to Handle Failure 49
Level of Complexity 50
Data Ingestion Options 50
File Transfers 51
Considerations for File Transfers vs. Other Ingest Methods 54
Sqoop 55
Flume 60
Kafka 70
Data Extraction 75
Conclusion 76
4. Processing Data in Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
MapReduce 80
MapReduce Overview 80
Example For MapReduce 88
When to Use MapReduce 94
Spark 94
Spark Overview 94
Overview of Spark Components 95
Basic Spark concepts 96
Benefits of using Spark 100
Spark Example 101
When to Use Spark 103
Abstractions 104
Pig 106
Pig Example 106
Crunch 109
Crunch Example 110
When to use Crunch 114
Abstraction: Cascading 114
Cascading Example 115
When to use Cascading 117
iv | Table of Contents
SQL: Hive 118
Hive Overview 118
Example of Hive Code 120
When to use Hive 124
Impala 125
Impala overview 126
Designed for Speed 127
Example of Using Impala 129
When to Use Impala 130
Conclusion 131
5. Common Hadoop Processing Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Pattern: Removing Duplicate Records by Primary Key 133
Data Generation for Deduplication Example 134
Code Example: Spark Deduplication in Scala 135
Code Example: Deduplication in SQL 137
Pattern: Windowing Analysis 138
Data Generation for Windowing Analysis Example 139
Code Example: Peaks and Valleys in Spark 140
Code Example: Peaks and Valleys in SQL 143
Windowing Summary 145
Pattern: Time Series Modifications 145
Code Example: Data Generation for Time Series Example 148
Code Example: Time Series in Spark 149
Code Example: Time Series in SQL 151
Conclusion 154
6. Graph Processing on Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
What is a Graph? 155
So What is Graph Processing? 157
So how do you Process a Graph in a Distributed System? 158
Giraph 161
GraphX 170
Just another RDD 170
GraphX Pregel Interface 171
GraphX versus Giraph 174
Which Tool to Use? 174
Conclusion 174
7. Orchestration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Why Do We Need Workflow Orchestration 177
Bring Your Own Orchestration 178
Table of Contents | v
The enterprise job scheduler and Hadoop 179
Terminology 181
Oozie 181
Oozie Overview 181
Oozie workflow 185
Azkaban 186
Azkaban overview 186
Workflow Patterns 187
Point-to-point workflow 187
Fan-out workflow 190
Fan-out workflow in Oozie 190
Capture-and-decide Workflow 194
Parameterizing workflows 200
Parameterizing workflows in Oozie 200
Parameterizing workflows in Azkaban 201
Scheduling patterns 201
Frequency scheduling 202
Time and Data triggers 202
More on Oozie 207
Executing an Oozie Workflow or a Coordinator 207
More on Azkaban 207
Executing or scheduling an Azkaban flow 207
Conclusion 207
8. Near Real-Time Processing with Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Stream Processing 211
Apache Storm 214
Storm Example 222
Trident 230
Trident Example 230
Spark Streaming 233
Flume Interceptors 240
Which tool to use? 240
Conclusion 244
9. Clickstream Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Defining the use case 245
Using Hadoop for Clickstream Analysis 247
Design Overview 247
Detailed Design 248
Ingestion 249
Processing 257
vi | Table of Contents
Analyzing 263
Orchestration 264
Sessionization 266
Sessionization in Hive 268
Sessionization in Spark 269
Sessionization in MapReduce 270
Sessionization in Pig 270
Data Deduplication 270
Deduplication in Hive 270
Conclusion 271
10. Introduction to Fraud Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Example 275
High Level Design 275
Transaction Approval Logic 277
Profile Storage and Retrieval 278
Caching 278
HBase Data Definition 280
Delivering Transaction Status - Approved or Denied? 283
Ingest 284
Path Between the Client and Flume 284
Near Real-Time and Exploratory Analytics 290
Near Real-Time Processing 290
Exploratory Analytics 292
Conclusion 294
11. Data Warehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Use case and data set 300
OLTP schema 301
Data Warehouse schema 302
Data Warehousing with Hadoop 304
Data Modeling and Storage 305
HDFS Schema design 316
Ingestion 316
Data Processing and Access 320
Aggregations 322
Data Export 323
Orchestration 324
Final architecture 325
Conclusion 326
A. Joins in Impala. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
...展开收缩
综合评分:4.6(14位用户评分)
开通VIP C币充值 立即下载

评论共有10条

name
ecnuzp2016-07-20 11:54:48
讲怎么把hadoop integrate 到现有的或者新的数据存储管理中。
name
dianeylee2016-06-27 14:58:35
看见上面还有CTO,自卑了。
name
technicaec72016-04-25 23:04:40
内容清晰,学习hadoop应用体系架构
name
wangbn992016-03-21 10:40:19
还是值得大致一看,谢谢!
name
ziyouma_lijin2016-01-14 17:33:57
不错,正是我们CTO想看的。只是没有中文版 可惜了
name
mumat2015-10-10 16:29:11
非常感谢您的分享,加油,共勉啊!!
name
paulwong8882015-07-24 13:42:17
可惜是抢鲜版,有没完整版的。
name
bandari_cn2015-06-03 15:38:17
15年抢鲜预览版,强烈推荐!
name
erikaeriga2015-05-18 04:18:45
蛮新的书,对架构理解很有帮助,推荐
name
o2008307402042015-05-15 00:47:22
去年的书,就找到英文版,顶一下,鉴定完毕

评论资源

您不能发表评论,可能是以下原因:

登录后才能评论

待评论资源
 

热门专辑

移动开发热门标签

VIP会员动态

关闭
img

spring mvc+mybatis+mysql+maven+bootstrap 整合实现增删查改简单实例.zip

资源所需积分/C币 当前拥有积分 当前拥有C币
5 0 0
为了良好体验,不建议使用迅雷下载
确认下载
img

Hadoop Application Architectures

会员到期时间: 剩余下载个数: 剩余C币:593 剩余积分:0
为了良好体验,不建议使用迅雷下载
VIP下载
您今日下载次数已达上限(为了良好下载体验及使用,每位用户24小时之内最多可下载20个资源)

积分不足!

资源所需积分/C币 当前拥有积分
您可以选择
开通VIP
4000万
程序员的必选
600万
绿色安全资源
现在开通
立省522元
或者
购买C币兑换积分 C币抽奖
img

资源所需积分/C币 当前拥有积分 当前拥有C币
5 4 45
(仅够下载10个资源)
为了良好体验,不建议使用迅雷下载
确认下载
img

资源所需积分/C币 当前拥有积分 当前拥有C币
5 0 0
为了良好体验,不建议使用迅雷下载
C币充值 开通VIP
img

资源所需积分/C币 当前拥有积分 当前拥有C币
5 4 45
您的积分不足,将扣除 10 C币
为了良好体验,不建议使用迅雷下载
确认下载
下载

兑换成功

你当前的下载分为234开始下载资源
你还不是VIP会员
开通VIP会员权限,免积分下载
立即开通

你下载资源过于频繁,请输入验证码

您因违反CSDN下载频道规则而被锁定帐户,如有疑问,请联络:webmaster@csdn.net!

举报

若举报审核通过,可奖励20下载分

  • 举报人:
  • 被举报人:
  • 举报的资源分:
  • *类型:
  • *详细原因: