没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
试读
778页
Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架,Spark,拥有Hadoop MapReduce所具有的优点;但不同于MapReduce的是——Job中间输出结果可以保存在内存中,从而不再需要读写HDFS,因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。
资源详情
资源评论
Spark 技术参考手册
1
目录
内容提要 ....................................................................................................................................................17
推荐序........................................................................................................................................................17
译者序........................................................................................................................................................18
序................................................................................................................................................................18
前言............................................................................................................................................................19
读者对象 ....................................................................................................................................................19
本书结构 ....................................................................................................................................................20
相关书籍 ....................................................................................................................................................20
排版约定 ....................................................................................................................................................20
使用代码示例 ............................................................................................................................................21
第 1 章 Spark 数据分析导论 .....................................................................................................................21
1.1 Spark 是什么.................................................................................................................................21
1.1.1 MapReduce 计算模型............................................................................................................22
1.1.2 Hadoop 和 Spark 并行计算的相同和区别............................................................................23
1.1.3 MapReduce 程序运行的常见问题 ........................................................................................53
1.2 一个大一统的软件栈 ...................................................................................................................58
1.2.1 Spark Core ..............................................................................................................................59
1.2.2 Spark SQL................................................................................................................................59
1.2.3 Spark Streaming......................................................................................................................59
1.2.4 Structured Streaming .............................................................................................................60
1.2.5 MLlib .......................................................................................................................................68
1.2.6 GraphX....................................................................................................................................68
1.2.7 集群管理器 ............................................................................................................................68
1.3 Spark 的用户和用途 .....................................................................................................................69
1.3.1 数据科学任务 .......................................................................................................................69
1.3.2 数据处理应用 .......................................................................................................................69
1.4 Spark 简史.....................................................................................................................................72
1.5 Spark 的版本和发布 .....................................................................................................................73
Spark 技术参考手册
2
1.5.1 Spark 常见的稳定版本 ......................................................................................................73
1.6 Spark 的存储层次 .........................................................................................................................74
1.7 Spark 的系统架构 .........................................................................................................................74
1.8 Spark 的运行原理 .........................................................................................................................75
第 2 章 Spark 下载与入门 .........................................................................................................................78
2.1 下载 Spark .....................................................................................................................................79
2.2 Spark 中 Python 和 Scala 的 shell .................................................................................................80
2.3 Spark 核心概念简介 .....................................................................................................................83
2.3.1 通过 Spark 的 Shell 操作 SparkContext 实例过程 ............................................................85
2.3.2 向 Spark 传递函数 .............................................................................................................87
2.4 独立应用 .......................................................................................................................................88
2.4.1 初始化 Spark Context.............................................................................................................88
2.4.2 构建独立应用 .......................................................................................................................89
2.5 PySpark 安装 .................................................................................................................................91
2.5.1 支持的 Python 版本.............................................................................................................91
2.5.2 使用 PyPI ..............................................................................................................................91
2.5.3 使用 Conda ..........................................................................................................................92
2.5.4 手动下载 ...............................................................................................................................93
2.5.5 从源安装 ...............................................................................................................................93
2.5.6 依赖项 ...................................................................................................................................93
2.6 Windows 单机模式的 Spark 安装和 Intellij IDE Scala 开发环境配置 .........................................94
2.6.1 安装前说明 ........................................................................................................................94
2.6.2 安装配置所需的安装包 ....................................................................................................94
2.6.3 Windows 安装 JDK 1.8 ...........................................................................................................95
2.6.4 Windows 安装 Scala.............................................................................................................101
2.6.5 Intellij IDE 安装.....................................................................................................................108
2.6.6 Intellij IDE 中 Scala 插件安装 ...............................................................................................108
2.6.7 运行 wordcount................................................................................................................111
Spark 技术参考手册
3
2.7 Windows 单机模式的 Spark 安装和 Python 开发环境配置.....................................................118
2.7.1 准备工作 ..........................................................................................................................118
2.7.2 下载 Anaconda .................................................................................................................118
2.7.3 安装 Anaconda .................................................................................................................119
2.7.4 配置 Anaconda 环境变量 ................................................................................................119
2.7.5 测试 Anaconda .................................................................................................................119
2.7.6 下载 JDK 1.8 .....................................................................................................................120
2.7.7 配置 JDK 环境变量...........................................................................................................120
2.7.8 测试 JDK 1.8 .....................................................................................................................121
2.7.9 安装 Scala.........................................................................................................................122
2.7.10 下载 Spark 3.1.0 ...............................................................................................................128
2.7.11 安装 Spark 3.1.0 ...............................................................................................................128
2.7.12 配置 Spark 环境变量 .......................................................................................................128
2.7.13 配置日志显示级别 ..........................................................................................................129
2.7.14 下载 Hadoop 支持模块....................................................................................................129
2.7.15 安装 Hadoop 支持模块....................................................................................................130
2.7.16 配置 Hadoop 支持模块的环境变量................................................................................130
2.7.17 测试 Spark ........................................................................................................................130
2.7.18 测试 pyspark.....................................................................................................................131
2.7.19 运行示例代码 ..................................................................................................................131
2.8 Spark 的开发环境搭建(intelliJ IDEA) ....................................................................................132
2.8.1 建立新项目 ..........................................................................................................................133
2.8.2 编写代码 ..............................................................................................................................135
2.8.3 生成程序包 ..........................................................................................................................138
2.8.4 Spark 单机版的决策树测试 ................................................................................................139
2.9 Spark 集群的安装与设置 ..........................................................................................................141
2.9.1 Ubuntu 12.04 下 Hadoop 2.2.0 集群搭建 ...........................................................................146
2.9.2 Ubuntu 14.04 下安装 Hadoop2.4.0(单机模式) .............................................................154
2.10 启动 Spark 集群 .......................................................................................................................166
2.11 Spark 运行 WordCount.............................................................................................................176
Spark 技术参考手册
4
2.11.1 安装 Spark 环境 .................................................................................................................176
2.11.2 启动 Spark 服务 .................................................................................................................185
2.11.3 总结 ....................................................................................................................................189
2.12 Linux 上安装 Apache Spark 3.1.0 详细步骤 ............................................................................190
spark 的安装与配置........................................................................................................................194
2.13 总结 ..........................................................................................................................................199
第 3 章 RDD 编程.....................................................................................................................................200
3.1 RDD 基础.....................................................................................................................................202
3.2 创建 RDD.....................................................................................................................................203
3.3 RDD 操作.....................................................................................................................................204
3.3.1 转化操作 .............................................................................................................................204
3.3.2 行动操作 .............................................................................................................................206
3.3.3 惰性求值 .............................................................................................................................206
3.4 向 Spark 传递函数......................................................................................................................207
3.4.1 Python...................................................................................................................................207
3.4.2 Scala......................................................................................................................................208
3.4.3 Java .......................................................................................................................................208
3.5 常见的转化操作和行动操作.....................................................................................................209
3.5.1 基本 RDD..............................................................................................................................209
3.5.2 在不同 RDD 类型间转换...................................................................................................216
3.6 持久化(缓存)...............................................................................................................................217
3.6.1 缓存机制和 cache 的意义 ...................................................................................................219
3.7 总结 ............................................................................................................................................220
第 4 章 键值对操作 .................................................................................................................................220
4.1 动机 ............................................................................................................................................221
4.2 创建 Pair RDD..............................................................................................................................221
4.3 Pair RDD 的 Transform 操作.....................................................................................................222
4.3.1 聚合操作 .............................................................................................................................223
Spark 技术参考手册
5
4.3.2 数据分组 .............................................................................................................................231
4.3.3 连接......................................................................................................................................232
4.3.4 数据排序 .............................................................................................................................233
4.4 Pair RDD 的 Action 操作............................................................................................................233
4.5 数据分区(进阶).....................................................................................................................234
4.5.1 获取 RDD 的分区方式.......................................................................................................238
4.5.2 从分区中获益的操作 .........................................................................................................238
4.5.3 影响分区方式的操作 .........................................................................................................239
4.5.4 示例:PageRank ..................................................................................................................239
4.5.5 自定义分区方式 .................................................................................................................240
4.6 总结 ............................................................................................................................................242
第 5 章 数据读取与保存 .........................................................................................................................242
5.1 动机 ............................................................................................................................................242
5.2 文件格式.....................................................................................................................................242
5.2.1 文本文件 .............................................................................................................................243
5.2.2 JSON......................................................................................................................................244
5.2.3 逗号分隔值与制表符分隔值..............................................................................................246
5.2.4 SequenceFile.........................................................................................................................249
5.2.5 对象文件 .............................................................................................................................251
5.2.6 Hadoop 输入输出格式 ........................................................................................................252
5.2.7 文件压缩 .............................................................................................................................254
5.3 文件系统.....................................................................................................................................255
5.3.1 本地/“常规”文件系统 .........................................................................................................255
5.3.2 Amazon S3 ............................................................................................................................256
5.3.3 HDFS .....................................................................................................................................256
5.4 Spark SQL 中的结构化数据 ........................................................................................................256
5.4.1 Apache Hive..........................................................................................................................257
5.4.2 JSON......................................................................................................................................257
5.5 数据库.........................................................................................................................................258
剩余777页未读,继续阅读
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0
最新资源