Learning Spark: Lightning-Fast Big Data Analysis

所需积分/C币:17 2017-10-17 20:51:09 7.28MB PDF
收藏 收藏

this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates. Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning.
Learning spark Holden karau, Andy Konwinski, Patrick Wendell, and Matei zaharia Beng. Cambridge. Farnham·Kn· Sebastopol, Tokyo OREILLY° Learning spark by holden Karau, andy Konwinski, Patrick Wendell, and Matei zaharia Copyright o 2015 Databricks. All rights reserved Printed in the united states of america Published by O reilly Media, InC, 1005 Gravenstein Highway North, Sebastopol, CA 95472 OReilly books may be purchased for educational, business, or sales promotional use Online editions are alsoavailableformosttitles(http://safaribooksonline.com).Formoreinformationcontactourcorporate institutional sales department: 800-998-9938 or corporate@oreillycon Editors: Ann Spencer and Marie Beaugureau Proofreader: Charles roumeliotis Production editor: Kara ebrahim Indexer: Ellen troutman Copyeditor: Rachel Monaghan Interior Designer: David Futato Cover Designer: Ellie Volckhausen Illustrator rebecca demarest February 2015: First edition Revision history for the first edition 2015-01-26: First Release Seehttp:/oreilly.com/catalog/errata.csp?isbn=9781449358624forreleasedetails The O Reilly logo is a registered trademark of O'Reilly Media, Inc. Learning Spark, the cover image of a small-spotted catshark, and related trade dress are trademarks of O Reilly Media, Inc While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-449-35862-4 ILSI Table of contents Foreword iX Preface 1. Introduction to Data Analysis with Spark. What Is apache spark? A Unified Stack Spark Core Spark SQL Spark streaming MLlib GraphX Cluster Managers Who Uses Spark, and for What? 2333444456 Data science tasks Data Processing Applications A Brief History of Spark Spark versions and releases Storage Layers for Spark 2. Downloading Spark and Getting Started. Downloading Spark Introduction to Spark's Python and Scala Shells Introduction to Core Spark Concepts 14 Standalone applications Initializing a spark Context 17 Building Standalone Applications 18 Conclusion 21 3. Programming with RDDs. 23 RDD Basics 23 Creating rdds 25 RDD Operations 26 Transformations 27 actions 28 Lazy evaluation 29 Passing Functions to spark 30 Python 30 Scala 31 32 Common transformations and actions 34 Basic rdds 34 Converting Between RDD Types 42 Persistence( Caching) 44 Conclusion 4. Working with Key/value Pairs. 47 Motivation Creating Pair RDDs 48 Transformations on pair rdds 49 aggregations 51 Grouping data 57 Joins 58 Sorting data 59 Actions available on pair rdds Data Partitioning(Advanced) 61 Determining an Rdd's Partitioner 64 Operations That Benefit from Partitioning 65 Operations That Affect Partitioning 65 Example: PageRank Custom partitioners 68 Conclusion 70 5. Loading and saving your data 71 Motivation 71 File formats Text files 73 JSON 74 Comma-Separated Values and Tab-Separated values Sequencefiles Object Files 83 Table of contents Hadoop Input and output Formats 84 File Compression 87 Filesystems 89 Local/^ Regular”Fs 89 Amazon s3 HDFS 90 Structured Data with Spark SQL 91 Apache Hive 91 JSON 92 Databases 93 Java Database Connectivity 93 Cassandra 94 HBase 96 Elasticsearch Conclusion 98 6. Advanced Spark Programming Introduction Accumulators 100 Accumulators and Fault tolerance 103 Custom Accumulators 103 Broadcast variables 104 Optimizing Broadcasts 106 Working on a Per-Partition Basis 107 Piping to external programs 109 Numeric RDd Operations 113 Conclusion 7. Running on a Cluster. 117 Introduction 117 Spark Runtime Architecture 117 The driver 118 Executors 119 Cluster Manager 119 Launching a program 120 Summary 120 Deploying Applications with spark-submit 121 Packaging Your Code and Dependencies 123 A Java Spark application Built with Maven 124 A Scala Spark application Built with sbt 126 Dependency Conflicts 128 Scheduling within and between Spark applications 128 Table of contents Cluster Managers 129 Standalone Cluster manager 129 Hadoop YARN 133 Apache mesos 134 Amazon ec2 135 Which Cluster Manager to Use? 138 Conclusion 139 8. Tuning and debugging spark. 141 Configuring spark with Spark Conf 141 Components of Execution: Jobs, Tasks, and Stages 145 Fi inain Information 150 Spark Web ui 150 Driv d executor lo 154 Key Performance Considerations 155 Level of parallelism 155 Serialization format 156 Memory management 157 Hardware Provisioning 158 Conclusion 160 9. Spark SQl 卷鲁·鲁·。鲁 161 Linking with Spark SQL 162 Using Spark SQl in Applications 164 Initializing Spark SQL 164 Basic Query example 165 Schemardds 166 Caching 169 Loading and Saving Data 170 apache hive 170 Parquet 171 JSON From rdds 174 JDBC/ODBC Server 175 Working with Beeline Long-Lived Tables and Queries 178 User-Defined Functions 178 Spark SQl udes 178 Hive udfs 179 Spark SQL Performance 180 Performance Tuning options 180 Conclusion 182 Table of contents 10. Spark Streaming 183 A Simple example 184 Architecture and abstraction 186 Transformations 189 Stateless transformations 190 Stateful transformations 192 Output operations 197 Input sources 199 Core sources Additional sources 200 Multiple Sources and Cluster Sizing 204 24/7 Operation 205 Checkpointing 205 Driver fault tolerance 206 Worker fault tolerance 207 Receiver Fault tolerance 207 Processing guarantees 208 Streaming ui g 208 Performance Considerations 209 Batch and window sizes 209 Level of parallelism 210 Garbage Collection and Memory Usage 210 Conclusion 211 1. Machine Learning with MLlib 213 verview 213 System requirements 214 Machine Learning Basics 215 Example: Spam Classification 216 Data Types 218 Working withⅤ ectors 219 algorithms 220 Feature extraction 221 Statistics 223 Classification and Regression 224 Clustering 229 Collaborative Filtering and recommendation 230 Dimensionality reduction 232 Model evaluation 234 Tips and Performance Considerations 234 Preparing Features 234 Configuring Algorithms 235 Table of contents Caching rdds to reuse 235 Recognizing Sparsity 235 Level of parallelism 236 Pipeline api 236 Conclusion 237 Index 239 I Table of Contents

试读 127P Learning Spark: Lightning-Fast Big Data Analysis
立即下载 身份认证后 购VIP低至7折
要努力啊要努力 很好,谢谢!
weixin_38245345 非常好的入门资料,感谢!
树哥 还可以,下来看看
关注 私信
Learning Spark: Lightning-Fast Big Data Analysis 17积分/C币 立即下载
Learning Spark: Lightning-Fast Big Data Analysis第1页
Learning Spark: Lightning-Fast Big Data Analysis第2页
Learning Spark: Lightning-Fast Big Data Analysis第3页
Learning Spark: Lightning-Fast Big Data Analysis第4页
Learning Spark: Lightning-Fast Big Data Analysis第5页
Learning Spark: Lightning-Fast Big Data Analysis第6页
Learning Spark: Lightning-Fast Big Data Analysis第7页
Learning Spark: Lightning-Fast Big Data Analysis第8页
Learning Spark: Lightning-Fast Big Data Analysis第9页
Learning Spark: Lightning-Fast Big Data Analysis第10页
Learning Spark: Lightning-Fast Big Data Analysis第11页
Learning Spark: Lightning-Fast Big Data Analysis第12页
Learning Spark: Lightning-Fast Big Data Analysis第13页
Learning Spark: Lightning-Fast Big Data Analysis第14页
Learning Spark: Lightning-Fast Big Data Analysis第15页
Learning Spark: Lightning-Fast Big Data Analysis第16页
Learning Spark: Lightning-Fast Big Data Analysis第17页
Learning Spark: Lightning-Fast Big Data Analysis第18页
Learning Spark: Lightning-Fast Big Data Analysis第19页
Learning Spark: Lightning-Fast Big Data Analysis第20页

试读结束, 可继续阅读

17积分/C币 立即下载