所需积分/C币:50 2018-04-08 16:28:53 4.84MB PDF
收藏 收藏

Hadoop The Definitive Guide Tom White orewor rd by doug cutting ○ REILLY Beijing· Cambridge: Farnham·Kdln· Sebastopol· Taipei· Tokyo Hadoop: The Definitive Guide by Tom white Copyright o 2009 Tom White. All rights reserved Printed in the United States of america Published by O Reilly Media, Inc, 1005 Gravenstein Highway North, Sebastopol, CA 95472 O'Reilly books may be purchased for educational, business, or sales promotional use. Online editions arealsoavailableformosttitles(http://my.safaribooksonline.com).Formoreinformationcontactour corporate/institutionalsalesdepartment(800)998-9938orcorporate@oreilly.com Editor: Mike Loukides Indexer: Ellen Troutman Zaig Production editor: Loranah Dimant Cover Designer: Karen montgomery Proofreader: Nancy Rotary Interior Designer: David Futato Illustrator: Robert romano Printing History: June 2009 First edition Nutshell Handbook, the Nutshell Handbook logo, and the O Reilly logo are registered trademarks of O'Reilly Media, Inc. Hadoop: The Definitive Guide, the image of an African elephant, and related trade dress are trademarks of O'Reilly Media, Inc Many of the designations used by manufacturers and sellers to distinguish their products are claimed as rademarks. Where those designations appear in this book, and O'Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein Repkover This book uses RepkoverTM, a durable and flexible lay-flat binding ISBN:978-0-596-52197-4 1243455573 For Eliane. Emilia, and lottie Table of contents reword XIl Preface XV 1. Meet Hadoop Data Data Storage and analysis Comparison with Other Systems RDBMS Grid ce Ing Volunteer Computing a Brief History of Hadoop The apache Hadoop project 12 2. Mapreduce.............................. 15 A Weather Dataset 15 Data format 15 Analyzing the Data with Unix tools 17 analyzing the data with hadoop Map and reduce 18 ava mapReduce 20 Scaling Out 27 Data flo 27 Combiner functions 29 Running a distributed Mapreduce Job 32 Hadoop Streaming 32 R 33 Python Hadoop Pipes 36 Compiling and running 38 3. The hadoop Distributed filesystem The Design of HDFS 41 HDFS Concepts 42 Blocks 42 Namenodes and datanodes 44 The Command-Line Interface Basic Filesystem Operations 45 Hadoop filesystems 47 Interfaces 49 The Java Interface 51 Reading Data from a Hadoop url 51 Reading Data Using the File System API 52 Writing Data 56 Directories Querying the Filesystem 58 Deleting Data Data Flo 63 Anatomy of a File read 63 Anatomy of a File Write 66 Coherency mode Parallel Copying with distcp Keeping an hDFS Cluster Balanced 71 Hadoop archives 71 Using Hadoop archiv Limitations Hadoop 1/0 Data Integrity 75 Data Integrity in HDFS ocalFileSystem 76 Checksum FileSystem Compression 79 Compression and input splits 83 Using Compression in MapReduce Serialization The Writable interface Writable classes 89 Implementing a Custom Writable 96 Serialization frameworks 101 File-Based Data structures 103 Sequence File 103 MapFile 110 ⅵ i Table of Contents 5. Developing a MapReduce Application 115 The Configuration API 116 Combining resources Variable expansion 117 Configuring the Development Environment 118 Managing Configuration 118 GenericOptions Parser, Tool, and ToolRunner 121 Writing a Unit Test 123 apper 124 Reducer 126 Running locally on Test data 127 Running a Job in a local Job runner 127 Testing the Driver 130 Running on a cluster 132 ging 132 Launching a Job 132 The MapReduce Web UI 134 Retrieving the results 136 Debugging a Job 138 Using a Remote debugger 144 Tuning a Job 145 Profiling tasks g 146 MapReduce Workflows 149 Decomposing a Problem into Map Reduce Jobs 149 Running dependent Jobs 151 6. How MapReduce Works 153 Anatomy of a MapReduce Job run 153 Job Submission 153 Job Initialization 155 Task assignment Task execution 156 Progress and Status Updates 156 Job Completion 158 F allures 159 Task failure 159 Tasktracker Failure 161 Jobtracker Failure 161 Job scheduling 161 The Fair Scheduler 162 Shuffle and sort 163 The Map sid P 163 The Reduce Side 164 Table of contents|ⅶi Configuration Tuning 166 Task execution 168 Speculative execution 169 Task jvm reuse 170 Skipping Bad Records 171 The Task execution environment 172 7. Map Reduce Types and Formats ,,175 Reduce t The Default MapReduce job 178 Input Formats 184 Input Splits and records 185 ext Input 196 Binary input 199 Multiple Inputs 200 Database Input(and Output 201 Output Formats 202 Text Output 202 Binary output 203 Multiple outputs 203 Lazy output 210 Database Output 210 8. MapReduce Features.........................211 Counters 211 Built-in counters 211 User-Defined Java Counters 213 User-Defined Streaming Counters 218 Sorting 218 reparation 218 Partial sort 219 Total sort 223 Secondary sort 227 Joins 233 e oins Reduce-Side joins 235 Side data distribution 238 Using the Job Configuration 238 Distributed cache 239 MapReduce library classes 243 9. Setting Up a Hadoop Cluster............. 245 Cluster Specification 245 ⅶ ii Table of Contents

试读 127P Hadoop权威指南(英文原版)
立即下载 低至0.43元/次 身份认证VIP会员低至7折
Hadoop权威指南(英文原版) 50积分/C币 立即下载

试读结束, 可继续阅读

50积分/C币 立即下载 >