Big Data Principles and best practices of scalable realtime data systems.pdf

所需积分/C币:18 2018-05-25 20:44:45 6.58MB PDF
收藏 收藏 2
举报

Big Data Principles and best practices of scalable realtime data systems.pdf
For online information and ordering of this and other Manning books, please visit www.manning.com.Thepublisheroffersdiscountsonthisbookwhenorderedinquantity For more information, please contact Special sales Department Manning publications co 20 Baldwin Road POBoⅹ761 Shelter island. nY11964 Emailorders@manning.com o2015 by Manning Publications Co. All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted,in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps Recognizing the importance of preserving what has been written, it is Mannings policy to have the books we publish printed on acid-free paper and we exert our best efforts to that end Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine Manning Publications Co Development editors: Renae Gregoire, Jennifer Stout 20 Baldwin Road Technical development editor: Jerry Gaines PO BoX 761 Copyeditor: Andy carroll Shelter island. NY11964 Proofreader: Katie Tennant Technical proofreader: Jerry Kuch Typesetter: Gordan salinovic Cover designer: Marija Tudor ISBN9781617290343 Printed in the United states of america 12345678910-EBM-201918171615 brief contents A new paradigm for Big Data PART I BATCH LAYER 25 2 Data model for Big Data 27 Data model for Big Data: Illustration 47 4 Data storage on the batch layer 54 Data storage on the batch layer: Illustration 65 6■ Batch layer83 7 Batch layer: Illustration 111 An example batch layer: Architecture and algorithms 139 9 An example batch layer: Implementation 156 PART 2 SERVING LAYER 177 Serving layer 179 Serving layer: Illustration 196 BRIEF CONTENTS PART3 SPEED LAYER。205 Realtime views 207 13 Realtime views: llustration 220 4 Queuing and stream processing 225 15 Queuing and stream processing Illustration 242 Micro-batch stream processing 254 17 Micro-batch stream processing: Illustration 269 18 Lambda Architecture in depth 284 contents eface xin acknowledgments xu about this book xviii A new paradigm for Big Data I 1.1 How this book is structured 2 ng with ditional datab Scaling with a queue 3 Scaling by sharding the database Fauli-tolerance issues begin 5 Corruption issues 5. Whai went wrong? 5. How will Big Data techniques help?6 1.3 NOSQL is not a panacea 6 1. 4 First principles 6 1.5 Desired properties of a Big Data system 7 Robustness and fault tolerance 7 Low latency reads and updates 8 scalability 8. generalization 8. Extensibility 8 Ad hoc queries 8 Minimal maintenance 9. Debuggability 9 1.6 The problems with fully incremental architectures 9 Operational complexity 10. Extreme complexity of achieving eventual consistency 11. lack of human-fault tolerance 12 Fully incremental solution vs. Lambda architecture solution 13 CONTENTS 1.7 Lambda Architecture 14 Batch layer 16. Serving layer 17 Batch and serving layers satisfy almost all properties 17. Speed layer 18 8 Recent trends in technology 20 CPUs aren t getting faster 20. Elastic clouds 21 vibrant open source ecosystem for Big Data 27 1.9 Example application: Super Webanalytics com 22 1.10 Summary 2g PART1 BATCH LAYER。 25 Data model for Big data 27 2.1 The properties of data 29 Data is raw 31. Data is immutable 34. Data is eterna true 36 2.2 The fact-based model for representing data 37 Example facts and their properties 37 Benefits of the fact-based model 39 2.3 Graph schemas 48 Elements of a graph schema 43. The need for an enforceable schema 44 2.4 A complete data model for Super webanalytics com 45 2.5 Summary 46 3 Data model for Big Data: Illustration 47 3.1 Why a serialization framework? 48 3.2A ache thrift 48 Nodes49· Edges49· Properties50·7 ying everything together into data objects 51 Evolving your schema 5I 8.3 Limitations of serialization frameworks 52 3.4 Summary 53 4 Data storage on the batch layer 54 4.1 Storage requirements for the master dataset 55 4.2 Choosing a storage solution for the batch layer 56 Using a key /value store for the master dataset 56 Distributed filesystems 57 CONTENTS 4.8 How distributed filesystems work 58 4.4 Storing a master dataset with a distributed filesystem 59 4.5 Vertical partitioning 61 4.6 LoW-level nature of distributed filesystems 62 4.7 Storing the Super WebAnalytics com master dataset on a distributed filesystem 64 4.8 Summary 64 Data storage on the batch layer: lllustration 65 5.1 Using the hadoop Distributed File system 66 The small-files problem 67. Towards a higher-level abstraction 67 5.2 Data storage in the batch layer with Pail 68 Basic Pail operations 69. Serializing objects into pails 70 Batch operations using Pail 72. Vertical partitioning with Pail 73 Pail file formats and compression 74 Summarizing the benefits of pail 75 5. 3 Storing the master dataset for Super Webanalytics com 76 A structured pail for Thrift objects 77. A basic pail fo Super WebAnalytics com 78 A split pail to vertically partition the dataset 78 5.4 Summary 82 6 Bat 83 6.1 Motivating examples 84 Number of pageviews over time 84 Gender inference 85 6.2 Computing on the batch layer 86 6.3 Recomputation algorithms vs. incremental algorithms 88 Performance 89. Human-fault tolerance 90 Generality ofthe algorithms 91 Choosing a style of algorithm 91 6. Scalability in the batch layer 92 6.5 MapReduce: a paradigm for Big Data computing 93 Scalability 94 Fault-tolerance 96. Generality of MapReduce 97 6.6 LOw-level nature of map Reduce 99 Multistep computations are unnatural 9g complicated to implement manually 99. Logical and physical execution tightly coupled 10 CONTENTS 6.7 Pipe diagrams: a higher-level way of thinking about batch computation 102 Concepts of pipe diagrams 102 Executing pipe diagrams via MapReduce 106. Combiner aggregators 107- Pipe diagram examples 108 6.8 Summary 109 Batch layer: Illustration 111 7. 1 An illustrative example 112 7.2 Common pitfalls of data-processing tools 114 Custom languages 114 Poorly composable abstractions 115 7. 8 An introduction to J Cascalog 115 The Cascalog data model 116. The structure of a Cascalog query 117. Querying multiple datasets 119. Grouping and aggregators 121. Stepping though an example query 122 Custom predicate operations 125 7.4 Composition 130 Combining subqueries 130. Dynamically created subqueries31· Predicate macros134· Dynamically created predicate macros 136 7.5 Summary 138 An example batch layer: Architecture and algorithms 139 8.1 Design of the Super WebAnalytics com batch layer 140 Supported queries 140. Batch views 141 8.2 Workflow overview 144 8.3 Ingesting new data 145 8.4 URL normalization 146 8.5 User-identifier normalization 146 8.6 Deduplicate pageviews 151 8.7 Computing batch views 151 Pageviews over time 151 Unique visitors over time 152 Bounce-rate analysis 152 8.8 Summary 154 CONTENTS An example batch layer Implementation 156 9. 1 Starting point 157 9.2 Preparing the workflow 158 9.3 Ingesting new data 158 9.4 URL normalization 162 9.5 User-identifier normalization 163 9.6 Deduplicate pageviews 168 9.7 Computing batch views 169 Pageviews over time 169. Uniques over time 171 bounce rate analysis 172 9. 8 Summary 175 PART 2 SERVING LAYER. 177 10 Serving layer 179 10.1 Performance metrics for the serving layer 181 10.2 The serving layer solution to the normalization/ denormalization problem 188 10.3 Requirements for a serving layer database 185 10.4 Designing a serving layer for SuperWebAnalytics com 186 Pageviews over time86· Uniques over time187·B0nce rate analysis 188 0.5 Contrasting with a fully incremental solution 188 Fully incremental solution to uniques over time 188 Comparing to the lambda architecture solution 194 10.6 Summary 195 Serving layer: Illustration 196 11.1 Basics of eleph phantD B197 View creation in elephantDb 197- View serving in ElephantDB 197. Using ElephantDB 198 1.2 Building the serving layer for Super WebAnalytics com 200 Pageviews over time 200 Uniques over time 202 bounce rate analysis 203 1. 3 Summary 204

...展开详情
试读 127P Big Data Principles and best practices of scalable realtime data systems.pdf
立即下载 低至0.43元/次 身份认证VIP会员低至7折
    一个资源只可评论一次,评论内容不能少于5个字
    PlayYoung 内容清晰完整
    2018-08-31
    回复
    ivyivylynn good version.
    2018-08-03
    回复
    关注 私信 TA的资源
    上传资源赚积分,得勋章
    最新推荐
    Big Data Principles and best practices of scalable realtime data systems.pdf 18积分/C币 立即下载
    1/127
    Big Data Principles and best practices of scalable realtime data systems.pdf第1页
    Big Data Principles and best practices of scalable realtime data systems.pdf第2页
    Big Data Principles and best practices of scalable realtime data systems.pdf第3页
    Big Data Principles and best practices of scalable realtime data systems.pdf第4页
    Big Data Principles and best practices of scalable realtime data systems.pdf第5页
    Big Data Principles and best practices of scalable realtime data systems.pdf第6页
    Big Data Principles and best practices of scalable realtime data systems.pdf第7页
    Big Data Principles and best practices of scalable realtime data systems.pdf第8页
    Big Data Principles and best practices of scalable realtime data systems.pdf第9页
    Big Data Principles and best practices of scalable realtime data systems.pdf第10页
    Big Data Principles and best practices of scalable realtime data systems.pdf第11页
    Big Data Principles and best practices of scalable realtime data systems.pdf第12页
    Big Data Principles and best practices of scalable realtime data systems.pdf第13页
    Big Data Principles and best practices of scalable realtime data systems.pdf第14页
    Big Data Principles and best practices of scalable realtime data systems.pdf第15页
    Big Data Principles and best practices of scalable realtime data systems.pdf第16页
    Big Data Principles and best practices of scalable realtime data systems.pdf第17页
    Big Data Principles and best practices of scalable realtime data systems.pdf第18页
    Big Data Principles and best practices of scalable realtime data systems.pdf第19页
    Big Data Principles and best practices of scalable realtime data systems.pdf第20页

    试读已结束,剩余107页未读...

    18积分/C币 立即下载 >