Paco Nathan
Enterprise Data Workflows
with Cascading
Downloa d f r o m W o w ! e B o o k < w w w.woweb o o k . c o m >
Enterprise Data Workflows with Cascading
by Paco Nathan
Copyright © 2013 Paco Nathan. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editors: Mike Loukides and Courtney Nash
Production Editor: Kristen Borg
Copyeditor: Kim Cofer
Proofreader: Julie Van Keuren
Indexer: Paco Nathan
Cover Designer: Randy Comer
Interior Designer: David Futato
Illustrator: Rebecca Demarest
July 2013:
First Edition
Revision History for the First Edition:
2013-07-10: First release
See http://oreilly.com/catalog/errata.csp?isbn=9781449358723 for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly
Media, Inc. Enterprise Data Workflows with Cascading, the image of an Atlantic cod, and related trade dress
are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐
mark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and author assume no
responsibility for errors or omissions, or for damages resulting from the use of the information contained
herein.
ISBN: 978-1-449-35872-3
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1.
Getting Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Programming Environment Setup 1
Example 1: Simplest Possible App in Cascading 3
Build and Run 4
Cascading Taxonomy 6
Example 2: The Ubiquitous Word Count 8
Flow Diagrams 10
Predictability at Scale 14
2.
Extending Pipe Assemblies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Example 3: Customized Operations 17
Scrubbing Tokens 21
Example 4: Replicated Joins 22
Stop Words and Replicated Joins 25
Comparing with Apache Pig 27
Comparing with Apache Hive 29
3.
Test-Driven Development. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Example 5: TF-IDF Implementation 33
Example 6: TF-IDF with Testing 41
A Word or Two About Testing 48
4.
Scalding—A Scala DSL for Cascading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Why Use Scalding? 51
Getting Started with Scalding 52
Example 3 in Scalding: Word Count with Customized Operations 54
A Word or Two about Functional Programming 57
iii