Programming Hive
Edward Capriolo, Dean Wampler, and Jason Rutherglen
Beijing
•
Cambridge
•
Farnham
•
Köln
•
Sebastopol
•
Tokyo
Downloa d f r o m W o w ! e B o o k < w w w.woweb o o k . c o m >
Programming Hive
by Edward Capriolo, Dean Wampler, and Jason Rutherglen
Copyright © 2012 Edward Capriolo, Aspect Research Associates, and Jason Rutherglen. All rights re-
served.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles (http://my.safaribooksonline.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editors: Mike Loukides and Courtney Nash
Production Editors: Iris Febres and Rachel Steely
Proofreaders: Stacie Arellano and Kiel Van Horn
Indexer: Bob Pfahler
Cover Designer: Karen Montgomery
Interior Designer: David Futato
Illustrator: Rebecca Demarest
October 2012: First Edition.
Revision History for the First Edition:
2012-09-17 First release
See http://oreilly.com/catalog/errata.csp?isbn=9781449319335 for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of
O’Reilly Media, Inc. Programming Hive, the image of a hornet’s hive, and related trade dress are trade-
marks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a
trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and authors assume
no responsibility for errors or omissions, or for damages resulting from the use of the information con-
tained herein.
ISBN: 978-1-449-31933-5
[LSI]
1347905436
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
An Overview of Hadoop and MapReduce 3
Hive in the Hadoop Ecosystem 6
Pig 8
HBase 8
Cascading, Crunch, and Others 9
Java Versus Hive: The Word Count Algorithm 10
What’s Next 13
2. Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Installing a Preconfigured Virtual Machine 15
Detailed Installation 16
Installing Java 16
Installing Hadoop 18
Local Mode, Pseudodistributed Mode, and Distributed Mode 19
Testing Hadoop 20
Installing Hive 21
What Is Inside Hive? 22
Starting Hive 23
Configuring Your Hadoop Environment 24
Local Mode Configuration 24
Distributed and Pseudodistributed Mode Configuration 26
Metastore Using JDBC 28
The Hive Command 29
Command Options 29
The Command-Line Interface 30
CLI Options 31
Variables and Properties 31
Hive “One Shot” Commands 34
iii