Getting Started with Pentaho Data Integration
This document is copyright © 2010 Pentaho Corporation. No part may be reprinted without written
permission from Pentaho Corporation. All trademarks are the property of their respective owners.
Help and Support Resources
If you have questions that are not covered in this guide, or if you would like to report errors in the
documentation, please contact your Pentaho technical support representative.
Support-related questions should be submitted through the Pentaho Customer Support Portal at
http://support.pentaho.com.
For information about how to purchase support or enable an additional named support contact,
please contact your sales representative, or send an email to sales@pentaho.com.
For information about instructor-led training on the topics covered in this guide, visit
http://www.pentaho.com/training.
Limits of Liability and Disclaimer of Warranty
The author(s) of this document have used their best efforts in preparing the content and the
programs contained in it. These efforts include the development, research, and testing of the
theories and programs to determine their effectiveness. The author and publisher make no warranty
of any kind, express or implied, with regard to these programs or the documentation contained in
this book.
The author(s) and Pentaho shall not be liable in the event of incidental or consequential damages
in connection with, or arising out of, the furnishing, performance, or use of the programs, associated
instructions, and/or claims.
Trademarks
Pentaho (TM) and the Pentaho logo are registered trademarks of Pentaho Corporation. All
other trademarks are the property of their respective owners. Trademarked names may appear
throughout this document. Rather than list the names and entities that own the trademarks or insert
a trademark symbol with each mention of the trademarked name, Pentaho states that it is using the
names for editorial purposes only and to the benefit of the trademark owner, with no intention of
infringing upon that trademark.
Company Information
Pentaho Corporation
Citadel International, Suite 340
5950 Hazeltine National Drive
Orlando, FL 32822
Phone: +1 407 812-OPEN (6736)
Fax: +1 407 517-4575
http://www.pentaho.com
E-mail: communityconnection@pentaho.com
Sales Inquiries: sales@pentaho.com
Documentation Suggestions: documentation@pentaho.com
Sign-up for our newsletter: http://community.pentaho.com/newsletter/
| TOC | 3
Contents
Introduction ............................................................................................................. 4
Common Uses..........................................................................................................................4
Key Benefits............................................................................................................................. 4
Pentaho Data Integration Architecture......................................................................6
Downloading Pentaho Data Integration....................................................................7
Installing Pentaho Data Integration...........................................................................8
Starting the Spoon Designer.................................................................................................... 8
Pentaho Data Integration Folders and Scripts......................................................................... 8
Installing Enterprise Edition Licenses.......................................................................................9
Adding a JDBC Driver.............................................................................................................. 9
Connecting to the Enterprise Repository................................................................11
Navigating through the Interface.............................................................................12
Creating Your First Transformation........................................................................ 15
Retrieving Data from a Flat File (Text File Input Step)...........................................................15
Saving Your Transformation........................................................................................18
Filter Records with Missing Postal Codes (Filter Rows Step)................................................ 18
Loading Your Data into a Relational Database (Table Output Step)......................................20
Retrieving Data from your Lookup File (Text File Input Step)................................................ 21
Resolving Missing Zip Code Information (Stream Lookup Step)............................................22
Completing your Transformation (Select Values Step).......................................................... 23
Running Your Transformation................................................................................................ 24
Building Your First Job............................................................................................27
Scheduling the Execution of Your Job .................................................................. 29
Building Business Intelligence Solutions Using Agile BI.........................................31
Using Agile BI.........................................................................................................................31
Correcting the Data Quality Issue.......................................................................................... 32
Creating a Top Ten Countries by Sales Chart....................................................................... 33
Breaking Down Your Chart by Deal Size............................................................................... 34
Wrapping it Up........................................................................................................................35
Why Choose Enterprise Edition?............................................................................37
Professional, Technical Support.............................................................................................37
Enterprise Edition Features....................................................................................................37
Certified Software Releases...................................................................................................37
Troubleshooting......................................................................................................38
I don't know what the default login is for the DI Server, Enterprise Console, and/or Carte....38
4 | | Introduction
Introduction
Pentaho Data Integration (PDI) is a powerful extract, transform, and load (ETL) solution that uses an
innovative metadata-driven approach. It includes an easy to use, graphical design environment for building
ETL jobs and transformations, resulting in faster development, lower maintenance costs, interactive
debugging, and simplified deployment.
Common Uses
Pentaho Data Integration is an extremely flexible tool that addresses a broad number of use cases
including:
• Data warehouse population with built-in support for slowly changing dimensions and surrogate key
creation
• Data migration between different databases and applications
• Loading huge data sets into databases taking full advantage of cloud, clustered and massively parallel
processing environments
• Data Cleansing with steps ranging from very simple to very complex transformations
• Data Integration including the ability to leverage real-time ETL as a data source for Pentaho Reporting
• Rapid prototyping of ROLAP schemas
• Hadoop functions: Hadoop job execution and scheduling, simple Hadoop map/reduce design, Amazon
EMR integration
Key Benefits
Pentaho Data Integration features and benefits include:
• Installs in minutes; you can be productive in one afternoon
• 100% Java with cross platform support for Windows, Linux and Macintosh
• Easy to use, graphical designer with over 100 out-of-the-box mapping objects including inputs,
transforms, and outputs
| Introduction | 5
• Simple plug-in architecture for adding your own custom extensions
• Enterprise Data Integration server providing security integration, scheduling, and robust content
management including full revision history for jobs and transformations
• Integrated designer (Spoon) combining ETL with metadata modeling and data visualization, providing
the perfect environment for rapidly developing new Business Intelligence solutions
• Streaming engine architecture provides the ability to work with extremely large data volumes
• Enterprise-class performance and scalability with a broad range of deployment options including
dedicated, clustered, and/or cloud-based ETL servers