Getting Started With Pdi
Getting Started With Pdi
This document is copyright © 2010 Pentaho Corporation. No part may be reprinted without written
permission from Pentaho Corporation. All trademarks are the property of their respective owners.
Trademarks
Pentaho (TM) and the Pentaho logo are registered trademarks of Pentaho Corporation. All
other trademarks are the property of their respective owners. Trademarked names may appear
throughout this document. Rather than list the names and entities that own the trademarks or insert
a trademark symbol with each mention of the trademarked name, Pentaho states that it is using the
names for editorial purposes only and to the benefit of the trademark owner, with no intention of
infringing upon that trademark.
Company Information
Pentaho Corporation
Citadel International, Suite 340
5950 Hazeltine National Drive
Orlando, FL 32822
Phone: +1 407 812-OPEN (6736)
Fax: +1 407 517-4575
http://www.pentaho.com
E-mail: [email protected]
Sales Inquiries: [email protected]
Documentation Suggestions: [email protected]
Sign-up for our newsletter: http://community.pentaho.com/newsletter/
Contents
Introduction ............................................................................................................. 4
Common Uses..........................................................................................................................4
Key Benefits............................................................................................................................. 4
Pentaho Data Integration Architecture......................................................................6
Downloading Pentaho Data Integration.................................................................... 7
Installing Pentaho Data Integration...........................................................................8
Starting the Spoon Designer.................................................................................................... 8
Pentaho Data Integration Folders and Scripts......................................................................... 8
Installing Enterprise Edition Licenses.......................................................................................9
Adding a JDBC Driver.............................................................................................................. 9
Connecting to the Enterprise Repository................................................................ 11
Navigating through the Interface.............................................................................12
Creating Your First Transformation........................................................................ 15
Retrieving Data from a Flat File (Text File Input Step)........................................................... 15
Saving Your Transformation........................................................................................18
Filter Records with Missing Postal Codes (Filter Rows Step)................................................ 18
Loading Your Data into a Relational Database (Table Output Step)......................................20
Retrieving Data from your Lookup File (Text File Input Step)................................................ 21
Resolving Missing Zip Code Information (Stream Lookup Step)............................................22
Completing your Transformation (Select Values Step).......................................................... 23
Running Your Transformation................................................................................................ 24
Building Your First Job............................................................................................27
Scheduling the Execution of Your Job .................................................................. 29
Building Business Intelligence Solutions Using Agile BI.........................................31
Using Agile BI.........................................................................................................................31
Correcting the Data Quality Issue.......................................................................................... 32
Creating a Top Ten Countries by Sales Chart....................................................................... 33
Breaking Down Your Chart by Deal Size............................................................................... 34
Wrapping it Up........................................................................................................................35
Why Choose Enterprise Edition?............................................................................37
Professional, Technical Support.............................................................................................37
Enterprise Edition Features....................................................................................................37
Certified Software Releases...................................................................................................37
Troubleshooting...................................................................................................... 38
I don't know what the default login is for the DI Server, Enterprise Console, and/or Carte
....38
| TOC | 3
Introduction
Pentaho Data Integration (PDI) is a powerful extract, transform, and load (ETL) solution that uses an
innovative metadata-driven approach. It includes an easy to use, graphical design environment for building
ETL jobs and transformations, resulting in faster development, lower maintenance costs, interactive
debugging, and simplified deployment.
Common Uses
Pentaho Data Integration is an extremely flexible tool that addresses a broad number of use cases
including:
• Data warehouse population with built-in support for slowly changing dimensions and surrogate key
creation
• Data migration between different databases and applications
• Loading huge data sets into databases taking full advantage of cloud, clustered and massively parallel
processing environments
• Data Cleansing with steps ranging from very simple to very complex transformations
• Data Integration including the ability to leverage real-time ETL as a data source for Pentaho Reporting
• Rapid prototyping of ROLAP schemas
• Hadoop functions: Hadoop job execution and scheduling, simple Hadoop map/reduce design, Amazon
EMR integration
Key Benefits
Pentaho Data Integration features and benefits include:
• Installs in minutes; you can be productive in one afternoon
• 100% Java with cross platform support for Windows, Linux and Macintosh
• Easy to use, graphical designer with over 100 out-of-the-box mapping objects including inputs,
transforms, and outputs
4 | | Introduction
• Simple plug-in architecture for adding your own custom extensions
• Enterprise Data Integration server providing security integration, scheduling, and robust content
management including full revision history for jobs and transformations
• Integrated designer (Spoon) combining ETL with metadata modeling and data visualization, providing
the perfect environment for rapidly developing new Business Intelligence solutions
• Streaming engine architecture provides the ability to work with extremely large data volumes
• Enterprise-class performance and scalability with a broad range of deployment options including
dedicated, clustered, and/or cloud-based ETL servers
| Introduction | 5
Pentaho Data Integration Architecture
The diagram below depicts the the core components of Pentaho Data Integration Enterprise Edition
Spoon is the design interface for building ETL jobs and transformations. Spoon provides a drag and drop
interface allowing you to graphically describe what you want to take place in your transformations which
can then be executed locally within Spoon, on a dedicated Data Integration Server, or a cluster of servers.
Enterprise Edition (EE) Data Integration Server is a dedicated ETL server whose primary functions are:
The Enterprise Console provides a thin client for managing deployments of Pentaho Data Integration
Enterprise Edition including management of Enterprise Edition licenses, monitoring and controlling activity
on a remote Pentaho Data Integration server and analyzing performance trends of registered jobs and
transformations.
Before you begin to download Pentaho Data Integration, you must have Java 6.0 already installed.
You will receive a confirmation email that provides you with credentials to access the Pentaho Knowledge
Base, which contains product documentation, support tips, and how-to articles.
It is assumed that you will follow the default installation instructions and that you are installing to a local
device (localhost).
1. Read and accept the License Agreement.
2. Specify the location where you want to install Pentaho Data Integration or click Next to accept the
default.
3. Set the user name and password for the Administrator account. For the purposes of this evaluation,
accept the default user name, "admin," and type "password" in Password and Confirm Password
fields.
4. Click Next to accept the default installation options on the Summary page.
5. Click Next to begin installation.
Pentaho Data Integration is installed as a Window service. When installation is complete, the Spoon
designer is launched.
3. Alternatively, in Windows, go to Start -> Pentaho Enterprise Edition -> Design Tools to launch the
designer.
Note: Microsoft SQL Server users frequently use an alternative, non-vendor-supported driver called
JTDS. Ensure that you are downloading the expected driver before installing it.
Before you can add a data source to a Pentaho server or client tool, you must copy the appropriate JDBC
driver JAR to certain directories. To add support for a database, obtain the correct version of the JDBC
driver from your database vendor and copy it to the following locations, depending on which products need
to connect to this database:
Note: Ensure that there are no other versions of the same vendor's JDBC driver installed in these
directories before copying driver JARs. If there are other versions of the same driver, you may
have to remove them to avoid confusion and potential class loading problems. This is of particular
concern when you are installing a driver JAR for a data source that is the same database type
Next, you will create a connection to the Enterprise Repository that is part of the Data Integration
Server. The Enterprise Repository is used to store and schedule the example transformation and job you
will create when performing the exercises in this document.
To create a connection to the Enterprise Repository:...
The Welcome page contains useful links to documentation, community links for getting involved in the
Pentaho Data Integration project, and links to blogs from some of the top contributors to the Pentaho Data
Integration project.
The Data Integration perspective of Spoon allows you to create two basic document types: transformations
and jobs. Transformations are used to describe the data flows for ETL such as reading from a source,
transforming data and loading it into a target location. Jobs are used to coordinate ETL activities such as
defining the flow and dependencies for what order transformations should be run, or prepare for execution
by checking conditions such as, "Is my source file available?," or "Does a table exist in my database?"
This exercise will step you through building your first transformation with Pentaho Data Integration
introducing common concepts along the way. The exercise scenario includes a flat file (CSV) of sales data
that you will load into a database so that mailing lists can be generated. Several of the customer records
are missing postal codes (zip codes) that must be resolved before loading into the database. The logic
looks like this:
1. Click (New) in the upper left corner of the Spoon graphical interface.
2. Select Transformation from the list.
3. Under the Design tab, expand the Input node; then, select and drag a Text File Input step onto the
canvas on the right.
10.Click the Fields tab and click Get Fields to retrieve the input fields from your source file.
A dialog box appears requesting that you to specify the number of lines to scan, allowing you to
determine default settings for the fields such as their format, length, and precision. Type 0 (zero) in the
12.Click Preview Rows to verify that your file is being read correctly. You can change the number of rows
to preview. click OK to exit the step properties dialog box.
3. In the Directory field, click (folder icon) to select a repository folder where you will save your
transformation.
4. Expand the Home directory and double-click the joe folder.
Your transformation will be stored in the joe folder in the Enterprise Repository.
5. Click OK to exit the Transformation Properties dialog box.
The Enter Comment dialog box appears.
6. Click in the Enter Comment dialog box and press <Delete> to remove the default text string. Type a
meaningful comment about your transformation.
The comment and your transformation are tracked for version control purposes in the Enterprise
Repository.
7. Click OK to exit the Enter Comment dialog box.
Alternatively, you can draw hops by hovering over a step until the hover menu appears. Drag the hop
painter icon from the source step to your target step.
Note: You will return to this step later and configure the Send true data to step and Send false
data to step settings after adding their target steps to your transformation.
8. Save your transformation.
3. Double-click the Table Output step to open its edit properties dialog box.
4. Rename your Table Output Step to Write to Database.
5. Click New next to the Connection field. You must create a connection to the database.
The Database Connection dialog box appears.
6. Provide the settings for connecting to the database as shown in the table below.
Connection Name Type, Sample Data
Connection Type: Choose, H2
Host Name localhost
Database Name Type sampledata
Port Number 9092
User Name sa
Password blank/no password
7. Click Test to make sure your entries are correct. A success message appears. Click OK.
Retrieving Data from your Lookup File (Text File Input Step)
You have been provided a second text file containing a list of cities, states, and postal codes that you will
now use to look up the postal codes for all of the records where they were missing (the ‘false’ branch of
your Filter rows step). First, you will use a Text file input step to read from the source file, next you will use
a Stream lookup step to bring the resolved Postal Codes into the stream.
1. Add a new Text File Input step to your transformation. In this step you will retrieve the records from
your lookup file.
2. Rename your Text File input step to, Read Postal Codes.
3. Click Browse to locate the source file, Zipssortedbycitystate.csv, located at ...\design-tools
\data-integration\samples\transformations\files.
4. Click Add.
The path to the file appears under Selected Files.
Note: Click Show File Content to view the contents of the file. This file is comma (,) delimited,
with an enclosure of quotation mark (“), and contains a single header row.
5. Under the Content tab, enable the Header option. Change the separator character to a comma (,). and
confirm that the enclosure setting is correct.
6. Under the Fields tab, click Get Fields to retrieve the data from your .csv file.
7. Click Preview Rows to make sure your entries are correct and click OK to exit the Text File input
properties dialog box.
1. Add a Stream Lookup step to your transformation. Under the Design tab, expand the Lookup folder
and choose Stream Lookup.
2. Draw a hop between the Filter Missing Zips (Filter rows) and Stream Lookup steps. Select the
Result is FALSE.
3. Create a hop from the Read Postal Codes step (Text File input) to the Stream lookup step.
4. Double-click on the Stream lookup step to open its edit properties dialog box.
5. Rename Stream Lookup to Lookup Missing Zips.
6. Select the Read Postal Codes (Text File input) as the Lookup step.
7. Define the CITY and STATE fields in the key(s) to look up the value(s) table. Click the drop down in
the Field column and select CITY. Then, click in the LookupField column and select CITY. Perform the
same actions to define the second key based on the STATE fields coming in on the source and lookup
streams:
1. Add a Select Values step to your transformation. Expand the Transform folder and choose Select
Values.
2. Create a hop between the Lookup Missing Zips and Select Values steps.
3. Double-click the Select Values step to open its properties dialog box.
4. Rename the Select Values step to, Prepare Field Layout.
5. Click Get fields to select to retrieve all fields and begin modifying the stream layout.
6. Select the ZIP_RESOLVED field in the Fields list and use <CTRL><UP> to move it just below the
POSTALCODE field (the one that still contains null values).
7. Select the old POSTALCODE field in the list (line 20) and delete it.
This final part of the creating a transformation exercise focuses exclusively on the local execution option.
For more information on remote, clustered and other execution options review the links in the additional
resources section later in this guide or in the Pentaho Data Integration User Guide found in the Knowledge
Base.
1.
In the Spoon graphical interface, click (Run this Transformation or Job).
The Execute a Transformation dialog box appears. You can run a transformation locally, remotely, or
in a clustered environment. For the purposes of this exercise, keep the default Local Execution.
2. Click Launch.
The transformation executes. Upon running the transformation, the Execution Results panel opens
below the graphical workspace.
The Step Metrics tab provides statistics for each step in your transformation including how many
records were read, written, caused an error, processing speed (rows per second) and more. If any of
the steps caused the transformation to fail, they would be highlighted in red as shown below.
The Logging tab displays the logging details for the most recent execution of the transformation. Error
lines are highlighted in red..
Like the Execution History, this feature requires you to configure your transformation to log to a
database through the Logging tab of the Transformation Settings dialog box. For more information on
configuring logging or performance monitoring, see the Pentaho Data Integration User Guide found in
the Knowledge Base.
The Start job entry defines where the execution will begin.
4. Expand the Conditions folder and add a File Exists job entry.
5. Draw a hop from the Start job entry to the File Exists job entry.
6. Double-click the File Exists job entry to open its edit properties dialog box. Click Browse and select the
sales_data.csv from the following location: ...\design-tools\data-integration\samples
\transformations\files.
Be sure to set the filter to CSV files to see the file.
13.Click Run Job. When the Execute a Job dialog box appears, choose Local Execution and click
Launch.
The Execution Results panel should open showing you the job metrics and log information for the job
execution.
The Enterprise Edition Pentaho Data Integration Server provides scheduling services allowing you to
schedule the execution of jobs and transformations in the future or on a recurring basis. In this example,
you will create a schedule that runs your Sample Job every Sunday at 9 am..
4. Under the Repeat section, select the Weekly option. Enable the Sunday check box.
5. For the End date, select Date and then enter a date several weeks in the future using the calendar
picker.
8. If the scheduler is stopped, you must click (Start Scheduler) on the sub-toolbar. If the button
appears with a red stop icon, the scheduler is already running. Your scheduled activity will take place as
indicated at the Next Run time.
Historically, starting new Business Intelligence projects required careful consideration of a broad set of
factors including:
Data Considerations
• Where is my data coming from?
• Where will it be stored?
• What cleansing and enrichment is necessary to address the business needs?
Information Delivery Consideration
• Will information be delivered through static content like pre-canned reports and dashboards?
• Will users need the ability to build their own reports or perform interactive analysis on the data?
Skill Set Considerations
• If users need self-service reporting and analysis, what skill sets do you expect them to have?
• Assuming the project involves some combination of ETL, content creation for reports and dashboards,
and meta-data modeling to enable business users to create their own content, do we have all the tools
and skill sets to build the solution in a timely fashion?
Cost
• How many tools and from how many vendors will it take to implement the total solution?
• If expanding the use of a BI tool already in house, what are the additional licensing costs associated
with rolling it out to a new user community?
• What are the costs in both time and money to train up on all tools necessary to roll out the solution?
• How long is the project going to take and when will we start seeing some ROI?
Because of this, many new projects are scratched before they even begin. Pentaho’s Agile BI initiative
seeks to break down the barriers to expanding your use of Business Intelligence through an iterative
approach to scoping, prototyping, and building complete BI solutions. It is an approach that centers on the
business needs first, empowers the business users to get involved at every phase of development, and
prevents projects from going completely off track from the original business goals.
In support of the Agile BI methodology, the Spoon design environment provides an integrated design
environment for performing all tasks related to building a BI solution including ETL, reporting and OLAP
metadata modeling and end user visualization. In a single click, Business users will instantly be able to
start interacting with data, building reports with zero knowledge of SQL or MDX, and work hand in hand
with solution architects to refine the solution.
Using Agile BI
This exercise builds upon your sample transformation and highlights the power an integrated design
environment can provide for building solutions using Agile BI.
For this example, your business users have asked to see what the top 10 countries are based on sales.
Furthermore, they want the data broken down by deal size where small deals are those less than $3,000,
medium sized deals are between $3,000 and $7,000, and large deals are over $7,000.
1. Open or select the tab containing the sample transformation you just created.
2. Right-click the Write to Database (Table Output) step, and select Visualize -> Analyzer.
In the background, Pentaho Data Integration automatically generates the OLAP model that allows you
to begin interacting immediately with your new data source.
3. Drag the COUNTRY field from the Field list on the left onto the report.
4. Drag the SALES measure from the Field list onto the report.
2. Right-click the Table output step from the flow and choose Detach step. Repeat this process to detach
the second hop.
3. Expand the Transform folder in the Design Palette and add a Value Mapper step to the
transformation.
4. Draw a hop from the Filter Missing Zips (Filter rows) step to the Value Mapper step and select Result
is TRUE.
5. Draw a hop from the Prepare Field Layout (Select values) step to the Value Mapper step.
6. Draw a hop from the Value Mapper step to the Write to Database (Table output) step. Your
transformation should look like the sample below:
7. Double-click on the Value Mapper step to open its edit step properties dialog box.
8. Select the COUNTRY field in the Fieldname to use input.
9. In the first row of the Field Values table, type United States as the Source value and USA as the
Target value. Click OK to exit the dialog box.
1. Right-click the COUNTRY header and select Top 10, and so on..
2. Confirm that the default settings are set to return the Top 10 COUNTRY members by the SALES
measure. Click OK.
3.
Click (chart) and select Stacked Bar to change the visualization to a bar chart.
Note: Because this step will be adding new field into the stream, you must update your target
database table to add the new column in the next steps.
7. Double-click on the Write to Database (Table output) step.
8. Click SQL to generate the DDL necessary to update the target table.
Wrapping it Up
Follow the instructions below to complete your Agile BI exercise:
1. Click Visualize to return to your Top 10 Countries chart. Next, you will update your dimensional model
with the new Deal Size attribute.
2. Click View in the Visualization Properties panel on the right to display the Model perspective and begin
editing the model used to build your chart.
3. Drag the DEALSIZE field from the list of available fields on the left onto the Dimensions folder in the
Model panel in the middle. This adds a new dimension called DEALSIZE with a single default hierarchy
and level of the same name.
4. Click Save on the main toolbar to save your updated model. Click Visualize to return to your Top 10
Countries chart.
5. Click Refresh to update your field list to include the new DEALSIZE attribute.
6.
Click (Toggle Layout) to open the Layout panel.
7. Drag DEALSIZE from the field list on the left into the Color Stack section of the Layout panel.
8.
Click (Toggle Layout) to close the Layout Panel. You have successfully delivered your business
user’s request
Enterprise Edition enables you to deploy Pentaho Data Integration with confidence, security, and far lower
total cost of ownership than proprietary and open source alternatives. Benefits of Pentaho Data Integration
Enterprise Edition include:
This section contains known problems and solutions relating to DI Server administration.
I don't know what the default login is for the DI Server, Enterprise Console,
and/or Carte
For the DI Server administrator, it's username admin and password secret.
For Enterprise Console administrator, it's username admin and password password.
For Carte, it's username cluster and password cluster.
Be sure to change these to new values in your production environment.
Note: DI Server users are not the same as BI Server users.
38 | | Troubleshooting