Guided Tutorial For Pentaho Data Integration Using Oracle
Guided Tutorial For Pentaho Data Integration Using Oracle
In the data integration exercise, you will use the Pentaho Data Integration tool to transform
two data sources and load data into an Oracle fact table. You will perform transformations to
parse date strings, combine fields, and perform validation checks. Before starting this tutorial,
you need to install necessary software, download data sources, and create tables used in the
tutorial.
1. Tutorial Prerequisites
Before starting this tutorial, you should download and install the server and client for either
Oracle or MySQL server. You can find details in Module 1 about Oracle installation. If you have
access to a remote Oracle server (perhaps through your employer), you do not need to install
the server software on your own machine.
You also need to install Pentaho Data Integration before starting this tutorial. After installing
Pentaho Data Integration, you need to install the Java Database Connectivity (JDBC) driver for
Oracle. Module 1 contains installation instructions about Pentaho Data Integration and JDBC
drivers. This tutorial demonstrates the community edition of the most recent stable version
(5.0.1) of Pentaho Data Integration.
After installing Pentaho Data Integration, you need to obtain the data sources used in the
tutorial from the class website.
The tutorial uses the Store Sales data warehouse as depicted in Figure 1. Sales is the fact entity
type surrounded by 1-M relationships with dimension entity types, Item, Customer, Store, and
TimeDim. The schema design has a snowflake for the 1-M relationship from Division to Store. In
the table design, table names have been preceded with the prefix “SS” to avoid conflicts with
other tables. Thus, the fact table is SSSales, not Sales as shown in the ERD of Figure 1.
The class website contains documents for Oracle and MySQL. You need to create and populate
the tables using one of these documents. The Oracle document also contains a statement to
create a sequence object for the SSSales table.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle Page 2
Figure 1: Oracle Snowflake Schema for the Store Sales Data Warehouse
This exercise will step you through building your first transformation with Pentaho Data
Integration introducing common concepts along the way. Follow the instructions below to
create a new transformation.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle Page 3
1. After starting Pentaho Data Integration, you will see the opening window (Figure 2) and the
Spoon window (Figure 3).
3. Select Transformation from the list of components (Figure 4) displayed after selecting the
New button.
Step 1 – In the View tab, right click the new transformation 1 and select “settings…”
Step 2 – Set the Transformation name for the new transformation as: SSTORETEST and click OK.
Step 3 – Save the transformation following File Save. You will see the empty transformation
window in the Spoon (Figure 5).
o Under the Design tab, expand the Input node (Figure 6).
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle Page 6
o Select and drag a Microsoft Excel Input step into the canvas on the right.
o Double Click on the Microsoft Excel Input step. The edit properties dialog box (Figure 7)
associated with the Microsoft Excel Input step appears. In this dialog box, you specify
the properties related to a particular step.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle Page 7
o Set name for the Excel Input as SSExcelData and specify the Excel data source path in
the Files tab.
o In the tab named Files, click the button “Browse…” and locate the Excel file that you
downloaded from the class website. Then, Click “Add” to add the file to the selected
files area.
o In the tab named Sheets, click the button “Get sheetname(s)…”. There will appear an
Enter List (Figure 8) to choose sheets. Select Sheet 1, press “>” to move it into the right
area. Click OK.
o In the tab names Fields, click on “Get fields from header row…” You need to change the
data types, length, and precision as the specification in Figure 9.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle Page 8
o Click OK at the bottom of the window. The input icon will change to the SSExcel icon
displayed in Figure 10.
Step 5 – In this part of the tutorial, you will add constraint checking for null values and
appropriate data types for the Excel data source.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle Page 9
o Add a Filter Rows step to your transformation. Under the Design table, go to Flow
Filter Rows (Figure 10).
o Create a “hop” between the SSExcelSource (Excel file input) step and the Filter Rows
step. Hops are used to describe the flow of data in your transformation. To create the
hop, click the SSExcel Source (Excel file input) step, then press the <SHIFT> key down
and draw a line to the Filter Rows step (Figure 11).
Figure 11: Hop connecting an Excel Input Node Connected to a Filter Node
o Alternatively, you can draw hops by hovering over a step until the hover menu (Figure
12) appears. Drag the hop painter icon from the source step to your target step.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 10
o Double-click the Filter Rows step. The Filter Rows edit properties dialog box appears
(Figure 13).
o Click on the comparison operator (Figure 15) (set to = by default) and select the IS NOT
NULL function and click OK.
o Click the button . A new condition row appears with null = [ ] as a default.
o Click on the expression and add constraints for the next column similarly to what you
did for “SalesUnits”
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 12
o Click on UP. This will allow you to see both conditions joint by AND
o Click the button again. Another new condition row appears with null = [ ] as a
default.
o Keeping repeating these steps for all fields.
o The final view of filter conditions is shown by Figure 16.
Step 6 – Create a step to sort the result of the Filter Rows step.
o Under the Design tab, expand the contents of the Transform node.
o Click and drag a Sort Rows step into your transformation; create a hop between the
Filter rows and Sort Rows steps. Select Result is TRUE in the filter results selection list
(Figure 17).
o Double-click the Sort Rows step to open its edit properties dialog box (Figure 18). Click
“Get Fields” to obtain the fields. Delete other fields except the Day, Month and Year
fields. Then click Ok.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 14
When you define a database connection, the connection information (username, password,
port number, and so on) is stored in the Pentaho Enterprise Repository and is available to other
users when they connect to the repository. If you are not using the Pentaho Enterprise
Repository, the database connection information is stored in the XML file associated with a
transformation or job.
Connections that are available for use with a transformation or job are listed under Database
Connection node in the explorer View in Spoon.
o In Spoon, under View in the navigation tap, right click Database connections and choose
New.
o In Spoon, under View in the navigation tap, right click Database connections and choose
New Connection Wizard.
o In the Table input configuration box, click on New.
This part of the tutorial involves looking up the date from the SSTimeDim table to check the
validity of dates in the Excel data source. In addition, you will lookup primary key columns from
other Oracle tables to ensure loaded data does not contain invalid foreign keys.
o Under the Design tab, expand the contents of the Input node.
o Click and drag a Table Input step into your transformation.
o Double-click the Table Input step to open its edit properties dialog box (Figure 19).
o Rename your Table Input step to SSTimeDim.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 16
o Click “New…” next to the connection field. You must create a connection to the
database. The Database connection dialog box appears.
o Before setting the connection information, you should first configure the JDBC driver
according to the instructions described in the installation procedure for Pentaho Data
Integration. You must also have created and populated the tables of the Store Sales data
warehouse.
o Provide the settings for connecting to the database as shown in Figure 20. You have two
options for connection details. If you created and populated the store sales tables under
the SYSTEM account, you should use the first connection details. If you created and
populated the store sales in an account you created (LocalUser1), you should use the
second connection details. Note that host name and port are left blank in both
connection details. The Database Name is only partially shown in Figure 20. You must
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 17
enter the full value exactly into the Database Name field. The full value for database
name is shown in the connection details.
o Connection for Oracle Database Virtual Box Appliance. This connection requires that
you have PDI installed on the Oracle Virtual Box, not Windows. The predefined
connection in SQL Developer uses the privileged account, SYSTEM. You can use ORCL or
CDB1 as the service name in the connection string.
Connection Name: Oracle12cDB
Connection Type: Oracle
Host Name:
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 18
Database Name:
(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)
(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=ORCL)))
Port Number:
User Name: SYSTEM *** or other user name that you created. ***
Password: oracle *** You may have changed this default password for SYSTEM. ***
Access: Native (JDBC)
o Connection for local 12c server using SYSTEM account and SID of ORCL.
Connection Name: Oracle12cDB
Connection Type: Oracle
Host Name:
Database Name:
(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)
(PORT=1521)))(CONNECT_DATA=(SID=ORCL)))
Port Number:
User Name: SYSTEM
Password: *** use the administrative password that you gave during installation ***
Access: Native (JDBC)
o Alternative connection using a local user and service name of PDBORCL. Note that
PDBORCL must be open and the local user must have been previously created. You
should see instructions in the document about making Oracle connections in the
software installations lesson in module 1.
Connection Name: Oracle12cDB
Connection Type: Oracle
Host Name:
Database Name:
(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)
(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=PDBORCL)))
Port Number:
User Name: LocalUser1 *** or other user name that you created. ***
Password: *** use the password that you gave for LocalUser1 or other user that you
created ***
Access: Native (JDBC)
o Click “Test” to test the connection. Then success test result is shown by Figure 21.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 19
o Type in “SELECT * FROM SSTimeDim” in the SQL section (Figure 22). You can click the
Preview button to view the database. Click Ok, to exit the Database Connection dialog
box.
Figure 22: SQL Edit Section in Property Window of Table Input Node
o Add another sort rows component Sort rows 2, and a hop connecting the SSTimeDim
step. In the field specification (Figure 23), delete other fields except TIMEDAY,
TIMEMOHTH, TIMEYEAR fields.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 20
o Under the Design tab, expand the contents of the Joins node.
o Click and drag a Merge Join step into your transformation; create a hop between the
Sort rows, Sort rows 2 and Merge Join steps (Figure 24).
Figure 24: Two Sort Rows Nodes Connected to Merge Join Node
o Double-click the Merge Join step to specify its properties (Figure 25). Set First step as
Sort rows, Second step as Sort rows 2, and Join Type as INNER. Click both of the “Get
key fields” at left and right to get the possible fields to join. In the left table, delete
other fields except Day, Month and Year fields. In the right table, delete other fields
except TIMEDAY, TIMEMONTH, and TIMEYEAR fields. Then click OK.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 21
o Now, we have finished inner join between Excel input and SSTimeDim table.
Similar to getting data from the SSTimeDim table in the previous section, inner joining these
tables requires Table Input components. First, we set the connection and query properties for
the SSItem table. Note that these tables should exist in your Oracle schema before these steps.
o Drag and drop the Table Input 2 into the design pane.
o Double click on the newly created component to open its Basic Settings pane. Specify
the connection as shown in previous figure.
o Use “SSItem” as the Table Name value and “SELECT * FROM SSItem” as the Query value.
o Create two sort rows components: Sort rows 3 and Sort rows 4, connecting Merge Join
and SSItem respectively. See the field to be sorted as: ItemID and ITEMID respectively.
o Drag and drop the Merge Join 2 into the design pane. Connect Sort rows 3 and Sort
rows 4 to Merge Join 2. Set the field to be joined as Item ID and ITEMID.
o The global view of all nodes and connections after Step 2 is shown by Figure 26.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 22
Figure 26: Global View of All Nodes and Connections after Step 2
o Inner join the tables named SSCustomer and SSStore in your transformation using the
same method described previously.
o For the SSCustomer step, connect the CustID (from Excel file) and CUSTID (from
Database) fields.
o For the SSStore step, connect the StoreID (from Excel file) and STOREID (from Database)
fields.
o The global view of all nodes and connections after Step 3 is shown by Figure 27.
Figure 27: Global View of All Nodes and Connections after Step 3
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 23
Step 4 – Create and connect an Add Sequence step to generate values for the SalesNo column.
o Under the Design tab, expand the contents of the Transform node.
o Click and drag an Add sequence step into your transformation; create a hop between
the Merge Join 4 and Add Sequence steps (Figure 28).
o Double click on the newly created component to open its Basic Settings pane.
o Set SalesNo as the name of value. Check the box for use DB to get sequence. Select the
connection as Oracle12cDB. Set SSSalesNoSeq as sequence name (Figure 29).
Figure 28: Global View of All Nodes and Connections after Step 4
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 24
Figure 30 shows the Insert/Update node (SSSales) connected to Add sequence Node.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 25
o Double click the Insert/Update component, to specify its properties (Figure 31). Set the
step name as SSSales. Select the connection as Oracle12cDB. Type in the Target table
as SSSales. DON’T click the button “Get fields”. Instead, select the names from the two
table fields and set the comparator between them to “=”. The final window should look
like Figure 31.
o Click the button “Get Updated fields” and then click on “Edit mapping” button to edit
mapping. The mapping edit window is shown by Figure 32. Select the fields named
SalesUnits, SalesDollar, SaleCost, CustID, StoreID, ItemID TIMENO and SalesNo into the
mappings field. Pentaho will automatically match the corresponding name in the Target
field. Only SalesNo field has to be manually matched with SALESNO field. Then click OK.
o Select the SSSales step and run a preview by clicking on . In the transformation debug
dialog click on Quick Launch (Figure 33).
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 27
o Connect to your Oracle account (on your PC or remote server) so you can verify the
number of rows in the SSSales table. You should see 104 rows with 8 new rows added to
the 96 rows in the sample data (Figure 35).
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 28
o If you do not see the extra rows, the Oracle output component had a failure. To see the
error, check the Execution Results section.
o Under the Design tab, expand the Input node. Figure 36 shows the Design table and
input node.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 29
o Select and drag a Microsoft Access Input step onto the canvas on the right;
o Double Click on the Microsoft Access Input. The edit properties dialog box associated
with the Microsoft Access Input step appears (Figure 37). In this dialog box, you specify
the properties related to a particular step.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 30
o Set name for the Access Input as Sales and specify the Excel data source path in the Files
tab.
o In the tab named Content, click the button “Get tables” of table section. There will
appear a window (Figure 38). Select Sales as the table name, click OK.
o In the tab named Fields, click the button “Get fields”. There will appear a list (Figure 39)
showing the fields in the table named Sales.
Figure 39: Fields Window for Microsoft Access Input Property Editing
o Click the button “Preview rows” to preview the database (Figure 40). When asked for
the number of rows type 12 and click OK.
o Click OK at the bottom of the window. The input icon will change to the shape shown by
Figure 41.
Step 2 –You will add constraint checking for null values using the Filter Rows step.
o Add a Filter Rows step to your transformation. Under the Design table, go to Flow
Filter Rows (Figure 42).
o Create a hop between the Sales (Access file input) step and the Filter Rows step. Hops
are used to describe the flow of data in your transformation. To create the hop, click the
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 33
Sales (Access file input) step, then press the <SHIFT> key down and draw a line to the
Filter Rows step.
o Alternatively, you can draw hops by hovering over a step until the hover menu appears.
Drag the hop painter icon from the source step to your target step.
o Double-click the Filter Rows step. The Filter Rows edit properties dialog box appears.
o In the Step Name field type, Filter rows.
o Under The condition, click <field>. A dialog box that contains the fields you can use to
create your condition appears.
o In the Fields: dialog box select SalesUnits and click OK.
o Click on the comparison operator (set to = by default) and select the IS NOT NULL
function and click OK.
o Click the button , add constraints for other columns (Figure 43).
o Under the Design tab, expand the contents of the Transform node.
o Click and drag a Select values step into your transformation.
o Create a “hop” between the Filter rows step and the Select values step (Figure 44).
o Double-click the Select values step to open its edit properties dialog box.
o In the tab named Metadata, click the button “Get fields to change”, to get the fields to
change, which is shown by Figure 45. Change the Type of field myDate as String, change
its Format as dd-MM-yyyy. Click OK.
o Under the Design tab, expand the contents of the Transform node.
o Click and drag a Split fields step into your transformation (Figure 46).
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 35
o Create a “hop” between the Select values step and the Split fields step.
o Double-click the Split fields step to open its edit properties dialog box (Figure 47).
o Select myDate in the Field to split, type “-” as the Delimiter. Type in Year, Month and
Day in the Column named New field, and set their Type as Number.
o Click OK.
o Click , to preview this transform (Figure 48). Make sure that Split Fields step is
selected from the left side panel of the transformation debug dialog and click on “Quick
Launch” button.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 36
o Under the Design tab, expand the contents of the Input node.
o Click and drag a Table Input step into your transformation.
o Double-click the Table Input step to open its edit properties dialog box.
o Rename your Table Input step to SSTimeDim.
o Click “New” next to the connection field. You must create a connection to the database.
The Database connection dialog box appears.
o Provide the settings for connecting to the database as shown in the Figure 20.
o Connection Name: Oracle12cDB
Connection Type: Oracle
Host Name:
Database Name: (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)
(HOST=132.194.167.74)(PORT=1521)))
(CONNECT_DATA=(SERVICE_NAME=portdb2.ucdenver.pvt)))
Port Number:
Access: Native (JDBC)
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 37
You should replace the IP address, port number and the service name. Also, you need to
use your assigned user name and password. Do not use ISMG6480ClassStudent as the
user name.
o Click “Test”, to test the connection.
o Type in “SELECT * FROM SSTimeDim” in the SQL section. You can click the Preview button to
view the database. Click Ok, to exit the Database Connection dialog box.
o Under the Design tab, expand the contents of the Transform node.
o Click and drag a Sort Rows step into your transformation; create a hop between the
Split fields and Sort Rows steps.
o Double-click the Sort Rows step to open its edit properties dialog box. Click “Get fields”
to obtain the fields. Delete other fields except the Day, Month and Year fields. Then click
Ok.
o Add one more sort rows component Sort rows 2, and a hop connecting the SSTimeDim
step. In the field specification, delete other fields except TIMEDAY, TIMEMOHTH,
TIMEYEAR fields.
o Under the Design tab, expand the contents of the Join node.
o Click and drag a Merge Join step into your transformation; create a hop between the
Sort rows, Sort rows 2 and Merge Join steps.
o Double-click the Merge Join step to specify its properties. Set First step as Sort rows,
Second step as Sort rows 2, and Join Type as INNER. Click both of the “Get key fields”
at left and right to get the possible fields to join. In the left table, delete other fields
except Day, Month and Year fields. In the right table, delete other fields except
TIMEDAY, TIMEMONTH, and TIMEYEAR fields. Then click OK.
o Now, we have finished inner join between the Access table and SSTimeDim table.
o Figure 49 shows the global view of all nodes and connections after Step 1.
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 38
Figure 49: Global View of All Nodes and Connections after Step 1
o Inner join the tables named SSItem, SSCustomer, and SSStore in your transformation
using the same method described before.
o For SSItem step, connect ItemID (from Excel file) and ITEMID (from Database) fields.
o For SSCustomer step, connect CustID (from Excel file) and CUSTID (from Database) fields.
o For SSStore step, connect StoreID (from Excel file) and STOREID (from Database) fields.
o Figure 50 shows the global view of all nodes and connections after Step 2.
Figure 50: Global View of All Nodes and Connections after Step 2
o Under the Design tab, expand the contents of the Transform node.
o Click and drag Add sequence step into your transformation; create a hop between the
Merge Join 4 and Add Sequence steps (Figure 51).
o Double click on the newly created component to open its Basic Settings pane.
o Set SalesNo as the name of value. Check the box for use DB to get sequence. Select the
connection as tbs11g2. Set SSSalesNoSeq as sequence name (Figure 52)
29 July 2020 Guided Tutorial for Pentaho Data Integration using Oracle P a g e 39
Figure 51: Global View of All Nodes and Connections after Step 3
Connect to your Oracle account (on your PC or remote server) so you can verify the number of
rows in the SSSales table. You should see 112 rows with 8 new rows added to the 104 rows in
the sample data (Figure 54).