0% found this document useful (0 votes)
142 views18 pages

Building Data Pipeline With Pentaho Lab Guide

This document provides a guided demonstration of using Pentaho to build a data pipeline from data ingestion to analytics visualization. It reviews a Pentaho transformation that obtains energy generation and usage data from an Excel file, prepares the data by building a data model, and publishes the data as a service. It then reviews a Pentaho job that runs the transformation and publishes the data model. Finally, it demonstrates analyzing and visualizing the data in Pentaho Analyzer. The goal is to describe how to use Pentaho transformations and jobs to ingest, transform, model, and publish data for analytics purposes.

Uploaded by

Deni Diana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views18 pages

Building Data Pipeline With Pentaho Lab Guide

This document provides a guided demonstration of using Pentaho to build a data pipeline from data ingestion to analytics visualization. It reviews a Pentaho transformation that obtains energy generation and usage data from an Excel file, prepares the data by building a data model, and publishes the data as a service. It then reviews a Pentaho job that runs the transformation and publishes the data model. Finally, it demonstrates analyzing and visualizing the data in Pentaho Analyzer. The goal is to describe how to use Pentaho transformations and jobs to ingest, transform, model, and publish data for analytics purposes.

Uploaded by

Deni Diana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Hitachi NEXT 2018

Building a Data Pipeline With Pentaho –


From Ingest to Analytics

Contents
Page 2: Guided Demonstration: Data Source to Dashboard
Page 3: Review the InputData Transformation
Page 11: Review and Run the CT2000 Job
Page 14: Create an Analysis Using the RenewableEnergy Model
Page 16: View the CT2000 Dashboard
HITACHI IS A TRADEMARK OR REGISTERED TRADEMARK OF
PageHITACHI,
17: Resources
LTD.
1
Guided Demonstration: Data Source to Dashboard
Introduction In this guided demonstration, you will review a Pentaho Data Integration (PDI)
transformation that obtains data about energy generation and usage around the
world, prepares the data for analytics by building a data model (cube), and
publishes the data to the repository as a data service. You will then review a PDI
job that runs the transformation and publishes the cube to the repository so it
can be used for analytics. Finally, you will use Analyzer to analyze and visualize
the data.

Objectives After completing this guided demonstration, you will be able to:

• Describe the purpose of a Transformation and the following


transformation steps:
- Microsoft Excel Input
- Select Values
- Modified Java Script Value
- Filter Rows
- Sort Rows
- Row Denormaliser
- Annotate Stream
• Create a Pentaho Data Service from a transformation step
• Describe the purpose of a Job and the following job entries:
- Start
- Transformation
- Build Model
- Publish Model
• Use Pentaho Analyzer to analyzer and visualize data

Note The transformation and job reviewed in this demonstration use a sampling of
PDI steps and job entries. The steps and job entries used in production vary
depending on the incoming data and the business objectives.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 2


Review the InputData Transformation

Start Pentaho Data Integration (Spoon) and Connect to the Repository

1. On the desktop, double-click the Data Integration icon.


2. To connect to the repository, at the far right of the toolbar, click Connect, and then click
Pentaho Repository.
3. Enter the User Name as admin, and the Password as password, and then click Connect.

Open the InputData Transformation

Transformations are used to describe the data flows for Extract, Transform, and Load (ETL) processes,
such as reading from a source, transforming data, and loading it into a target location. Each “step” in a
transformation applies specific logic to the data flowing through the transformation. The steps are
connected with “hops” that define the pathways the data follow through the transformation. The data
flowing through the transformation is referred to as the “stream.”

The InputData transformation receives data from a Microsoft Excel file containing data about energy
generation and usage around the world. It then fine tunes the data, creates a data model (OLAP cube),
and publishes the data to the repository as a Pentaho Data Service.

To open the InputData transformation:


1. From the menu, select File, and then click Open.
2. Navigate to the Public>CT2000>files>KTR folder.
3. Double-click InputDataTransformation.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 3


Review the Microsoft Excel Input Step

The Microsoft Excel Input step provides the ability to read data from one or more Excel and Open Office
files. In this example, the Excel file contains data about energy generation and usage by country for the
years 2000-2015.

To review the Microsoft Excel Input step:


1. Double-click the Input Data xls step, and then review the configuration of the Files tab.

2. Click the Fields tab, and then review the configuration.

3. To preview the data, click Preview Rows, and then click OK.

4. To close the preview, click Close, and then to close the step dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 4


Review the Select Values Step

The Select Values step is useful for selecting, removing, renaming, changing data types and configuring
the length and precision of the fields in the stream. In this example, the fields are reordered, and the
Technology field is replicated four times to create the Tech1, Tech2, Tech3, and Tech4 fields. You will
see the purpose of those fields later in this demonstration.

To review the Select Values step:


1. Double-click the Defines fields step, and then review the configuration.

2. To close the step dialog, click OK.

Review the Modified Java Script Value Step

The Modified Java Script Value step provides an expression based user interface for building JavaScript
expressions. This step also allows you to create multiple scripts for each step. The Technology field from
the spreadsheet contains the specific type of energy (for example, Renewable Municipal Waste). Since
the specific energy sources can be categorized into higher levels, the expressions in this step assign the
energy source to various categories to create a hierarchy that will be used in the OLAP cube.

For example, the Technology “Renewable Municipal Waste” gets turned into the following four fields:
Tech1: Total Renewable Energy
Tech2: Bioenergy
Tech3: Solid Biofuels
Tech4: Renewable Municipal Waste

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 5


To review the Modified Java Script Value step:
1. Double-click the Builds tech hierarchy step.
2. Click the Item_0 tab, and then review the script.

3. Click the Script 1 tab, and then review the script.

4. To close the step dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 6


Review the Filter Rows Step

The Filter Rows step filters rows based on conditions and comparisons. The rows are then directed
based on whether the filter evaluates to ‘true’ or ‘false.’ In this example, the previous JavaScript step
results in some redundant data, so those rows are filtered out of the stream.

To review the Filter Rows step:


1. Double-click the Filters out redundancy step, and then review the configuration.

2. To close the step dialog, click OK.

Review the Sort Rows Step

The Sort rows step sorts rows based on the fields you specify and on whether they should be sorted in
ascending or descending order.

To review the Sort Rows step:

1. Double-click the Sort rows step, and then review the configuration.

2. To close the step dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 7


Review the Row Denormaliser Step

The Row Denormaliser step allows you denormalize data by looking up key-value pairs. It also allows you
to immediately convert data types. In this example, the Indicator field is used denormalize the rows and
create two additional fields: Total Generated GWh and Total Capacity MW.

To review the Row Denormaliser step:


1. Double-click the Denormalises Indicator step, and then review the configuration.

2. To close the step dialog, click OK, and then click Close.

Review the Second Filter Rows Step

The second Filter Rows step removes rows with Total Capacity MW of zero.

To review the Filter Rows step:


1. Double-click the Remove Capacity = 0 step, and then review the configuration.

2. To close the step dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 8


Review the Annotate Stream Step

The Annotate Stream step helps you refine your data for the Streamlined Data Refinery by creating
measures, link dimensions, or attributes on stream field(s) which you specify. In this example, the Total
Generated GWh and Total Capacity MW are defined as measures, and the remaining fields are defined
as dimensions within hierarchies for the location and the technologies. The Annotate Stream modifies
the default model produced from the Build Model job entry. You will review the Build Model job entry
later in this demonstration.

To review the Annotate Stream step:


1. Double-click the Sets measures and hierarchies step, and then review the configuration.

2. To close the step dialog, click OK.

Review the Output Step

Prototyping a data model can be time consuming, particularly when it involves setting up databases,
creating the data model and setting up a data warehouse, then negotiating accesses so that analysts can
visualize the data and provide feedback. One way to streamline this process is to make the output of a
transformation step a Pentaho Data Service. The output of the transformation step is exposed by the
data service so that the output data can be queried as if it were stored in a physical table, even though
the results of the transformation are not stored in a physical database. Instead, results are published to
the Pentaho Server as a virtual table. The results of this transformation are being used to create a
Pentaho Data Service called DataServiceCT2000.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 9


To review the Data Service:
1. Right-click the OUTPUT step, then click Data Services, and then click Edit.

2. To close the Data Service dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 10


Review and Run the CT2000 Job

Open the CT2000 Job

Jobs are used to coordinate ETL activities such as defining the flow and dependencies for what order
transformations should be run, or prepare for execution by checking various conditions such as
ensuring a source file is available.

The CT2000 job executes the InputDataTransformation, builds the data model (cube) based on the
Annotate Stream step, and then publishes the model to the repository. After the job runs, the data
service and model are available for reporting, analysis, and dashboarding.

To open the CT2000 job:

1. From the Menu, select File, and then click Open.


2. Double-click CT2000JOB.

Review the Build Model Job Entry

The Build Model job entry creates Data Source Wizard (DSW) data models. In this example, the
RenewableEnergy model is created from the DataServiceCT2000 data service based on the annotations
defined in the Annotate Stream step.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 11


To review the Build Model job entry:
1. Double-click the Build Model job entry.

2. To close the job entry dialog, click OK.

Review the Publish Model Job Entry

The Publish Model job entry allows you to publish the data model created with the Build Model job
entry so it is available for use on the Pentaho Server.

To review the Publish Model job entry:


1. Double-click the Publish Model job entry.

2. To close the job entry dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 12


Run the CT2000 Job

To run the CT2000 job:


1. On the sub-toolbar, click the Run button.
2. Verify the Run Options, and then click Run.

Notice the green checkmarks indicating that each job entry successfully completed.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 13


Create an Analysis Using the RenewableEnergy Model

Start the Pentaho User Console

1. On the desktop, double-click the User Console Login icon.


2. In the User Name field, type admin, then in the Password field, type password, and then click
Login.

Create an Analysis Using the RenewableEnergy Model

To create a new analysis:


1. From the Home Perspective, click Create New>Analysis Report.
2. In the Select Data Source window, click Renewable Energy:Renewable Energy, and then click
OK.
3. Review the RenewableEnergy model/cube.

4. To add Total Generated (GWh) to the Measures, double-click Total Generated (GWh).
5. To add Continent to the Rows, double-click Continent.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 14


6. To add Tech2 to the Columns, select Tech2 and drag it to the Columns drop zone on the Layout
panel.
7. To drill down to the Tech3 level for Bioenergy, double-click the Bioenergy column header.
8. To drill down to the Tech4 level for Solid biofuels, double-click the Solid biofuels column
header.
9. To keep only the Renewable municipal waste data, right-click the Renewable municipal waste
column header, and then click Keep Only Renewable municipal waste.
10. To drill down to the Country level for Europe, double-click the Europe row header.
11. To view the analysis as a chart, on the toolbar, click the Choose chart type icon, and then click
Column.

12. To return to the table, on the toolbar, click the Switch to table format icon.
13. To close the analysis, on the Analysis Report tab, click the X, and then click Yes. (It is not
necessary to save this analysis.)

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 15


View the CT2000 Dashboard

The CT2000 dashboard was created with the CTools using the RenewableEnergy data model and the
DataServiceCT2000 data service to provide an interactive dashboard that allows users to explore the
data from various perspectives.
To view the CT2000 dashboard:
1. From the Home Perspective, click Browse Files.
2. In the Folders panel, navigate to the Public>CT2000>dashboards folder.
3. In the Files panel, double-click the CDE sample file.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 16


Resources

Hitachi Vantara Web Site

https://www.hitachivantara.com

Innovate with Data and Analytics

https://www.hitachivantara.com/en-us/solutions/data-analytics.html

Pentaho Data Integration

https://www.hitachivantara.com/en-us/products/big-data-integration-analytics/pentaho-data-
integration.html

Pentaho Business Analytics

https://www.hitachivantara.com/en-us/products/big-data-integration-analytics/pentaho-business-
analytics.html

Training

https://www.hitachivantara.com/en-us/services/training-certification/training/pentaho.html

Pentaho Data Integration

DI1000: Pentaho Data Integration Fundamentals

DI1500: Pentaho Data Integration Advanced

Pentaho Business Analytics

BA1000: Business Analytics User Console

BA2000: Business Analytics Report Designer

BA3000: Business Analytics Data Modeling

CTools

CT1000: CTools Fundamentals

CT1500: CTools Advanced

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 17


HITACHI is a trademark or registered trademark of Hitachi, Ltd. 18

You might also like