Building Data Pipeline With Pentaho Lab Guide
Building Data Pipeline With Pentaho Lab Guide
Contents
Page 2: Guided Demonstration: Data Source to Dashboard
Page 3: Review the InputData Transformation
Page 11: Review and Run the CT2000 Job
Page 14: Create an Analysis Using the RenewableEnergy Model
Page 16: View the CT2000 Dashboard
HITACHI IS A TRADEMARK OR REGISTERED TRADEMARK OF
PageHITACHI,
17: Resources
LTD.
1
Guided Demonstration: Data Source to Dashboard
Introduction In this guided demonstration, you will review a Pentaho Data Integration (PDI)
transformation that obtains data about energy generation and usage around the
world, prepares the data for analytics by building a data model (cube), and
publishes the data to the repository as a data service. You will then review a PDI
job that runs the transformation and publishes the cube to the repository so it
can be used for analytics. Finally, you will use Analyzer to analyze and visualize
the data.
Objectives After completing this guided demonstration, you will be able to:
Note The transformation and job reviewed in this demonstration use a sampling of
PDI steps and job entries. The steps and job entries used in production vary
depending on the incoming data and the business objectives.
Transformations are used to describe the data flows for Extract, Transform, and Load (ETL) processes,
such as reading from a source, transforming data, and loading it into a target location. Each “step” in a
transformation applies specific logic to the data flowing through the transformation. The steps are
connected with “hops” that define the pathways the data follow through the transformation. The data
flowing through the transformation is referred to as the “stream.”
The InputData transformation receives data from a Microsoft Excel file containing data about energy
generation and usage around the world. It then fine tunes the data, creates a data model (OLAP cube),
and publishes the data to the repository as a Pentaho Data Service.
The Microsoft Excel Input step provides the ability to read data from one or more Excel and Open Office
files. In this example, the Excel file contains data about energy generation and usage by country for the
years 2000-2015.
3. To preview the data, click Preview Rows, and then click OK.
4. To close the preview, click Close, and then to close the step dialog, click OK.
The Select Values step is useful for selecting, removing, renaming, changing data types and configuring
the length and precision of the fields in the stream. In this example, the fields are reordered, and the
Technology field is replicated four times to create the Tech1, Tech2, Tech3, and Tech4 fields. You will
see the purpose of those fields later in this demonstration.
The Modified Java Script Value step provides an expression based user interface for building JavaScript
expressions. This step also allows you to create multiple scripts for each step. The Technology field from
the spreadsheet contains the specific type of energy (for example, Renewable Municipal Waste). Since
the specific energy sources can be categorized into higher levels, the expressions in this step assign the
energy source to various categories to create a hierarchy that will be used in the OLAP cube.
For example, the Technology “Renewable Municipal Waste” gets turned into the following four fields:
Tech1: Total Renewable Energy
Tech2: Bioenergy
Tech3: Solid Biofuels
Tech4: Renewable Municipal Waste
The Filter Rows step filters rows based on conditions and comparisons. The rows are then directed
based on whether the filter evaluates to ‘true’ or ‘false.’ In this example, the previous JavaScript step
results in some redundant data, so those rows are filtered out of the stream.
The Sort rows step sorts rows based on the fields you specify and on whether they should be sorted in
ascending or descending order.
1. Double-click the Sort rows step, and then review the configuration.
The Row Denormaliser step allows you denormalize data by looking up key-value pairs. It also allows you
to immediately convert data types. In this example, the Indicator field is used denormalize the rows and
create two additional fields: Total Generated GWh and Total Capacity MW.
2. To close the step dialog, click OK, and then click Close.
The second Filter Rows step removes rows with Total Capacity MW of zero.
The Annotate Stream step helps you refine your data for the Streamlined Data Refinery by creating
measures, link dimensions, or attributes on stream field(s) which you specify. In this example, the Total
Generated GWh and Total Capacity MW are defined as measures, and the remaining fields are defined
as dimensions within hierarchies for the location and the technologies. The Annotate Stream modifies
the default model produced from the Build Model job entry. You will review the Build Model job entry
later in this demonstration.
Prototyping a data model can be time consuming, particularly when it involves setting up databases,
creating the data model and setting up a data warehouse, then negotiating accesses so that analysts can
visualize the data and provide feedback. One way to streamline this process is to make the output of a
transformation step a Pentaho Data Service. The output of the transformation step is exposed by the
data service so that the output data can be queried as if it were stored in a physical table, even though
the results of the transformation are not stored in a physical database. Instead, results are published to
the Pentaho Server as a virtual table. The results of this transformation are being used to create a
Pentaho Data Service called DataServiceCT2000.
Jobs are used to coordinate ETL activities such as defining the flow and dependencies for what order
transformations should be run, or prepare for execution by checking various conditions such as
ensuring a source file is available.
The CT2000 job executes the InputDataTransformation, builds the data model (cube) based on the
Annotate Stream step, and then publishes the model to the repository. After the job runs, the data
service and model are available for reporting, analysis, and dashboarding.
The Build Model job entry creates Data Source Wizard (DSW) data models. In this example, the
RenewableEnergy model is created from the DataServiceCT2000 data service based on the annotations
defined in the Annotate Stream step.
The Publish Model job entry allows you to publish the data model created with the Build Model job
entry so it is available for use on the Pentaho Server.
Notice the green checkmarks indicating that each job entry successfully completed.
4. To add Total Generated (GWh) to the Measures, double-click Total Generated (GWh).
5. To add Continent to the Rows, double-click Continent.
12. To return to the table, on the toolbar, click the Switch to table format icon.
13. To close the analysis, on the Analysis Report tab, click the X, and then click Yes. (It is not
necessary to save this analysis.)
The CT2000 dashboard was created with the CTools using the RenewableEnergy data model and the
DataServiceCT2000 data service to provide an interactive dashboard that allows users to explore the
data from various perspectives.
To view the CT2000 dashboard:
1. From the Home Perspective, click Browse Files.
2. In the Folders panel, navigate to the Public>CT2000>dashboards folder.
3. In the Files panel, double-click the CDE sample file.
https://www.hitachivantara.com
https://www.hitachivantara.com/en-us/solutions/data-analytics.html
https://www.hitachivantara.com/en-us/products/big-data-integration-analytics/pentaho-data-
integration.html
https://www.hitachivantara.com/en-us/products/big-data-integration-analytics/pentaho-business-
analytics.html
Training
https://www.hitachivantara.com/en-us/services/training-certification/training/pentaho.html
CTools