Design A Multidimensional Cube Using Schema Workbench in Pentaho CE BI Suite 3.0

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Introducción al Análisis Dimensional de Bases de Datos

ARTICULO 1:

Design a multidimensional cube


using Schema Workbench in
Pentaho CE BI Suite 3.0
This exercise will teach you how to design a new cube, publishing it into the Pentaho server and viewing the
cube via Pentaho user console.

This exercise uses Pentaho Schema Workbench 3.0.4 (stable) for designing the cubes, Pentaho CE BI Suite
3.0 for hosting and viewing the cube we design, and ‘SampleData’, a hypersonic SQL database as data
source. This ‘SampleData’ database is available as part of the free download of Pentaho CE BI Suite 3.0.
Please follow the below steps to design and view a new cube.

Step 1:
Make sure the pentaho server is up and running.

Step 2:
Once the pentaho server is started, go to the folder where you have the ‘Schema Workbench’ tool installed in
your system.

Step 3:
In the ‘schema-workbench’ folder, double click on the batch file ‘workbench’ to startup the ‘schema-
workbench’ tool. Or you can right-click on the batch file ‘workbench’ to do the same process. This will open
the schema workbench window along with a command prompt. Please maximize the window.

Step 4:
Click on the menu ‘Tools’ and select ‘Preferences’. This will open the ‘Workbench Preferences’ window. We
need to provide the JDBC details based on the datasource we use.

Step 5:
In the ‘Workbench Preferences’ window please provide the following details.

Note: As we are using ‘SampleData’ HSQL DB, the details I’ve given is specific to that database. In case if
we need to use Oracle, MySQL, etc this will change.

Driver Class Name: org.hsqldb.jdbcDriver


Connection URL: jdbc:hsqldb:hsql://localhost/sampledata
User Name: pentaho_user
Password: password
Schema (Optional): <leave it blank>
Require Schema Attributes: Check this option.
Click on ‘Accept’ button.

Step 6:

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

To create a new schema file, in the menu bar select ‘File à New à Schema’ menu item.

This will open the ‘New Schema 1’ window with the schema file name as ‘Schema1.xml’. Please refer to the
below screenshot.

Step 7:
Click on ‘Schema’ as shown above and set the required properties for it, for ex, name of the schema, etc. For
now, enter name as ‘SchemaTest’.

Step 8:
Right click on element ‘Schema’ and select ‘Add Cube’ option. This will add a new cube into the schema.

Step 9:
Set the name of the cube as ‘CubeTest’. Once it is done, the schema design will look like below.

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

Step 10:
Basically, a cube is a structure made of number of dimensions, measures, etc. Cubes usually rely on two kind
of tables like ‘fact-table’ (for cubes) & ‘dimension-table’ (for cube’s different dimensions). A cube can have
only one fact-table and ‘n’ number of dimension tables (based on no: of dimensions in the cube).
So our next step is to set the ‘fact-table’ for the cube ‘CubeTest’. To do so, click on the icon before the cube
image as mentioned in #2 in above screenshot. This will expand the cube node like below image.

Step 11:
Now click on the ‘Table’ element, this will list out the attributes specific to the ‘Table’ element. Clicking on
the ‘name’ attribute will display all tables available under current datasource (the database we set in Step 5.
Select the table ‘Customers’.

Once you choose the table ‘PUBLIC -> CUSTOMERS’, the ‘schema’ attribute value will be filled in
automatically as ‘PUBLIC’.

Note: If the fact-table doesn’t belong to the schema mentioned in step 5, then you must explicitly specify the
schema to which your fact-table belongs to.
Step 12:
Now add a new dimension called ‘CustomerGeography’ to the cube by right clicking the cube element
‘CubeTest’.

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

Step 13:
For the new dimension added, set the required attribute values like name, foreign key, etc.

Set name of the dimension as ‘CustomerGeography’, and foreign key as ‘CUSTOMERNUMBER’.

Just double click on the dimension name ‘CustomerGeography’. This will expand the node and display the
‘Hierarchy’.

Click on the ‘hierarchy’ in the left side pane, you can find the attribute properties for the hierarchy.

Set name -> CustomerGeo; allMemberName = ‘All Countries’

Step 14:
Double click on the ‘Hierarchy’ element in the left side pane, will expand the node further and show the
‘Table’ element. Click on the ‘Table’ element to set the dimension-table for the dimension
‘CustomerGeography’. This will list the related attributes on the right side pane. Clicking on the ‘name’
attribute’s value field will list the tables available in the current schema.

Select it as ‘CUSTOMERS’. This will automatically fill the ‘schema’ field as ‘PUBLIC’.

Step 15:
Right click on the element ‘Hierarchy’ on the left side pane and select ‘Add Level’.

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

This will add a new level with name ‘New Level 0’. Refer to the below screenshot.

To rename and set other attributes, set the attribute values (as listed below) for the newly created level in the
right side pane.

Name -> CustomerCountry


Column -> COUNTRY
Type -> string
uniqueMembers -> true
Now we have added a level called ‘CustomerCountry’.

Step 16:
To add another one level, right-click on ‘Hierarchy’ in the left side pane (as we did in Step 15), and select
‘Add Level’. This will add a new level with name ‘New Level 1’. To rename and set other attribute values, set
the attribute values in the right side pane as below,

Name -> CustomerCity


Column -> CITY
Type -> String
So far, we have created a cube with a dimension which will show two hierarchical level of details.

Step 17:
To add a new dimension to the cube, right-click on the cube item (CubeTest) in the left side pane then, select
‘Add Dimension’.

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

This will add a new dimension to the cube with a default name. To rename it and set other attribute values,
click on the newly created dimension in the left side pane. This will list out the attributes for the dimension.

Set name -> CustomerContact; foreign key – ‘CUSTOMERNUMBER’.


Step 18:
To add hierarchy and levels for this dimension, double click on the dimension name which will expand the
dimension node ‘CustomerContact’. Click on the ‘hierarchy’ element in the left side pane, then on the right
side pane set the below attribute values.

Set name -> ‘’; allmembername = ‘All Contacts’.


Step 19:
Double click on the element ‘hierarchy’, will expand the node ‘hierarchy’ where you can set the dimension-
table for the dimension ‘CustomerContact’.

Click on the ‘Table’ element and select the table as ‘CUSTOMERS’.

Step 20:
To add a new level for this dimension or hierarchy, right-click on the element ‘hierarchy’ and select ‘Add
Level’. This will add a new level to the hierarchy with name ‘New Level 0’. We can rename it by changing
the attributes’ values like below,

name -> ‘CustomerNames’


Column -> CONTACTFIRSTNAME
Type -> String.
Step 21:
To add new measure to the cube, right click on the cube ‘CubeTest’ and select ‘Add Measure’. This will add a
new measure with name ‘New Measure 0’. You can rename it by changing the attribute values.

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

For ex, we are trying to achieve the number of customers under each country/city.

Set attribute values like below,

Name -> CustomerCount

Aggregator -> count

Column -> CUSTOMERNUMBER

Format string -> ####

Datatype -> Integer

After setting up the measure, the cube (CubeTest) schema structure will look like below,

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

Step 22:
Now, click ‘File -> Save’ menu to save the cube schema. You can save it in your desired path.

For ex, save it as ‘TestCube.mondrian.xml’


Step 23:
Once saved the schema, you can then publish the cube into the Pentaho server (we are using Pentaho
Community Edition BI Suite 3.0 Stable version).

Select ‘File -> Publish…’ menu item to publish the cube. This will open a publish dialog like below

.
Follow the instructions in the above screenshot and click ‘OK’ button. Click ‘Cancel’ button to cancel the
publishing action. Once clicked OK button, you can see the processing action like below,

Then, a publish dialog will open. Here you have to specify, in Pentaho server, where we need to publish the
cube.

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

Choose the location where you want to publish and click on ‘Publish’ button.

On successful publishing, the system will display you a dialog box with message ‘Publish successful’.

Click ‘OK’ button.

Step 24:
To view this published cube via Pentaho, please open or browse or visit the Pentaho server URL.

For ex, http://localhost:8080/pentaho


Click on ‘Pentaho User Console Login’ button, you can find a login dialog box.

Login as user ‘joe(admin)’ and password ‘password’ or any other users with administrative privileges. After
selecting the user (or) entering the user credentials, click on ‘Login’ button to login or click on ‘cancel’ button
to cancel the login process.

Step 25:
After logging in, the system will be redirected to the Pentaho BI home page.

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

Click on ‘New Analysis View’ button. This will list of schema accessible to the currently logged in user.

By default, this schema list would include ‘SampleData’, ‘SteelWheels’ along with the schema we published
early.

Step 26:
Select the ‘Schema’ as ‘SchemaTest’ and ‘Cube’ as ‘CubeTest’. Click ‘OK’ button.

Step 27:
This will generate the cube in an ‘Analysis View’ window like below.

Result:
A new cube has been designed, configured and published into the Pentaho Server. Also, we viewed the cube
via Pentaho User Console.

(function(i,s,o,g,r,a,m){i[‘GoogleAnalyticsObject’]=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,’script’,’//www.google-analytics.com/analytics.js’,’ga’);
ga(‘create’, ‘UA-44966059-1’, ‘wordpress.com’);
ga(‘send’, ‘pageview’);

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

ARTICULO 2:
Implementing a database cubes system on MySQL

I’ve already talked about how I solved the problem of managing huge amounts of data in my
last post. Now, I’m going to explain how to implement one of the solutions found in order
to comfortably face this continuously increasing avalanche of information.
Let’s imagine that, as I explained before, I have separated data input into tables coded by a
station ID. All of them are stored in the same database for maintenance simplicity sake. So,
we have the ‘database’ in which there are hundreds of tables called ‘Station_XYZ’. Every
table has the same structure: to simplify, SensorID, Time (UNIX), Value. All right then, time
to make cubes!

First of all, I define a MySQL stored procedure to extract maximum, minimum and average
values from these tables and to save them in a second ‘database_cubes_1h’, named so
because this process will run every hour. Also, there is a table Stations in ‘database_main’
(a third database to keep other application’s configurations), where all the stations ever
installed are registered. We will use this table to know if a station exists and therefore its
related table in database. A first draft would look like this:
? View Code MYSQL
1 -- -----------------------------------------------------
2 -- Procedure `fill_cubes_1h`
3 -- -----------------------------------------------------
4
5 DROP PROCEDURE IF EXISTS fill_cubes_1h;
6
7 DELIMITER //
8
9 CREATE PROCEDURE fill_cubes_1h($StationID INT)
10 BEGIN
11 IF Station_ID IN (SELECT StationID FROM `database`.`Stations`)
12 THEN
13 SET @strSQL = CONCAT('
14 INSERT INTO `database_cubes_1h`.`Station_',$StationID,'`
15 SELECT
16 SensorID,
17 UNIX_TIMESTAMP(DATE_FORMAT(FROM_UNIXTIME(Time),"%Y-%m-%d %H:00:00"))
18 AS Hour,
19 AVG(Value) AS Value,
20 MAX(Value) AS ValueMax,
21 MIN(Value) AS ValueMin
22 FROM `database`.`Station_',$StationID,'`
23 WHERE
24 Time &gt;= UNIX_TIMESTAMP(DATE_FORMAT(NOW() - INTERVAL 1 HOUR,"%Y-%m-
25 %d %H:00:00"))
26 GROUP BY
27 HOUR, SensorID
28 ORDER BY
29 HOUR ASC
30 ');
31
32 PREPARE statement FROM @strSQL;
33 EXECUTE statement ;

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

34 DEALLOCATE PREPARE statement ;


35
36 ELSE
37 SET @foo = "";
38
39 END IF;
40
41 END;//

DELIMITER ;

Basically, it composes a query to extract averages from one table and insert into another at
once. This query is run as a prepared statement so that we can reuse the function for all the
stations in our database, as long as tables are always named as Station_XYZ. But, what does
exactly it do?

1. Use a statement like ‘INSERT INTO table SELECT * FROM another_table. It will copy data
automatically if the number and format of columns of the SELECT output and the INSERT
INTO input are the same.
2. As Time is stored in UNIX EPOCH time, it is converted to ISO time, then only the hour
is extracted, then it is converted back to UNIX. The result of this process is be able to
group the result set by hour, getting rid off minutes and seconds.
“UNIX_TIMESTAMP(DATE_FORMAT(FROM_UNIXTIME(Time),”%Y-%m-%d %H:00:00″)) AS
Hour”
3. A WHERE condition filters the results for the last hour: ‘Time >=
UNIX_TIMESTAMP(DATE_FORMAT(NOW() – INTERVAL 1 HOUR,”%Y-%m-%d %H:00:00″))’.
4. Finally, the whole result set is grouped by Time, to be able to calculate averages,
maximums and minimums. This statement does the grouping by respecting the
different sensors that might have sent data on the same hourly interval “GROUP BY
HOUR, SensorID” and the following ones perform calculations “AVG(Value) AS Value,
MAX(Value) AS ValueMax, MIN(Value) AS ValueMin”.
Up to now we seem to have resolved the performance problem. Hourly cubes can be
constructed and we only would need to add a sort of cron job. But, it is not so easy… I
haven’t mentioned yet that data is not received synchronously. That means, within a time
frame of three or four hours we could receive the data from a lazy or out of range station.
That may be problematic, so…

I’ve changed my stored procedure to do as follows:

1. Calculate the whole last day in hourly tranches, while keep running every hour. “WHERE
Time >= UNIX_TIMESTAMP(DATE_FORMAT(NOW() – INTERVAL 1 DAY,”%Y-%m-%d
%H:00:00″)) AND Time <= UNIX_TIMESTAMP(DATE_FORMAT(NOW() – INTERVAL 1
HOUR,”%Y-%m-%d %H:59:59″))”

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

2. Modify the query to allow update in case that a hourly cube was already calculated. “ON
DUPLICATE KEY UPDATE `Value` = VALUES(`Value`)”
Please note that in order to get it correctly running, there should be an effective way to
detect a duplicate key. In my case, I’ve used all fields but the value as primary keys, instead
of defining a new artificial key field. Thus, SensorID and Time are primary keys, so there
should never be more than one value for each combination of both. Doing so, MySQL detects
duplicate and updates the third column, value, by throwing no errors. The whole stored
procedure would look like this:

? View Code MYSQL


1 -- -----------------------------------------------------
2 -- Procedure `fill_cubes_1h`
3 -- -----------------------------------------------------
4
5 DROP PROCEDURE IF EXISTS fill_cubes_1h;
6
7 DELIMITER //
8
9 CREATE PROCEDURE fill_cubes_1h($StationID INT)
10 BEGIN
11
12 IF Station_ID IN (SELECT StationID FROM `database`.`Stations`)
13 THEN
14 SET @strSQL = CONCAT('
15 INSERT INTO `database_cubes_1h`.`Station_',$StationID,'`
16 SELECT
17 SensorID,
18 UNIX_TIMESTAMP(DATE_FORMAT(FROM_UNIXTIME(Time),"%Y-%m-%d %H:00:00")) AS
19 Hour,
20 AVG(Value) AS Value,
21 MAX(Value) AS ValueMax,
22 MIN(Value) AS ValueMin
23 FROM `database`.`Station_',$StationID,'`
24 WHERE
25 Time &gt;= UNIX_TIMESTAMP(DATE_FORMAT(NOW() - INTERVAL 1 DAY,"%Y-%m-%d
26 %H:00:00")) AND
27 Time &lt;= UNIX_TIMESTAMP(DATE_FORMAT(NOW() - INTERVAL 1 HOUR,"%Y-%m-%d
28 %H:59:59"))
29 GROUP BY
30 HOUR, SensorID
31 ORDER BY
32 HOUR ASC
33 ON DUPLICATE KEY UPDATE `Value` = VALUES(`Value`)
34 ');
35
36 PREPARE statement FROM @strSQL;
37 EXECUTE statement ;
38 DEALLOCATE PREPARE statement ;
39
40 ELSE
41 SET @foo = "";
42
43 END IF;
44
END ;//

DELIMITER ;

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

The next step is to get this procedure running every hour for all the stations in the database.
Now we are going to use our Stations table. The steps to follow are:

1. Request the highest station ID in Stations table


2. Since station ID’s are not sequential, get a way to determine if related table should exist
3. Tell the procedure to keep going if any exception raises
4. Run the loop until reach final station ID, and call the previous procedure passing current
index as a parameter
The final result looks like:

? View Code MYSQL


1 -- -----------------------------------------------------
2 -- Procedure `climaps_main`.`fill_all_cubes_1h`
3 -- -----------------------------------------------------
4
5 DROP PROCEDURE IF EXISTS fill_all_cubes_1h;
6
7 DELIMITER //
8
9 CREATE PROCEDURE fill_all_cubes_1h()
10 BEGIN
11 DECLARE v INTEGER;
12 DECLARE m INTEGER;
13
14 DECLARE CONTINUE HANDLER FOR SQLEXCEPTION BEGIN END;
15
16 SET v = 0;
17 SET m = (SELECT StaID FROM climaps_main.Stations ORDER BY StaID DESC LIMIT 1);
18
19 WHILE v &lt;= m DO
20
21 CALL fill_cubes_1h(v);
22 SET v = v + 1;
23
24 END WHILE;
25
26 END;//
27
28 DELIMITER ;

We’ve nearly finished! We have a way to build hourly cubes for a variable number of stations,
and solving the problem of asynchronous data. We just need to create an event to run this
last procedure, as we have stated above, every hour. This should work on MySQL 5.1

? View Code MYSQL


1 -- -----------------------------------------------------
2 -- Event `database_main`.`fill_cubes_1h`
3 -- -----------------------------------------------------
4
5 DROP EVENT IF EXISTS fill_cubes_1h;
6

Docente: MSc. Ing. Arturo Díaz Pulido.


Introducción al Análisis Dimensional de Bases de Datos

7 DELIMITER //
8
9 CREATE EVENT fill_cubes_1h
10 ON SCHEDULE EVERY 1 HOUR
11 DO BEGIN
12 CALL fill_all_cubes_1h();
13 END //
14
15 DELIMITER ;

If for any reason we are not allowed to create events (all this stuff should be run as a user
with whole access to the tables involved), or we are running a MySQL version lower than 5.1,
a cron job should be run instead. Just get EXECUTE access on the database where the
procedures have been created (on my case, database_main), and add this line to crontab:

? View Code BASH


1 0 * * * * mysql -e 'CALL database_main.fill_all_cubes_1h()'

Please note that to get this statement running without asking for a password, you need to
create a .my.cnf file in your home directory (UNIX systems)

Docente: MSc. Ing. Arturo Díaz Pulido.

You might also like