Design A Multidimensional Cube Using Schema Workbench in Pentaho CE BI Suite 3.0
Design A Multidimensional Cube Using Schema Workbench in Pentaho CE BI Suite 3.0
Design A Multidimensional Cube Using Schema Workbench in Pentaho CE BI Suite 3.0
ARTICULO 1:
This exercise uses Pentaho Schema Workbench 3.0.4 (stable) for designing the cubes, Pentaho CE BI Suite
3.0 for hosting and viewing the cube we design, and ‘SampleData’, a hypersonic SQL database as data
source. This ‘SampleData’ database is available as part of the free download of Pentaho CE BI Suite 3.0.
Please follow the below steps to design and view a new cube.
Step 1:
Make sure the pentaho server is up and running.
Step 2:
Once the pentaho server is started, go to the folder where you have the ‘Schema Workbench’ tool installed in
your system.
Step 3:
In the ‘schema-workbench’ folder, double click on the batch file ‘workbench’ to startup the ‘schema-
workbench’ tool. Or you can right-click on the batch file ‘workbench’ to do the same process. This will open
the schema workbench window along with a command prompt. Please maximize the window.
Step 4:
Click on the menu ‘Tools’ and select ‘Preferences’. This will open the ‘Workbench Preferences’ window. We
need to provide the JDBC details based on the datasource we use.
Step 5:
In the ‘Workbench Preferences’ window please provide the following details.
Note: As we are using ‘SampleData’ HSQL DB, the details I’ve given is specific to that database. In case if
we need to use Oracle, MySQL, etc this will change.
Step 6:
To create a new schema file, in the menu bar select ‘File à New à Schema’ menu item.
This will open the ‘New Schema 1’ window with the schema file name as ‘Schema1.xml’. Please refer to the
below screenshot.
Step 7:
Click on ‘Schema’ as shown above and set the required properties for it, for ex, name of the schema, etc. For
now, enter name as ‘SchemaTest’.
Step 8:
Right click on element ‘Schema’ and select ‘Add Cube’ option. This will add a new cube into the schema.
Step 9:
Set the name of the cube as ‘CubeTest’. Once it is done, the schema design will look like below.
Step 10:
Basically, a cube is a structure made of number of dimensions, measures, etc. Cubes usually rely on two kind
of tables like ‘fact-table’ (for cubes) & ‘dimension-table’ (for cube’s different dimensions). A cube can have
only one fact-table and ‘n’ number of dimension tables (based on no: of dimensions in the cube).
So our next step is to set the ‘fact-table’ for the cube ‘CubeTest’. To do so, click on the icon before the cube
image as mentioned in #2 in above screenshot. This will expand the cube node like below image.
Step 11:
Now click on the ‘Table’ element, this will list out the attributes specific to the ‘Table’ element. Clicking on
the ‘name’ attribute will display all tables available under current datasource (the database we set in Step 5.
Select the table ‘Customers’.
Once you choose the table ‘PUBLIC -> CUSTOMERS’, the ‘schema’ attribute value will be filled in
automatically as ‘PUBLIC’.
Note: If the fact-table doesn’t belong to the schema mentioned in step 5, then you must explicitly specify the
schema to which your fact-table belongs to.
Step 12:
Now add a new dimension called ‘CustomerGeography’ to the cube by right clicking the cube element
‘CubeTest’.
Step 13:
For the new dimension added, set the required attribute values like name, foreign key, etc.
Just double click on the dimension name ‘CustomerGeography’. This will expand the node and display the
‘Hierarchy’.
Click on the ‘hierarchy’ in the left side pane, you can find the attribute properties for the hierarchy.
Step 14:
Double click on the ‘Hierarchy’ element in the left side pane, will expand the node further and show the
‘Table’ element. Click on the ‘Table’ element to set the dimension-table for the dimension
‘CustomerGeography’. This will list the related attributes on the right side pane. Clicking on the ‘name’
attribute’s value field will list the tables available in the current schema.
Select it as ‘CUSTOMERS’. This will automatically fill the ‘schema’ field as ‘PUBLIC’.
Step 15:
Right click on the element ‘Hierarchy’ on the left side pane and select ‘Add Level’.
This will add a new level with name ‘New Level 0’. Refer to the below screenshot.
To rename and set other attributes, set the attribute values (as listed below) for the newly created level in the
right side pane.
Step 16:
To add another one level, right-click on ‘Hierarchy’ in the left side pane (as we did in Step 15), and select
‘Add Level’. This will add a new level with name ‘New Level 1’. To rename and set other attribute values, set
the attribute values in the right side pane as below,
Step 17:
To add a new dimension to the cube, right-click on the cube item (CubeTest) in the left side pane then, select
‘Add Dimension’.
This will add a new dimension to the cube with a default name. To rename it and set other attribute values,
click on the newly created dimension in the left side pane. This will list out the attributes for the dimension.
Step 20:
To add a new level for this dimension or hierarchy, right-click on the element ‘hierarchy’ and select ‘Add
Level’. This will add a new level to the hierarchy with name ‘New Level 0’. We can rename it by changing
the attributes’ values like below,
For ex, we are trying to achieve the number of customers under each country/city.
After setting up the measure, the cube (CubeTest) schema structure will look like below,
Step 22:
Now, click ‘File -> Save’ menu to save the cube schema. You can save it in your desired path.
Select ‘File -> Publish…’ menu item to publish the cube. This will open a publish dialog like below
.
Follow the instructions in the above screenshot and click ‘OK’ button. Click ‘Cancel’ button to cancel the
publishing action. Once clicked OK button, you can see the processing action like below,
Then, a publish dialog will open. Here you have to specify, in Pentaho server, where we need to publish the
cube.
Choose the location where you want to publish and click on ‘Publish’ button.
On successful publishing, the system will display you a dialog box with message ‘Publish successful’.
Step 24:
To view this published cube via Pentaho, please open or browse or visit the Pentaho server URL.
Login as user ‘joe(admin)’ and password ‘password’ or any other users with administrative privileges. After
selecting the user (or) entering the user credentials, click on ‘Login’ button to login or click on ‘cancel’ button
to cancel the login process.
Step 25:
After logging in, the system will be redirected to the Pentaho BI home page.
Click on ‘New Analysis View’ button. This will list of schema accessible to the currently logged in user.
By default, this schema list would include ‘SampleData’, ‘SteelWheels’ along with the schema we published
early.
Step 26:
Select the ‘Schema’ as ‘SchemaTest’ and ‘Cube’ as ‘CubeTest’. Click ‘OK’ button.
Step 27:
This will generate the cube in an ‘Analysis View’ window like below.
Result:
A new cube has been designed, configured and published into the Pentaho Server. Also, we viewed the cube
via Pentaho User Console.
(function(i,s,o,g,r,a,m){i[‘GoogleAnalyticsObject’]=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,’script’,’//www.google-analytics.com/analytics.js’,’ga’);
ga(‘create’, ‘UA-44966059-1’, ‘wordpress.com’);
ga(‘send’, ‘pageview’);
ARTICULO 2:
Implementing a database cubes system on MySQL
I’ve already talked about how I solved the problem of managing huge amounts of data in my
last post. Now, I’m going to explain how to implement one of the solutions found in order
to comfortably face this continuously increasing avalanche of information.
Let’s imagine that, as I explained before, I have separated data input into tables coded by a
station ID. All of them are stored in the same database for maintenance simplicity sake. So,
we have the ‘database’ in which there are hundreds of tables called ‘Station_XYZ’. Every
table has the same structure: to simplify, SensorID, Time (UNIX), Value. All right then, time
to make cubes!
First of all, I define a MySQL stored procedure to extract maximum, minimum and average
values from these tables and to save them in a second ‘database_cubes_1h’, named so
because this process will run every hour. Also, there is a table Stations in ‘database_main’
(a third database to keep other application’s configurations), where all the stations ever
installed are registered. We will use this table to know if a station exists and therefore its
related table in database. A first draft would look like this:
? View Code MYSQL
1 -- -----------------------------------------------------
2 -- Procedure `fill_cubes_1h`
3 -- -----------------------------------------------------
4
5 DROP PROCEDURE IF EXISTS fill_cubes_1h;
6
7 DELIMITER //
8
9 CREATE PROCEDURE fill_cubes_1h($StationID INT)
10 BEGIN
11 IF Station_ID IN (SELECT StationID FROM `database`.`Stations`)
12 THEN
13 SET @strSQL = CONCAT('
14 INSERT INTO `database_cubes_1h`.`Station_',$StationID,'`
15 SELECT
16 SensorID,
17 UNIX_TIMESTAMP(DATE_FORMAT(FROM_UNIXTIME(Time),"%Y-%m-%d %H:00:00"))
18 AS Hour,
19 AVG(Value) AS Value,
20 MAX(Value) AS ValueMax,
21 MIN(Value) AS ValueMin
22 FROM `database`.`Station_',$StationID,'`
23 WHERE
24 Time >= UNIX_TIMESTAMP(DATE_FORMAT(NOW() - INTERVAL 1 HOUR,"%Y-%m-
25 %d %H:00:00"))
26 GROUP BY
27 HOUR, SensorID
28 ORDER BY
29 HOUR ASC
30 ');
31
32 PREPARE statement FROM @strSQL;
33 EXECUTE statement ;
DELIMITER ;
Basically, it composes a query to extract averages from one table and insert into another at
once. This query is run as a prepared statement so that we can reuse the function for all the
stations in our database, as long as tables are always named as Station_XYZ. But, what does
exactly it do?
1. Use a statement like ‘INSERT INTO table SELECT * FROM another_table. It will copy data
automatically if the number and format of columns of the SELECT output and the INSERT
INTO input are the same.
2. As Time is stored in UNIX EPOCH time, it is converted to ISO time, then only the hour
is extracted, then it is converted back to UNIX. The result of this process is be able to
group the result set by hour, getting rid off minutes and seconds.
“UNIX_TIMESTAMP(DATE_FORMAT(FROM_UNIXTIME(Time),”%Y-%m-%d %H:00:00″)) AS
Hour”
3. A WHERE condition filters the results for the last hour: ‘Time >=
UNIX_TIMESTAMP(DATE_FORMAT(NOW() – INTERVAL 1 HOUR,”%Y-%m-%d %H:00:00″))’.
4. Finally, the whole result set is grouped by Time, to be able to calculate averages,
maximums and minimums. This statement does the grouping by respecting the
different sensors that might have sent data on the same hourly interval “GROUP BY
HOUR, SensorID” and the following ones perform calculations “AVG(Value) AS Value,
MAX(Value) AS ValueMax, MIN(Value) AS ValueMin”.
Up to now we seem to have resolved the performance problem. Hourly cubes can be
constructed and we only would need to add a sort of cron job. But, it is not so easy… I
haven’t mentioned yet that data is not received synchronously. That means, within a time
frame of three or four hours we could receive the data from a lazy or out of range station.
That may be problematic, so…
1. Calculate the whole last day in hourly tranches, while keep running every hour. “WHERE
Time >= UNIX_TIMESTAMP(DATE_FORMAT(NOW() – INTERVAL 1 DAY,”%Y-%m-%d
%H:00:00″)) AND Time <= UNIX_TIMESTAMP(DATE_FORMAT(NOW() – INTERVAL 1
HOUR,”%Y-%m-%d %H:59:59″))”
2. Modify the query to allow update in case that a hourly cube was already calculated. “ON
DUPLICATE KEY UPDATE `Value` = VALUES(`Value`)”
Please note that in order to get it correctly running, there should be an effective way to
detect a duplicate key. In my case, I’ve used all fields but the value as primary keys, instead
of defining a new artificial key field. Thus, SensorID and Time are primary keys, so there
should never be more than one value for each combination of both. Doing so, MySQL detects
duplicate and updates the third column, value, by throwing no errors. The whole stored
procedure would look like this:
DELIMITER ;
The next step is to get this procedure running every hour for all the stations in the database.
Now we are going to use our Stations table. The steps to follow are:
We’ve nearly finished! We have a way to build hourly cubes for a variable number of stations,
and solving the problem of asynchronous data. We just need to create an event to run this
last procedure, as we have stated above, every hour. This should work on MySQL 5.1
7 DELIMITER //
8
9 CREATE EVENT fill_cubes_1h
10 ON SCHEDULE EVERY 1 HOUR
11 DO BEGIN
12 CALL fill_all_cubes_1h();
13 END //
14
15 DELIMITER ;
If for any reason we are not allowed to create events (all this stuff should be run as a user
with whole access to the tables involved), or we are running a MySQL version lower than 5.1,
a cron job should be run instead. Just get EXECUTE access on the database where the
procedures have been created (on my case, database_main), and add this line to crontab:
Please note that to get this statement running without asking for a password, you need to
create a .my.cnf file in your home directory (UNIX systems)