Experiment No: 01 Data Exploration & Data Preprocessing
Experiment No: 01 Data Exploration & Data Preprocessing
Aim: To study and implement data exploration using WEKA tool. Also to investigate the
properties of data, how to visualize data, and how pre-proposing can improve the information
content of data.
Objectives:
Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual
exploration to understand what is in a dataset and the characteristics of the data. To understand
different data preprocessing techniques.
Theory:
a) Data Exploration:
Data Exploration is about describing the data by means of statistical and visualization techniques.
We explore data in order to bring important aspects of that data into focus for further analysis.
Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual
exploration to understand what is in a dataset and the characteristics of the data, rather than through
traditional data management systems. These characteristics can include size or amount of data,
completeness of the data, correctness of the data, possible relationships amongst data elements or
files/tables in the data. Data exploration is typically conducted using a combination of automated
and manual activities. Automated activities can include data profiling or data visualization or
tabular reports to give the analyst an initial view into the data and an understanding of key
characteristics.
This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns
identified through the automated actions. Data exploration can also require manual scripting and
queries into the data or using Excel or similar tools to view the raw data. All of these activities are
aimed at creating a clear mental model and understanding of the data in the mind of the analyst,
and defining basic metadata (statistics, structure, relationships) for the data set that can be used in
further analysis. Once this initial understanding of the data is had, the data can be pruned or refined
by removing unusable parts of the data, correcting poorly formatted elements and defining relevant
relationships across datasets. This process is also known as determining data quality.
For the exercises in this tutorial you will use ‘Explorer’. Click on ‘Explorer’ button in the
‘WEKA GUI Chooser’ window.
The following statistics are shown in ‘Selected attribute’ box on the right panel of ‘Preprocess’
window:
Name is the name of an attribute,
Type is most commonly Nominal or Numeric, and
Missing is the number (percentage) of instances in the data for which this attribute is unspecified,
Distinct is the number of different values that the data contains for this attribute, and
Unique is the number (percentage) of instances in the data having a value for this attribute that no
other instances have.
Exercise:
Attribute-Relation File Format (ARFF) is the file file format used to work with the WEKA tool.
Load few data sets in the format .arff into a local folder. Open any data set header and understand
the format of the file header and data. The header contains the list of attributes along with their
data types. The data part contains several tuples of data objects per line. The data type of the
attributes can be numeric, nominal, or date. Load iris.arff data set in the explorer and
Answer the following questions:
Q.2 What are the acceptable values of the class label? How many instances of each label type are
available?
i. Open weka explorer.
ii. Click on Preprocess tab
iii. Click on “Open File”.
iii. Load iris.arff data set in data folder in weka folder.
iv. Click on class attribute.
v. Find out the number of instances of each the class label
iv. Click on each attribute one by one.
v. Find out the number of instances of each attribute
Q.3 How can you view the values of all tuples? Change the petal width of the 7th instance to 0.3.
b) Data Preprocessing:
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Real world data are generally
a. Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
b. Noisy: containing errors or outliers
c. Inconsistent: containing discrepancies in codes or names
• Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
• Data Integration: Data with different representations are put together and conflicts within
the data are resolved.
• Data Transformation: Data is normalized, aggregated and generalized.
• Data Reduction: This step aims to present a reduced representation of the data in a data
warehouse.
• Data Discretization: Involves the reduction of a number of values of a continuous attribute
by dividing the range of attribute intervals.
Pre-processing tools in WEKA are called “filters”. WEKA contains filters for discretization,
normalization, resampling, attribute selection, transformation and combination of attributes. Some
techniques, such as association rule mining, can only be performed on categorical data. This
requires performing discretization on numeric or continuous attributes.
At the bottom of the editor window there are four buttons. ‘Open’ and ‘Save’ buttons allow you to
save object configurations for future use. ‘Cancel’ button allows you to exit without saving
changes. Once you have made changes, click ‘OK’ to apply them.
Q.2 Open the data file diabetis.arff. Choose attribute as Class. Select ‘Visualize all’.
i. Open weka explorer.
ii. Click on Preprocess tab
iii. Click on “Open File”.
iv. Load diabetis.arff data set in data folder in weka folder.
v. Click on ‘visualize all’.
Result:
_____________________________________________________________________________
Conclusion:
_____________________________________________________________________________
References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf
2. www.cs.waikato.ac.nz/ml/weka/documentation.html
3. www.ittc.ku.edu/~nivisid/WEKA_MANUAL.pdf
Industrial Applications:
Database-driven applications such as customer relationship management
Rule-based applications (like neural networks)
4 How can you view the values of all tuples of the relation?
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
Theory:
The Decision Tree algorithm, like Naive Bayes, is based on conditional probabilities. Unlike Naive
Bayes, decision trees generate rules. A rule is a conditional statement that can easily be understood
by humans and easily used within a database to identify a set of records. In some applications of
data mining, the accuracy of a prediction is the only thing that really matters.
1) A Naive Bayes classifier
It is a simple probabilistic classifier based on applyingBayes' theorem with strong (naive)
independence assumptions. In simple terms, a naive Bayes classifier assumes that the presence (or
absence) of a particular feature of a class is unrelated to the presence (or absence) of any other
feature, given the class variable.
The Naive Bayes model
Abstractly, the probability model for a classifier is a conditional mode over a dependent class
variable with a small number of outcomes or classes, conditional on several feature variables
through . The problem is that if the number of features is large or when a feature can take on a
large number of values, then basing such a model on probability tables is infeasible. We therefore
reformulate the model to make it more tractable.
p(C) p(F1,…Fn/C)
p(C/F1,…Fn) = p(F1,…Fn)
Bayes theorem
2) Random Forests
We assume that the user knows about the construction of single classification trees. Random
Forests grows many classification trees. To classify a new object from an input vector, put the
input vector down each of the trees in the forest. Each tree gives a classification, and we say the
tree "votes" for that class. The forest chooses the classification having the most votes (over all the
trees in the forest).
Each tree is grown as follows:
1. If the number of cases in the training set is N, sample N cases at random - but with
replacement, from the original data. This sample will be the training set for growing the
tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m
variables are selected at random out of the M and the best split on these m is used to split
the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
In the original paper on random forests, it was shown that the forest error rate depends on two
things:
▪ The correlation between any two trees in the forest. Increasing the correlation increases
the forest error rate.
▪ The strength of each individual tree in the forest. A tree with a low error rate is a strong
classifier. Increasing the strength of the individual trees decreases the forest error rate.
Reducing m reduces both the correlation and the strength. Increasing it increases both. Somewhere
in between is an "optimal" range of m - usually quite wide. Using the error rate (see below) a value
of m in the range can quickly be found. This is the only adjustable parameter to which random
forests is somewhat sensitive.
Example of ID3
Suppose we want ID3 to decide whether the weather is amenable to playing baseball. Over the
course of 2 weeks, data is collected to help ID3 build a decision tree (see table 1).The target
classification is "should we play baseball?" which can be yes or no.The weather attributes are
outlook, temperature, humidity, and wind speed.
They can have the following values:
outlook = { sunny, overcast, rain }
We need to find which attribute will be the root node in our decision tree. The gain is calculated
for all four attributes:
Since Outlook has three possible values, the root node has three branches (sunny, overcast,
rain). The next question is "what attribute should be tested at the Sunny branch node?" Since
we have used Outlook at the root, we only decide on the remaining three attributes: Humidity,
Temperature, or Wind.
Ssunny = {D1, D2, D8, D9, D11} = 5 examples from table 1 with outlook = sunny
Gain(Ssunny, Humidity) = 0.970
Gain(Ssunny, Temperature) = 0.570
Gain(Ssunny, Wind) = 0.019
Humidity has the highest gain; therefore, it is used as the decision node. This process goes on
until all data is classified perfectly or we run out of attributes.
Flowchart:-
Start
Business Intelligence Lab, Sem VI CBCGS 12
Select the data set
Stop
Algorithm:
Steps to implement the following classifiers- Decision tree, Naïve Bayes and Random Forest:
1. Select any data set, say weather.arff.
2. After preprocessing step, click the Classify tab.
3. In the choose option, expand ‘trees’. Select the J48 option.
4. Click the start button. The classification is completed, and you can see the data in the
second panel.
5. In the left panel, right click the trees.J48 option, and choose the visualize tree option.
6. A separate window opens, showing the decision tree. If the nodes are very clumsy, right
click the window and select Fit to screen option, you can now see a clear tree view.
7. Repeat steps 1-4 for Naïve Bayes and Random Forest. But you have to select Naïve Bayes
and Random Forest instead of J48.
8. Compare the outputs and accuracy of the three classifiers.
Result
_____________________________________________________________________________
Conclusion
Industrial Applications:-
ID3 has been incorporated in a number of commercial rule-induction packages. Some specific
applications include medical diagnosis, credit risk assessment of loan applications, equipment
malfunctions by their cause, classification of soybean diseases, and web search classification.
1. Banking:
In the banking sector, random forest algorithm is widely used in two main applications. These
are for finding the loyal customer and finding the fraud customers.
The loyal customer means not only the customer who pays well, but also the customer who can
take the huge amount as loan and pays the loan interest properly to the bank. As the growth of
the bank purely depends on the loyal customers, the bank customer data is highly analyzed to
find the pattern for the loyal customer based the customer details.
In the same way, there is a need to identify the customers who are not profitable for the bank,
like taking the loan and paying the loan interest properly or find the outlier customers. If the
bank can identify these kind of customers before giving the loan the customer, it will get a
chance to not approve the loan to these kinds of customers. In this case, also random forest
algorithm is used to identify the customers who are not profitable for the bank.
2.Medicine
In medicine field, random forest algorithm is used to identify the correct combination of the
components to validate the medicine. Random forest algorithm is also helpful for identifying
the disease by analyzing the patient’s medical records.
3.Stock Market
In the stock market, random forest algorithm used to identify the stock behavior as well as the
expected loss or profit by purchasing the particular stock.
4.E-commerce
In e-commerce, the random forest used only in the small segment of the recommendation engine
for identifying the likely hood of customer liking the recommend products based on the similar
kinds of customers.
References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf
2. www.cs.waikato.ac.nz/ml/weka/documentation.html
3. www.ittc.ku.edu/~nivisid/WEKA_MANUAL.pdf
2. What is a rule?
_____________________________________________________________________________
3. What is classification?
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
Objectives:
A cluster of data objects can be treated as one group. While doing cluster analysis, we first partition
the set of data into groups based on data similarity and then assign the labels to the groups. The main
advantage of clustering over classification is that, it is adaptable to changes and helps single out useful
features that distinguish different groups.
Theory:
Clustering is “the process of organizing objects into groups whose members are similar in some
way”. Clustering is the task of assigning a set of objects into groups (called clusters) so that the
objects in the same cluster are more similar (in some sense or another) to each other than to those
in other clusters. Clustering is a main task of explorative data mining, and a common technique
for statistical data analysis used in many fields, including machine learning, pattern recognition,
image analysis, information retrieval, and bioinformatics. Clustering can be considered the most
important unsupervised learning technique; so, as every other problem of this kind, it deals with
finding a structure in a collection of unlabeled data.
Clustering Methods
To determine the distance between clusters based on their member elements, the following
methods have been implemented:
a. Single Linkage- minimum distance between any members of each group
b. Complete Linkage-maximum distance between any members of each group
c. Average Linkage-average pair-wise distance between each member of one cluster to each
member of another cluster
d. Average Group Linkage -average distance between all possible element pairs of the union
of the two clusters
e. Centroid -distance between the mean vectors (centroids) of the two clusters
f. Wards Method- increase in variance when merging two clusters.
Algorithm:-
Steps to implement K-means clustering algorithm using WEKA
1. Choose any data set, for example, diabetes.arff.
2. Go to the the cluster panel and choose simple kmeans.
Flowchart:-
Start
Stop
Result:
_____________________________________________________________________________
Conclusion:
_____________________________________________________________________________
Industrial Applications:-
Credit card companies mine transaction records for fraudulent use of their cards based on purchase
patterns of consumers - They can deny access if your purchase patterns change drastically.
References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf
2. www.cs.waikato.ac.nz/ml/weka/documentation.html
3. www.ittc.ku.edu/~nivisid/WEKA_MANUAL.pdf
1. What is clustering?
_____________________________________________________________________________
3. What is dendogram?
_____________________________________________________________________________
4. Enlist the methods used to calculate the distance between the clusters.
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
Aim: To use WEKA tool to implement Association Mining using – Apriori algorithm and FP-Tree
Growth algorithm. To find out the frequent itemsets and find out the association rules.
Objectives:
To learn association data mining function that discovers the probability of the co-occurrence of
items in a collection. The relationships between co-occurring items are expressed as association
rules. Association rules are often used to analyze sales transactions.
In data mining, association rule learning is a popular and well researched method for discovering
interesting relations between variables in large databases. Association rules are usually required to
satisfy a user-specified minimum support and a user-specified minimum confidence at the same
time. Association rule generation is usually split up into two separate steps:
1. First, minimum support is applied to find all the frequent itemsets in a database.
2. Second, these frequent itemsets and the minimum confidence constraint are used to form
rules.
Item Support
{1} 3
{2} 6
{3} 4
{4} 5
All the itemsets of size 1 have a support of at least 3, so they are all frequent.
The next step is to generate a list of all pairs of the frequent items.
For example, regarding the pair {1,2}: the first table of Example 2 shows items 1 and 2
appearing together in three of the itemsets; therefore, we say item {1,2} has support of three.
Item Support
{1,2} 3
{1,3} 1
{1,4} 2
{2,3} 3
{2,4} 4
{3,4} 3
Item Support
{2,3,4} 2
In the example, there are no frequent triplets -- {2,3,4} is below the minimal threshold, and the
other triplets were excluded because they were super sets of pairs that were already below the
threshold.
We have thus determined the frequent sets of items in the database, and illustrated how some
items were not counted because one of their subsets was already known to be below the
threshold.
Steps to implement Association mining by using Apriori and FP-Growth.
1.Load the .arff file, say supermarket.arff.
2. Click the associate tab, and choose Apriori in the associator.
3.Run with the default values.
4. Study the output.
Flowchart:-
Start
Load the data set. Click the associate tab, and choose Apriori
in the associator
Stop
Result :
_____________________________________________________________________________
References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf
2. www.cs.waikato.ac.nz/ml/weka/documentation.html
3. www.ittc.ku.edu/~nivisid/WEKA_MANUAL.pdf
4. Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition.
Industrial Applications:-
1. Market Basket Analysis
2.Supermarket
3.Analyzing Customer Interests In Retail Industry
4.Application of the Apriori algorithm for adverse drug reaction detection
5.Detection of adverse drug reactions (ADR) in health care data. The Apriori algorithm is used to
perform association analysis on the characteristics of patients, the drugs they are taking, their
primary diagnosis, co-morbid conditions, and the ADRs or adverse events (AE) they experience.
This analysis produces association rules that indicate what combinations of medications and
patient characteristics lead to ADRs.
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
10. How can you find frequent itemset without candidate generation?
_____________________________________________________________________________
Objectives:
To learn the actual implementation of classification algorithm and analyze the result.
Theory:
The Decision Tree algorithm, is based on conditional probabilities. Decision trees generate rules.
A rule is a conditional statement that can easily be understood by humans and easily used within
a database to identify a set of records. In some applications of data mining, the accuracy of a
prediction is the only thing that really matters. It may not be important to know how the model
works. In others, the ability to explain the reason for a decision can be crucial. For example, a
Marketing professional would need complete descriptions of customer segments in order to launch
a successful marketing campaign. The Decision Tree algorithm is ideal for this type of application.
This rule comes from a decision tree that predicts the probability that customers will increase
spending if given a loyalty card. A target value of 0 means not likely to increase spending; 1 means
likely to increase spending.
Flowchart:-
Result:-
_____________________________________________________________________________
Conclusion:-
_____________________________________________________________________________
References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf
2. https://data-flair.training/blogs/data-mining-algorithms/
3. Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition.
7. What is precision?
_____________________________________________________________________________
8. What is recall?
_____________________________________________________________________________
Objectives:
To learn the actual implementation of clustering algorithm and analyses the result.
Theory:
Flowchart :
Suppose we have several objects (4 types of medicines) and each object have two attributes or
features as shown in table below. Our goal is to group these objects into K=2 group of medicine
based on the two features (pH and weight index).
1 2 4 5 X
1 1 3 4 Y
Each column in the distance matrix symbolizes the object. The first row of the distance matrix
corresponds to the distance of each object to the first centroid and the second row is the distance
of each object to the second centroid. For example, distance from medicine C = (4, 3) to the first
centroid c1 = (1,1) is √ (4-1)2 + (3 - 1)2 = 3.61 and its distance to the second centroid c2 = (2,1) is
√ (4-2)2 + (3 - 1)2 = 2.83.
3.Objects clustering: We assign each object based on the minimum distance. Thus, medicine A
is assigned to group 1, medicine B to group 2, medicine C to group 2 and medicine D to group 2.
The element of Group matrix below is 1 if and only if the object is assigned to that group.
G0 = 1 0 0 0 Group 1
0 1 1 1 Group 2
A B C D
4.Iteration-1, determine centroids: Knowing the members of each group, now we compute the
new centroid of each group based on these new memberships. Group 1 only has one member thus
the centroid remains in c1 = (1,1) . Group 2 now has three members, thus the centroid is the average
2+4+5 , 1+3+4
coordinate among the three members: c2=( ______ _______ )
3 3
5.Iteration-1, Objects-Centroids distances: The next step is to compute the distance of all
objects to the new centroids. Similar to step 2, we have distance matrix at iteration 1 is
A B C D
1 2 4 5 X
1 1 3 4 Y
G1 = 1 1 0 0 Group 1
0 0 1 1 Group 2
A B C D
7.Iteration 2, determine centroids: Now we repeat step 4 to calculate the new centroids
coordinate based on the clustering of previous iteration. Group1 and group 2 both has two
members, thus the new centroids :
A B C D
1 2 4 5 X
1 1 3 4 Y
9. Iteration-2, Objects clustering: Again, we assign each object based on the minimum distance.
G1 = 1 1 0 0 Group 1
0 0 1 1 Group 2
A B C D
We obtain result that G 2 = G 1 . Comparing the grouping of last iteration and this iteration reveals that the
objects does not move group anymore. Thus, the computation of the k-mean clustering has reached its
stability and no more iteration is needed. We get the final grouping as the results.
_____________________________________________________________________________
Conclusion:
_____________________________________________________________________________
Industrial Applications:-
1.Credit card companies mine transaction records for fraudulent use of their cards based on
purchase patterns of consumers - They can deny access if your purchase patterns change
drastically.
2. Pattern recognition
3.Image analysis
4. Bioinformatics
5. Machine Learning
6.Voice minig
7.Image processing
References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf
2. https://data-flair.training/blogs/clustering-in-data-mining/
3. Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition.
7. What is threshold?
_____________________________________________________________________________
Objectives: To give the student maximum exposure to Pentaho components. To define and
describe business analytics and business intelligence.
Theory:
Having a lot of numbers, spreadsheets full of values, charts, diagrams, histograms, so on and
so forth, it rarely is easy to discover trends or specific dependencies soon enough to influence
on them. It's not easy to predict, then, because everything needs to grow significantly before it's
noticed. Pentaho Data Mining - the same as any other data mining solution - helps with
discovering these trends before they're actually noticeable by human. The system's idea is to
quickly analyze extremely large volumes of data and search for any trends that are being shaped.
The most meaningful advantage is the fact Pentaho Data Mining tool highlights the trends long
before they become "traditionally" noticeable, what makes it a real value in today's Business
Intelligence.
Which contractor to choose? Which product to buy? Sometimes it truly is difficult to decide,
and these are the cases which efficient Data Mining solution could help with.
Who - beside the system - would be able to find out that this type contractors seem to delay the
payments and - therefore - there is quite a significant risk the one we want to begin cooperating
with would do the same? None.
With a full support from data integration, analysis, dashboards, and reporting, Pentaho Data
Mining is truly a worth considering solution.
The idea of data mining isn't complex, however doing it "manually" could be difficult and time
consuming. What's more, wouldn't guarantee the final success. Data mining within Pentaho
Data Mining tools begins with choosing a model. There are numerous options to choose from -
segmentation, decision trees, nets, random forests, clustering, and many others. On the chosen
models can depend the efficiency of data mining. Then, data is added. After this, there is a need
to adapt the chosen model to sample data - it's a crucial moment, thereupon there are two
methods to choose from. In all cases, it could be done automatically (following the most
common procedures and parameters). However, it sometimes is possible to do it personally, as
well. Nonetheless, the adjusted parameters require testing. Thereupon, it's suggested to verify
the model on some data from the future (and check out whether the output is more or less the
• powerful engine working well even with the largest data volumes
• numerous and differentiated learning algorithms originated from the Weka (principal component
analysis, random forests, decision trees, neural networks, segmentation, clustering, so on and so
forth)
• simplified and accelerated data integration
• automated data transforming capability (from almost any other to the format Pentaho Data
Mining requires)
• twofold algorithm applying methods (from Java code or directly to the dataset)
• various methods for output presentation
• differentiated filters for data analysis
• PMML (Predictive Model Markup Language) support
• graphical user interfaces
• efficient hidden relationships and patterns uncovering capabilities
• using already discovered patterns in future data mining
• embedding insights into other applications capabilities (patterns, then, can be displayed every
time they could be useful, not only when one wants to check for them)
Pentaho Data Mining resources
Pentaho BI Suite
●It provides support for: data integration, reporting, OLAP analysis, dashboard and data mining.
Pentaho Data Integration (PDI) provides the Extract, Transform, and Load (ETL) capabilities that
facilitate the process of capturing, cleansing, and storing data using a uniform and consistent
format that is accessible and relevant to end users and IoT technologies.
PDI – Example
Kettle/Spoon
Pentaho Schema Workbench (PSW)
With a physical multidimensional data model in place, you must create a logical model that maps
to it. A Mondrian schema is essentially an XML file that performs this mapping, thereby defining
In a very basic scenario, you will create a Mondrian schema with one cube that consists of a single
fact table and a few dimensions, each with a single hierarchy consisting of a handful of levels.
More complex schemas may involve multiple virtual cubes, and instead of mapping directly to the
single fact table at the center of a star schema, they might map to views or inline tables instead.
Before you start using Schema Workbench, you should be aware of the following points:
Your data source must be available, its database driver JAR must be present in
the /pentaho/design-tools/schema-workbench/drivers/ directory, and you should know or be
able to obtain the database connection information and user account credentials for it.
Your data is now available to Schema Workbench, and you can proceed with creating a
Mondrian schema.
Remove Mondrian Data Sources
As you phase out old analysis schemas, you will have to manually remove their data source entries
in the Data Source Wizard in the User Console.
1. Login to the User Console with administrator credentials.
2. On the Home page of the User Console, click Manage Data Sources. The Data Source
Wizard appears.
3. Click to highlight the data source to be deleted, and click Remove.
In order to complete this process, you should have already connected to your data source in Schema
Workbench.
This section explains the basic procedure for creating a barebones Mondrian schema using Schema
Workbench..
1. To create a new Mondrian schema, click the New button, or go to the File menu, then select New,
then Schema. A new schema sub-window will appear. Resize it to fit your preference.
3. Typically your first action when creating a schema is to add a cube. Right-click the Schema icon
in the schema window, then select Add cube from the context menu. Alternatively you can click
the New Cube button in the toolbar. A new default cube will show up in your schema.
5. Add a table by clicking the New Table button, or by right-clicking your cube, then selecting Add
Table. This will be your fact table. Alternatively you can select View or Inline Table if these are
the data types you need for your fact table.
6. Click the Table entry in the name field of your new table, and select or type in the name of the table
in your physical model that you want to use for this cube's fact table.
7. Add a dimension by right-clicking the cube, then selecting Add Dimension, or by clicking the New
Dimension button.
9. Select a foreign key for this dimension from the foreignKey drop-down box, or just type it into the
field.
10. When you add a dimension, a new hierarchy is automatically created for it. To configure the
hierarchy, expand the dimension by clicking the lever icon on the left side of the dimension's tree
entry, then click on New Hierarchy 0. Choose a primaryKey or primaryKey Table.
11. Add a table to the hierarchy by right-clicking the hierarchy, then selecting Add Table from the
context menu.
13. Add a level to the hierarchy by right-clicking the hierarchy, then selecting Add Level from the
context menu.
14. Give the level a name and choose a column for it.
15. Add a member property to the level by right-clicking the level, then selecting Add Property from
the context menu.
16. Give the property a name and choose a column for it.
18. Choose a column that you want to provide values for, then select an aggregator to determine how
the values should be calculated.
These instructions have shown you how to use Schema Workbench's interface to add and configure basic
Mondrian schema elements.
When your schema is finished, you should test it with a basic MDX query such as:
In order to use your schema as a data source in any Pentaho Business Analytics client tools, you
must publish it to the Pentaho Server. To do this, select Publish from the File menu, then enter in
your Pentaho Server connection information and credentials when requested.
Edit a Schema
There are two advanced tools in Schema Workbench that enable you to work with raw MDX and
XML. The first is the MDX query editor, which can query your logical data model in real time. To
open this view, go to the File menu, select New, then click MDX Query.
The second is XML viewing mode, which you can get to by clicking the rightmost icon (the pencil)
in the toolbar. This replaces the name/value fields with the resultant XML for each selected
element. To see the entire schema, select the top-level schema entry in the element list on the left
of the Schema Workbench interface. Unfortunately you won't be able to edit the XML in this view;
if you want to edit it by hand, you'll have to open the schema in an XML-aware text editor
Add Business Groups
The available fields list in Analyzer organizes fields in folders according to
the AnalyzerBusinessGroup annotation. To implement business groups, add these annotations to
your member definitions appropriately. If no annotation is specified, then the group defaults to
"Measures" for measures and the hierarchy name/caption for attributes.
Below is an example that puts Years, Quarters and Months into a "Time Periods" business group:
...
By adding description attributes to your Mondrian schema elements, you can enable tooltip
(mouse-over) field descriptions in Analyzer reports.
Remove the line-wrap or this may not work. These variables will not work unless you localize
schemas.
A few Mondrian features are not yet functional in Pentaho Analyzer. You must adapt your schemas
to adjust for these limitations and enable some Analyzer functions to work properly.
Localization and Internationalization of Analysis Schemas
You can create internationalized message bundles for your analysis schemas and deploy them
with your Pentaho web applications.
It provides the following functionalities:
●Schema editor integrated with the underlying data source for validation
●Test MDX queries against schema and database
●Browse underlying databases structure
Pentaho BI-Server
The Pentaho BI-Server is a web-application for sharing and managing reports. With the BI-
Platform you are able to make report available to a wider audience. You can automatically send
Pentaho Dashboards
Dashboard Designer has dynamic filter controls, which enable dashboard viewers to change a
dashboard's details by choosing different values from a drop-down list, and to control the content
in one dashboard panel by changing the options in another. This is known as content linking.
Get Started with the Dashboard Designer
You can view the editable version of the Sales Performance (Dashboard) in Dashboard Designer
by clicking Browse Files on the User Console Home page. Follow these quick steps.
1. In the Folders pane, click to expand the Public folder, then click to highlight the Steel
Wheels folder.
2. In the center pane, double-click on Sales Performance (Dashboard).
3. After the dashboard opens, click Edit in File Actions.
Add a Website
Use these steps to display contents of a website in a dashboard panel.
1. Select a panel in the Dashboard Designer.
2. Click (Insert) and choose URL. The Enter Web site dialog box appears.
3. Enter the website URL in the text box and click OK.
4. If applicable, click (Edit) to make changes.
Drag-and-Drop Content
Use these steps to add an existing chart, table, or file to your dashboard panels using the drag-and-
drop feature.
1. In the left pane of the Pentaho User Console, under Files, locate to the content (chart, table,
or file) you want added to your dashboard.
2. Click and drag the content into a blank panel on your dashboard. You will see the "title" of
the content as you move it around the dashboard. Notice that the title background is red; it
turns green when you find a panel where the content can be dropped.
3. Repeat steps 2 and 3 until your dashboard contains all the content you want to display. To
swap content from one panel to another, click the title bar of the panel that contains the
If you are working with an existing dashboard, you can perform steps 2 and 3 steps; however, a
warning message appears when you try to place content in a panel that already contains content.
The new content will override the existing content.
Use Chart Designer
The Chart Designer allows you to create bar, pie, line, dial, and area charts that can be added to a
dashboard.
Adjust White Space in Dashboard Panels
Sometimes you must adjust the white space in dashboard panels, (or the filter panel), so that
content appears correctly. Use these steps to adjust white space.
1. In the lower pane, click General Settings and then click the Properties tab.
2. Click Resize Panels. The white space between the dashboard panels turns blue.
3. Adjust the panel size by clicking and holding the left mouse button down as you move the
blue lines (white space) around. Release the mouse button when you are satisfied with the
positioning of the panel.
4. Click Close in the lower-right corner of the dashboard to exit resize layout mode.
5. Examine the dashboard contents to make sure they are placed correctly. You can return to
the resize layout mode if you need to make additional changes.
Set the Refresh Interval
The content in your dashboard may need to be refreshed periodically if users are examining
real-time data. You can set refresh intervals for individual panels on your dashboard or for the
entire dashboard.
To set the refresh interval for individual panels in the dashboard, click the edit button and the
choose the panel that contains the content you want refreshed in the Objects panel. Under Refresh
Interval (sec) enter the interval time in seconds and click Apply.
1. Click the Save As button, which is a floppy disk and pencil button, to open the Save
As dialog box.
2. In the File Name text box, type a file name for your dashboard.
3. Enter the path to the location where you want to save the dashboard. Alternatively, use the
up/down arrows or click Browse to locate the solution (content files) directory in which
you will save your dashboard.
4. Click Save. The report saves with the name specified.
Result :
_____________________________________________________________________________
Conclusion:
_____________________________________________________________________________
Industrial Applications
Creating and managing databases, data warehouses and data miners.
References:
1.https://www.hitachivantara.com/en-in/products/big-data-integration-analytics/pentaho-trial-
download.html
2.https://www.tutorialspoint.com/pentaho/index.htm
3. https://intellipaat.com/tutorial/pentaho-tutorial/introduction-to-pentaho/
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
b) Identify and use a standard data mining dataset available for the problem. Some links for data
mining datasets are: WEKA site, UCI Machine Learning Repository, KDD site, KDD Cup etc.
Advantages