0% found this document useful (0 votes)
994 views54 pages

Experiment No: 01 Data Exploration & Data Preprocessing

This document describes an experiment on data exploration and preprocessing using the WEKA tool. It discusses exploring data to understand its characteristics, visualizing data, and preprocessing techniques like data cleaning, integration, transformation and reduction to improve data quality. The objectives are to study data exploration in WEKA and different preprocessing techniques. The document provides examples of loading datasets in WEKA, exploring attributes, removing attributes using filters, and visualizing data.

Uploaded by

Amar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
994 views54 pages

Experiment No: 01 Data Exploration & Data Preprocessing

This document describes an experiment on data exploration and preprocessing using the WEKA tool. It discusses exploring data to understand its characteristics, visualizing data, and preprocessing techniques like data cleaning, integration, transformation and reduction to improve data quality. The objectives are to study data exploration in WEKA and different preprocessing techniques. The document provides examples of loading datasets in WEKA, exploring attributes, removing attributes using filters, and visualizing data.

Uploaded by

Amar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Experiment No: 01

Data Exploration & Data Preprocessing

Aim: To study and implement data exploration using WEKA tool. Also to investigate the
properties of data, how to visualize data, and how pre-proposing can improve the information
content of data.

Objectives:
Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual
exploration to understand what is in a dataset and the characteristics of the data. To understand
different data preprocessing techniques.

Software Required: MS word, Weka

Theory:
a) Data Exploration:

Data Exploration is about describing the data by means of statistical and visualization techniques.
We explore data in order to bring important aspects of that data into focus for further analysis.
Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual
exploration to understand what is in a dataset and the characteristics of the data, rather than through
traditional data management systems. These characteristics can include size or amount of data,
completeness of the data, correctness of the data, possible relationships amongst data elements or
files/tables in the data. Data exploration is typically conducted using a combination of automated
and manual activities. Automated activities can include data profiling or data visualization or
tabular reports to give the analyst an initial view into the data and an understanding of key
characteristics.
This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns
identified through the automated actions. Data exploration can also require manual scripting and
queries into the data or using Excel or similar tools to view the raw data. All of these activities are
aimed at creating a clear mental model and understanding of the data in the mind of the analyst,
and defining basic metadata (statistics, structure, relationships) for the data set that can be used in
further analysis. Once this initial understanding of the data is had, the data can be pruned or refined
by removing unusable parts of the data, correcting poorly formatted elements and defining relevant
relationships across datasets. This process is also known as determining data quality.

For the exercises in this tutorial you will use ‘Explorer’. Click on ‘Explorer’ button in the
‘WEKA GUI Chooser’ window.

Business Intelligence Lab, sem VI CBCGS 1


‘WEKA Explorer’ window appears on a screen.

Opening file from a local file system


Click on ‘Open file…’ button.It brings up a dialog box allowing you to browse for the data file on
the local file system, choose “weather.arff” file.

Business Intelligence Lab, Sem VI CBCGS 2


Some databases have the ability to save data in CSV format. In this case, you can select CSV file
from the local file system. If you would like to convert this file into ARFF format, you can click
on ‘Save’ button. WEKA automatically creates ARFF file from your CSV file.

Business Intelligence Lab, Sem VI CBCGS 3


Once the data is loaded, WEKA recognizes attributes that are shown in the ‘Attribute’ window.
Left panel of ‘Preprocess’ window shows the list of recognized attributes: No. is a number that
identifies the order of the attribute as they are in data file, Selection tick boxes allow you to select
the attributes for working relation, Name is a name of an attribute as it was declared in the data
file. The ‘Current relation’ box above ‘Attribute’ box displays the base relation (table) name and
the current working relation (which are initially the same) - “weather”, the number of instances –
14 and the number of attributes - 5. During the scan of the data, WEKA computes some basic
statistics on each attribute.

The following statistics are shown in ‘Selected attribute’ box on the right panel of ‘Preprocess’
window:
Name is the name of an attribute,
Type is most commonly Nominal or Numeric, and
Missing is the number (percentage) of instances in the data for which this attribute is unspecified,
Distinct is the number of different values that the data contains for this attribute, and
Unique is the number (percentage) of instances in the data having a value for this attribute that no
other instances have.

Exercise:
Attribute-Relation File Format (ARFF) is the file file format used to work with the WEKA tool.
Load few data sets in the format .arff into a local folder. Open any data set header and understand
the format of the file header and data. The header contains the list of attributes along with their
data types. The data part contains several tuples of data objects per line. The data type of the
attributes can be numeric, nominal, or date. Load iris.arff data set in the explorer and
Answer the following questions:

Q.1 What is the number of attributes and instances?


i. Open weka explorer.
ii. Click on Preprocess tab
iii. Click on “Open File”.
iii. Load iris.arff data set in data folder in weka folder.
iv. Find out the number of attributes and instances.

Q.2 What are the acceptable values of the class label? How many instances of each label type are
available?
i. Open weka explorer.
ii. Click on Preprocess tab
iii. Click on “Open File”.
iii. Load iris.arff data set in data folder in weka folder.
iv. Click on class attribute.
v. Find out the number of instances of each the class label
iv. Click on each attribute one by one.
v. Find out the number of instances of each attribute

Q.3 How can you view the values of all tuples? Change the petal width of the 7th instance to 0.3.

Business Intelligence Lab, Sem VI CBCGS 4


i. Open weka explorer.
ii. Click on Preprocess tab
iii. Click on “Open File”.
iv. Load iris.arff data set in data folder in weka folder.
v. Click on “Edit” button.
vi. Click on 7th instance. Change the value of petalwidth from 0.3 to 0.5.

b) Data Preprocessing:
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Real world data are generally

a. Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
b. Noisy: containing errors or outliers
c. Inconsistent: containing discrepancies in codes or names

Data goes through a series of steps during preprocessing:

• Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
• Data Integration: Data with different representations are put together and conflicts within
the data are resolved.
• Data Transformation: Data is normalized, aggregated and generalized.
• Data Reduction: This step aims to present a reduced representation of the data in a data
warehouse.
• Data Discretization: Involves the reduction of a number of values of a continuous attribute
by dividing the range of attribute intervals.

Pre-processing tools in WEKA are called “filters”. WEKA contains filters for discretization,
normalization, resampling, attribute selection, transformation and combination of attributes. Some
techniques, such as association rule mining, can only be performed on categorical data. This
requires performing discretization on numeric or continuous attributes.

Business Intelligence Lab, Sem VI CBCGS 5


Note, when you right-click on filter, a ‘GenericObjectEditor’ dialog box comes up on your screen.
The box lets you to choose the filter configuration options. The same box can be used for
classifiers, clusterers and association rules. Clicking on ‘More’ button brings up an ‘Information’
window describing what the different options can do.

At the bottom of the editor window there are four buttons. ‘Open’ and ‘Save’ buttons allow you to
save object configurations for future use. ‘Cancel’ button allows you to exit without saving
changes. Once you have made changes, click ‘OK’ to apply them.

Business Intelligence Lab, Sem VI CBCGS 6


Exercise:
Before choosing a data mining task, we may have to do some preprocessing, such as choosing
attributes, converting file formats, adding or removing attributes, and filling in missing instance
values.
We can use filters to preprocess a data set. The filters are arranged in a hierarchy. There are two
classifications of filters. A filter can be either supervised or unsupervised. Another classification
is instance and attributes filters.

Answer the following questions:

Q.1 Choose a particular attribute, say ‘sepalwidth’ and remove it.

i. Open weka explorer.


ii. Click on Preprocess tab
iii. Click on “Open File”.
iv. Load iris.arff data set in data folder in weka folder.
v. Click on ‘sepalwidth’ attribute.
vi. Click on remove button.

Q.2 Open the data file diabetis.arff. Choose attribute as Class. Select ‘Visualize all’.
i. Open weka explorer.
ii. Click on Preprocess tab
iii. Click on “Open File”.
iv. Load diabetis.arff data set in data folder in weka folder.
v. Click on ‘visualize all’.

Result:
_____________________________________________________________________________

Conclusion:
_____________________________________________________________________________

References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf
2. www.cs.waikato.ac.nz/ml/weka/documentation.html
3. www.ittc.ku.edu/~nivisid/WEKA_MANUAL.pdf

Industrial Applications:
Database-driven applications such as customer relationship management
Rule-based applications (like neural networks)

Business Intelligence Lab, Sem VI CBCGS 7


Questionnaire:
1 What is data exploration?
_____________________________________________________________________________

2 Which file format does weka tool use?


______________________________________________________________________________

3 What is the data type of attributes?


___________________________________________________________________________

4 How can you view the values of all tuples of the relation?
_____________________________________________________________________________

5 Why data preprocessing is required?


_____________________________________________________________________________

6 What is data preprocessing?


_____________________________________________________________________________

7 Enlist the steps in data preprocessing.


_____________________________________________________________________________

_____________________________________________________________________________

8 What are the steps in data cleaning?


_____________________________________________________________________________

_____________________________________________________________________________

9 What does data preprocessing include?


_____________________________________________________________________________

_____________________________________________________________________________

10 What are the types of filters?


_____________________________________________________________________________

_____________________________________________________________________________

Business Intelligence Lab, Sem VI CBCGS 8


Experiment No: 02
Classification using Weka Tool
Aim: - To use WEKA to implement the following Classifiers-
Decision tree
Naïve Bayes
Random Forests.
To determine the class label of the data sets using the classifiers.

What will you learn by performing this experiment?


By applying Classification, student will learn data mining function that assigns items in a
collection to target categories or classes. The goal of classification is to accurately predict the target
class for each case in the data. For example, a classification model could be used to identify loan
applicants as low, medium, or high credit risks.

Software Required: MS word, Weka

Theory:
The Decision Tree algorithm, like Naive Bayes, is based on conditional probabilities. Unlike Naive
Bayes, decision trees generate rules. A rule is a conditional statement that can easily be understood
by humans and easily used within a database to identify a set of records. In some applications of
data mining, the accuracy of a prediction is the only thing that really matters.
1) A Naive Bayes classifier
It is a simple probabilistic classifier based on applyingBayes' theorem with strong (naive)
independence assumptions. In simple terms, a naive Bayes classifier assumes that the presence (or
absence) of a particular feature of a class is unrelated to the presence (or absence) of any other
feature, given the class variable.
The Naive Bayes model

Abstractly, the probability model for a classifier is a conditional mode over a dependent class
variable with a small number of outcomes or classes, conditional on several feature variables
through . The problem is that if the number of features is large or when a feature can take on a
large number of values, then basing such a model on probability tables is infeasible. We therefore
reformulate the model to make it more tractable.

Using Bayes' theorem, we write

p(C) p(F1,…Fn/C)

p(C/F1,…Fn) = p(F1,…Fn)

Bayes theorem

Business Intelligence Lab, Sem VI CBCGS 9


Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x),
and P(x|c). Naive Bayes classifier. Assume that the effect of the value of a predictor (x) on a given
class (c) is independent of the values of other predictors. This assumption is called class conditional
independence.

• P(c|x) is the posterior probability of class (target) given predictor (attribute).


• P(c) is the prior probability of class.
• P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.

2) Random Forests
We assume that the user knows about the construction of single classification trees. Random
Forests grows many classification trees. To classify a new object from an input vector, put the
input vector down each of the trees in the forest. Each tree gives a classification, and we say the
tree "votes" for that class. The forest chooses the classification having the most votes (over all the
trees in the forest).
Each tree is grown as follows:
1. If the number of cases in the training set is N, sample N cases at random - but with
replacement, from the original data. This sample will be the training set for growing the
tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m
variables are selected at random out of the M and the best split on these m is used to split
the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.

In the original paper on random forests, it was shown that the forest error rate depends on two
things:

▪ The correlation between any two trees in the forest. Increasing the correlation increases
the forest error rate.
▪ The strength of each individual tree in the forest. A tree with a low error rate is a strong
classifier. Increasing the strength of the individual trees decreases the forest error rate.

Reducing m reduces both the correlation and the strength. Increasing it increases both. Somewhere
in between is an "optimal" range of m - usually quite wide. Using the error rate (see below) a value
of m in the range can quickly be found. This is the only adjustable parameter to which random
forests is somewhat sensitive.

Business Intelligence Lab, Sem VI CBCGS 10


3) Growing a Decision Tree
A decision tree predicts a target value by asking a sequence of questions. At a given stage in the
sequence, the question that is asked depends upon the answers to the previous questions. The goal
is to ask questions that, taken together, uniquely identify specific target values. Graphically, this
process forms a tree structure.

Example of ID3

Day Outlook Temperature Humidity Wind Play ball

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Suppose we want ID3 to decide whether the weather is amenable to playing baseball. Over the
course of 2 weeks, data is collected to help ID3 build a decision tree (see table 1).The target
classification is "should we play baseball?" which can be yes or no.The weather attributes are
outlook, temperature, humidity, and wind speed.
They can have the following values:
outlook = { sunny, overcast, rain }

Business Intelligence Lab, Sem VI CBCGS 11


temperature = {hot, mild, cool }
humidity = { high, normal }
wind = {weak, strong }
Examples of set S are:

We need to find which attribute will be the root node in our decision tree. The gain is calculated
for all four attributes:

Gain(S, Outlook) = 0.246


Gain(S, Temperature) = 0.029
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048 (calculated in example 2)
Outlook attribute has the highest gain, therefore it is used as the decision attribute in the root
node.

Since Outlook has three possible values, the root node has three branches (sunny, overcast,
rain). The next question is "what attribute should be tested at the Sunny branch node?" Since
we have used Outlook at the root, we only decide on the remaining three attributes: Humidity,
Temperature, or Wind.

Ssunny = {D1, D2, D8, D9, D11} = 5 examples from table 1 with outlook = sunny
Gain(Ssunny, Humidity) = 0.970
Gain(Ssunny, Temperature) = 0.570
Gain(Ssunny, Wind) = 0.019
Humidity has the highest gain; therefore, it is used as the decision node. This process goes on
until all data is classified perfectly or we run out of attributes.

The final decision = tree


The decision tree can also be expressed in rule format:
IF outlook = sunny AND humidity = high THEN playball = no
IF outlook = rain AND humidity = high THEN playball = no
IF outlook = rain AND wind = strong THEN playball = yes
IF outlook = overcast THEN playball = yes
IF outlook = rain AND wind = weak THEN playball = yes

Flowchart:-
Start
Business Intelligence Lab, Sem VI CBCGS 12
Select the data set

Click the Classify tab

In the choose option, expand ‘trees’. Select the J48 option

Click the start button

In the left panel, right click the trees.J48


option, and choose the visualize tree
option.
A separate window opens, showing the decision tree.

If the nodes are right click the


very clumsy window and select
Fit to screen option

Stop
Algorithm:
Steps to implement the following classifiers- Decision tree, Naïve Bayes and Random Forest:
1. Select any data set, say weather.arff.
2. After preprocessing step, click the Classify tab.
3. In the choose option, expand ‘trees’. Select the J48 option.
4. Click the start button. The classification is completed, and you can see the data in the
second panel.
5. In the left panel, right click the trees.J48 option, and choose the visualize tree option.
6. A separate window opens, showing the decision tree. If the nodes are very clumsy, right
click the window and select Fit to screen option, you can now see a clear tree view.
7. Repeat steps 1-4 for Naïve Bayes and Random Forest. But you have to select Naïve Bayes
and Random Forest instead of J48.
8. Compare the outputs and accuracy of the three classifiers.

Result

_____________________________________________________________________________

Conclusion

Business Intelligence Lab, Sem VI CBCGS 13


_____________________________________________________________________________

Industrial Applications:-
ID3 has been incorporated in a number of commercial rule-induction packages. Some specific
applications include medical diagnosis, credit risk assessment of loan applications, equipment
malfunctions by their cause, classification of soybean diseases, and web search classification.
1. Banking:
In the banking sector, random forest algorithm is widely used in two main applications. These
are for finding the loyal customer and finding the fraud customers.

The loyal customer means not only the customer who pays well, but also the customer who can
take the huge amount as loan and pays the loan interest properly to the bank. As the growth of
the bank purely depends on the loyal customers, the bank customer data is highly analyzed to
find the pattern for the loyal customer based the customer details.

In the same way, there is a need to identify the customers who are not profitable for the bank,
like taking the loan and paying the loan interest properly or find the outlier customers. If the
bank can identify these kind of customers before giving the loan the customer, it will get a
chance to not approve the loan to these kinds of customers. In this case, also random forest
algorithm is used to identify the customers who are not profitable for the bank.

2.Medicine
In medicine field, random forest algorithm is used to identify the correct combination of the
components to validate the medicine. Random forest algorithm is also helpful for identifying
the disease by analyzing the patient’s medical records.

3.Stock Market
In the stock market, random forest algorithm used to identify the stock behavior as well as the
expected loss or profit by purchasing the particular stock.

4.E-commerce

In e-commerce, the random forest used only in the small segment of the recommendation engine
for identifying the likely hood of customer liking the recommend products based on the similar
kinds of customers.

References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf
2. www.cs.waikato.ac.nz/ml/weka/documentation.html
3. www.ittc.ku.edu/~nivisid/WEKA_MANUAL.pdf

Business Intelligence Lab, Sem VI CBCGS 14


Questionnaire:

1. Give the applications of classification?


_____________________________________________________________________________

2. What is a rule?
_____________________________________________________________________________

3. What is classification?

_____________________________________________________________________________

4. Enlist the steps in classification.

_____________________________________________________________________________

5. How can an internal node in a decision tree be denoted?

_____________________________________________________________________________

6. What does a branch node in a decision tree denote?

_____________________________________________________________________________

7. What does a leaf node in a decision tree denote?

_____________________________________________________________________________

8. What happens in learning step?

_____________________________________________________________________________

9. What happens in classification step?

_____________________________________________________________________________

10. How do you calculate the accuracy of a classifier?

_____________________________________________________________________________

Business Intelligence Lab, Sem VI CBCGS 15


Experiment No: 03
Clustering using Weka Tool
Aim:. To study and implement following clustering methods using WEKA tool.
K- Means
Agglomerative
Divisive.
To organize the objects with similarities into groups called as clusters.

Objectives:
A cluster of data objects can be treated as one group. While doing cluster analysis, we first partition
the set of data into groups based on data similarity and then assign the labels to the groups. The main
advantage of clustering over classification is that, it is adaptable to changes and helps single out useful
features that distinguish different groups.

Software Required: MS word, WEKA

Theory:
Clustering is “the process of organizing objects into groups whose members are similar in some
way”. Clustering is the task of assigning a set of objects into groups (called clusters) so that the
objects in the same cluster are more similar (in some sense or another) to each other than to those
in other clusters. Clustering is a main task of explorative data mining, and a common technique
for statistical data analysis used in many fields, including machine learning, pattern recognition,
image analysis, information retrieval, and bioinformatics. Clustering can be considered the most
important unsupervised learning technique; so, as every other problem of this kind, it deals with
finding a structure in a collection of unlabeled data.

Clustering Methods
To determine the distance between clusters based on their member elements, the following
methods have been implemented:
a. Single Linkage- minimum distance between any members of each group
b. Complete Linkage-maximum distance between any members of each group
c. Average Linkage-average pair-wise distance between each member of one cluster to each
member of another cluster
d. Average Group Linkage -average distance between all possible element pairs of the union
of the two clusters
e. Centroid -distance between the mean vectors (centroids) of the two clusters
f. Wards Method- increase in variance when merging two clusters.

Algorithm:-
Steps to implement K-means clustering algorithm using WEKA
1. Choose any data set, for example, diabetes.arff.
2. Go to the the cluster panel and choose simple kmeans.

Business Intelligence Lab, Sem VI CBCGS 16


3. Click the field containing the text. This will pop another window where you can choose
different options.
4. Choose appropriate training and testing sets, in case you want to perform supervised
clustering. The default values for these sets are two clusters and Euclidean distance, resp.
5. Repeat steps 1-4 for Hierarchical clustering. But you have to select HierarchicalCluster
instead of simplekmeans.

Flowchart:-

Start

Choose any data set

Go to the the cluster panel and


choose simple kmeans

Click the field


containing the text

Choose appropriate training and


testing sets

Stop

Result:

_____________________________________________________________________________

Conclusion:

_____________________________________________________________________________

Industrial Applications:-
Credit card companies mine transaction records for fraudulent use of their cards based on purchase
patterns of consumers - They can deny access if your purchase patterns change drastically.

References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf
2. www.cs.waikato.ac.nz/ml/weka/documentation.html
3. www.ittc.ku.edu/~nivisid/WEKA_MANUAL.pdf

Business Intelligence Lab, Sem VI CBCGS 17


Questionnaire

1. What is clustering?
_____________________________________________________________________________

2. Give one example where clustering is used.


_____________________________________________________________________________

3. What is dendogram?
_____________________________________________________________________________

4. Enlist the methods used to calculate the distance between the clusters.
_____________________________________________________________________________

5. How does K-means algorithm work?


____________________________________________________________________________

_____________________________________________________________________________

6. Why do clustering is known as unsupervised learning?


_____________________________________________________________________________

7. What are the different clustering methods?

_____________________________________________________________________________

8. Give the examples of partitioning method of clustering.

_____________________________________________________________________________

9. Give the examples of density-based method of clustering.

_____________________________________________________________________________

10. Give the examples of grid-based method of clustering.

_____________________________________________________________________________

Business Intelligence Lab, Sem VI CBCGS 18


Experiment No: 04
Apriori Algorithm using WEKA

Aim: To use WEKA tool to implement Association Mining using – Apriori algorithm and FP-Tree
Growth algorithm. To find out the frequent itemsets and find out the association rules.

Objectives:
To learn association data mining function that discovers the probability of the co-occurrence of
items in a collection. The relationships between co-occurring items are expressed as association
rules. Association rules are often used to analyze sales transactions.

Software Required: MS word, Weka


Theory:

In data mining, association rule learning is a popular and well researched method for discovering
interesting relations between variables in large databases. Association rules are usually required to
satisfy a user-specified minimum support and a user-specified minimum confidence at the same
time. Association rule generation is usually split up into two separate steps:

1. First, minimum support is applied to find all the frequent itemsets in a database.
2. Second, these frequent itemsets and the minimum confidence constraint are used to form
rules.

The Apriori algorithm:


1. k = 1;
2. Find frequent itemset, Lk from Ck, the set of all candidate itemsets;
3. Form Ck+1from Lk;
4. k = k+1;
5. Repeat 2-4 until Ck is empty;
Step 2 is called the frequent itemset generation step.
Step 3 is called as the candidate itemset generation step.

Frequent itemset generation


Scan D and count each itemset in Ck, if the count is greater than minSupp, then add that itemset
to Lk.
Candidate itemset generation
For k = 1, C1= all itemsets of length = 1.
For k > 1, generate Ckfrom Lk-1as follows:
The join step:
Ck= k-2 way join of Lk-1with itself.
If both {a1,..,ak-2, ak-1} & {a1,.., ak-2, ak} are in Lk-1, then add {a1,..,ak-2, ak-1, ak} to Ck.
The items are always stored in the sorted order.

Business Intelligence Lab, Sem VI CBCGS 19


The prune step:
Remove {a1, …,ak-2, ak-1, ak}, if it contains a non-frequent (k-1) subset.

Sample usage of Apriori algorithm


Assume that a large supermarket tracks sales data by stock-keeping unit (SKU) for each item:
each item, such as "butter" or "bread", is identified by a numerical SKU. The supermarket has
a database of transactions where each transaction is a set of SKUs that were bought together.
Let the database of transactions consist of following itemsets:
TID Itemsets
1 {1,2,3,4}
2 {1,2,4}
3 {1,2}
4 {2,3,4}
5 {2,3}
6 {3,4}
7 {2,4}
We will use Apriori to determine the frequent item sets of this database. To do this, we will say
that an item set is frequent if it appears in at least 3 transactions of the database: the value 3 is
the support threshold.
The first step of Apriori is to count up the number of occurrences, called the support, of each
member item separately. By scanning the database for the first time, we obtain the following
result

Item Support

{1} 3
{2} 6
{3} 4
{4} 5

All the itemsets of size 1 have a support of at least 3, so they are all frequent.
The next step is to generate a list of all pairs of the frequent items.
For example, regarding the pair {1,2}: the first table of Example 2 shows items 1 and 2
appearing together in three of the itemsets; therefore, we say item {1,2} has support of three.
Item Support
{1,2} 3
{1,3} 1
{1,4} 2
{2,3} 3
{2,4} 4
{3,4} 3

Business Intelligence Lab, Sem VI CBCGS 20


The pairs {1,2}, {2,3}, {2,4}, and {3,4} all meet or exceed the minimum support of 3, so they
are frequent. The pairs {1,3} and {1,4} are not. Now, because {1,3} and {1,4} are not frequent,
any larger set which contains {1,3} or {1,4} cannot be frequent. In this way, we can prune sets:
we will now look for frequent triples in the database, but we can already exclude all the triples
that contain one of these two pairs:

Item Support
{2,3,4} 2

In the example, there are no frequent triplets -- {2,3,4} is below the minimal threshold, and the
other triplets were excluded because they were super sets of pairs that were already below the
threshold.
We have thus determined the frequent sets of items in the database, and illustrated how some
items were not counted because one of their subsets was already known to be below the
threshold.
Steps to implement Association mining by using Apriori and FP-Growth.
1.Load the .arff file, say supermarket.arff.
2. Click the associate tab, and choose Apriori in the associator.
3.Run with the default values.
4. Study the output.

Flowchart:-

Start

Load the data set. Click the associate tab, and choose Apriori
in the associator

Run with the default values

Study the output

Stop

Result :
_____________________________________________________________________________

Business Intelligence Lab, Sem VI CBCGS 21


Conclusion:
_____________________________________________________________________________

References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf
2. www.cs.waikato.ac.nz/ml/weka/documentation.html
3. www.ittc.ku.edu/~nivisid/WEKA_MANUAL.pdf
4. Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition.

Industrial Applications:-
1. Market Basket Analysis
2.Supermarket
3.Analyzing Customer Interests In Retail Industry
4.Application of the Apriori algorithm for adverse drug reaction detection
5.Detection of adverse drug reactions (ADR) in health care data. The Apriori algorithm is used to
perform association analysis on the characteristics of patients, the drugs they are taking, their
primary diagnosis, co-morbid conditions, and the ADRs or adverse events (AE) they experience.
This analysis produces association rules that indicate what combinations of medications and
patient characteristics lead to ADRs.

Business Intelligence Lab, Sem VI CBCGS 22


Questionnaire :

1. Define frequent itemset.

_____________________________________________________________________________

2. What is minimum support count?

_____________________________________________________________________________

3. What is prune step?

_____________________________________________________________________________

4. What is join step?

_____________________________________________________________________________

5. Give the steps in association rule mining.

_____________________________________________________________________________

6. Write the steps in Apriori algorithm?

_____________________________________________________________________________

7. How do you calculate support?

_____________________________________________________________________________

8. How do you calculate confidence?

_____________________________________________________________________________

9. What is association mining?

_____________________________________________________________________________

10. How can you find frequent itemset without candidate generation?

_____________________________________________________________________________

Business Intelligence Lab, Sem VI CBCGS 23


Experiment No: 05
Classification using JAVA/ python/R
Aim: - Implementation of any of the classification algorithm using JAVA/ python/R. To predict
the class label of the given data set using.

Objectives:
To learn the actual implementation of classification algorithm and analyze the result.

Software Required: java

Theory:

The Decision Tree algorithm, is based on conditional probabilities. Decision trees generate rules.
A rule is a conditional statement that can easily be understood by humans and easily used within
a database to identify a set of records. In some applications of data mining, the accuracy of a
prediction is the only thing that really matters. It may not be important to know how the model
works. In others, the ability to explain the reason for a decision can be crucial. For example, a
Marketing professional would need complete descriptions of customer segments in order to launch
a successful marketing campaign. The Decision Tree algorithm is ideal for this type of application.
This rule comes from a decision tree that predicts the probability that customers will increase
spending if given a loyalty card. A target value of 0 means not likely to increase spending; 1 means
likely to increase spending.

Growing a Decision Tree


A decision tree predicts a target value by asking a sequence of questions. At a given stage in the
sequence, the question that is asked depends upon the answers to the previous questions. The goal
is to ask questions that, taken together, uniquely identify specific target values. Graphically, this
process forms a tree structure.

Tuning the Decision Tree Algorithm


The Decision Tree algorithm is implemented with reasonable defaults for splitting and termination
criteria. It is unlikely that you will need to use any of the build settings that are supported for
Decision Tree.
Algorithms for decision trees
BuildTree (data set S)
if all records in S belong to the same class
return
for each attribute Ai
evaluate splits on attribute Ai
use best split found to partition S into S1 and S2
BuildTree (S1)

Business Intelligence Lab, Sem VI CBCGS 24


BuildTree (S2)
PruneTree (node t)
if t is leaf
return C(S) +1 /* C(S) is the cost of encoding
the classes for the records in set S */
minCost1:= PruneTree (t1) /* t1 , t2 are */
minCost2:= PruneTree (t2) /* t’s children*/
minCost
t := min(C(S)+1,Csplit
(t)+1+minCost1+ minCost2)
Return minCost
t E(S)=- ∑pi log2 pi
continuous class attribute

Flowchart:-

Result:-

_____________________________________________________________________________

Conclusion:-

_____________________________________________________________________________

Business Intelligence Lab, Sem VI CBCGS 25


Industrial Applications:
1.Weather forecast
2.Predict sales and stock up
3.Health Sciences

References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf

2. https://data-flair.training/blogs/data-mining-algorithms/

3. Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition.

Business Intelligence Lab, Sem VI CBCGS 26


Questionnaire:

1. Give the formula for Bayes’ Theorem?


_____________________________________________________________________________

2. Which type of training is used to train the naïve bayes classifier?


_____________________________________________________________________________

3. Enlist the common approaches for tree pruning?


_____________________________________________________________________________

4. Differentiate between prepruning and postpruning?


_____________________________________________________________________________

5. How do you calculate true positive rate?


_____________________________________________________________________________

6. How do you calculate true negative rate?


_____________________________________________________________________________

7. What is precision?
_____________________________________________________________________________

8. What is recall?
_____________________________________________________________________________

9. What is the assumption of Naïve Bayes’ classification?


_____________________________________________________________________________

10. Enlist different classification methods.


_____________________________________________________________________________

Business Intelligence Lab, Sem VI CBCGS 27


Experiment No: 06
K-means Clustering using JAVA/ python
Aim : Implementation of K-means algorithm using JAVA/ python. To identify the similarities and
dissimilarities among the objects and group the objects into a cluster according to the similarities.

Objectives:
To learn the actual implementation of clustering algorithm and analyses the result.

Software Required: java

Theory:

Algorithm: - k-means clustering


k-means is one of the simplest unsupervised learning algorithms that solve the well-known
clustering problem. The procedure follows a simple and easy way to classify a given data set
through a certain number of clusters (assume k clusters) fixed.
The main idea is to define k centers, one for each cluster. These centers should be placed in a
cunning way because of different location causes different result. So, the better choice is to place
them as much as possible far away from each other. The next step is to take each point belonging
to a given data set and associate it to the nearest center. When no point is pending, the first step is
completed and an early group age is done.
At this point we need to re-calculate k new centroids as barycenter of the clusters resulting from
the previous step. After we have these k new centroids, a new binding has to be done between the
same data set points and the nearest new center. A loop has been generated. As a result of this loop
we may notice that the k centers change their location step by step until no more changes are done
or in other words centers do not move any more.
Finally, this algorithm aims at minimizing an objective function knows as squared error function
given by:
Where,

‘||xi - vj||’ is the Euclidean distance between xi and vj.


‘ci’ is the number of data points in ith cluster.
‘c’ is the number of cluster centers.

Algorithmic steps for k-means clustering


Let X = {x1,x2,x3,……..,xn} be the set of data points and V =
{v1,v2,…….,vc} be the set of centers.

1) Randomly select ‘c’ cluster centers.


2) Calculate the distance between each data point and cluster centers.

Business Intelligence Lab, Sem VI CBCGS 28


3) Assign the data point to the cluster center whose distance from the cluster center is minimum
of all the cluster centers..
4) Recalculate the new cluster center using: Where, ‘ci’ represents the number of data points in
ith cluster.
5) Recalculate the distance between each data point and new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise repeat from step.

Flowchart :

Suppose we have several objects (4 types of medicines) and each object have two attributes or
features as shown in table below. Our goal is to group these objects into K=2 group of medicine
based on the two features (pH and weight index).

Object Feature 1 (X): Feature 2 (Y): pH


weight index
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
1. Initial value of centroids: Suppose we use medicine A and medicine B as the first
centroids. Let c1 and c2 denote the coordinate of the centroids,
then c1 = (1,1) and c2 = (2,1)
2. Objects-Centroids distance: we calculate the distance between cluster centroid to each
object. Let us use Euclidean distance, then we have distance matrix at iteration 0 Is

D0 = 0 1 3.61 5 C1=(1,1)---- Group 1

1 0 2.83 4.24 C2=(2,1)---- Group 2

Business Intelligence Lab, Sem VI CBCGS 29


A B C D

1 2 4 5 X

1 1 3 4 Y

Each column in the distance matrix symbolizes the object. The first row of the distance matrix
corresponds to the distance of each object to the first centroid and the second row is the distance
of each object to the second centroid. For example, distance from medicine C = (4, 3) to the first
centroid c1 = (1,1) is √ (4-1)2 + (3 - 1)2 = 3.61 and its distance to the second centroid c2 = (2,1) is
√ (4-2)2 + (3 - 1)2 = 2.83.

3.Objects clustering: We assign each object based on the minimum distance. Thus, medicine A
is assigned to group 1, medicine B to group 2, medicine C to group 2 and medicine D to group 2.
The element of Group matrix below is 1 if and only if the object is assigned to that group.

G0 = 1 0 0 0 Group 1
0 1 1 1 Group 2

A B C D

4.Iteration-1, determine centroids: Knowing the members of each group, now we compute the
new centroid of each group based on these new memberships. Group 1 only has one member thus
the centroid remains in c1 = (1,1) . Group 2 now has three members, thus the centroid is the average

2+4+5 , 1+3+4
coordinate among the three members: c2=( ______ _______ )
3 3

5.Iteration-1, Objects-Centroids distances: The next step is to compute the distance of all
objects to the new centroids. Similar to step 2, we have distance matrix at iteration 1 is

D1 = 0 1 3.61 5 C1=(1,1)---- Group 1

3.14 2.36 0.47 1.89 C2=(11/3 , 8/3)---- Group 2

A B C D

1 2 4 5 X

1 1 3 4 Y

Business Intelligence Lab, Sem VI CBCGS 30


6.Iteration-1, Objects clustering: Similar to step 3, we assign each object based on the minimum
distance. Based on the new distance matrix, we move the medicine B to Group 1 while all the other
objects remain. The Group matrix is shown below

G1 = 1 1 0 0 Group 1
0 0 1 1 Group 2

A B C D

7.Iteration 2, determine centroids: Now we repeat step 4 to calculate the new centroids
coordinate based on the clustering of previous iteration. Group1 and group 2 both has two
members, thus the new centroids :

c1=(_1+2__ , __1+1__)=(3/2 ,1) and c2 = (__4+5__ , __3+4__) =(9/2 , 7/2)


2 2 2 2
8. Iteration-2, Objects-Centroids distances: Repeat step 2 again, we have new distance matrix
at iteration 2 as

D 2 = 0.5 0.5 3.2 4.61 C1=(3/2,1)---- Group 1

4.30 3.45 0.71 0.71 C2=(9/2 , 7/2)---- Group 2

A B C D

1 2 4 5 X

1 1 3 4 Y

9. Iteration-2, Objects clustering: Again, we assign each object based on the minimum distance.

G1 = 1 1 0 0 Group 1
0 0 1 1 Group 2

A B C D

We obtain result that G 2 = G 1 . Comparing the grouping of last iteration and this iteration reveals that the
objects does not move group anymore. Thus, the computation of the k-mean clustering has reached its
stability and no more iteration is needed. We get the final grouping as the results.

Object Feature 1 (X): weight Feature 2 (Y): pH Group (Result)


index
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2

Business Intelligence Lab, Sem VI CBCGS 31


Result :

_____________________________________________________________________________

Conclusion:

_____________________________________________________________________________

Industrial Applications:-
1.Credit card companies mine transaction records for fraudulent use of their cards based on
purchase patterns of consumers - They can deny access if your purchase patterns change
drastically.
2. Pattern recognition
3.Image analysis
4. Bioinformatics
5. Machine Learning
6.Voice minig
7.Image processing

References
1. statweb.stanford.edu/~lpekelis/13_datafest.../WekaManual-3-7-8.pdf
2. https://data-flair.training/blogs/clustering-in-data-mining/

3. Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition.

Business Intelligence Lab, Sem VI CBCGS 32


Questionnaire:

1. Which approach is the top down approach- agglomerative or divisive?


_____________________________________________________________________________

2. Give the formula of single linkage clustering.


_____________________________________________________________________________

3. What is the formula for complete linkage clustering?


_____________________________________________________________________________

4. What is the average linkage clustering?


_____________________________________________________________________________

5. What are the parameters of CF tree?


_____________________________________________________________________________

6. What is a branching factor?


_____________________________________________________________________________

7. What is threshold?
_____________________________________________________________________________

8. What does DBSCAN stand for?


_____________________________________________________________________________

9. What does BIRCH stand for?


_____________________________________________________________________________

10. Give the steps for agglomerative clustering?


_____________________________________________________________________________

Business Intelligence Lab, Sem VI CBCGS 33


Experiment No: 07
Pentaho

Aim: Study of BI Tool- Pentaho BI.

Objectives: To give the student maximum exposure to Pentaho components. To define and
describe business analytics and business intelligence.

Theory:

Having a lot of numbers, spreadsheets full of values, charts, diagrams, histograms, so on and
so forth, it rarely is easy to discover trends or specific dependencies soon enough to influence
on them. It's not easy to predict, then, because everything needs to grow significantly before it's
noticed. Pentaho Data Mining - the same as any other data mining solution - helps with
discovering these trends before they're actually noticeable by human. The system's idea is to
quickly analyze extremely large volumes of data and search for any trends that are being shaped.
The most meaningful advantage is the fact Pentaho Data Mining tool highlights the trends long
before they become "traditionally" noticeable, what makes it a real value in today's Business
Intelligence.
Which contractor to choose? Which product to buy? Sometimes it truly is difficult to decide,
and these are the cases which efficient Data Mining solution could help with.
Who - beside the system - would be able to find out that this type contractors seem to delay the
payments and - therefore - there is quite a significant risk the one we want to begin cooperating
with would do the same? None.
With a full support from data integration, analysis, dashboards, and reporting, Pentaho Data
Mining is truly a worth considering solution.

Data mining with Pentaho in practice

The idea of data mining isn't complex, however doing it "manually" could be difficult and time
consuming. What's more, wouldn't guarantee the final success. Data mining within Pentaho
Data Mining tools begins with choosing a model. There are numerous options to choose from -
segmentation, decision trees, nets, random forests, clustering, and many others. On the chosen
models can depend the efficiency of data mining. Then, data is added. After this, there is a need
to adapt the chosen model to sample data - it's a crucial moment, thereupon there are two
methods to choose from. In all cases, it could be done automatically (following the most
common procedures and parameters). However, it sometimes is possible to do it personally, as
well. Nonetheless, the adjusted parameters require testing. Thereupon, it's suggested to verify
the model on some data from the future (and check out whether the output is more or less the

Business Intelligence Lab, Sem VI CBCGS 34


same with what happened next). Perfecting, if needed, can be provided later to ensure the model
suits data as good as possible. Finally, it comes to data mining in the form most people
understand it. Once there is data inputted and model suited up and perfected, is a time for
delivering the output. How the results are going to look, depends on Pentaho Data Mining user.
However, there always are different options to choose from (alerts in other applications,
graphical illustrations, etc.).

Pentaho Data Mining features

Among many others, main Pentaho Data Mining features are:

• powerful engine working well even with the largest data volumes
• numerous and differentiated learning algorithms originated from the Weka (principal component
analysis, random forests, decision trees, neural networks, segmentation, clustering, so on and so
forth)
• simplified and accelerated data integration
• automated data transforming capability (from almost any other to the format Pentaho Data
Mining requires)
• twofold algorithm applying methods (from Java code or directly to the dataset)
• various methods for output presentation
• differentiated filters for data analysis
• PMML (Predictive Model Markup Language) support
• graphical user interfaces
• efficient hidden relationships and patterns uncovering capabilities
• using already discovered patterns in future data mining
• embedding insights into other applications capabilities (patterns, then, can be displayed every
time they could be useful, not only when one wants to check for them)
Pentaho Data Mining resources
Pentaho BI Suite

●Open source Business Intelligence tool

●It provides support for: data integration, reporting, OLAP analysis, dashboard and data mining.

Business Intelligence Lab, Sem VI CBCGS 35


Pentaho Architecture

Pentaho Data Integration (PDI)

Pentaho Data Integration (PDI) provides the Extract, Transform, and Load (ETL) capabilities that
facilitate the process of capturing, cleansing, and storing data using a uniform and consistent
format that is accessible and relevant to end users and IoT technologies.

Common uses of Pentaho Data Integration include:

Business Intelligence Lab, Sem VI CBCGS 36


• Data migration between different databases and applications
• Loading huge data sets into databases taking full advantage of cloud, clustered and
massively parallel processing environments
• Data Cleansing with steps ranging from very simple to very complex transformations
• Data Integration including the ability to leverage real-time ETL as a data source for
Pentaho Reporting
• Data warehouse population with built-in support for slowly changing dimensions and
surrogate key creation

PDI – Example
Kettle/Spoon
Pentaho Schema Workbench (PSW)

With a physical multidimensional data model in place, you must create a logical model that maps
to it. A Mondrian schema is essentially an XML file that performs this mapping, thereby defining

Business Intelligence Lab, Sem VI CBCGS 37


a multidimensional database structure. You can create Mondrian schemas using the Pentaho
Schema Workbench.

In a very basic scenario, you will create a Mondrian schema with one cube that consists of a single
fact table and a few dimensions, each with a single hierarchy consisting of a handful of levels.
More complex schemas may involve multiple virtual cubes, and instead of mapping directly to the
single fact table at the center of a star schema, they might map to views or inline tables instead.

Get Started with the Schema Workbench

Before you start using Schema Workbench, you should be aware of the following points:

• You start Schema Workbench by executing the /pentaho/design-tools/schema-


workbench/workbench script. On Linux and OS X, this is a .sh file; on Windows it's .bat.
• You must be familiar with your physical data model before you use Schema Workbench.
If you don't know which your fact tables are and how your dimensions relate to them, you
will not be able to make significant progress in developing a Mondrian schema.
• When you make a change to any field in Schema Workbench, the change will not be
applied until you click out of that field such that it loses the cursor focus.
• Schema Workbench is designed to accommodate multiple sub-windows. By default they
are arranged in a cascading fashion. However, you may find more value in a tiled format,
especially if you put the JDBC Explorer window next to your Schema window so that you
can see the database structure at a glance. Simply resize and move the sub-windows until
they are in agreeable positions.

Add a Data Source

Your data source must be available, its database driver JAR must be present in
the /pentaho/design-tools/schema-workbench/drivers/ directory, and you should know or be
able to obtain the database connection information and user account credentials for it.

Follow the below process to connect to a data source in Schema Workbench.


1. Establish a connection to your data source by going to the Options menu and
selecting Connection. The Database Connection dialog appears.

Business Intelligence Lab, Sem VI CBCGS 38


2. Select your database type, then enter in the necessary database connection information,
then click Test. When you've verified that the connection settings work, click OK. The
database connection information includes the database name, port number, and user
credentials. If you don't know what to type into any of these fields, consult your database
administrator or database vendor's documentation.
3. The Require Schema check box, when selected in the Options menu, puts Schema
Workbench into a mode where unpopulated elements appear in the schema.
4. If you are using an Oracle data source, selecting Require Schema will dramatically improve
your Analysis schema load time.
5. If you required a database schema in the previous step, you must now define it by going to
the Options section of the database dialogue, and creating a parameter
called FILTER_SCHEMA_LIST with a value of the schema name you want to use.

Your data is now available to Schema Workbench, and you can proceed with creating a
Mondrian schema.
Remove Mondrian Data Sources
As you phase out old analysis schemas, you will have to manually remove their data source entries
in the Data Source Wizard in the User Console.
1. Login to the User Console with administrator credentials.
2. On the Home page of the User Console, click Manage Data Sources. The Data Source
Wizard appears.
3. Click to highlight the data source to be deleted, and click Remove.

The data source is removed and is no longer available for use.


Create a Mondrian Schema

In order to complete this process, you should have already connected to your data source in Schema
Workbench.

This section explains the basic procedure for creating a barebones Mondrian schema using Schema
Workbench..
1. To create a new Mondrian schema, click the New button, or go to the File menu, then select New,
then Schema. A new schema sub-window will appear. Resize it to fit your preference.

Business Intelligence Lab, Sem VI CBCGS 39


2. It's easier to visualize your physical data model if you have it in front of you. Turn on the JDBC
Explorer from the New section of the File menu and position it according to your preference. If you
have a third-party database visualization tool that you are more familiar with, use that instead. The
JDBC Explorer is not interactive; it only shows the table structure of your data source so that you
can see at a glance what the names of the columns and rows in it.

3. Typically your first action when creating a schema is to add a cube. Right-click the Schema icon
in the schema window, then select Add cube from the context menu. Alternatively you can click
the New Cube button in the toolbar. A new default cube will show up in your schema.

4. Give your cube a name.

5. Add a table by clicking the New Table button, or by right-clicking your cube, then selecting Add
Table. This will be your fact table. Alternatively you can select View or Inline Table if these are
the data types you need for your fact table.

6. Click the Table entry in the name field of your new table, and select or type in the name of the table
in your physical model that you want to use for this cube's fact table.

7. Add a dimension by right-clicking the cube, then selecting Add Dimension, or by clicking the New
Dimension button.

8. Type in a friendly name for this dimension in the name field.

9. Select a foreign key for this dimension from the foreignKey drop-down box, or just type it into the
field.

10. When you add a dimension, a new hierarchy is automatically created for it. To configure the
hierarchy, expand the dimension by clicking the lever icon on the left side of the dimension's tree
entry, then click on New Hierarchy 0. Choose a primaryKey or primaryKey Table.

11. Add a table to the hierarchy by right-clicking the hierarchy, then selecting Add Table from the
context menu.

12. Choose a column for the name attribute.

13. Add a level to the hierarchy by right-clicking the hierarchy, then selecting Add Level from the
context menu.

14. Give the level a name and choose a column for it.

15. Add a member property to the level by right-clicking the level, then selecting Add Property from
the context menu.

16. Give the property a name and choose a column for it.

Business Intelligence Lab, Sem VI CBCGS 40


17. Add a measure to the cube by right-clicking the cube and selecting Add Measure from the context
menu.

18. Choose a column that you want to provide values for, then select an aggregator to determine how
the values should be calculated.
These instructions have shown you how to use Schema Workbench's interface to add and configure basic
Mondrian schema elements.

When your schema is finished, you should test it with a basic MDX query such as:

select {[Dim1].[All Dim1s]} on rows, {[Measures].[Meas1]} on columns from [CubeName]

In order to use your schema as a data source in any Pentaho Business Analytics client tools, you
must publish it to the Pentaho Server. To do this, select Publish from the File menu, then enter in
your Pentaho Server connection information and credentials when requested.

Edit a Schema
There are two advanced tools in Schema Workbench that enable you to work with raw MDX and
XML. The first is the MDX query editor, which can query your logical data model in real time. To
open this view, go to the File menu, select New, then click MDX Query.

The second is XML viewing mode, which you can get to by clicking the rightmost icon (the pencil)
in the toolbar. This replaces the name/value fields with the resultant XML for each selected
element. To see the entire schema, select the top-level schema entry in the element list on the left
of the Schema Workbench interface. Unfortunately you won't be able to edit the XML in this view;
if you want to edit it by hand, you'll have to open the schema in an XML-aware text editor
Add Business Groups
The available fields list in Analyzer organizes fields in folders according to
the AnalyzerBusinessGroup annotation. To implement business groups, add these annotations to
your member definitions appropriately. If no annotation is specified, then the group defaults to
"Measures" for measures and the hierarchy name/caption for attributes.

Below is an example that puts Years, Quarters and Months into a "Time Periods" business group:

...

Business Intelligence Lab, Sem VI CBCGS 41


<Level name="Years" ... >
<Annotations><Annotation name="AnalyzerBusinessGroup">Time
Periods</Annotation></Annotations>
</Level>
<Level name="Quarters" ... >
<Annotations><Annotation name="AnalyzerBusinessGroup">Time
Periods</Annotation></Annotations>
</Level>
<Level name="Months" ... >
<Annotations><Annotation name="AnalyzerBusinessGroup">Time
Periods</Annotation></Annotations>
</Level>
...

The AnalyzerBusinessGroup annotation is supported on the following schema elements:


• Level
• Measure
• CalculatedMember
• VirtualCubeMeasure

Add Field Descriptions

By adding description attributes to your Mondrian schema elements, you can enable tooltip
(mouse-over) field descriptions in Analyzer reports.

<Level name="Store Country" column="store_country" uniqueMembers="true"


caption="%{foodmart.dimension.store.country.caption}"
description="%{foodmart.dimension.store.country.description}"/>

Remove the line-wrap or this may not work. These variables will not work unless you localize
schemas.

This attribute can be set on the following schema elements:

Business Intelligence Lab, Sem VI CBCGS 42


• Level
• Measure
• CalculatedMember

Build a Schema and Detect Errors


Analysis schemas are built by publishing them to the Pentaho Server. Each schema is validated to
make sure that there are no errors before it is built; if there are any, they'll be shown to you and the
schema will fail to publish. If you want to see the errors marked in Schema Workbench before you
publish, go to the Options menu and select Require Schema. When this option is checked,
schema validation will happen as new elements are added, and any errors will show as a red x next
to the offending element.
Adapt Mondrian Schemas to Work with Pentaho Analyzer

A few Mondrian features are not yet functional in Pentaho Analyzer. You must adapt your schemas
to adjust for these limitations and enable some Analyzer functions to work properly.
Localization and Internationalization of Analysis Schemas
You can create internationalized message bundles for your analysis schemas and deploy them
with your Pentaho web applications.
It provides the following functionalities:
●Schema editor integrated with the underlying data source for validation
●Test MDX queries against schema and database
●Browse underlying databases structure

Business Intelligence Lab, Sem VI CBCGS 43


PSW – Example

Pentaho OLAP Analysis


An OLAP Analysis allows us to:
●Study at once a whole bulk of data
●Observe data from different points of view
●Support decisional processes
●The most common functions are: Slicing, Dicing, Drill-down, Drill-accross, Drill-through

Business Intelligence Lab, Sem VI CBCGS 44


Pentaho Reporting
Pentaho Reporting is a suite of tools for creating pixel perfect reports. With Pentaho Reporting
you are able to transform data into meaningful information tailored to your audience. You can
create HTML, Excel, PDF, Text or printed reports. If you are a developer, you can also produce
CSV and XML reports to feed other systems.
Pentaho Reporting's development is driven by the goal to create a flexible yet simple to use
reporting engine. Using the reporting engine gives you unmatched flexibility to create reports
that adapt to your data as almost every property can be computed during report generation. Your
reports can include data from virtually any data-source due to Pentaho Reporting's large
selection of data-sources, including SQL-databases, OLAP data sources and even the Pentaho
Data-Integration ETL tool.
With high performance at low memory consumption the report processing can scale from small
footprint embedded scenarios to large-scale enterprise reporting scenarios. Pentaho Reporting
integrates perfectly with the Pentaho BI-Server to share your reports with your co-workers and
peers.

Pentaho Report Designer


The Pentaho Report Designer is the primary design tool to create report definitions. Its state of
the art user interface allows you to spell out your data flow in the report and to define the visual
appearance of your report. With the Pentaho Report Designer you are able to create pixel perfect
reports quick and efficiently.
The Pentaho Report Designer offers complete access to all settings and configuration options
of the Pentaho Reporting Engine. Its high number of possibilities and extreme flexibility can
make it overwhelming to novice users. The Pentaho Report Designer is aimed at technically
skilled power users.
You can use the Pentaho Report Designer as a Desktop Reporting Tool by running your reports
locally. The Pentaho Report Designer can also publish your finished Reports to a Pentaho BI -
Server so that your report is available to others.

Pentaho Reporting Engine


The Pentaho Reporting Engine is the underlying technology that creates the reports from your
report definitions. The Pentaho Reporting Engine is used both in the Pentaho Report Designer
and the Pentaho BI-Platform to drive the report generation.
If you require reporting or printing capabilities in your own application, you can easily embed
the reporting engine there.

Pentaho BI-Server
The Pentaho BI-Server is a web-application for sharing and managing reports. With the BI-
Platform you are able to make report available to a wider audience. You can automatically send

Business Intelligence Lab, Sem VI CBCGS 45


reports by E-Mail to a list of recipients via a process called "bursting". You can schedule large
reports to run at night-time so that you get up-to-date information in the morning.

Pentaho Dashboards

Business Intelligence Lab, Sem VI CBCGS 46


Creating a dashboard in Dashboard Designer is as simple as choosing a layout template, theme,
and the content you want to display. In addition to displaying content generated from Interactive
Reports and Analyzer, Dashboard Designer can also include these content types.
• Charts: simple bar, line, area, pie, and dial charts created with Chart Designer
• Data Tables: tabular data
• URLs: Web sites that you want to display in a dashboard panel

Dashboard Designer has dynamic filter controls, which enable dashboard viewers to change a
dashboard's details by choosing different values from a drop-down list, and to control the content
in one dashboard panel by changing the options in another. This is known as content linking.
Get Started with the Dashboard Designer
You can view the editable version of the Sales Performance (Dashboard) in Dashboard Designer
by clicking Browse Files on the User Console Home page. Follow these quick steps.
1. In the Folders pane, click to expand the Public folder, then click to highlight the Steel
Wheels folder.
2. In the center pane, double-click on Sales Performance (Dashboard).
3. After the dashboard opens, click Edit in File Actions.

Business Intelligence Lab, Sem VI CBCGS 47


Create a Dashboard
You must be logged into the User Console. Use these steps to create a new dashboard.
1. From the User Console Home page, click Create New, then select Dashboard.
2. On the bottom of the page, click the Properties tab, and enter a title for your dashboard
page in the Page Title text box. The name you entered appears on the top left corner of the
dashboard. This name helps you identify the page if you want to edit, copy, or delete it
later.
3. Click Templates to choose a dashboard layout. A blank dashboard with the layout you
selected appears.
4. Click Theme to choose a theme for your dashboard. The theme you selected is applied to
your dashboard.

You now have the basic framework for a Pentaho dashboard.

Add a Report from Analyzer

Use these steps to display an analysis report in a dashboard.


1. Select a panel in the Dashboard Designer.
2. Click Insert and choose File.
3. Locate the appropriate analysis report and click Select. The report appears inside the
dashboard panel.

Add a Report from Report Designer

Use these steps to add a report created with Report Designer.


1. Select a panel in the Dashboard Designer.
2. Click the Insert icon and choose File. A browser window opens.
3. Locate the appropriate report file.
4. Click Select to place the report inside the dashboard panel. Pagination control arrows at the
top of a report allows you to scroll through long reports. Notice that the report file
name, Inventory.prpt, appears under Content: in the dashboard edit pane in the sample
below. This sample report contains parameters. You can enter values manually and link

Business Intelligence Lab, Sem VI CBCGS 48


them to a dashboard filter in the text boxes under Source. When the report renders again,
the parameter value(s) you entered are included in the report.

Add a Website
Use these steps to display contents of a website in a dashboard panel.
1. Select a panel in the Dashboard Designer.
2. Click (Insert) and choose URL. The Enter Web site dialog box appears.
3. Enter the website URL in the text box and click OK.
4. If applicable, click (Edit) to make changes.

Drag-and-Drop Content
Use these steps to add an existing chart, table, or file to your dashboard panels using the drag-and-
drop feature.
1. In the left pane of the Pentaho User Console, under Files, locate to the content (chart, table,
or file) you want added to your dashboard.
2. Click and drag the content into a blank panel on your dashboard. You will see the "title" of
the content as you move it around the dashboard. Notice that the title background is red; it
turns green when you find a panel where the content can be dropped.
3. Repeat steps 2 and 3 until your dashboard contains all the content you want to display. To
swap content from one panel to another, click the title bar of the panel that contains the

Business Intelligence Lab, Sem VI CBCGS 49


content you want moved and drag it over the panel you want swapped. You will see the
swap icon as you are moving the content.

If you are working with an existing dashboard, you can perform steps 2 and 3 steps; however, a
warning message appears when you try to place content in a panel that already contains content.
The new content will override the existing content.
Use Chart Designer

The Chart Designer allows you to create bar, pie, line, dial, and area charts that can be added to a
dashboard.
Adjust White Space in Dashboard Panels

Sometimes you must adjust the white space in dashboard panels, (or the filter panel), so that
content appears correctly. Use these steps to adjust white space.
1. In the lower pane, click General Settings and then click the Properties tab.
2. Click Resize Panels. The white space between the dashboard panels turns blue.
3. Adjust the panel size by clicking and holding the left mouse button down as you move the
blue lines (white space) around. Release the mouse button when you are satisfied with the
positioning of the panel.
4. Click Close in the lower-right corner of the dashboard to exit resize layout mode.
5. Examine the dashboard contents to make sure they are placed correctly. You can return to
the resize layout mode if you need to make additional changes.
Set the Refresh Interval

The content in your dashboard may need to be refreshed periodically if users are examining
real-time data. You can set refresh intervals for individual panels on your dashboard or for the
entire dashboard.

To set the refresh interval for individual panels in the dashboard, click the edit button and the
choose the panel that contains the content you want refreshed in the Objects panel. Under Refresh
Interval (sec) enter the interval time in seconds and click Apply.

Business Intelligence Lab, Sem VI CBCGS 50


If you want the entire dashboard to refresh, click the Prompts tab in the dashboard and set your
refresh interval.
Save a Dashboard

You must be in Edit mode to save a dashboard.

1. Click the Save As button, which is a floppy disk and pencil button, to open the Save
As dialog box.
2. In the File Name text box, type a file name for your dashboard.
3. Enter the path to the location where you want to save the dashboard. Alternatively, use the
up/down arrows or click Browse to locate the solution (content files) directory in which
you will save your dashboard.
4. Click Save. The report saves with the name specified.

Result :
_____________________________________________________________________________

Conclusion:
_____________________________________________________________________________

Industrial Applications
Creating and managing databases, data warehouses and data miners.

References:

1.https://www.hitachivantara.com/en-in/products/big-data-integration-analytics/pentaho-trial-
download.html

2.https://www.tutorialspoint.com/pentaho/index.htm

3. https://intellipaat.com/tutorial/pentaho-tutorial/introduction-to-pentaho/

4. WEKA, RapidMiner Pentaho resources from the Web.

Business Intelligence Lab, Sem VI CBCGS 51


Questionnaire:

1. What does Pentaho BI suite include?

_____________________________________________________________________________

2. What are the different BI tools used?

_____________________________________________________________________________

3. State True or False. Pentaho is an open source BI tool.

_____________________________________________________________________________

4. What is the use of Pentaho report designer ?

_____________________________________________________________________________

5. What are the uses of Pentaho data integration ?

_____________________________________________________________________________

6. What are the uses of Pentaho dashboard?

_____________________________________________________________________________

7. What are the functions of OLAP analysis?

_____________________________________________________________________________

8. What are the different features of Pentaho data mining?

_____________________________________________________________________________

9. How can you share and manage the reports?

_____________________________________________________________________________

10. How Pentaho is different from other BI tools?

_____________________________________________________________________________

Business Intelligence Lab, Sem VI CBCGS 52


Experiment No: 08
Business Intelligence Mini Project
Each group assigned one new case study for this; A BI report must be prepared outlining the
following steps:

a) Problem definition, identifying which data mining task is needed

b) Identify and use a standard data mining dataset available for the problem. Some links for data
mining datasets are: WEKA site, UCI Machine Learning Repository, KDD site, KDD Cup etc.

c) Implement the data mining algorithm of choice

d) Interpret and visualize the results

e) Provide clearly the BI decision that is to be taken as a result of mining.

Sample Problem Statements

1. Hotel Recommendation System Based on Hybrid Recommendation Model


We presented Machine Learning and Sentiment Word Net based method for opinion mining from
hotel reviews and sentence relevance score based method for opinion summarization of hotel
reviews. The classified and summarized hotel review information helps web users to understand
review contents easily in a short time. Opinion Mining for Hotel Review system that detects hidden
sentiments in feedback of the customer and rates the feedback accordingly. The system uses
opinion-mining methodology in order to achieve desired functionality. Opinion mining for hotel
reviews is a web application, which gives review of the feedback that is posted by various users.
The system takes review of various users, based on the opinion, system will specify whether the
posted hotel is good, bad, or worst. Based on users search on hotels, recommendations will be
shown to the user based on how many times a user visited that particular hotel page. We use a
database of sentiment based keywords along with positivity or negativity weight in database and
then based on these sentiment keywords mined in user review is ranked. Once the user login to the
system he views the hotels and gives review about the hotel. System will use database and will
match the review with the keywords in database and rank the review accordingly. System will rate
the hotel based on the rank of review. The role of the admin is to post new hotel and add keywords
in database. This application is useful for those who are exploring new places and also useful for
those who travel often. Using this application, a user will get to know which hotel is best and
suitable for them. User can decide which hotel to accommodate before they reach the place.

Business Intelligence Lab, Sem VI CBCGS 53


2. Heart Disease Prediction Project
It might have happened so many times that you or someone yours need doctors help immediately,
but they are not available due to some reason. The Heart Disease Prediction application is an end
user support and online consultation project. Here, we propose a web application that allows users
to get instant guidance on their heart disease through an intelligent system online. The application
is fed with various details and the heart disease associated with those details. The application
allows user to share their heart related issues. It then processes user specific details to check for
various illness that could be associated with it. Here we use some intelligent data mining
techniques to guess the most accurate illness that could be associated with patient’s details. Based
on result, the can contact doctor accordingly for further treatment. The system allows user to view
doctor’s details too. The system can be used for free heart disease consulting online.

3. Social Media Community Using Optimized Clustering Algorithm


Now-a-days social media is used to introduce new issues and discussion on social media. More
number of users participates in discussion via social media. Different users belong to different kind
of groups. Positive and negative comments will be posted by the user and they will participate in
discussion . Here we proposed a system to group different kind of users and system specifies from
which category they belong to. For example film industry, politician etc. Once the social media
data such as user messages are parsed and network relationships are identified, data mining
techniques can be applied to group of different types of communities. We used K-Means clustering
algorithm to cluster data. In this system we detect communities by clustering messages from large
streams of social data. Our proposed algorithm gives better clustering results and provides a novel
use-case of grouping user communities based on their activities. This application is used to identify
group of people who viewed the post and commented on the post. This helps to categorize the
users.

Advantages

• This system helps to categorize group of people


• This system helps to identify group of people participated in discussion
• This system helps to approach targeted crowd.
• We had used an effective algorithm which will provide accurate result.

Business Intelligence Lab, Sem VI CBCGS 54

You might also like