0% found this document useful (0 votes)
125 views90 pages

Batch B DWM Experiments

The document describes an experiment to create a data warehouse for a library management system. It involves building a star schema or snowflake schema with dimensions for books, members, publishers, and staff. It also involves creating fact tables to track transactions like book loans and returns. The data warehouse would help users efficiently find books and track how the library's contents change over time as more books are added. It justifies the need for slowly changing dimensions to track book details and member information over time. Factless fact tables are not required since the dimensions all contain measures. Large dimensions would be broken into mini-dimensions. Surrogate keys and junk dimensions may also be used to track book replacement transactions without increasing table sizes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views90 pages

Batch B DWM Experiments

The document describes an experiment to create a data warehouse for a library management system. It involves building a star schema or snowflake schema with dimensions for books, members, publishers, and staff. It also involves creating fact tables to track transactions like book loans and returns. The data warehouse would help users efficiently find books and track how the library's contents change over time as more books are added. It justifies the need for slowly changing dimensions to track book details and member information over time. Factless fact tables are not required since the dimensions all contain measures. Large dimensions would be broken into mini-dimensions. Surrogate keys and junk dimensions may also be used to track book replacement transactions without increasing table sizes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Experiment 01 - Create a data warehouse

Aim-
Build Data Warehouse/Data Mart for a given problem statement ( One case study) Write
Detailed
Problem statement and design dimensional modelling (creation of star and snowflake
schema)

Part 1-

Problem statement - The problem faced is that library users require an efficient method to
find a specific book or keyword(s)
within a book given a continuously expanding library. Efficiency requires that the processing
time should
stay relatively the same even as the library contents increase.

Star schema
Snowflake schema
Part 2
Postlab questions -

1.Justify whether slowly changing dimension modelling is required for the problem
selected?

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current
and historical data over time in a data warehouse.
Slowly changing dimension modelling is required for library management as it is important to
keep track of dates of all books issued by some members with their return date. Also it is
important to keep record of old books and the new books in library.
2.Justify why Fact-less fact table modelling is not required for the problem selected?
Give the examples of fact tables with its type.

Factless fact table is essentially an intersection of dimensions (it contains nothing but
dimensional keys).
Every dimension table of the library has some measures available due to which factless fact
tables are not required for dimension modelling.
There are two types of factless table :
1. Event Tracking Tables – Average number of attendance for a given course
2. Coverage Tables – Tacking students attendance

3. Justify your approach to deal with large dimension tables for the problem selected.

To handle large dimension tables, mini dimension tables can be created from a large
dimension table as per the interest and can be represented in the form of star schema and
snowflake schema. In library management the Book table, Staff table, Publisher table and
member table are the mini dimension
tables.

4.Justify the use of junk dimension and surrogate key for the problem selected.

For library management modelling it is important to keep note of all transactions done for
book replacement or ordering new books, so surrogate keys can also be used in place of
primary keys.Without increasing the size of dimension tables it is stored in the fact table.

Conclusion -

Designing of structure by creating a star or snowflake schema , importing the details from
csv file into a single table, creating the dimension and fact table by selecting specific
attributes from the imported table contents.
Experiment 02 - OLAP Operations

Aim-
Perform OLAP operations on a dataset using MS Excel, PostgreSQL & Python.

Screenshots:
Part 1) Using excel –Tool Ms-Excel with pivot table
Do all OLAP operations in separate sheet(cube, roll up, drilldown, slice,
dice)
Cube:

RollUp
Drill Down:

Slice:

Dice:
2. Need to see monthly, quarterly, yearly profit, sales of each category
region wise.

3. Comparison of sales and profit in various years.


4. Comparison of sales in various months for product name =x.

5. Need to know which product has more demand on which location?

6. Need to study trend of sales by time period of the day over the month,
and year?
7. What is sale of product category wise, city wise for year=2015

8. What is the sales of product categories in the Jan month of each year?
9. What is trend of sales on year 2014 and 2015 for product category
furniture and technology?

10. Need to compare weekly, monthly and yearly sales to know growth.
Part2-Perform OLAP operations using python pivot table

Cube:
Rollup:

2. Need to see monthly, quarterly, yearly profit, sales of each category


region wise.

Yearly :

Monthly :
Quarterly :

3. Comparison of sales and profit in various year.


4. Comparison of sales in various months for product name =x.

5. Need to know which product has more demand on which location?


6. Need to study trend of sales by time period of the day over the month,
and year?
Yes

7. What is sale of product category wise, city wise for year=2015

8. What is the sales of product categories in the Jan month of each year?

1. Describe all DT and count the number of records.


Dim_cust

Dim_prod
Dim_time
2. Describe FT and count number of records

3. Create basic cube by considering all following queries and cube command
a. cube with following attributes
(month, year, product category, city, sum-of-sales) –union
All-16

b. cube with following attributes


(month, year, product category, city, sum-of-sales) –
grouping sets /rollup/cube-postgresql
c. cube with following attributes

(month, year, product category, city, sum-of-sales) -cube


command- (postgresql)/ mysql- with rollup and rollup
HAVING GROUPING ()

4. Describe cube and select all data


5. Give total sales for ( all quarters(q1,q2,q3,q4), all product
categories(ci,c2,c3,c4), all store states(s1,s2,s3,s4)) for year 2014

6. Write appropriate SQL queries for slice operation


Give the total sales of east region for all product categories, for
all quarter for the year 2016

Selection conditions on some attributes using <WHERE clause> <Group by> and aggregation
on some attribute

7. Write appropriate SQL queries for dice operation


Give the total sales of east region and west region for product
categories technology and furniture, for the quarters q1 and q2
for the year 2015

Selection conditions on some attributes using <WHERE clause> Group by and aggregation on
some attribute
8. Write appropriate SQL queries for drill down operation
Give the total sales of east region for all product categories, for all
months for the year 2016
SELECT … GROUP BY ROLLDOWN(columns);

9. Write appropriate SQL queries rollup operation


Give the total sales of all regions, for all product categories, for all
years 2014, 2015, 2016, and 2017
SELECT …GROUP BY ROLLUP ( grouping_Column_reference_list);
10. Calculate total no. of cuboids for problem.
d1-Product (p-name, sub-cat, cat)-3+1
d2-Time (year, quarter, month)-3+1
d3- Location (city, state, country, region) -4+1
n=3
=4*4*5=80

PostLab:
1)In data warehouse technology, a multiple dimensional view can be
implemented by a relational database technique (ROLAP), by a
multidimensional database technique (MOLAP), or by a hybrid
database technique (HOLAP).
(a) Briefly describe each implementation technique.

i. The generation of a data warehouse (including aggregation).

Initial aggregation may be accomplished in SQL via group-bys. The compute cube operator
computes aggregates over all subsets of the dimensions in the specified operation; this
leads to the generation of a single cube. ROLAP relies on tuples and relational tables as its
basic data structures. The base fact table (a relational table) stores data at the abstraction
level indicated by join keys in the schema for the given data cube. Aggregated data can also
be stored in fact tables (summary fact tables). ROLAP uses value-based addressing, where
dimension values are accessed via key-based addressing search strategies. To optimize
ROLAP cube computation we may use the following techniques:
- sorting, hashing, grouping operations
- grouping is performed on sub-aggregates
- aggregates derived from previously computed aggregates
ii. Roll-up: Aggregation on a data cube (aka dimension reduction). In ROLAP, this means
that the relational tables are aggregated from more to less specific.

iii. Drill-down:

The opposite of Roll-up. We introduce additional dimensions into the relation tables and,
hence, cubes.
iv. Incremental updating:

Data warehouse implementation can be broken down into segments or increments. An


increment is a defined data warehouse implementation project that has a specified beginning
and end. An increment may also be referred to as a departmental data warehouse within the
context of an enterprise.

A ROLAP server would require the use of appropriate tools such as those made by Informix.
Since ROLAPs are based on relational databases, the updating method would (I think) be
performed in a manner similar to those in traditional RDBMS, and then grafted onto the data
cube(s). Some of the techniques are summarised in this article from the above link:

Incremental data capture is a time-dependent model for capturing changes to operational


systems. This technique is best applied in circumstances where changes in the data are
significantly smaller than the size of the data set for a specific period of time (i.e., the time
between captures). These techniques are more complex than static capture, because they
are closely tied to the DBMS or the operational software which updates the DBMS. Three
different techniques in this category are application-assisted capture, trigger-based capture
and transaction log capture. In circumstances where DBMSs are used to store operational
data, transaction log capture is the most powerful and provides the most efficient approach
to incremental capture. Some of the incremental techniques used are listed in Figure 1
below.

FIGURE 1: Incremental Update Techniques


MOLAP (Multidimensional OLAP servers): These servers allow for multidimensional views of
data through “array-based multidimensional engines.” They can map multidimensional views
onto data cube arrays. The advantage to this is quicker indexing to pre-computed
summarised data. MOLAPs may have a two-level storage system in order to handle sparse
and dense data – dense “subcubes” are identified/stored as array structures; sparse
“subcubes” use compression to make storage more efficient.

(b) For each technique, explain how each of the following functions may
be implemented:

i. The generation of a data warehouse (including aggregation)


The generation would consist of a combined approach of both MOLAP and ROLAP. A
HOLAP would be generated to store large volumes of detail in a relational database (see i.
for ROLAP generation methods) while a MOLAP would be used to store aggregations
separately (see i. for MOLAP generation methods)
ii. Roll-up
The Roll-up method for MOLAPs would be somewhat similar to the process described above
for ROLAP. Except, now, we are rolling up chunks that make up the subcubes…and rolling
up the subcubes that make up the array.
iii. Drill-down
The opposite of roll-up. We introduce additional dimensions into the subcubes or array.
iv. Incremental updating
The technique for MOLAP updates would be more sophisticated due to the additional
complexity of arrays and subcubes. Selecting the right tools would seem to be the key. An
example would be Microsoft Data Warehousing Framework and OLAP Manager which is
included with SQL Server OLAP Services.
(c) Which implementation techniques do you prefer, and why?
I would have to say that the Hybrid approach seems to be the best solution for most
applications. It would appear to be backward compatible with older(?) ROLAPs and retains
the scalability of this implementation. At the same time, it incorporates the more
sophisticated features and faster computation of MOLAP.

2) Explain difference between Rollup and cube function in postgresql


with an example.
Ans. The ROLLUP operator generates aggregated results for the selected columns in a
hierarchical way. On the other hand, CUBE generates an aggregated result that contains all
the possible combinations for the selected columns.

Example:
A table product has the following records:-
Apparel Brand Quantity

Shirt Gucci 124

Jeans Lee 223


Shirt Gucci 101

Jeans Lee 210

CUBE can be used to return a result set that contains the Quantity subtotal for all possible
combinations of Apparel and Brand:
SELECT Apparel, Brand, SUM(Quantity) AS QtySum
FROM product
GROUP BY Apparel, Brand WITH CUBE
The query above will return:
Apparel Brand Quantity

Shirt Gucci 101.00

Shirt Lee 210.00

Shirt (null) 311.00

Jeans Gucci 124.00

Jeans Lee 223.00

Jeans (null) 347.00

(null) (null) 658.00

(null) Gucci 225.00

(null) Lee 433.00

ROLLUP:- Calculates multiple levels of subtotals of a group of columns.

Example:
SELECT Apparel,Brand,sum(Quantity) FROM Product GROUP BY ROLLUP
(Apparel,Brand);
The query above will return a sum of all quantities of the different brands.

Conclusion: OLAP is a software technology that allows users to analyse information from
multiple database systems at the same time. It is based on a multidimensional data model
and allows the user to query on multi-dimensional data. In drill-down operation, the less
detailed data is converted into highly detailed data. Roll up: It is just opposite of the
drill-down operation. It performs aggregation on the OLAP cube. It can be done by: Climbing
up in the concept hierarchy, Reducing the dimensions. Dice: It selects a sub-cube from the
OLAP cube by selecting two or more dimensions. In the cube given in the overview section,
a sub-cube is selected by selecting dimensions with a criteria. Slice: It selects a single
dimension from the OLAP cube which results in a new sub-cube creation.
Experiment 03 - Data Exploration In Python
Aim: To perform various data exploration operations and data cleaning using python

Steps
1) Load the libraries
2) Download the data set from kaggle/ other sources
3) Read the file –select appropriate file read function according to data type of file( refer link
1)
4) Display attributes in the data set-10 samples.
5) Describe the attributes name, count no of values, and find min, max, data type, range,
quartile, percentile, box plot and outliers.
6) Give visualisation of statistical description of data – in form of histogram, scatter plot, pie
chart
7) Give correlation matrix
8) Identify missing values and outliers and fill them with average.

Code:
Experiment 04 - Apriori Algorithm
Aim: Implementation of Association Rule Mining algorithm (Apriori in
java/python).

Part1 : WAP in Python/java to implement Apriori

CODE:
from itertools import combinations
for l in pl:
c = [frozenset(q) for q in combinations(l,len(l)-1)]
mmax = 0
for a in c:
b = l-a
ab = l
sab = 0
sa = 0
sb = 0
for q in data:
temp = set(q[1])
if(a.issubset(temp)):
sa+=1
if(b.issubset(temp)):
sb+=1
if(ab.issubset(temp)):
sab+=1
temp = sab/sa*100
if(temp > mmax):
mmax = temp
temp = sab/sb*100
if(temp > mmax):
mmax = temp
print(str(list(a))+" -> "+str(list(b))+" = "+str(sab/sa*100)+"%")
print(str(list(b))+" -> "+str(list(a))+" = "+str(sab/sb*100)+"%")
curr = 1
print("choosing:", end=' ')
for a in c:
b = l-a
ab = l
sab = 0
sa = 0
sb = 0
for q in data:
temp = set(q[1])
if(a.issubset(temp)):
sa+=1
if(b.issubset(temp)):
sb+=1
if(ab.issubset(temp)):
sab+=1
temp = sab/sa*100
if(temp == mmax):
print(curr, end = ' ')
curr += 1
temp = sab/sb*100
if(temp == mmax):
print(curr, end = ' ')
curr += 1
print()
print()
OUTPUT:

Part2: using Std libraries


CODE & OUTPUT:
Post lab:

1) Give limitations of Apriori algorithms

Ans:

Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of
candidates sets with much frequent itemsets, low minimum support or large itemsets i.e. it
is not an efficient approach for large number of datasets. The overall performance can be
reduced as it scans the database for multiple times. For example, if there are 10^4 from
frequent 1- itemsets, it needs to generate more than 10^7 candidates into 2-length which in
turn they will be tested and accumulate. Furthermore, to detect frequent pattern in size 100
i.e. v1, v2… v100, it have to generate 2^100 candidate itemsets that yield on costly and
wasting of time of candidate generation. So, it will check for many sets from candidate
itemsets, also it will scan database many times repeatedly for finding candidate itemsets.
Apriori will be very low and inefficiency when memory capacity is limited with large number
of transactions.
2) Support threshold=50%, Confidence= 60%

Solution:Support threshold=50% => 0.5*6= 3 => min_sup=3

1. Considering the root node null.


2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1},
where I2 is linked as a child to root, I1 is linked to I2 and I3 is linked to I1.
3. T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and I4
is linked to I3. But this branch would share I2 node as common as it is already used
in T1.
4. Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as a child
to I3. The count is {I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node,
hence it will be incremented by 1. Similarly I1 will be incremented by 1 as it is already
linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2},
{I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4
1}.
1. The lowest node item I5 is not considered as it does not have a min support count,
hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}.
Therefore considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1}. This
forms the conditional pattern base.
3. The conditional pattern base is considered a transaction database, an FP-tree is
constructed. This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the
min support count.
4. This path will generate all combinations of frequent patterns :
{I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree :
{I2:4, I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4}
and frequent patterns are generated: {I2, I1:4}.

3) List the name of packages/libraries used in Part 2 of this experiments

Ans:

● numpy
● pandas
● mlxtend.frequent_patterns:
1. apriori
2. association_rules
● apyori.apriori
Experiment 05 - KMeans Clustering
Code :
Part -1
# coding: utf-8

import matplotlib.pyplot as plt


from matplotlib import style
import numpy as np
k=3

X=[[2,10],[2,5],[8,4],[5,8],[7,5],[6,4],[1,2],[4,9]]
Y=np.array(X)

style.use('fivethirtyeight')
plt.scatter(Y[:, 0], Y[:, 1], alpha=0.3)
plt.show()

def recalculate_clusters(X, centroids, k):


""" Recalculates the clusters """
# Initiate empty clusters
clusters = {}
# Set the range for value of k (number of centroids)
for i in range(k):
clusters[i] = []
# Setting the plot points using dataframe (X) and the vector norm (magnitude/length)
for data in X:
# Set up list of euclidian distance and iterate through
euc_dist = []
for j in range(k):
euc_dist.append(np.linalg.norm(data - centroids[j]))
# Append the cluster of data to the dictionary
clusters[euc_dist.index(min(euc_dist))].append(data)
return clusters

def recalculate_centroids(centroids, clusters, k):


""" Recalculates the centroid position based on the plot """
for i in range(k):
# Finds the average of the cluster at given index
centroids[i] = np.average(clusters[i], axis=0)
return centroids

def plot_clusters(centroids, clusters, k):


""" Plots the clusters with centroid and specified graph attributes """
colors = ['red', 'blue' , 'green', 'orange', 'blue', 'gray', 'yellow', 'purple']
plt.figure(figsize = (6, 4))
area = (20) ** 2
for i in range(k):
for cluster in clusters[i]:
plt.scatter(cluster[0], cluster[1], c=colors[i % k], alpha=0.6)
plt.scatter(centroids[i][0], centroids[i][1], c='black', s=200)

def k_means_clustering(X, centroids={}, k=3, repeats=10):


""" Calculates full k_means_clustering algorithm """
for i in range(k):
# Sets up the centroids based on the data
centroids[i] = X[i]

# Outputs the recalculated clusters and centroids


print(f'First and last of {repeats} iterations')
for i in range(repeats):
clusters = recalculate_clusters(X, centroids, k)
centroids = recalculate_centroids(centroids, clusters, k)

# Plot the first and last iteration of k_means given the repeats specified
# Default is 10, so this would output the 1st iteration and the 10th
if i == range(repeats)[-1] or i == range(repeats)[0]:
plot_clusters(centroids, clusters, k)

k_means_clustering(Y)
Part 2
Post lab:
1. Differentiate K-means and K-medoids algorithms with one example

● The k -medoids algorithm is a clustering algorithm related to the k-means algorithm


and the medoid shift algorithm.
● Both the k-means and k-medoids algorithms are partitional (breaking the dataset up
into groups).
● K -means attempts to minimise the total squared error, while k -medoids minimises
the sum of dissimilarities between points labelled to be in a cluster and a point
designated as the centre of that cluster.
● In contrast to the k-means algorithm, k-medoids choose data points as centres
(medoids or exemplars).
● K-medoids is also a partitioning technique of clustering that clusters the data set of n
objects into k clusters with k known a priori .
● A useful tool for determining k is the [silhouette]
● It could be more robust to noise and outliers as compared to k-means 11 because it
minimises a sum of general pairwise dissimilarities instead of a sum of squared
Euclidean distances.
● The possible choice of the dissimilarity function is very rich but we normally stick to
the Euclidean distance.
● A medoid of a finite dataset is a data point from this set, whose average dissimilarity
to all the data points is minimal i.e. it is the most centrally located point in the set.
● KMeans has a comparatively easier implementation compared to K medoid
● KMeans is sensitive to outliers whereas K medoids is not which is a plus point

2. List the packages/ libraries of python used in this experiments


● MatplotLib
● Numpy
● Pandas
● Sklearn
Experiment 06 - Weka
A) Part 1

Weather.numeric.arrff: Visualize

Preprocess and select attributes

Discuss the new insights you found from visualizing the data, the techniques you tested, and
the results you obtained
Using filters we can change data from numeric to nominal types.

Part 2

Diabetes.arff(https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/diabetes.arf
f)

Find outliers using filter(unsupervised-attribute-interqurtileRange)

Applying Interquartile Range Filter:


Results in formation of two additional attributes, Outlier and ExtremeValue

As we can see, the result is that there 49 outliers and 0 extreme values in the given
dataset.Saving this new file

Making required changes to filter

Applying filter

Result: Outliers are removed, the dataset doesn’t have outliers


B) Use the Apriori algorithm to generate association rules
The association rules are generated as follows
C) Perform classification and prediction using following Algorithms

1. Logistic Regression
2. Naive Bayes
3. Decision Tree
4. K-Nearest Neighbors
5. Support Vector Machines
D) Perform clustering simple K-means clustering using iris dataset-use Euclidian
distance.
Postlabs

1. Load the ‘weather.nominal.arff’ dataset into Weka and run Id3 classification
algorithm. Answer the following questions

· List the attributes of the given relation along with the type details

· Study the classifier output and answer the following questions

1. Paste the decision tree generated by the classifier

2. Compute the entropy values for each of the attributes


3. What is the relationship between the attribute entropy values and the
nodes of the decision tree?

Entropy is a measure of disorder or uncertainty and the goal of machine learning


models and Data Scientists in general is to reduce uncertainty. The formula for
calculating the entropy is:

pi is the probability of randomly selecting an example in class i

In the decision trees, each internal node represents a "test" on an attribute.


Information gain helps to determine the order of attributes in the nodes of a
decision tree. Represented by the formula

Here, The term Gain represents information gain. And Eparent is the entropy of the
parent node and E_{children} is the average entropy of the child nodes.

2. Draw the confusion matrix? What information does the confusion matrix
provide?
A confusion matrix is a table that is used to define the performance of a
classification algorithm. It visualizes and summarizes the performance of a
classification algorithm. The confusion matrix tells us that there are 8 true
positive values, 1 false positive value, 1 false negative value and 4 true
negative values.

3. Describe the Kappa statistic?

The Kappa statistic (or value) is a metric that compares an Observed Accuracy
with an Expected Accuracy (random chance). The kappa statistic is used not
only to evaluate a single classifier, but also to evaluate classifiers amongst
themselves.

Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy)

The kappa statistic for one model is directly comparable to the kappa statistic
for any other model used for the same classification task.

4. Describe the following quantities: write formulas

· TP Rate

Rate of true positives (instances correctly classified as a given class)

· FP Rate

Rate of false positives (instances falsely classified as a given class)

· Precision

Proportion of instances that are truly of a class divided by the total instances
classified as that class

· Recall
Proportion of instances classified as a given class divided by the actual total
in that class

2. Load the ‘weather.arff’ dataset in Weka and run the Id3 classification
algorithm. What problem do you have and what is the solution?

The weather.arff dataset consists of numeric as well as nominal attributes but the
ID3 classification algorithm works only on nominal attributes. To solve this
problem, we convert the numeric attributes to nominal by using the filter in the
preprocessing tab as weka.filters.unsupervised.attribute.NumericToNominal.

3. Load the ‘weather.arff’ dataset in Weka and run the OneR rule generation
algorithm. Write the rules that were generated.
4. Load the ‘weather.arff’ dataset in Weka and run the PRISM rule generation
algorithm. Write down the rules that are generated.

5. Load the glass.arff dataset and perform the following tasks?

● How will you determine the color assigned to each class?

On going to the visualize tab, the class color section tells us which color has
been assigned to which class

● By examining the histogram, how will you determine which attributes should
be the most important in classifying the types of glass?

We can understand using the histogram that since most of the attributes are of the numeric
type and are overlapping. We choose the attribute with least overlap as the ones useful in
classifying types of glass.
6. Perform the following classification tasks:

● Run the 1Bk classifier for various values of K?

● What is the accuracy of this classifier for each value of K?

For k=1:

For k=2:

For k=20:
● What type of classifier is the 1Bk classifier?

IBk classifier is a lazy learners algorithm. Lazy Learner firstly stores the training
dataset and wait until it receives the test dataset. In Lazy learner case,
classification is done on the basis of the most related data stored in the training
dataset. It takes less time in training but more time for predictions.

7. Perform the following classification tasks:

● Run the J48 classifier

● What is the accuracy of this classifier?


● What type of classifier is the J48 classifier?

The J48 classifier is a decision tree classifier. Decision Tree is a Supervised


Machine Learning Algorithm that uses a set of rules to make decisions,
similarly to how humans make decisions.

8. Compare the results of the 1Bk(K-nearest neighbours classifier. Can select appropriate
value of K based on cross-validation. Can also do distance weighting.) And the J48
classifiers. Which is better?

On comparing the results of IBk and J48 classifiers we see that the number of
classified instances is more in J48 compared to IBk.

9. Run the J48 and the NaiveBayes classifiers on the following datasets and determine the
accuracy:
1. vehicle.arff ,
J48

Correctly classified instances in percentage = 72.45%


Naive Bayes

Correctly classified instances in percentage = 44.78%

2. Kr-vs-kp.arff
J48

Correctly classified instances: 99.43%


Naive Bayes

Correctly classified instances: 99.43%

3. Glass.arff
J48
Correctly classified instances: 96.26%

NaiveBayes

Correctly classified instances: 87.89%

4. Wave-form-5000.arff
J48

Correctly classified instances: 75.08%

NaiveBayes
Correctly classified instances: 80%

Q. On which datasets does the NaiveBayes perform better?


Ans. Kr-vs-kp.arff

10. Perform the following tasks

● Use the results of the J48 classifier to determine the most important attributes
● Remove the least important attributes

● Run the J48 and 1Bk classifiers and determine the effect of this change on
the accuracy of these classifiers. What will you conclude from the results?

Before:

J48

1Bk

After:J48
1Bk:

As we can see that there is an improvement in the 1Bk classification when we


remove the least important attributes where as a slight deterioration in the J48
classification. Thus, 1Bk works better when there are only a select number of
attributes.

Prediction:

Preprocessing diabetes.arff

Classification - Logistic Regression


Save Model at any location

Open weka and and load any old dataset then go to classify and in result list ‘right click’ and
select load model. From here load the previously saved model.

After Loading Model


Opening the datatset in editor

Open the copy of diabetes.arff by selecting ‘Supplied test set’ option

Do the following changes in ‘more options’

Then Click on ‘Re-evaluate model on current test set’( which comes when you right
click in result list)

Output shown in ‘Classifier Output’


Experiment 07 - Hierarchical Clustering
Code:
Output:

Conclusion:
Thus, by using hierarchical clustering we have been capable of dividing the customer
spending tendency on the bases of their Annual Income and Spending score into clusters.

—This experiment does not include postlabs—


Experiment 08- Naive Bayes algorithm
Aim: Implementation of Bayesian algorithm

Bayes theorem states mathematically as:


P(A|B) = ( P(B|A) * P(A) )/ P(B)
where A and B are events and P(B) != 0.
P(A|B) is a conditional probability: the probability of event A occurring given
that B is true.
P(B|A) is also a conditional probability: the probability of event B occurring
given that A is true.
P(A) and P(B) are the probabilities of observing A and B respectively without
any given conditions.
A and B must be different events.
test = [0,1,0,0]

probYes = 1

count = 0
total = 0
for row in dataset:
if(row[-1] == 1):
count+=1
total+=1
print("Total yes: "+str(count)+" / "+str(total))
probYes *= count/total
for i in range(len(test)):
count = 0
total = 0
for row in mp[1]:
if(test[i] == row[i]):
count += 1
total += 1
print('for feature '+str(i+1))
print(str(count)+" / "+str(total))
probYes *= count/total

probNo = 1
count = 0
total = 0
for row in dataset:
if(row[-1] == 0):
count+=1
total+=1
probNo *= count/total
print("Total no: "+str(count)+" / "+str(total))
for i in range(len(test)):
count = 0
total = 0
for row in mp[0]:
if(test[i] == row[i]):
count += 1
total += 1
print('for feature '+str(i+1))
print(str(count)+" / "+str(total))
probNo *= count/total

—This experiment does not include postlabs—


Experiment 09- Page rank/HITS algorithm

Aim: Implementation of Page rank/HITS algorithm

Code - Page Rank Algorithm(Web Surfer) in Python:

Output:
Code - Simple PageRank Algorithm in Python
Output:

Code - HITS Algorithm in Python:


Output:

Hub Scores: {'A': 0.04642540403219995, 'D': 0.13366037526115382, 'B':


0.15763599442967322, 'C': 0.03738913224642654, 'E': 0.25881445984686646, 'F':
0.15763599442967322, 'H': 0.03738913224642654, 'G': 0.17104950750758036}

Authority Scores: {'A': 0.10864044011724344, 'D': 0.13489685434358, 'B':


0.11437974073336446, 'C': 0.38837280038761807, 'E': 0.06966521184241477, 'F':
0.11437974073336446, 'H': 0.06966521184241475, 'G': 0.0}

Conclusion:
● PageRank (PR) is an algorithm used by Google Search to rank websites in their
search engine results. PageRank works by counting the number and quality of links
to a page to determine a rough estimate of how important the website is. The
underlying assumption is that more important websites are likely to receive more
links from other websites.
● Hyperlink Induced Topic Search (HITS) Algorithm is a Link Analysis Algorithm that
rates webpages.This algorithm is used to the web link-structures to discover and rank
the webpages relevant for a particular search.

Postlab Questions:

1) Give the efficient approach to handle M and write reason


Ans)
1. The transition matrix(M) is very sparse, since the average Web page has about 10
out-links. If, say, we are analysing a graph of ten billion pages, then only one in a
billion entries is not 0.
2. The proper way to represent any sparse matrix is to list the locations of the nonzero
entries and their values. If we use 4-byte integers for coordinates of an element and
an 8-byte double-precision number for the value, then we need 16 bytes per nonzero
entry. That is, the space needed is linear in the number of nonzero entries, rather
than quadratic in the side of the matrix.
3. However, for a transition matrix of the Web, there is one further compression
that we can do. If we list the nonzero entries by column, then we know what
each nonzero entry is; it is 1 divided by the out-degree of the page.
4. We can thus represent a column by one integer for the out-degree, and one integer
per nonzero entry in that column, giving the row number where that entry is located.
Thus, we need slightly more than 4 bytes per nonzero entry to represent a transition
matrix.

2) Give improvement of Page rank algorithm for spider trap problem and dead end
Ans)
1. Dead end is a page that has no links out. Surfers reaching such a page disappear,
and the result is that in the limit no page that can reach a dead end can have any
PageRank at all.
2. The second problem is groups of pages that all have outlinks but they never link to
any other pages. These structures are called spider traps.
3. There are two approaches to dealing with dead ends.
1. We can drop the dead ends from the graph, and also drop their incoming
arcs. Doing so may create more dead ends, which also have to be dropped,
recursively. However, eventually we wind up with a strongly-connected
component, none of whose nodes are dead ends. Recursive deletion of dead ends
will remove parts of the out-component, tendrils, and tubes, but leave the SCC and
the in-component, as well as parts of any small isolated components.
2. We can modify the process by which random surfers are assumed to move
about the Web. This method, which we refer to as “taxation,” also solves
the problem of spider traps.
Experiment 10 - Text Mining

Aim: Perform text summarization of Wikipedia Page of your choice

Task :
1. Fetching Latest Articles as per your hobby from Wikipedia and display
2. Preprocessing and show output of preprocessing
3. Converting Text To Sentences
4. Find Weighted Frequency of Occurrence and display output
5. Calculating Sentence Scores and display output
6. Display the Summary

Code :
—This experiment does not include postlabs—
Experiment 11 - Text Summarisation with Spacy
Conclusion :

Spacy NLTK

Considered an entire paragraph as a sentence and gave Considered every line between fullstops as a sentence
a score for it. and gave a score to it.

Word frequency table has less rows Word frequency table has more rows
Considered even blank spaces.

Summarised text has more lines due to paragraphs Summarised text has less lines due to sentence scores
scores specified. specified.

You might also like