Batch B DWM Experiments
Batch B DWM Experiments
Aim-
Build Data Warehouse/Data Mart for a given problem statement ( One case study) Write
Detailed
Problem statement and design dimensional modelling (creation of star and snowflake
schema)
Part 1-
Problem statement - The problem faced is that library users require an efficient method to
find a specific book or keyword(s)
within a book given a continuously expanding library. Efficiency requires that the processing
time should
stay relatively the same even as the library contents increase.
Star schema
Snowflake schema
Part 2
Postlab questions -
1.Justify whether slowly changing dimension modelling is required for the problem
selected?
A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current
and historical data over time in a data warehouse.
Slowly changing dimension modelling is required for library management as it is important to
keep track of dates of all books issued by some members with their return date. Also it is
important to keep record of old books and the new books in library.
2.Justify why Fact-less fact table modelling is not required for the problem selected?
Give the examples of fact tables with its type.
Factless fact table is essentially an intersection of dimensions (it contains nothing but
dimensional keys).
Every dimension table of the library has some measures available due to which factless fact
tables are not required for dimension modelling.
There are two types of factless table :
1. Event Tracking Tables – Average number of attendance for a given course
2. Coverage Tables – Tacking students attendance
3. Justify your approach to deal with large dimension tables for the problem selected.
To handle large dimension tables, mini dimension tables can be created from a large
dimension table as per the interest and can be represented in the form of star schema and
snowflake schema. In library management the Book table, Staff table, Publisher table and
member table are the mini dimension
tables.
4.Justify the use of junk dimension and surrogate key for the problem selected.
For library management modelling it is important to keep note of all transactions done for
book replacement or ordering new books, so surrogate keys can also be used in place of
primary keys.Without increasing the size of dimension tables it is stored in the fact table.
Conclusion -
Designing of structure by creating a star or snowflake schema , importing the details from
csv file into a single table, creating the dimension and fact table by selecting specific
attributes from the imported table contents.
Experiment 02 - OLAP Operations
Aim-
Perform OLAP operations on a dataset using MS Excel, PostgreSQL & Python.
Screenshots:
Part 1) Using excel –Tool Ms-Excel with pivot table
Do all OLAP operations in separate sheet(cube, roll up, drilldown, slice,
dice)
Cube:
RollUp
Drill Down:
Slice:
Dice:
2. Need to see monthly, quarterly, yearly profit, sales of each category
region wise.
6. Need to study trend of sales by time period of the day over the month,
and year?
7. What is sale of product category wise, city wise for year=2015
8. What is the sales of product categories in the Jan month of each year?
9. What is trend of sales on year 2014 and 2015 for product category
furniture and technology?
10. Need to compare weekly, monthly and yearly sales to know growth.
Part2-Perform OLAP operations using python pivot table
Cube:
Rollup:
Yearly :
Monthly :
Quarterly :
8. What is the sales of product categories in the Jan month of each year?
Dim_prod
Dim_time
2. Describe FT and count number of records
3. Create basic cube by considering all following queries and cube command
a. cube with following attributes
(month, year, product category, city, sum-of-sales) –union
All-16
Selection conditions on some attributes using <WHERE clause> <Group by> and aggregation
on some attribute
Selection conditions on some attributes using <WHERE clause> Group by and aggregation on
some attribute
8. Write appropriate SQL queries for drill down operation
Give the total sales of east region for all product categories, for all
months for the year 2016
SELECT … GROUP BY ROLLDOWN(columns);
PostLab:
1)In data warehouse technology, a multiple dimensional view can be
implemented by a relational database technique (ROLAP), by a
multidimensional database technique (MOLAP), or by a hybrid
database technique (HOLAP).
(a) Briefly describe each implementation technique.
Initial aggregation may be accomplished in SQL via group-bys. The compute cube operator
computes aggregates over all subsets of the dimensions in the specified operation; this
leads to the generation of a single cube. ROLAP relies on tuples and relational tables as its
basic data structures. The base fact table (a relational table) stores data at the abstraction
level indicated by join keys in the schema for the given data cube. Aggregated data can also
be stored in fact tables (summary fact tables). ROLAP uses value-based addressing, where
dimension values are accessed via key-based addressing search strategies. To optimize
ROLAP cube computation we may use the following techniques:
- sorting, hashing, grouping operations
- grouping is performed on sub-aggregates
- aggregates derived from previously computed aggregates
ii. Roll-up: Aggregation on a data cube (aka dimension reduction). In ROLAP, this means
that the relational tables are aggregated from more to less specific.
iii. Drill-down:
The opposite of Roll-up. We introduce additional dimensions into the relation tables and,
hence, cubes.
iv. Incremental updating:
A ROLAP server would require the use of appropriate tools such as those made by Informix.
Since ROLAPs are based on relational databases, the updating method would (I think) be
performed in a manner similar to those in traditional RDBMS, and then grafted onto the data
cube(s). Some of the techniques are summarised in this article from the above link:
(b) For each technique, explain how each of the following functions may
be implemented:
Example:
A table product has the following records:-
Apparel Brand Quantity
CUBE can be used to return a result set that contains the Quantity subtotal for all possible
combinations of Apparel and Brand:
SELECT Apparel, Brand, SUM(Quantity) AS QtySum
FROM product
GROUP BY Apparel, Brand WITH CUBE
The query above will return:
Apparel Brand Quantity
Example:
SELECT Apparel,Brand,sum(Quantity) FROM Product GROUP BY ROLLUP
(Apparel,Brand);
The query above will return a sum of all quantities of the different brands.
Conclusion: OLAP is a software technology that allows users to analyse information from
multiple database systems at the same time. It is based on a multidimensional data model
and allows the user to query on multi-dimensional data. In drill-down operation, the less
detailed data is converted into highly detailed data. Roll up: It is just opposite of the
drill-down operation. It performs aggregation on the OLAP cube. It can be done by: Climbing
up in the concept hierarchy, Reducing the dimensions. Dice: It selects a sub-cube from the
OLAP cube by selecting two or more dimensions. In the cube given in the overview section,
a sub-cube is selected by selecting dimensions with a criteria. Slice: It selects a single
dimension from the OLAP cube which results in a new sub-cube creation.
Experiment 03 - Data Exploration In Python
Aim: To perform various data exploration operations and data cleaning using python
Steps
1) Load the libraries
2) Download the data set from kaggle/ other sources
3) Read the file –select appropriate file read function according to data type of file( refer link
1)
4) Display attributes in the data set-10 samples.
5) Describe the attributes name, count no of values, and find min, max, data type, range,
quartile, percentile, box plot and outliers.
6) Give visualisation of statistical description of data – in form of histogram, scatter plot, pie
chart
7) Give correlation matrix
8) Identify missing values and outliers and fill them with average.
Code:
Experiment 04 - Apriori Algorithm
Aim: Implementation of Association Rule Mining algorithm (Apriori in
java/python).
CODE:
from itertools import combinations
for l in pl:
c = [frozenset(q) for q in combinations(l,len(l)-1)]
mmax = 0
for a in c:
b = l-a
ab = l
sab = 0
sa = 0
sb = 0
for q in data:
temp = set(q[1])
if(a.issubset(temp)):
sa+=1
if(b.issubset(temp)):
sb+=1
if(ab.issubset(temp)):
sab+=1
temp = sab/sa*100
if(temp > mmax):
mmax = temp
temp = sab/sb*100
if(temp > mmax):
mmax = temp
print(str(list(a))+" -> "+str(list(b))+" = "+str(sab/sa*100)+"%")
print(str(list(b))+" -> "+str(list(a))+" = "+str(sab/sb*100)+"%")
curr = 1
print("choosing:", end=' ')
for a in c:
b = l-a
ab = l
sab = 0
sa = 0
sb = 0
for q in data:
temp = set(q[1])
if(a.issubset(temp)):
sa+=1
if(b.issubset(temp)):
sb+=1
if(ab.issubset(temp)):
sab+=1
temp = sab/sa*100
if(temp == mmax):
print(curr, end = ' ')
curr += 1
temp = sab/sb*100
if(temp == mmax):
print(curr, end = ' ')
curr += 1
print()
print()
OUTPUT:
Ans:
Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of
candidates sets with much frequent itemsets, low minimum support or large itemsets i.e. it
is not an efficient approach for large number of datasets. The overall performance can be
reduced as it scans the database for multiple times. For example, if there are 10^4 from
frequent 1- itemsets, it needs to generate more than 10^7 candidates into 2-length which in
turn they will be tested and accumulate. Furthermore, to detect frequent pattern in size 100
i.e. v1, v2… v100, it have to generate 2^100 candidate itemsets that yield on costly and
wasting of time of candidate generation. So, it will check for many sets from candidate
itemsets, also it will scan database many times repeatedly for finding candidate itemsets.
Apriori will be very low and inefficiency when memory capacity is limited with large number
of transactions.
2) Support threshold=50%, Confidence= 60%
Ans:
● numpy
● pandas
● mlxtend.frequent_patterns:
1. apriori
2. association_rules
● apyori.apriori
Experiment 05 - KMeans Clustering
Code :
Part -1
# coding: utf-8
X=[[2,10],[2,5],[8,4],[5,8],[7,5],[6,4],[1,2],[4,9]]
Y=np.array(X)
style.use('fivethirtyeight')
plt.scatter(Y[:, 0], Y[:, 1], alpha=0.3)
plt.show()
# Plot the first and last iteration of k_means given the repeats specified
# Default is 10, so this would output the 1st iteration and the 10th
if i == range(repeats)[-1] or i == range(repeats)[0]:
plot_clusters(centroids, clusters, k)
k_means_clustering(Y)
Part 2
Post lab:
1. Differentiate K-means and K-medoids algorithms with one example
Weather.numeric.arrff: Visualize
Discuss the new insights you found from visualizing the data, the techniques you tested, and
the results you obtained
Using filters we can change data from numeric to nominal types.
Part 2
Diabetes.arff(https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/diabetes.arf
f)
As we can see, the result is that there 49 outliers and 0 extreme values in the given
dataset.Saving this new file
Applying filter
1. Logistic Regression
2. Naive Bayes
3. Decision Tree
4. K-Nearest Neighbors
5. Support Vector Machines
D) Perform clustering simple K-means clustering using iris dataset-use Euclidian
distance.
Postlabs
1. Load the ‘weather.nominal.arff’ dataset into Weka and run Id3 classification
algorithm. Answer the following questions
· List the attributes of the given relation along with the type details
Here, The term Gain represents information gain. And Eparent is the entropy of the
parent node and E_{children} is the average entropy of the child nodes.
2. Draw the confusion matrix? What information does the confusion matrix
provide?
A confusion matrix is a table that is used to define the performance of a
classification algorithm. It visualizes and summarizes the performance of a
classification algorithm. The confusion matrix tells us that there are 8 true
positive values, 1 false positive value, 1 false negative value and 4 true
negative values.
The Kappa statistic (or value) is a metric that compares an Observed Accuracy
with an Expected Accuracy (random chance). The kappa statistic is used not
only to evaluate a single classifier, but also to evaluate classifiers amongst
themselves.
The kappa statistic for one model is directly comparable to the kappa statistic
for any other model used for the same classification task.
· TP Rate
· FP Rate
· Precision
Proportion of instances that are truly of a class divided by the total instances
classified as that class
· Recall
Proportion of instances classified as a given class divided by the actual total
in that class
2. Load the ‘weather.arff’ dataset in Weka and run the Id3 classification
algorithm. What problem do you have and what is the solution?
The weather.arff dataset consists of numeric as well as nominal attributes but the
ID3 classification algorithm works only on nominal attributes. To solve this
problem, we convert the numeric attributes to nominal by using the filter in the
preprocessing tab as weka.filters.unsupervised.attribute.NumericToNominal.
3. Load the ‘weather.arff’ dataset in Weka and run the OneR rule generation
algorithm. Write the rules that were generated.
4. Load the ‘weather.arff’ dataset in Weka and run the PRISM rule generation
algorithm. Write down the rules that are generated.
On going to the visualize tab, the class color section tells us which color has
been assigned to which class
● By examining the histogram, how will you determine which attributes should
be the most important in classifying the types of glass?
We can understand using the histogram that since most of the attributes are of the numeric
type and are overlapping. We choose the attribute with least overlap as the ones useful in
classifying types of glass.
6. Perform the following classification tasks:
For k=1:
For k=2:
For k=20:
● What type of classifier is the 1Bk classifier?
IBk classifier is a lazy learners algorithm. Lazy Learner firstly stores the training
dataset and wait until it receives the test dataset. In Lazy learner case,
classification is done on the basis of the most related data stored in the training
dataset. It takes less time in training but more time for predictions.
8. Compare the results of the 1Bk(K-nearest neighbours classifier. Can select appropriate
value of K based on cross-validation. Can also do distance weighting.) And the J48
classifiers. Which is better?
On comparing the results of IBk and J48 classifiers we see that the number of
classified instances is more in J48 compared to IBk.
9. Run the J48 and the NaiveBayes classifiers on the following datasets and determine the
accuracy:
1. vehicle.arff ,
J48
2. Kr-vs-kp.arff
J48
3. Glass.arff
J48
Correctly classified instances: 96.26%
NaiveBayes
4. Wave-form-5000.arff
J48
NaiveBayes
Correctly classified instances: 80%
● Use the results of the J48 classifier to determine the most important attributes
● Remove the least important attributes
● Run the J48 and 1Bk classifiers and determine the effect of this change on
the accuracy of these classifiers. What will you conclude from the results?
Before:
J48
1Bk
After:J48
1Bk:
Prediction:
Preprocessing diabetes.arff
Open weka and and load any old dataset then go to classify and in result list ‘right click’ and
select load model. From here load the previously saved model.
Then Click on ‘Re-evaluate model on current test set’( which comes when you right
click in result list)
Conclusion:
Thus, by using hierarchical clustering we have been capable of dividing the customer
spending tendency on the bases of their Annual Income and Spending score into clusters.
probYes = 1
count = 0
total = 0
for row in dataset:
if(row[-1] == 1):
count+=1
total+=1
print("Total yes: "+str(count)+" / "+str(total))
probYes *= count/total
for i in range(len(test)):
count = 0
total = 0
for row in mp[1]:
if(test[i] == row[i]):
count += 1
total += 1
print('for feature '+str(i+1))
print(str(count)+" / "+str(total))
probYes *= count/total
probNo = 1
count = 0
total = 0
for row in dataset:
if(row[-1] == 0):
count+=1
total+=1
probNo *= count/total
print("Total no: "+str(count)+" / "+str(total))
for i in range(len(test)):
count = 0
total = 0
for row in mp[0]:
if(test[i] == row[i]):
count += 1
total += 1
print('for feature '+str(i+1))
print(str(count)+" / "+str(total))
probNo *= count/total
Output:
Code - Simple PageRank Algorithm in Python
Output:
Conclusion:
● PageRank (PR) is an algorithm used by Google Search to rank websites in their
search engine results. PageRank works by counting the number and quality of links
to a page to determine a rough estimate of how important the website is. The
underlying assumption is that more important websites are likely to receive more
links from other websites.
● Hyperlink Induced Topic Search (HITS) Algorithm is a Link Analysis Algorithm that
rates webpages.This algorithm is used to the web link-structures to discover and rank
the webpages relevant for a particular search.
Postlab Questions:
2) Give improvement of Page rank algorithm for spider trap problem and dead end
Ans)
1. Dead end is a page that has no links out. Surfers reaching such a page disappear,
and the result is that in the limit no page that can reach a dead end can have any
PageRank at all.
2. The second problem is groups of pages that all have outlinks but they never link to
any other pages. These structures are called spider traps.
3. There are two approaches to dealing with dead ends.
1. We can drop the dead ends from the graph, and also drop their incoming
arcs. Doing so may create more dead ends, which also have to be dropped,
recursively. However, eventually we wind up with a strongly-connected
component, none of whose nodes are dead ends. Recursive deletion of dead ends
will remove parts of the out-component, tendrils, and tubes, but leave the SCC and
the in-component, as well as parts of any small isolated components.
2. We can modify the process by which random surfers are assumed to move
about the Web. This method, which we refer to as “taxation,” also solves
the problem of spider traps.
Experiment 10 - Text Mining
Task :
1. Fetching Latest Articles as per your hobby from Wikipedia and display
2. Preprocessing and show output of preprocessing
3. Converting Text To Sentences
4. Find Weighted Frequency of Occurrence and display output
5. Calculating Sentence Scores and display output
6. Display the Summary
Code :
—This experiment does not include postlabs—
Experiment 11 - Text Summarisation with Spacy
Conclusion :
Spacy NLTK
Considered an entire paragraph as a sentence and gave Considered every line between fullstops as a sentence
a score for it. and gave a score to it.
Word frequency table has less rows Word frequency table has more rows
Considered even blank spaces.
Summarised text has more lines due to paragraphs Summarised text has less lines due to sentence scores
scores specified. specified.