0% found this document useful (0 votes)
33 views86 pages

Dm-Lab - Nov 1

Uploaded by

hazarmout
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views86 pages

Dm-Lab - Nov 1

Uploaded by

hazarmout
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

IV YEAR B.

TECH I-SEMESTER
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DATA MINING LAB MANUAL

LORDS INSTITUTE OF ENGINEERING & TECHNOLOGY


Survey no.32, Himayathsagar, near police academy junction, Hyderabad-18

Vision of the institute:

Department of Computer Science and Engineering DM Lab


Strive continuously for excellence in professional education through Quality,
Innovation, Team Work and Value creation, to emerge as a premier Institute in
the State and the Nation.

Mission of the institute:

1. To impart quality professional education that meets the needs of present and
emerging technological world.
2. To strive for student achievement and success, while preparing them for life,
career and leadership.
3. To produce graduates with professional ethics and responsibility towards the
development of industry and the society and for sustainable development.
4. To ensure abilities in the graduates to lead technical and management teams
for conception, development and management of projects for industrial and
national development.
5. To forge mutually beneficial relationships with government organizations,
industries, society and the alumni.
 The mission and vision are published in the department, laboratories
& all the instructional rooms.
 It is also provided in the college website and department notice
boards.
 Explained to students and their parents as a part of the induction
programme.
 This mission and vision are exhibited in the library and in the seminar
halls.
 It is published in the lab manuals, newsletters and course files.

Based on the needs of local and global employers, industry, advance in


technology and opportunities for higher studies, the department has defined the
vision and mission.

Department of Computer Science and Engineering DM Lab


Vision of the department

To emerge as a centre of excellence by imparting quality technical education


through innovation, team work and value creation, and to contribute to
advancement of knowledge in the field of Computer Science and
Engineering.

Mission of the department

1) Providing the students with in-depth understanding of fundamentals and


practical training related to professional skills and their applications through
effective Teaching-Learning Process and state of the art laboratories
pertaining to Computer Science Engineering and inter disciplinary areas.
2) Preparing students in developing research, design, entrepreneurial skills and
employability capabilities.
3) Providing consultancy services and promoting Industry-Department
Interactions.

Department of Computer Science and Engineering DM Lab


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DM LAB
SYLLABUS

S.N PAGE NO.

Program Educational Objectives (PEOs)

PEO 1 Shall have strong foundations in Basic Sciences, Mathematics,


Computer Science and allied engineering.
Shall be capable of identifying, formulating, analyzing and
creating engineering solutions using appropriate modern
PEO 2
engineering techniques, designing skills, and tools to develop
novel products solutions and simulations for the real life problems
in Computer Science and Engineering.
Shall have successful and productive engineering careers, with
PEO 3
emphasis on technical competency and managerial skills so that
they are really accepted by the industry with minimal orientations.
Shall have professional, lifelong learning skills, ethics, research
PEO 4
skills and leadership for independent working or team spirit to
work cohesively within a group.
Department of Computer Science and Engineering DM Lab
Course Name: Data Mining LabU21CS7L2 Year of Study: 2024-25

C418.1 Student will be able to implement algorithms to solve data mining


problems using WEKA tool.
C418.2 Ability to add mining algorithm as a component to the existing tools.
C418.3 Ability to apply mining techniques for realistic data
C418.4 Ability to demonstrate the classification, cluster in large data set.

PROGRAM OUTCOMES (POs)


Engineering Graduates will be able to:

Engineering knowledge: Apply the knowledge of mathematics, science,


PO1 engineering fundamentals, and an engineering specialization to the
solution of complex engineering problems.
Problem analysis: Identify, formulate, review research literature, and
PO2 analyze complex engineering problems reaching substantiated conclusions
using first principles of mathematics, natural sciences, and engineering
sciences.
Design/development of solutions: Design solutions for complex
engineering
PO3 problemsanddesignsystemcomponentsorprocessesthatmeetthespecifiednee
dswithappropriateconsideration for the public health and safety, and the
cultural, societal, and environmental considerations
Conduct investigations of complex problems: Use research-based
PO4 knowledge and research methods including design of experiments,
analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.

Department of Computer Science and Engineering DM Lab


Modern tool usage: Create, select, and apply appropriate techniques,
PO5 resources, and modern engineering and IT tools including prediction and
modelling to complex engineering activities with an understanding of the
limitations
The engineer and society: Apply reasoning informed by the contextual
PO6 knowledge to assess societal, health, safety, legal and cultural issues and
the consequent responsibilities relevant to the professional engineering
practice.
Environment and sustainability: Understand the impact of the
PO7 professional engineering solutions in societal and environmental contexts,
and demonstrate the knowledge of, and need for sustainable development

PO8 Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
PO9 Individual and team work: Function effectively as an individual, and as a
member or leader in diverse teams, and in multi disciplinary settings
Communication: Communicate effectively on complex engineering
activities with the engineering community and with society at large, such
PO10 as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear
instructions.
Project management and finance: Demonstrate knowledge and
understanding of
PO11 theengineeringandmanagementprinciplesandapplythesetoone’sownwork,as
a member and leader in a team, to manage projects and in multi
disciplinary environments.
Life-long learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in the broadest
PO12
context of technological change.

PROGRAM SPECIFIC OUTCOMES (PS’O)

PSO1 Professional Skills: The ability to research, understand and implement


computer programs in the areas related to algorithms, system software,
multimedia, web design, big data analytics, and networking for efficient
analysis and design of computer-based systems of varying complexity.

Department of Computer Science and Engineering DM Lab


PSO2 Problem-Solving Skills: The ability to apply standard practices and
strategies in software project development using open-ended programming
environments to deliver a quality product for business success.
PSO3 Successful Career and Entrepreneurship: The ability to employ modern
computer languages, environments, and platforms in creating innovative
career paths, to be an entrepreneur, and a zest for higher studies.

Course Name: Data Mining Lab U21CS7L2 Year of Study

CO PO1 PO PO PO PO PO PO PO PO PO1 PO1 PO1 PSO PSO


C418.1 3 2 - 32 42 53 6 - 7 - 8 - 9 - 0 - 1 - 2 - 1 2 2 -

C418.2 3 - 2 2 3 - - - - - - - 2 -
C418.3 3 - 2 2 3 - - - - - - - 2 -
C418.4 3 - 2 2 3 - - - - - - - 2 -
Avg 3 - 2 2 3 - - - - - - - 2 -
( C418)

CO- PO mapping: PO1, PO3, PO4, PO5, PSO1

LIST OF EXPERIMENTS

S.N TITLE PAGE NO.


1 Data Processing Techniques: 10
a) Data Cleaning

Department of Computer Science and Engineering DM Lab


b) Data Transformation – Normalization
c) Data Integration
2 Partitioning – Horizontal, Vertical, Round Robin, Hash 46
based Explore WEKA Data Mining/Machine Learning
Toolkit.
3 Data Ware House Schemas – Star, Snow Flake , Fact 90
Constellation
4 Data Cube Constructions – OLAP Operations 93
5 Data Extraction, Transformation & Loading Operations 96
6 Implementation of Attribute oriented Induced 103
Algorithm
7 Implementation of Apriori Algorithm 108
8 Implementation of FP – Algorithm 110
9 Implementation of FP- Growth Algorithm 117
10 Implementation of Decision tree Induction 126
11 Calculating Information Gain Measures 130
12 Classification of Data using Bayseian approach 132
13 Implementation of K- Means Algorithm 134
14 Implementation of BIRCH Algorithm 140
15 Implementation of PAM Algorithm 158
16 Implementation of DBSCAN Algorithm 227

Department of Computer Science and Engineering DM Lab


DATA MINING

Department of Computer Science and Engineering DM Lab


EXPERIMENT 1 :

Data Processing Techniques :

Experiments using Weka & Pentaho Tools

1. Data Processing Techniques: (i) Data cleaning (ii) Data transformation –


Normalization

Here’s a step-by-step guide on how to perform Data cleaning and Data


transformation (normalization) using Weka and Pentaho tools:

1. Data Cleaning Using Weka

Weka is a popular tool for data mining and machine learning tasks. It also provides
functionalities for basic data preprocessing like data cleaning.

Steps:

1. Open Weka Explorer:


o Start Weka and choose the "Explorer" interface.
2. Load the Dataset:
o Click on the "Open file" button to load your dataset (in formats like
CSV, ARFF, etc.).
3. Identify Missing Values:
o In the "Preprocess" tab, you can check for missing values in the
dataset. If any, they will be indicated in the attribute summary.
4. Remove Instances with Missing Values:
o You can remove instances with missing values using the "Filter"
option.
o Choose filters.unsupervised.instance.RemoveWithValues.
o Set the parameters to remove instances with missing values.
5. Replace Missing Values:
o If you want to replace missing values instead of removing them, use
the filter:
o filters.unsupervised.attribute.ReplaceMissingValues.

Department of Computer Science and Engineering DM Lab


o This will replace missing values with the mean/mode (for
numerical/categorical attributes).
6. Remove Outliers:
o Weka also allows you to remove outliers using filters like:
o filters.unsupervised.instance.RemoveOutlier.
7. Save the Cleaned Dataset:
o After cleaning, you can save the modified dataset by clicking on the
"Save" button.

2. Data Transformation (Normalization) Using Weka

Normalization is essential for many machine learning algorithms, especially when


dealing with features that have different scales.

Steps:

1. Load the Dataset:


o As before, load the dataset in Weka.
2. Apply Normalization:
o In the "Preprocess" tab, use the filter:
o filters.unsupervised.attribute.Normalize.
o This filter normalizes all numeric attributes to a range between 0 and
1.
o Another option is filters.unsupervised.attribute.Standardize, which
will standardize attributes to have a mean of 0 and a standard
deviation of 1.
3. Save the Transformed Dataset:
o Once normalization is applied, save the transformed dataset for further
analysis.

3. Data Processing Using Pentaho Data Integration (PDI)

Pentaho Data Integration (PDI), also known as Kettle, is a powerful tool for data
processing, including ETL (Extract, Transform, Load) processes.

Steps:

1. Open Pentaho Data Integration (Spoon):


o Launch the Spoon application.
2. Create a New Transformation:
o Go to File -> New -> Transformation.
Department of Computer Science and Engineering DM Lab
3. Load the Data:
o Use the "Input" step to load your dataset. This could be from various
sources like CSV, Excel, databases, etc.
4. Data Cleaning:
o Handling Missing Values:
 Use the "Select Values" step to select the columns you want to
clean.
 Use the "Replace Values" step to handle missing values, or use
the "Filter Rows" step to remove rows with missing data.
o Removing Duplicates:
 Use the "Unique Rows" step to remove duplicate records.
5. Data Transformation (Normalization):
o Use the "Normaliser" step to normalize your data.
o Alternatively, use "Calculator" to apply custom normalization
formulas.
6. Output the Processed Data:
o Use the "Output" step to save the cleaned and normalized data to a file
or database.
7. Run the Transformation:
o After configuring all steps, click on the "Run" button to execute the
transformation.

1. Data Integration Using Weka

Weka is not typically designed for complex data integration tasks like merging
datasets from different sources. However, you can perform basic data integration
operations, such as merging datasets with similar structures.

Steps:

1. Open Weka Explorer:


o Start Weka and choose the "Explorer" interface.
2. Load the First Dataset:
o Click on the "Open file" button and load your first dataset.
3. Merge Datasets:
o Weka allows you to merge datasets horizontally (adding new
attributes) or vertically (adding new instances).
o For horizontal merging (if the datasets have the same instances but
different attributes):

Department of Computer Science and Engineering DM Lab


 Use the "Preprocess" tab and select the "Merge Two Datasets"
option under the "Edit" menu.
o For vertical merging (if the datasets have the same attributes but
different instances):
 Use the "Append" option to add instances from another dataset.
o Ensure that the datasets have matching structures (the same number of
attributes with the same data types) for a successful merge.
4. Save the Integrated Dataset:
o Once the datasets are merged, save the integrated dataset by clicking
on the "Save" button.

2. Data Integration Using Pentaho Data Integration (PDI)

Pentaho Data Integration (PDI) is well-suited for more complex data integration
tasks, especially when dealing with data from different sources (e.g., databases,
CSV files, Excel, etc.).

Steps:

1. Open Pentaho Data Integration (Spoon):


o Launch the Spoon application.
2. Create a New Transformation:
o Go to File -> New -> Transformation.
3. Load the First Data Source:
o Use the "Input" step (like "CSV Input," "Table Input," or "Excel
Input") to load your first dataset.
o Configure the connection details and select the data you want to
import.
4. Load Additional Data Sources:
o Use additional "Input" steps to load other datasets that you want to
integrate. These can be from different sources, such as other CSV
files, Excel files, databases, etc.
5. Join or Merge Data:
o Joining Data:
 If you need to join datasets based on a common key, use the
"Merge Join" step.
 Specify the join type (e.g., Inner join, Left join) and select the
key attributes for joining.
o Merging Data Vertically:

Department of Computer Science and Engineering DM Lab


If the datasets have the same structure and you want to merge
them vertically, use the "Append Streams" step.
o Merging Data Horizontally:
 Use the "Join Rows (cartesian product)" step if you want to
create combinations of records from different datasets.
6. Handle Data Discrepancies:
o Use transformation steps like "Select Values," "Value Mapper," or
"Calculator" to harmonize data types and resolve discrepancies
between datasets.
7. Output the Integrated Data:
o Use the "Output" step (like "Text File Output" or "Table Output") to
save the integrated dataset.
8. Run the Transformation:
o Click on the "Run" button to execute the data integration process.

Conclusion:

 Weka: Suitable for basic data integration when datasets have similar
structures and are small in size.
 Pentaho Data Integration: Ideal for more complex data integration tasks
involving multiple data sources, different formats, and larger datasets.

Pentaho is generally preferred for comprehensive data integration tasks due to its
flexibility and wide range of functionalities.

EXPERIMENT – 2

Department of Computer Science and Engineering DM Lab


Partitioning - Horizontal, Vertical, Round Robin, Hash based

Partitioning is a crucial data preprocessing technique, especially when dealing with


large datasets. It allows you to divide data into smaller, more manageable parts
based on different criteria. Below is a guide on how to perform horizontal, vertical,
round-robin, and hash-based partitioning using Weka and Pentaho Data Integration
(PDI).

2.

Partitioning Using Weka

Weka is more commonly used for data mining and analysis than for data
partitioning, but you can perform basic horizontal and vertical partitioning through
its functionalities.

Horizontal Partitioning

Horizontal partitioning involves dividing the dataset into multiple subsets of rows
(instances).

Steps:

1. Open Weka Explorer:


o Start Weka and choose the "Explorer" interface.
2. Load the Dataset:
o Load your dataset by clicking on the "Open file" button.
3. Apply Filters for Partitioning:
o In the "Preprocess" tab, use the
filters.unsupervised.instance.RemovePercentage filter.
o Configure the filter to remove a percentage of instances (e.g., remove
50% to get the first half and then apply the filter again on the
remaining dataset to get the second half).
4. Save the Partitions:
o After partitioning, save the resulting subsets as separate files.

Department of Computer Science and Engineering DM Lab


Vertical Partitioning

Vertical partitioning involves splitting the dataset by columns (attributes).

Steps:

1. Load the Dataset:


o Load your dataset in Weka.
2. Select Attributes:
o In the "Preprocess" tab, use the "Remove" button under the
"Attributes" section to select which columns you want to remove or
keep.
o Create subsets of your data by removing different sets of attributes.
3. Save the Partitions:
o Save each partition as a separate file.

2. Partitioning Using Pentaho Data Integration (PDI)

Pentaho PDI offers more advanced and flexible partitioning methods, including
horizontal, vertical, round-robin, and hash-based partitioning.

Horizontal Partitioning

Steps:

1. Open Pentaho Data Integration (Spoon):


o Launch the Spoon application.
2. Create a New Transformation:
o Go to File -> New -> Transformation.
3. Load the Dataset:
o Use the "Input" step (like "CSV Input," "Table Input") to load your
dataset.
4. Apply Row Filtering:
o Use the "Filter Rows" or "Switch/Case" step to create multiple
branches based on conditions (e.g., filter by specific ranges of rows or
by a condition).
5. Output the Partitions:

Department of Computer Science and Engineering DM Lab


o Use the "Output" step to save each filtered dataset to a different output
file.

Vertical Partitioning:

Steps:

1. Load the Dataset:


o Load your dataset using an "Input" step.
2. Select Attributes:
o Use the "Select Values" step to select specific columns (attributes) for
each partition.
3. Output the Partitions:
o Save the different attribute subsets as separate files.

Round-Robin Partitioning :

Round-robin partitioning distributes rows evenly across multiple partitions in a


cyclical manner.

Steps:

1. Load the Dataset:


o Load your dataset using an "Input" step.
2. Use Round-Robin Step:
o Use the "Row Distributor" step and configure it for round-robin
distribution.
o Define how many output branches (partitions) you want and connect
them.
3. Output the Partitions:
o Save each partition to a different file.

Hash-Based Partitioning :

Hash-based partitioning distributes rows based on a hash function applied to one or


more columns.

Department of Computer Science and Engineering DM Lab


Steps:

1. Load the Dataset:


o Load your dataset using an "Input" step.
2. Apply Hash Function:
o Use the "Modified Java Script Value" step to create a hash function
based on one or more columns.
o Alternatively, use the "Partitioning" step to define a custom hash-
based partitioning method.
3. Distribute Based on Hash:
o Use "Switch/Case" or "Filter Rows" steps to distribute rows based on
the hash value.
4. Output the Partitions:
o Save each partition to a separate output.

Conclusion:

 Weka: Suitable for basic horizontal and vertical partitioning.


 Pentaho Data Integration: Highly versatile for all types of partitioning,
including round-robin and hash-based methods.

Pentaho’s flexibility makes it ideal for complex data partitioning tasks, especially
when you need to automate and scale these processes across large datasets.

Would you like more detailed steps for any specific partitioning method or an
example using a particular type of data?

Department of Computer Science and Engineering DM Lab


Department of Computer Science and Engineering DM Lab
1. Sales fact table - contains all details regarding sales
2. Orders fact table - in some cases the table can be split into open orders and
historical orders. Sometimes the values for historical orders are stored in a
sales fact table.
3. Budget fact table - usually grouped by month and loaded once at the
end of a year.
4. Forecast fact table - usually grouped by month and loaded daily,
weekly or monthly.
Inventory fact table - report stocks, usually refreshed daily

Dimension table

Nearly all of the information in a typical fact table is also present in one or more
dimension tables. The main purpose of maintaining Dimension Tables is to allow
browsing the categories quickly and easily.

The primary keys of each of the dimension tables are linked together to form the
composite primary key of the fact table. In a star schema design, there is only
one de-normalized table for a given dimension. Typical dimension tables in a
data warehouse are:

Department of Computer Science and Engineering DM Lab


Experiment No. 2

Partitioning – Horizontal, Vertical, Round Robin, Hash based

i). Horizontal Partitioning

Definition: In horizontal partitioning, rows (records) of a table are divided into


different tables based on a certain condition.

Example:

python
Copy code
import pandas as pd

# Sample DataFrame
data = {'ID': [1, 2, 3, 4, 5], 'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Horizontal partition based on condition (e.g., Value > 30)


partition1 = df[df['Value'] <= 30]
partition2 = df[df['Value'] > 30]

print(partition1)
print(partition2)

ii). Vertical Partitioning

Definition: In vertical partitioning, columns (attributes) of a table are divided into


different tables.

Example:

python
Copy code
Department of Computer Science and Engineering DM Lab
# Sample DataFrame
data = {'ID': [1, 2, 3, 4, 5], 'Value': [10, 20, 30, 40, 50], 'Category': ['A', 'B', 'A', 'B',
'A']}
df = pd.DataFrame(data)

# Vertical partitioning into two DataFrames


partition1 = df[['ID', 'Value']]
partition2 = df[['ID', 'Category']]

print(partition1)
print(partition2)

iii). Round Robin Partitioning

Definition: Round robin partitioning distributes records evenly across a specified


number of partitions in a cyclic manner.

Example:

python
Copy code
# Sample DataFrame
data = {'ID': [1, 2, 3, 4, 5], 'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Round Robin partitioning


partitions = [[] for _ in range(3)] # Create 3 partitions
for index, row in df.iterrows():
partitions[index % 3].append(row)

for i, partition in enumerate(partitions):


print(f"Partition {i + 1}: {partition}")
iv). Hash-Based Partitioning

Definition: Hash-based partitioning divides records based on the hash value of a


particular attribute.

Example:
Department of Computer Science and Engineering DM Lab
python
Copy code
# Sample DataFrame
data = {'ID': [1, 2, 3, 4, 5], 'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Hash-based partitioning using 'ID' as the hash attribute


partitions = {0: [], 1: []} # Two partitions based on hash mod 2
for index, row in df.iterrows():
partition_key = hash(row['ID']) % 2
partitions[partition_key].append(row)

for key, partition in partitions.items():


print(f"Partition {key}: {partition}")

Experiment No. 3

Objective: Create a simple data warehouse with three schemas: Star, Snowflake,
and Fact Constellation. Load sample data and perform queries to demonstrate their
structure and functionality.

Step 1: Environment Setup

1. Database: Use a relational database management system like MySQL or


PostgreSQL.
2. Client Tool: Use a SQL client like MySQL Workbench or pgAdmin to
execute SQL commands.

Step 2: Create the Database

CREATE DATABASE DataWarehouse;


USE DataWarehouse;

Department of Computer Science and Engineering DM Lab


Step 3: Create Schemas

Star Schema:

1. Create Tables:

-- Sales Fact Table


CREATE TABLE SalesFact (
SaleID INT PRIMARY KEY,
ProductID INT,
CustomerID INT,
DateID INT,
Amount DECIMAL(10, 2),
Quantity INT
);

-- Dimension Tables
CREATE TABLE ProductDimension (
ProductID INT PRIMARY KEY,
ProductName VARCHAR(100),
Category VARCHAR(100)
);

CREATE TABLE CustomerDimension (


CustomerID INT PRIMARY KEY,
CustomerName VARCHAR(100),
Region VARCHAR(100)
);

CREATE TABLE DateDimension (


DateID INT PRIMARY KEY,
Date DATE,
Day INT,
Month INT,
Year INT
);

Department of Computer Science and Engineering DM Lab


Snowflake Schema:

1. Create Tables:

-- Sales Fact Table


CREATE TABLE SalesFact (
SaleID INT PRIMARY KEY,
ProductID INT,
CustomerID INT,
DateID INT,
Amount DECIMAL(10, 2),
Quantity INT
);

-- Dimension Tables
CREATE TABLE ProductDimension (
ProductID INT PRIMARY KEY,
ProductName VARCHAR(100),
CategoryID INT
);

CREATE TABLE CategoryDimension (


CategoryID INT PRIMARY KEY,
CategoryName VARCHAR(100)
);

CREATE TABLE CustomerDimension (


CustomerID INT PRIMARY KEY,
CustomerName VARCHAR(100),
RegionID INT
);

CREATE TABLE RegionDimension (


RegionID INT PRIMARY KEY,
RegionName VARCHAR(100)
);

CREATE TABLE DateDimension (


DateID INT PRIMARY KEY,
Department of Computer Science and Engineering DM Lab
Date DATE,
Day INT,
Month INT,
Year INT
);

Fact Constellation Schema:

1. Create Tables:

-- Sales Fact Table


CREATE TABLE SalesFact (
SaleID INT PRIMARY KEY,
ProductID INT,
CustomerID INT,
DateID INT,
Amount DECIMAL(10, 2),
Quantity INT
);

-- Inventory Fact Table


CREATE TABLE InventoryFact (
InventoryID INT PRIMARY KEY,
ProductID INT,
StoreID INT,
DateID INT,
StockLevel INT
);

-- Dimension Tables
CREATE TABLE ProductDimension (
ProductID INT PRIMARY KEY,
ProductName VARCHAR(100),
Department of Computer Science and Engineering DM Lab
Category VARCHAR(100)
);

CREATE TABLE CustomerDimension (


CustomerID INT PRIMARY KEY,
CustomerName VARCHAR(100),
Region VARCHAR(100)
);

CREATE TABLE StoreDimension (


StoreID INT PRIMARY KEY,
StoreName VARCHAR(100),
Location VARCHAR(100)
);

CREATE TABLE DateDimension (


DateID INT PRIMARY KEY,
Date DATE,
Day INT,
Month INT,
Year INT
);
Step 4: Load Sample Data

Load some sample data into the tables. Here’s an example for the Star Schema:

-- Sample Data for Product Dimension


INSERT INTO ProductDimension (ProductID, ProductName, Category) VALUES
(1, 'Laptop', 'Electronics'),
(2, 'Smartphone', 'Electronics'),
(3, 'Tablet', 'Electronics');

-- Sample Data for Customer Dimension


INSERT INTO CustomerDimension (CustomerID, CustomerName, Region)
VALUES
(1, 'Alice', 'North'),
(2, 'Bob', 'South');

-- Sample Data for Date Dimension


Department of Computer Science and Engineering DM Lab
INSERT INTO DateDimension (DateID, Date, Day, Month, Year) VALUES
(1, '2023-10-01', 1, 10, 2023),
(2, '2023-10-02', 2, 10, 2023);

-- Sample Data for Sales Fact


INSERT INTO SalesFact (SaleID, ProductID, CustomerID, DateID, Amount,
Quantity) VALUES
(1, 1, 1, 1, 1000.00, 1),
(2, 2, 2, 2, 500.00, 2);

Repeat similar INSERT statements for the Snowflake and Fact Constellation
schemas, adapting the values as necessary.

Step 5: Perform Queries

1. Star Schema Queries:

-- Total Sales Amount by Product


SELECT
p.ProductName,
SUM(s.Amount) AS TotalSales
FROM
SalesFact s
JOIN
ProductDimension p ON s.ProductID = p.ProductID
GROUP BY
p.ProductName;

2. Snowflake Schema Queries:

-- Total Sales Amount by Category


SELECT
c.CategoryName,
SUM(s.Amount) AS TotalSales
FROM
SalesFact s
JOIN
ProductDimension p ON s.ProductID = p.ProductID
JOIN
CategoryDimension c ON p.CategoryID = c.CategoryID
Department of Computer Science and Engineering DM Lab
GROUP BY
c.CategoryName;

3. Fact Constellation Queries:

-- Total Inventory by Store


SELECT
st.StoreName,
SUM(i.StockLevel) AS TotalStock
FROM
InventoryFact i
JOIN
StoreDimension st ON i.StoreID = st.StoreID
GROUP BY
st.StoreName;

Department of Computer Science and Engineering DM Lab


Experiment No. 4

Objective: Create a data cube from a sample dataset and perform various OLAP
operations, such as slicing, dicing, drilling down, and rolling up.

Step 1: Environment Setup

1. Database: Use a relational database like MySQL or PostgreSQL.


2. Client Tool: Use a SQL client (like MySQL Workbench or pgAdmin) to
execute SQL commands.

Step 2: Create the Database and Tables

1. Create a New Database:

CREATE DATABASE OLAPDataCube;


USE OLAPDataCube;

2. Create Sample Tables:

Create tables for the dimensions and fact data.

Dimension Tables

 Product Dimension:

CREATE TABLE ProductDimension (


ProductID INT PRIMARY KEY,
ProductName VARCHAR(100),
Category VARCHAR(100)
);

 Customer Dimension:

CREATE TABLE CustomerDimension (


CustomerID INT PRIMARY KEY,
CustomerName VARCHAR(100),
Department of Computer Science and Engineering DM Lab
Region VARCHAR(100)
);

 Date Dimension:

CREATE TABLE DateDimension (


DateID INT PRIMARY KEY,
Date DATE,
Year INT,
Month INT,
Day INT
);
Fact Table

 Sales Fact Table:

CREATE TABLE SalesFact (


SaleID INT PRIMARY KEY,
ProductID INT,
CustomerID INT,
DateID INT,
Amount DECIMAL(10, 2),
Quantity INT,
FOREIGN KEY (ProductID) REFERENCES ProductDimension(ProductID),
FOREIGN KEY (CustomerID) REFERENCES
CustomerDimension(CustomerID),
FOREIGN KEY (DateID) REFERENCES DateDimension(DateID)
);
Step 3: Load Sample Data

Insert sample data into the tables.

Product Dimension Data


INSERT INTO ProductDimension (ProductID, ProductName, Category) VALUES
(1, 'Laptop', 'Electronics'),
(2, 'Smartphone', 'Electronics'),
(3, 'Tablet', 'Electronics');

Department of Computer Science and Engineering DM Lab


Customer Dimension Data
INSERT INTO CustomerDimension (CustomerID, CustomerName, Region)
VALUES
(1, 'Alice', 'North'),
(2, 'Bob', 'South'),
(3, 'Charlie', 'East');
Date Dimension Data
INSERT INTO DateDimension (DateID, Date, Year, Month, Day) VALUES
(1, '2023-10-01', 2023, 10, 1),
(2, '2023-10-02', 2023, 10, 2);
Sales Fact Data
INSERT INTO SalesFact (SaleID, ProductID, CustomerID, DateID, Amount,
Quantity) VALUES
(1, 1, 1, 1, 1000.00, 1),
(2, 2, 2, 2, 500.00, 2),
(3, 3, 3, 1, 300.00, 1);
Step 4: Data Cube Construction

You can construct a data cube using SQL queries. In OLAP, a data cube is
typically created by aggregating data across multiple dimensions.

-- Creating a basic data cube


SELECT
p.Category,
c.Region,
d.Year,
SUM(s.Amount) AS TotalSales,
SUM(s.Quantity) AS TotalQuantity
FROM
SalesFact s
JOIN
ProductDimension p ON s.ProductID = p.ProductID
JOIN
CustomerDimension c ON s.CustomerID = c.CustomerID
JOIN
DateDimension d ON s.DateID = d.DateID
Department of Computer Science and Engineering DM Lab
GROUP BY
p.Category, c.Region, d.Year;
Step 5: OLAP Operations

1. Slicing:

Extracting a subset of the cube by fixing a dimension.

-- Slice for Electronics category


SELECT
c.Region,
d.Year,
SUM(s.Amount) AS TotalSales
FROM
SalesFact s
JOIN
ProductDimension p ON s.ProductID = p.ProductID
JOIN
CustomerDimension c ON s.CustomerID = c.CustomerID
JOIN
DateDimension d ON s.DateID = d.DateID
WHERE
p.Category = 'Electronics'
GROUP BY
c.Region, d.Year;
2. Dicing:

Selecting a sub-cube by selecting specific values from multiple dimensions.

-- Dice for specific regions and products


SELECT
p.ProductName,
c.CustomerName,
d.Year,
SUM(s.Amount) AS TotalSales
FROM
SalesFact s
JOIN
Department of Computer Science and Engineering DM Lab
ProductDimension p ON s.ProductID = p.ProductID
JOIN
CustomerDimension c ON s.CustomerID = c.CustomerID
JOIN
DateDimension d ON s.DateID = d.DateID
WHERE
c.Region IN ('North', 'East') AND p.ProductID IN (1, 2)
GROUP BY
p.ProductName, c.CustomerName, d.Year;
3. Drill Down:

Increasing the detail level of the data.

-- Drill down to view monthly sales instead of yearly


SELECT
p.ProductName,
c.CustomerName,
d.Month,
SUM(s.Amount) AS TotalSales
FROM
SalesFact s
JOIN
ProductDimension p ON s.ProductID = p.ProductID
JOIN
CustomerDimension c ON s.CustomerID = c.CustomerID
JOIN
DateDimension d ON s.DateID = d.DateID
GROUP BY
p.ProductName, c.CustomerName, d.Month;
4. Roll Up:

Aggregating data to a higher level.

-- Roll up to get total sales by product category


SELECT
p.Category,
SUM(s.Amount) AS TotalSales
FROM
Department of Computer Science and Engineering DM Lab
SalesFact s
JOIN
ProductDimension p ON s.ProductID = p.ProductID
GROUP BY
p.Category;

Experiment – 4
i) Develop an application to implement OLAP, roll up, drill down,
slice and dice operation

OLAP is an acronym for On Line Analytical Processing.Online Analytical


Processing: An OLAP system manages large amount of historical data, provides
facilities for summarization and aggregation, and stores and manages
information at different levels of granularity.

Multidimensional Data: Sales volume as a function of product, month, and


region

Dimensions: Product, Location,

Time Hierarchical summarization

paths Industry Region

Year

Category Country Quarter

Department of Computer Science and Engineering DM Lab


Product City Month Week

Office Day

OLAP operations:

The analyst can understand the meaning contained in the databases using multi-
dimensional analysis. By aligning the data content with the analyst's mental
model, the chances of confusion and erroneous interpretations are reduced. The
analyst can navigate through the database and screen for a particular subset of the
data, changing the data's orientations and defining analytical calculations. The
user- initiated process of navigating by calling for page displays interactively,
through the specification of slices via rotations and drill down/up is sometimes
called "slice and dice". Common operations include slice and dice, drill down, roll
up, and pivot.

Slice: A slice is a subset of a multi-dimensional array corresponding to a single


value for one or more members of the dimensions not in the subset.

Dice: The dice operation is a slice on more than two dimensions of a data cube (or
more than two consecutive slices).
Drill Down/Up: Drilling down or up is a specific analytical technique whereby
the user navigates among levels of data ranging from the most summarized (up)
to the most detailed (down).

Roll-up: A roll-up involves computing all of the data relationships for one or more
dimensions. To do this, a computational relationship or formula might be defined.
Department of Computer Science and Engineering DM Lab
Pivot: To change the dimensional orientation of a report or page display.

Other operations

Drill through: through the bottom level of the cube to its back-end relational tables
(using SQL)

Slice and dice

The slice operation performs a selection on one dimension of the given cube,
resulting in a sub_cube.

The dice operation defines a sub_cube by performing a selection on two or more


dimensions.

Drill Down:

Food Line Outdoor CATEGORY_total


Asia 59,728 Line
151,174 210,902

Drill-Down

Food Line Outdoor CATEGORY_total


Line
Malaysia 618 9,418 10,036
China 33,198.5 74,165 107,363.5
India 6,918 0 6,918
Japan 13,871.5 34,965 48,836.5
Singapore 5,122 32,626 37,748
Belgium 7797.5 21,125 28,922.5

Department of Computer Science and Engineering DM Lab


Department of Computer Science and Engineering DM Lab
Roll up:

Slice:

Dice:

Department of Computer Science and Engineering DM Lab


(iii) Develop an application to construct a multidimensional data.

Multidimensional Data Model:

Multidimensional data model is to view it as a cube. The cable at the left contains
detailed sales data by product, market and time. The cube on the right associates
sales number (unit sold) with dimensions-product type, market and time with the
unit variables organized as cell in an array.
This cube can be expended to include another array-price-which can be
associates with all or only some dimensions.
As number of dimensions increases number of cubes cell increase exponentially.
Dimensions are hierarchical in nature i.e. time dimension may contain
hierarchies for years, quarters, months, weak and day. GEOGRAPHY may
contain country, state, city etc.

Department of Computer Science and Engineering DM Lab


FigA.0

The Multidimensional Data Model

This chapter describes the multidimensional data model and how it is


implemented in relational tables and standard form analytic workspaces.
It consists of the following topics:

 The Logical Multidimensional Data Model


 The Relational Implementation of the Model
 The Analytic Workspace Implementation of the Model

1. The Logical Multidimensional Data Model

The multidimensional data model is an integral part of On-Line Analytical


Processing, or OLAP. Because OLAP is on-line, it must provide answers
quickly; analysts pose iterative queries during interactive sessions, not in batch
jobs that run overnight. And because OLAP is also analytic, the queries are
complex. The multidimensional data model is designed to solve complex queries
in real time.

The multidimensional data model is important because it enforces simplicity. As


Ralph Kimball states in his landmark book, The Data Warehouse Toolkit:

"The central attraction of the dimensional model of a business is its simplicity....


that simplicity is the fundamental key that allows users to understand databases,
and allows software to navigate databases efficiently."

The multidimensional data model is composed of logical cubes, measures,


dimensions, hierarchies, levels, and attributes. The simplicity of the model is

Department of Computer Science and Engineering DM Lab


inherent because it defines objects that represent real-world business entities.
Analysts know which business measures they are interested in examining, which
dimensions and attributes make the data meaningful, and how the dimensions of
their business are organized into levels and hierarchies.

Fig A.1shows the relationships among the logical objects.

Diagram of the Logical Multidimensional Model

Fig.A.1

a)Logical Cubes

Department of Computer Science and Engineering DM Lab


Logical cubes provide a means of organizing measures that have the same shape,
that is, they have the exact same dimensions. Measures in the same cube have the
same relationships to other logical objects and can easily be analyzed and
displayed together.

b) Logical Measures

Measures populate the cells of a logical cube with the facts collected about
business operations. Measures are organized by dimensions, which typically
include a Time dimension.

An analytic database contains snapshots of historical data, derived from data in a


legacy system, transactional database, syndicated sources, or other data sources.
Three years of historical data is generally considered to be appropriate for
analytic applications.

Measures are static and consistent while analysts are using them to inform their
decisions. They are updated in a batch window at regular intervals: weekly,
daily, or periodically throughout the day. Many applications refresh their data by
adding periods to the time dimension of a measure, and may also roll off an
equal number of the oldest time periods. Each update provides a fixed historical
record of a particular business activity for that interval. Other applications do a
full rebuild of their data rather than performing incremental updates.

A critical decision in defining a measure is the lowest level of detail (sometimes


called the grain). Users may never view this base level data, but it determines the
types of analysis that can be performed. For example, market analysts (unlike
order entry personnel) do not need to know that Beth Miller in Ann Arbor,
Michigan, placed an order for a size 10 blue polka-dot dress on July 6, 2002, at
Department of Computer Science and Engineering DM Lab
2:34 p.m. But they might want to find out which color of dress was most popular
in the summer of 2002 in the Midwestern United States.

The base level determines whether analysts can get an answer to this question.
For this particular question, Time could be rolled up into months, Customer
could be rolled up into regions, and Product could be rolled up into items (such
as dresses) with an attribute of color. However, this level of aggregate data could
not answer the question: At what time of day are women most likely to place an
order? An important decision is the extent to which the data has been pre-
aggregated before being loaded into a data warehouse.

c)Logical Dimensions

Dimensions contain a set of unique values that identify and categorize data. They
form the edges of a logical cube, and thus of the measures within the cube.
Because measures are typically multidimensional, a single value in a measure
must be qualified by a member of each dimension to be meaningful. For
example, the Sales measure has four dimensions: Time, Customer, Product, and
Channel. A particular Sales value (43,613.50) only has meaning when it is
qualified by a specific time period (Feb-01), a customer (Warren Systems), a
product (Portable PCs), and a channel (Catalog).

d) Logical Hierarchies and Levels

A hierarchy is a way to organize data at different levels of aggregation. In


viewing data, analysts use dimension hierarchies to recognize trends at one level,
drill down to lower levels to identify reasons for these trends, and roll up to
higher levels to see what affect these trends have on a larger sector of the
Department of Computer Science and Engineering DM Lab
business.

Each level represents a position in the hierarchy. Each level above the base (or
most detailed) level contains aggregate values for the levels below it. The
members at different levels have a one-to-many parent-child relation. For
example, Q1-02 and Q2-02 are the children of 2002, thus 2002 is the parent of
Q1-02 and Q2-02.

Suppose a data warehouse contains snapshots of data taken three times a day,
that is, every 8 hours. Analysts might normally prefer to view the data that has
been aggregated into days, weeks, quarters, or years. Thus, the Time dimension
needs a hierarchy with at least five levels. Similarly, a sales manager with a
particular target for the upcoming year might want to allocate that target amount
among the sales representatives in his territory; the allocation requires a
dimension hierarchy in which individual sales representatives are the child
values of a particular territory.

Hierarchies and levels have a many-to-many relationship. A hierarchy typically


contains several levels, and a single level can be included in more than one
hierarchy.
a)Logical Attributes

An attribute provides additional information about the data. Some attributes are
used for display. For example, you might have a product dimension that uses
Stock Keeping Units (SKUs) for dimension members. The SKUs are an excellent
way of uniquely identifying thousands of products, but are meaningless to most
people if they are used to label the data in a report or graph. You would define
attributes for the descriptive labels.

Department of Computer Science and Engineering DM Lab


You might also have attributes like colors, flavors, or sizes. This type of attribute
can be used for data selection and answering questions such as: Which colors
were the most popular in women's dresses in the summer of 2002? How does this
compare with the previous summer?

Time attributes can provide information about the Time dimension that may be
useful in some types of analysis, such as identifying the last day or the number of
days in each time period.

2 The Relational Implementation of the Model

The relational implementation of the multidimensional data model is typically a


star schema, as shown in Figure b, or a snowflake schema. A star schema is a
convention for organizing the data into dimension tables, fact tables, and
materialized views. Ultimately, all of the data is stored in columns, and metadata
is required to identify the columns that function as multidimensional objects.

In Oracle Database, you can define a logical multidimensional model for


relational tables using the OLAP Catalog or AWXML. The metadata
distinguishes level columns from attribute columns in the dimension tables and
specifies the hierarchical relationships among the levels. It identifies the various
measures that are stored in columns of the fact tables and aggregation methods
for the measures. And it provides display names for all of these logical objects.

Department of Computer Science and Engineering DM Lab


FigA.2

a)Dimension Tables

A star schema stores all of the information about a dimension in a single table.
Each level of a hierarchy is represented by a column or column set in the
dimension table. A dimension object can be used to define the hierarchical
relationship between two columns (or column sets) that represent two levels of a
hierarchy; without a dimension object, the hierarchical relationships are defined
only in metadata. Attributes are stored in columns of the dimension tables.
A snowflake schema normalizes the dimension members by storing each level in
a separate table.
Department of Computer Science and Engineering DM Lab
b) Fact Tables

Measures are stored in fact tables. Fact tables contain a composite primary key,
which is composed of several foreign keys (one for each dimension table) and a
column for each measure that uses these dimensions.

c) Materialized Views

Aggregate data is calculated on the basis of the hierarchical relationships defined


in the dimension tables. These aggregates are stored in separate tables, called
summary tables or materialized views. Oracle provides extensive support for
materialized views, including automatic refresh and query rewrite.

Queries can be written either against a fact table or against a materialized view.
If a query is written against the fact table that requires aggregate data for its
result set, the query is either redirected by query rewrite to an existing
materialized view, or the data is aggregated on the fly.

Each materialized view is specific to a particular combination of levels; in fig


A.2, only two materialized views are shown of a possible 27 (3 dimensions with
3 levels have 3**3 possible level combinations).
EXPERIMENT – 13 ;
K Means Clustering Example with Weka Explorer

K- means is the most popularly used algorithm for clustering. User need to specify

the number of clusters (k) in advance. Algorithm randomly selects k objects as


cluster mean or center. It works towards optimizing square error criteria function,
defined as:

Department of Computer Science and Engineering DM Lab


Main steps of k-means algorithm are:

1. Assign initial means mi


2. Assign each data object x to the cluster

3. Compute new mean for each cluster Ci for the closest mean

4. Iterate until criteria function converges, that is, there are no more new
assignments

Following is the example of K means on weather


i. Select the Cluster tab from the upper tabs.

ii. Select Kmeans from the choose tab.

iii. You can select the attributes for clustering.

iv. If class attribute is known, then user can select that attribute for “classes to
cluster evaluation” to check for accuracy of results.

v. In order to store the results, select “Store cluster for visualization”

vi. Click on start to run the algorithm.


vii. Right click on the result and select visualize cluster assignment.

Department of Computer Science and Engineering DM Lab


Fig. 9: K means clustering in weka

Figure 10 shows the results of k means on weather data. Confusion matrix


specifies the classes of obtained results as we have selected the classes to cluster
evaluation. For example, cluster0 has total 9 objects, out of which majority of
objects (6) are from yes category, hence this cluster is treated as cluster of “yes”.
Similarly, cluster1 has total of 5 objects, out of which 3 objects are from “no”
category, hence it is considered as cluster of no category.

Department of Computer Science and Engineering DM Lab


Fig. 10: Results of K means on weather data

1. Perform data preprocessing tasks and demonstrate performing


association rule mining on data sets
A. Explore various options available in Weka for preprocessing data
and apply (like Discretization Filters, Resample filter etc.) on each
dataset.
Demonstration of preprocessing on dataset student.arff

Aim: This experiment illustrates some of the basic data preprocessing operations
that can be performed using WEKA-Explorer. The sample dataset used for this
example is the student data available in arff format.

1. Step1: Loading the data. We can load the dataset into weka by clicking on
open button in preprocessing interface and selecting the appropriate file.

Department of Computer Science and Engineering DM Lab


2. Step2: Once the data is loaded, weka will recognize the attributes and during
the scan of the data weka will compute some basic strategies on each attribute.
The left panel in the above figure shows the list of recognized attributes while
the top panel indicates the names of the base relation or table and the current
working relation (which are same initially).
3. Step3:Clicking on an attribute in the left panel will show the basic statistics on
the attributes for the categorical attributes the frequency of each attribute value
is shown, while for continuous attributes we can obtain min, max, mean,
standard deviation and deviation etc.,
4. Step4:The visualization in the right button panel in the form of cross-tabulation
across two attributes.
5. Note: we can select another attribute using the dropdown list.
6. Step5:Selecting or filtering attributes
7. Removing an attribute-When we need to remove an attribute,we can do this by
using the attribute filters in weka.In the filter model panel,click on choose
button,This will show a popup window with a list of available filters.
8. Scroll down the list and select the “weka.filters.unsupervised.attribute.remove”
filters.
9. Step 6:a)Next click the textbox immediately to the right of the choose
button.In the resulting dialog box enter the index of the attribute to be filtered
out.
10. b)Make sure that invert selection option is set to false.The click OK now in the
filter box.you will see “Remove-R-7”.
11. c)Click the apply button to apply filter to this data.This will remove the
attribute and create new working relation.

Department of Computer Science and Engineering DM Lab


d)Save the new working relation as an arff file by clicking save button on the
top(button)panel.(student.arff)

Discretization
1)Sometimes association rule mining can only be performed on categorical
data.This requires performing discretization on numeric or continuous attributes.In
the following example let us discretize age attribute.

Let us divide the values of age attribute into three bins(intervals).

First load the dataset into weka(student.arff)

Select the age attribute.

Activate filter-dialog box and select


“WEKA.filters.unsupervised.attribute.discretize”from the list.

To change the defaults for the filters,click on the box immediately to the right of
the choose button.

We enter the index for the attribute to be discretized.In this case the attribute is
age.So we must enter ‘1’ corresponding to the age attribute.

Enter ‘3’ as the number of bins.Leave the remaining field values as they are.

Click OK button.

Click apply in the filter panel.This will result in a new working relation with the
selected attribute partition into 3 bins.

Save the new working relation in a file called student-data-discretized.arff

Department of Computer Science and Engineering DM Lab


Dataset student .arff

@relation student

@attribute age {<30,30-40,>40}

@attribute income {low, medium, high}

@attribute student {yes, no}

@attribute credit-rating {fair, excellent}

@attribute buyspc {yes, no}

@data

% 30, high, no, fair, no

<30, high, no, excellent, no


30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
Department of Computer Science and Engineering DM Lab
%

The following screenshot shows the effect of discretization.

Department of Computer Science and Engineering DM Lab


Experiment 7 :

Implementation of Apriori Algorithm :

Load each dataset into Weka and run Apriori algorithm with different
support and confidence values. Study the rules generated.

Demonstration of Association rule process on dataset contact lenses.arff using


apriori algorithm

Department of Computer Science and Engineering DM Lab


Aim: This experiment illustrates some of the basic elements of asscociation rule
mining using WEKA. The sample dataset used for this example is
contactlenses.arff

Step1: Open the data file in Weka Explorer. It is presumed that the required data
fields have been discretized. In this example it is age attribute.

Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm.

Step3: We will use apriori algorithm. This is the default algorithm.

Step4: Inorder to change the parameters for the run (example support, confidence
etc) we click on the text box immediately to the right of the choose button.

Dataset contact lenses.arff:

Department of Computer Science and Engineering DM Lab


The following screenshot shows the association rules that were generated when
apriori algorithm is applied on the given dataset.

Department of Computer Science and Engineering DM Lab


Department of Computer Science and Engineering DM Lab
A. Apply different discretization filters on numerical attributes and run the
Apriori association rule algorithm. Study the rules generated. Derive
interesting insights and observe the effect of discretization in the rule
generation process.

Demonstration of Association rule process on dataset test.arff using apriori


algorithm

Aim: This experiment illustrates some of the basic elements of asscociation rule
mining using WEKA. The sample dataset used for this example is test.arff

Department of Computer Science and Engineering DM Lab


Step1: Open the data file in Weka Explorer. It is presumed that the required data
fields have been discretized. In this example it is age attribute.

Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm.

Step3: We will use apriori algorithm. This is the default algorithm.

Step4: Inorder to change the parameters for the run (example support, confidence
etc) we click on the text box immediately to the right of the choose button.

Dataset test.arff

@relation test

@attribute admissionyear {2005,2006,2007,2008,2009,2010}

@attribute course {cse,mech,it,ece}

@data

2005, cse

2005, it

2005, cse

2006, mech

2006, it

2006, ece

2007, it
Department of Computer Science and Engineering DM Lab
2007, cse

2008, it

2008, cse

2009, it

2009, ece

The following screenshot shows the association rules that were generated when
apriori algorithm is applied on the given dataset.

Department of Computer Science and Engineering DM Lab


1. Demonstrate performing classification on data sets
Aim: This experiment illustrates the use of j-48 classifier in weka. The sample
data set used in this experiment is “student” data available at arff format. This
document assumes that appropriate data pre processing has been performed.

Steps involved in this experiment:

Step-1: We begin the experiment by loading the data (student.arff)into weka.

Department of Computer Science and Engineering DM Lab


Step2: Next we select the “classify” tab and click “choose” button t o select the
“j48”classifier.

Step3: Now we specify the various parameters. These can be specified by clicking
in the text box to the right of the chose button. In this example, we accept the
default values. The default version does perform some pruning but does not
perform error pruning.

Step4: Under the “text” options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation
data set, this is necessary to get a reasonable idea of accuracy of generated model.

Step-5: We now click ”start” to generate the model .the Ascii version of the tree as
well as evaluation statistic will appear in the right panel when the model
construction is complete.

Step-6: Note that the classification accuracy of model is about 69%.this indicates
that we may find more work. (Either in preprocessing or in selecting current
parameters for the classification)

Step-7: Now weka also lets us a view a graphical version of the classification tree.
This can be done by right clicking the last result set and selecting “visualize tree”
from the pop-up menu.

Step-8: We will use our model to classify the new instances.

Step-9: In the main panel under “text” options click the “supplied test set” radio
button and then click the “set” button. This wills pop-up a window which will
allow you to open the file containing test instances.

Dataset test.arff

Department of Computer Science and Engineering DM Lab


@relation test

@attribute admissionyear {2005,2006,2007,2008,2009,2010}

@attribute course {cse,mech,it,ece}

@data

2005, cse

2005, it

2005, cse

2006, mech

2006, it

2006, ece

2007, it

2007, cse

2008, it

2008, cse

2009, it

2009, ece

Department of Computer Science and Engineering DM Lab


The following screenshot shows the association rules that were generated when
apriori algorithm is applied on the given dataset.

Department of Computer Science and Engineering DM Lab


Demonstration of classification rule process on dataset employee.arff using
id3 algorithm

Aim: This experiment illustrates the use of id3 classifier in weka. The sample data
set used in this experiment is “employee”data available at arff format. This
document assumes that appropriate data pre processing has been performed.

Steps involved in this experiment:

1. We begin the experiment by loading the data (employee.arff) into weka.

Department of Computer Science and Engineering DM Lab


Step2: next we select the “classify” tab and click “choose” button to select the
“id3”classifier.

Step3: now we specify the various parameters. These can be specified by clicking
in the text box to the right of the chose button. In this example, we accept the
default values his default version does perform some pruning but does not perform
error pruning.

Step4: under the “text “options in the main panel. We select the 10-fold cross

Aim: This experiment illustrates the use of simple k-mean clustering with Weka
explorer. The sample data set used for this example is based on the iris data
available in ARFF format. This document assumes that appropriate preprocessing
has been performed. This iris dataset includes 150 instances.

Steps involved in this Experiment

Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing
interface.

Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and
click on the choose button. This step results in a dropdown list of available
clustering algorithms.

Step 3 : In this case we select ‘simple k-means’.

Step 4: Next click in text button to the right of the choose button to get popup
window shown in the screenshots. In this window we enter six on the number of
clusters and we leave the value of the seed on as it is. The seed value is used in
generating a random number which is used for making the internal assignments of
instances of clusters.

Department of Computer Science and Engineering DM Lab


Step 5 : Once of the option have been specified. We run the clustering algorithm
there we must make sure that they are in the ‘cluster mode’ panel. The use of
training set option is selected and then we click ‘start’ button. This process and
resulting window are shown in the following screenshots.

Step 6 : The result window shows the centroid of each cluster as well as statistics
on the number and the percent of instances assigned to different clusters. Here
clusters centroid are means vectors for each clusters. This clusters can be used to
characterized the cluster.For eg, the centroid of cluster1 shows the class
iris.versicolor mean value of the sepal length is 5.4706, sepal width 2.4765, petal
width 1.1294, petal length 3.7941.

Step 7: Another way of understanding characterstics of each cluster through


visualization ,we can do this, try right clicking the result set on the result. List
panel and selecting the visualize cluster assignments.

The following screenshot shows the clustering rules that were generated when
simple k means algorithm is applied on the given dataset.

Department of Computer Science and Engineering DM Lab


Explore visualization features of Weka to visualize the clusters. Derive
interesting insights and explain.

Interpretation of the above visualization

From the above visualization, we can understand the distribution of sepal length
and petal length in each cluster. For instance, for each cluster is dominated by petal
length. In this case by changing the color dimension to other attributes we can see
their distribution with in each of the cluster.

Step 8: We can assure that resulting dataset which included each instance along
with its assign cluster. To do so we click the save button in the visualization
window and save the result iris k-mean .The top portion of this file is shown in the
following figure.
Department of Computer Science and Engineering DM Lab
Department of Computer Science and Engineering DM Lab
Department of Computer Science and Engineering DM Lab
PROCEDURE FOR ALL EXPERIMENTS WITH VIVA QUESTIONS

(A) PROCEDURE FOR ALL EXPERIMENTS


1. Aim of the Experiments
2. Requirements
(a) Software (WEKA TOOLS)
(b) Hardware ( Desktop, Internet connection)
3. Algorithm Analysis
4. Implementation of Algorithm
5. Results Analysis

(B) VIVA QUESTIONS


1. What is the benefits of data warehouse?
o A data warehouse helps to integrate data and store them historically so that
we can analyze different aspects of business including, performance
analysis, trend, prediction etc. over a given time frame and use the result of
our analysis to improve the efficiency of business processes.

2. What is the difference between OLTP and OLAP?


o OLTP is the transaction system that collects business data. Whereas OLAP
is the reporting and analysis system on that data.
OLTP systems are optimized for INSERT, UPDATE operations and
therefore highly normalized. On the other hand, OLAP systems are

Department of Computer Science and Engineering DM Lab


deliberately denormalized for fast data retrieval through SELECT
operations.

3. What is data mart?


o Data marts are generally designed for a single subject area. An organization
may have data pertaining to different departments like Finance, HR,
Marketting etc. stored in data warehouse and each department may have
separate data marts. These data marts can be built on top of the data
warehouse.

4. What is dimension?
o A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say… “20kg”, it does not mean
anything. But if I say, "20kg of Rice (Product) is sold to Ramesh (customer)
on 5th April (date)", then that gives a meaningful sense. These product,
customer and dates are some dimension that qualified the measure - 20kg.
Dimensions are mutually independent. Technically speaking, a dimension is
a data element that categorizes each item in a data set into non-overlapping
regions.

5. What is Fact?
o A fact is something that is quantifiable (Or measurable). Facts are typically
(but not always) numerical values that can be aggregated.

6. Briefly state different between data ware house & data mart?
o Dataware house is made up of many datamarts. DWH contain many subject
areas. but data mart focuses on one subject area generally. e.g. If there will
be DHW of bank then there can be one data mart for accounts, one for Loans
etc. This is high level definitions. Metadata is data about data. e.g. if in data
mart we are receving any file. then metadata will contain information like
how many columns, file is fix width/elimted, ordering of fileds, dataypes of
field etc...

7. What is the difference between dependent data warehouse and


independent data warehouse?
o There is a third type of Datamart called Hybrid. The Hybrid datamart having
source data from Operational systems or external files and central
Datawarehouse as well. I will definitely check for Dependent and
Independent Datawarehouses and update.

Department of Computer Science and Engineering DM Lab


8. What are the storage models of OLAP?
o ROLAP, MOLAP and HOLAP

9. What are CUBES?


o A data cube stores data in a summarized version which helps in a faster
analysis of data. The data is stored in such a way that it allows reporting
easily.
o E.g. using a data cube A user may want to analyze weekly, monthly
performance of an employee. Here, month and week could be considered as
the dimensions of the cube.

10. What is MODEL in Data mining world?


o Models in Data mining help the different algorithms in decision making or
pattern matching. The second stage of data mining involves considering
various models and choosing the best one based on their predictive
performance.

11. Explain how to mine an OLAP cube.


o A data mining extension can be used to slice the data the source cube in the
order as discovered by data mining. When a cube is mined the case table is a
dimension.

12. Explain how to use DMX-the data mining query language.


o Data mining extension is based on the syntax of SQL. It is based on
relational concepts and mainly used to create and manage the data mining
models. DMX comprises of two types of statements: Data definition and
Data manipulation. Data definition is used to define or create new models,
structures.

13. Define Rollup and cube.


o Custom rollup operators provide a simple way of controlling the process of
rolling up a member to its parents values.The rollup uses the contents of the
column as custom rollup operator for each member and is used to evaluate
the value of the member’s parents.
If a cube has multiple custom rollup formulas and custom rollup members,
then the formulas are resolved in the order in which the dimensions have
been added to the cube.

14. Differentiate between Data Mining and Data warehousing.

Department of Computer Science and Engineering DM Lab


o Data warehousing is merely extracting data from different sources, cleaning
the data and storing it in the warehouse. Where as data mining aims to
examine or explore the data using queries. These queries can be fired on the
data warehouse. Explore the data in data mining helps in reporting, planning
strategies, finding meaningful patterns etc.
E.g. a data warehouse of a company stores all the relevant information of
projects and employees. Using Data mining, one can use this data to
generate different reports like profits generated etc.

15. What is Discrete and Continuous data in Data mining world?


o Discreet data can be considered as defined or finite data. E.g. Mobile
numbers, gender. Continuous data can be considered as data which changes
continuously and in an ordered fashion. E.g. age

16. What is a Decision Tree Algorithm?


o A decision tree is a tree in which every node is either a leaf node or a
decision node. This tree takes an input an object and outputs some decision.
All Paths from root node to the leaf node are reached by either using AND
or OR or BOTH. The tree is constructed using the regularities of the data.
The decision tree is not affected by Automatic Data Preparation.

17. What is Naïve Bayes Algorithm?


o Naïve Bayes Algorithm is used to generate mining models. These models
help to identify relationships between input columns and the predictable
columns. This algorithm can be used in the initial stage of exploration. The
algorithm calculates the probability of every state of each input column
given predictable columns possible states. After the model is made, the
results can be used for exploration and making predictions.
o
18. Explain clustering algorithm.
o Clustering algorithm is used to group sets of data with similar characteristics
also called as clusters. These clusters help in making faster decisions, and
exploring data. The algorithm first identifies relationships in a dataset
following which it generates a series of clusters based on the relationships.
The process of creating clusters is iterative. The algorithm redefines the
groupings to create clusters that better represent the data.

19. Explain Association algorithm in Data mining?

Department of Computer Science and Engineering DM Lab


o Association algorithm is used for recommendation engine that is based on a
market based analysis. This engine suggests products to customers based on
what they bought earlier. The model is built on a dataset containing
identifiers. These identifiers are both for individual cases and for the items
that cases contain. These groups of items in a data set are called as an item
set. The algorithm traverses a data set to find items that appear in a case.
MINIMUM_SUPPORT parameter is used any associated items that appear
into an item set.

20. What are the goals of data mining?


o Prediction, identification, classification and optimization

21. Is data mining independent subject?


o No, it is interdisciplinary subject. includes, database technology,
visualization, machine learning, pattern recognition, algorithm etc.

22. What are different types of database?


o Relational database, data warehouse and transactional database.

23. What are data mining functionality?


o Mining frequent pattern, association rules, classification and prediction,
clustering, evolution analysis and outlier Analise

24. What are issues in data mining?


o Issues in mining methodology, performance issues, user interactive issues,
different source of data types issues etc.

25. List some applications of data mining.


o Agriculture, biological data analysis, call record analysis, DSS, Business
intelligence system etc

26. What do you mean by interesting pattern?


o A pattern is said to be interesting if it is 1. easily understood by human 2.
valid 3. potentially useful 4. novel

27. Why do we pre-process the data?


o To ensure the data quality. [accuracy, completeness, consistency, timeliness,
believability, interpret-ability]

28. What are the steps involved in data pre-processing?


Department of Computer Science and Engineering DM Lab
o Data cleaning, data integration, data reduction, data transformation.

29. What is distributed data warehouse?


o Distributed data warehouse shares data across multiple data repositories for
the purpose of OLAP operation.

30. Define virtual data warehouse.


o A virtual data warehouse provides a compact view of the data inventory. It
contains meta data and uses middle-ware to establish connection between
different data sources.

31. What is are different data warehouse model?


o Enterprise data ware houst
o Data marts
o Virtual Data warehouse

32. List few roles of data warehouse manager.


o Creation of data marts, handling users, concurrency control, updation etc,

33. What are different types of cuboids?


o 0-D cuboids are called as apex cuboids
o n-D cuboids are called base cuboids
o Middle cuboids

34. What are the forms of multidimensional model?


o Star schema
o Snow flake schema
o Fact constellation Schema

35. What are frequent pattern?


o A set of items that appear frequently together in a transaction data set.
o eg milk, bread, sugar

36. What are the issues regarding classification and prediction?


o Preparing data for classification and prediction
o Comparing classification and prediction

37. Define model over fitting.


o A model that fits training data well can have generalization errors. Such
situation is called as model over fitting.
Department of Computer Science and Engineering DM Lab
38. What are the methods to remove model over fitting?
o Pruning [Pre-pruning and post pruning)
o Constraint in the size of decision tree
o Making stopping criteria more flexible

39. What is regression?


o Regression can be used to model the relationship between one or more
independent and dependent variables.
o Linear regression and non-linear regression

40. Compare K-mean and K-mediods algorithm.


o K-mediods is more robust than k-mean in presence of noise and outliers. K-
Mediods can be computationally costly.

41. What is K-nearest neighbor algorithm?


o It is one of the lazy learner algorithm used in classification. It finds the k-
nearest neighbor of the point of interest.

42. What is Baye's Theorem?


o P(H/X) = P(X/H)* P(H)/P(X)

43. What is concept Hierarchy?


o It defines a sequence of mapping from a set of low level concepts to higher -
level, more general concepts.

44. What are the causes of model over fitting?


o Due to presence of noise
o Due to lack of representative samples
o Due to multiple comparison procedure

45. What is decision tree classifier?


o A decision tree is an hierarchically based classifier which compares data
with a range of properly selected features.

46. If there are n dimensions, how many cuboids are there?


o There would be 2^n cuboids.

47. What is spatial data mining?

Department of Computer Science and Engineering DM Lab


o Spatial data mining is the process of discovering interesting, useful,
non-trivial patterns from large spatial datasets.
Spatial Data Mining = Mining Spatial Data Sets (i.e. Data Mining
+ Geographic Information Systems)

48. What is multimedia data mining?


o Multimedia Data Mining is a subfield of data mining that deals with
an extraction of implicit knowledge, multimedia data relationships, or other
patterns not explicitly stored in multimedia databases

49. What are different types of multimedia data?


o image, video, audio

50. What is text mining?


o Text mining is the procedure of synthesizing information, by analyzing
relations, patterns, and rules among textual data. These procedures contains
text summarization, text categorization, and text clustering.

51. List some application of text mining.


o Customer profile analysis
o patent analysis
o Information dissemination
o Company resource planning

52. What do you mean by web content mining?


o Web content mining refers to the discovery of useful information from
Web contents, including text, images, audio, video, etc.

53. Define web structure mining and web usage mining.


o Web structure mining studies the model underlying the link structures of
the Web. It has been used for search engine result ranking and other
Web applications.

Web usage mining focuses on using data mining techniques to analyze


search logs to find interestingpatterns. One of the main applications of Web
usage mining is its use to learn user profiles.

54. What is data warehouse?

Department of Computer Science and Engineering DM Lab


o A data warehouse is a electronic storage of an Organization's historical data
for the purpose of reporting, analysis and data mining or knowledge
discovery.

55. What is data warehouse?


o A data warehouse is a electronic storage of an Organization's historical data
for the purpose of reporting, analysis and data mining or knowledge
discovery.

56. What is data warehouse?


o A data warehouse is a electronic storage of an Organization's historical data
for the purpose of reporting, analysis and data mining or knowledge
discovery.

57. What is data warehouse?


o A data warehouse is a electronic storage of an Organization's historical data
for the purpose of reporting, analysis and data mining or knowledge
discovery.

58. What are frequent patterns?


o These are the patterns that appear frequently in a data set.
o item-set, sub sequence, etc

59. What is data warehouse?


o A data warehouse is a electronic storage of an Organization's historical data
for the purpose of reporting, analysis and data mining or knowledge
discovery.

60. What is data characterization?


o Data Characterization is s summarization of the general features of a target
class of data. Example, analyzing software product with sales increased by
10%

61. What is data discrimination?


o Data discrimination is the comparison of the general features of the target
class objects against one or more contrasting objects.

62. What can business analysts gain from having a data warehouse?

Department of Computer Science and Engineering DM Lab


o First, having a data warehouse may provide a competitive advantage by
presenting relevant information from which to measure performance and
make critical adjustments in order to help win over competitors.
Second, a data warehouse can enhance business productivity because it is
able to quickly and efficiently gather information that accurately describes
the organization.
Third, a data warehouse facilitates customer relationship
management because it provides a consistent view of customers and item
across all lines of business, all departments and all markets.
Finally, a data warehouse may bring about cost reduction by tracking
trends, patterns, and exceptions over long periods in a consistent and reliable
manner.

63. Why is association rule necessary?


o In data mining, association rule learning is a popular and well researched
method for discovering interesting relations between variables in large
databases.
o It is intended to identify strong rules discovered in database using different
measures of interesting.

64. What are two types of data mining tasks?


o Descriptive task
o Predictive task

65. Define classification.


o Classification is the process of finding a model (or function) that
describes and distinguishes data classes or concepts.

66. What are outliers?


o A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are called outliers.

67. What do you mean by evolution analysis?


o Data evolution analysis describes and models regularities or trends for
objects whose behavior change over time.
Although this may include characterization, discrimination, association
and correlation analysis, classification, prediction, or clustering of time
related data.

Department of Computer Science and Engineering DM Lab


Distinct features of such as analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data
analysis.

68. Define KDD.


o The process of finding useful information and patterns in data.

69. What are the components of data mining?


Database, Data Warehouse, World Wide Web, or other information
repository
Database or Data Warehouse Server
Knowledge Based
Data Mining Engine
Pattern Evaluation Module
User Interface

70. Define metadata.


o A database that describes various aspects of data in the warehouse is
called metadata.

71. What are the usage of metadata?


o Map source system data to data warehouse tables
Generate data extract, transform, and load procedures for import jobs
Help users discover what data are in the data warehouse
Help users structure queries to access data they need

72. List the demerits of distributed data warehouse.


o There is no metadata, no summary data or no individual DSS (Decision
Support System) integration or history. All queries must be repeated,
causing additional burden on the system.
Since compete with production data transactions, performance can be
degraded.
There is no refreshing process, causing the queries to be very complex.

73. Define HOLAP.


o The hybrid OLAP approach combines ROLAP and MOLAP
technology.

Department of Computer Science and Engineering DM Lab


74. What are data mining techniques?
o Association rules
o Classification and prediction
o Clustering
o Deviation detection
o Similarity search
o Sequence Mining

75. List different data mining tools.


o Traditional data mining tools
o Dashboards
o Text mining tools

76. Define sub sequence.


o A subsequence, such as buying first a PC, the a digital camera, and then a
memory card, if it occurs frequently in a shopping history database, is a
(frequent) sequential pattern.

77. What is data warehouse?


o A data warehouse is a electronic storage of an Organization's historical data
for the purpose of reporting, analysis and data mining or knowledge
discovery.

78. What is the main goal of data mining?


o Prediction

79. List the typical OLAP operations.


o Roll UP
o DRILL DOWN
o ROTATE
o SLICE AND DICE
o DRILL trough and drill across

80. If there are 3 dimensions, how many cuboids are there in cube?
1. 2^3 = 8 cuboids

81. Differentiate between star schema and snowflake schema.


o •Star Schema is a multi-dimension model where each of its disjoint
dimension is represented in single table.
Department of Computer Science and Engineering DM Lab
•Snow-flake is normalized multi-dimension schema when each of
disjoint dimension is represent in multiple tables.
•Star schema can become a snow-flake
•Both star and snowflake schemas are dimensional models; the difference is
in their physical implementations.
•Snowflake schemas support ease of dimension maintenance because they
are more normalized.
•Star schemas are easier for direct user access and often support simpler and
more efficient queries.
•It may be better to create a star version of the snowflaked dimension for
presentation to the users

82. List the advantages of star schema.


o •Star Schema is very easy to understand, even for non technical
business manager.
•Star Schema provides better performance and smaller query times
•Star Schema is easily extensible and will handle future changes easily

83. What are the characteristics of data warehouse?


o Integrated
o Non-volatile
o Subject oriented
o Time varient

84. Define support and confidence.


o The support for a rule R is the ratio of the number of occurrences of R, given
all occurrences of all rules.
The confidence of a rule X->Y, is the ratio of the number of occurrences of
Y given X, among all other occurrences given X

85. What are the criteria on the basic of which classification and prediction
can be compared?
o speed, accuracy, robustness, scalability, goodness of rules, interpret-ability

86. What is Data purging?


o The process of cleaning junk data is termed as data purging. Purging data
would mean getting rid of unnecessary NULL values of columns. This
usually happens when the size of the database gets too large.

Department of Computer Science and Engineering DM Lab


Department of Computer Science and Engineering DM Lab

You might also like