0% found this document useful (0 votes)
5 views36 pages

Data Warehousing Lab Manual

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views36 pages

Data Warehousing Lab Manual

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

(A Constituent College of Anna University, Chennai)

Villupuram 605103

CCS341
DATA WAREHOUSE LABORATORY
EX NO:1 DATA EXPLORATION AND INTEGRATION WITH WEKA-
WEATHER DATASET

Aim:

The goal of this lab is to install and familiarize with Weka. To demonstrate the available features in
preprocessing, we will use the Weather dataset.

Procedure:

Step1: Download and install Weka.

Step2: Open Weka and have a look at the interface. It is an open-source project written in Java from
theUniversity of Waikato
Step 3: Click on the Explorer button on the right side
Step 4: Weka comes with a number of small datasets. Those files are located at C:\Program Files\Weka3-8
(If it is installed at this location. Or else, search for Weka-3-8 to find the installation location).
In this folder, there is a subfolder named ‘data’. Open that folder to see all files that comes with
Weka.

Using the ... Open file option under the Preprocess tag select the weather-nominal.arff file.

DATASET
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

When opening the file, the screen looks like this.

Step 5: Check different tabs to familiarize with the tool.

Understanding Data
Let us first look at the highlighted Current relation sub window. It shows the name of the dataset that is
currently loaded. You can infer two points from this sub window −
• There are 14 instances - the number of rows in the table.
• The table contains 5 attributes - the fields, which are discussed in the upcoming sections.
On the left side, notice the Attributes sub window that displays the various fields in the database.
The weather dataset contains five fields - outlook, temperature, humidity, windy and play. When you select
an attribute from this list by clicking on it, further details on the attribute itself are displayed on the right
hand side.
Let us select the temperature attribute first. When clicking on it, we would see the following screen −
In the Selected Attribute subwindow, you can observe the following −
• The name and the type of the attribute are displayed.
• The type for the temperature attribute is Nominal.
• The number of Missing values is zero.
• There are three distinct values with no unique value.
• The table underneath this information shows the nominal values for this field as hot, mild and cold.
• It also shows the count and weight in terms of a percentage for each nominal value.
At the bottom of the window, you see the visual representation of the class values.
If you click on the Visualize All button, you will be able to see all features in one single window as shown
here −

Removing Attributes:
Many a time, the data that you want to use for model building comes with many irrelevant fields. For
example, the customer database may contain his mobile number which is relevant in analysing his credit
rating.
To remove Attribute/s select them and click on the Remove button at the bottom.
The selected attributes would be removed from the database. After we fully pre-process the data, we can
save it for model building.

Applying Filters:

Some of the machine learning techniques such as association rule mining requires categorical data. To
illustrate the use of filters, we will use weather-numeric.arff database that contains two numeric attributes
- temperature and humidity.

We will convert these to nominal by applying a filter on our raw data. Click on the Choose button in
the Filter subwindow and select the following filter −

weka→filters→supervised→attribute→Discretize
Click on the Apply button and examine the temperature and/or humidity attribute. You will notice that
these have changed from numeric to nominal types.

After we fully pre-process the data, we can save it for model building.

Result:

Thus the data exploration and integration with WEKA is successfully executed.
EX NO:2 APPLY WEKA TOOL FOR DATA VALIDATION

Aim:
By applying Weka in dataset it supports several standard data mining tasks like data pre-
processing, clustering, classification, regressing, visualization and feature selection.

Preprocess :

Initially as you open the explorer, only the Pre-process tab is enabled. The first step is to pre-process
the data. Thus, in the Pre-process option, we will select the data file, process it and make it fit for applying
the various algorithms.
Loading Data:

The first four buttons at the top of the preprocess section enable you to load data into WEKA

1. Open file.... Brings up a dialog box allowing you to browse for the data file on the local file system.

2. Open URL.... Asks for a Uniform Resource Locator address for where the data is stored.

3. Open DB.... Reads data from a database. (Note that to make this work you might have to edit the file in
weka/experiment/DatabaseUtils.props.)

4. Generate.... Enables you to generate artificial data from a variety of Data Generators. Using the Open
file... button you can read files in a variety of formats:

WEKA‘s ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF files typically have
a .arff extension, CSV files a .csv extension,C4.5 files a .data and .names extension, and serialized Instances
objects a .bsiextension.
DATASET
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
Select classify tab in WEKA explorer.

The Classify tab provides us several machine learning algorithms for the classification of your data. To list a
few, we may apply algorithms such as Linear Regression, Logistic Regression, Support Vector Machines,

Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on. The list is very exhaustive and

provides both supervised and unsupervised machine learning algorithms.

Cross-Validation
Procedure for cross-validation:
1. Load the dataset into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification algorithms under
functions section.
3. In which we selected Linear Regression algorithm & click on start option with cross validation option
with 10 folds.
4. Then we will get regression model & its result as shown below

Here, we enabled cross-validation test option with 10 folds & clicked start button as represented below.
Using Cross-Validation Strategy with 20 folds: Here, we enabled cross-validation test option with 20 folds
& clicked start button as represented below.

If we see the above results of cross validation with 10 folds & 20 folds. As per our observation the
error rate is lesser with 20 folds got 97.3% correctness when compared to 10 folds got 94.6% correctness.

Result:

Thus the data validation using WEKA is successfully executed.


Ex:No: 3
Real-time anomaly detection with Apache Kafka and Python
(Plan the architecture for real time application)

Aim:
To make real-time predictions with incoming stream data from Apache Kafka, and to implement
notification messages for credit card transactions, GPS logs, system consumption metrics.

Project ideas:
• Train an anomaly detection algorithm using unsupervised machine learning.
• Create a new data producer that sends the transactions to a Kafka topic.
• Read the data from the Kafka topic to make the prediction using the trained ml model.
• If the model detects that the transaction is not an inlier, send it to another Kafka topic.
• Create the last consumer that reads the anomalies and sends an alert to a Slack channel.

Architecture:

Procedure:
Step 1: Project Structure:
i) First, Check The Settings.Py; It Has Some Variables To Set, Like The Kafka Broker Host And Port;
Leave The Ones By Default (Listening On Localhost And Default Ports Of Kafka And Zookeeper).
ii) The Streaming/Utils.Py File Contains The Configurations To Create Kafka Consumers And Producers.
iii) Install The Requirements

Step 2: Train The Model


i) To generate random data; it will have two variables
ii) Isolation Forest model to detect the outliers; (To isolate the data points by tracing random lines over
one of the (sampled) variables' axes and, after several iterations, measure how "hard" was to isolate
each observation).
Step 3:Create The Topics
i) "transactions," where the producer will send new transaction records.
ii) "anomalies," the module that detects anomalies will send the data, and the last consumer will read it to
send a slack notification:

Step 4:Transaction Producer


i) To generate the first producer that will send new data to the Kafka topic "transactions"; use the
confluent-Kafka package; in the file streaming/producer.py.
ii) Producer will send data to a Kafka topic, with a probability of
OUTLIERS_GENERATION_PROBABILITY; the data will come from an "outlier generator," will send
an auto-incremental id, the data needed for the machine learning model and the current time in UTC.

Step 5:Outlier Detector Consumer


i) To make the predictions, and filter the outliers. Done in the streaming/anomalies_detector.py file .
ii) The consumer read messages from the "transactions" topic and a consumer sent outliers to the
"anomalies" .

Step 6:Slack notification


i) To take some actions with these detected outliers; in a real-life scenario, it could block a transaction,
scale a server, generate a recommendation, send an alert to an administrative user.

Program:
Step1:

pip install -r requirements.txt

step 2 Train and build the model

Step 3 create the topics


kafka-topics.sh --zookeeper localhost:2181 --topic transactions --create --partitions 3 --replication-factor 1
kafka-topics.sh --zookeeper localhost:2181 --topic anomalies --create --partitions 3 --replication-factor 1
Step 4 transaction producer

kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9092 --topic transactions

Step 4 Outlier Detector Consumer


Step 5 Slack notification
Output:
anamoly detection

CHATBOT ALERT NOTIFICATION

Result:
Thus the prediction of Real-time anomaly detection with Apache Kafka and Python has executes the
streaming/bot_alerts notification.
Ex:No:4 Implement The Query For Schema Definition
(Star, snowflake and Fact constellation schemas)

Aim:

To design database on data warehouse query for schema definition namely Star, snowflake and Fact
constellation schemas through mysql database connection of weka tool.
Procdure:

Step 1: Click Start –All Programs-Xmpp-Click Start Apache And Mysql Server Then Open Weka Tool- Click
Explorer.
Step 2: Click Open Db Tab For Database Connectivity
Step 3: Enter Database Connection Parameter Url,Username,Password. Then Check Database Connection .
Step 4: Double Click Localhost 3306. Database Connection.
Step 5: Double Click Dwtp And Click Schemas(1) Right Click-Select New Schemas Type Schema Name
Step 6: Double Click Dw-Tables(0) Click Icon Sql-Type Query And Click Run Icon
Step 7: Then Close Sql Query Dialog Box
Step 8: Upto Above, Created Database Along With Primary Key.Next Import Csv File And Store Data In The
Database Goto Tables-Choose Table Name And Right Click Choose Import Option .Then Select.Csv File From
The Location &Select Format As Csv & Click Import Button. Now The Data Are Stored In The Database
Step 9: Similarly Do The Followings Table-Snowflake ,Star ,Fact Constellation Table
Step 10: Click Sql Icon-Type The Queries And Click Run Button.

Implementation:
STAR SCHEMA:
• Each dimension in a star schema is represented with only one-dimension table.
• The fact table also contains the attributes, namely dollars sold and units sold.
STAR SCHEMA DEFINITION:
The star schema defined using Data Mining Query Language (DMQL) as follows-
define cube sales star [time, item, branch, location]:
dollars sold sum(sales in dollars), units sold-count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)

SELECT pdim.Name Product_Name,


Sum (sfact.sales_units) Quanity_Sold
FROM Product pdim,
Sales sfact,
Store sdim,
Date ddim
WHERE sfact.product_id = pdim.product_id
AND sfact.store_id = sdim.store_id
AND sfact.date_id = ddim.date_id
AND sdim.state = 'Kerala'
AND ddim.month = 1
AND ddim.year = 2018
AND pdim.Name in (‘Novels’, ‘DVDs’)
GROUP BY pdim.Name
SNOWFLAKE SCHEMA:
Snowflake schema can be defined using DMQL as follows-
define cube sales snowflake (time, item, branch, location]
dollars sold sumtsales in dollars), units sold count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state, countr))

SELECT pdim.Name Product_Name,


Sum (sfact.sales_units) Quanity_Sold
FROM Sales sfact
INNER JOIN Product pdim ON sfact.product_id = pdim.product_id
INNER JOIN Store sdim ON sfact.store_id = sdim.store_id
INNER JOIN State stdim ON sdim.state_id = stdim.state_id
INNER JOIN Date ddim ON sfact.date_id = ddim.date_id
INNER JOIN Month mdim ON ddim.month_id = mdim.month_id
WHERE stdim.state = 'Kerala'
AND mdim.month = 1
AND ddim.year = 2018
AND pdim.Name in (‘Novels’, ‘DVDs’)
GROUP BY pdim.Name

FACT CONSTELLATION SCHEMA:


Fact constellation schema can be defined using DMQL as follows-
define cube sales (time, item, branch, location]:
dollars sold sum(sales in dollars), units sold-count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping (time, item, shipper, from location, to location]:
dollars cost-sum(cost in dollars), units shipped count(*)
define dimension tune as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales, shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales
Chrome –search bar –localhost 3306
Output:
Star schema

Snowflake schema

Fact constellation schema

Result:

Thus the implementation of star, snowflake, fact constellation schema using weka and mysql was
executed successfully.
Ex :No:5 Design Data WareHouse For Real Time Applications

(population database named as sample data create user-details and hockey)

Aim:
To Build a Data Warehouse/Data Mart of source tables and populate sample data using MySQL
administrator, SQLyog Enterprise tools.

Procedure:(MySQL Administrator connection establishment)


Step1: Start Login MySQL Administrator,After successful login, it will open new window .
Step2:Check the mysql database connection localhost 3306.
Step3:Open The SQLyog Enterprise tool.
Step4:To build tables & populate table‘s data in database through SQL queries.
Step5:After creation of database using mysql query for real time application stop the process
and Logout the server.

Implement:
Result:

Thus the creation of data warehouse for real time application of sample dataset with user-details and
hockey data table could be designed successfully
Ex:No:6 Implement and Analyse the Dimensional Modeling of Data Model

Aim:

To implement the creation of table dimensions and analysis of data model.

Procedure:
Step1:Identify the Business Process
Step2:Identify the Grain
Step3:Identify the Dimensions
Step4:Identify the Facts
Step5:Build the Schema

Implementation:

Create the data warehouse


create database TopHireDW
go
use TopHireDW
go
-- Create Date Dimension
if exists (select * from sys.tables where name = 'DimDate')
drop table DimDate
go
create table DimDate
( DateKey int not null primary key,
[Year] varchar(7), [Month] varchar(7), [Date] date, DateString varchar(10))
go
-- Populate Date Dimension
truncate table DimDate
go
declare @i int, @Date date, @StartDate date, @EndDate date, @DateKey int,
@DateString varchar(10), @Year varchar(4),
@Month varchar(7), @Date1 varchar(20)
set @StartDate = '2006-01-01'
set @EndDate = '2016-12-31'
set @Date = @StartDate
insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
values (0, 'Unknown', 'Unknown', '0001-01-01', 'Unknown') --The unknown row
while @Date <= @EndDate
begin
set @DateString = convert(varchar(10), @Date, 20)
set @DateKey = convert(int, replace(@DateString,'-',''))
set @Year = left(@DateString,4)
set @Month = left(@DateString, 7)
insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
values (@DateKey, @Year, @Month, @Date, @DateString)
set @Date = dateadd(d, 1, @Date)
end
go
select * from DimDate

-- Create Customer dimension


if exists (select * from sys.tables where name = 'DimCustomer')
drop table DimCustomer
go
create table DimCustomer
( CustomerKey int not null identity(1,1) primary key,
CustomerId varchar(20) not null,
CustomerName varchar(30), DateOfBirth date, Town varchar(50),
TelephoneNo varchar(30), DrivingLicenceNo varchar(30), Occupation varchar(30)
)
go
insert into DimCustomer (CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo,
DrivingLicenceNo, Occupation)
select * from HireBase.dbo.Customer
select * from DimCustomer
-- Create Van dimension
if exists (select * from sys.tables where name = 'DimVan')
drop table DimVan
go
create table DimVan
( VanKey int not null identity(1,1) primary key,
RegNo varchar(10) not null,
Make varchar(30), Model varchar(30), [Year] varchar(4),
Colour varchar(20), CC int, Class varchar(10)
)
go
insert into DimVan (RegNo, Make, Model, [Year], Colour, CC, Class)
select * from HireBase.dbo.Van
go
select * from DimVan
-- Create Hire fact table
if exists (select * from sys.tables where name = 'FactHire')
drop table FactHire
go
create table FactHire
( SnapshotDateKey int not null, --Daily periodic snapshot fact table
HireDateKey int not null, CustomerKey int not null, VanKey int not null, --Dimension Keys
HireId varchar(10) not null, --Degenerate Dimension
NoOfDays int, VanHire money, SatNavHire money,
Insurance money, DamageWaiver money, TotalBill money
)
go
select * from FactHire
Output:

Snowflake Schema image source of dimensional data model

Result:

Thus the implementation and analysed of data dimension model using weka tool with mysql query.
Ex:No:7 Perform various OLAP operations such slice, dice, roll up, drill down and pivot

Aim:
To Perform various OLAP operations such slice, dice, roll up, drill down and pivot using Microsoft
Excel.

Procedure:
Step1:Open Microsoft Excel, go to Data tab in top & click on ―Existing Connections”.
Step2: Existing Connections window will be opened, there “Browse for more”option should be clicked for
importing .cub extension file for performing OLAP Operations.
Step3: select ―PivotTable Report” and click “OK”.
Step4: Analyzing different OLAP Operations.Firstly, performed drill-down operation .
Step5: To perform roll-up (drill-up) operation.
Step6: Next OLAP operation Slicing is performed by inserting slicer
Step7:Dicing operation is similar to Slicing operation.
Step8: Finally, the Pivot (rotate) OLAP operation is performed by swapping rows.
Step9:After visualization save and exit the process.

Implement:
Open data

Import data
Drill down

Drill Up
slice

Dice

Pivot table
Result:

Thus the implementation of OLAP operations using Microsoft excel successfully executed.
Ex:No:8 Write ETL Scripts of OLTP Operations and Implement Using Data Warehouse Tools

Aim:
To implement OLTP operations using ETL tool for the extraction of data from several sources its,
cleansing, customization, reformatting, integration, and insertion into a data warehouse.

Procedure:
Step 1: open the tool.
Step2:Setting Up, download or create data save it one folder then select and copy the file path.
Step3: Extract the features of data aor retrieve the data.
Step4:Transform the data from one source to another .
Step5: Load the data from the sources.
Step6:Save and exit the process.

Implement:
Extract
# Import
statements
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy import Table, Column, Integer, String, MetaData, ForeignKey
from sqlalchemy import inspect
# Connect the engine to the database file we'll be using
engine = create_engine('sqlite:///chinook.db')
Engine
DB Query
# SQL Expression Language creates metadata that contains objects that define the customers table
metadata = MetaData()
# This method instantiates the tables that already
# exist in the database, which the engine is connected to.
metadata.create_all(engine)
# Checking this out, we can see the table structure and variable types for the employees table
inspector = inspect(engine)
# Checked out the columns in the employees table
inspector.get_columns('employees')

Creation of Employees Table


# Let's execute raw SQL on some tables using SQLAlchemy
with engine.connect() as con:
rs = con.execute('SELECT * FROM employees')
for row in rs:
print(row)
# Don't forget to close your connection to the database when the query is done
con.close()
Transform
# How many employees are there?
with engine.connect() as con:
rs = con.execute("""SELECT COUNT(EmployeeId)
FROM employees;""")
for row in rs:
print(row)
con.close()
# How many customers did each sales rep help?
with engine.connect() as con:
rs = con.execute("""SELECT COUNT(SupportRepId)
FROM customers
GROUP BY SupportRepId;""")
for row in rs:
print(row)
con.close()
with engine.connect() as con:
rs = con.execute("""SELECT HireDate, EmployeeId
FROM employees
WHERE EmployeeId BETWEEN 3 AND 5
ORDER BY HireDate ASC""")
for row in rs:
print(row)

# Does their length of tenure map to how many customers they helped?
with engine.connect() as con:
rs = con.execute("""SELECT MIN(HireDate), EmployeeId
FROM employees;""")
for row in rs:
print(row)
con.close()
with engine.connect() as con:
# Grab the variables you want then inner join them on the respective private keys
rs = con.execute(
"""SELECT
invoices.InvoiceId AS invid,
invoices.CustomerId AS invcustid,
customers.CustomerId AS custcustid,
COUNT(customers.CustomerId) AS numcustomers,
customers.Country as country,
invoice_items.InvoiceId AS invitemid,
invoice_items.TrackId AS invtrackid,
tracks.TrackId AS tracktrackid,
tracks.GenreId AS trackgenreid,
tracks.Bytes AS trackbytes,
SUM(tracks.Milliseconds) / 1000 / 60 AS minutes
FROM
invoices INNER JOIN customers ON invcustid=custcustid
INNER JOIN invoice_items ON invitemid=invid
INNER JOIN tracks ON tracktrackid=invtrackid
GROUP BY country
ORDER BY minutes DESC
"""
)
for row in rs:
print(row)
con.close()

Load

# Connecting the query to pd.read_sql_query. To simplify, you could modify the query to create
# a table and then just do pd.read_sql_table in to the dataframe.
import pandas as pd
df = pd.read_sql_query("""SELECT
invoices.InvoiceId AS invid,
invoices.CustomerId AS invcustid,
customers.CustomerId AS custcustid,
COUNT(customers.CustomerId) AS numcustomers,
customers.Country as country,
invoice_items.InvoiceId AS invitemid,

invoice_items.TrackId AS invtrackid,
tracks.TrackId AS tracktrackid,
tracks.GenreId AS trackgenreid,
tracks.Bytes AS trackbytes,
SUM(tracks.Milliseconds) / 1000 / 60 AS minutes

FROM
invoices INNER JOIN customers ON invcustid=custcustid
INNER JOIN invoice_items ON invitemid=invid
INNER JOIN tracks ON tracktrackid=invtrackid

GROUP BY country

ORDER BY minutes DESC

""", con=engine.connect())

Output:
Employees table

To count the number of employees


3,000,000 employees

Sales representative helped


Employee 3 helped 21 customers, 4 helped 20, and 5 helped 18 respectively.

Hiring date for employees 3-5

Invoice customer

3,277 minutes of music were sold to 494 Americans! Cool!


Load dataset

Result:

Thus the implementation of OLTP operations using ETL tool.

You might also like