Data Warehousing Lab Manual
Data Warehousing Lab Manual
Villupuram 605103
CCS341
DATA WAREHOUSE LABORATORY
EX NO:1 DATA EXPLORATION AND INTEGRATION WITH WEKA-
WEATHER DATASET
Aim:
The goal of this lab is to install and familiarize with Weka. To demonstrate the available features in
preprocessing, we will use the Weather dataset.
Procedure:
Step2: Open Weka and have a look at the interface. It is an open-source project written in Java from
theUniversity of Waikato
Step 3: Click on the Explorer button on the right side
Step 4: Weka comes with a number of small datasets. Those files are located at C:\Program Files\Weka3-8
(If it is installed at this location. Or else, search for Weka-3-8 to find the installation location).
In this folder, there is a subfolder named ‘data’. Open that folder to see all files that comes with
Weka.
Using the ... Open file option under the Preprocess tag select the weather-nominal.arff file.
DATASET
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
Understanding Data
Let us first look at the highlighted Current relation sub window. It shows the name of the dataset that is
currently loaded. You can infer two points from this sub window −
• There are 14 instances - the number of rows in the table.
• The table contains 5 attributes - the fields, which are discussed in the upcoming sections.
On the left side, notice the Attributes sub window that displays the various fields in the database.
The weather dataset contains five fields - outlook, temperature, humidity, windy and play. When you select
an attribute from this list by clicking on it, further details on the attribute itself are displayed on the right
hand side.
Let us select the temperature attribute first. When clicking on it, we would see the following screen −
In the Selected Attribute subwindow, you can observe the following −
• The name and the type of the attribute are displayed.
• The type for the temperature attribute is Nominal.
• The number of Missing values is zero.
• There are three distinct values with no unique value.
• The table underneath this information shows the nominal values for this field as hot, mild and cold.
• It also shows the count and weight in terms of a percentage for each nominal value.
At the bottom of the window, you see the visual representation of the class values.
If you click on the Visualize All button, you will be able to see all features in one single window as shown
here −
Removing Attributes:
Many a time, the data that you want to use for model building comes with many irrelevant fields. For
example, the customer database may contain his mobile number which is relevant in analysing his credit
rating.
To remove Attribute/s select them and click on the Remove button at the bottom.
The selected attributes would be removed from the database. After we fully pre-process the data, we can
save it for model building.
Applying Filters:
Some of the machine learning techniques such as association rule mining requires categorical data. To
illustrate the use of filters, we will use weather-numeric.arff database that contains two numeric attributes
- temperature and humidity.
We will convert these to nominal by applying a filter on our raw data. Click on the Choose button in
the Filter subwindow and select the following filter −
weka→filters→supervised→attribute→Discretize
Click on the Apply button and examine the temperature and/or humidity attribute. You will notice that
these have changed from numeric to nominal types.
After we fully pre-process the data, we can save it for model building.
Result:
Thus the data exploration and integration with WEKA is successfully executed.
EX NO:2 APPLY WEKA TOOL FOR DATA VALIDATION
Aim:
By applying Weka in dataset it supports several standard data mining tasks like data pre-
processing, clustering, classification, regressing, visualization and feature selection.
Preprocess :
Initially as you open the explorer, only the Pre-process tab is enabled. The first step is to pre-process
the data. Thus, in the Pre-process option, we will select the data file, process it and make it fit for applying
the various algorithms.
Loading Data:
The first four buttons at the top of the preprocess section enable you to load data into WEKA
1. Open file.... Brings up a dialog box allowing you to browse for the data file on the local file system.
2. Open URL.... Asks for a Uniform Resource Locator address for where the data is stored.
3. Open DB.... Reads data from a database. (Note that to make this work you might have to edit the file in
weka/experiment/DatabaseUtils.props.)
4. Generate.... Enables you to generate artificial data from a variety of Data Generators. Using the Open
file... button you can read files in a variety of formats:
WEKA‘s ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF files typically have
a .arff extension, CSV files a .csv extension,C4.5 files a .data and .names extension, and serialized Instances
objects a .bsiextension.
DATASET
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
Select classify tab in WEKA explorer.
The Classify tab provides us several machine learning algorithms for the classification of your data. To list a
few, we may apply algorithms such as Linear Regression, Logistic Regression, Support Vector Machines,
Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on. The list is very exhaustive and
Cross-Validation
Procedure for cross-validation:
1. Load the dataset into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification algorithms under
functions section.
3. In which we selected Linear Regression algorithm & click on start option with cross validation option
with 10 folds.
4. Then we will get regression model & its result as shown below
Here, we enabled cross-validation test option with 10 folds & clicked start button as represented below.
Using Cross-Validation Strategy with 20 folds: Here, we enabled cross-validation test option with 20 folds
& clicked start button as represented below.
If we see the above results of cross validation with 10 folds & 20 folds. As per our observation the
error rate is lesser with 20 folds got 97.3% correctness when compared to 10 folds got 94.6% correctness.
Result:
Aim:
To make real-time predictions with incoming stream data from Apache Kafka, and to implement
notification messages for credit card transactions, GPS logs, system consumption metrics.
Project ideas:
• Train an anomaly detection algorithm using unsupervised machine learning.
• Create a new data producer that sends the transactions to a Kafka topic.
• Read the data from the Kafka topic to make the prediction using the trained ml model.
• If the model detects that the transaction is not an inlier, send it to another Kafka topic.
• Create the last consumer that reads the anomalies and sends an alert to a Slack channel.
Architecture:
Procedure:
Step 1: Project Structure:
i) First, Check The Settings.Py; It Has Some Variables To Set, Like The Kafka Broker Host And Port;
Leave The Ones By Default (Listening On Localhost And Default Ports Of Kafka And Zookeeper).
ii) The Streaming/Utils.Py File Contains The Configurations To Create Kafka Consumers And Producers.
iii) Install The Requirements
Program:
Step1:
Result:
Thus the prediction of Real-time anomaly detection with Apache Kafka and Python has executes the
streaming/bot_alerts notification.
Ex:No:4 Implement The Query For Schema Definition
(Star, snowflake and Fact constellation schemas)
Aim:
To design database on data warehouse query for schema definition namely Star, snowflake and Fact
constellation schemas through mysql database connection of weka tool.
Procdure:
Step 1: Click Start –All Programs-Xmpp-Click Start Apache And Mysql Server Then Open Weka Tool- Click
Explorer.
Step 2: Click Open Db Tab For Database Connectivity
Step 3: Enter Database Connection Parameter Url,Username,Password. Then Check Database Connection .
Step 4: Double Click Localhost 3306. Database Connection.
Step 5: Double Click Dwtp And Click Schemas(1) Right Click-Select New Schemas Type Schema Name
Step 6: Double Click Dw-Tables(0) Click Icon Sql-Type Query And Click Run Icon
Step 7: Then Close Sql Query Dialog Box
Step 8: Upto Above, Created Database Along With Primary Key.Next Import Csv File And Store Data In The
Database Goto Tables-Choose Table Name And Right Click Choose Import Option .Then Select.Csv File From
The Location &Select Format As Csv & Click Import Button. Now The Data Are Stored In The Database
Step 9: Similarly Do The Followings Table-Snowflake ,Star ,Fact Constellation Table
Step 10: Click Sql Icon-Type The Queries And Click Run Button.
Implementation:
STAR SCHEMA:
• Each dimension in a star schema is represented with only one-dimension table.
• The fact table also contains the attributes, namely dollars sold and units sold.
STAR SCHEMA DEFINITION:
The star schema defined using Data Mining Query Language (DMQL) as follows-
define cube sales star [time, item, branch, location]:
dollars sold sum(sales in dollars), units sold-count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
Snowflake schema
Result:
Thus the implementation of star, snowflake, fact constellation schema using weka and mysql was
executed successfully.
Ex :No:5 Design Data WareHouse For Real Time Applications
Aim:
To Build a Data Warehouse/Data Mart of source tables and populate sample data using MySQL
administrator, SQLyog Enterprise tools.
Implement:
Result:
Thus the creation of data warehouse for real time application of sample dataset with user-details and
hockey data table could be designed successfully
Ex:No:6 Implement and Analyse the Dimensional Modeling of Data Model
Aim:
Procedure:
Step1:Identify the Business Process
Step2:Identify the Grain
Step3:Identify the Dimensions
Step4:Identify the Facts
Step5:Build the Schema
Implementation:
Result:
Thus the implementation and analysed of data dimension model using weka tool with mysql query.
Ex:No:7 Perform various OLAP operations such slice, dice, roll up, drill down and pivot
Aim:
To Perform various OLAP operations such slice, dice, roll up, drill down and pivot using Microsoft
Excel.
Procedure:
Step1:Open Microsoft Excel, go to Data tab in top & click on ―Existing Connections”.
Step2: Existing Connections window will be opened, there “Browse for more”option should be clicked for
importing .cub extension file for performing OLAP Operations.
Step3: select ―PivotTable Report” and click “OK”.
Step4: Analyzing different OLAP Operations.Firstly, performed drill-down operation .
Step5: To perform roll-up (drill-up) operation.
Step6: Next OLAP operation Slicing is performed by inserting slicer
Step7:Dicing operation is similar to Slicing operation.
Step8: Finally, the Pivot (rotate) OLAP operation is performed by swapping rows.
Step9:After visualization save and exit the process.
Implement:
Open data
Import data
Drill down
Drill Up
slice
Dice
Pivot table
Result:
Thus the implementation of OLAP operations using Microsoft excel successfully executed.
Ex:No:8 Write ETL Scripts of OLTP Operations and Implement Using Data Warehouse Tools
Aim:
To implement OLTP operations using ETL tool for the extraction of data from several sources its,
cleansing, customization, reformatting, integration, and insertion into a data warehouse.
Procedure:
Step 1: open the tool.
Step2:Setting Up, download or create data save it one folder then select and copy the file path.
Step3: Extract the features of data aor retrieve the data.
Step4:Transform the data from one source to another .
Step5: Load the data from the sources.
Step6:Save and exit the process.
Implement:
Extract
# Import
statements
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy import Table, Column, Integer, String, MetaData, ForeignKey
from sqlalchemy import inspect
# Connect the engine to the database file we'll be using
engine = create_engine('sqlite:///chinook.db')
Engine
DB Query
# SQL Expression Language creates metadata that contains objects that define the customers table
metadata = MetaData()
# This method instantiates the tables that already
# exist in the database, which the engine is connected to.
metadata.create_all(engine)
# Checking this out, we can see the table structure and variable types for the employees table
inspector = inspect(engine)
# Checked out the columns in the employees table
inspector.get_columns('employees')
# Does their length of tenure map to how many customers they helped?
with engine.connect() as con:
rs = con.execute("""SELECT MIN(HireDate), EmployeeId
FROM employees;""")
for row in rs:
print(row)
con.close()
with engine.connect() as con:
# Grab the variables you want then inner join them on the respective private keys
rs = con.execute(
"""SELECT
invoices.InvoiceId AS invid,
invoices.CustomerId AS invcustid,
customers.CustomerId AS custcustid,
COUNT(customers.CustomerId) AS numcustomers,
customers.Country as country,
invoice_items.InvoiceId AS invitemid,
invoice_items.TrackId AS invtrackid,
tracks.TrackId AS tracktrackid,
tracks.GenreId AS trackgenreid,
tracks.Bytes AS trackbytes,
SUM(tracks.Milliseconds) / 1000 / 60 AS minutes
FROM
invoices INNER JOIN customers ON invcustid=custcustid
INNER JOIN invoice_items ON invitemid=invid
INNER JOIN tracks ON tracktrackid=invtrackid
GROUP BY country
ORDER BY minutes DESC
"""
)
for row in rs:
print(row)
con.close()
Load
# Connecting the query to pd.read_sql_query. To simplify, you could modify the query to create
# a table and then just do pd.read_sql_table in to the dataframe.
import pandas as pd
df = pd.read_sql_query("""SELECT
invoices.InvoiceId AS invid,
invoices.CustomerId AS invcustid,
customers.CustomerId AS custcustid,
COUNT(customers.CustomerId) AS numcustomers,
customers.Country as country,
invoice_items.InvoiceId AS invitemid,
invoice_items.TrackId AS invtrackid,
tracks.TrackId AS tracktrackid,
tracks.GenreId AS trackgenreid,
tracks.Bytes AS trackbytes,
SUM(tracks.Milliseconds) / 1000 / 60 AS minutes
FROM
invoices INNER JOIN customers ON invcustid=custcustid
INNER JOIN invoice_items ON invitemid=invid
INNER JOIN tracks ON tracktrackid=invtrackid
GROUP BY country
""", con=engine.connect())
Output:
Employees table
Invoice customer
Result: