DMW Assignment 1
DMW Assignment 1
1
Title:
For an organization of your choice, choose a set of business processes.
Design star / snow flake schemas for analyzing these processes. Create a
fact constellation schema by combining them. Extract data from different
data sources, apply suitable transformations and load into destination tables
using an ETL tool.
Problem Definition:
Design a basic ETL model using Rapid Miner Application.
Theory Concepts:
What does ETL mean?
ETL stands for Extract, Transform and Load. An ETL tool extracts the data
from different RDBMS source systems, transforms the data like applying
calculations, concatenate, etc. and then load the data to Data Warehouse
system. The data is loaded in the DW system in the form of dimension and
fact tables.
Extraction
A staging area is required during ETL load. There are various reasons
why staging area is required.
The source systems are only available for specific period of time to
extract data. This period of time is less than the total dataload time.
Therefore, staging area allows you to extract the data from the source
system and keeps it in the staging area before the time slot ends.
Staging area is required when you want to get the data from multiple
data sources together or if you want to join two or more systems
together. For example, you will not be able to perform a SQL query
joining two tables from two physically different databases.
Data extractions’ time slot for different systems vary as per the time zone
and operational hours.
Data extracted from source systems can be used in multiple data
warehouse system, Operation Data stores, etc.
ETL allows you to perform complex transformations and requires extra
area to store the data.
Transform
In data transformation, you apply a set of functions on extracted data to
load it into the target system. Data, which does not require any
transformation is known as direct move or pass through data.
You can apply different transformations on extracted data from the source
system. For example, you can perform customized calculations. If you
want sumofsales revenue and this is not in database, you can apply the
SUM formula during transformation and load the data.
For example, if you have the first name and the last name in a table in
different columns, you can use concatenate before loading.
Load
During Load phase, data is loaded into the endtarget system and it can be
a flat file or a Data Warehouse system.
Tool for ETL: RAPID MINER
Rapid Miner is a worldleading opensource system for data mining. It is
available as a standalone application for data analysis and as a data
mining engine for the integration into own products. Rapid Miner is now
Rapid Miner Studio and Rapid Analytics is now called Rapid Miner Server.
In a few words, Rapid Miner Studio is a "downloadable GUI for machine
learning, data mining, text mining, predictive analytics and business
analytics". It can also be used (for most purposes) in batch mode
(command line mode)
Rapid Miner Support to Nominal, Numerical values, Integers, Real
numbers, 2value nominal, multivalue nominal etc.
STEPS FOR INSTALLATION:
1. Downloading Rapid Miner Server
2. Installing Rapid Miner Server
3. Configuring Rapid Miner Server settings
4. Configuring Rapid Miner Server's database connection
5. Installing Radoop Proxy
6. Completing the installation
Once logged in, complete the final installation steps.
1. From the SQL Dialect pulldown, verify that the database type displayed
is the one you used to create the Rapid Miner Server database.
2. Verify the setting for the integrated Quartz scheduler, which is enabled
by default.
3. Specify the path to the plug in directory. You can install additional
RapidMiner extensions by placing them in, or saving them to, this
directory. Note that all extensions bundled with RapidMiner Studio are
also bundled with Rapid Miner Server (no installation is necessary).
These bundled extensions are stored in a separate directory that is
independent of the path specified here. Be sure that you have write
permission to the directory.
5. Click Start installation now.
6. Installation gets completed.
Data Warehousing Schemas
1. Star Schema
2. Snowflake Schema
3. Fact Constellation
Star Schema
For example, as you can see in the abovegiven image that fact table is at
the center which contains keys to every dimension table like Deal_ID, Model
ID, Date_ID, Product_ID, Branch_ID & other attributes like Units sold and
revenue.
Characteristics of Star Schema:
Every dimension in a star schema is represented with the only one
dimension table.
The dimension table should contain the set of attributes.
The dimension table is joined to the fact table using a foreign key
The dimension table are not joined to each other
Fact table would contain key and measure
The Star schema is easy to understand and provides optimal disk usage.
The dimension tables are not normalized. For instance, in the above
figure, Country_ID does not have Country lookup table as an OLTP
design would have.
The schema is widely supported by BI Tools
Snowflake Schema
A Snowflake Schema is an extension of a Star Schema, and it adds
additional dimensions. It is called snowflake because its diagram resembles
a Snowflake.
The dimension tables are normalized which splits data into additional
tables. In the following example, Country is further normalized into an
individual table.
Characteristics of Snowflake Schema:
The main benefit of the snowflake schema it uses smaller disk space.
Easier to implement a dimension is added to the Schema
Due to multiple tables query performance is reduced
The primary challenge that you will face while using the snowflake
Schema is that you need to perform more maintenance efforts because
of the more lookup tables.
Star Schema Snow Flake Schema
Hierarchies for the dimensions are stored Hierarchies are divided into separate
in the dimensional table. tables.
It contains a fact table surrounded by One fact table surrounded by dimension
dimension tables. table which are in turn surrounded by
dimension table
In a star schema, only single join creates A snowflake schema requires many joins to
the relationship between the fact table fetch the data.
and any dimension tables.
Simple DB Design. Very Complex DB Design.
Denormalized Data structure and query Normalized Data Structure.
also run faster.
High level of Data redundancy Very lowlevel data redundancy
Single Dimension table contains Data Split into different Dimension Tables.
aggregated data.
Cube processing is faster. Cube processing might be slow because of
the complex join
Offers higher performing queries using The Snow Flake Schema is represented by
Star Join Query Optimization. Tables may centralized fact table which unlikely
be connected with multiple dimensions. connected with multiple dimensions.
Star Schema
1. Design Model
Step1 Import Data from Source
Step2 Select Data Location
Step3 Open Sample Data Set eg. Iris dataset available inbuilt with tool
Step4 Click on retrieve Operator Drag in Process View
Step5 Retrieve icon shows in Process View it has input and out
Operator
Step6 Click on repository entry
Step7 Select Local Repository
Step8 Select Sample file
Step9 Join Out Operator to result Operator
Step10 Start Execution of Current Process
Step11 Output Result Generated after Execution of Currrent Process
Step12 Now you can add Store Operator and connect to result operator
Step13 You can also plot Charts of Sample Data set