0% found this document useful (0 votes)

139 views15 pages

DMW Assignment 1

This document provides information about designing an ETL model using the Rapid Miner tool. It defines ETL as extracting data from source systems, transforming it, and loading it into a data warehouse. It then outlines the extraction, transformation, and loading phases. It describes using Rapid Miner to create a star schema ETL model on sample iris data, including importing, selecting, and retrieving the data, joining operators, executing the process, and storing and visualizing the results.

Uploaded by

mad world

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views15 pages

DMW Assignment 1

Uploaded by

mad world

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Assignment No.

Title:
For an organization of your choice, choose a set of business processes.
Design star / snow flake schemas for analyzing these processes. Create a
fact constellation schema by combining them. Extract data from different
data sources, apply suitable transformations and load into destination tables
using an ETL tool.

Problem Definition:

Design a basic ETL model using Rapid Miner Application.

Theory Concepts:

What does ETL mean?

ETL stands for Extract, Transform and Load. An ETL tool extracts the data
from different RDBMS source systems, transforms the data like applying
calculations, concatenate, etc. and then load the data to Data Warehouse
system. The data is loaded in the DW system in the form of dimension and
fact tables.

Extraction

 A staging area is required during ETL load. There are various reasons
why staging area is required.
 The source systems are only available for specific period of time to
extract data. This period of time is less than the total dataload time.
Therefore, staging area allows you to extract the data from the source
system and keeps it in the staging area before the time slot ends.
 Staging area is required when you want to get the data from multiple
data sources together or if you want to join two or more systems
together. For example, you will not be able to perform a SQL query
joining two tables from two physically different databases.
 Data extractions’ time slot for different systems vary as per the time zone
and operational hours.
 Data extracted from source systems can be used in multiple data
warehouse system, Operation Data stores, etc.
 ETL allows you to perform complex transformations and requires extra
area to store the data.

Transform

In data transformation, you apply a set of functions on extracted data to
load it into the target system. Data, which does not require any
transformation is known as direct move or pass through data.

You can apply different transformations on extracted data from the source
system. For example, you can perform customized calculations. If you
want sumofsales revenue and this is not in database, you can apply the
SUM formula during transformation and load the data.

For example, if you have the first name and the last name in a table in
different columns, you can use concatenate before loading.

Load

During Load phase, data is loaded into the endtarget system and it can be
a flat file or a Data Warehouse system.
Tool for ETL: RAPID MINER

Rapid Miner is a worldleading opensource system for data mining. It is
available as a standalone application for data analysis and as a data
mining engine for the integration into own products. Rapid Miner is now
Rapid Miner Studio and Rapid Analytics is now called Rapid Miner Server.

In a few words, Rapid Miner Studio is a "downloadable GUI for machine
learning, data mining, text mining, predictive analytics and business
analytics". It can also be used (for most purposes) in batch mode
(command line mode)

Rapid Miner Support to Nominal, Numerical values, Integers, Real
numbers, 2value nominal, multivalue nominal etc.

STEPS FOR INSTALLATION:

1. Downloading Rapid Miner Server
2. Installing Rapid Miner Server

3. Configuring Rapid Miner Server settings
4. Configuring Rapid Miner Server's database connection

5. Installing Radoop Proxy

6. Completing the installation

Once logged in, complete the final installation steps.

1. From the SQL Dialect pulldown, verify that the database type displayed
is the one you used to create the Rapid Miner Server database.
2. Verify the setting for the integrated Quartz scheduler, which is enabled
by default.

3. Specify the path to the plug in directory. You can install additional
RapidMiner extensions by placing them in, or saving them to, this
directory. Note that all extensions bundled with RapidMiner Studio are
also bundled with Rapid Miner Server (no installation is necessary).
These bundled extensions are stored in a separate directory that is
independent of the path specified here. Be sure that you have write
permission to the directory.

4. Specify the path to upload directory. This is the directory where

RapidMiner Server stores temporary files needed for processes. The
installation process creates a local uploads directory in the installation
folder. However, if you install Rapid Miner Server on a relatively small
hard disk and, for example, use many file objects in processes or if you
have large resulting files, consider creating a directory elsewhere in the
cluster to store the temporary files. Be sure that you have write
permission to the directory.

5. Click Start installation now.

6. Installation gets completed.

Data Warehousing Schemas
1. Star Schema
2. Snowflake Schema
3. Fact Constellation

Star Schema
For example, as you can see in the abovegiven image that fact table is at
the center which contains keys to every dimension table like Deal_ID, Model
ID, Date_ID, Product_ID, Branch_ID & other attributes like Units sold and
revenue.

Characteristics of Star Schema:
 Every dimension in a star schema is represented with the only one
dimension table.
 The dimension table should contain the set of attributes.
 The dimension table is joined to the fact table using a foreign key
 The dimension table are not joined to each other
 Fact table would contain key and measure
 The Star schema is easy to understand and provides optimal disk usage.
 The dimension tables are not normalized. For instance, in the above
figure, Country_ID does not have Country lookup table as an OLTP
design would have.
 The schema is widely supported by BI Tools

Snowflake Schema
A Snowflake Schema is an extension of a Star Schema, and it adds
additional dimensions. It is called snowflake because its diagram resembles
a Snowflake.
The dimension tables are normalized which splits data into additional
tables. In the following example, Country is further normalized into an
individual table.

Characteristics of Snowflake Schema:
 The main benefit of the snowflake schema it uses smaller disk space.
 Easier to implement a dimension is added to the Schema
 Due to multiple tables query performance is reduced
 The primary challenge that you will face while using the snowflake
Schema is that you need to perform more maintenance efforts because
of the more lookup tables.
Star Schema Snow Flake Schema

Hierarchies for the dimensions are stored Hierarchies are divided into separate
in the dimensional table. tables.

It contains a fact table surrounded by One fact table surrounded by dimension
dimension tables. table which are in turn surrounded by
dimension table

In a star schema, only single join creates A snowflake schema requires many joins to
the relationship between the fact table fetch the data.
and any dimension tables.

Simple DB Design. Very Complex DB Design.

Denormalized Data structure and query Normalized Data Structure.
also run faster.

High level of Data redundancy Very lowlevel data redundancy

Single Dimension table contains Data Split into different Dimension Tables.
aggregated data.
Cube processing is faster. Cube processing might be slow because of
the complex join

Offers higher performing queries using The Snow Flake Schema is represented by
Star Join Query Optimization. Tables may centralized fact table which unlikely
be connected with multiple dimensions. connected with multiple dimensions.

Star Schema
1. Design Model
Step1 Import Data from Source

Step2 Select Data Location
Step3 Open Sample Data Set eg. Iris dataset available inbuilt with tool

Step4 Click on retrieve Operator Drag in Process View
Step5 Retrieve icon shows in Process View it has input and out
Operator

Step6 Click on repository entry
Step7 Select Local Repository

Step8 Select Sample file
Step9 Join Out Operator to result Operator

Step10 Start Execution of Current Process
Step11 Output Result Generated after Execution of Currrent Process

Step12 Now you can add Store Operator and connect to result operator
Step13 You can also plot Charts of Sample Data set

Erd PDF
No ratings yet
Erd PDF
28 pages
DW Design and Data Model Example
No ratings yet
DW Design and Data Model Example
42 pages
Keboola Advanced Training - Public PDF
No ratings yet
Keboola Advanced Training - Public PDF
76 pages
Kalido Generic Data Modeling
No ratings yet
Kalido Generic Data Modeling
23 pages
BI Ques Ans
No ratings yet
BI Ques Ans
51 pages
Overall DWH Concepts Handbook
No ratings yet
Overall DWH Concepts Handbook
27 pages
Datadgeling
No ratings yet
Datadgeling
22 pages
Database Design: Answer
No ratings yet
Database Design: Answer
14 pages
Starschema
50% (2)
Starschema
4 pages
ETL Introduction
No ratings yet
ETL Introduction
44 pages
Data Sync Manager
No ratings yet
Data Sync Manager
6 pages
Python Pandas II Notes XII
No ratings yet
Python Pandas II Notes XII
20 pages
Data Cubemod2
100% (1)
Data Cubemod2
21 pages
A Trio of Interesting Snowflakes - Kimball Group
No ratings yet
A Trio of Interesting Snowflakes - Kimball Group
9 pages
CDC Installation
No ratings yet
CDC Installation
686 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
DWM Unit-IV
No ratings yet
DWM Unit-IV
27 pages
Assignment No. 1: Lab Practices-2 Fourth Year Computer Engineering Engineering
No ratings yet
Assignment No. 1: Lab Practices-2 Fourth Year Computer Engineering Engineering
16 pages
Lecture 2 Data Models
No ratings yet
Lecture 2 Data Models
32 pages
Data Warehousing Concepts JSR
No ratings yet
Data Warehousing Concepts JSR
24 pages
Surayya's Resume-1
No ratings yet
Surayya's Resume-1
4 pages
Oracle Essbase 9 Implementation Guide
From Everand
Oracle Essbase 9 Implementation Guide
Joseph Sydney Gomez
No ratings yet
Clover ETL - 1
No ratings yet
Clover ETL - 1
29 pages
Conncetivity To Change Data Capture
No ratings yet
Conncetivity To Change Data Capture
74 pages
DataCaptureMethodsC3 18mar06
No ratings yet
DataCaptureMethodsC3 18mar06
32 pages
03 Etl 081028 2055
No ratings yet
03 Etl 081028 2055
46 pages
CDM Best Practice
No ratings yet
CDM Best Practice
34 pages
Best Practices and Solutions For GENESIS64 G64 104
No ratings yet
Best Practices and Solutions For GENESIS64 G64 104
51 pages
DWDM Lecturenotes PDF
No ratings yet
DWDM Lecturenotes PDF
133 pages
CDC With HDFS Apply
No ratings yet
CDC With HDFS Apply
10 pages
Blood Bank Management Systemdocx PDF Free
0% (1)
Blood Bank Management Systemdocx PDF Free
143 pages
FAQ: SAP HANA Lock Analysis: Symptom
No ratings yet
FAQ: SAP HANA Lock Analysis: Symptom
118 pages
Data Warehouse Ques
No ratings yet
Data Warehouse Ques
10 pages
5.data Warehouse
No ratings yet
5.data Warehouse
19 pages
Self Service Scenario SIT
No ratings yet
Self Service Scenario SIT
141 pages
Enterprise Reporting Best Practices in An SAP Environment: White Paper
No ratings yet
Enterprise Reporting Best Practices in An SAP Environment: White Paper
22 pages
Building Your ETL Framework With BIML
No ratings yet
Building Your ETL Framework With BIML
19 pages
DW Concepts Shiva
No ratings yet
DW Concepts Shiva
32 pages
[Lecture Notes in Computer Science 2341] Cui Yu (eds.) - High-Dimensional Indexing_ Transformational Approaches to High-Dimensional Range and Similarity Searches (2003, Springer-Verlag Berlin Heidelberg) - li.pdf
No ratings yet
[Lecture Notes in Computer Science 2341] Cui Yu (eds.) - High-Dimensional Indexing_ Transformational Approaches to High-Dimensional Range and Similarity Searches (2003, Springer-Verlag Berlin Heidelberg) - li.pdf
159 pages
Attunity Oracle-CDC For SSIS - Sample Tutorial
100% (1)
Attunity Oracle-CDC For SSIS - Sample Tutorial
12 pages
Adbms Data Warehousing and Data Mining
No ratings yet
Adbms Data Warehousing and Data Mining
169 pages
Oracle GoldenGate Best Practices - Extract ASM Connection Methods v2-ID1390268.1
No ratings yet
Oracle GoldenGate Best Practices - Extract ASM Connection Methods v2-ID1390268.1
10 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Upgrade
No ratings yet
Upgrade
12 pages
DW Olap
No ratings yet
DW Olap
57 pages
Business Intelligence & Business Performance Mgt.: อภิชาต ชมภูนุช Sunday, June 27, 2010
No ratings yet
Business Intelligence & Business Performance Mgt.: อภิชาต ชมภูนุช Sunday, June 27, 2010
50 pages
Ram Manohar Bheemana: Contact About Me
No ratings yet
Ram Manohar Bheemana: Contact About Me
7 pages
Incident Management LVL 100: Introduction To The Incident Management Application in Servicenow
No ratings yet
Incident Management LVL 100: Introduction To The Incident Management Application in Servicenow
11 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Assignment No. 1: Lab Practices-2 Fourth Year Computer Engineering Engineering
No ratings yet
Assignment No. 1: Lab Practices-2 Fourth Year Computer Engineering Engineering
16 pages
Software Testing FAQ: Explain The Software Development Lifecycle
No ratings yet
Software Testing FAQ: Explain The Software Development Lifecycle
30 pages
Best Practices - ETL
No ratings yet
Best Practices - ETL
3 pages
DWH Architecture
No ratings yet
DWH Architecture
3 pages
Cubes Poster - PyCon 2014
100% (1)
Cubes Poster - PyCon 2014
2 pages
Data Modeling and Mining
No ratings yet
Data Modeling and Mining
4 pages
Change Data Capture Error 14234
No ratings yet
Change Data Capture Error 14234
2 pages
Rdbms
No ratings yet
Rdbms
63 pages
Knight's Microsoft SQL Server 2012 Integration Services 24-Hour Trainer
From Everand
Knight's Microsoft SQL Server 2012 Integration Services 24-Hour Trainer
Brian Knight
No ratings yet
CDCSetup
No ratings yet
CDCSetup
4 pages
ETL Specification Table of Contents: Change Log
No ratings yet
ETL Specification Table of Contents: Change Log
3 pages
Aparna INTERN REPORT 12
No ratings yet
Aparna INTERN REPORT 12
46 pages
Database Assignment 312
No ratings yet
Database Assignment 312
20 pages
Report Ict Sem 3 Rajaah
No ratings yet
Report Ict Sem 3 Rajaah
30 pages
Practicalfile Format CS Xii
No ratings yet
Practicalfile Format CS Xii
32 pages
Schemas For Multidimensional Databases
No ratings yet
Schemas For Multidimensional Databases
5 pages
DBMS UNIT 2 Part 1praven Kaumar Rai
No ratings yet
DBMS UNIT 2 Part 1praven Kaumar Rai
79 pages
Introduction to Oracle Database Administration
From Everand
Introduction to Oracle Database Administration
Ying Wang
5/5 (1)
1.1 2 - Upgrading Oracle EBS Database To 19c - 7.X - On - 2node - RAC
No ratings yet
1.1 2 - Upgrading Oracle EBS Database To 19c - 7.X - On - 2node - RAC
82 pages
Module 6 - Normalization-1
No ratings yet
Module 6 - Normalization-1
30 pages
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
SDD Template
No ratings yet
SDD Template
7 pages
ETL Specification Review Check List Ods - Ap
No ratings yet
ETL Specification Review Check List Ods - Ap
5 pages
Oracle Streams - Step by Step
100% (32)
Oracle Streams - Step by Step
11 pages
Bit 4101 Business Data Minning and Warehousing
No ratings yet
Bit 4101 Business Data Minning and Warehousing
11 pages
Meridium APM Framework
No ratings yet
Meridium APM Framework
4 pages
MCS-221 2024-25 em
No ratings yet
MCS-221 2024-25 em
34 pages
Ansar - F18605005 Inlab + Post Lab No 04 Operating System Dated 24 April, 2021
No ratings yet
Ansar - F18605005 Inlab + Post Lab No 04 Operating System Dated 24 April, 2021
6 pages
Module 4 - Entity Relationship (ER) Modeling
No ratings yet
Module 4 - Entity Relationship (ER) Modeling
13 pages
The Pill Store Database 2 2
No ratings yet
The Pill Store Database 2 2
16 pages
Siva Obiee and Tableau Resume
No ratings yet
Siva Obiee and Tableau Resume
6 pages
Arun P Rangrej 4 Sem, Mca RIT Hive-Handout: Create The Following Tables and Answer The Queries
No ratings yet
Arun P Rangrej 4 Sem, Mca RIT Hive-Handout: Create The Following Tables and Answer The Queries
3 pages
Data Engineering Foundation
No ratings yet
Data Engineering Foundation
2 pages
Lesson Plan DBMS
No ratings yet
Lesson Plan DBMS
4 pages
SQL Injection Cheat Sheet - Web Security Academy
No ratings yet
SQL Injection Cheat Sheet - Web Security Academy
5 pages
Chromatographic Data Analysis TIMS
No ratings yet
Chromatographic Data Analysis TIMS
4 pages
Socket Read Timed Out Error Trying To Connect From JDBC Application (Doc ID 2051087.1)
No ratings yet
Socket Read Timed Out Error Trying To Connect From JDBC Application (Doc ID 2051087.1)
2 pages
ADO Terraform Template
No ratings yet
ADO Terraform Template
4 pages
Operators in Mongodb
No ratings yet
Operators in Mongodb
4 pages
Vss Cheat Sheet
No ratings yet
Vss Cheat Sheet
1 page
Advertisment Management System
No ratings yet
Advertisment Management System
61 pages

DMW Assignment 1

Uploaded by

DMW Assignment 1

Uploaded by

Assignment No.

4. Specify the path to upload directory. This is the directory where

You might also like