0% found this document useful (0 votes)
27 views131 pages

Unit 3

Uploaded by

ishanbhardwaj444
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views131 pages

Unit 3

Uploaded by

ishanbhardwaj444
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 131

Noida Institute of Engineering and Technology, Greater

Noida

Unit-3
Data Mining and warehousing

FOUNDATIONS OF DATA SCIENCE


(BCSDS0301)

B-Tech
Ms. Garima Dhawan
CSE(DS)
Assistant Professor
3rd Sem
Data Science

Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 1


12/09/2024
Noida Institute of Engineering and Technology, Greater
Noida
Name : Ms. Garima Dhawan
Designation: Assistant Professor
Department: Data Science

Qualification:

 B.Tech(CSE) from Kurukshetra University, Kurukshetra 2014


 M.Tech(CSE) from Guru Jambheshwar University Of Science and Technology
2016
 Qualified UGC NET 2019
 Qualified HTET 2016
 JRF Aspirant

Research Publications and Patents :


5 Papers in International Journals
1 Paper in IEEE
1 SCI
5 Patents

Garima Dhawan Foundation of Data Science BCSDS0301


12/09/2024 Unit-3 2
Noida Institute of Engineering and Technology, Greater
Noida

Garima Dhawan Foundation of Data


12/09/2024 3
Science BCSDS0301 Unit-3
Syllabus

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 4


Syllabus

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 5


Syllabus

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 6


Syllabus

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 7


Objectives

The objective of this course is :


1. To understand the fundamental concepts of Data Science
2. Learn about various types of data formats and its manipulations.
3. It helps students to learn exploratory data analysis and visualization
techniques in addition to R programming language.

Garima Dhawan Foundation of Data Science BCSDS0301


12/09/2024 8
Unit-3
Course Outcome

Course outcomes : After completion of this course students will be able to

CO 1 Understand the fundamental ideas behind data science and statistical K2, K3
techniques, as well as the
applications that students may use these concepts to solve.

CO 2 Explain and exemplify the most common forms of data and its K2
representations.

CO 3 Illustrate Data Mining and Warehousing so students can learn to clean K3


and analyze the stored data.
CO4 Illustrate data pre-processing techniques using R. K2, K4

CO 5 Evaluate various visualization methods for different types of data sets K3


and application scenarios.

Garima Dhawan Foundation of Data


12/09/2024 9
Science BCSDS0301 Unit-3
Program Outcome

1. Engineering knowledge: Apply the knowledge of mathematics, science,


engineering fundamentals, and an engineering specialization to the solution of
complex engineering problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified
needs with appropriate consideration for the public health and safety, and the
cultural, societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based
knowledge and research methods including design of experiments, analysis and
interpretation of data, and synthesis of the information to provide valid
conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques,
resources, and modern engineering and IT tools including prediction and modeling
to complex engineering activities with an understanding of the limitations.
.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 10


Program Outcome

6.The engineer and society: Apply reasoning informed by the contextual


knowledge to assess societal, health, safety, legal and cultural issues and the
consequent responsibilities relevant to the professional engineering practice.
7.Environment and sustainability: Understand the impact of the
professional engineering solutions in societal and environmental contexts, and
demonstrate the knowledge of, and need for sustainable development.
8.Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a
member or leader in diverse teams, and in multidisciplinary settings.
10.Communication: Communicate effectively on complex engineering
activities with the engineering community and with society at large, such as,
being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear
instructions.

Garima Dhawan Foundation of Data Science BCSDS0301


12/09/2024 11
Unit-3
Program Outcome

11. Project management and finance: Demonstrate knowledge and


understanding of the engineering and management principles and apply
these to one’s own work, as a member and leader in a team, to manage
projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation
and ability to engage in independent and life-long learning in the broadest
context of technological change.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 12


CO-PO Mapping

PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 2 1 1 1 1 0 0 0 1 0 1 2
CO2 3 2 2 2 2 0 0 1 1 0 1 2
CO3 2 2 2 2 2 1 2 0 1 0 1 2
CO4 2 3 2 3 3 1 0 0 1 0 1 2
CO5 2 3 2 3 3 0 0 0 0 2 1 2
AVER
AGE 2.2 2.2 1.8 2.2 2.2 0.4 0.4 0.2 0.8 0.4 1 2

Garima Dhawan Foundation of Data Science BCSDS0301 Unit-


12/09/2024 13
3
CO-PSO Mapping

PSO1 PSO2 PSO3


CO1 1 1 1
CO2 1 1 1
CO3 2 2 2
CO4 2 2 2
CO5 2 2 2
AVERAGE 1.6 1.6 1.6

12/09/2024 Garima Dhawan Foundation of Data Sc 14


ience BCSDS0301 Unit-3
PSOs

At the end of the program, the student will be able to-

PSO 1: Analyse, design and develop solutions by applying fundamental


concepts of Data Science

PSO 2: Apply technical knowledge while using modern tools and


technologies for solving complex problems.

PSO 3: Collaborate different fields of science and technology with right


attitude, to work as an individual or as a team, demonstrating
professional ethics for the well-being of society

Garima Dhawan Foundation of Data


12/09/2024 15
Science BCSDS0301 Unit-3
PEOs

PEO1: Solve real time complex problems and adapt to


technological changes with the ability of lifelong learning.

PEO2: Work as data scientists ,entrepreneurs and bureaucrats for


goodwill of the society and pursue higher education.

PEO3: Exhibit professional ethics and moral values with good


leadership qualities and effective interpersonal skills

Garima Dhawan Foundation of Data


12/09/2024 16
Science BCSDS0301 Unit-3
Result Analysis

Subject Name Faculty Section Total Pass % Overall Highest


Name Students Students Section % Result Marks
wise
Result
Foundations of Ms. Garima A 86 85 98.84% 99.13% 98
Data Science Dhawan

B 85 85 100% 93

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 17


Textbooks and References
Textbooks:

1) Glenn J. Myatt, Making sense of Data: A practical Guide to Exploratory Data Analysis and Data Mining,
John Wiley Publishers, 2007.
2) Data Analysis and Data Mining, 2nd Edition, John Wiley & Sons Publication, 2014.

Reference Books:

1) Open Data for Sustainable Community: Glocalized Sustainable Development Goals, Neha Sharma, Santanu
Ghosh, Monodeep Saha, Springer, 2021.
2) The Data Science Handbook, Field Cady, John Wiley & Sons, Inc, 2017

3) Data Mining Concepts and Techniques, Third Edition, Jiawei Han, Micheline Kamber, Jian Pei, Morgan
Kaufmann, 2012.

Garima Dhawan Foundation of Data Science BCSDS0301 Unit-


12/09/2024 18
3
Brief Introduction about subject

Data science enables businesses to process huge amounts of structured


and unstructured big data to detect patterns. This in turn allows
companies to increase efficiencies, manage costs, identify new market
opportunities, and boost their market advantage.

‍ sking a personal assistant like Alexa or Siri for a recommendation


A
demands data science. So does operating a self-driving car, using a
search engine that provides useful results, or talking to a chatbot for
customer service. These are all real-life applications for data science.

Garima Dhawan Foundation of Data


12/09/2024 19
Science BCSDS0301 Unit-3
Contents

Data Mining & Warehousing:


1. Data Pre-processing: Form of Data Pre-processing
2. why pre-process the data Attribute and its types
3. understanding and extracting useful variables
4. KDD process
5. Data Cleaning: Missing Values, Noisy Data, (Binning, Clustering, Regression),
Inconsistent Data, Data Integration and Transformation
6. Data Reduction: Data Cube Aggregation, Dimensionality reduction, Data
Compression, Numerosity Reduction, Discretization and Concept hierarchy
generation
Data Warehouse Process and Technology:
7. Overview, Definition, Data Warehousing Components
8. Building a Data Warehouse,
9. Difference between Database System and Data Warehouse
10. Multi-Dimensional Data Model
11. Data Cubes, Stars, Snowflakes,
12. Fact Constellations, Warehousing Strategy, Warehouse /management and
Support Processes, Warehouse Schema Design. Aggregation, Query Facility,
OLAP function and Tools. OLAP Servers, ROLAP, MOLAP, HOLAP.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 20


Prerequisite and Recap

Prerequisites:
• Knowledge about Database Management Systems

• SQL

• MS Office 2019

• Programming Language like R/Python

Recap:
• Types of data
• Data Handling Techniques

Garima Dhawan Foundation of Data Science BCSDS0301


12/09/2024 21
Unit-3
Unit Objective

The objective of this course is :


1. To understand the concepts of Data mining and warehousing
2. Learn about Data Warehouse process Technology.
3. It helps students to learn exploratory data preprocessing, Integration
and transformation, Data Warehouse Schema and management.
4. Its also give understanding building and architecture of data
warehouse and easy to familiar with Warehouse Schema Design.
Aggregation, Query Facility, OLAP function and Tools. OLAP Servers,
ROLAP, MOLAP, HOLAP.

Garima Dhawan Foundation of Data Science BCSDS0301


12/09/2024 22
Unit-3
Data Pre-Processing

Objective:
In this topic we will learn about data preprocessing and techniques of data
preprocessing. By preprocessing data, we make it easier to interpret and
use.
Data preprocessing is an integral step in Machine Learning as the quality of
data and the useful information that can be derived from it directly affects
the ability of our model to learn; therefore, it is extremely important that
we preprocess our data before feeding it into our model.

Garima Dhawan Foundation of Data Science BCSDS0301


12/09/2024 23
Unit-3
Data Pre-Processing

Data Preprocessing:
Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw data.
Data preprocessing transforms the data into a format that is more easily and
effectively processed in data mining, machine learning and other data science tasks.
The techniques are generally used at the earliest stages of the machine learning and
AI development pipeline to ensure accurate results.
The quality of the data should be checked before applying machine learning or data
mining algorithms.

Garima Dhawan Foundation of Data Science BCSDS0301


12/09/2024 24
Unit-3
Techniques of Data Pre-Processing

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 25


Techniques of Data Pre-Processing
1. Data Cleaning/Cleansing

Real-world data tend to be incomplete, noisy, and inconsistent. Data


Cleaning/Cleansing routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
Data can be noisy, having incorrect attribute values. Owing to the following, the data
collection instruments used may be fault. Maybe human or computer errors
occurred at data entry. Errors in data transmission can also occur.
“Dirty” data can cause confusion for the mining procedure. Although most mining
routines have some procedures, they deal incomplete or noisy data, which are not
always robust. Therefore, a useful Data Preprocessing step is to run the data through
some Data Cleaning/Cleansing routines.

Garima Dhawan Foundation of Data Science BCSDS0301 Unit-


12/09/2024 26
3
Techniques of Data Pre-Processing

Data Integration:

•The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management.
There are some problems to be considered during data integration.
•Schema integration
•Entity identification problem
•Detecting and resolving data value concepts

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 27


Techniques of Data Pre-Processing

3. Data Transformation:

The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements. There
are some methods in data transformation.
•Smoothing
•Aggregation
•Discretization
•Normalization

Garima Dhawan Foundation of Data Science BCSDS0301


12/09/2024 28
Unit-3
Techniques of Data Pre-Processing

4. Data Reduction

This process helps in the reduction of the volume of the data which makes the
analysis easier yet produces the same or almost the same result. This reduction also
helps to reduce storage space. There are some of the techniques in data reduction
are Dimensionality reduction, Numerosity reduction, Data compression.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 29


Why preprocess data attributes and its types?

Raw data can have missing or inconsistent values as well as present a lot of redundant
information. The most common problems you can find with raw data can be divided into 3
groups:
Missing data: you can also see Missing attribute values, missing certain attributes of
importance, or having only aggregate data since the information that isn’t there creates
gaps that might be relevant to the final analysis. Missing data often appears when there’s a
problem in the collection phase, such as a glitch that caused a system’s downtime, mistakes
in data entry, or issues with biometrics use, among others.
Noisy data: this group encompasses erroneous data and outliers that you can find in the
data set but that is just meaningless information. Here you can see noise made of human
mistakes, rare exceptions, mislabels, and other issues during data gathering.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 30


Why preprocess data attributes and its types?

Inconsistent data: inconsistencies happen when you keep files with similar data in
different formats and files. Duplicates in different formats, mistakes in codes of names,
or the absence of data constraints often lead to inconsistent data, that introduces
deviations that you have to deal with before analysis.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 31


Understanding and Extracting useful variables

variables are any characteristics that can take on different values, such as height, age,
temperature, or test scores.
Researchers often manipulate or measure independent and dependent variables in
studies to test cause-and-effect relationships.
•The independent variable is the cause. Its value is independent of other variables in
your study.
•The dependent variable is the effect. Its value depends on changes in the independent
variable.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 32


Understanding and Extracting useful variables

An independent variable is the variable you manipulate or vary in an experimental


study to explore its effects. It’s called “independent” because it’s not influenced by any
other variables in the study.
Independent variables are also called:
•Explanatory variables (they explain an event or outcome)
•Predictor variables (they can be used to predict the value of a dependent variable)
•Right-hand-side variables (they appear on the right-hand side of
a regression equation).
These terms are especially used in statistics, where you estimate the extent to which an
independent variable change can explain or predict changes in the dependent variable.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 33


Understanding and Extracting useful variables

Types of independent variables


There are two main types of independent variables.
•Experimental independent variables can be directly manipulated by researchers.
•Subject variables cannot be manipulated by researchers, but they can be used to group
research subjects categorically.
Dependent variable:
A dependent variable is the variable that changes as a result of the independent
variable manipulation. It’s the outcome you’re interested in measuring, and it
“depends” on your independent variable.
In statistics, dependent variables are also called:
•Response variables (they respond to a change in another variable)
•Outcome variables (they represent the outcome you want to measure)
•Left-hand-side variables (they appear on the left-hand side of a regression equation)
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 34
Understanding and Extracting useful variables

The dependent variable is what you record after you’ve manipulated the independent
variable. You use this measurement data to check whether and to what extent your
independent variable influences the dependent variable by conducting statistical
analyses.
Based on your findings, you can estimate the degree to which your independent
variable variation drives changes in your dependent variable. You can also predict how
much your dependent variable will change as a result of variation in the independent
variable.
Identifying independent vs. dependent variables
A dependent variable from one study can be the independent variable in another
study, so it’s important to pay attention to research design.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 35


Understanding and Extracting useful variables

Recognizing independent variables


Use this list of questions to check whether you’re dealing with an independent variable:
•Is the variable manipulated, controlled, or used as a subject grouping method by the
researcher?
•Does this variable come before the other variable in time?
•Is the researcher trying to understand whether or how this variable affects another
variable?
Recognizing dependent variables
Check whether you’re dealing with a dependent variable:
•Is this variable measured as an outcome of the study?
•Is this variable dependent on another variable in the study?
•Does this variable get measured only after other variables are altered?

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 36


KDD Process

Objective:
This topic Introduces KDD Process in which we learn how to extract
information from data in the context of large databases.
The Knowledge Discovery in Databases is considered as a programmed,
exploratory analysis and modeling of vast data repositories.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 37


KDD Process

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 38


KDD Process

The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
1.Developing an understanding of
1. the application domain
2. the relevant prior knowledge
3. the goals of the end-user
2.Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3.Data cleaning and preprocessing.
1. Removal of noise or outliers.
2. Collecting necessary information to model or account for noise.
3. Strategies for handling missing data fields.
4. Accounting for time sequence information and known changes.
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 39
KDD Process

4. Data reduction and projection.


1. Finding useful features to represent the data depending on the goal of the task.
2. Using dimensionality reduction or transformation methods to reduce the effective
number of variables under consideration or to find invariant representations for
the data.
5. Choosing the data mining task.
3. Deciding whether the goal of the KDD process is classification, regression,
clustering, etc.
6. Choosing the data mining algorithm(s).
4. Selecting method(s) to be used for searching for patterns in the data.
5. Deciding which models and parameters may be appropriate.
6. Matching a particular data mining method with the overall criteria of the KDD
process.
.12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 40
KDD Process
7. Data mining.Searching for patterns of interest in a particular representational
form or a set of such representations as classification rules or trees, regression,
clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
The terms knowledge discovery and data mining are distinct.
KDD refers to the overall process of discovering useful knowledge from data. It
involves the evaluation and possibly interpretation of the patterns to make the
decision of what qualifies as knowledge. It also includes the choice of encoding
schemes, preprocessing, sampling, and projections of the data prior to the data
mining step.
Data mining refers to the application of algorithms for extracting patterns from data
without the additional steps of the KDD process.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 41


Data Cleaning: Missing Data

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle
this part, data cleaning is done. It involves handling of missing data,
noisy data etc.
(a) Missing Data:
This situation arises when some data is missing in the data. It can
be handled in various ways.
• Ignore the tuples
• Fill the Missing values manually
• Use a Global constant to fill in the missing value
• Use the attribute mean to fill in the missing value.
• Use the attribute mean for all samples belonging to
the same class as given tuple.
• Use the most probable value to fill in the missing
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 42
Data Cleaning : Noisy Data

b)Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete
the task. Each segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to complete the task.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 43


Data Cleaning : Noisy Data

2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected, or it will fall outside the clusters.
C) Inconsistent data:
There may be inconsistencies in the data recorded for some transactions. Some data
inconsistencies may be corrected manually using external references.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 44


Data Integration

The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components in data management. There are some
problems to be considered during data integration.
•Schema integration: Integrates metadata(a set of data that describes other data)
from different sources.
•Entity identification problem: Identifying entities from multiple databases. For
example, the system or the user should know student _id of one database and
student_no. of another database belongs to the same entity.
•Detecting and resolving data value conflicts: The data taken from different databases
while merging may differ. Like the attribute values from one database may differ from
another database. For example, the date format may differ like “MM/DD/YYYY” or
“DD/MM/YYYY”.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 45


Data Integration

Redundancy is another important issue. An attribute may be redundant if it can be


“derived” from another table, such as annual revenue.
Some redundancies can be detected by correlation analysis. For example, given two
attributes, such analysis can measure how strongly one attribute implies the other, based
on the available data. The correlation between attributes A and B can be measured by

Where,
n = number of tuples
ai = value of A in tuple i
bi = value of B in tuple i

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 46


Data Integration

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 47


Data Integration

It is a statistic that measures the degree to which one variable varies in tandem with
another. It ranges from -1 to +1. A +1 correlation means that as one variable rises, the
other rises proportionally; a -1 correlation means that as one rises, the other falls
proportionally. A 0 correlation means that there is no relationship between the
movements of the two variables.
From the above discussion, we can say that the greater the
correlation coefficient, the more strongly the attributes are correlated
to each other, and we can ignore any one of them (either a or b). If
the value of the correlation constant is null, the attributes are
independent. If the value of the correlation constant is negative, one
attribute discourages the other. It means that the value of one
attribute increases, then the value of another attribute is decreasing.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 48


Data Transformation

The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements.
There are some methods in data transformation.
• Smoothing: With the help of algorithms, we can remove noise from the dataset
and helps in knowing the important features of the dataset.
• Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set which is from multiple sources is integrated into with data
analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the
quantity of the data are good the results are more relevant.
• Generalization: In this method, low level or “primitive” data are replaced by
higher level concepts through the use of concept hierarchies. For example,
categorical attributes like street, can be generalized to higher level concepts like
city or country.
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 49
Data Transformation
• Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we can
set an interval like (3 pm-5 pm, 6 pm-8 pm).
• Normalization: It is the method of scaling the data so that it can be represented
in a smaller range. Example ranging from -1.0 to 1.0. There are many methods for
data normalization. Decimal Scaling, Min-Max Normalization, z-Score
Normalization(zero-mean Normalization)
• Decimal Scaling, It normalizes by moving the decimal point of
values of the data. To normalize the data by this technique, we
divide each value of the data by the maximum absolute value of
data. The data value, vi, of data is normalized to vi‘ by using the
formula below –

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 50


Data Transformation

where j is the smallest integer such that max(|vi‘|)<1.


Example: Let the input data is: -20, 302, 401, -501, 601, 801, 901
To normalize the above data,
Step 1: Maximum absolute value in given data(m): 901
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.02, 0.302, 0.401, -0.501, 0.601, 0.801, 0.901
Min-Max Normalization –

In this technique of data normalization, linear transformation is


performed on the original data. Minimum and maximum value from
data is fetched and each value is replaced according to the following
formula.
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 51
Data Transformation
Where A is the attribute data,
Min(A), Max(A) are the minimum and maximum absolute
value of A respectively.
v’ is the new value of each entry in data.
v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the
range(i.e boundary value of range required) respectively.

Z-score normalization –
In this technique, values are normalized based on mean and
standard deviation of the data A. The formula used is:

v’, v is the new and old of each entry in data respectively. σ A, A is


the standard deviation and mean of A respectively.
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 52
Data Reduction
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to
get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1.Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2.Dimension Reduction:
The highly relevant attributes should be used, rest all can be discarded.
3.Numerosity Reduction:
Where the data are replaced or estimated by alternative, smaller data representations
such as parametric models or nonparametric methods such as clustering, sampling and
the use of histograms.
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 53
Data Reduction
4. Data Compression:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If
after reconstruction from compressed data, original data can be retrieved, such
reduction are called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are : Wavelet transforms and PCA
(Principal Component Analysis).
5. Discretization and concept hierarchy generation:
where raw data values for attributes are replaced by ranges or higher conceptual

levels.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 54


Concept Hierarchy Generation

Objective:
To organize information or concepts in a hierarchical structure or a specific
partial order, which are used for defining knowledge in brief, high-level
methods, and creating possible mining knowledge at several levels of
abstraction .

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 55


Concept Hierarchy Generation

A concept hierarchy represents a series of mappings from a set of low-level


concepts to larger-level, more general concepts. A conceptual hierarchy includes a
set of nodes organized in a tree, where the nodes define values of an attribute
known as concepts. A specific node, “ANY”, is constrained for the root of the tree.
A number is created to the level of each node in a conceptual hierarchy. The level
of the root node is one. The level of a non-root node is one more the level of its
parent level number.
Because values are defined by nodes, the levels of nodes can also be used to
describe the levels of values. Concept hierarchy enables raw information to be
managed at a higher and more generalized level of abstraction. There are several
types of concept hierarchies which are as follows −

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 56


Concept Hierarchy Generation

Schema Hierarchy − Schema hierarchy represents the total or partial order between
attributes in the database. It can define existing semantic relationships between
attributes. In a database, more than one schema hierarchy can be generated by
using multiple sequences and grouping of attributes.
Set-Grouping Hierarchy − A set-grouping hierarchy constructs values for a given
attribute or dimension into groups or constant range values. It is also known as
instance hierarchy because the partial series of the hierarchy is represented on the
set of instances or values of an attribute. These hierarchies have more functional
sense and are so approved than other hierarchies.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 57


Concept Hierarchy Generation

Operation-Derived Hierarchy − Operation-derived hierarchy is represented by a


set of operations on the data. These operations are defined by users,
professionals, or the data mining system. These hierarchies are usually
represented for mathematical attributes. Such operations can be as easy as range
value comparison, as difficult as a data clustering and data distribution analysis
algorithm.
Rule-based Hierarchy − In a rule-based hierarchy either a whole concept hierarchy
or an allocation of it is represented by a set of rules and is computed dynamically
based on the current information and rule definition. A lattice-like architecture is
used for graphically defining this type of hierarchy, in which each child-parent
route is connected with a generalization rule.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 58


Data Warehouse

Objective:
To create a trove of historical data that can be retrieved and analyzed to
provide useful insight into the organization's operations.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 59


Data Warehouse

• A Data Warehouse consists of data from multiple heterogeneous data sources and
is used for analytical reporting and decision making. Data Warehouse is a central
place where data is stored from different data sources and applications. The term
Data Warehouse was first invented by Bill Inmom in 1990.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 60


Features of Data Warehouse

• A data warehouse is a database, which is kept separate from the


organization's operational database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization to
analyze its business.
• A data warehouse helps executives to organize, understand, and use their
data to take strategic decisions.
• Data warehouse systems help in the integration of diversity of application
systems.
• A data warehouse system helps in consolidated historical data analysis.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 61


Characteristics of a Data Warehouse

• The following are the key characteristics of a Data Warehouse −

• Subject Oriented − In a DW system, the data is categorized and stored by a


business subject rather than by application like equity plans, shares, loans, etc.

• Integrated − Data from multiple data sources are integrated in a Data


Warehouse.

• Non Volatile − Data in data warehouse is non-volatile. It means when data is


loaded in DW system, it is not altered.

• Time Variant − A DW system contains historical data as compared to


Transactional system which contains only current data. In a Data warehouse you
can see data for 3 months, 6 months, 1 year, 5 years, etc.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 62


Architecture & components of Data Warehouse

Data warehouse architecture defines the comprehensive architecture of data


processing and presentation that will be useful for data analysis and decision making
within the enterprise and organization. Each organization has different data
warehouses depending upon their need, but all of them are characterized by some
standard components.

The architecture of the data warehouse mainly consists of the proper arrangement of
its elements, to build an efficient data warehouse with software and hardware
components. The elements and components may vary based on the requirement of
organizations. All of these depend on the organization’s circumstances.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 63


Architecture & components of Data Warehouse

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 64


Components of Data Warehouse

1.Source Data Component:


In the Data Warehouse, the source data comes from different places. They are group into
four categories:

External Data: For data gathering, most of the executives and data analysts rely on
information coming from external sources for a numerous amount of the information
they use. They use statistical features associated with their organization that is brought
out by some external sources and department.

Internal Data: In every organization, the consumer keeps their “private” spreadsheets,
reports, client profiles, and generally even department databases. This is often the
interior information, a part that might be helpful in every data warehouse.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 65


Components of Data Warehouse

• Operational System data: Operational systems are principally meant to run the
business. In each operation system, we periodically take the old data and store it in
achieved files.
• Flat files: A flat file is nothing but a text database that stores data in a plain text
format. Flat files generally are text files that have all data processing and structure
markup removed. A flat file contains a table with a single record per line.

2.Data Staging:
• After the data is extracted from various sources, now it’s time to prepare the data
files for storing in the data warehouse. The extracted data collected from various
sources must be transformed and made ready in a format that is suitable to be saved
in the data warehouse for querying and analysis.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 66


Components of Data Warehouse

The data staging contains three primary functions that take place in this part:
• Data Extraction
• Data Transformation
• Data Loading

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 67


Components of Data Warehouse

3. Data Storage in Warehouse:


Data storage for data warehousing is split into multiple repositories. These data
repositories contain structured data in a very highly normalized form for fast and
efficient processing.
•Metadata: Metadata means data about data i.e. it summarizes basic details regarding
data, creating findings & operating with explicit instances of data. Metadata is
generated by an additional correction or automatically and can contain basic
information about data.
•Raw Data: Raw data is a set of data and information that has not yet been processed
and was delivered from a particular data entity to the data supplier and hasn’t been
processed nonetheless by machine or human. This data is gathered out from online
sources to deliver deep insight into users’ online behavior.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 68


Components of Data Warehouse

Summary Data or Data summary:


Data summary is an easy term for a brief conclusion of an enormous theory or a
paragraph. This is often one thing where analysts write the code and in the end, they
declare the ultimate end in the form of summarizing data. Data summary is the most
essential thing in data mining and processing.
4. Data Marts:
Data marts are also the part of storage component in a data warehouse. It can store
the information of a specific function of an organization that is handled by a
single authority. There may be any number of data marts in a particular organization
depending upon the functions. In short, data marts contain subsets of the data stored
in data warehouses.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 69


Building a Data Warehouse

Some steps that are needed for building any data warehouse are as following
below:
1.To extract the data (transnational) from different data sources:
1.For building a data warehouse, a data is extracted from various data sources and
that data is stored in central storage area. For extraction of the data Microsoft has
come up with an excellent tool. When you purchase Microsoft SQL Server, then this
tool will be available at free of cost.
2.To transform the transnational data:
There are various DBMS where many of the companies stores their data. Some of
them are: MS Access, MS SQL Server, Oracle, Sybase etc. Also these companies
saves the data in spreadsheets, flat files, mail systems etc. Relating a data from all
these sources is done while building a data warehouse.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 70


Building a Data Warehouse

3.To load the data (transformed) into the dimensional database:


After building a dimensional model, the data is loaded in the dimensional
database. This process combines the several columns together or it may split one
field into the several columns. There are two stages at which transformation of
the data can be performed and they are: while loading the data into the
dimensional model or while data extraction from their origins.

4.To purchase a front-end reporting tool:


There are top notch analytical tools are available in the market. These tools are
provided by the several major vendors. A cost effective tool and Data Analyzer is
released by the Microsoft on its own.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 71


Difference between Database System and Data Warehouse
Database System Data Warehouse
It supports analysis and performance
It supports operational processes.
reporting.

Capture and maintain the data. Explore the data.

Current data. Multiple years of history.

Data is balanced within the scope of this one Data must be integrated and balanced from
system. multiple system.

Data is updated when transaction occurs. Data is updated on scheduled processes.

Data verification occurs when entry is done. Data verification occurs after the fact.

100 MB to GB. 100 GB to TB.

ER based. Star/Snowflake.

Application oriented. Subject oriented.

Primitive and highly detailed. Summarized and consolidated.

Flat relational. Multidimensional.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 72


Multidimensional Data Model

Objective:
In this topic we will learn about method for arranging the data in the
database, with better structuring and organization of the contents in the
database.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 73


Multidimensional Data Model

A multidimensional model views data in the form of a data-cube. A data cube enables
data to be modeled and viewed in multiple dimensions. It is defined by dimensions and
facts.
The dimensions are the perspectives or entities concerning which an organization keeps
records. For example, a shop may create a sales data warehouse to keep records of the
store's sales for the dimension time, item, and location. These dimensions allow the save
to keep track of things, for example, monthly sales of items and the locations at which the
items were sold. Each dimension has a table related to it, called a dimensional table,
which describes the dimension further. For example, a dimensional table for an item may
contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales.
This theme is represented by a fact table. Facts are numerical measures. The fact table
contains the names of the facts or measures of the related dimensional tables.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 74


Multidimensional Data Model

Consider the data of a shop for items sold per quarter in the
city of Delhi. The data is shown in the table. In this 2D
representation, the sales for Delhi are shown for the time
dimension (organized in quarters) and the item dimension
(classified according to the types of an item sold). The fact or
measure
12/09/2024 displayed
Garimain rupee_sold
Dhawan Foundation(in thousands).
of Data Science BCSDS0301 Unit-3 75
Multidimensional Data Model

Now, if we want to view the sales data with a third dimension, For example, suppose
the data according to time and item, as well as the location is considered for the
cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table.
The 3D data of the table are represented as a series of 2D tables.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 76


Multidimensional Data Model

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 77


Multidimensional Data Model

Conceptually, it may also be represented by the same data in the


form of a 3D data cube, as shown in fig:

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 78


Data Cube

Objectives:
In this topic we will learn how,
• To easily interpret data.
• To represent data together with dimensions as certain measures of
business requirements

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 79


Data Cube

Grouping of data in a multidimensional matrix is called data cubes. In Data ware


housing, we generally deal with various multidimensional data models as the data
will be represented by multiple dimensions and multiple attributes. This
multidimensional data is represented in the data cube as the cube represents a high-
dimensional space. The Data cube pictorially shows how different attributes of data
are arranged in the data model. Below is the diagram of a general data cube.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 80


Data Cube

The example above is a 3D cube having attributes like branch(A,B,C,D),item type


( home, entertainment, computer, phone, security ), year(1997,1998,1999) .
Data cube operations:

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 81


Data Cube Operations

Data cube operations are used to manipulate data to meet the needs of users. These
operations help to select particular data for the analysis purpose. There are mainly 5
operations listed below-
•Roll-up: operation and aggregate certain similar data attributes having the same
dimension together. For example, if the data cube displays the daily income of a
customer, we can use a roll-up operation to find the monthly income of his salary.
•Drill-down: this operation is the reverse of the roll-up operation. It allows us to
take particular information and then subdivide it further for coarser granularity
analysis. It zooms into more detail. For example- if India is an attribute of a country
column and we wish to see villages in India, then the drill-down operation splits India
into states, districts, towns, cities, villages and then displays the required
information.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 82


Data Cube

Slicing: this operation filters the unnecessary portions. Suppose in a particular


dimension, the user doesn’t need everything for analysis, rather a particular attribute.
For example, country=”Jamaica”, this will display only about jamaica and only display
other countries present on the country list.
Dicing: this operation does a multidimensional cutting, that not
only cuts only one dimension but also can go to another dimension
and cut a certain range of it. As a result, it looks more like a subcube
out of the whole cube(as depicted in the figure). For example- the
user wants to see the annual salary of Jharkhand state employees.
Pivot: this operation is very important from a viewing point of view.
It basically transforms the data cube in terms of view. It doesn’t
change the data present in the data cube. For example, if the user is
comparing year versus branch, using the pivot operation, the user
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 83
can change the viewpoint and now compare branch versus item
Advantages of Data Cube

Advantages of data cubes:


•Helps in giving a summarized view of data.
•Data cubes store large data in a simple way.
•Data cube operation provides quick and better analysis,
•Improve performance of data.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 84


Schemas in Datawarehouse

Objectives:
This topic provides us the knowledge of complexity of a structure of data
warehouse.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 85


Star Schema

The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schemas simulates a star,
with points, diverge from a central table. The center of the schema consists of a
large fact table, and the points of the star are the dimension tables.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 86


Star Schema

Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact
table has two types of columns: those that include fact and those that are foreign
keys to the dimension table. The primary key of the fact tables is generally a
composite key that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated
(fact tables that include aggregated fact are often instead called summary tables). A
fact table generally contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that
categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 87


Star Schema

The primary keys of each of the dimensions table are part of the composite
primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional
tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the
geographic region (markets, cities), clients, products, times, channels.
Characteristics of Star Schema
•It provides a flexible design that can be changed easily or added to throughout
the development cycle, and as the database grows.
•It provides a parallel in design to how end-users typically think of and use the
data.
•It reduces the complexity of metadata for both developers and end-users.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 88


Star Schema

Advantages of Star Schema

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 89


Snowflake Schema

Snowflake Schema in data warehouse is a logical arrangement of tables in a


multidimensional database such that the ER diagram resembles a snowflake shape. A
Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions.
The dimension tables are normalized which splits data into additional tables.

In the following Snowflake Schema example, Country is further normalized into an


individual table.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 90


Example of Snowflake Schema

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 91


Snowflake Schema

Characteristics of Snowflake Schema:


•The main benefit of the snowflake schema it uses smaller disk space.
•Easier to implement a dimension is added to the Schema
•Due to multiple tables query performance is reduced
•The primary challenge that you will face while using the snowflake Schema is
that you need to perform more maintenance efforts because of the more
lookup tables.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 92


Fact Constellation Schema

• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 93


Fact Constellation Schema

•The sales fact table is same as that in the star schema.


•The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
•The shipping fact table also contains two measures, namely dollars sold and
units sold.
•It is also possible to share dimension tables between fact tables. For example,
time, item, and location dimension tables are shared between the sales and
shipping fact table.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 94


Critical Data Warehousing Best Practices (Strategy)

1. Involve stakeholders early and often


A data warehouse must meet the needs of stakeholders like department
managers, business analysts, and data scientists, as they all access the
information the warehouse contains to run analyses and create reports.
Incorporating feedback from these parties improves the chances that an
organization's decision-makers will have the information necessary to
make informed choices and reduces the chance of requiring substantial
changes later.

2. Incorporate data governance


If the data fed into a warehouse is of poor quality, then centralizing it for
analytics is pointless — the results of analyses will be inaccurate and
misleading. To avoid this, organizations should implement robust data
governance processes. Departments should work to define the security,
collaboration, and retention policies for their data assets based on their
business and legal requirements.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 95


Warehouse Strategies

3. Define user roles


User roles determine who can read, write, and update data within a warehouse.
If, for instance, a field needs to be updated in a department's source data,
whose responsibility is it to update the ETL jobs to account for this? Do code
and data updates need management approval? What happens if someone wants
to integrate a new data source? Organizations should strike a balance between
security and the operational flexibility necessary for analysts to work
effectively.

4. Understand data warehouse schema design


An organization should design its schemas to suits the data warehouse
technology it's using and its business needs. For example, the normalized
structure of a snowflake schema requires less storage and processing resources
than the slightly more denormalized data structure of a star schema, but the
latter facilitates faster data queries. The scalability of cloud data warehouses
now allows enterprises to denormalize their data to increase querying speed
free of resource constraints.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 96


Warehouse Strategies
5. Iterate and test — then do it again
Taking an agile approach to developing and maintaining a data warehouse can
improve the repository's performance and ability to adapt to an organization’s
changing needs. By utilizing short development cycles with small, well-
defined tasks and testing plans, development teams can get faster feedback on
their results from relevant stakeholders and then iterate to improve their
systems and processes. This creates a quick feedback loop for product
development, and allows an enterprise to identify and resolve issues with its
warehouse before they impact users.

6. Take advantage of ELT and cloud data warehouses


ETL (extract, transform, load) and ELT (extract, load, transform) are the
processes used to ingest data from its source, transform it as needed, and store
it in the data warehouse. By moving the transformation step to the end of the
process, ELT allows organizations to ingest data and begin analyzing it more
quickly.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 97


Warehouse Management Processes

There are four major processes that contribute to a data warehouse −


1. Extract and load the data.
2. Cleaning and transforming the data.
3. Backup and archive the data.
4. Managing queries and directing them to the appropriate data sources.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 98


Warehouse Management Processes

1.Extract and Load Process:


Data extraction takes data from the source systems. Data load takes the extracted
data and loads it into the data warehouse.
Note − Before loading the data into the data warehouse, the information
extracted from the external sources must be reconstructed.
2.Clean and Transform Process:
Once the data is extracted and loaded into the temporary data store, it is time to
perform Cleaning and Transforming. Here is the list of steps involved in Cleaning
and Transforming −
•Clean and transform the loaded data into a structure
•Partition the data
•Aggregation

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 99


Warehouse Management Processes

3.Backup and Archive the Data


In order to recover the data in the event of data loss, software failure, or hardware
failure, it is necessary to keep regular back ups. Archiving involves removing the old
data from the system in a format that allow it to be quickly restored whenever
required.
4.Query Management Process
This process performs the following functions −
•manages the queries.
•helps speed up the execution time of queris.
•directs the queries to their most effective data sources.
•ensures that all the system sources are used in the most effective
way.
•monitors actual query profiles.
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 100
Aggregation

Data aggregation is the process where data is collected and presented in a


summarized format for statistical analysis and to effectively achieve business
objectives. Data aggregation is vital to data warehousing as it helps to make
decisions based on vast amounts of raw data. Data aggregation provides the
ability to forecast future trends and aids in predictive modeling. Effective data
aggregation techniques help to minimize performance problems.
Data Aggregators:
Data aggregators refer to a system used in data mining to collect data from
various sources, then process the data and extract them into useful information
into a draft. They play a vital role in enhancing the customer data by acting as an
agent. It also helps in the query and delivery procedure where the customer
requests data instances about a specific product.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 101


Aggregation

Working of data aggregators


The working of data aggregators can be performed in three stages

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 102


OLAP

Objectives:
This topic provides us the knowledge of how to analyze database
information from multiple database systems at one time. The primary
objective is data analysis.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 103


OLAP(Online analytical Processing)

OLAP(Online analytical Processing):


OLAP is part of the broader category of business intelligence, which also
encompasses relational database, report writing and data mining.OLAP tools enable
users to analyze multidimensional data interactively from multiple perspectives.

Types of OLAP Servers


We have four types of OLAP servers −
•Relational OLAP (ROLAP)
•Multidimensional OLAP (MOLAP)
•Hybrid OLAP (HOLAP)
•Specialized SQL Servers

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 104


OLAP(Online analytical Processing)

OLAP stands for Online Analytical


Processing Server. It is a software
technology that allows users to analyze
information from multiple database
systems at the same time. It is based on
multidimensional data model and allows
the user to query on multi-dimensional
data (eg. Delhi -> 2018 -> Sales data). OLAP
databases are divided into one or more
cubes and these cubes are known
as Hyper-cubes.

Garima Dhawan Foundation of Data Science BCSDS0301
12/09/2024 Unit-3 105
OLAP(Online analytical Processing)

OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1.Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
1. Moving down in the concept hierarchy
2. Adding a new dimension
2.In the cube given in overview section, the drill
down operation is performed by moving down
in the concept hierarchy of Time dimension
(Quarter -> Month).

Garima Dhawan Foundation of Data Science BCSDS0301 Unit-


12/09/2024 106
3
OLAP(Online analytical Processing)

2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:Climbing up in the concept hierarchy
•Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing
up in the concept hierarchy of Location dimension (City -> Country).

Garima Dhawan Foundation of Data Science BCSDS0301 Unit-


12/09/2024 107
3
OLAP(Online analytical Processing)

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or


more dimensions. In the cube given in the overview section, a sub-
cube is selected by selecting following dimensions with criteria :
Location = “Delhi” or “Kolkata”
•Time = “Q1” or “Q2”
•Item = “Car” or “Bus”

Garima Dhawan Foundation of Data Science BCSDS0301 Unit-


12/09/2024 108
3
OLAP(Online analytical Processing)

4. Slice: It selects a single dimension from the OLAP cube which


results in a new sub-cube creation. In the cube given in the overview
section, Slice is performed on the dimension Time = “Q1”.

Garima Dhawan Foundation of Data Science BCSDS0301 Unit-


12/09/2024 109
3
OLAP(Online analytical Processing)

5. Pivot: It is also known as rotation operation as it rotates the current


view to get a new view of the representation. In the sub-cube obtained
after the slice operation, performing pivot operation gives a new view
of it.

Garima Dhawan Foundation of Data Science BCSDS0301 Unit-


12/09/2024 110
3
OLAP(Online analytical Processing) Tools

How does OLAP Work?


OLAP works by extracting data from multiple sources and storing the same in data
warehouses from where data is cleansed and then stored in OLAP cubes and the
user gets the data from OLAP cubes against the queries run by it. The new term
here is OLAP cubes. In cubes, data is categorized in dimensions (geographical
region, time period) derived from dimensions in data warehouses which are filled
by members (name, id), etc.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 111


Multidimensional OLAP (MOLAP)

Multidimensional OLAP (MOLAP):


• MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
• MOLAP stores this data in an optimized multi-dimensional array storage, rather
than in a relational database.
• Therefore it requires the pre-computation and storage of information in the cube,
the operation known as processing.
• MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
The data cube contains all the possible answers to a given range of questions.
• MOLAP tools have a very fast response time and the ability to quickly write back
data into the data set.
• Some of the tools for that are:
• IBM Cognos, SAP NetWeaver BW, Microsoft Analysis Services, Mondrian OLAP
server, Infor BI OLAP Server, Ic Cube.
Garima Dhawan Foundation of Data Science BCSDS0301
12/09/2024 112
Unit-3
Multidimensional OLAP (MOLAP)

Some of the tools for that are:


• IBM Cognos
• SAP NetWeaver BW
• Microsoft Analysis Services
• MicroStrategy Intelligence Server
• Mondrian OLAP server
• Ic Cube
• Oracle Database OLAP option
• SAS OLAP Server
• IBM T1
• Jedox OLAP Server
• Infor BI OLAP Server

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 113


Relational OLAP (ROLAP)

Relational OLAP (ROLAP):


• ROLAP works directly with relational databases. The base data and the dimension
tables are stored as relational tables and new tables are created to hold the
aggregated information. It depends on a specialized schema design.
• This methodology relies on manipulating the data stored in the relational database
to give the appearance of traditional OLAP's slicing and dicing functionality. In
essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause
in the SQL statement.
• ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
standard relational database and its tables in order to bring back the data required
to answer the question.
• IBM Cognos, SAP NetWeaver BW, Microsoft Analysis Services, Essbase, Jedox OLAP
Server, SAS OLAP Server, MicroStrategy Intelligence Serve

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 114


Relational OLAP (ROLAP)

Some of the top ROLAP is as follows:


1. IBM Cognos
2. SAP NetWeaver BW
3. Microsoft Analysis Services
4. Essbase
5. Jedox OLAP Server
6. SAS OLAP Server
7. MicroStrategy Intelligence Server
8. Oracle Database OLAP option

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 115


Hybrid OLAP (HOLAP)

Hybrid OLAP (HOLAP):


• There is no clear agreement across the industry as to what constitutes Hybrid

OLAP, except that a database will divide data between relational and

specialized storage.
• For example, for some vendors, a HOLAP database will use relational tables to

hold the larger quantities of detailed data and use specialized storage for at

least some aspects of the smaller quantities of more-aggregate or less-

detailed data.
• HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the

capabilities of both approaches.

• HOLAP tools can utilize both pre-calculated cubes and relational data sources.
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 116
• Some of the top HOLAP is : IBM Cognos, SAP NetWeaver BW, Mondrian OLAP
Hybrid OLAP (HOLAP)

• Some of the top HOLAP is as follows:


1. IBM Cognos
2. SAP NetWeaver BW
3. Mondrian OLAP server
4. Microsoft Analysis Services
5. Essbase
6. Jedox OLAP Server
7. SAS OLAP Server
8. MicroStrategy Intelligence Server
9. Oracle Database OLAP option

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 117


Basis ROLAP MOLAP HOLAP

Relational Database is used Multidimensional Database Multidimensional Database


Storage location for
as storage location for is used as storage location is used as storage location
summary aggregation
summary aggregation. for summary aggregation. for summary aggregation.

Processing time of ROLAP is Processing time of MOLAP is Processing time of HOLAP is


Processing time
very slow. fast. fast.

Large storage space Medium storage space Small storage space


requirement in ROLAP as requirement in MOLAP as requirement in HOLAP as
Storage space requirement
compare to MOLAP and compare to ROLAP and compare to MOLAP and
HOLAP. HOLAP. ROLAP.

Relational database is used Multidimensional database Relational database is used


Storage location for detail
as storage location for detail is used as storage location as storage location for detail
data
data. for detail data. data.

Low latency in ROLAP as High latency in MOLAP as Medium latency in HOLAP as


Latency compare to MOLAP and compare to ROLAP and compare to MOLAP and
HOLAP. HOLAP. ROLAP.

Slow query response time in Fast query response time in Medium query response
Query response time ROLAP as compare to MOLAP as compare to time in HOLAP as compare to
MOLAP and HOLAP. ROLAP and HOLAP. MOLAP and ROLAP.

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 118


Daily Quiz

1. Define Data mining.


2. What are the steps involved in KDD process.
3. Define Slice and Dice operation.
4. What is data warehouse?
5. What is the need for preprocessing the data?
6. What is dimensionality reduction?
7. Write the strategies for data reduction.
8. What is Datamart?
9. What is Snowflake Schema?
10. What is the definition of Cube in Data warehousing?

Garima Dhawan Foundation of Data Science BCSDS0301


12/09/2024 Unit-3 119
Weekly Assignment

1. What is the difference between OLTP and OLAP?


2. What are the stages of Data warehousing?
3. What is dimensionality reduction?
4. What do you mean by data mining? Differentiate between data mining and data
warehousing.
5. Differentiate between data warehouse and database.
6. What is data warehouse metadata and why is it used for?
7. Explain the process of KDD?
8. What are the issues in data mining?
9. Differentiate between fact table and dimension table.
10. Define Numerosity Reduction and Discretization ?

Garima Dhawan Foundation of Data Science BCSDS0301


12/09/2024 Unit-3 120
Topic Links

• https://www.youtube.com/watch?v=J61r--lv7-w

• https://www.youtube.com/watch?v=1NjPTh0Eoeg

• https://www.youtube.com/watch?v=mOHbYrXtKbc

• https://www.youtube.com/watch?v=CHYPF7jxlik

• https://www.youtube.com/watch?v=uigKK02XGxE

• https://www.youtube.com/watch?v=GkZre_zkJJ0

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 121


MCQs

1. What is KDD in data mining?


a) Knowledge Discovery Database
b) Knowledge Discovery Data
c) Knowledge Data definition
d) Knowledge data house
2. What is the use of data cleaning?
e) to remove the noisy data
f) correct the inconsistencies in data
g) transformations to correct the wrong data.
h) All of the above
10. The partition of overall data warehouse is _______.
i) database
j) data cube
k) data mart
l) operational data

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 122


MCQs
4.____ refers to the mapping or classification of a class with some predefined
group or class.
a) Data Discrimination
b) Data Characterization
c) Data Definition
d) Data Visualization
5. A data warehouse is usually modeled by a multidimensional data structure.
This data structure is called ________
e) multidimensional schema
f) data cubes
g) data cells
h) all of the above
6.OLAP stands for
a) Online analytical processing
b) Online analysis processing
c) Online transaction processing
d) Online aggregate processing

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 123


MCQs
7. The operation of moving from finer-granularity data to a coarser granularity (by
means of aggregation) is called a ________
a) Rollup
b) Drill down
c) Dicing
d) Pivoting
8. _________ uses array-based multidimensional storage engines for
multidimensional views of data.
a) KOLAP
b) MOLAP
c) ROLAP
d) ZOLAP
9. OLAP based on?
e) one dimensional data model
f) two dimensional data model
g) multidimensional data model
h) All of the above

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 124


MCQs

10.The operation of moving from finer-granularity data to a coarser granularity (by


means of aggregation) is called a ________
a) Rollup
b) Drill down
c) Dicing
d) Pivoting

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 125


Glossary Questions

1. Multiple data sources may be combined is called as _____


2. A _____ is a collection of tables, each of which is assigned a unique name which
uses the entity-relationship (ER) data model.
3. Relational data can be accessed by _____ written in a relational query language.
4. Why is the snowflake schema applied?
5. What are the different types of dimension tables in the context of data
warehousing?
6. What are the advantages of a data warehouse?
7. Which one is faster, Multidimensional OLAP or Relational OLAP? Explain.
8. What does data purging mean?
9. What are Aggregate tables?
10. Define data analytics in the context of data warehousing

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 126


Previous Year Question Paper(Online)

SECTION-B
Answer any TEN of the following
ACSDS0301 10*3=30
Q.No. Question Content Marks
1 Explain the process of datafication. 3
2 Discuss all phases of Data Science lifecycle in brief. 3

3 Differentiate between qualitative and quantitative data with examples. Mention their types. 3
4 What is an outlier? How can one detect outliers in the data? 3
5 What is the process of loading a .csv file in R? 3

df<-data.frame(Name=c(NA, 'John', 'Arun', NA, 'Andrew'),


Sales=c(20,18,22,55,59),
Price=c(33,51,20,40,20),
stringsAsFactors=FALSE)
6 Write a R code that will remove all NA from Name Column 3

7 Explain the process of Principal Component Analysis and illustrate with example 3
8 List main functions of Janitor package and explain any 2 in brief 3
9 List down the advantages of data visualization in R 3

How can we visualize spatial data and maps in R? what are the packages available
10 for spatial data? 3

11 Explain how Uber and Facebook are using data science techniques for data analytics. 3
12 Describe the working of a web scraper. 3

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 127


Previous year Question Paper (22-23)

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 128


Previous year Question Paper (22-23)

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 129


Expected Questions

1. What are the characteristics of a data warehouse?


2. Differentiate between fact table and dimension table.
3. Differentiate between star schema and snowflake schema in the context of
data warehousing.
4. Define the concept of Datamart in brief.
5. List different types of OLAP servers.
6. What is the difference between OLTP & OLAP?
7. What is the level of granularity of a fact table?
8. Which one is faster, Multidimensional OLAP or Relational OLAP?
9. How are metadata and data dictionaries different?
10. Are databases and data warehousing the same thing?

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 130


Recap of Unit

This unit provide us


• fundamentals domain of Data Preprocessing
• We have learnt about the Knowledge Discovery Process.
• Major Tasks in Data Preprocessing
• Knowledge about Data warehouse
• Understanding about Various OLAP servers

12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 131

You might also like