Unit 3
Unit 3
Noida
Unit-3
Data Mining and warehousing
B-Tech
Ms. Garima Dhawan
CSE(DS)
Assistant Professor
3rd Sem
Data Science
Qualification:
CO 1 Understand the fundamental ideas behind data science and statistical K2, K3
techniques, as well as the
applications that students may use these concepts to solve.
CO 2 Explain and exemplify the most common forms of data and its K2
representations.
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 2 1 1 1 1 0 0 0 1 0 1 2
CO2 3 2 2 2 2 0 0 1 1 0 1 2
CO3 2 2 2 2 2 1 2 0 1 0 1 2
CO4 2 3 2 3 3 1 0 0 1 0 1 2
CO5 2 3 2 3 3 0 0 0 0 2 1 2
AVER
AGE 2.2 2.2 1.8 2.2 2.2 0.4 0.4 0.2 0.8 0.4 1 2
B 85 85 100% 93
1) Glenn J. Myatt, Making sense of Data: A practical Guide to Exploratory Data Analysis and Data Mining,
John Wiley Publishers, 2007.
2) Data Analysis and Data Mining, 2nd Edition, John Wiley & Sons Publication, 2014.
Reference Books:
1) Open Data for Sustainable Community: Glocalized Sustainable Development Goals, Neha Sharma, Santanu
Ghosh, Monodeep Saha, Springer, 2021.
2) The Data Science Handbook, Field Cady, John Wiley & Sons, Inc, 2017
3) Data Mining Concepts and Techniques, Third Edition, Jiawei Han, Micheline Kamber, Jian Pei, Morgan
Kaufmann, 2012.
Prerequisites:
• Knowledge about Database Management Systems
• SQL
• MS Office 2019
Recap:
• Types of data
• Data Handling Techniques
Objective:
In this topic we will learn about data preprocessing and techniques of data
preprocessing. By preprocessing data, we make it easier to interpret and
use.
Data preprocessing is an integral step in Machine Learning as the quality of
data and the useful information that can be derived from it directly affects
the ability of our model to learn; therefore, it is extremely important that
we preprocess our data before feeding it into our model.
Data Preprocessing:
Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw data.
Data preprocessing transforms the data into a format that is more easily and
effectively processed in data mining, machine learning and other data science tasks.
The techniques are generally used at the earliest stages of the machine learning and
AI development pipeline to ensure accurate results.
The quality of the data should be checked before applying machine learning or data
mining algorithms.
Data Integration:
•The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management.
There are some problems to be considered during data integration.
•Schema integration
•Entity identification problem
•Detecting and resolving data value concepts
3. Data Transformation:
The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements. There
are some methods in data transformation.
•Smoothing
•Aggregation
•Discretization
•Normalization
4. Data Reduction
This process helps in the reduction of the volume of the data which makes the
analysis easier yet produces the same or almost the same result. This reduction also
helps to reduce storage space. There are some of the techniques in data reduction
are Dimensionality reduction, Numerosity reduction, Data compression.
Raw data can have missing or inconsistent values as well as present a lot of redundant
information. The most common problems you can find with raw data can be divided into 3
groups:
Missing data: you can also see Missing attribute values, missing certain attributes of
importance, or having only aggregate data since the information that isn’t there creates
gaps that might be relevant to the final analysis. Missing data often appears when there’s a
problem in the collection phase, such as a glitch that caused a system’s downtime, mistakes
in data entry, or issues with biometrics use, among others.
Noisy data: this group encompasses erroneous data and outliers that you can find in the
data set but that is just meaningless information. Here you can see noise made of human
mistakes, rare exceptions, mislabels, and other issues during data gathering.
Inconsistent data: inconsistencies happen when you keep files with similar data in
different formats and files. Duplicates in different formats, mistakes in codes of names,
or the absence of data constraints often lead to inconsistent data, that introduces
deviations that you have to deal with before analysis.
variables are any characteristics that can take on different values, such as height, age,
temperature, or test scores.
Researchers often manipulate or measure independent and dependent variables in
studies to test cause-and-effect relationships.
•The independent variable is the cause. Its value is independent of other variables in
your study.
•The dependent variable is the effect. Its value depends on changes in the independent
variable.
The dependent variable is what you record after you’ve manipulated the independent
variable. You use this measurement data to check whether and to what extent your
independent variable influences the dependent variable by conducting statistical
analyses.
Based on your findings, you can estimate the degree to which your independent
variable variation drives changes in your dependent variable. You can also predict how
much your dependent variable will change as a result of variation in the independent
variable.
Identifying independent vs. dependent variables
A dependent variable from one study can be the independent variable in another
study, so it’s important to pay attention to research design.
Objective:
This topic Introduces KDD Process in which we learn how to extract
information from data in the context of large databases.
The Knowledge Discovery in Databases is considered as a programmed,
exploratory analysis and modeling of vast data repositories.
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
1.Developing an understanding of
1. the application domain
2. the relevant prior knowledge
3. the goals of the end-user
2.Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3.Data cleaning and preprocessing.
1. Removal of noise or outliers.
2. Collecting necessary information to model or account for noise.
3. Strategies for handling missing data fields.
4. Accounting for time sequence information and known changes.
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 39
KDD Process
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle
this part, data cleaning is done. It involves handling of missing data,
noisy data etc.
(a) Missing Data:
This situation arises when some data is missing in the data. It can
be handled in various ways.
• Ignore the tuples
• Fill the Missing values manually
• Use a Global constant to fill in the missing value
• Use the attribute mean to fill in the missing value.
• Use the attribute mean for all samples belonging to
the same class as given tuple.
• Use the most probable value to fill in the missing
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 42
Data Cleaning : Noisy Data
b)Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete
the task. Each segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected, or it will fall outside the clusters.
C) Inconsistent data:
There may be inconsistencies in the data recorded for some transactions. Some data
inconsistencies may be corrected manually using external references.
The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components in data management. There are some
problems to be considered during data integration.
•Schema integration: Integrates metadata(a set of data that describes other data)
from different sources.
•Entity identification problem: Identifying entities from multiple databases. For
example, the system or the user should know student _id of one database and
student_no. of another database belongs to the same entity.
•Detecting and resolving data value conflicts: The data taken from different databases
while merging may differ. Like the attribute values from one database may differ from
another database. For example, the date format may differ like “MM/DD/YYYY” or
“DD/MM/YYYY”.
Where,
n = number of tuples
ai = value of A in tuple i
bi = value of B in tuple i
It is a statistic that measures the degree to which one variable varies in tandem with
another. It ranges from -1 to +1. A +1 correlation means that as one variable rises, the
other rises proportionally; a -1 correlation means that as one rises, the other falls
proportionally. A 0 correlation means that there is no relationship between the
movements of the two variables.
From the above discussion, we can say that the greater the
correlation coefficient, the more strongly the attributes are correlated
to each other, and we can ignore any one of them (either a or b). If
the value of the correlation constant is null, the attributes are
independent. If the value of the correlation constant is negative, one
attribute discourages the other. It means that the value of one
attribute increases, then the value of another attribute is decreasing.
The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements.
There are some methods in data transformation.
• Smoothing: With the help of algorithms, we can remove noise from the dataset
and helps in knowing the important features of the dataset.
• Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set which is from multiple sources is integrated into with data
analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the
quantity of the data are good the results are more relevant.
• Generalization: In this method, low level or “primitive” data are replaced by
higher level concepts through the use of concept hierarchies. For example,
categorical attributes like street, can be generalized to higher level concepts like
city or country.
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 49
Data Transformation
• Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we can
set an interval like (3 pm-5 pm, 6 pm-8 pm).
• Normalization: It is the method of scaling the data so that it can be represented
in a smaller range. Example ranging from -1.0 to 1.0. There are many methods for
data normalization. Decimal Scaling, Min-Max Normalization, z-Score
Normalization(zero-mean Normalization)
• Decimal Scaling, It normalizes by moving the decimal point of
values of the data. To normalize the data by this technique, we
divide each value of the data by the maximum absolute value of
data. The data value, vi, of data is normalized to vi‘ by using the
formula below –
Z-score normalization –
In this technique, values are normalized based on mean and
standard deviation of the data A. The formula used is:
levels.
Objective:
To organize information or concepts in a hierarchical structure or a specific
partial order, which are used for defining knowledge in brief, high-level
methods, and creating possible mining knowledge at several levels of
abstraction .
Schema Hierarchy − Schema hierarchy represents the total or partial order between
attributes in the database. It can define existing semantic relationships between
attributes. In a database, more than one schema hierarchy can be generated by
using multiple sequences and grouping of attributes.
Set-Grouping Hierarchy − A set-grouping hierarchy constructs values for a given
attribute or dimension into groups or constant range values. It is also known as
instance hierarchy because the partial series of the hierarchy is represented on the
set of instances or values of an attribute. These hierarchies have more functional
sense and are so approved than other hierarchies.
Objective:
To create a trove of historical data that can be retrieved and analyzed to
provide useful insight into the organization's operations.
• A Data Warehouse consists of data from multiple heterogeneous data sources and
is used for analytical reporting and decision making. Data Warehouse is a central
place where data is stored from different data sources and applications. The term
Data Warehouse was first invented by Bill Inmom in 1990.
The architecture of the data warehouse mainly consists of the proper arrangement of
its elements, to build an efficient data warehouse with software and hardware
components. The elements and components may vary based on the requirement of
organizations. All of these depend on the organization’s circumstances.
External Data: For data gathering, most of the executives and data analysts rely on
information coming from external sources for a numerous amount of the information
they use. They use statistical features associated with their organization that is brought
out by some external sources and department.
Internal Data: In every organization, the consumer keeps their “private” spreadsheets,
reports, client profiles, and generally even department databases. This is often the
interior information, a part that might be helpful in every data warehouse.
• Operational System data: Operational systems are principally meant to run the
business. In each operation system, we periodically take the old data and store it in
achieved files.
• Flat files: A flat file is nothing but a text database that stores data in a plain text
format. Flat files generally are text files that have all data processing and structure
markup removed. A flat file contains a table with a single record per line.
2.Data Staging:
• After the data is extracted from various sources, now it’s time to prepare the data
files for storing in the data warehouse. The extracted data collected from various
sources must be transformed and made ready in a format that is suitable to be saved
in the data warehouse for querying and analysis.
The data staging contains three primary functions that take place in this part:
• Data Extraction
• Data Transformation
• Data Loading
Some steps that are needed for building any data warehouse are as following
below:
1.To extract the data (transnational) from different data sources:
1.For building a data warehouse, a data is extracted from various data sources and
that data is stored in central storage area. For extraction of the data Microsoft has
come up with an excellent tool. When you purchase Microsoft SQL Server, then this
tool will be available at free of cost.
2.To transform the transnational data:
There are various DBMS where many of the companies stores their data. Some of
them are: MS Access, MS SQL Server, Oracle, Sybase etc. Also these companies
saves the data in spreadsheets, flat files, mail systems etc. Relating a data from all
these sources is done while building a data warehouse.
Data is balanced within the scope of this one Data must be integrated and balanced from
system. multiple system.
Data verification occurs when entry is done. Data verification occurs after the fact.
ER based. Star/Snowflake.
Objective:
In this topic we will learn about method for arranging the data in the
database, with better structuring and organization of the contents in the
database.
A multidimensional model views data in the form of a data-cube. A data cube enables
data to be modeled and viewed in multiple dimensions. It is defined by dimensions and
facts.
The dimensions are the perspectives or entities concerning which an organization keeps
records. For example, a shop may create a sales data warehouse to keep records of the
store's sales for the dimension time, item, and location. These dimensions allow the save
to keep track of things, for example, monthly sales of items and the locations at which the
items were sold. Each dimension has a table related to it, called a dimensional table,
which describes the dimension further. For example, a dimensional table for an item may
contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales.
This theme is represented by a fact table. Facts are numerical measures. The fact table
contains the names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the
city of Delhi. The data is shown in the table. In this 2D
representation, the sales for Delhi are shown for the time
dimension (organized in quarters) and the item dimension
(classified according to the types of an item sold). The fact or
measure
12/09/2024 displayed
Garimain rupee_sold
Dhawan Foundation(in thousands).
of Data Science BCSDS0301 Unit-3 75
Multidimensional Data Model
Now, if we want to view the sales data with a third dimension, For example, suppose
the data according to time and item, as well as the location is considered for the
cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table.
The 3D data of the table are represented as a series of 2D tables.
Objectives:
In this topic we will learn how,
• To easily interpret data.
• To represent data together with dimensions as certain measures of
business requirements
Data cube operations are used to manipulate data to meet the needs of users. These
operations help to select particular data for the analysis purpose. There are mainly 5
operations listed below-
•Roll-up: operation and aggregate certain similar data attributes having the same
dimension together. For example, if the data cube displays the daily income of a
customer, we can use a roll-up operation to find the monthly income of his salary.
•Drill-down: this operation is the reverse of the roll-up operation. It allows us to
take particular information and then subdivide it further for coarser granularity
analysis. It zooms into more detail. For example- if India is an attribute of a country
column and we wish to see villages in India, then the drill-down operation splits India
into states, districts, towns, cities, villages and then displays the required
information.
Objectives:
This topic provides us the knowledge of complexity of a structure of data
warehouse.
The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schemas simulates a star,
with points, diverge from a central table. The center of the schema consists of a
large fact table, and the points of the star are the dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact
table has two types of columns: those that include fact and those that are foreign
keys to the dimension table. The primary key of the fact tables is generally a
composite key that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated
(fact tables that include aggregated fact are often instead called summary tables). A
fact table generally contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that
categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list.
The primary keys of each of the dimensions table are part of the composite
primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional
tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the
geographic region (markets, cities), clients, products, times, channels.
Characteristics of Star Schema
•It provides a flexible design that can be changed easily or added to throughout
the development cycle, and as the database grows.
•It provides a parallel in design to how end-users typically think of and use the
data.
•It reduces the complexity of metadata for both developers and end-users.
• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.
Objectives:
This topic provides us the knowledge of how to analyze database
information from multiple database systems at one time. The primary
objective is data analysis.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1.Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
1. Moving down in the concept hierarchy
2. Adding a new dimension
2.In the cube given in overview section, the drill
down operation is performed by moving down
in the concept hierarchy of Time dimension
(Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:Climbing up in the concept hierarchy
•Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing
up in the concept hierarchy of Location dimension (City -> Country).
OLAP, except that a database will divide data between relational and
specialized storage.
• For example, for some vendors, a HOLAP database will use relational tables to
hold the larger quantities of detailed data and use specialized storage for at
detailed data.
• HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the
• HOLAP tools can utilize both pre-calculated cubes and relational data sources.
12/09/2024 Garima Dhawan Foundation of Data Science BCSDS0301 Unit-3 116
• Some of the top HOLAP is : IBM Cognos, SAP NetWeaver BW, Mondrian OLAP
Hybrid OLAP (HOLAP)
Slow query response time in Fast query response time in Medium query response
Query response time ROLAP as compare to MOLAP as compare to time in HOLAP as compare to
MOLAP and HOLAP. ROLAP and HOLAP. MOLAP and ROLAP.
• https://www.youtube.com/watch?v=J61r--lv7-w
• https://www.youtube.com/watch?v=1NjPTh0Eoeg
• https://www.youtube.com/watch?v=mOHbYrXtKbc
• https://www.youtube.com/watch?v=CHYPF7jxlik
• https://www.youtube.com/watch?v=uigKK02XGxE
• https://www.youtube.com/watch?v=GkZre_zkJJ0
SECTION-B
Answer any TEN of the following
ACSDS0301 10*3=30
Q.No. Question Content Marks
1 Explain the process of datafication. 3
2 Discuss all phases of Data Science lifecycle in brief. 3
3 Differentiate between qualitative and quantitative data with examples. Mention their types. 3
4 What is an outlier? How can one detect outliers in the data? 3
5 What is the process of loading a .csv file in R? 3
7 Explain the process of Principal Component Analysis and illustrate with example 3
8 List main functions of Janitor package and explain any 2 in brief 3
9 List down the advantages of data visualization in R 3
How can we visualize spatial data and maps in R? what are the packages available
10 for spatial data? 3
11 Explain how Uber and Facebook are using data science techniques for data analytics. 3
12 Describe the working of a web scraper. 3