0% found this document useful (0 votes)
6 views49 pages

Data Mining: Prepared By: Eesha Tur Razia Babar

The document discusses the importance of data quality in data mining, highlighting its impact on business decisions and customer satisfaction. It outlines the seven pillars of data quality, common issues affecting data quality, and the significance of data preprocessing to enhance data accuracy and reliability. Additionally, it covers statistical concepts essential for data analysis, including measures of central tendency and methods for identifying and handling outliers.

Uploaded by

enl36756
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views49 pages

Data Mining: Prepared By: Eesha Tur Razia Babar

The document discusses the importance of data quality in data mining, highlighting its impact on business decisions and customer satisfaction. It outlines the seven pillars of data quality, common issues affecting data quality, and the significance of data preprocessing to enhance data accuracy and reliability. Additionally, it covers statistical concepts essential for data analysis, including measures of central tendency and methods for identifying and handling outliers.

Uploaded by

enl36756
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Data Mining

Prepared By: Eesha Tur Razia Babar


Quality of Data 2
• Data quality is a measure which refers to whether the data is good enough to support the outcome it is
used for

• Measuring data quality levels can help organizations identify data errors that need to be resolved and
assess whether the data in their IT systems is fit to serve its intended purpose

• High-quality data is the foundation of all digital businesses

• Poor quality of data can have following consequences:


• Wastage of time and money
• Incorrect reports resulting in misguided business decision
• Negative image of company
• Lower customer satisfaction
Quality of Data 3

• Some of the most common issues affecting data quality are:

• inconsistent formatting of dates and numbers


• unusual character sets and symbols
• duplicate entries,
• different languages and measurement units etc.

So it really important to assess the quality of data before using it


Quality of Data 4
• There are seven pillar of Data Quality:
• Accuracy (Is the information correct in every detail)
• No error in data
• The data reflects actual, real-world scenarios
• inaccurate information can cause significant problems with severe consequences
• Completeness (How comprehensive is the information)
• Completeness is a measure of the data’s ability to effectively deliver all the required
values that are available
• It refers to how comprehensive the information is
• If information is incomplete, it might be unusable
• Consistency (Does the information contradict other trusted resources?)
• Data consistency refers to the uniformity of data as it moves across networks and
applications
• The same data values stored in difference locations should not conflict with one another
Quality of Data 5
• Relevance (Do you really need this information)/Timeliness (How
up-to-date is information)
• Timely data is data that is available when it is required.
• Data may be updated in real time to ensure that it is readily available and
accessible
• The timeliness of information is an important data quality characteristic, because
information that isn’t timely can lead to people making the wrong decisions. In
turn, that costs organizations time, money, and reputational damage
• Conformity (What data is stored in non standardized format)
• Uniqueness (What data measures or attributes are repeated)
• Interpretability and Validity
• It reflects how easy the data are understood and then the data record doesn’t
contain invalid entry
Step 2: To explore the dataset (EDA) 6
• Preliminary investigation of the data to better understand its specific characteristics
• It can help to answer some of the data mining questions
• To help in selecting pre-processing tools
• To help in selecting appropriate data mining algorithms

• Things to look at
• Class balance
• Dispersion of data attribute values
• Skewness, outliers, missing values
• Attributes that vary together
• Visualization tools are important
• Histograms, scatter plots
Statistics’ Concepts In Data Science 7

• There are four types of statistical measures used to describe data:


• Measure of Frequency
• It indicates the number of occurrences of any particular data value in given dataset. The measures
of frequency number and percentage
• Measure of Central Tendency
• It indicates whether the data values accumulate in the middle of distribution or toward the end.
The measures are mean, median, and mode
• Measure of Spread
• Measures of spread, also called measures of dispersion, shows how wide the set of data is. The
measures are standard deviation, variance, and quartiles
• Measure of Position
• It indicates the exact location of a particular data value in the given dataset. The measures are
percentiles and quartiles
Statistics’ Concepts In Data Science 8
• A data scientist need to understand the fundamental concepts of:

• descriptive statistics
• It provides a summary of data numerically
• Summarizes and describes features of a dataset
• Descriptive statistics consists of two basic categories of measures: measures of central tendency, measures
of variability (or spread)

• inferential statistics
• generalizes the larger dataset and applies probability theory to draw a conclusions/predictions

• Key concepts include:


• Probability distribution
• Statistical significance
• Hypothesis testing (inferential statistical technique)
• Regression
• Bayesian thinking(machine learning)
• Conditional probability
• Maximum likelihood
Descriptive Statistics –(Measure of central
tendency): MEAN 9

Looking at the retirement age distribution:


54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

• Advantage of the mean:


• The mean can be used for both continuous and discrete numeric data
• Limitations of the mean:
• The mean cannot be calculated for categorical data
• It is not robust because a single large value(an outlier) can skew the average
Measures of central tendency: MEDIAN 10

Looking at the retirement age (skewed) distribution:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 81
• Advantage of Median:
• Useful when the distribution of data is not symmetrical
• Not affected by outliers
• Limitation of the median:
• The median cannot be identified for categorical nominal data, as it cannot be logically
ordered.
Measures of central tendency: MODE
11

• Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

• Advantage of Mode:
• It can be found for both numerical and categorical (non-numerical) data
• Limitations of the mode:
• In some distributions, the mode may not reflect the center of the distribution very well
Descriptive Statistics – Variability
(Measures of Spread) 12
Note: use “n” in denominator
of variance formula if you know
the mean
use (n-1) when you estimate
the mean
Distribution of Data 13

• The distribution of a data set is the shape of the graph when all
possible values are plotted on a frequency graph (showing how often
they occur).
• It is used to organize and disseminate large amounts of information in
a way that is meaningful and simple for audiences to digest.
• The most common type of distribution of data is NORMAL
DISTRIBUTION. For example:
• Height of People
• Size of things produced by machines
• Errors in measurements
• Blood Pressure
• Academic Test Scores/ Grades
Normal Distribution 14

• There are many cases where the data tends to be around a central
value with no bias left or right, and it gets close to a "Normal
Distribution" like this:

• The "Bell Curve" is a Normal Distribution. The yellow histogram shows


some data that follows it closely, but not perfectly (which is usual).
How does the shape of a distribution influence
the Measures of Central Tendency? 15
• Symmetrical distributions:

When a distribution is symmetrical, the mode, median and mean are all in the
middle of the distribution.
• The following graph shows a larger retirement age dataset with a distribution
which is symmetrical. The mode, median and mean all equal 58 years.
Skewness 16

• A measure of asymmetry

• Skewness refers deviation from the symmetrical bell curve, or normal


distribution of data

• In a skewed distribution, the median is often a preferred measure of central


tendency, as the mean is not usually in the middle of the distribution.
Positive Skewness / Right Skewed: 17

• A distribution is said to be positively or right skewed when the tail on the


right side of the distribution is longer than the left side.
• In a positively skewed distribution it is common for the mean to be ‘pulled’
toward the right tail of the distribution.
Negative Skewness /Left Skewed: 18
• A distribution is said to be negatively or left skewed when the tail on the
left side of the distribution is longer than the right side
• In a negatively skewed distribution, it is common for the mean to be ‘pulled’
toward the left tail of the distribution
Normal Distribution 19
Standard Normal Distribution 20
Standard Normal Distribution 21
Standard Normal Distribution 22

Here is the Standard Normal


Distribution with percentages for
every half of a standard
deviation, and cumulative
percentages:
DATA PREPROCESSING 23
Why Data Preprocessing? 24

• Preprocessing data is an important step, as raw data can be inconsistent or


incomplete in its formatting or has errors/outliers

• Effectively preprocessed raw data can increase its accuracy, which can increase
the quality of projects and meanwhile can improve its reliability
Advantages of Data Preprocessing 25

• Improvement in accuracy and reliability of dataset


• By removal of missing and/or inconsistent data

• Improvement in data consistency


• By removal of duplicate records and checking the data values for analysis
are consistent

• Increase in readability of data’s algorithm


• Preprocessing enhances the data's quality and makes it easier for
machine learning algorithms to read, use, and interpret it
Steps for Data Preprocessing 26
• Data cleaning
• Remove inconsistencies
• Filling missing values
• Remove outliers (Z-score, IQR)
• Smooth out noise in data

• Data integration
• Merge data from multiple sources e.g., data cubes, files, or notes etc. into a coherent
data store

• Data transformation
• Normalization (scaling to a specific range)
• Feature Type conversion
• Aggregation
• Attribute/feature construction
Steps for Data Preprocessing 27

• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results by using
• Data aggregation
• Eliminating features
• Clustering

• Data Discretization
• Transforms numeric data by mapping values to interval or concept labels
• Binning
• Clustering
• ML techniques (Decision Tree Analysis, etc.)
Steps for Data Preprocessing 28
Data Cleaning – Missing Data Handling 29

• Manual Entry
• Using measure of Central Tendency (Mean, Median, Mode)
• Ignore the tuple/row/instance
• Global constant
• Measure of Central tendency for each class
• Most probable value (ML algorithms: Regression, Decision Trees )
Data Cleaning – Smooth Out Noisy Data 30
• It is a random error or variance in a measured variable. The following techniques are used to remove
noise
• Binning
• Clustering
• Regression

• Binning: First sort the data and then divide into Bins/Buckets
• Equal-width (distance) partitioning
• W = (max_value – min_value)/x, x=bin size bin#1 range={min , min+w-1}
Example: bin#2 range={min+w , min+2w-1}
• 2,10,18,18,19,20,22,25,32 bin#3 range ={min+2w , max}
• Bin size = 3

• Equal-depth (frequency) partitioning


• F = (n/x), n = total no. of values, x = bin size
Example:
• 2,10,18,18,19,20,22,25,28
• Bin size = 3
Data Cleaning – Smooth Out Noisy Data 31

• After creating the bins, the data smoothened using


• Bin means
• Bin medians
• Bin boundaries
• Example:
• Bin1 = [2,10,18], Bin2 = [18,19,20], Bin3 = [22,25,28]
• Smooth by mean: Smooth by median: Smooth by boundaries:
• Bin1=[10,10,10] Bin1 = [10,10,10] Bin1 = [2,2,18]
• Bin2=[19,19,19] Bin2 = [19,19,19] Bin2 = [18,18,20] In this
• Bin3=[25,25,25] Bin3 = [25,25,25] Bin3 = [22,22,28] example, used
minimum
boundary in
case of tie
Data Cleaning – Smooth Out Noisy Data 32

• In Equal-width (distance) partitioning, skewed data is not handled

• In Equal-depth (frequency) partitioning, categorical data can be tricky


Data Cleaning – Identify and Remove Outliers 33

The detection and removal of outliers in data is an important step in data


cleaning process
What are outliers?

• These are the data points/ samples which do not follow the general trend of the data
• Outliers can indicate variation or error in the data
• Outliers in a single variable/column are called univariate while outliers in multiple
variables/columns are called multivariate
• Outliers can be detected both visually and mathematically
Data Cleaning – Identify and Remove Outliers 34
• Outliers can be identified visually using following plots:

• Scatter Plot
Data Cleaning – Identify and Remove Outliers 35

• Box Plot
Data Cleaning – Identify and Remove Outliers 36

• Histogram
Data Cleaning – Identify and Remove Outliers 37

IMPORTANT: You need


to be sure whether the
outlier should be
removed or not.
HOW TO DO THAT?
The context and domain
knowledge should guide
your approach to dealing
with outliers.
Data Cleaning – Identify and Remove Outliers 38
Data Cleaning – Identify and Remove Outliers 39
Outliers can be detected mathematically using:
Linear Regression Clustering
Data Cleaning – Identify and Remove Outliers 40

• The Z-score is a statistical measure that quantifies how far a particular


data point is from the mean of a dataset in terms of standard deviations.
It's a commonly used method for detecting outliers in a dataset

• Define a threshold or criterion for what constitutes an outlier. Commonly
used thresholds include 2 standard deviations (Z<=-2 or Z>=2) or 3
standard deviations (Z<=-3 or Z>=3 or ) from the mean

• Data points with Z-scores that exceed the defined threshold are
considered outliers
Data Cleaning – Identify and Remove Outliers 41

Data Cleaning – Identify and Remove Outliers 42
• IQR (Inter Quartile Range)
• The quartiles of a data set divide the data set into the following four parts, each containing 25% of the
data:
• The first quartile (Q1) is the 25th percentile
• The second quartile (Q2) is the 50th percentile, that is, the median.
• The third quartile (Q3) is the 75th percentile
• The IQR is a measure of variability, much more robust than the SD may be interpreted to represent
the spread of the middle 50% of the data
• IQR = Q3 − Q1
• A data value is an outlier if:
• it is located 1.5(IQR) below the first quartile Q1 i.e., Q1-(1.5*IQR), or
• it is located 1.5(IQR) above the third quartile Q3 i.e., Q3+(1.5*IQR)
Data Cleaning – Identify and Remove Outliers 43

• Suppose you have a dataset of exam scores: [65, 72, 75, 78, 80, 81, 82, 85, 90, 110].
1. Calculate Quartiles:
1. Q1 = 75 (since it's the 25th percentile)
2. Q3 = 85 (since it's the 75th percentile)
2. Calculate IQR:
1. IQR = Q3 - Q1 = 85 - 75 = 10.
3. Define a Threshold:
1. Threshold = 1.5 * IQR = 1.5 * 10 = 15.
4. Identify Outliers:
1. Any data point below Q1 - Threshold = 75 - 15 = 60 or above Q3 + Threshold = 85 + 15 = 100 is
considered an outlier.
• In this example, the data point 110 is above the upper threshold and is therefore
considered an outlier.
Data Cleaning – Identify and Remove Outliers 44

• The IQR method is robust


• It is not influenced by extreme values as much as other methods like the
Z-score method.
• It's particularly useful when you have data that may not follow a normal
distribution or when you want to detect outliers in skewed datasets.
• However, it's essential to choose an appropriate threshold based on the
characteristics of your data and the context of your analysis.
BOX PLOT 45

• Box plot is one of the visualization technique to show the distribution of numerical data

• It can be drawn vertically or horizontally

• The length of box indicates the spread of data

• The whisker represents all data values, and


the maximum length of a whisker is 1.5

• Any data beyond the whisker is considered to be


an outlier
IQR & BOX PLOT 46
IQR & BOX PLOT - Example 47
IQR & BOX PLOT - Example 48
IQR & BOX PLOT - Example 49

You might also like