Data Mining: Prepared By: Eesha Tur Razia Babar
Data Mining: Prepared By: Eesha Tur Razia Babar
• Measuring data quality levels can help organizations identify data errors that need to be resolved and
assess whether the data in their IT systems is fit to serve its intended purpose
• Things to look at
• Class balance
• Dispersion of data attribute values
• Skewness, outliers, missing values
• Attributes that vary together
• Visualization tools are important
• Histograms, scatter plots
Statistics’ Concepts In Data Science 7
• descriptive statistics
• It provides a summary of data numerically
• Summarizes and describes features of a dataset
• Descriptive statistics consists of two basic categories of measures: measures of central tendency, measures
of variability (or spread)
• inferential statistics
• generalizes the larger dataset and applies probability theory to draw a conclusions/predictions
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 81
• Advantage of Median:
• Useful when the distribution of data is not symmetrical
• Not affected by outliers
• Limitation of the median:
• The median cannot be identified for categorical nominal data, as it cannot be logically
ordered.
Measures of central tendency: MODE
11
• Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
• Advantage of Mode:
• It can be found for both numerical and categorical (non-numerical) data
• Limitations of the mode:
• In some distributions, the mode may not reflect the center of the distribution very well
Descriptive Statistics – Variability
(Measures of Spread) 12
Note: use “n” in denominator
of variance formula if you know
the mean
use (n-1) when you estimate
the mean
Distribution of Data 13
• The distribution of a data set is the shape of the graph when all
possible values are plotted on a frequency graph (showing how often
they occur).
• It is used to organize and disseminate large amounts of information in
a way that is meaningful and simple for audiences to digest.
• The most common type of distribution of data is NORMAL
DISTRIBUTION. For example:
• Height of People
• Size of things produced by machines
• Errors in measurements
• Blood Pressure
• Academic Test Scores/ Grades
Normal Distribution 14
• There are many cases where the data tends to be around a central
value with no bias left or right, and it gets close to a "Normal
Distribution" like this:
When a distribution is symmetrical, the mode, median and mean are all in the
middle of the distribution.
• The following graph shows a larger retirement age dataset with a distribution
which is symmetrical. The mode, median and mean all equal 58 years.
Skewness 16
• A measure of asymmetry
• Effectively preprocessed raw data can increase its accuracy, which can increase
the quality of projects and meanwhile can improve its reliability
Advantages of Data Preprocessing 25
• Data integration
• Merge data from multiple sources e.g., data cubes, files, or notes etc. into a coherent
data store
• Data transformation
• Normalization (scaling to a specific range)
• Feature Type conversion
• Aggregation
• Attribute/feature construction
Steps for Data Preprocessing 27
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results by using
• Data aggregation
• Eliminating features
• Clustering
• Data Discretization
• Transforms numeric data by mapping values to interval or concept labels
• Binning
• Clustering
• ML techniques (Decision Tree Analysis, etc.)
Steps for Data Preprocessing 28
Data Cleaning – Missing Data Handling 29
• Manual Entry
• Using measure of Central Tendency (Mean, Median, Mode)
• Ignore the tuple/row/instance
• Global constant
• Measure of Central tendency for each class
• Most probable value (ML algorithms: Regression, Decision Trees )
Data Cleaning – Smooth Out Noisy Data 30
• It is a random error or variance in a measured variable. The following techniques are used to remove
noise
• Binning
• Clustering
• Regression
• Binning: First sort the data and then divide into Bins/Buckets
• Equal-width (distance) partitioning
• W = (max_value – min_value)/x, x=bin size bin#1 range={min , min+w-1}
Example: bin#2 range={min+w , min+2w-1}
• 2,10,18,18,19,20,22,25,32 bin#3 range ={min+2w , max}
• Bin size = 3
• These are the data points/ samples which do not follow the general trend of the data
• Outliers can indicate variation or error in the data
• Outliers in a single variable/column are called univariate while outliers in multiple
variables/columns are called multivariate
• Outliers can be detected both visually and mathematically
Data Cleaning – Identify and Remove Outliers 34
• Outliers can be identified visually using following plots:
• Scatter Plot
Data Cleaning – Identify and Remove Outliers 35
• Box Plot
Data Cleaning – Identify and Remove Outliers 36
• Histogram
Data Cleaning – Identify and Remove Outliers 37
• Data points with Z-scores that exceed the defined threshold are
considered outliers
Data Cleaning – Identify and Remove Outliers 41
•
Data Cleaning – Identify and Remove Outliers 42
• IQR (Inter Quartile Range)
• The quartiles of a data set divide the data set into the following four parts, each containing 25% of the
data:
• The first quartile (Q1) is the 25th percentile
• The second quartile (Q2) is the 50th percentile, that is, the median.
• The third quartile (Q3) is the 75th percentile
• The IQR is a measure of variability, much more robust than the SD may be interpreted to represent
the spread of the middle 50% of the data
• IQR = Q3 − Q1
• A data value is an outlier if:
• it is located 1.5(IQR) below the first quartile Q1 i.e., Q1-(1.5*IQR), or
• it is located 1.5(IQR) above the third quartile Q3 i.e., Q3+(1.5*IQR)
Data Cleaning – Identify and Remove Outliers 43
• Suppose you have a dataset of exam scores: [65, 72, 75, 78, 80, 81, 82, 85, 90, 110].
1. Calculate Quartiles:
1. Q1 = 75 (since it's the 25th percentile)
2. Q3 = 85 (since it's the 75th percentile)
2. Calculate IQR:
1. IQR = Q3 - Q1 = 85 - 75 = 10.
3. Define a Threshold:
1. Threshold = 1.5 * IQR = 1.5 * 10 = 15.
4. Identify Outliers:
1. Any data point below Q1 - Threshold = 75 - 15 = 60 or above Q3 + Threshold = 85 + 15 = 100 is
considered an outlier.
• In this example, the data point 110 is above the upper threshold and is therefore
considered an outlier.
Data Cleaning – Identify and Remove Outliers 44
• Box plot is one of the visualization technique to show the distribution of numerical data