0% found this document useful (0 votes)

6 views49 pages

Data Mining: Prepared By: Eesha Tur Razia Babar

The document discusses the importance of data quality in data mining, highlighting its impact on business decisions and customer satisfaction. It outlines the seven pillars of data quality, common issues affecting data quality, and the significance of data preprocessing to enhance data accuracy and reliability. Additionally, it covers statistical concepts essential for data analysis, including measures of central tendency and methods for identifying and handling outliers.

Uploaded by

enl36756

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views49 pages

Data Mining: Prepared By: Eesha Tur Razia Babar

Uploaded by

enl36756

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Data Mining

Prepared By: Eesha Tur Razia Babar

Quality of Data 2
• Data quality is a measure which refers to whether the data is good enough to support the outcome it is
used for

• Measuring data quality levels can help organizations identify data errors that need to be resolved and
assess whether the data in their IT systems is fit to serve its intended purpose

• High-quality data is the foundation of all digital businesses

• Poor quality of data can have following consequences:

• Wastage of time and money
• Incorrect reports resulting in misguided business decision
• Negative image of company
• Lower customer satisfaction
Quality of Data 3

• Some of the most common issues affecting data quality are:

• inconsistent formatting of dates and numbers

• unusual character sets and symbols
• duplicate entries,
• different languages and measurement units etc.

So it really important to assess the quality of data before using it

Quality of Data 4
• There are seven pillar of Data Quality:
• Accuracy (Is the information correct in every detail)
• No error in data
• The data reflects actual, real-world scenarios
• inaccurate information can cause significant problems with severe consequences
• Completeness (How comprehensive is the information)
• Completeness is a measure of the data’s ability to effectively deliver all the required
values that are available
• It refers to how comprehensive the information is
• If information is incomplete, it might be unusable
• Consistency (Does the information contradict other trusted resources?)
• Data consistency refers to the uniformity of data as it moves across networks and
applications
• The same data values stored in difference locations should not conflict with one another
Quality of Data 5
• Relevance (Do you really need this information)/Timeliness (How
up-to-date is information)
• Timely data is data that is available when it is required.
• Data may be updated in real time to ensure that it is readily available and
accessible
• The timeliness of information is an important data quality characteristic, because
information that isn’t timely can lead to people making the wrong decisions. In
turn, that costs organizations time, money, and reputational damage
• Conformity (What data is stored in non standardized format)
• Uniqueness (What data measures or attributes are repeated)
• Interpretability and Validity
• It reflects how easy the data are understood and then the data record doesn’t
contain invalid entry
Step 2: To explore the dataset (EDA) 6
• Preliminary investigation of the data to better understand its specific characteristics
• It can help to answer some of the data mining questions
• To help in selecting pre-processing tools
• To help in selecting appropriate data mining algorithms

• Things to look at
• Class balance
• Dispersion of data attribute values
• Skewness, outliers, missing values
• Attributes that vary together
• Visualization tools are important
• Histograms, scatter plots
Statistics’ Concepts In Data Science 7

• There are four types of statistical measures used to describe data:

• Measure of Frequency
• It indicates the number of occurrences of any particular data value in given dataset. The measures
of frequency number and percentage
• Measure of Central Tendency
• It indicates whether the data values accumulate in the middle of distribution or toward the end.
The measures are mean, median, and mode
• Measure of Spread
• Measures of spread, also called measures of dispersion, shows how wide the set of data is. The
measures are standard deviation, variance, and quartiles
• Measure of Position
• It indicates the exact location of a particular data value in the given dataset. The measures are
percentiles and quartiles
Statistics’ Concepts In Data Science 8
• A data scientist need to understand the fundamental concepts of:

• descriptive statistics
• It provides a summary of data numerically
• Summarizes and describes features of a dataset
• Descriptive statistics consists of two basic categories of measures: measures of central tendency, measures
of variability (or spread)

• inferential statistics
• generalizes the larger dataset and applies probability theory to draw a conclusions/predictions

• Key concepts include:

• Probability distribution
• Statistical significance
• Hypothesis testing (inferential statistical technique)
• Regression
• Bayesian thinking(machine learning)
• Conditional probability
• Maximum likelihood
Descriptive Statistics –(Measure of central
tendency): MEAN 9

Looking at the retirement age distribution:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

• Advantage of the mean:

• The mean can be used for both continuous and discrete numeric data
• Limitations of the mean:
• The mean cannot be calculated for categorical data
• It is not robust because a single large value(an outlier) can skew the average
Measures of central tendency: MEDIAN 10

Looking at the retirement age (skewed) distribution:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 81
• Advantage of Median:
• Useful when the distribution of data is not symmetrical
• Not affected by outliers
• Limitation of the median:
• The median cannot be identified for categorical nominal data, as it cannot be logically
ordered.
Measures of central tendency: MODE
11

• Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

• Advantage of Mode:
• It can be found for both numerical and categorical (non-numerical) data
• Limitations of the mode:
• In some distributions, the mode may not reflect the center of the distribution very well
Descriptive Statistics – Variability
(Measures of Spread) 12
Note: use “n” in denominator
of variance formula if you know
the mean
use (n-1) when you estimate
the mean
Distribution of Data 13

• The distribution of a data set is the shape of the graph when all
possible values are plotted on a frequency graph (showing how often
they occur).
• It is used to organize and disseminate large amounts of information in
a way that is meaningful and simple for audiences to digest.
• The most common type of distribution of data is NORMAL
DISTRIBUTION. For example:
• Height of People
• Size of things produced by machines
• Errors in measurements
• Blood Pressure
• Academic Test Scores/ Grades
Normal Distribution 14

• There are many cases where the data tends to be around a central
value with no bias left or right, and it gets close to a "Normal
Distribution" like this:

• The "Bell Curve" is a Normal Distribution. The yellow histogram shows

some data that follows it closely, but not perfectly (which is usual).
How does the shape of a distribution influence
the Measures of Central Tendency? 15
• Symmetrical distributions:

When a distribution is symmetrical, the mode, median and mean are all in the
middle of the distribution.
• The following graph shows a larger retirement age dataset with a distribution
which is symmetrical. The mode, median and mean all equal 58 years.
Skewness 16

• A measure of asymmetry

• Skewness refers deviation from the symmetrical bell curve, or normal

distribution of data

• In a skewed distribution, the median is often a preferred measure of central

tendency, as the mean is not usually in the middle of the distribution.
Positive Skewness / Right Skewed: 17

• A distribution is said to be positively or right skewed when the tail on the

right side of the distribution is longer than the left side.
• In a positively skewed distribution it is common for the mean to be ‘pulled’
toward the right tail of the distribution.
Negative Skewness /Left Skewed: 18
• A distribution is said to be negatively or left skewed when the tail on the
left side of the distribution is longer than the right side
• In a negatively skewed distribution, it is common for the mean to be ‘pulled’
toward the left tail of the distribution
Normal Distribution 19
Standard Normal Distribution 20
Standard Normal Distribution 21
Standard Normal Distribution 22

Here is the Standard Normal

Distribution with percentages for
every half of a standard
deviation, and cumulative
percentages:
DATA PREPROCESSING 23
Why Data Preprocessing? 24

• Preprocessing data is an important step, as raw data can be inconsistent or

incomplete in its formatting or has errors/outliers

• Effectively preprocessed raw data can increase its accuracy, which can increase
the quality of projects and meanwhile can improve its reliability
Advantages of Data Preprocessing 25

• Improvement in accuracy and reliability of dataset

• By removal of missing and/or inconsistent data

• Improvement in data consistency

• By removal of duplicate records and checking the data values for analysis
are consistent

• Increase in readability of data’s algorithm

• Preprocessing enhances the data's quality and makes it easier for
machine learning algorithms to read, use, and interpret it
Steps for Data Preprocessing 26
• Data cleaning
• Remove inconsistencies
• Filling missing values
• Remove outliers (Z-score, IQR)
• Smooth out noise in data

• Data integration
• Merge data from multiple sources e.g., data cubes, files, or notes etc. into a coherent
data store

• Data transformation
• Normalization (scaling to a specific range)
• Feature Type conversion
• Aggregation
• Attribute/feature construction
Steps for Data Preprocessing 27

• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results by using
• Data aggregation
• Eliminating features
• Clustering

• Data Discretization
• Transforms numeric data by mapping values to interval or concept labels
• Binning
• Clustering
• ML techniques (Decision Tree Analysis, etc.)
Steps for Data Preprocessing 28
Data Cleaning – Missing Data Handling 29

• Manual Entry
• Using measure of Central Tendency (Mean, Median, Mode)
• Ignore the tuple/row/instance
• Global constant
• Measure of Central tendency for each class
• Most probable value (ML algorithms: Regression, Decision Trees )
Data Cleaning – Smooth Out Noisy Data 30
• It is a random error or variance in a measured variable. The following techniques are used to remove
noise
• Binning
• Clustering
• Regression

• Binning: First sort the data and then divide into Bins/Buckets
• Equal-width (distance) partitioning
• W = (max_value – min_value)/x, x=bin size bin#1 range={min , min+w-1}
Example: bin#2 range={min+w , min+2w-1}
• 2,10,18,18,19,20,22,25,32 bin#3 range ={min+2w , max}
• Bin size = 3

• Equal-depth (frequency) partitioning

• F = (n/x), n = total no. of values, x = bin size
Example:
• 2,10,18,18,19,20,22,25,28
• Bin size = 3
Data Cleaning – Smooth Out Noisy Data 31

• After creating the bins, the data smoothened using

• Bin means
• Bin medians
• Bin boundaries
• Example:
• Bin1 = [2,10,18], Bin2 = [18,19,20], Bin3 = [22,25,28]
• Smooth by mean: Smooth by median: Smooth by boundaries:
• Bin1=[10,10,10] Bin1 = [10,10,10] Bin1 = [2,2,18]
• Bin2=[19,19,19] Bin2 = [19,19,19] Bin2 = [18,18,20] In this
• Bin3=[25,25,25] Bin3 = [25,25,25] Bin3 = [22,22,28] example, used
minimum
boundary in
case of tie
Data Cleaning – Smooth Out Noisy Data 32

• In Equal-width (distance) partitioning, skewed data is not handled

• In Equal-depth (frequency) partitioning, categorical data can be tricky

Data Cleaning – Identify and Remove Outliers 33

The detection and removal of outliers in data is an important step in data

cleaning process
What are outliers?

• These are the data points/ samples which do not follow the general trend of the data
• Outliers can indicate variation or error in the data
• Outliers in a single variable/column are called univariate while outliers in multiple
variables/columns are called multivariate
• Outliers can be detected both visually and mathematically
Data Cleaning – Identify and Remove Outliers 34
• Outliers can be identified visually using following plots:

• Scatter Plot
Data Cleaning – Identify and Remove Outliers 35

• Box Plot
Data Cleaning – Identify and Remove Outliers 36

• Histogram
Data Cleaning – Identify and Remove Outliers 37

IMPORTANT: You need

to be sure whether the
outlier should be
removed or not.
HOW TO DO THAT?
The context and domain
knowledge should guide
your approach to dealing
with outliers.
Data Cleaning – Identify and Remove Outliers 38
Data Cleaning – Identify and Remove Outliers 39
Outliers can be detected mathematically using:
Linear Regression Clustering
Data Cleaning – Identify and Remove Outliers 40

• The Z-score is a statistical measure that quantifies how far a particular

data point is from the mean of a dataset in terms of standard deviations.
It's a commonly used method for detecting outliers in a dataset
•
• Define a threshold or criterion for what constitutes an outlier. Commonly
used thresholds include 2 standard deviations (Z<=-2 or Z>=2) or 3
standard deviations (Z<=-3 or Z>=3 or ) from the mean

• Data points with Z-scores that exceed the defined threshold are
considered outliers
Data Cleaning – Identify and Remove Outliers 41
•
Data Cleaning – Identify and Remove Outliers 42
• IQR (Inter Quartile Range)
• The quartiles of a data set divide the data set into the following four parts, each containing 25% of the
data:
• The first quartile (Q1) is the 25th percentile
• The second quartile (Q2) is the 50th percentile, that is, the median.
• The third quartile (Q3) is the 75th percentile
• The IQR is a measure of variability, much more robust than the SD may be interpreted to represent
the spread of the middle 50% of the data
• IQR = Q3 − Q1
• A data value is an outlier if:
• it is located 1.5(IQR) below the first quartile Q1 i.e., Q1-(1.5*IQR), or
• it is located 1.5(IQR) above the third quartile Q3 i.e., Q3+(1.5*IQR)
Data Cleaning – Identify and Remove Outliers 43

• Suppose you have a dataset of exam scores: [65, 72, 75, 78, 80, 81, 82, 85, 90, 110].
1. Calculate Quartiles:
1. Q1 = 75 (since it's the 25th percentile)
2. Q3 = 85 (since it's the 75th percentile)
2. Calculate IQR:
1. IQR = Q3 - Q1 = 85 - 75 = 10.
3. Define a Threshold:
1. Threshold = 1.5 * IQR = 1.5 * 10 = 15.
4. Identify Outliers:
1. Any data point below Q1 - Threshold = 75 - 15 = 60 or above Q3 + Threshold = 85 + 15 = 100 is
considered an outlier.
• In this example, the data point 110 is above the upper threshold and is therefore
considered an outlier.
Data Cleaning – Identify and Remove Outliers 44

• The IQR method is robust

• It is not influenced by extreme values as much as other methods like the
Z-score method.
• It's particularly useful when you have data that may not follow a normal
distribution or when you want to detect outliers in skewed datasets.
• However, it's essential to choose an appropriate threshold based on the
characteristics of your data and the context of your analysis.
BOX PLOT 45

• Box plot is one of the visualization technique to show the distribution of numerical data

• It can be drawn vertically or horizontally

• The length of box indicates the spread of data

• The whisker represents all data values, and

the maximum length of a whisker is 1.5

• Any data beyond the whisker is considered to be

an outlier
IQR & BOX PLOT 46
IQR & BOX PLOT - Example 47
IQR & BOX PLOT - Example 48
IQR & BOX PLOT - Example 49

M5 A2
100% (1)
M5 A2
4 pages
02 Data
No ratings yet
02 Data
64 pages
02 Data
No ratings yet
02 Data
36 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
16 pages
Unit .......
No ratings yet
Unit .......
45 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
Module 1
No ratings yet
Module 1
64 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
1 L2 Intro DAM
No ratings yet
1 L2 Intro DAM
27 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Data ch2
No ratings yet
Data ch2
16 pages
Exploratory Data Analysis: Datascience Using Python Topic: 3
No ratings yet
Exploratory Data Analysis: Datascience Using Python Topic: 3
32 pages
DS Unit 1
No ratings yet
DS Unit 1
99 pages
02 Data
No ratings yet
02 Data
66 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
02 Data
No ratings yet
02 Data
65 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Ge8 Statistics
No ratings yet
Ge8 Statistics
2 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Lecture 5 (Descriptive Statistics)
No ratings yet
Lecture 5 (Descriptive Statistics)
39 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
02 Data
No ratings yet
02 Data
62 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Datascience First Conti..and Second Unit
No ratings yet
Datascience First Conti..and Second Unit
49 pages
02data Part2
No ratings yet
02data Part2
34 pages
Module 5
No ratings yet
Module 5
51 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
02 Data
No ratings yet
02 Data
24 pages
9-1 Data Analysis and Pre-Processing Part 1 PDF
No ratings yet
9-1 Data Analysis and Pre-Processing Part 1 PDF
19 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
02 Data
No ratings yet
02 Data
65 pages
Ch01 ICS422 04
No ratings yet
Ch01 ICS422 04
84 pages
UCS551 Chapter 4 - Descriptive Analytics - Visualization
No ratings yet
UCS551 Chapter 4 - Descriptive Analytics - Visualization
39 pages
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet
Statistics and Data Analysis Essentials
From Everand
Statistics and Data Analysis Essentials
Jayant Ramaswamy
No ratings yet
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Effect of Self - Motivation On Academic Performance of Students of Woman University of Azad Jammu and Kashmir Bagh
No ratings yet
Effect of Self - Motivation On Academic Performance of Students of Woman University of Azad Jammu and Kashmir Bagh
11 pages
Week 17 Research 1ST Att.
No ratings yet
Week 17 Research 1ST Att.
4 pages
DP 307 Cultural Di
No ratings yet
DP 307 Cultural Di
11 pages
Emerging - 2021 - Module 2 PDF
No ratings yet
Emerging - 2021 - Module 2 PDF
61 pages
Sodor Terminal Oil Project
No ratings yet
Sodor Terminal Oil Project
30 pages
Complete Download Outlier Ensembles An Introduction 1st Edition Charu C. Aggarwal PDF All Chapters
100% (3)
Complete Download Outlier Ensembles An Introduction 1st Edition Charu C. Aggarwal PDF All Chapters
65 pages
Problem Analysis
0% (1)
Problem Analysis
6 pages
Salisu Bala Curriculum Vites: Flat 23 New Ecobank Road, Kaura Namoda Metropolis, Kaura Namoda Local Govt Zamfara State
No ratings yet
Salisu Bala Curriculum Vites: Flat 23 New Ecobank Road, Kaura Namoda Metropolis, Kaura Namoda Local Govt Zamfara State
5 pages
Design of Thermal Systems (MECH 0009.1) - Spring-2021 Assig. 1 QP
No ratings yet
Design of Thermal Systems (MECH 0009.1) - Spring-2021 Assig. 1 QP
10 pages
Assessment Task 1
No ratings yet
Assessment Task 1
8 pages
Full Project
No ratings yet
Full Project
101 pages
Inbound 6012099561189618715
No ratings yet
Inbound 6012099561189618715
58 pages
Sofyan Fadli Anshary Rumasukun, Yohanis Rante, Oscar O. Wambrauw, Bonifasia Elita Bharanti
100% (1)
Sofyan Fadli Anshary Rumasukun, Yohanis Rante, Oscar O. Wambrauw, Bonifasia Elita Bharanti
13 pages
Cambridge International AS & A Level: ECONOMICS 9708/42
No ratings yet
Cambridge International AS & A Level: ECONOMICS 9708/42
4 pages
Eda Notes
No ratings yet
Eda Notes
4 pages
Pakyo Ka Loy
No ratings yet
Pakyo Ka Loy
40 pages
48 +Nurul+Aini+dkk
No ratings yet
48 +Nurul+Aini+dkk
10 pages
Analysis of Organic Fertilizer Marketing Strategy at The Indmira Company To Meet The Needs of Palm Oil Farmers in Kuantan Singingi Regency Using The SWOT Me
No ratings yet
Analysis of Organic Fertilizer Marketing Strategy at The Indmira Company To Meet The Needs of Palm Oil Farmers in Kuantan Singingi Regency Using The SWOT Me
11 pages
Cardiovascular Investigation Practical Report Checklist
No ratings yet
Cardiovascular Investigation Practical Report Checklist
1 page
Full Report BS Group 5
No ratings yet
Full Report BS Group 5
22 pages
RES 1 Input and Output 6
No ratings yet
RES 1 Input and Output 6
9 pages
4-Research Objective
No ratings yet
4-Research Objective
17 pages
Chapter 1 Probability
No ratings yet
Chapter 1 Probability
38 pages
Prediction of Groundwater Level Using Machine Learning: Ijarsct
No ratings yet
Prediction of Groundwater Level Using Machine Learning: Ijarsct
4 pages
An Exploratory Study To Assess The Academic Stress, Stress Reactions and Coping Strategies Among The B.Sc. Nursing 1st Year Students in Selected Colleges of Moga, Punjab
No ratings yet
An Exploratory Study To Assess The Academic Stress, Stress Reactions and Coping Strategies Among The B.Sc. Nursing 1st Year Students in Selected Colleges of Moga, Punjab
4 pages
DP Unit Planner en
No ratings yet
DP Unit Planner en
7 pages
Primark
No ratings yet
Primark
29 pages
Managing Human Resources Canadian 7th Edition Belcourt Test Bank
100% (39)
Managing Human Resources Canadian 7th Edition Belcourt Test Bank
20 pages
Literature Review Sample Document
100% (1)
Literature Review Sample Document
6 pages

Data Mining: Prepared By: Eesha Tur Razia Babar

Uploaded by

Data Mining: Prepared By: Eesha Tur Razia Babar

Uploaded by

Data Mining

Prepared By: Eesha Tur Razia Babar

• High-quality data is the foundation of all digital businesses

• Poor quality of data can have following consequences:

• Some of the most common issues affecting data quality are:

• inconsistent formatting of dates and numbers

So it really important to assess the quality of data before using it

• There are four types of statistical measures used to describe data:

• Key concepts include:

Looking at the retirement age distribution:

• Advantage of the mean:

Looking at the retirement age (skewed) distribution:

• The "Bell Curve" is a Normal Distribution. The yellow histogram shows

• Skewness refers deviation from the symmetrical bell curve, or normal

• In a skewed distribution, the median is often a preferred measure of central

• A distribution is said to be positively or right skewed when the tail on the

Here is the Standard Normal

• Preprocessing data is an important step, as raw data can be inconsistent or

• Improvement in accuracy and reliability of dataset

• Improvement in data consistency

• Increase in readability of data’s algorithm

• Equal-depth (frequency) partitioning

• After creating the bins, the data smoothened using

• In Equal-width (distance) partitioning, skewed data is not handled

• In Equal-depth (frequency) partitioning, categorical data can be tricky

The detection and removal of outliers in data is an important step in data

IMPORTANT: You need

• The Z-score is a statistical measure that quantifies how far a particular

• The IQR method is robust

• It can be drawn vertically or horizontally

• The length of box indicates the spread of data

• The whisker represents all data values, and

• Any data beyond the whisker is considered to be

You might also like