0% found this document useful (0 votes)

127 views55 pages

Part2 Statistics

The document discusses exploratory data analysis (EDA). It describes EDA as the critical process of initially investigating data to discover patterns, spot anomalies, test hypotheses, and check assumptions. This is done through summary statistics and graphical representations to understand the data and gather insights. It also discusses different data types, measures of central tendency (mean, median, mode), measures of dispersion (variance, standard deviation, range, interquartile range), frequency, normal distribution, skewness, kurtosis, outliers, and provides an example of EDA on a wine quality dataset.

Uploaded by

TechManager SaharaTvm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views55 pages

Part2 Statistics

Uploaded by

TechManager SaharaTvm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 55

2

EXPLORATORY DATA ANALYSIS

EXPLORATORY DATA ANALYSIS REFERS TO THE CRITICAL
PROCESS OF PERFORMING INITIAL INVESTIGATIONS ON
DATA SO AS
TO DISCOVER PATTERNS
TO SPOT ANOMALIES
TO TEST HYPOTHESIS
TO CHECK ASSUMPTIONS

WITH THE HELP OF SUMMARY STATISTICS AND

GRAPHICAL REPRESENTATIONS

UNDERSTAND THE DATA AND GATHER AS MANY

INSIGHTS FROM IT
4
DATA LEVELS
NOMINAL(QUALITATIVE)
DATA MEASURE ACCORDING TO WORDS, LETTER, SYMBOL ETC
ORDINAL(QUALITATIVE)
DATA PLACING IN AN ORDER ACCORDING TO THEIR WEIGHT
INTERVAL(QUANTITATIVE)
DATA ORDERING ON THE BASIS OF A INTERVAL
RATIO(QUANTITATIVE)
SAME AS INTERVAL EXCLUDING IT DON'T CONSIDER ZERO VALUE
RANDOM VARIABLE:
A FUNCTION WHOSE EVERY OUTCOME IS A NEW VALUE
DISCRETE(ROLLING A COIN)
CONTINUOUS(FLOATING INTEREST RATE)

PROBABILITY
IT IS THE CHANCE OF HAPPENING SOME EVENT.
DISCRETE PROBABILITY(LIST POSSIBLE VALUE OF RANDOM NUMBER LIKE 1,2,3,4)
CONTINUOUS PROBABILITY(LISTING POSSIBLE VALUE OF CONTINUOUS VALUE)
One number that best summarizes the entire set of measurements
A number that is in some way “central” to the set
MEAN /AVERAGE
MEAN OR AVERAGE IS A CENTRAL TENDENCY OF THE DATA

X = (12+24+41+51+67+67+85+99) / 8 = 55.75

11
Median
Median is the value which divides the data in 2 equal parts

Median will be a middle term, if number of terms is odd

Median will be average of middle 2 terms, if number of terms is even.

12
Mode
It is that value which appears the most number of times in a data set. For
example, in the following data set, 0 appears the most number of times.
Therefore, it is the mode.
0,0,1,2,3,0,4,5,0

BIMODAL TRIMODAL
MULTIMODAL

13
Variance
Variance is the measure of dispersion in a data set. In other words, it
measures how spread out a data set is. It is calculated by first finding the
deviation of each element in the data set from the mean, and then by
squaring it.
Variance is the average of all squared deviations.

14
Steps to Calculate Variance:

1. List elements of data set

The following are ages of students
pursuing a Master’s degree:
Data set 1: 28,25,26,27,31,32,24

2. Calculate the mean

(28 + 25 +26 +27 +31 +32 + 24) / 7 =
27.57

3. Find the deviation from the mean

for each data point

4. Square it

15
5. The average of all squared differences is the variance
To find it, add all squared variances and divide the sum by a number of elements in data set (n)
(0.1849 + 6.6049 + 2.4649 + .3249 + 11.76 + 19.6249 + 12. 4609) / 7

53.4303 /7 = 7.6329

Now that we know how to calculate the measures of central tendency and
dispersion a data, let us look at how to find the same using Python.

16
Standard Deviation
Standard deviation is the squared root of variance.

Standard deviation tells about the concentration of the data around the mean
of the data set.

Standard deviation is inversely proportional to the concentration of the data

around the mean

i.e with high concentration, the standard deviation will be low, and vice
versa.

17
To find standard deviation using Python, we will use data set 2. The numbers
are listed below, and we already know the variance.
results = [3,3,3,5,6,1]
To calculate standard deviation use the inbuilt function “std” from numpy
as shown below:

18
RANGE

Difference between upper and lower limits on s specific scale

Range=max-min

INTER QUARTILE RANGE

Dataset is divided into four equal portions
Q1=25%
Q2=50%
Q3=75%

Inter Quartile range( IQR)=Q3-Q1

Frequency
Frequency of an event

Represented as “n” number of times

the event occurred in an
experiment

Frequencies are often graphically

represented in histograms

20
NORMAL DISTRIBUTION

• HEIGHT OF PEOPLE
• MARKS ON A TEST
• BLOOD PRESSURE MEASUREMENT
• ERRORS IN MEASUREMENT
• SIZE OF THINGS
SKEWNESS

• MEASURE OF SYMMETRY
• SYMMETRICAL
DISTRIBUTION HAS
SKEWNESS OF ZERO
• NORMAL, RIGHT, LEFT
TAIL
KURTOSIS

• MEASURE OF FLATNESS
• PERFECT GAUSSIAN DISTRIBUTION HAS A
KURTOSIS OF ZERO
• MESOKURTIC,PLATYKURTIC,LEPTOKURTI
C
• MESOKURTIC-KURTOSIS IS ZERO
• LEPTOKURTIC-KURTOSIS>3
• PLATYURTIC-KURTOSIS<3
STANDARD NORMAL DISTRIBUTION-Z SCORE

CONVERTS DATA INTO A DISTRIBUTION WHOSE MEAN IS

ZERO AND STANDARD DEVIATION IS ONE

Z SCORE=(X-µ)/σ
µ- MEAN OF DATA
σ - STANDARD DEVIATION OF DATA
CONFIDENCE INTERVAL

• DESCRIBES THE AMOUNT OF UNCERTAINITY ASSOCIATED WITH A SAMPLE ESTIMATE OF POPULATION

PARAMETER
• CONFIDENCE INTERVAL= POINT ESTIMATE ± MARGIN OF ERROR

• MARGIN OF ERROR= CRITICAL VALUE * STANDARD DEVIATION OF STATISTICS

• CRITICAL VALUE AND STANDARD DEVIATION DEPENDS ON TYPE OF DATA DISTRIBUTION
• Z-DISTRIBUTION
• T-DISTRIBUTION
Z DISTRIBUTION
Confidence Interval using Z distribution is
given by:

where
x̄ - point estimate or sample mean
Z* -critical value
σ – population standard deviation
n- sample size
C.I- confidence interval
α – significance level
Z-TABLE
T- DISTRIBUTION
Confidence Interval using T-distribution is given by:

where
x̄ - point estimate or sample mean
t* -critical value
s – sample standard deviation
n- sample size
C.I- confidence interval
α – significance level
T TABLE
Outlier
Data which is out of normal
constraint

Indicate bad data

Needs close attention, will have more

impact on result

Can lead to wildly wrong estimations

Include or exclude based on the

impact

31
Univariate Outlier – A univariateoutlier is a data point that consists
of an extreme value on one variable.

Multivariate Outlier – A multivariate outlier is a combination of

unusual scores on at least two variables

32
CAUSES OF
OUTLIERS:
Data entry errors (human errors)

Measurement errors (instrument errors)

Experimental errors (data extraction or experiment planning/executing errors)

Intentional (dummy outliers made to test detection methods)

Data processing errors (data manipulation errors)

Sampling errors (extracting or mixing data from wrong or various sources)

Natural (not an error, novelties in data)

33
Detecting Outliers:
We can use Visualization to detect Outliers.
Various visualization methods, like Box-
plot, Histogram and Scatter Plotcan be used

34
Data
visualization

35
37
38
39
EDA EXPLAINED USING WINE QUALITY
DATASET
 IMPORT NECESSARY LIBRARIES
• PANDAS
• NUMPY
• MATPLOTLIB
• SEABORN

 IMPORT DATASET
USE PANDAS

 CHECK FOR DIMENSION OF THE DATASET

• NUMBER OF DATA RECORDS AND FEATURES

 CHECK FOR TYPE OF FEATURES AND MISSING VALUES

EDA CONTD…

 OBTAIN SUMMARY STATISTICS OF DATA

 FIND CORRELATION OF FEATURES

USE SEABORN

 CHECK FOR OUTLIERS AND VISUALISE DATA DISTRIBUTION

USE BOXPLOT

 CHECK FOR LINEARITY OF DATA

USE HISTOGRAM
DATASET PREVIEW
DIMENSION

DATASET COMPRISES OF 4898 OBSERVATIONS AND 12 CHARACTERISTICS

OUT OF WHICH ONE IS DEPENDENT VARIABLE (QUALITY) AND REST 11 ARE INDEPENDENT
VARIABLES
FEATURE TYPES AND MISSING VALUES

• Data has only float and integer values

• No variable column has null/missing values.
SUMMARY STATISTICS

•Here, median value is less than mean value of each column which is represented by 50%(50th percentile) in index column

•There is notably a large difference between 75th %tile and max values of predictors “residual sugar”, ”free sulfur dioxide”, ”total
sulfur dioxide”

•Thus observations 1 and 2 suggests that there are extreme values-Outliers in our data set
• Target variable/Dependent variable is discrete and categorical in nature

• “quality” score scale ranges from 1 to 10, where 1 being poor and 10 being the best

• 1,2 & 10 Quality ratings are not given by any observation. Only scores obtained are between 3 to 9
• THIS TELLS US VOTE COUNT OF EACH QUALITY SCORE IN DESCENDING ORDER

• “QUALITY” HAS MOST VALUES CONCENTRATED IN THE CATEGORIES 5, 6 AND 7

• ONLY A FEW OBSERVATIONS MADE FOR THE CATEGORIES 3 & 9

NOW..LET’S EXPLORE DATA USING BEAUTIFUL GRAPHS

CORRELATION MATRIX-HEAT MAP -SEABORN

CORRELATION MATRIX WITH ANNOTATION
1)HERE WE CAN INFER THAT “DENSITY” HAS STRONG POSITIVE CORRELATION WITH
“RESIDUAL SUGAR” WHEREAS IT HAS STRONG NEGATIVE CORRELATION WITH “ALCOHOL”

2)“FREE SULPHUR DIOXIDE” AND “CITRIC ACID” HAS ALMOST NO CORRELATION WITH
“QUALITY”

3)SINCE CORRELATION IS ZERO WE CAN INFER THERE IS NO LINEAR RELATIONSHIP BETWEEN

THESE TWO PREDICTORS
DATA DISTRIBUTION

BOXPLOT
IN THE SIMPLEST BOX PLOT THE CENTRAL RECTANGLE SPANS THE FIRST QUARTILE TO THE
THIRD QUARTILE
(THE INTERQUARTILE RANGE OR IQR)

A SEGMENT INSIDE THE RECTANGLE SHOWS THE MEDIAN AND “WHISKERS” ABOVE AND
BELOW THE BOX SHOW THE LOCATIONS OF THE MINIMUM AND MAXIMUM

OUTLIERS ARE EITHER 3×IQR OR MORE ABOVE THE THIRD QUARTILE OR 3×IQR OR MORE
BELOW THE FIRST QUARTILE
IN OUR DATA SET EXCEPT “ALCOHOL” ALL OTHER FEATURES COLUMNS SHOWS OUTLIERS
LINEARITY
HISTOGRAM

•“pH” column appears to be normally distributed

•remaining all independent variables are right skewed/positively skewed

Data Cleaning
The next step is to look for irregularities in
the data by doing some data exploration.

Finding out the null values and replacing

them with other values or dropping that row
altogether is involved in this phase. .

54
After we are done cleaning, we can move ahead and make some visualizations
to understand the relationship between various aspects of our dataset.

Based on our analysis we can make conclusions and provide insights into the
problems and insights driven by data.
55

T Nav User Guide English
100% (2)
T Nav User Guide English
420 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
Program-1
No ratings yet
Program-1
15 pages
Six Sigma: Statistics: By: - Hakeem-Ur-Rehman
No ratings yet
Six Sigma: Statistics: By: - Hakeem-Ur-Rehman
44 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
EDA - Module 4
No ratings yet
EDA - Module 4
35 pages
Data Science 01 - Basics
No ratings yet
Data Science 01 - Basics
52 pages
Unit 4
No ratings yet
Unit 4
14 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Ds 5 Marks Final
No ratings yet
Ds 5 Marks Final
11 pages
RM EBBA Class 8 CH0 11 Quatitative Analysis
No ratings yet
RM EBBA Class 8 CH0 11 Quatitative Analysis
37 pages
Introduction To The Practice of Basic Statistics (Textbook Outline)
100% (14)
Introduction To The Practice of Basic Statistics (Textbook Outline)
65 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Das FFFF
No ratings yet
Das FFFF
16 pages
Reliability Distribution 1
No ratings yet
Reliability Distribution 1
41 pages
Unit 3
No ratings yet
Unit 3
20 pages
Lecture01
No ratings yet
Lecture01
76 pages
7u7 PDF
No ratings yet
7u7 PDF
31 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
Data Visualization
No ratings yet
Data Visualization
37 pages
CS361 FA23 Lec2 Post
No ratings yet
CS361 FA23 Lec2 Post
67 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Features
No ratings yet
Features
42 pages
FORMULAS
No ratings yet
FORMULAS
16 pages
Dr. Pham Huynh Tram Department of ISE Phtram@hcmiu - Edu.vn
No ratings yet
Dr. Pham Huynh Tram Department of ISE Phtram@hcmiu - Edu.vn
37 pages
DSOST2
No ratings yet
DSOST2
44 pages
Resumo Adp
No ratings yet
Resumo Adp
5 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
3.3 Assignment: One Variable Statistics: A) Histogram
No ratings yet
3.3 Assignment: One Variable Statistics: A) Histogram
12 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
4 DataUnderstanding
No ratings yet
4 DataUnderstanding
51 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Chapter 2 - Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2 - Data Exploration, Preprocessing and Visualization
92 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Module 1
No ratings yet
Module 1
64 pages
Lec448B 20160406
No ratings yet
Lec448B 20160406
30 pages
ADS PRINT Ans
No ratings yet
ADS PRINT Ans
4 pages
02data Part2
No ratings yet
02data Part2
34 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Week - 6-7
No ratings yet
Week - 6-7
9 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
M6 - Basic Statistics
No ratings yet
M6 - Basic Statistics
66 pages
Data Mining: Prepared By: Eesha Tur Razia Babar
No ratings yet
Data Mining: Prepared By: Eesha Tur Razia Babar
49 pages
09 Exploration
No ratings yet
09 Exploration
14 pages
Basic Stats Session
No ratings yet
Basic Stats Session
16 pages
Exploring Data: AP Statistics Unit 1: Chapters 1-4
No ratings yet
Exploring Data: AP Statistics Unit 1: Chapters 1-4
83 pages
Unit 4
No ratings yet
Unit 4
66 pages
SQC
No ratings yet
SQC
53 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Curriculum Vitae: University Professor (Univ.-Prof
No ratings yet
Curriculum Vitae: University Professor (Univ.-Prof
23 pages
QB em
100% (1)
QB em
17 pages
A-Cat Corp - Forecasting
No ratings yet
A-Cat Corp - Forecasting
7 pages
Assignment 5 Mine
No ratings yet
Assignment 5 Mine
4 pages
Morrey Spaces and Fractional Operators
No ratings yet
Morrey Spaces and Fractional Operators
13 pages
Relations and Functions Class 11 Notes Maths Chapter 2 - Learn CBSE
No ratings yet
Relations and Functions Class 11 Notes Maths Chapter 2 - Learn CBSE
5 pages
Maths Tuition Test (Class - 6) : A) 4386 B) 31,600 C) 63,712 D) 28693
No ratings yet
Maths Tuition Test (Class - 6) : A) 4386 B) 31,600 C) 63,712 D) 28693
8 pages
APT Language
No ratings yet
APT Language
25 pages
B.Tech. Degree Examination Civil, Cse. Ece, Eee & Mech: (April-19)
No ratings yet
B.Tech. Degree Examination Civil, Cse. Ece, Eee & Mech: (April-19)
3 pages
Samarth NIMCET 2026 - MCA - Lecture Planner (Computer) - PDF Only
No ratings yet
Samarth NIMCET 2026 - MCA - Lecture Planner (Computer) - PDF Only
8 pages
MM1 Formulas and Questions
No ratings yet
MM1 Formulas and Questions
13 pages
Cut Tutorial Lantek
No ratings yet
Cut Tutorial Lantek
63 pages
Lesson Script in Mathematics: National Math Program
No ratings yet
Lesson Script in Mathematics: National Math Program
16 pages
Algorithmic Thinking, Reasoning Flowcharts, Pseudo-Codes. (Lecture-1)
100% (1)
Algorithmic Thinking, Reasoning Flowcharts, Pseudo-Codes. (Lecture-1)
53 pages
Chapter 1 PPT (Compressibility and Consolidation)
No ratings yet
Chapter 1 PPT (Compressibility and Consolidation)
65 pages
III-Day 40
No ratings yet
III-Day 40
4 pages
Fundamental in QA - Worksheets
No ratings yet
Fundamental in QA - Worksheets
23 pages
Regular Expressions
100% (5)
Regular Expressions
94 pages
Textile Engineering and Fibre Science
No ratings yet
Textile Engineering and Fibre Science
3 pages
Practical Exercise1
No ratings yet
Practical Exercise1
3 pages
EE315 Syllabus 211 LAST
No ratings yet
EE315 Syllabus 211 LAST
2 pages
Durability Study For Tractor Seat Using LS-DYNA
No ratings yet
Durability Study For Tractor Seat Using LS-DYNA
3 pages
Accet-Be Eee Syllabus Reg 2011
No ratings yet
Accet-Be Eee Syllabus Reg 2011
63 pages
Tennis Predictions
No ratings yet
Tennis Predictions
13 pages
Creating A Page in BSP
No ratings yet
Creating A Page in BSP
4 pages
Skolem, G Odel, and Hilbert Fibrations: Davide Trotta Jonathan Weinberger Valeria de Paiva July 23, 2024
No ratings yet
Skolem, G Odel, and Hilbert Fibrations: Davide Trotta Jonathan Weinberger Valeria de Paiva July 23, 2024
35 pages
Gisele Glosser 1. Try Not To Frown On Wrong Answers. It Discourages Students From Participating. Critical
No ratings yet
Gisele Glosser 1. Try Not To Frown On Wrong Answers. It Discourages Students From Participating. Critical
4 pages
EL5623 - Finite Elements For Electrical Engineering
No ratings yet
EL5623 - Finite Elements For Electrical Engineering
3 pages
Section 11.3, Conic Sections - The Ellipse
No ratings yet
Section 11.3, Conic Sections - The Ellipse
4 pages

Part2 Statistics

Uploaded by

Part2 Statistics

Uploaded by

2

EXPLORATORY DATA ANALYSIS

WITH THE HELP OF SUMMARY STATISTICS AND

UNDERSTAND THE DATA AND GATHER AS MANY

Median will be a middle term, if number of terms is odd

Median will be average of middle 2 terms, if number of terms is even.

1. List elements of data set

2. Calculate the mean

3. Find the deviation from the mean

Standard deviation is inversely proportional to the concentration of the data

Difference between upper and lower limits on s specific scale

INTER QUARTILE RANGE

Inter Quartile range( IQR)=Q3-Q1

Represented as “n” number of times

Frequencies are often graphically

CONVERTS DATA INTO A DISTRIBUTION WHOSE MEAN IS

• DESCRIBES THE AMOUNT OF UNCERTAINITY ASSOCIATED WITH A SAMPLE ESTIMATE OF POPULATION

• MARGIN OF ERROR= CRITICAL VALUE * STANDARD DEVIATION OF STATISTICS

Indicate bad data

Needs close attention, will have more

Can lead to wildly wrong estimations

Include or exclude based on the

Multivariate Outlier – A multivariate outlier is a combination of

Measurement errors (instrument errors)

Experimental errors (data extraction or experiment planning/executing errors)

Intentional (dummy outliers made to test detection methods)

Data processing errors (data manipulation errors)

Sampling errors (extracting or mixing data from wrong or various sources)

Natural (not an error, novelties in data)

 CHECK FOR DIMENSION OF THE DATASET

 CHECK FOR TYPE OF FEATURES AND MISSING VALUES

 OBTAIN SUMMARY STATISTICS OF DATA

 FIND CORRELATION OF FEATURES

 CHECK FOR OUTLIERS AND VISUALISE DATA DISTRIBUTION

 CHECK FOR LINEARITY OF DATA

DATASET COMPRISES OF 4898 OBSERVATIONS AND 12 CHARACTERISTICS

• Data has only float and integer values

• “QUALITY” HAS MOST VALUES CONCENTRATED IN THE CATEGORIES 5, 6 AND 7

• ONLY A FEW OBSERVATIONS MADE FOR THE CATEGORIES 3 & 9

CORRELATION MATRIX-HEAT MAP -SEABORN

3)SINCE CORRELATION IS ZERO WE CAN INFER THERE IS NO LINEAR RELATIONSHIP BETWEEN

•“pH” column appears to be normally distributed

•remaining all independent variables are right skewed/positively skewed

Finding out the null values and replacing

You might also like