0% found this document useful (0 votes)
127 views55 pages

Part2 Statistics

The document discusses exploratory data analysis (EDA). It describes EDA as the critical process of initially investigating data to discover patterns, spot anomalies, test hypotheses, and check assumptions. This is done through summary statistics and graphical representations to understand the data and gather insights. It also discusses different data types, measures of central tendency (mean, median, mode), measures of dispersion (variance, standard deviation, range, interquartile range), frequency, normal distribution, skewness, kurtosis, outliers, and provides an example of EDA on a wine quality dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views55 pages

Part2 Statistics

The document discusses exploratory data analysis (EDA). It describes EDA as the critical process of initially investigating data to discover patterns, spot anomalies, test hypotheses, and check assumptions. This is done through summary statistics and graphical representations to understand the data and gather insights. It also discusses different data types, measures of central tendency (mean, median, mode), measures of dispersion (variance, standard deviation, range, interquartile range), frequency, normal distribution, skewness, kurtosis, outliers, and provides an example of EDA on a wine quality dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

2

EXPLORATORY DATA ANALYSIS


EXPLORATORY DATA ANALYSIS REFERS TO THE CRITICAL
PROCESS OF PERFORMING INITIAL INVESTIGATIONS ON
DATA SO AS
TO DISCOVER PATTERNS
TO SPOT ANOMALIES
TO TEST HYPOTHESIS
TO CHECK ASSUMPTIONS

WITH THE HELP OF SUMMARY STATISTICS AND


GRAPHICAL REPRESENTATIONS

UNDERSTAND THE DATA AND GATHER AS MANY


INSIGHTS FROM IT
4
DATA LEVELS
NOMINAL(QUALITATIVE)
DATA MEASURE ACCORDING TO WORDS, LETTER, SYMBOL ETC
ORDINAL(QUALITATIVE)
DATA PLACING IN AN ORDER ACCORDING TO THEIR WEIGHT
INTERVAL(QUANTITATIVE)
DATA ORDERING ON THE BASIS OF A INTERVAL
RATIO(QUANTITATIVE)
SAME AS INTERVAL EXCLUDING IT DON'T CONSIDER ZERO VALUE
RANDOM VARIABLE:
A FUNCTION WHOSE EVERY OUTCOME IS A NEW VALUE
DISCRETE(ROLLING A COIN)
CONTINUOUS(FLOATING INTEREST RATE)

PROBABILITY
IT IS THE CHANCE OF HAPPENING SOME EVENT.
DISCRETE PROBABILITY(LIST POSSIBLE VALUE OF RANDOM NUMBER LIKE 1,2,3,4)
CONTINUOUS PROBABILITY(LISTING POSSIBLE VALUE OF CONTINUOUS VALUE)
One number that best summarizes the entire set of measurements
A number that is in some way “central” to the set
MEAN /AVERAGE
MEAN OR AVERAGE IS A CENTRAL TENDENCY OF THE DATA

X = (12+24+41+51+67+67+85+99) / 8 = 55.75

11
Median
Median is the value which divides the data in 2 equal parts

Median will be a middle term, if number of terms is odd

Median will be average of middle 2 terms, if number of terms is even.

12
Mode
It is that value which appears the most number of times in a data set. For
example, in the following data set, 0 appears the most number of times.
Therefore, it is the mode.
0,0,1,2,3,0,4,5,0

BIMODAL TRIMODAL
MULTIMODAL

13
Variance
Variance is the measure of dispersion in a data set. In other words, it
measures how spread out a data set is. It is calculated by first finding the
deviation of each element in the data set from the mean, and then by
squaring it.
Variance is the average of all squared deviations.

14
Steps to Calculate Variance:

1. List elements of data set


The following are ages of students
pursuing a Master’s degree:
Data set 1: 28,25,26,27,31,32,24

2. Calculate the mean


(28 + 25 +26 +27 +31 +32 + 24) / 7 =
27.57

3. Find the deviation from the mean


for each data point

4. Square it

15
5. The average of all squared differences is the variance
To find it, add all squared variances and divide the sum by a number of elements in data set (n)
(0.1849 + 6.6049 + 2.4649 + .3249 + 11.76 + 19.6249 + 12. 4609) / 7

53.4303 /7 = 7.6329

Now that we know how to calculate the measures of central tendency and
dispersion a data, let us look at how to find the same using Python.

16
Standard Deviation
Standard deviation is the squared root of variance.

Standard deviation tells about the concentration of the data around the mean
of the data set.

Standard deviation is inversely proportional to the concentration of the data


around the mean

i.e with high concentration, the standard deviation will be low, and vice
versa.

17
To find standard deviation using Python, we will use data set 2. The numbers
are listed below, and we already know the variance.
results = [3,3,3,5,6,1]
To calculate standard deviation use the inbuilt function “std” from numpy
as shown below:

18
RANGE

Difference between upper and lower limits on s specific scale

Range=max-min

INTER QUARTILE RANGE


Dataset is divided into four equal portions
Q1=25%
Q2=50%
Q3=75%

Inter Quartile range( IQR)=Q3-Q1


Frequency
Frequency of an event

Represented as “n” number of times


the event occurred in an
experiment

Frequencies are often graphically


represented in histograms

20
NORMAL DISTRIBUTION

• HEIGHT OF PEOPLE
• MARKS ON A TEST
• BLOOD PRESSURE MEASUREMENT
• ERRORS IN MEASUREMENT
• SIZE OF THINGS
SKEWNESS

• MEASURE OF SYMMETRY
• SYMMETRICAL
DISTRIBUTION HAS
SKEWNESS OF ZERO
• NORMAL, RIGHT, LEFT
TAIL
KURTOSIS

• MEASURE OF FLATNESS
• PERFECT GAUSSIAN DISTRIBUTION HAS A
KURTOSIS OF ZERO
• MESOKURTIC,PLATYKURTIC,LEPTOKURTI
C
• MESOKURTIC-KURTOSIS IS ZERO
• LEPTOKURTIC-KURTOSIS>3
• PLATYURTIC-KURTOSIS<3
STANDARD NORMAL DISTRIBUTION-Z SCORE

CONVERTS DATA INTO A DISTRIBUTION WHOSE MEAN IS


ZERO AND STANDARD DEVIATION IS ONE

Z SCORE=(X-µ)/σ
µ- MEAN OF DATA
σ - STANDARD DEVIATION OF DATA
CONFIDENCE INTERVAL

• DESCRIBES THE AMOUNT OF UNCERTAINITY ASSOCIATED WITH A SAMPLE ESTIMATE OF POPULATION


PARAMETER
• CONFIDENCE INTERVAL= POINT ESTIMATE ± MARGIN OF ERROR

• MARGIN OF ERROR= CRITICAL VALUE * STANDARD DEVIATION OF STATISTICS


• CRITICAL VALUE AND STANDARD DEVIATION DEPENDS ON TYPE OF DATA DISTRIBUTION
• Z-DISTRIBUTION
• T-DISTRIBUTION
Z DISTRIBUTION
Confidence Interval using Z distribution is
given by:

where
x̄ - point estimate or sample mean
Z* -critical value
σ – population standard deviation
n- sample size
C.I- confidence interval
α – significance level
Z-TABLE
T- DISTRIBUTION
Confidence Interval using T-distribution is given by:

where
x̄ - point estimate or sample mean
t* -critical value
s – sample standard deviation
n- sample size
C.I- confidence interval
α – significance level
T TABLE
Outlier
Data which is out of normal
constraint

Indicate bad data

Needs close attention, will have more


impact on result

Can lead to wildly wrong estimations

Include or exclude based on the


impact

31
Univariate Outlier – A univariateoutlier is a data point that consists
of an extreme value on one variable.

Multivariate Outlier – A multivariate outlier is a combination of


unusual scores on at least two variables

32
CAUSES OF
OUTLIERS:
Data entry errors (human errors)

Measurement errors (instrument errors)

Experimental errors (data extraction or experiment planning/executing errors)

Intentional (dummy outliers made to test detection methods)

Data processing errors (data manipulation errors)

Sampling errors (extracting or mixing data from wrong or various sources)

Natural (not an error, novelties in data)

33
Detecting Outliers:
We can use Visualization to detect Outliers.
Various visualization methods, like Box-
plot, Histogram and Scatter Plotcan be used

34
Data
visualization

35
37
38
39
EDA EXPLAINED USING WINE QUALITY
DATASET
 IMPORT NECESSARY LIBRARIES
• PANDAS
• NUMPY
• MATPLOTLIB
• SEABORN

 IMPORT DATASET
USE PANDAS

 CHECK FOR DIMENSION OF THE DATASET


• NUMBER OF DATA RECORDS AND FEATURES

 CHECK FOR TYPE OF FEATURES AND MISSING VALUES


EDA CONTD…

 OBTAIN SUMMARY STATISTICS OF DATA

 FIND CORRELATION OF FEATURES


USE SEABORN

 CHECK FOR OUTLIERS AND VISUALISE DATA DISTRIBUTION


USE BOXPLOT

 CHECK FOR LINEARITY OF DATA


USE HISTOGRAM
DATASET PREVIEW
DIMENSION

DATASET COMPRISES OF 4898 OBSERVATIONS AND 12 CHARACTERISTICS


OUT OF WHICH ONE IS DEPENDENT VARIABLE (QUALITY) AND REST 11 ARE INDEPENDENT
VARIABLES
FEATURE TYPES AND MISSING VALUES

• Data has only float and integer values


• No variable column has null/missing values.
SUMMARY STATISTICS

•Here, median value is less than mean value of each column which is represented by 50%(50th percentile) in index column

•There is notably a large difference between 75th %tile and max values of predictors “residual sugar”, ”free sulfur dioxide”, ”total
sulfur dioxide”

•Thus observations 1 and 2 suggests that there are extreme values-Outliers in our data set
• Target variable/Dependent variable is discrete and categorical in nature

• “quality” score scale ranges from 1 to 10, where 1 being poor and 10 being the best

• 1,2 & 10 Quality ratings are not given by any observation. Only scores obtained are between 3 to 9
• THIS TELLS US VOTE COUNT OF EACH QUALITY SCORE IN DESCENDING ORDER

• “QUALITY” HAS MOST VALUES CONCENTRATED IN THE CATEGORIES 5, 6 AND 7

• ONLY A FEW OBSERVATIONS MADE FOR THE CATEGORIES 3 & 9


NOW..LET’S EXPLORE DATA USING BEAUTIFUL GRAPHS

CORRELATION MATRIX-HEAT MAP -SEABORN


CORRELATION MATRIX WITH ANNOTATION
1)HERE WE CAN INFER THAT “DENSITY” HAS STRONG POSITIVE CORRELATION WITH
“RESIDUAL SUGAR” WHEREAS IT HAS STRONG NEGATIVE CORRELATION WITH “ALCOHOL”

2)“FREE SULPHUR DIOXIDE” AND “CITRIC ACID” HAS ALMOST NO CORRELATION WITH
“QUALITY”

3)SINCE CORRELATION IS ZERO WE CAN INFER THERE IS NO LINEAR RELATIONSHIP BETWEEN


THESE TWO PREDICTORS
DATA DISTRIBUTION

BOXPLOT
IN THE SIMPLEST BOX PLOT THE CENTRAL RECTANGLE SPANS THE FIRST QUARTILE TO THE
THIRD QUARTILE
(THE INTERQUARTILE RANGE OR IQR)

A SEGMENT INSIDE THE RECTANGLE SHOWS THE MEDIAN AND “WHISKERS” ABOVE AND
BELOW THE BOX SHOW THE LOCATIONS OF THE MINIMUM AND MAXIMUM

OUTLIERS ARE EITHER 3×IQR OR MORE ABOVE THE THIRD QUARTILE OR 3×IQR OR MORE
BELOW THE FIRST QUARTILE
IN OUR DATA SET EXCEPT “ALCOHOL” ALL OTHER FEATURES COLUMNS SHOWS OUTLIERS
LINEARITY
HISTOGRAM

•“pH” column appears to be normally distributed

•remaining all independent variables are right skewed/positively skewed


Data Cleaning
The next step is to look for irregularities in
the data by doing some data exploration.

Finding out the null values and replacing


them with other values or dropping that row
altogether is involved in this phase. .

54
After we are done cleaning, we can move ahead and make some visualizations
to understand the relationship between various aspects of our dataset.

Based on our analysis we can make conclusions and provide insights into the
problems and insights driven by data.
55

You might also like