2
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS REFERS TO THE CRITICAL
PROCESS OF PERFORMING INITIAL INVESTIGATIONS ON
DATA SO AS
TO DISCOVER PATTERNS
TO SPOT ANOMALIES
TO TEST HYPOTHESIS
TO CHECK ASSUMPTIONS
WITH THE HELP OF SUMMARY STATISTICS AND
GRAPHICAL REPRESENTATIONS
UNDERSTAND THE DATA AND GATHER AS MANY
INSIGHTS FROM IT
4
DATA LEVELS
NOMINAL(QUALITATIVE)
DATA MEASURE ACCORDING TO WORDS, LETTER, SYMBOL ETC
ORDINAL(QUALITATIVE)
DATA PLACING IN AN ORDER ACCORDING TO THEIR WEIGHT
INTERVAL(QUANTITATIVE)
DATA ORDERING ON THE BASIS OF A INTERVAL
RATIO(QUANTITATIVE)
SAME AS INTERVAL EXCLUDING IT DON'T CONSIDER ZERO VALUE
RANDOM VARIABLE:
A FUNCTION WHOSE EVERY OUTCOME IS A NEW VALUE
DISCRETE(ROLLING A COIN)
CONTINUOUS(FLOATING INTEREST RATE)
PROBABILITY
IT IS THE CHANCE OF HAPPENING SOME EVENT.
DISCRETE PROBABILITY(LIST POSSIBLE VALUE OF RANDOM NUMBER LIKE 1,2,3,4)
CONTINUOUS PROBABILITY(LISTING POSSIBLE VALUE OF CONTINUOUS VALUE)
One number that best summarizes the entire set of measurements
A number that is in some way “central” to the set
MEAN /AVERAGE
MEAN OR AVERAGE IS A CENTRAL TENDENCY OF THE DATA
X = (12+24+41+51+67+67+85+99) / 8 = 55.75
11
Median
Median is the value which divides the data in 2 equal parts
Median will be a middle term, if number of terms is odd
Median will be average of middle 2 terms, if number of terms is even.
12
Mode
It is that value which appears the most number of times in a data set. For
example, in the following data set, 0 appears the most number of times.
Therefore, it is the mode.
0,0,1,2,3,0,4,5,0
BIMODAL TRIMODAL
MULTIMODAL
13
Variance
Variance is the measure of dispersion in a data set. In other words, it
measures how spread out a data set is. It is calculated by first finding the
deviation of each element in the data set from the mean, and then by
squaring it.
Variance is the average of all squared deviations.
14
Steps to Calculate Variance:
1. List elements of data set
The following are ages of students
pursuing a Master’s degree:
Data set 1: 28,25,26,27,31,32,24
2. Calculate the mean
(28 + 25 +26 +27 +31 +32 + 24) / 7 =
27.57
3. Find the deviation from the mean
for each data point
4. Square it
15
5. The average of all squared differences is the variance
To find it, add all squared variances and divide the sum by a number of elements in data set (n)
(0.1849 + 6.6049 + 2.4649 + .3249 + 11.76 + 19.6249 + 12. 4609) / 7
53.4303 /7 = 7.6329
Now that we know how to calculate the measures of central tendency and
dispersion a data, let us look at how to find the same using Python.
16
Standard Deviation
Standard deviation is the squared root of variance.
Standard deviation tells about the concentration of the data around the mean
of the data set.
Standard deviation is inversely proportional to the concentration of the data
around the mean
i.e with high concentration, the standard deviation will be low, and vice
versa.
17
To find standard deviation using Python, we will use data set 2. The numbers
are listed below, and we already know the variance.
results = [3,3,3,5,6,1]
To calculate standard deviation use the inbuilt function “std” from numpy
as shown below:
18
RANGE
Difference between upper and lower limits on s specific scale
Range=max-min
INTER QUARTILE RANGE
Dataset is divided into four equal portions
Q1=25%
Q2=50%
Q3=75%
Inter Quartile range( IQR)=Q3-Q1
Frequency
Frequency of an event
Represented as “n” number of times
the event occurred in an
experiment
Frequencies are often graphically
represented in histograms
20
NORMAL DISTRIBUTION
• HEIGHT OF PEOPLE
• MARKS ON A TEST
• BLOOD PRESSURE MEASUREMENT
• ERRORS IN MEASUREMENT
• SIZE OF THINGS
SKEWNESS
• MEASURE OF SYMMETRY
• SYMMETRICAL
DISTRIBUTION HAS
SKEWNESS OF ZERO
• NORMAL, RIGHT, LEFT
TAIL
KURTOSIS
• MEASURE OF FLATNESS
• PERFECT GAUSSIAN DISTRIBUTION HAS A
KURTOSIS OF ZERO
• MESOKURTIC,PLATYKURTIC,LEPTOKURTI
C
• MESOKURTIC-KURTOSIS IS ZERO
• LEPTOKURTIC-KURTOSIS>3
• PLATYURTIC-KURTOSIS<3
STANDARD NORMAL DISTRIBUTION-Z SCORE
CONVERTS DATA INTO A DISTRIBUTION WHOSE MEAN IS
ZERO AND STANDARD DEVIATION IS ONE
Z SCORE=(X-µ)/σ
µ- MEAN OF DATA
σ - STANDARD DEVIATION OF DATA
CONFIDENCE INTERVAL
• DESCRIBES THE AMOUNT OF UNCERTAINITY ASSOCIATED WITH A SAMPLE ESTIMATE OF POPULATION
PARAMETER
• CONFIDENCE INTERVAL= POINT ESTIMATE ± MARGIN OF ERROR
• MARGIN OF ERROR= CRITICAL VALUE * STANDARD DEVIATION OF STATISTICS
• CRITICAL VALUE AND STANDARD DEVIATION DEPENDS ON TYPE OF DATA DISTRIBUTION
• Z-DISTRIBUTION
• T-DISTRIBUTION
Z DISTRIBUTION
Confidence Interval using Z distribution is
given by:
where
x̄ - point estimate or sample mean
Z* -critical value
σ – population standard deviation
n- sample size
C.I- confidence interval
α – significance level
Z-TABLE
T- DISTRIBUTION
Confidence Interval using T-distribution is given by:
where
x̄ - point estimate or sample mean
t* -critical value
s – sample standard deviation
n- sample size
C.I- confidence interval
α – significance level
T TABLE
Outlier
Data which is out of normal
constraint
Indicate bad data
Needs close attention, will have more
impact on result
Can lead to wildly wrong estimations
Include or exclude based on the
impact
31
Univariate Outlier – A univariateoutlier is a data point that consists
of an extreme value on one variable.
Multivariate Outlier – A multivariate outlier is a combination of
unusual scores on at least two variables
32
CAUSES OF
OUTLIERS:
Data entry errors (human errors)
Measurement errors (instrument errors)
Experimental errors (data extraction or experiment planning/executing errors)
Intentional (dummy outliers made to test detection methods)
Data processing errors (data manipulation errors)
Sampling errors (extracting or mixing data from wrong or various sources)
Natural (not an error, novelties in data)
33
Detecting Outliers:
We can use Visualization to detect Outliers.
Various visualization methods, like Box-
plot, Histogram and Scatter Plotcan be used
34
Data
visualization
35
37
38
39
EDA EXPLAINED USING WINE QUALITY
DATASET
IMPORT NECESSARY LIBRARIES
• PANDAS
• NUMPY
• MATPLOTLIB
• SEABORN
IMPORT DATASET
USE PANDAS
CHECK FOR DIMENSION OF THE DATASET
• NUMBER OF DATA RECORDS AND FEATURES
CHECK FOR TYPE OF FEATURES AND MISSING VALUES
EDA CONTD…
OBTAIN SUMMARY STATISTICS OF DATA
FIND CORRELATION OF FEATURES
USE SEABORN
CHECK FOR OUTLIERS AND VISUALISE DATA DISTRIBUTION
USE BOXPLOT
CHECK FOR LINEARITY OF DATA
USE HISTOGRAM
DATASET PREVIEW
DIMENSION
DATASET COMPRISES OF 4898 OBSERVATIONS AND 12 CHARACTERISTICS
OUT OF WHICH ONE IS DEPENDENT VARIABLE (QUALITY) AND REST 11 ARE INDEPENDENT
VARIABLES
FEATURE TYPES AND MISSING VALUES
• Data has only float and integer values
• No variable column has null/missing values.
SUMMARY STATISTICS
•Here, median value is less than mean value of each column which is represented by 50%(50th percentile) in index column
•There is notably a large difference between 75th %tile and max values of predictors “residual sugar”, ”free sulfur dioxide”, ”total
sulfur dioxide”
•Thus observations 1 and 2 suggests that there are extreme values-Outliers in our data set
• Target variable/Dependent variable is discrete and categorical in nature
• “quality” score scale ranges from 1 to 10, where 1 being poor and 10 being the best
• 1,2 & 10 Quality ratings are not given by any observation. Only scores obtained are between 3 to 9
• THIS TELLS US VOTE COUNT OF EACH QUALITY SCORE IN DESCENDING ORDER
• “QUALITY” HAS MOST VALUES CONCENTRATED IN THE CATEGORIES 5, 6 AND 7
• ONLY A FEW OBSERVATIONS MADE FOR THE CATEGORIES 3 & 9
NOW..LET’S EXPLORE DATA USING BEAUTIFUL GRAPHS
CORRELATION MATRIX-HEAT MAP -SEABORN
CORRELATION MATRIX WITH ANNOTATION
1)HERE WE CAN INFER THAT “DENSITY” HAS STRONG POSITIVE CORRELATION WITH
“RESIDUAL SUGAR” WHEREAS IT HAS STRONG NEGATIVE CORRELATION WITH “ALCOHOL”
2)“FREE SULPHUR DIOXIDE” AND “CITRIC ACID” HAS ALMOST NO CORRELATION WITH
“QUALITY”
3)SINCE CORRELATION IS ZERO WE CAN INFER THERE IS NO LINEAR RELATIONSHIP BETWEEN
THESE TWO PREDICTORS
DATA DISTRIBUTION
BOXPLOT
IN THE SIMPLEST BOX PLOT THE CENTRAL RECTANGLE SPANS THE FIRST QUARTILE TO THE
THIRD QUARTILE
(THE INTERQUARTILE RANGE OR IQR)
A SEGMENT INSIDE THE RECTANGLE SHOWS THE MEDIAN AND “WHISKERS” ABOVE AND
BELOW THE BOX SHOW THE LOCATIONS OF THE MINIMUM AND MAXIMUM
OUTLIERS ARE EITHER 3×IQR OR MORE ABOVE THE THIRD QUARTILE OR 3×IQR OR MORE
BELOW THE FIRST QUARTILE
IN OUR DATA SET EXCEPT “ALCOHOL” ALL OTHER FEATURES COLUMNS SHOWS OUTLIERS
LINEARITY
HISTOGRAM
•“pH” column appears to be normally distributed
•remaining all independent variables are right skewed/positively skewed
Data Cleaning
The next step is to look for irregularities in
the data by doing some data exploration.
Finding out the null values and replacing
them with other values or dropping that row
altogether is involved in this phase. .
54
After we are done cleaning, we can move ahead and make some visualizations
to understand the relationship between various aspects of our dataset.
Based on our analysis we can make conclusions and provide insights into the
problems and insights driven by data.
55