Unit 2 - DA - Statistical Concepts
Unit 2 - DA - Statistical Concepts
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
In the field of data, there is nothing more important than understanding the data
that needs to be analyzed. In order to understand the data, it is important to
understand the purpose of the analysis because this will help to save time and
dictate how to go about analyzing the data.
Exploratory data analysis (EDA) can be classified as univariate, bivariate,
and multivariate analysis. EDA refers to the critical process of performing initial
investigations on data to discover patterns, to spot anomalies, to test hypothesis
and to check assumptions with the help of summary statistics and graphical
representations.
So, the goal is to present data in a form that makes sense to people. Tools that
are used to do this include:
Graphs: bar charts, pie charts, histograms, scatter charts, and time series
graphs.
Numerical summary measures: counts, percentages, averages, and
measures of variability
Tables of summary measures: totals, averages, and counts, grouped by
categories
School of Computer Engineering
Exploratory Data Analysis cont…
7
Raw data are not very informative. Exploratory Data Analysis (EDA)
is how we make sense of the data by converting them from their raw
form to a more informative one.
In particular, EDA consists of:
Organizing and summarizing the raw data
Discovering important features and patterns in the data and any
striking deviations from those patterns, and then
Interpreting our findings in the context of the problem
Usefulness:
Describing the distribution of a single variable (center, spread,
shape, outliers)
Checking data (for errors or other problems)
Checking assumptions to more complex statistical analyses
Investigating relationships between variables
This type of data consists of only one variable. A variable is a characteristic that
can be measured and that can assume different values. Height, age, income,
province or country of birth, grades obtained at school and type of housing are all
examples of variables.
The analysis of univariate data is thus the simplest form of analysis since the
information deals with only one quantity that changes.
It does not deal with causes or relationships and the main purpose of the
analysis is to describe the data and find patterns that exist within it.
The example of a univariate data can be height (in cm)
Height 164 168 170 169 173 175 180 175 176
Suppose that the heights of seven students of a class is recorded, there is only
one variable that is height and it is not dealing with any cause or relationship.
The description of patterns found in this type of data can be made by drawing
conclusions using central tendency measures (mean, median and mode),
dispersion or spread of data (range, minimum, maximum, quartiles, variance
and standard deviation) and by using frequency distribution tables, histograms,
pie charts, frequency polygon and bar charts.
School of Computer Engineering
Bivariate data and its analysis
9
Population
KIIT administrator wants to analyse the final exam scores of all
graduating students to see if there is a trend. Since they are interested in
applying their findings to all graduating students at KIIT university, they
use the whole population dataset.
Sample
KIIT want to study political attitudes in students. KIIT population is
around the 30,000 undergraduate students. Because it’s not practical to
collect data from all of them, so one may use a sample of 300
undergraduate volunteers from different school who meet the inclusion
criteria. This is the group who are expected to be part of the survey.
Illustrate
variables and
observations
Types of
data
Row
Frequency mariginals
Married Single Female Male Single 0.49 0.51
6866 7193 7170 6889 Female Male
Row Column Column Married 0.52 0.48
Single 0.49 0.51
Bar chart and Pie chart are most often used to visualize categorical variables.
Another efficient way to find counts for a categorical variable is to use dummy
(0–1) variables.
Recode each variable so that one category is replaced by 1 and all others by
0.
Find the count of that category by summing the 0s and 1s.
Find the percentage of that category by averaging the 0s and 1s.
Definition:
k= the kth percentile. It may or may not be part of the data.
i= the index (ranking or position of a data value)
n= the total number of data
To calculate percentile:
1. Order the data from smallest to largest.
2. Calculate i = (k / 100) * (n + 1)
3. If i is an integer, then the kth percentile is the data value in the ith position in
the ordered set of data. If i is not an integer, then round i up and round i down to
the nearest integers. Average the two data values in these two positions in the
ordered data set.
Listed are 29 ages for academy award winning best actors in order from
smallest to largest.
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64;
67; 69; 71; 72; 73; 74; 76; 77
Find the 70th percentile:
i = (k / 100) * (n + 1) = (70 / 100) * (29 + 1) = 21
21 is an integer, and the data value in the 21st position in the ordered data set is
64. The 70th percentile is 64 years.
Find the 83rd percentile:
i = (k / 100) * (n + 1) = (83 / 100) * (29 + 1) = 24.9 which is not an integer.
Round it down to 24 and up to 25. The age in the 24th position is 71 and the age
in the 25th position is 72. Average 71 and 72. The 83rd percentile is 71.5 years.
The sample standard deviation, denoted by s, is the square root of the sample
variance.
Mean absolute deviation (MAD): It is the average distance between each data point
and the mean.
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and
300mm. Find out the range, the mean, the variance, and the standard deviation.
To calculate the variance, take each difference, square it, and then average the result:
σ2 = (2062 + 762 + (−224)2 + 362 + (−94)2)/5 = 21704
The standard deviation is just the square root of variance, so:
σ = √21704 = 147.32... = 147
The good thing about the standard deviation is that it is useful. Now, it can be
shown which heights are within one standard deviation (147mm) of the mean. So,
using the standard deviation, there is a "standard" way of knowing what is normal,
and what is extra large or extra small. Rottweilers are tall dogs and Dachshunds
are a bit short, right?
Q1. Consider your travel time in minutes from the hostel to library : 15, 29, 8,
42, 35, 21, 18, 42, 26. Calculate:
Mean
Median
Mode
Standard deviation
Variance
Range
Quartiles
Sample size
Q2. The following dollar amounts were the hourly collections from a Salvation
Army kettle at a local store one day in December: $12, $12, $12, $12, $12,
$12, $12, $12, $12, $12, $12, and $12. Determine the Interquartile Range for
the amount collected.
School of Computer Engineering
Normal Distribution
39
Histogram
Plot Histogram
Skewed to the Right: Data that are skewed to the right have a long tail that
extends to the right. In this situation, the mean and the median are both
greater than the mode.
Skewed to the Left: Data that are skewed to the left have a long tail that
extends to the left. In this situation, the mean and the median are both less
than the mode.
Using the data from the example above (12 13 54 56 25), determine kurtosis.
Two lines (called whiskers) outside the box extend to the smallest (i.e.,
lower whisker) and largest (i.e., upper whisker) observations. The upper
and lower whiskers can be understood as the boundaries of data, and any
data lying outside it are the outliers.
Lower whisker = Q1-1.5*(Q3-Q1)
Upper whisker = Q3+1.5*(Q3-Q1)
A z-score tells where the data point lies on a distribution curve i.e., how far
from the mean a data point is.
In the above case, xi is the respective data point, x̄ is the mean and s is the
standard deviation.
Usually z-score =3 is considered as a cut-off value to set the limit. Therefore,
any z-score greater than +3 or less than -3 is considered as outlier.
Most real data sets have gaps in the data. There are two issues: how to
detect these missing values and what to do about them.
Missing values occur when no data value is stored for the variable in an
observation. Missing data are a common occurrence and can have a significant
effect on the conclusions that can be drawn from the data.
In MCAR, the probability of data being missing is the same for all the
observations. In this case, there is no relationship between the
missing data and any other values observed or unobserved (the data
which is not recorded) within the given dataset. That is, missing
values are completely independent of other data. There is no
pattern.
In the case of MCAR data, the value could be missing due to human
error, some system/equipment failure, loss of sample, or some
unsatisfactory technicalities while recording the values.
For Example, suppose in a library there are some overdue books.
Some values of overdue books in the computer system are missing.
The reason might be a human error, like the librarian forgetting to
type in the values. So, the missing values of overdue books are not
related to any other variable/data in the system. It should not be
assumed as it’s a rare case.
School of Computer Engineering
Missing At Random
63
MAR data means that the reason for missing values can be explained
by variables on which one have complete information, as there is
some relationship between the missing data and other values/data.
In this case, the data is not missing for all the observations.
It is missing only within sub-samples of the data, and there is some
pattern in the missing values.
For example, in the survey data, one may find that all the people
have answered their ‘Gender,’ but ‘Age’ values are mostly missing for
people who have answered their ‘Gender’ as ‘female.’ (The reason
being most of the females don’t want to reveal their age.)
So, the probability of data being missing depends only on the
observed value or data. In this case, the variables ‘Gender’ and ‘Age’
are related. The reason for missing values of the ‘Age’ variable can be
explained by the ‘Gender’ variable, but one can not predict the
missing value itself.
School of Computer Engineering
Missing Not At Random
64
If the missing data does not fall under the MCAR or MAR, it can be
categorized as MNAR. It can happen due to the reluctance of people
to provide the required information. A specific group of respondents
may not answer some questions in a survey.
For example, suppose the name and the number of overdue books
are asked in the poll for a library. So most of the people having no
overdue books are likely to answer the poll. People having more
overdue books are less likely to answer the poll. So, in this case, the
missing value of the number of overdue books depends on the
people who have more books overdue.
Another example is that people having less income may refuse to
share some information in a survey or questionnaire.
The middle table indicates that only 6% of the nondrinkers are heavy
smokers, whereas 31% of the heavy drinkers are heavy smokers.
Similarly, the bottom table indicates that 43.1% of the nonsmokers are
nondrinkers, whereas only 11.3% of the heavy smokers are nondrinkers.
In short, these tables indicate that smoking and drinking habits tend to go
with one another.
These tendencies are reinforced by the column charts of the two percentage
tables.
Correlation is a technique that can show whether and how strongly pairs of
variables are related. For example, height and weight are related; taller people
tend to be heavier than shorter people. The relationship isn't perfect. People of
the same height vary in weight. Nonetheless, the average weight of people 5'5''
is less than the average weight of people 5'6'', and their average weight is less
than that of people 5'7'', etc. Correlation can tell you just how much of the
variation in peoples' weights is related to their heights. The main result of a
correlation is called the correlation coefficient (or “r”). It ranges from -1.0 to
+1.0. The closer r is to +1 or -1, the more closely the two variables are related.
Sales 215 325 185 332 406 522 412 614 544 421 445 408
Let us call the two sets of data "x" and "y" (in our case Temperature is x and Ice
Cream Sales is y)
1. Step 1: Find the mean of x, and the mean of y
2. Step 2: Subtract the mean of x from every x value (call them "a"), do the
same for y (call them "b")
3. Step 3: Calculate: a*b, a2 and b2 for every value
4. Step 4: Sum up a*b, sum up a2 and sum up b2
5. Step 5: Divide the sum of a*b by the square root of [(sum of a2) × (sum of
b2)]
Where
ρ(X,Y) – the correlation between the variables X and Y
Cov(X,Y) – the covariance between the variables X and Y
σX – the standard deviation of the X-variable
σY – the standard deviation of the Y-variable
Where:
Xi – the values of the X-variable
Yj – the values of the Y-variable
X̄ – the mean (average) of the X-variable
Ȳ – the mean (average) of the Y-variable
n – the number of data points
The prices of ABC Corp. and the S&P 500 are as follows. Find the covariance.
Year S&P 500 ABC Corp
2013 1692 68
2014 1978 102
2015 1884 110
2016 2151 112
2017 2519 154
For
illustration
Find the relationships of salary between male and female of below sample by
illustrating with the box plot.
It involves the sample being drawn from that part of the population
that is close to hand i.e., easiest to access. This can be due to
geographical proximity, availability at a given time, or willingness to
participate.
Suppose one is researching public perception towards the city of
Seattle and determined that a sample of 100 people is sufficient to
answer a question. To collect the data, one can stand at a subway
station and approach passersby, asking them whether they want to
participate in the survey. One should continue to ask until the sample
size is reached.
Another example would be a new NGO wants to establish itself in 20
cities. It selects the top 20 cities to serve based on the proximity to
where they are based.
New units are recruited by other units to form part of the sample.
It is also known as chain sampling or network sampling, snowball
sampling begins with one or more study participants. It then continues
on the basis of referrals from those participants. This process continues
until you reach the desired sample, or a saturation point.
The table (in last page) shows that there are seven possible values of the
sample mean x̄ . The value x̄ =152 happens only one way (the rower weighing
152 pounds must be selected both times), as does the value x̄ =164, but the
other values happen more than one way, hence are more likely to be
observed than 152 and 164 are. Since the 16 samples are equally likely, we
obtain the probability distribution of the sample mean just by counting:
x̄ 152 154 156 158 160 162 164
P(x̄) 1/16 2/16 3/16 4/16 3/16 2/16 1/16
=0.0625 =0.1256 =0.1875 =0.25 =0.1875 =0.1256 =0.0625
Most often, the methods of finding the parameters of large populations are
unrealistic. For example, when finding the average age of kids attending
kindergarten, it is impossible to collect the exact age of every kindergarten kid
in the world. Instead, the point estimator is used to make an estimate of the
population parameter.
It is desirable for a point estimate to be
Consistent: the larger the sample size, the more accurate the estimate. For
the point estimator to be consistent, the expected value should move toward
the true value.
Unbiased: The expectation of the observed values of many samples equals
the corresponding population parameter i.e., the sample mean is an
unbiased estimator for the population mean.
Most efficient: The most efficient point estimator is the one with the
smallest variance. Generally, the efficiency of the estimator depends on the
distribution of the population. For example, in a normal distribution, the
mean is considered more efficient than the median, but the same does not
apply in asymmetrical distributions.
School of Computer Engineering
How to find Point Estimate?
106
Example 2: calculate the best point estimate from the list of data i.e., 15.22,
14.34, 18.12, 12.61, 15.61, 14.22, 19.41, 12.22, 17.12, 14.22, 12.91 and 18.12.
Solution: In such a case, the sample mean (i.e., 15.34) is the best point estimate
for population mean.
An interval is a range of values. Let’s say we wanted to find out the average
cigarette use of senior citizens. We can’t survey every senior citizen on the
planet (due to time constraints and finances), so we take a sample of 1000
senior citizens and find that 10% of them smoke cigarettes. Although we have
only taken a sample, we can use that figure to estimate that “about” 10% of the
whole population smoke cigarettes. In reality, it’s unlikely to be exactly 10% (as
we only sampled a small percentage of people), but it’s probably somewhere
around there, perhaps between 5 and 15%. That “somewhere between 5 and
15%” is an interval estimate.
There’s nothing wrong with making a good guess at an interval, but sometimes
we want to be very confident that our results are sound and repeatable.
“Repeatable” means that if we do the whole thing over again, we’ll get the same
results. One way to do this is to express a confidence level. Confidence levels
are percentages of certainty. For example, we might say we are 99% confident
(i.e., we have a 99% confidence level) that between 5 and 15% of senior
citizens smoke cigarettes. When the interval estimate has a confidence level
attached, it’s called a confidence interval.
School of Computer Engineering
Confidence Interval Estimation
108
The lower bound (in the example, 5%) is called a lower confidence limit and
the upper bound (in the example, 15%) is called an upper confidence limit.
The bigger the sample size, the more narrow the confidence interval will be.
How to determine the lower and upper confidence limit?
Confidence limit Standard deviation
Sample size A measure of how many
standard deviations are below
Mean Z-score or above the population mean
Recap
to
represents the probability of varied outcomes when a study is conducted.
There are 3 types of sampling distributions i.e.,
Sampling Distribution of Mean – [Discussed earlier]
T-Distribution
As the df increases, the t-distribution will get closer and closer to matching
the standard normal distribution.
The values of the t-statistic is : t = [ x̄ - μ ] / [ s / √ n ] where,
t = t score,
x̄ = sample mean,
μ = population mean,
s = standard deviation of the sample,
n = sample size
Note: A t-score is equivalent to the number of standard deviations away
from the mean of the t-distribution.
A law school claims it’s graduates earn an average of $300 per hour. A
sample of 15 graduates is selected and found to have a mean salary of $280
with a sample standard deviation of $50. Assuming the school’s claim is
true, what is the t-score?
Solution: t= (280 – 300) / (50/ √ 15) = -20 / 12.909945 = -1.549.
In today’s data-driven world, decisions are based on data all the time.
Hypothesis plays a crucial role in that process, whether it may be making
business decisions, in the health sector, academia, or in quality improvement.
Without hypothesis & hypothesis tests, you risk drawing the wrong conclusions
and making bad decisions.
Hypothesis testing is a type of statistical analysis in which assumptions are
put about a population parameter to the test. It is used to estimate the
relationship between variables.
Examples:
A faculty assumes that 60% of his students come from higher-middle-class
families.
A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for
diabetic patients.
It involves setting up a null hypothesis and an alternative hypothesis. These
two hypotheses will always be mutually exclusive. This means that if the null
hypothesis is true then the alternative hypothesis is false and vice versa.
The null hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.
Example:
Smokers are no more susceptible to heart disease than nonsmokers.
The new drug has a cure rate no higher than other drugs on the market.
H0 is the symbol for it, and it is pronounced H-naught.
Hypothesis testing is used to conclude if the null hypothesis can be rejected or not.
Suppose an experiment is conducted to check if girls are shorter than boys at the
age of 5. The null hypothesis will say that they are the same height.
The alternate hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null hypothesis.
It indicates that there is a statistical significance between two possible outcomes
and can be denoted as Ha.
For the above-mentioned example, the alternative hypothesis would be that girls
are shorter than boys at the age of 5.
The null hypothesis is usually the current thinking, or status quo. The alternative
hypothesis is usually the hypothesis to be proved. The burden of proof is on the
alternative hypothesis.
School of Computer Engineering
Null Hypothesis and Alternate Hypothesis cont…
119
How to write null and alternate hypothesis - The only thing to know are the
dependent (DV) variables and independent variables (IV). To write null
hypothesis, and alternative hypothesis, fill in the following sentences with
variables i.e., does independent variable affect dependent variable?
Null hypothesis (H0): IV does not affect DV.
Alternative Hypothesis (Ha): IV affects DV.
Characteristics of a Hypothesis
It has to be clear and accurate in order to look reliable.
It has to be specific.
There should be scope for further investigation and experiments.
It should be explained in simple language while retaining its
significance.
IVs and DVs must be included with the relationship between them.
Example: Effect of new bill pass on the loan of farmers and H0: There is no
significant effect of the new bill passed on loans of farmers. The main
intention is to check the new bill passes can affect in both ways either
increase or decrease the loan of farmers.
School of Computer Engineering
Types of Error
125
Truth
H0 is true Ha is true
Reject H0 Type I error No error
Decision
Do not reject H0 No error Type II error
The question, then, is how strong the evidence in favor of the alternative
hypothesis must be to reject the null hypothesis.
This is done by means of a p-value. The p-value is the probability of seeing
a random sample at least as extreme as the observed sample, given that the
null hypothesis is true. The smaller the p-value, the more evidence there is
in favor of the alternative hypothesis.
The p-values are expressed as decimals and can be converted into
percentage. For example, a p-value of 0.0237 is 2.37%, which means there's
a 2.37% chance of the results being random or having happened by chance.
In the hypothesis test, if the value is:
A small p value (<=0.05), reject the null hypothesis.
A large p value (>0.05), do not reject the null hypothesis
The p-values are usually calculated using p-value tables, or calculated
automatically using statistical software like R, SPSS, Python etc.
Note: Other way to decide the rejection region is with z-score and it is
applicable when the sample size is less than or equal to 30.
School of Computer Engineering
Hypothesis Testing Example
127
Where x̄n is the mean of the population, µ0 is the null hypothesis (i.e., the mean)
to be tested, σ is the standard deviation, n is the sample size.
School of Computer Engineering
Hypothesis Testing Numerical cont…
129
Using the data given in the equation we would have the following:
μ0 = 100, σ = 15, n = 30, x̄n = 140
Plugging the values into the formula:
Step 4: Look up the values of z (called the critical value) from statistical
table (The table is predefined and should be referred)
From the table: the value is 1.96
Step 5: Draw a conclusion
In this case the tested statistic value of z calculated is more than the critical
value obtained from statistical tables (i.e., 14.606 > 1.96). Therefore the null
hypothesis is rejected.
This means, from the question, that the medication administered does not affect
intelligence.
Let’s take a closer look at the movie snacks example. Suppose we collect
data for 600 people at our theater. For each person, we know the type of
movie they saw and whether or not they bought snacks.
For the valid Chi-square test, the following conditions to be satisfied:
1. Data values that are a simple random sample from the population of
interest.
2. Two categorical or nominal variables.
3. For each combination of the levels of the two variables, we need at
least five expected values. When we have fewer than five for any one
combination, the test results are not reliable. To confirm this, we need
to know the total counts for each type of movie and the total counts for
whether snacks were bought or not. For now, we assume we meet this
requirement and will check it later.
Lastly, to get our test statistic, we add the numbers in the final row for each
cell: 3.46 + 3.75 + 5.81 + 6.21 + 12.65 + 13.52 + 9.14 + 10.70 = 65.24
Now, we need to find the critical value from the Chi-square distribution based
on degrees of freedom and significance level. This is the value to expect if the
two variables are independent.
The degrees of freedom depend on how many rows and how many columns
we have. The degrees of freedom (df) are calculated as df=(r−1)×(c−1) where
r is the number of rows, and c is the number of columns in the contingency
table. From the example, r is 4 and c is 2. Hence, df = (4−1)×(2−1)=3×1=3.
The Chi-square value with α = 0.05 (it is given and represents the probability
of rejecting the null hypothesis when it is true) and three degrees of freedom
is 7.815. Note: This value of 7.815 to be infer from the Chi-square
distribution table. Refer Appendix for further details
We compare the value of our test statistic (65.24) to the Chi-square value.
Since 65.24 > 7.815, we reject the idea that movie type and snack purchases
are independent.
School of Computer Engineering
Chi-square Test for Independence Example cont…
137