UNIT 1 Notes by ARUN JHAPATE
UNIT 1 Notes by ARUN JHAPATE
DESCRIPTIVE STATISTICS
Introduction
Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be
either a representation of the entire or a sample of a population. Descriptive statistics are broken down
into measures of central tendency and measures of variability (spread). Measures of central tendency
include the mean, median, and mode, while measures of variability include the standard deviation,
variance, the minimum and maximum variables, and the kurtosis and skewness.
PROBABILITY DISTRIBUTIONS
6 Common Probability Distributions every data science professional
should know.
Example.
Suppose you are a teacher at a university. After checking assignments for a week, you graded all
the students. You gave these graded papers to a data entry guy in the university and tell him to
create a spreadsheet containing the grades of all the students. But the guy only stores the grades
and not the corresponding students.
He made another blunder, he missed a couple of entries in a hurry and we have no idea whose
grades are missing. Let’s find a way to solve this.
One way is that you visualize the grades and see if you can find a trend in the data.
The graph that you have plot is called the frequency distribution of the data. You see that there is
a smooth curve like structure that defines our data, but do you notice an anomaly? We have an
abnormally low frequency at a particular score range. So the best guess would be to have missing
values that remove the dent in the distribution.
This is how you would try to solve a real-life problem using data analysis. For any Data
Scientist, a student or a practitioner, distribution is a must know concept. It provides the basis for
analytics and inferential statistics.
While the concept of probability gives us the mathematical calculations, distributions help us
actually visualize what’s happening underneath.
In this article, I have covered some important probability distributions which are explained in a
lucid as well as comprehensive manner.
Note: This article assumes you have a basic knowledge of probability. If not, you can refer
this probability distributions.
Types of Distributions
1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal Distribution
5. Poisson Distribution
6. Exponential Distribution
Common Data Types
Before we jump on to the explanation of distributions, let’s see what kind of data can we
encounter. The data can be discrete or continuous.
Discrete Data, as the name suggests, can take only specified values. For example, when you roll
a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or 2.45.
Continuous Data can take any value within a given range. The range may be finite or infinite.
For example, A girl’s weight or height, the length of the road. The weight of a girl can be any
value from 54 kgs, or 54.5 kgs, or 54.5436kgs.
Types of Distributions
1. Bernoulli Distribution
Let’s start with the easiest distribution that is Bernoulli Distribution. It is actually easier to
understand than it sounds!
All you cricket junkies out there! At the beginning of any cricket match, how do you decide who
is going to bat or ball? A toss! It all depends on whether you win or lose the toss, right? Let’s say
if the toss results in a head, you win. Else, you lose. There’s no midway.
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and
a single trial. So the random variable X which has a Bernoulli distribution can take value 1 with
the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.
Here, the occurrence of a head denotes success, and the occurrence of a tail denotes failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only two possible
outcomes.
The probability mass function is given by: p x(1-p)1-x where x € (0, 1).
It can also be written as
The probabilities of success and failure need not be equally likely, like the result of a fight
between me and Undertaker. He is pretty much certain to win. So in this case probability of my
success is 0.15 while my failure is 0.85
Here, the probability of success(p) is not same as the probability of failure. So, the chart below
shows the Bernoulli Distribution of our fight.
Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value is
exactly what it sounds. If I punch you, I may expect you to punch me back. Basically expected
value of any distribution is the mean of the distribution. The expected value of a random variable
X from a Bernoulli distribution is found as follows:
There are many examples of Bernoulli distribution such as whether it’s going to rain tomorrow
or not where rain denotes success and no rain denotes failure and Winning (success) or losing
(failure) the game.
2. Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are
equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the
n number of possible outcomes of a uniform distribution are equally likely.
The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum of
40 and a minimum of 10.
Let’s try calculating the probability that the daily sales will fall between 15 and 30.
The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5
Similarly, the probability that daily sales are greater than 20 is = 0.667
The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard uniform
density is given by:
3. Binomial Distribution
Let’s get back to cricket. Suppose that you won the toss today and this indicates a successful
event. You toss again but you lost this time. If you win a toss today, this does not necessitate that
you will win the toss tomorrow. Let’s assign a random variable, say X, to the number of times
you won the toss. What can be the possible value of X? It can be any number depending on the
number of times you tossed a coin.
There are only two possible outcomes. Head denoting success and tail denoting failure.
Therefore, probability of getting a head = 0.5 and the probability of failure can be easily
computed as: q = 1- p = 0.5.
A distribution where only two outcomes are possible, such as success or failure, gain or loss, win
or lose and where the probability of success and failure is same for all the trials is called a
Binomial Distribution.
The outcomes need not be equally likely. Remember the example of a fight between me and
Undertaker? So, if the probability of success in an experiment is 0.2 then the probability of
failure can be easily computed as q = 1 – 0.2 = 0.8.
Each trial is independent since the outcome of the previous toss doesn’t determine or affect the
outcome of the current toss. An experiment with only two possible outcomes repeated n number
of times is called binomial. The parameters of a binomial distribution are n and p where n is the
total number of trials and p is the probability of success in each trial.
On the basis of the above explanation, the properties of a Binomial Distribution are
There are only two possible outcomes in a trial- either a success or a failure.
The probability of success and failure is same for all trials. (Trials are identical.)
A binomial distribution graph where the probability of success does not equal the probability of
failure looks like
Now, when probability of success = probability of failure, in such a situation the graph of
binomial distribution looks like
The mean and variance of a binomial distribution are given by:
4. Normal Distribution
Normal distribution represents the behavior of most of the situations in the universe (That is why
it’s called a “normal” distribution. I guess!). The large sum of (small) random variables often
turns out to be normally distributed, contributing to its widespread application. Any distribution
is known as Normal distribution if it has the following characteristics:
The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
Exactly half of the values are to the left of the center and the other half to the right.
A normal distribution is highly different from Binomial Distribution. However, if the number of
trials approaches infinity then the shapes will be quite similar.
The mean and variance of a random variable X which is said to be normally distributed is given
by:
A standard normal distribution is defined as the distribution with mean 0 and standard deviation
1. For such a case, the PDF becomes:
5. Poisson Distribution
Suppose you work at a call center, approximately how many calls do you get in a day? It can be
any number. Now, the entire number of calls at a call center in a day is modeled by Poisson
distribution. Some more examples are
You can now think of many examples following the same course. Poisson Distribution is
applicable in situations where events occur at random points of time and space wherein our
interest lies only in the number of occurrences of the event.
10
A distribution is called Poisson distribution when the following assumptions are valid:
1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a
longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.
Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some
notations used in Poisson distribution are:
Here, X is called a Poisson Random Variable and the probability distribution of X is called
Poisson distribution.
Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.
The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that
interval. The graph of a Poisson distribution is shown below:
The graph shown below illustrates the shift in the curve due to increase in mean.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
11
It is perceptible that as the mean increases, the curve shifts to the right.
6. Exponential Distribution
Let’s consider the call center example one more time. What about the interval of time between
the calls ? Here, exponential distribution comes to our rescue. Exponential distribution models
the interval of time between the calls.
Exponential distribution is widely used for survival analysis. From the expected life of a machine
to the expected life of a human, exponential distribution successfully delivers the result.
f(x) = { λe-λx, x ≥ 0
For survival analysis, λ is called the failure rate of a device at any time t, given that it has
survived up to t.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
12
Also, the greater the rate, the faster the curve drops and the lower the rate, flatter the curve. This
is explained better with the graph shown below.
P{X>x} = e-λx, corresponds to the area under the density curve to the right of x.
P{x1<X≤ x2} = e-λx1 – e-λx2, corresponds to the area under the density curve between x1 and x2.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
13
The "mean" is the "average" you're used to, where you add up all the numbers and then divide by
the number of numbers. The "median" is the "middle" value in the list of numbers. To find the
median, your numbers have to be listed in numerical order from smallest to largest, so you may
have to rewrite your list before you can find the median. The "mode" is the value that occurs
most often. If no number in the list is repeated, then there is no mode for the list.
Find the mean, median, mode, and range for the following list of values:
13, 18, 13, 14, 13, 16, 14, 21, 13
The mean is the usual average, so I'll add and then divide:
(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15
Note that the mean, in this case, isn't a value from the original list. This is a common result. You
should not assume that your mean will be one of your original numbers.
The median is the middle value, so first I'll have to rewrite the list in numerical order:
There are nine numbers in the list, so the middle one will be the (9 + 1) ÷ 2 = 10 ÷ 2 = 5th
number:
The mode is the number that is repeated more often than any other, so 13 is the mode.
The largest value in the list is 21, and the smallest is 13, so the range is 21 – 13 = 8.
mean: 15
median: 14
mode: 13
range: 8
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
14
REGRESSION
A regression analysis is a statistical procedure that allows you to make a prediction about an
outcome (or criterion) variable based on knowledge of some predictor variable. To create a
regression model, you first need to collect (a lot of) data on both variables, similar to what you
would do if you were conducting a correlation. Then you would determine the contribution of the
predictor variable to the outcome variable. Once you have the regression model, you would be
able to input an individual’s score on the predictor variable to get a prediction of their score on
the outcome variable.
- Example: You want to try to predict whether a student will come back for a second year
based on how many on-campus activities s/he attended. You would have to collect data on how
many activities students attended and then whether or not those students returned for a second
year. If activity attendance and retention are significantly related to each other, then you can
generate a regression model where you could identify at-risk students (in terms of retention)
based on how many activities they have attended.
- Example: You want to try to identify students who are at risk of failing College Algebra
based on their scores on a math assessment so you can direct them to special services on campus.
You would administer the math assessment at the start of the semester and then match each
student’s score on the math assessment to their final grade in the course. Eventually, your data
may show that the math assessment is significantly correlated to their final grade, and you can
create a regression model to identify those at-risk students so you can direct them to tutors and
other resources on campus.
You want to be able to make a prediction about an outcome given what you already know about
some related factor.
Another option with regression is to do a multiple regression, which allows you to make a
prediction about an outcome based on more than just one predictor variable. Many retention
models are essentially multiple regressions that consider factors such as GPA, level of
involvement, and attitude towards academics and learning.
Y’=a+bX
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
15
Step 1: Make a chart of your data, filling in the columns in the same way as you would fill in the chart if you were
finding the Pearson’s Correlation Coefficient.
SU BJECT AGE X G L U CO S E L E V E L Y XY X2 Y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
From the above table, Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample size (6, in our
case).
Y’=a+bX
a = 65.1416
b = .385225
Click here if you want easy, step-by-step instructions for solving this formula.
Find a:
((486 × 11,409) – ((247 × 20,485)) / 6 (11,409) – 2472)
484979 / 7445
=65.14
Find b:
(6(20,485) – (247 × 486)) / (6 (11409) – 2472)
(122,910 – 120,042) / 68,454 – 2472
2,868 / 7,445
= .385225
Step 3: Insert the values into the equation.
y’ = a + bx
y’ = 65.14 + .385225x
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
16
Formula-
In order to determine how strong the relationship is between two variables, a formula must be
followed to produce what is referred to as the coefficient value. The coefficient value can range
between -1.00 and 1.00. If the coefficient value is in the negative range, then that means the
relationship between the variables is negatively correlated, or as one value increases, the other
decreases. If the value is in the positive range, then that means the relationship between the
variables is positively correlated, or both values increase or decrease together. Let's look at the
formula for conducting the Pearson correlation coefficient value.
Step one: Make a chart with your data for two variables, labeling the variables (x) and (y), and
add three more columns labeled (xy), (x^2), and (y^2). A simple data chart might look like this:
Step-1: Complete the chart using basic multiplication of the variable values.
Person Age (x) Score (y) (xy) (x^2) (y^2)
1 20 30 600 400 900
2 24 20 480 576 400
Step-2: After you have multiplied all the values to complete the chart, add up all of the columns
from top to bottom.
Person Age (x) Score (y) (xy) (x^2) (y^2)
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
17
Step-3: Use this formula to find the Pearson correlation coefficient value.
Sample question: Find the value of the correlation coefficient from the following
table:
S UB J E C T AGE X GL UC OS E L E VEL Y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 1:Make a chart. Use the given data, and add three more columns: xy, x2, and
y2.
S UB J E C T AGE X GL UC OS E L E VEL Y XY X2 Y2
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 2: Multiply x and y together to fill the xy column. For example, row 1 would
be 43 × 99 = 4,257.
S UB J E C T AGE X GLUCOSE
LE VEL Y
XY X2 Y2
1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779
Step 3: Take the square of the numbers in the x column, and put the result in the
x2 column.
S UB J E C T AGE X GL UC OS E LE VE L Y XY X2 Y2
1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481
Step 4: Take the square of the numbers in the y column, and put the result in the
y2 column.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
18
S UB J E C T AGE X GL UC OS E LE VE L Y XY X2 Y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Step 5: Add up all of the numbers in the columns and put the result at the bottom
of the column. The Greek letter sigma (Σ) is a short way of saying “sum of.”
S UB J E C T AGE GLUCOSE XY X2 Y2
X LE VEL Y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
Step 6: Use the following correlation coefficient formula.
Σx = 247
Σy = 486
Σxy = 20,485
Σx2 = 11,409
Σy2 = 40,022
n is the sample size, in our case = 6
The correlation coefficient =
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
20
= 0.5298
INFERENTIAL STATISTICS
What is the main purpose of inferential statistics?
The main purpose of inferential statistics is to:
a. Summarize data in a useful andinformative manner.
b. Estimate a population characteristic based on a sample.
c. Determine if the data adequately represents the population.
Inferential statistics allows you to make inferences about the population from the sample data.
When you have quantitative data, you can analyze it using either descriptive or inferential
statistics. Descriptive statistics do exactly what it sounds like – they describe the data.
Descriptive statistics include measures of central tendency (mean, median, mode), measures of
variation (standard deviation, variance), and relative position (quartiles, percentiles). There are
times, however, when you want to draw conclusions about the data. This may include making
comparisons across time, comparing different groups, or trying to make predictions based on
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
21
data that has been collected. Inferential statistics are used when you want to move beyond simple
description or characterization of your data and draw conclusions based on your data. There are
several kinds of inferential statistics that you can calculate; here are a few of the more common
types:
A random sample of 100 coin flips is taken from a random population of coin flippers,
and the null hypothesis is then tested. If it is found that the 100 coin flips were distributed
as 40 heads and 60 tails, the analyst would assume that a penny does not have a 50%
chance of landing on heads and would reject the null hypothesis and accept the
alternative hypothesis. Afterward, a new hypothesis would be tested, this time that a
penny has a 40% chance of landing on heads.
1. The first step is for the analyst to state the two hypotheses so that only one can be
right.
2. The next step is to formulate an analysis plan, which outlines how the data will be
evaluated.
3. The third step is to carry out the plan and physically analyze the sample data.
4. The fourth and final step is to analyze the results and either accept or reject the
null hypothesis.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
22
The use of this parametric statistical technique involves certain key assumptions, including
the following:
1. Independence of case: Independence of case assumption means that the case of the
dependent variable should be independent or the sample should be selected randomly.
There should not be any pattern in the selection of the sample.
3. Homogeneity: Homogeneity means variance between the groups should be the same.
Levene’s test is used to test the homogeneity between groups.
If particular data follows the above assumptions, then the analysis of variance (ANOVA) is
the best technique to compare the means of two, or more, populations.
One way analysis: When we are comparing more than three groups based on one factor
variable, then it said to be one way analysis of variance (ANOVA). For example, if we want
to compare whether or not the mean output of three workers is the same based on the
workinghours of the three workers.
Two way analysis: When factor variables are more than two, then it is said to be two way
analysis of variance (ANOVA). For example, based on working condition and working
hours,we can compare whether or not the mean output of three workers is the same.
K-way analysis: When factor variables are k, then it is said to be the k-way analysis of
variance (ANOVA).
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP