0% found this document useful (0 votes)

67 views20 pages

UNIT 1 Notes by ARUN JHAPATE

This document discusses descriptive statistics and probability distributions. Descriptive statistics include measures of central tendency like mean, median, and mode, and measures of variability like standard deviation. Six common probability distributions are discussed: Bernoulli, uniform, binomial, normal, Poisson, and exponential. Examples are provided to illustrate key properties of each distribution like possible outcomes, probability mass or density functions, expected value, and variance. Understanding these fundamental probability distributions is important for data analysis and statistics.

Uploaded by

Ankit “अंकित मौर्य” Mourya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views20 pages

UNIT 1 Notes by ARUN JHAPATE

Uploaded by

Ankit “अंकित मौर्य” Mourya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

UNIT -1

DESCRIPTIVE STATISTICS

Introduction

Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be
either a representation of the entire or a sample of a population. Descriptive statistics are broken down
into measures of central tendency and measures of variability (spread). Measures of central tendency
include the mean, median, and mode, while measures of variability include the standard deviation,
variance, the minimum and maximum variables, and the kurtosis and skewness.

PROBABILITY DISTRIBUTIONS
6 Common Probability Distributions every data science professional
should know.
Example.

Suppose you are a teacher at a university. After checking assignments for a week, you graded all
the students. You gave these graded papers to a data entry guy in the university and tell him to
create a spreadsheet containing the grades of all the students. But the guy only stores the grades
and not the corresponding students.

He made another blunder, he missed a couple of entries in a hurry and we have no idea whose
grades are missing. Let’s find a way to solve this.

One way is that you visualize the grades and see if you can find a trend in the data.
The graph that you have plot is called the frequency distribution of the data. You see that there is
a smooth curve like structure that defines our data, but do you notice an anomaly? We have an
abnormally low frequency at a particular score range. So the best guess would be to have missing
values that remove the dent in the distribution.

This is how you would try to solve a real-life problem using data analysis. For any Data
Scientist, a student or a practitioner, distribution is a must know concept. It provides the basis for
analytics and inferential statistics.

While the concept of probability gives us the mathematical calculations, distributions help us
actually visualize what’s happening underneath.

In this article, I have covered some important probability distributions which are explained in a
lucid as well as comprehensive manner.

Note: This article assumes you have a basic knowledge of probability. If not, you can refer
this probability distributions.

Types of Distributions
1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal Distribution
5. Poisson Distribution
6. Exponential Distribution
Common Data Types
Before we jump on to the explanation of distributions, let’s see what kind of data can we
encounter. The data can be discrete or continuous.

Discrete Data, as the name suggests, can take only specified values. For example, when you roll
a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or 2.45.

Continuous Data can take any value within a given range. The range may be finite or infinite.
For example, A girl’s weight or height, the length of the road. The weight of a girl can be any
value from 54 kgs, or 54.5 kgs, or 54.5436kgs.

Now let us start with the types of distributions.

Types of Distributions

1. Bernoulli Distribution
Let’s start with the easiest distribution that is Bernoulli Distribution. It is actually easier to
understand than it sounds!

All you cricket junkies out there! At the beginning of any cricket match, how do you decide who
is going to bat or ball? A toss! It all depends on whether you win or lose the toss, right? Let’s say
if the toss results in a head, you win. Else, you lose. There’s no midway.

A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and
a single trial. So the random variable X which has a Bernoulli distribution can take value 1 with
the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.

Here, the occurrence of a head denotes success, and the occurrence of a tail denotes failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only two possible
outcomes.

The probability mass function is given by: p x(1-p)1-x where x € (0, 1).
It can also be written as

The probabilities of success and failure need not be equally likely, like the result of a fight
between me and Undertaker. He is pretty much certain to win. So in this case probability of my
success is 0.15 while my failure is 0.85

Here, the probability of success(p) is not same as the probability of failure. So, the chart below
shows the Bernoulli Distribution of our fight.
Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value is
exactly what it sounds. If I punch you, I may expect you to punch me back. Basically expected
value of any distribution is the mean of the distribution. The expected value of a random variable
X from a Bernoulli distribution is found as follows:

E(X) = 1p + 0(1-p) = p

The variance of a random variable from a bernoulli distribution is:

V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)

There are many examples of Bernoulli distribution such as whether it’s going to rain tomorrow
or not where rain denotes success and no rain denotes failure and Winning (success) or losing
(failure) the game.

2. Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are
equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the
n number of possible outcomes of a uniform distribution are equally likely.

A variable X is said to be uniformly distributed if the density function is:

The graph of a uniform distribution curve looks like

You can see that the shape of the Uniform distribution curve is rectangular, the reason why
Uniform distribution is called rectangular distribution.

For a Uniform Distribution, a and b are the parameters.

The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum of
40 and a minimum of 10.

Let’s try calculating the probability that the daily sales will fall between 15 and 30.

The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5

Similarly, the probability that daily sales are greater than 20 is = 0.667

The mean and variance of X following a uniform distribution is:

Mean -> E(X) = (a+b)/2

Variance -> V(X) = (b-a)²/12

The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard uniform
density is given by:

3. Binomial Distribution
Let’s get back to cricket. Suppose that you won the toss today and this indicates a successful
event. You toss again but you lost this time. If you win a toss today, this does not necessitate that
you will win the toss tomorrow. Let’s assign a random variable, say X, to the number of times
you won the toss. What can be the possible value of X? It can be any number depending on the
number of times you tossed a coin.

There are only two possible outcomes. Head denoting success and tail denoting failure.
Therefore, probability of getting a head = 0.5 and the probability of failure can be easily
computed as: q = 1- p = 0.5.
A distribution where only two outcomes are possible, such as success or failure, gain or loss, win
or lose and where the probability of success and failure is same for all the trials is called a
Binomial Distribution.

The outcomes need not be equally likely. Remember the example of a fight between me and
Undertaker? So, if the probability of success in an experiment is 0.2 then the probability of
failure can be easily computed as q = 1 – 0.2 = 0.8.

Each trial is independent since the outcome of the previous toss doesn’t determine or affect the
outcome of the current toss. An experiment with only two possible outcomes repeated n number
of times is called binomial. The parameters of a binomial distribution are n and p where n is the
total number of trials and p is the probability of success in each trial.

On the basis of the above explanation, the properties of a Binomial Distribution are

Each trial is independent.

There are only two possible outcomes in a trial- either a success or a failure.

A total number of n identical trials are conducted.

The probability of success and failure is same for all trials. (Trials are identical.)

The mathematical representation of binomial distribution is given by:

A binomial distribution graph where the probability of success does not equal the probability of
failure looks like

Now, when probability of success = probability of failure, in such a situation the graph of
binomial distribution looks like
The mean and variance of a binomial distribution are given by:

Mean -> µ = n*p

Variance -> Var(X) = npq

4. Normal Distribution
Normal distribution represents the behavior of most of the situations in the universe (That is why
it’s called a “normal” distribution. I guess!). The large sum of (small) random variables often
turns out to be normally distributed, contributing to its widespread application. Any distribution
is known as Normal distribution if it has the following characteristics:

The mean, median and mode of the distribution coincide.

The curve of the distribution is bell-shaped and symmetrical about the line x=μ.

The total area under the curve is 1.

Exactly half of the values are to the left of the center and the other half to the right.

A normal distribution is highly different from Binomial Distribution. However, if the number of
trials approaches infinity then the shapes will be quite similar.

The PDF of a random variable X following a normal distribution is given by:

The mean and variance of a random variable X which is said to be normally distributed is given
by:

Mean -> E(X) = µ

Variance -> Var(X) = σ^2

Here, µ (mean) and σ (standard deviation) are the parameters.
The graph of a random variable X ~ N (µ, σ) is shown below.

A standard normal distribution is defined as the distribution with mean 0 and standard deviation
1. For such a case, the PDF becomes:

5. Poisson Distribution
Suppose you work at a call center, approximately how many calls do you get in a day? It can be
any number. Now, the entire number of calls at a call center in a day is modeled by Poisson
distribution. Some more examples are

The number of emergency calls recorded at a hospital in a day.

The number of thefts reported in an area on a day.

The number of customers arriving at a salon in an hour.

The number of suicides reported in a particular city.

The number of printing errors at each page of the book.

You can now think of many examples following the same course. Poisson Distribution is
applicable in situations where events occur at random points of time and space wherein our
interest lies only in the number of occurrences of the event.
10

A distribution is called Poisson distribution when the following assumptions are valid:

1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a
longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.

Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some
notations used in Poisson distribution are:

λ is the rate at which an event occurs,

t is the length of a time interval,

And X is the number of events in that time interval.

Here, X is called a Poisson Random Variable and the probability distribution of X is called
Poisson distribution.

Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.

The PMF of X following a Poisson distribution is given by:

The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that
interval. The graph of a Poisson distribution is shown below:

The graph shown below illustrates the shift in the curve due to increase in mean.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
11

It is perceptible that as the mean increases, the curve shifts to the right.

The mean and variance of X following a Poisson distribution:

Mean -> E(X) = µ

Variance -> Var(X) = µ

6. Exponential Distribution
Let’s consider the call center example one more time. What about the interval of time between
the calls ? Here, exponential distribution comes to our rescue. Exponential distribution models
the interval of time between the calls.

Other examples are:

1. Length of time beteeen metro arrivals,

2. Length of time between arrivals at a gas station
3. The life of an Air Conditioner

Exponential distribution is widely used for survival analysis. From the expected life of a machine
to the expected life of a human, exponential distribution successfully delivers the result.

A random variable X is said to have an exponential distribution with PDF:

f(x) = { λe-λx, x ≥ 0

and parameter λ>0 which is also called the rate.

For survival analysis, λ is called the failure rate of a device at any time t, given that it has
survived up to t.

Mean and Variance of a random variable X following an exponential distribution:

Mean -> E(X) = 1/λ

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
12

Variance -> Var(X) = (1/λ)²

Also, the greater the rate, the faster the curve drops and the lower the rate, flatter the curve. This
is explained better with the graph shown below.

To ease the computation, there are some formulas given below.

P{X≤x} = 1 – e-λx, corresponds to the area under the density curve to the left of x.

P{X>x} = e-λx, corresponds to the area under the density curve to the right of x.

P{x1<X≤ x2} = e-λx1 – e-λx2, corresponds to the area under the density curve between x1 and x2.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
13

Mean, Median, Mode, and Range

Mean, median, and mode are three kinds of "averages". There are many "averages" in statistics,
but these are, I think, the three most common, and are certainly the three you are most likely to
encounter in your pre-statistics courses, if the topic comes up at all.

The "mean" is the "average" you're used to, where you add up all the numbers and then divide by
the number of numbers. The "median" is the "middle" value in the list of numbers. To find the
median, your numbers have to be listed in numerical order from smallest to largest, so you may
have to rewrite your list before you can find the median. The "mode" is the value that occurs
most often. If no number in the list is repeated, then there is no mode for the list.

Find the mean, median, mode, and range for the following list of values:
13, 18, 13, 14, 13, 16, 14, 21, 13

The mean is the usual average, so I'll add and then divide:

(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15

Note that the mean, in this case, isn't a value from the original list. This is a common result. You
should not assume that your mean will be one of your original numbers.

The median is the middle value, so first I'll have to rewrite the list in numerical order:

13, 13, 13, 13, 14, 14, 16, 18, 21

There are nine numbers in the list, so the middle one will be the (9 + 1) ÷ 2 = 10 ÷ 2 = 5th
number:

13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14.

The mode is the number that is repeated more often than any other, so 13 is the mode.

The largest value in the list is 21, and the smallest is 13, so the range is 21 – 13 = 8.

mean: 15
median: 14
mode: 13
range: 8

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
14

REGRESSION
A regression analysis is a statistical procedure that allows you to make a prediction about an
outcome (or criterion) variable based on knowledge of some predictor variable. To create a
regression model, you first need to collect (a lot of) data on both variables, similar to what you
would do if you were conducting a correlation. Then you would determine the contribution of the
predictor variable to the outcome variable. Once you have the regression model, you would be
able to input an individual’s score on the predictor variable to get a prediction of their score on
the outcome variable.

- Example: You want to try to predict whether a student will come back for a second year
based on how many on-campus activities s/he attended. You would have to collect data on how
many activities students attended and then whether or not those students returned for a second
year. If activity attendance and retention are significantly related to each other, then you can
generate a regression model where you could identify at-risk students (in terms of retention)
based on how many activities they have attended.

- Example: You want to try to identify students who are at risk of failing College Algebra
based on their scores on a math assessment so you can direct them to special services on campus.
You would administer the math assessment at the start of the semester and then match each
student’s score on the math assessment to their final grade in the course. Eventually, your data
may show that the math assessment is significantly correlated to their final grade, and you can
create a regression model to identify those at-risk students so you can direct them to tutors and
other resources on campus.

- Thus, use regression when:

You want to be able to make a prediction about an outcome given what you already know about
some related factor.

Another option with regression is to do a multiple regression, which allows you to make a
prediction about an outcome based on more than just one predictor variable. Many retention
models are essentially multiple regressions that consider factors such as GPA, level of
involvement, and attitude towards academics and learning.

The Linear Regression Equation

Linear regression is a way to model the relationship between two variables. You might also
recognize the equation as the slope formula. The equation has the form Y=a+bX, where Y is the
dependent variable (that’s the variable that goes on the Y axis), X is the independent variable
(i.e. it is plotted on the X axis), b is the slope of the line and a is the y-intercept.

Y’=a+bX

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
15

HOW TO FIND A LINEAR REGRESSION EQUATION: STEPS

Step 1: Make a chart of your data, filling in the columns in the same way as you would fill in the chart if you were
finding the Pearson’s Correlation Coefficient.

SU BJECT AGE X G L U CO S E L E V E L Y XY X2 Y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
From the above table, Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample size (6, in our
case).

Step 2: Use the following equations to find a and b.

Y’=a+bX

a = 65.1416
b = .385225
Click here if you want easy, step-by-step instructions for solving this formula.
Find a:
 ((486 × 11,409) – ((247 × 20,485)) / 6 (11,409) – 2472)
 484979 / 7445
 =65.14
Find b:
 (6(20,485) – (247 × 486)) / (6 (11409) – 2472)
 (122,910 – 120,042) / 68,454 – 2472
 2,868 / 7,445
 = .385225
Step 3: Insert the values into the equation.
y’ = a + bx
y’ = 65.14 + .385225x

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
16

Pearson Correlation Coefficient

The Pearson correlation coefficient is a very helpful statistical formula that measures the strength
between variables and relationships. In the field of statistics, this formula is often referred to as
the Pearson R test. When conducting a statistical test between two variables, it is a good idea to
conduct a Pearson correlation coefficient value to determine just how strong that relationship is
between those two variables.

Formula-
In order to determine how strong the relationship is between two variables, a formula must be
followed to produce what is referred to as the coefficient value. The coefficient value can range
between -1.00 and 1.00. If the coefficient value is in the negative range, then that means the
relationship between the variables is negatively correlated, or as one value increases, the other
decreases. If the value is in the positive range, then that means the relationship between the
variables is positively correlated, or both values increase or decrease together. Let's look at the
formula for conducting the Pearson correlation coefficient value.

Step one: Make a chart with your data for two variables, labeling the variables (x) and (y), and
add three more columns labeled (xy), (x^2), and (y^2). A simple data chart might look like this:
Step-1: Complete the chart using basic multiplication of the variable values.
Person Age (x) Score (y) (xy) (x^2) (y^2)
1 20 30 600 400 900
2 24 20 480 576 400

3 17 27 459 289 729

Step-2: After you have multiplied all the values to complete the chart, add up all of the columns
from top to bottom.
Person Age (x) Score (y) (xy) (x^2) (y^2)

1 20 30 600 400 900

2 24 20 480 576 400
3 17 27 459 289 729
Total 61 77 1539 1265 2029

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
17
Step-3: Use this formula to find the Pearson correlation coefficient value.
Sample question: Find the value of the correlation coefficient from the following
table:
S UB J E C T AGE X GL UC OS E L E VEL Y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 1:Make a chart. Use the given data, and add three more columns: xy, x2, and
y2.
S UB J E C T AGE X GL UC OS E L E VEL Y XY X2 Y2
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 2: Multiply x and y together to fill the xy column. For example, row 1 would
be 43 × 99 = 4,257.
S UB J E C T AGE X GLUCOSE
LE VEL Y
XY X2 Y2

1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779
Step 3: Take the square of the numbers in the x column, and put the result in the
x2 column.

S UB J E C T AGE X GL UC OS E LE VE L Y XY X2 Y2
1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481
Step 4: Take the square of the numbers in the y column, and put the result in the
y2 column.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
18

S UB J E C T AGE X GL UC OS E LE VE L Y XY X2 Y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Step 5: Add up all of the numbers in the columns and put the result at the bottom
of the column. The Greek letter sigma (Σ) is a short way of saying “sum of.”

S UB J E C T AGE GLUCOSE XY X2 Y2

X LE VEL Y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
Step 6: Use the following correlation coefficient formula.

The answer is: 2868 / 5413.27 = 0.529809

Click here if you want easy, step-by-step instructions for solving this formula.
From our table:

Σx = 247
 Σy = 486
 Σxy = 20,485
 Σx2 = 11,409
 Σy2 = 40,022
 n is the sample size, in our case = 6
The correlation coefficient =

 6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
20

= 0.5298

The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or

52.98%, which means the variables have a moderate positive correlation.

INFERENTIAL STATISTICS
What is the main purpose of inferential statistics?
The main purpose of inferential statistics is to:
a. Summarize data in a useful andinformative manner.
b. Estimate a population characteristic based on a sample.
c. Determine if the data adequately represents the population.
Inferential statistics allows you to make inferences about the population from the sample data.

Population & Sample

A sample is a representative subset of a population. Conducting a census on population is an
ideal but impractical approach in most of the cases. Sampling is much more practical, however it
is prone to sampling error. A sample non-representative of population is called bias, method
chosen for such sampling is called sampling bias. Convenience bias, judgement bias, size
bias, response bias are main types of sampling bias. The best technique for reducing bias in
sampling is randomization. Simple random sampling is the simplest of randomization
techniques, cluster sampling & stratified samplingare other systematic sampling techniques.
Here are two main areas of inferential statistics:
1. Estimating parameters. This means taking a statistic from your sample data (for example
the sample mean) and using it to say something about a population parameter (i.e. the
population mean).
2. Hypothesis tests. This is where you can use sample data to answer research questions. For
example, you might be interested in knowing if a new cancer drug is effective. Or if
breakfast helps children perform better in schools.

When you have quantitative data, you can analyze it using either descriptive or inferential
statistics. Descriptive statistics do exactly what it sounds like – they describe the data.
Descriptive statistics include measures of central tendency (mean, median, mode), measures of
variation (standard deviation, variance), and relative position (quartiles, percentiles). There are
times, however, when you want to draw conclusions about the data. This may include making
comparisons across time, comparing different groups, or trying to make predictions based on

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
21

data that has been collected. Inferential statistics are used when you want to move beyond simple
description or characterization of your data and draw conclusions based on your data. There are
several kinds of inferential statistics that you can calculate; here are a few of the more common
types:

INFERENTIAL STATISTICS THROUGH HYPOTHESIS

Q. What Is Hypothesis Testing?

Ans.

Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding

a population parameter. The methodology employed by the analyst depends on the nature
of the data used and the reason for the analysis. Hypothesis testing is used to infer the
result of a hypothesis performed on sample data from a larger population.

Q. What is inferential hypothesis?

Ans.
Inferential Statistics arises from the sampling theory that makes inferences about
populations based upon samples taken from such populations. ... Null hypothesis is a
statement that indicates there is no difference between the sample and the population (or
the population means for the treatment and experimental groups).

Real World Example of Hypothesis Testing

If, for example, a person wants to test that a penny has exactly a 50% chance of landing
on heads, the null hypothesis would be yes, and the alternative hypothesis would be no (it
does not land on heads). Mathematically, the null hypothesis would be represented as Ho:
P = 0.5. The alternative hypothesis would be denoted as "Ha" and be identical to the null
hypothesis, except with the equal sign struck-through, meaning that it does not equal
50%.

A random sample of 100 coin flips is taken from a random population of coin flippers,
and the null hypothesis is then tested. If it is found that the 100 coin flips were distributed
as 40 heads and 60 tails, the analyst would assume that a penny does not have a 50%
chance of landing on heads and would reject the null hypothesis and accept the
alternative hypothesis. Afterward, a new hypothesis would be tested, this time that a
penny has a 40% chance of landing on heads.

Four Steps of Hypothesis Testing

All hypotheses are tested using a four-step process:

1. The first step is for the analyst to state the two hypotheses so that only one can be
right.
2. The next step is to formulate an analysis plan, which outlines how the data will be
evaluated.
3. The third step is to carry out the plan and physically analyze the sample data.
4. The fourth and final step is to analyze the results and either accept or reject the
null hypothesis.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
22

ANOVA (ANALYSIS OF VARIANCE)

Analysis of Variance (ANOVA) is a parametric statistical technique used to compare

datasets. This technique was invented by R.A. Fisher, and is thus often referred to as
Fisher’s ANOVA, as well. It is similar in application to techniques such as t-test and z-test,
in that it is used to compare means and the relative variance between them. However,
analysis of variance (ANOVA) is best applied where more than 2 populations or samples
are meant to be compared.

Statistics Solutions is the country’s leader in Analysis of Variance (ANOVA) and

dissertation statistics. Contact Statistics Solutions today for a free 30-minute consultation.

The use of this parametric statistical technique involves certain key assumptions, including
the following:

1. Independence of case: Independence of case assumption means that the case of the
dependent variable should be independent or the sample should be selected randomly.
There should not be any pattern in the selection of the sample.

2. Normality: Distribution of each group should be normal. The Kolmogorov-Smirnov or

the Shapiro-Wilk test may be used to confirm normality of the group.

3. Homogeneity: Homogeneity means variance between the groups should be the same.
Levene’s test is used to test the homogeneity between groups.

If particular data follows the above assumptions, then the analysis of variance (ANOVA) is
the best technique to compare the means of two, or more, populations.

Analysis of variance (ANOVA) has three types:

One way analysis: When we are comparing more than three groups based on one factor
variable, then it said to be one way analysis of variance (ANOVA). For example, if we want
to compare whether or not the mean output of three workers is the same based on the
workinghours of the three workers.

Two way analysis: When factor variables are more than two, then it is said to be two way
analysis of variance (ANOVA). For example, based on working condition and working
hours,we can compare whether or not the mean output of three workers is the same.

K-way analysis: When factor variables are k, then it is said to be the k-way analysis of
variance (ANOVA).

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP

Lecture Slides - Inferential Statistics
100% (1)
Lecture Slides - Inferential Statistics
42 pages
Notes On Unit 3
No ratings yet
Notes On Unit 3
42 pages
Chapter 6 Probability Distribution
No ratings yet
Chapter 6 Probability Distribution
22 pages
Statatics and Probability Chapter 3 and 4
No ratings yet
Statatics and Probability Chapter 3 and 4
10 pages
Unit 3
No ratings yet
Unit 3
70 pages
Chapter 5 Discrete Random Variables
No ratings yet
Chapter 5 Discrete Random Variables
54 pages
Chapter 2 Elemtry Vs Probabilty Distr
No ratings yet
Chapter 2 Elemtry Vs Probabilty Distr
77 pages
Probability Distribution
No ratings yet
Probability Distribution
14 pages
BIOL 2163 Lecture 5 - Discrete Probability Distributions
No ratings yet
BIOL 2163 Lecture 5 - Discrete Probability Distributions
62 pages
STATS For Business Lec 6
No ratings yet
STATS For Business Lec 6
53 pages
05 Descriptive Statistics - Distribution
No ratings yet
05 Descriptive Statistics - Distribution
5 pages
Lecture 6
No ratings yet
Lecture 6
43 pages
Istanbul Aydin University: Chapter 4: Probability Distributions
No ratings yet
Istanbul Aydin University: Chapter 4: Probability Distributions
12 pages
Sunu 4
No ratings yet
Sunu 4
60 pages
L8 Probability Distributionsv2
No ratings yet
L8 Probability Distributionsv2
42 pages
Study Material - Statistics by Jim
No ratings yet
Study Material - Statistics by Jim
46 pages
Chapter05 - Probability Disty
No ratings yet
Chapter05 - Probability Disty
17 pages
BSTATS
No ratings yet
BSTATS
32 pages
Data Analytics Notes From Unit 1 To 5 by DR Kapil Chaturvedi
100% (9)
Data Analytics Notes From Unit 1 To 5 by DR Kapil Chaturvedi
94 pages
Unit 4
No ratings yet
Unit 4
30 pages
Bus Stat CHP 6&7
No ratings yet
Bus Stat CHP 6&7
7 pages
ch06-ProbDistribs RandomVars
No ratings yet
ch06-ProbDistribs RandomVars
11 pages
Types of Distribution Normal Distribution
No ratings yet
Types of Distribution Normal Distribution
5 pages
Distribution in Statistice
No ratings yet
Distribution in Statistice
6 pages
Lecture+Slides+ +week+1
No ratings yet
Lecture+Slides+ +week+1
30 pages
Unit 1 Ssmda Notes
No ratings yet
Unit 1 Ssmda Notes
35 pages
Chapter 6
No ratings yet
Chapter 6
50 pages
ALY6000 Module 6.0
No ratings yet
ALY6000 Module 6.0
54 pages
Day 02-Random Variable and Probability - Part (I)
No ratings yet
Day 02-Random Variable and Probability - Part (I)
34 pages
Unit 07 Probability Distribution Bi Po and Nor
No ratings yet
Unit 07 Probability Distribution Bi Po and Nor
6 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
51 pages
Tài liệu 5
No ratings yet
Tài liệu 5
19 pages
STATSPROB
No ratings yet
STATSPROB
11 pages
Lecture4 Probability
No ratings yet
Lecture4 Probability
28 pages
01 SOC Chapter 1 Probability Distributions - Nov 26, 2023
No ratings yet
01 SOC Chapter 1 Probability Distributions - Nov 26, 2023
41 pages
Statistic S at Probabili TY: Teacher: Aldwin N. Petronio
No ratings yet
Statistic S at Probabili TY: Teacher: Aldwin N. Petronio
44 pages
Statistics Notes Part-2
No ratings yet
Statistics Notes Part-2
24 pages
Probability Distributions.
No ratings yet
Probability Distributions.
46 pages
F5 Add Math Folio (Normal Distribution)
0% (1)
F5 Add Math Folio (Normal Distribution)
28 pages
Types of Probability Distribution
No ratings yet
Types of Probability Distribution
10 pages
Top 10 Types of Distribution in Statistics With Formulas
No ratings yet
Top 10 Types of Distribution in Statistics With Formulas
5 pages
13 Binomial - Poissons Distribution
No ratings yet
13 Binomial - Poissons Distribution
6 pages
BS Chapter4 2021 Discrete Probability Distribution Binomial Hyper Poisen 22
No ratings yet
BS Chapter4 2021 Discrete Probability Distribution Binomial Hyper Poisen 22
40 pages
Binomial and Poissons Distribution
No ratings yet
Binomial and Poissons Distribution
29 pages
1st Unit Notes
No ratings yet
1st Unit Notes
34 pages
jml5 1
No ratings yet
jml5 1
63 pages
R-6 Theory
No ratings yet
R-6 Theory
4 pages
Chapters 7,8,9
No ratings yet
Chapters 7,8,9
4 pages
Chapter 3 - Special Probability Distributions
No ratings yet
Chapter 3 - Special Probability Distributions
45 pages
Lecture Note 1
No ratings yet
Lecture Note 1
5 pages
Statistical Tools
No ratings yet
Statistical Tools
79 pages
Module 2 in IStat 1 Probability Distribution
No ratings yet
Module 2 in IStat 1 Probability Distribution
6 pages
Statistical Analysis: Dr. Shahid Iqbal Fall 2021
No ratings yet
Statistical Analysis: Dr. Shahid Iqbal Fall 2021
65 pages
Probability Distribution
No ratings yet
Probability Distribution
10 pages
Stats and Prob Reviewer
No ratings yet
Stats and Prob Reviewer
7 pages
Instructions For Chapter 5-By Dr. Guru-Gharana The Binomial Distribution Random Variable
No ratings yet
Instructions For Chapter 5-By Dr. Guru-Gharana The Binomial Distribution Random Variable
10 pages
What Is Distribution?
No ratings yet
What Is Distribution?
4 pages
Probability Distribution
No ratings yet
Probability Distribution
33 pages
On Probability Theory &stochastic Process
No ratings yet
On Probability Theory &stochastic Process
101 pages
Straight Line Fitting:: W by Means of A Pulley Block, Find A Linear Law of The BW A P W Using The Following Data P W
No ratings yet
Straight Line Fitting:: W by Means of A Pulley Block, Find A Linear Law of The BW A P W Using The Following Data P W
8 pages
Add Math Probability Distribution
No ratings yet
Add Math Probability Distribution
10 pages
Bussiness Statistics Book
No ratings yet
Bussiness Statistics Book
5 pages
Prob Weber
No ratings yet
Prob Weber
32 pages
As 6 Control Charts For Attributes
100% (1)
As 6 Control Charts For Attributes
32 pages
Lecture 18-Sta4001
No ratings yet
Lecture 18-Sta4001
11 pages
Engineering Statistics: An Introduction Edward B. Magrab PDF Download
100% (1)
Engineering Statistics: An Introduction Edward B. Magrab PDF Download
55 pages
BA Sem 1 & 2 (Major-Minor-MDC-SEC-VAC) Statistics Syllabus As Per Govt NEP GR DT 11-07-2023 From 2023-24
No ratings yet
BA Sem 1 & 2 (Major-Minor-MDC-SEC-VAC) Statistics Syllabus As Per Govt NEP GR DT 11-07-2023 From 2023-24
25 pages
Waiting Lines and Queuing Theory Models: To Accompany
No ratings yet
Waiting Lines and Queuing Theory Models: To Accompany
64 pages
Problem Set1 PDF
No ratings yet
Problem Set1 PDF
6 pages
Module 2 Math Foundation II
No ratings yet
Module 2 Math Foundation II
24 pages
Engineering Statistics Lectures (2019-2020) PDF
No ratings yet
Engineering Statistics Lectures (2019-2020) PDF
87 pages
1.probability Random Variables and Stochastic Processes Athanasios Papoulis S. Unnikrishna Pillai 1 300 121 150
No ratings yet
1.probability Random Variables and Stochastic Processes Athanasios Papoulis S. Unnikrishna Pillai 1 300 121 150
30 pages
(Ebooks PDF) Download Quantitative Methods For Business 13th Edition Full Chapters
100% (2)
(Ebooks PDF) Download Quantitative Methods For Business 13th Edition Full Chapters
55 pages
Poisson Process
No ratings yet
Poisson Process
10 pages
Digital Assignmen T-3: Mat 2001 Statistics For Engineers
No ratings yet
Digital Assignmen T-3: Mat 2001 Statistics For Engineers
14 pages
OCR MEI S2 Revision Sheets
No ratings yet
OCR MEI S2 Revision Sheets
8 pages
DescriptiveStatsFormulas JMP SAS
No ratings yet
DescriptiveStatsFormulas JMP SAS
21 pages
T Test
No ratings yet
T Test
9 pages
Practice Problem On Distribution
No ratings yet
Practice Problem On Distribution
4 pages
Queuing Theory: Dr. Somesh Kumar Sharma MED NIT Hamirpur
No ratings yet
Queuing Theory: Dr. Somesh Kumar Sharma MED NIT Hamirpur
18 pages
Using The Beta-Binomial Distribution For The Analysis of Biometric Identification
No ratings yet
Using The Beta-Binomial Distribution For The Analysis of Biometric Identification
8 pages
The Binomial Probability Distribution Definition: A Binomial Experiment Is One That Has These Five Characteristics
No ratings yet
The Binomial Probability Distribution Definition: A Binomial Experiment Is One That Has These Five Characteristics
11 pages
DPK
No ratings yet
DPK
3 pages
Course Plan MA2001D - (Monsoon 2023-24) - With Co-Po
No ratings yet
Course Plan MA2001D - (Monsoon 2023-24) - With Co-Po
3 pages
Tutorial Sheet Unit 5
No ratings yet
Tutorial Sheet Unit 5
3 pages
Poisson Regression Analysis 136
No ratings yet
Poisson Regression Analysis 136
8 pages
Poisson Functions in R Programming
No ratings yet
Poisson Functions in R Programming
3 pages
Midter Exam Stat and Proba
No ratings yet
Midter Exam Stat and Proba
3 pages
STPM Maths T Assignment Introduction Example
No ratings yet
STPM Maths T Assignment Introduction Example
2 pages
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet

UNIT 1 Notes by ARUN JHAPATE

Uploaded by

UNIT 1 Notes by ARUN JHAPATE

Uploaded by

UNIT -1

Now let us start with the types of distributions.

E(X) = 1*p + 0*(1-p) = p

The variance of a random variable from a bernoulli distribution is:

V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)

A variable X is said to be uniformly distributed if the density function is:

The graph of a uniform distribution curve looks like

For a Uniform Distribution, a and b are the parameters.

The mean and variance of X following a uniform distribution is:

Mean -> E(X) = (a+b)/2

Variance -> V(X) = (b-a)²/12

Each trial is independent.

A total number of n identical trials are conducted.

The mathematical representation of binomial distribution is given by:

Mean -> µ = n*p

Variance -> Var(X) = n*p*q

The mean, median and mode of the distribution coincide.

The total area under the curve is 1.

The PDF of a random variable X following a normal distribution is given by:

Mean -> E(X) = µ

Variance -> Var(X) = σ^2

The number of emergency calls recorded at a hospital in a day.

The number of thefts reported in an area on a day.

The number of customers arriving at a salon in an hour.

The number of suicides reported in a particular city.

The number of printing errors at each page of the book.

λ is the rate at which an event occurs,

t is the length of a time interval,

And X is the number of events in that time interval.

The PMF of X following a Poisson distribution is given by:

The mean and variance of X following a Poisson distribution:

Mean -> E(X) = µ

Other examples are:

1. Length of time beteeen metro arrivals,

A random variable X is said to have an exponential distribution with PDF:

and parameter λ>0 which is also called the rate.

Mean and Variance of a random variable X following an exponential distribution:

Mean -> E(X) = 1/λ

Variance -> Var(X) = (1/λ)²

To ease the computation, there are some formulas given below.

Mean, Median, Mode, and Range

13, 13, 13, 13, 14, 14, 16, 18, 21

13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14.

- Thus, use regression when:

The Linear Regression Equation

HOW TO FIND A LINEAR REGRESSION EQUATION: STEPS

Step 2: Use the following equations to find a and b.

Pearson Correlation Coefficient

3 17 27 459 289 729

1 20 30 600 400 900

The answer is: 2868 / 5413.27 = 0.529809

 6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]

The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or

Population & Sample

INFERENTIAL STATISTICS THROUGH HYPOTHESIS

Q. What Is Hypothesis Testing?

Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding

Q. What is inferential hypothesis?

Real World Example of Hypothesis Testing

Four Steps of Hypothesis Testing

ANOVA (ANALYSIS OF VARIANCE)

Analysis of Variance (ANOVA) is a parametric statistical technique used to compare

Statistics Solutions is the country’s leader in Analysis of Variance (ANOVA) and

2. Normality: Distribution of each group should be normal. The Kolmogorov-Smirnov or

Analysis of variance (ANOVA) has three types:

You might also like

E(X) = 1p + 0(1-p) = p

Variance -> Var(X) = npq