0% found this document useful (0 votes)

105 views49 pages

Revision Notes On Probability and Regression Analysis Both Classses

The document discusses several topics related to probability, index numbers, and regression models. It first defines probability as the mathematical calculation of possibilities of phenomena occurring based on a value between 0 and 1. It then provides the formula for calculating probability by dividing favorable events by total possible events. The document also discusses types of probability like mathematical, frequency, objective, and conditional probability. Finally, it summarizes theories of probability including addition, multiplication, and binomial distribution rules. The document then discusses index numbers, providing definitions and discussing steps in their construction including selecting a base year and representative commodities, and collecting prices. It defines an index number as measuring changes in a variable over time relative to a base

Uploaded by

magdawaks46

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views49 pages

Revision Notes On Probability and Regression Analysis Both Classses

Uploaded by

magdawaks46

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 49

REVISION NOTES

Calculating Probability Values, Index Numbers, Regression Model and

Hypothesis

Meaning of probability:

The term probability is used to define the mathematical calculation that

establishes all the possibilities that exist for a phenomenon to occur in
certain random circumstances. Probability is calculated based on a value
between 0 and 1, and the level of certainty is determined by the closeness to the
unit value; on the other hand, if it is closer to zero, there is less certainty in the final
result.

Formula for calculating probability:

To calculate probability, you must divide the number of favorable events by

the total number of possible events. This generates a sample, and the
calculation can be performed from the data obtained.

Calculating probabilities is expressed as a percent and follows the formula:

Probability = Favorable cases / possible cases x 100.

Types of probability:

 Mathematical: this follows the principles of formal, non-experimental logic,

calculating random events that may occur within a certain field in figures.
 Frequency: based on experimentation and determines the number of times
an event may occur by considering a specific number of opportunities.
 Objective: considers the frequency of the event in advance and only sheds
light on the probable cases when that event may occur.
 Subjective: this concept is the opposite of mathematical probability, as it
takes certain eventualities into account that allow inferring the probability of a
certain event, even without having certainty at the arithmetic level.
 Binomial: determines the success or failure of an event with only two
possible outcomes.
 Logical: raises the possibility of an event occurring based on inductive laws.

1
 Conditional: explains the probability of one event happening based on the
prior occurrence of another, so one is dependent on the other.

 Hypergeometric: probability obtained from sampling techniques – that is,

events are classified according to the frequency of their occurrence. This way,
a set of groups of events are created that are determined according to their
occurrence.

Theories explaining probability:

There are three methods for determining the probability of any event, and they are
based on the rules of:

1. Addition: states that the probability of a particular event occurring is equal

to the sum of the individual probabilities, as long as the events do not occur
at the same time.
2. Multiplication: posits that the probability of two or more independent events
occurring is equal to the product of their individual probabilities.
3. Binomial distribution: posits that the probability of a given combination of
events occurring independently of each other admits only two possible
mutually exclusive outcomes: success or failure.

There is also Laplace’s rule, which states that, in a random sample composed of
results that are equally probable, the probability of an event is the result of the
number of possible cases divided by the number of probable cases.

Situations under which probability can be used:

Some examples where probability is applied are:

1. Statistical analysis of business risk: drops in stock prices, investment

statements, etc. can be estimated through probabilistic formulas.
2. Insurance calculation: the processes used to study the reliability of an
insured party, making it possible to know whether it is profitable to insure
them and at what price and time span this should be done, arise from
probability calculations and strategies.

2
3. Behavioral analysis: in this type of application, probability is used to
evaluate certain behaviors of a population sample so that certain patterns of
opinions, behaviors, or thoughts can be predicted.
4. Medical research: the success of vaccines, as well as their side effects in a
population, is an example that’s determined by probabilistic calculations.

Index Numbers:
Characteristics, Formula, Examples, Types, Importance and Limitations
This section discusses:- 1. Meaning of Index Numbers 2. Features of Index
Numbers 3. Steps or Problems in the Construction 4. Construction of Price Index
Numbers (Formula and Examples) 5. Difficulties in Measuring Changes in Value of
Money 6. Types of Index Numbers 7. Importance 8. Limitations.

Meaning of Index Numbers:

The value of money does not remain constant over time. It rises or falls and is
inversely related to the changes in the price level. A rise in the price level means a
fall in the value of money and a fall in the price level means a rise in the value of
money. Thus, changes in the value of money are reflected by the changes in the
general level of prices over a period of time. Changes in the general level of prices
can be measured by a statistical device known as ‘index number.’

Index number is a technique of measuring changes in a variable or group of

variables with respect to time, geographical location or other characteristics. There
can be various types of index numbers, but, in the present context, we are
concerned with price index numbers, which measures changes in the general price
level (or in the value of money) over a period of time.

Price index number indicates the average of changes in the prices of representative
commodities at one time in comparison with that at some other time taken as the
base period. According to L.V. Lester, “An index number of prices is a figure
showing the height of average prices at one time relative to their height at some
other time which is taken as the base period.”

Features of Index Numbers:

The following are the main features of index numbers:

3
(i) Index numbers are a special type of average. Whereas mean, median and mode
measure the absolute changes and are used to compare only those series which are
expressed in the same units, the technique of index numbers is used to measure
the relative changes in the level of a phenomenon where the measurement of
absolute change is not possible and the series are expressed in different types of
items.

(ii) Index numbers are meant to study the changes in the effects of such factors
which cannot be measured directly. For example, the general price level is an
imaginary concept and is not capable of direct measurement. But, through the
technique of index numbers, it is possible to have an idea of relative changes in the
general level of prices by measuring relative changes in the price level of different
commodities.

(iii) The technique of index numbers measures changes in one variable or group of
related variables. For example, one variable can be the price of wheat, and group of
variables can be the price of sugar, the price of milk and the price of rice.

(iv) The technique of index numbers is used to compare the levels of a

phenomenon on a certain date with its level on some previous date (e.g., the price
level in 1980 as compared to that in 1960 taken as the base year) or the levels of a
phenomenon at different places on the same date (e.g., the price level in India in
1980 in comparison with that in other countries in 1980).

Steps or Problems in the Construction of Price Index Numbers:

The construction of the price index numbers involves the following steps
or problems:

1. Selection of Base Year:

The first step or the problem in preparing the index numbers is the selection of the
base year. The base year is defined as that year with reference to which the price
changes in other years are compared and expressed as percentages. The base year
should be a normal year.

In other words, it should be free from abnormal conditions like wars, famines,
floods, political instability, etc. Base year can be selected in two ways- (a) through
fixed base method in which the base year remains fixed; and (b) through chain

4
base method in which the base year goes on changing, e.g., for 1980 the base year
will be 1979, for 1979 it will be 1978, and so on.

2. Selection of Commodities:

The second problem in the construction of index numbers is the selection of the
commodities. Since all commodities cannot be included, only representative
commodities should be selected keeping in view the purpose and type of the index
number.

In selecting items, the following points are to be kept in mind:

(a) The items should be representative of the tastes, habits and customs of the
people.

(b) Items should be recognizable,

(d) The economic and social importance of various items should be considered

(e) The items should be fairly large in number.

(f) All those varieties of a commodity which are in common use and are stable in
character should be included.

3. Collection of Prices:

After selecting the commodities, the next problem is regarding the

collection of their prices:

(a) From where the prices to be collected;

(b) Whether to choose wholesale prices or retail prices;

(c) Whether to include taxes in the prices or not etc.

While collecting prices, the following points are to be noted:

5
(a) Prices are to be collected from those places where a particular commodity is
traded in large quantities.

(b) Published information regarding the prices should also be utilised,

(c) In selecting individuals and institutions who would supply price quotations, care
should be taken that they are not biased.

(d) Selection of wholesale or retail prices depends upon the type of index number
to be prepared. Wholesale prices are used in the construction of general price index
and retail prices are used in the construction of cost-of-living index number.

(e) Prices collected from various places should be averaged.

4. Selection of Average:

Since the index numbers are, a specialized average, the fourth problem is to choose
a suitable average. Theoretically, geometric mean is the best for this purpose. But,
in practice, arithmetic mean is used because it is easier to follow.

5. Selection of Weights:

Generally, all the commodities included in the construction’ of index numbers are
not of equal importance. Therefore, if the index numbers are to be representative,
proper weights should be assigned to the commodities according to their relative
importance.

For example, the prices of books will be given more weightage while preparing the
cost-of-living index for teachers than while preparing the cost-of-living index for the
workers. Weights should be unbiased and be rationally and not arbitrarily selected.

6. Purpose of Index Numbers:

The most important consideration in the construction of the index numbers is the
objective of the index numbers. All other problems or steps are to be viewed in the
light of the purpose for which a particular index number is to be prepared. Since,
different index numbers are prepared with specific purposes and no single index

6
number is ‘all purpose’ index number, it is important to be clear about the purpose
of the index number before its construction.

7. Selection of Method:

The selection of a suitable method for the construction of index numbers is the final
step.

There are two methods of computing the index numbers:

(a) Simple index number and

(b) Weighted index number.

Simple index number again can be constructed either by – (i) Simple aggregate
method, or by (ii) simple average of price relative’s method. Similarly, weighted
index number can be constructed either by (i) weighted aggregative method, or by
(ii) weighted average of price relative’s method. The choice of method depends
upon the availability of data, degree of accuracy required and the purpose of the
study.

Construction of Price Index Numbers (Formula and Examples):

Construction of price index numbers through various methods can be

understood with the help of the following examples:

1. Simple Aggregative Method:

In this method, the index number is equal to the sum of prices for the year for
which index number is to be found divided by the sum of actual prices for the base
year.

The formula for finding the index number through this method is as
follows:

7
2. Simple Average of Price Relatives Method:

In this method, the index number is equal to the sum of price relatives
divided by the number of items and is calculated by using the following
formula:

8
3. Weighted Aggregative Method:

In this method, different weights are assigned to the items according to their
relative importance. Weights used are the quantity weights. Many formulae have
been developed to estimate index numbers on the basis of quantity weights.

Some of them are explained below:

9
10
4. Weighted Average of Relatives Method:

In this method also different weights are used for the items according to their
relative importance.

The price index number is found out with the help of the following
formula:

11
Difficulties in Measuring Changes in Value of Money:

Measurement of changes in the value of money through price index number is not
an easy and reliable technique. There are a number of theoretical as well as
practical difficulties in the construction of price index numbers. Moreover, the index
number technique itself has many limitations.

(A) Conceptual Difficulties:

The following are the conceptual difficulties during the construction of

price index numbers:

1. Vague Concept of Value of Money:

The concept of money is vague, abstract and cannot be clearly defined. The value
of money is a relative concept which changes from person to person depending
upon the type of goods on which the money is spent.

2. Inaccurate Measurement:

Price index numbers do not measure the changes in the value of money accurately
and reliably. A rise or fall in the general level of prices as indicated by the price

12
index numbers does not mean that the price of every commodity has risen or fallen
to the same extent.

3. Reflect General Changes:

Price index numbers are averages and measure general changes in the value of
money on the average. Therefore, they are not of much significance for the
particular individuals who may be affected by the changes in the actual prices quite
differently from that indicated by the index numbers.

4. Limitations of Wholesale Price Index:

The wholesale price index numbers, which are generally used to measure
changes in the value of money, suffer from certain limitations:

(a) They do not reflect the changes in the cost of living because retail prices are
generally higher than the wholesale prices.

(b) They ignore some of the important items concerning the urban population, such
as, expenditure on education, transport, house rent, etc.

(B) Practical Difficulties:

The practical difficulties in the way of constructing price index numbers,

and therefore, in measuring changes in the value of money are as follows:

1. Selection of Base Year:

While preparing the index number, first difficulty arises regarding the selection of
base year. The base year should be a normal year. But, it is very difficult to find out
a fully normal year free from any unusual happening. There is every possibility that
the selected base year may be an abnormal year, or a distant year, or may be
selected by an immature or biased person.

2. Selection of Items:

The selection of the representative commodities is the second difficulty in

the construction of index numbers:

13
(a) With the passage of time the quality of the product may change ; if the quality
of a product changes in the year of enquiry from what it was in the base year, the
product becomes irrelevant,

(b) The relative importance of certain commodities may change due to a change in
the consumption pattern of the people in the course of time; for example, Vanaspati
Ghee was not an important item of consumption in India in the pre-war period, but
today it has become an item of necessity. Under such conditions, it is not easy to
select the appropriate commodities.

3. Collection of Prices:

It is also difficult to obtain correct, adequate and representative data regarding

prices. It is not an easy job to select representative places from which the
information about prices to be collected and to select the experienced and unbiased
individuals or institutions who will supply price quotations. Moreover, there is the
problem of deciding which prices (wholesale or retail) are to be taken into
consideration. It is comparatively easy to get information about wholesale prices
which vary considerably.

4. Assigning Weights:

Another important difficulty that arises in preparing the index numbers is that of
assigning proper weights to different items in order to arrive at correct and
unbiased conclusions. As there are no hard and fast rules to weights for the
commodities according to their relative importance, there is very likelihood that the
weights are decided arbitrarily on the basis of personal judgement and involve
biasness.

5. Selection of Averages:

Another major problem is that which average should be employed to find out the
price relatives. There are many types of averages such as arithmetic average,
geometric average, mean, median, mode, etc. The use of different averages gives
different results. Therefore, it is essential to select the method with great care. Dr.
Marshall has advocated the use of chain index number to solve the problem of
averaging and weighing.

6. Problem of Dynamic Changes:

14
In the dynamic world, the consumption pattern of the individuals and the number
and varieties of goods undergo continuous changes.

They create difficulties for preparing index numbers and making temporal
comparisons:

(a) Since, in the course of time, old commodities may disappear and many new
ones come into existence, the long-run comparison may become difficult,

(b) The quantity and quality of commodities may also change over the period of
time, thus making the choice of commodities for constructing index numbers
difficult,

(c) A number of factors, like income, education, fashion, etc., bring changes in the
consumption pattern of the people which render the index numbers uncomparable.

Types of Index Numbers:

Index numbers are of different types.

Important types of index numbers are discussed below:

1. Wholesale Price Index Numbers:

Wholesale price index numbers are constructed on the basis of the wholesale prices
of certain important commodities. The commodities included in preparing these
index numbers are mainly raw-materials and semi-finished goods. Only the most
important and most price-sensitive and semi- finished goods which are bought and
sold in the wholesale market are selected and weights are assigned in accordance
with their relative importance.

The wholesale price index numbers are generally used to measure changes in the
value of money. The main problem with these index numbers is that they include
only the wholesale prices of raw materials and semi-finished goods and do not take
into consideration the retail prices of goods and services generally consumed by the
common man. Hence, the wholesale price index numbers do not reflect true and
accurate changes in the value of money.

2. Retail Price Index Numbers:

15
These index numbers are prepared to measure the changes.in the value of money
on the basis of the retail prices of final consumption goods. The main difficulty with
this index number is that the retail price for the same goods and for continuous
periods is not available. The retail prices represent larger and more frequent
fluctuations as compared to the wholesale prices.

3. Cost-of-Living Index Numbers:

These index numbers are constructed with reference to the important goods and
services which are consumed by common people. Since the number of these goods
and services is very large, only representative items which form the consumption
pattern of the people are included. These index numbers are used to measure
changes in the cost of living of the general public.

4. Working Class Cost-of-Living Index Numbers:

The working class cost-of-living index numbers aim at measuring changes in the
cost of living of workers. These index numbers are consumed on the basis of only
those goods and services which are generally consumed by the working class. The
prices of these goods and index numbers are of great importance to the workers
because their wages are adjusted according to these indices.

5. Wage Index Numbers:

The purpose of these index numbers is to measure time to time changes in money
wages. These index numbers, when compared with the working class cost-of-living
index numbers, provide information regarding the changes in the real wages of the
workers.

6. Industrial Index Numbers:

Industrial index numbers are constructed with an objective of measuring changes in

the industrial production. The production data of various industries are included in
preparing these index numbers.

Importance of Index Numbers:

Index numbers are used to measure all types of quantitative changes in different
fields.

16
Various advantages of index numbers are given below:

1. General Importance:

In general, index numbers are very useful in a number of ways:

(a) They measure changes in one variable or in a group of variables.

(b) They are useful in making comparisons with respect to different places or
different periods of time,

(c) They are helpful in simplifying the complex facts.

(d) They are helpful in forecasting about the future,

(e) They are very useful in academic as well as practical research.

2. Measurement of Value of Money:

Index numbers are used to measure changes in the value of money or the price
level from time to time. Changes in the price level generally influence production
and employment of the country as well as various sections of the society. The price
index numbers also forewarn about the future inflationary tendencies and in this
way, enable the government to take appropriate anti- inflationary measures.

3. Changes in Cost of Living:

Index numbers highlight changes in the cost of living in the country. They indicate
whether the cost of living of the people is rising or falling. On the basis of this
information, the wages of the workers can be adjusted accordingly to save the
wage earners from the hardships of inflation.

4. Changes in Production:

Index numbers are also useful in providing information regarding production trends
in different sectors of the economy. They help in assessing the actual condition of
different industries, i.e., whether production in a particular industry is increasing or
decreasing or is constant.

5. Importance in Trade:

17
Importance in trade with the help of index numbers, knowledge about the trade
conditions and trade trends can be obtained. The import and export indices show
whether foreign trade of the country is increasing or decreasing and whether the
balance of trade is favourable or unfavourable.

6. Formation of Economic Policy:

Index numbers prove very useful to the government in formulating as well as

evaluating economic policies. Index numbers measure changes in the economic
conditions and, with this information, help the planners to formulate appropriate
economic policies. Further, whether particular economic policy is good or bad is also
judged by index numbers.

7. Useful in All Fields:

Index numbers are useful in almost all the fields. They are especially important in
economic field.

Some of the specific uses of index numbers in the economic field are:

(a) They are useful in analyzing markets for specific commodities.

(b) In the share market, the index numbers can provide data about the trends in
the share prices,

(d) The bankers can get information about the changes in deposits by means of
index numbers.

Limitations of Index Numbers:

Index number technique itself has certain limitations which have greatly reduced its
usefulness:

(i) Because of the various practical difficulties involved in their computation, the
index numbers are never cent per cent correct.

18
(ii) There are no all-purpose index numbers. The index numbers prepared for one
purpose cannot be used for another purpose. For example, the cost-of-living index
numbers of factory workers cannot be used to measure changes in the value of
money of the middle income group.

(iii) Index numbers cannot be reliably used to make international comparisons.

Different countries include different items with different qualities and use different
base years in constructing index numbers.

(iv) Index numbers measure only average change and indicate only broad trends.
They do not provide accurate information.

(v) While preparing index numbers, quality of items is not considered. It may be
possible that a general rise in the index is due to an improvement in the quality of a
product and not because of a rise in its price.

19
The Simple Linear Regression Model:
We have worked hard to come up with formulas for the intercept b0 and the slope b1 of
the least squares regression line. But, we haven't yet discussed
what b0 and b1 estimate.
What do b0 and b1 estimate?
Let's investigate this question with another example. Below is a plot illustrating a
potential relationship between the predictor "high school grade point average (gpa)"
and the response "college entrance test score." Only four groups ("subpopulations") of
students are considered — those with a gpa of 1, those with a gpa of 2, ..., and those
with a gpa of 4.

Let's focus for now just on those students who have a gpa of 1. As you can see, there
are so many data points — each representing one student — that the data points run
together. That is, the data on the entire subpopulation of students with a gpa of 1 are
plotted. And, similarly, the data on the entire subpopulation of students with gpas of 2,
3, and 4 are plotted.

Now, take the average college entrance test score for students with a gpa of 1. And,
similarly, take the average college entrance test score for students with a gpa of 2, 3,
and 4. Connecting the dots — that is, the averages — you get a line, which we
summarize by the formula μY=E(Y)=β0+β1x. The line — which is called the
"population regression line" — summarizes the trend in the population between the
predictor x and the mean of the responses μY. We can also express the average college

20
entrance test score for the i-th student, E(Yi)=β0+β1xi. Of course, not every student's
college entrance test score will equal the average E(Yi). There will be some error. That
is, any student's response yi will be the linear trend β0+β1xi plus some error ϵi. So,
another way to write the simple linear regression model is yi=E(Yi)+ϵi=β0+β1xi+ϵi.
When looking to summarize the relationship between a predictor x and a response y,
we are interested in knowing the population regression line μY=E(Y)=β0+β1x. The only
way we could ever know it, though, is to be able to collect data on everybody in the
population — most often an impossible task. We have to rely on taking and using a
sample of data from the population to estimate the population regression line.
Let's take a sample of three students from each of the subpopulations — that is, three
students with a gpa of 1, three students with a gpa of 2, ..., and three students with a
gpa of 4 — for a total of 12 students. As the plot below suggests, the least squares
regression line y^=b0+b1x through the sample of 12 data points estimates the
population regression line μY=E(Y)=β0+β1x. That is, the sample intercept b0 estimates
the population intercept β0 and the sample slope b1 estimates the population slope β1.

The least squares regression line doesn't match the population regression line perfectly,
but it is a pretty good estimate. And, of course, we'd get a different least squares
regression line if we took another (different) sample of 12 such students. Ultimately, we
are going to want to use the sample slope b1 to learn about the parameter we care
about, the population slope β1. And, we will use the sample intercept b0 to learn about
the population intercept β0.
In order to draw any conclusions about the population parameters β0 and β1, we have
to make a few more assumptions about the behavior of the data in a regression setting.

21
We can get a pretty good feel for the assumptions by looking at our plot of gpa against
college entrance test scores.
First, notice that when we connected the averages of the college entrance test scores
for each of the subpopulations, it formed a line. Most often, we will not have the
population of data at our disposal as we pretend to do here. If we didn't, do you think it
would be reasonable to assume that the mean college entrance test scores are linearly
related to high school grade point averages?

Again, let's focus on just one subpopulation, those students who have a gpa of 1, say.
Notice that most of the college entrance scores for these students are clustered near
the mean of 6, but a few students did much better than the subpopulation's average
scoring around a 9, and a few students did a bit worse scoring about a 3. Do you get
the picture? Thinking instead about the errors, ϵi, most of the errors for these students
are clustered near the mean of 0, but a few are as high as 3 and a few are as low as -
3. If you could draw a probability curve for the errors above this subpopulation of data,
what kind of a curve do you think it would be? Does it seem reasonable to assume that
the errors for each subpopulation are normally distributed?
Looking at the plot again, notice that the spread of the college entrance test scores for
students whose gpa is 1 is similar to the spread of the college entrance test scores for
students whose gpa is 2, 3, and 4. Similarly, the spread of the errors is similar, no
matter the gpa. Does it seem reasonable to assume that the errors for each
subpopulation have equal variance?
Does it also seem reasonable to assume that the error for one student's college
entrance test score is independent of the error for another student's college entrance

22
test score? I'm sure you can come up with some scenarios — cheating students, for
example — for which this assumption would not hold, but if you take a random sample
from the population, it should be an assumption that is easily met.
We are now ready to summarize the four conditions or assumptions that underlie "the
simple linear regression model:"

 The mean of the response, E(Yi), at each value of the predictor, xi, is a Linear
function of the xi.
 The errors, εi, are Independent.
 The errors, εi, at each value of the predictor, xi, are Normally distributed.
 The errors, εi, at each value of the predictor, xi, have Equal variances (denoted σ2).
Do you notice what the first letters that are colored in blue spell? "LINE." And, what
are we studying in this course? Lines! Get it? You might find this mnemonic a useful
way to remember the four conditions that make up what we call the "simple linear
regression model." Whenever you hear "simple linear regression model," think of these
four conditions!
An equivalent way to think of the first (linearity) condition is that the mean of the
error, E(ϵi), at each value of the predictor, xi, is zero. An alternative way to describe all
four assumptions is that the errors, ϵi, are independent normal random variables with
mean zero and constant variance, σ2.

Multiple Regression Analysis: Definition, Formula and Uses

In statistics, linear regression is a measurement process for understanding how an

independent variable affects a dependent variable. In multiple regression, the number
of independent variables increases, creating changes within dependent factors, too.
Multiple regression analysis is a method that analysts and statisticians use to
understand and create conclusions about multiple regression.

In this article, we offer a multiple regression analysis definition, list the formula for
calculating multiple regression and explain how to calculate multiple regression with an
example to provide more insight into this type of statistical analysis.

 Regression analysis is a series of statistical modeling processes that helps

analysts estimate relationships between one, or multiple, independent variables
and a dependent variable.
 You can represent multiple regression analysis using the formula:

23
Y = b0 + b1X1 + b1 + b2X2 + ... + bpXp

 Multiple regression analysis has many applications, from business to marketing to

statistics.

Multiple regression analysis

Multiple regression analysis is a statistical evaluation tool. It's an extension of linear

regression, a process that predicts the value of a variable where that value depends on
another variable to influence it. This makes the predictive variable a dependent variable
since it depends on another variable to affect it. In multiple regression, two or more
external variables affect the value of the dependent variable. Multiple regression
analysis is simply a method for evaluating the information that comes from measuring
data using regression.

Multiple regression analysis formula

To perform a regression analysis, first calculate the multiple regression of your data.
You can use this formula:

Y = b0 + b1X1 + b1 + b2X2 + ... + bpXp

In this formula:

 Y stands for the predictive value or dependent variable.

 The variables (X1), (X2) and so on through (Xp) represent the predictive values,
or independent variables, causing a change in Y. It's important to note that each
X factor represents a distinct predictive value.
 The variable (b0) represents the Y-value when all the independent variables (X1
through Xp) are equal to zero.
 The variables (b1) through (bp) represent the regression coefficients.

When to use multiple regression analysis

Multiple regression analysis is a useful tool in a wide range of applications. From

business, marketing and sales analytics to environmental, medical and technological
applications, multiple regression analysis helps professionals evaluate diverse data that
supports goals, processes and outcomes in many industries. Here are several ways
multiple regression analysis can benefit a business or organization:

24
Gives insight into predictive factors

Conducting a multiple regression analysis is useful for determining what factors are
affecting different aspects of a business' processes. For instance, revenue can be one
type of Y-value, where different independent variables like the number of sales and cost
of goods affect business revenue. With multiple regression analysis, analysts can
identify the individual activities that affect specific metrics they want to measure, giving
them better insight into how to improve efficiency and productivity.

Predicts factors affecting outcomes

When companies can analyze the factors that affect certain business operations,
management can better predict which independent variables influence the dependent
functions of the business. For example, a business analyst can predict which factors are
likely to affect an organization's future profitability, based on the results of a multiple
regression analysis.

In this case, the analyst may calculate the regression using the formula where profit is
the predictive variable and factors like overhead, liabilities and total sales revenue
represent the (b) and (X) values in the formula. When the analyst understands how
much these factors affect profits, they can better predict the variables that may affect
profits in the future.

Creates models for cause-and-effect analysis

Understanding the mathematical data that multiple regression analysis can provide
allows professionals to model the information in a graph or chart. Displaying multiple
regression—how external variables cause changes in a dependent variable—in this way
can help you model the cause-and-effect relationship to better see the changes taking
place in real time. This can be especially beneficial for financial activities like investing
in stocks and securities, where traders can see the cause-and-effect relationship in a
chart to understand how economic factors are influencing current market shares.

Calculating multiple regression

25
To understand the calculations of multiple regression analysis, assume a financial
analyst wants to predict the price changes in a stock share of a major fuel company.
Using this example, follow the steps below to understand how the analyst calculates
multiple regression:

1. Determine all predictive variables

Using the example, the financial analyst must first determine all the factors that can
cause the share prices to fluctuate. While stock prices can have many influencing
factors, assume the predictive variables the analyst evaluates include interest rates,
crude oil prices and prices to move fuel resources. The analyst determines:

 The X1 variable is a 5% interest rate, or 0.05.

 The X2 variable is a current price of $50 per barrel of crude oil.
 The Xp variable is the current transport price of $25 per load of 100 barrels.

The analyst plugs these values into the formula:

Y = b0 + b1X1 + b1 + b2X2 +...+ bpXp = b0 + b1(0.05) + b2(50) + bp(250)

2. Determine the regression coefficient at time zero

Once the analyst knows the independent variables affecting share price, they can
identify the value of the regression coefficient, or the relationship between predictive
variables and responses in Y, at time zero. Time zero refers to the value of the stock at
the moment of evaluation. If the stock price is $50 when the analyst begins their
assessment, the b0 value is $50:

Y = b0 + b1X1 + b1 + b2X2 +...+ bpXp = (500) + b1(0.05) + b2(50) + bp(250)

3. Identify the regression coefficients for b variables

After calculating the predictive variables and the regression coefficient at time zero, the
analyst can find the regression coefficients for each X predictive factor. The regression
coefficient for the X1 variable represents the change in interest rates from time zero,
the regression coefficient for the X2 variable is the change in the price of crude oil and
the regression coefficient for the Xp variable is the change in transportation costs. The
regression coefficients, or change rates, the analyst calculates come from the

26
differences in prices between previous and current years. Assume the analyst uses
these values in the formula:

Y = (500) + b1(0.05) + b2(50) + bp(25) where b1 represents the change in interest

rates, b2 is the change in stock price and bp is the change in transportation costs
between the previous and current years. The analyst uses b1 = 0.015, b2 = 0.33 and
bp = 0.8 in the formula:

Y = (500) + (0.015)(0.05) + (0.33)(50) + (0.8)(25)

4. Sum these values

Once the analyst has all values in the formula, they can find the total sum, or the value
of Y. It looks like this:

Y = (50) + (0.015)(0.05) + (0.33)(50) + = (0.8)(25)

(50) + (0.00075) + (16.5) + (20) = 86.5

5. Evaluate the results

The multiple regression sum represents the likelihood of changes occurring because of
the changes in the independent variables affecting the dependent factor. In the
example of the financial analyst evaluating the advantages of company stocks, the
value of Y is approximately 86.5, or 86.5%.

This shows that the stock price for shares of the fuel company's stock has an 86.5%
chance of fluctuating based on changes in external factors. While this value doesn't
determine whether the fluctuations are increases or decreases in price, a multiple
regression rate of 86.5% can give the analyst valuable insight into just how volatile the
company stock prices are.

Regression models are used to describe relationships between variables by fitting a

line to the observed data. Regression allows you to estimate how a dependent variable
changes as the independent variable(s) change.

27
Multiple linear regression is used to estimate the relationship between two or
more independent variables and one dependent variable. You can use multiple
linear regression when you want to know:

1. How strong the relationship is between two or more independent variables and
one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer
added affect crop growth).
2. The value of the dependent variable at a certain value of the independent
variables (e.g. the expected yield of a crop at certain levels of rainfall,
temperature, and fertilizer addition).

Multiple linear regression example: You are a public health researcher interested in
social factors that influence heart disease. You survey 500 towns and gather data on
the percentage of people in each town who smoke, the percentage of people in each
town who bike to work, and the percentage of people in each town who have heart
disease.
Because you have two independent variables and one dependent variable, and all
your variables are quantitative, you can use multiple linear regression to analyze the
relationship between them.

Assumptions of multiple linear regression:

Multiple linear regression makes all of the same assumptions as simple linear
regression:

Homogeneity of variance (homoscedasticity): the size of the error in our

prediction doesn’t change significantly across the values of the independent variable.

Independence of observations: the observations in the dataset were collected using

statistically valid sampling methods, and there are no hidden relationships among
variables.

In multiple linear regression, it is possible that some of the independent variables are
actually correlated with one another, so it is important to check these before developing
the regression model. If two independent variables are too highly correlated (r2 >
~0.6), then only one of them should be used in the regression model.

Normality: The data follows a normal distribution.

Linearity: the line of best fit through the data points is a straight line, rather than a
curve or some sort of grouping factor.
28
How to perform a multiple linear regression

Multiple linear regression formula

The formula for a multiple linear regression is:

 = the predicted value of the dependent variable

 = the y-intercept (value of y when all other parameters are set to 0)
 = the regression coefficient ( ) of the first independent variable ( )
(a.k.a. the effect that increasing the value of the independent variable has on
the predicted y value)
 … = do the same for however many independent variables you are testing
 = the regression coefficient of the last independent variable
 = model error (a.k.a. how much variation there is in our estimate of )

To find the best-fit line for each independent variable, multiple linear regression
calculates three things:

 The regression coefficients that lead to the smallest overall model error.
 The t statistic of the overall model.
 The associated p value (how likely it is that the t statistic would have occurred
by chance if the null hypothesis of no relationship between the independent and
dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the
model.

Multiple linear regression in R

While it is possible to do multiple linear regression by hand, it is much more commonly
done via statistical software. We are going to use R for our examples because it is free,
powerful, and widely available. Download the sample dataset to try it yourself.

Load the heart.data dataset into your R environment and run the following code:

R code for multiple linear regression heart.disease.lm<-lm(heart.disease ~ biking +

smoking, data = heart.data)

29
This code takes the data set heart.data and calculates the effect that the independent
variables biking and smoking have on the dependent variable heart disease using the
equation for the linear model: lm().

Learn more by following the full step-by-step guide to linear regression in R.

Interpreting the results

To view the results of the model, you can use the summary () function:

Summary (heart.disease.lm)
This function takes the most important parameters from the linear model and puts
them into a table that looks like this:

The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’).
If the residuals are roughly centered around zero and with similar spread on either side,
as these do (median 0.03, and min and max around -2 and 2) then the model probably
fits the assumption of heteroscedasticity.

Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the
coefficients table is labeled (Intercept) – this is the y-intercept of the regression
equation. It’s helpful to know the estimated intercept in order to plug it into the
regression equation and predict values of the dependent variable:

30
heart disease = 15 + (-0.2*biking) + (0.178*smoking) ± e
The most important things to note in this output table are the next two tables – the
estimates for the independent variables.

The Estimate column is the estimated effect, also called the regression
coefficient or r2 value. The estimates in the table tell us that for every one percent
increase in biking to work there is an associated 0.2 percent decrease in heart disease,
and that for every one percent increase in smoking there is an associated .17 percent
increase in heart disease.

The Std.error column displays the standard error of the estimate. This number shows
how much variation there is around the estimates of the regression coefficient.

The t value column displays the test statistic. Unless otherwise specified, the test
statistic used in linear regression is the t value from a two-sided t test. The larger the
test statistic, the less likely it is that the results occurred by chance.

The Pr( > | t | ) column shows the p value. This shows how likely the
calculated t value would have occurred by chance if the null hypothesis of no effect of
the parameter were true.

Because these values are so low (p < 0.001 in both cases), we can reject the null
hypothesis and conclude that both biking to work and smoking both likely influence
rates of heart disease.

Presenting the results

When reporting your results, include the estimated effect (i.e. the regression
coefficient), the standard error of the estimate, and the p value. You should also
interpret your numbers to make it clear to your readers what the regression coefficient
means.

In our survey of 500 towns, we found significant relationships between the frequency of
biking to work and the frequency of heart disease and the frequency of smoking and
frequency of heart disease (p < 0.001 for each). Specifically we found a 0.2% decrease
(± 0.0014) in the frequency of heart disease for every 1% increase in biking, and a
0.178% increase (± 0.0035) in the frequency of heart disease for every 1% increase in
smoking.

Visualizing the results in a graph

31
It can also be helpful to include a graph with your results. Multiple linear regression is
somewhat more complicated than simple linear regression, because there are more
parameters than will fit on a two-dimensional plot.

However, there are ways to display your results that include the effects of multiple
independent variables on the dependent variable, even though only one independent
variable can actually be plotted on the x-axis.

Here, we have calculated the predicted values of the dependent variable (heart disease)
across the full range of observed values for the percentage of people biking to work.

To include the effect of smoking on the independent variable, we calculated these

predicted values while holding smoking constant at the minimum, mean, and maximum
observed rates of smoking.

32
Receive feedback on language, structure, and formatting
Professional editors proofread and edit your paper by focusing on:

 Academic style
 Vague sentences
 Grammar
 Style consistency


Multiple Regression Formula

Multiple regression formulas analyze the relationship between dependent and multiple
independent variables. For example, the equation Y represents the formula is equal to a
plus bX1 plus cX2 plus dX3 plus E where Y is the dependent variable, and X1, X2, and
X3 are independent variables. A is the intercept, b, c, and d are the slopes, and E is the
residual value.

Multiple regressions are a very useful statistical method. Regression plays a very
important role in the world of finance. A lot of forecasting is done
using regression analysis. For example, one can predict the sales of a particular
segment in advance with the help of macroeconomic indicators that have a very good
correlation with that segment.

Key Takeaways

33
 Multiple regression formulas are used to analyze the relationship between a dependent
variable and multiple independent variables.
 This method uses two or more independent variables to forecast or predict the
dependent variable.
 The main objective is to identify and examine the relationship between the dependent
and independent variables. Based on this analysis, suitable independent variables are
selected to aid in predicting the dependent variable.
 Multiple regression is employed when linear regression alone cannot fulfill the intended
purpose, and it helps determine the effectiveness of the chosen predictor variables in
forecasting the dependent variable.

Multiple Regression Formula Explained:

Multiple regression model formula are a method to predict the dependent variable
with the help of two or more independent variables. While running this analysis, the
main purpose of the researcher is to find out the relationship between the dependent
and independent variables. The multiple independent variables are chosen, which can
help predict the dependent variable to predict the dependent variable. One may use it
when linear regression cannot serve the purpose. The regression analysis helps in
the process of validating whether the predictor variables are good enough to help in
predicting the dependent variable.

y = mx1 + mx2+ mx3+ b

Where,

 Y= the dependent variable of the regression

 M= slope of the regression
 X1=first independent variable of the regression
 The x2=second independent variable of the regression
 The x3=third independent variable of the regression
 B= constant

The main aim of this method of multiple regression model formula is to estimate
the coefficients that reduce or minimize the sum of squared differences between the
values of Y and the values that are predicted by the equations. Various software

34
packages used for statistical purposes can perform this analysis systematically because
they are designed to handle complex calculations within a limited timeframe and
provide statistical evaluation of the accuracy.

Examples

Let us understand the concept of multiple regression analysis formula with the
help of suitable examples.

Example #1

Let us try and understand the concept of multiple regression analysis with
the help of an example. But, first, let us try to find out the relation between
the distance covered by an UBER driver and the age of the driver, and the
number of years of experience of the driver.

To calculate multiple regression, go to the “Data” tab in Excel and select the “Data
Analysis” option. For further procedure and calculation, refer to the: Analysis ToolPak
in Excel article.

The regression formula for the above example will be

1. y = MX + MX + b
2. y= 604.17*-3.18+604.17*-4.06+0
3. y= -4377

In this particular example of multiple regression analysis formula, we will see

which variable is the dependent variable and which variable is the independent variable.
The dependent variable in this regression equation is the distance covered by the UBER
driver, and the independent variables are the age of the driver and the number of
experiences he has in driving.

Example #2

Let us try and understand the concept of multiple regression analysis with
the help of another example. Let us try to find the relation between the GPA
of a class of students, the number of hours of study, and the student’s height.

Go to the “Data” tab in Excel and select the “Data Analysis” option for the calculation.

35
The regression equation for the above example will be

y = MX + MX + b

y= 1.08*.03+1.08*-.002+0

y= .0325

In this particular example, we will see which variable is the dependent variable and
which variable is the independent variable. The dependent variable in this regression is
the GPA, and the independent variables are study hours and the height of the students.

Example #3

Let us try and understand the concept of multiple regression analysis with
the help of another example. Now, let us find out the relation between the
salary of a group of employees in an organization, the number of years of
experience, and the age of the employees.

Go to the “Data” tab in Excel and select the “Data Analysis” option for the calculation.

The regression equation for the above example will be

 y = MX + MX + b
 y= 41308*.-71+41308*-824+0
 y= -37019

In this particular example, we will see which variable is the dependent variable and
which variable is the independent variable. The dependent variable in this regression
equation is the salary, and the independent variables are the experience and age of the
employees.

Thus, the above examples successfully explain the formula and the concept by using
different case studies to highlight the various areas of study where it can be applied
and used to derive suitable results that can be easily interpreted.

Relevance and Uses:

Let us look at some of the uses of the concept.

36
 This concept is widely used for prediction of the values of the dependent variables with
relation to the values of independent variables. Some examples of such situations can
be prediction of share prices, sales value and students performance over a period of
time. In this way it can also help in assessing the relation between many or multiple
variables.
 Multiple regression model equation can be used to isolate and identify any
particular factor that can impact one variable while other variables constant.
 It can successfully capture the relationships between both the dependent and
independent variables which are complex, not linear in nature and and includes more
than one predictor.
 Businesses can take decisions based on the outcome of this calculation related to
employee performance, sales figures, customer demand and satisfaction levels, etc.
 Any business is subject to a numberof risks related to market movements, demand,
supply, prices, material availability and many more. In such cases this concept and
calculation can be used by the finance and insurance companies to assess the return or
the claim that they may have to handle to cover such risks.
 Companies use the method of multiple regression model equation to assess the
extent to which the company’s marketing efforts are impacting the revenue and profits,
which is helpful for both the stakeholders andthe management to make crucial
decisions. The method also helps establish relationships betweenimportant variables like
the GDP, employment, and inflation, which are essential factors that every country’s
government needs to look into for all round development of the economy.
 It helps in quality control and also generate process improvement ideas that contribute
to the upgradation of the standard of the products and services of the organization.

Thus, we see that the concept has a number of uses in the financial as well as
statistical field. It uses complex datasets and helps businesses generate business
models or take complex financial and other type of decisions that guides the business
towards a smooth operational process. It is necessary to use the procedure in the
correct manner to get proper result.

1. Can multiple regression formulas use categorical variables?

Yes, the multiple regression formula can handle categorical variables. Techniques like
dummy coding or effect coding can be used to represent categorical variables as a set
of binary (dummy) variables. These transformed variables are then included in the
regression analysis to assess their impact on the dependent variable.

2. What are the limitations of the multiple regression formula?

37
The limitations of multiple regression include the assumptions of linearity, independence
of observations, normality of errors, absence of multicollinearity, and homoscedasticity.
Violations of these assumptions can lead to biased or inefficient estimates and affect
the validity of the regression model’s predictions.

3. The benefits of the multiple regression formula

Multiple regression offers several benefits. It allows for quantifying the relationships
between multiple independent variables and a dependent variable. It helps identify
significant predictors and control for confounding factors. It enables predictions and
forecasts based on the regression equation. Additionally, it provides insights into the
strength and direction of relationships between variables.

Simple Linear Regression

Definition: Simple linear regression aims to find a linear relationship to describe

the correlation between an independent and possibly dependent variable. The
regression line can be used to predict or estimate missing values, this is known
as interpolation.

Least Squares Regression Line, LSRL

The calculation is based on the method of least squares. The idea behind it is to minimize
the sum of the vertical distance between all of the data points and the line of best fit.

Consider these attempts at drawing the line of best fit, they all look like they could be a
fair line of best fit, but in fact Diagram 3 is the most accurate as the regression line has

been calculated using the least squares regression line.

The equation of the least squares regression line is^y=a+bx

where:

 ^y is the predicted value of y,

 a=¯y−b¯x,
 b=SxySxx=∑(xi−¯x)(yi−¯y)∑(xi−¯x)2=∑(xy)−∑x∑yn∑(x2)−(∑x)2n,
 ¯x=∑xn,
 ¯y=∑yn,

Note: The underlying statistical model here is that there is a linear relation between the
variables, say y=a′+b′x, and so we should regard the equation that we obtain using the
method above as resulting in an estimate for the true equation. For this reason many

38
authorities write y=a+bx+ϵ to emphasize this point. A further discussion on the nature
of the error ϵ is not appropriate here, but is covered in the references below.

Worked Examples

Example 1

Consider the example below where the mass, y (grams), of a chemical is related to the
time, x (seconds), for which the chemical reaction has been taking place according to the
table:

Time, x (seconds) 55 77 1212 1616 2020

Mass, y (grams) 4040 120120 180180 210210 240240

Find the equation of the regression line.

Solution

To work out the regression line the following values need to be

calculated: a=¯y−b¯x and b=SxySxx. The easiest way of calculating them is by using a
table.

Start off by working out the mean of the independent and dependent variables.

¯x=∑xn=5+7+12+16+205=605=12,¯y=∑yn=40+120+180+210+2405=7905=158.=5+7+12+1
6+205=605=12,=40+120+180+210+2405=7905=158.

Xi yi xi−¯x yi−¯y (xi−¯x)(yi−¯y) (xi−¯x)2

5−12=−75−1 40−158=−11840−1 −7×−118=826−7×−11
55 4040 −72=49−72=49
2=−7 58=−118 8=826
7−12=−57−1 120−158=−38120− −5×−38=190−5×−38
77 120120 −52=25−52=25
2=−5 158=−38 =190
12−12=012− 180−158=22180−15
1212 180180 0×22=00×22=0 02=002=0
12=0 8=22
16−12=416− 210−158=52210−15
1616 210210 4×52=2084×52=208 42=1642=16
12=4 8=52
20−12=820− 240−158=82240−15
2020 240240 8×82=6568×82=656 82=6482=64
12=8 8=82
∑x=60∑ ∑y=790∑ ∑(xi−¯x) ∑(xi−¯x)2=154
=60 =790 (yi−¯y)=1880∑=1880 2=154

Now calculate b

39
b=SxySxx=∑(xi−¯x)(yi−¯y)∑(xi−¯x)2=1880154=12.20779...=12.208 (3.d.p.)=1880154=12.2
0779...=12.208 (3.d.p.)

and calculate a

a=¯y−b¯x=158−12.208×12=11.506...=11.506 (3.d.p.).=158−12.208×12=11.506...=11.506 (3.d

.p.).

So the equation of the regression line is: ^y=a+bx=11.506+12.208x.=11.506+12.208x

Example 2

To see how students' reaction skills have improved over a year, eight students took a
reactions test at the start of the year and at the end of the year. These are their scores:

Student Liam Felicity Adian Mel Leroy Vic Lawrie Louise

First Test, x 5656 7575 6161 6161 6767 7272 6262 6161
Second Test, y 2121 3939 3434 2121 3232 2424 2929 2424

Find the equation of the regression line given that:

∑x=515, ∑y=224, ∑x2=33441, ∑y2=6576 and ∑xy=14590.∑=515, ∑=224,

∑=33441, ∑=6576 ∑=14590.

Solution

We know that the equation of the least squares regression line is

^y=a+bx.

As we have been given some summed values we are going to use b=SxySxx=∑(xy)
−∑x∑yn∑(x2)−(∑x)2n.

b=SxySxx=∑(xy)−∑x∑yn∑(x2)−(∑x)2n=14590−515×224833441−51528=0.590534...=0.590
(3.d.p.)(∑=14590−515×224833441−51528=0.590534...=0.590 (3.d.p.)

To find a we need to first work out the mean of x and y.

¯x=∑xn=5158=64.375,¯y=∑yn=2248=28a=¯y−b¯x=28−(0.590×64.375)=−10.015631...=−10.
016 (3.d.p.)¯=∑=5158=64.375,¯=∑=2248=28=¯−¯=28−(0.590×64.375)=−10.015631...=−10.0
16 (3.d.p.)

40
So the equation of our regression line is ^y=−10.106+0.590x y^=−10.106+0.590x.

Video Example

Alissa Grant-Walker presents a video on working out the linear regression line.

Interpreting the Regression Line

The simple linear regression line, ^y=a+bx, can be interpreted as follows:

 ^y is the predicted value of y,

 a is the intercept and predicts where the regression line will cross the y-axis,
 b predicts the change in y for every unit change in x.

We can also use the equation of the regression line for finding approximate values for
missing data.

Note: Using this to estimate outside the range of your data is unreliable.

Worked Example

Using the data from the last worked example about the mass of a chemical as time
increases, we worked out the equation of the regression line to be ^y=11.506+12.208x,
^y =11.506+12.208x. We can interpret this as for every 11 minute increase in time the
mass of the chemical increases by 12.20812.208 grams. The equation also tells us that
when no time has passed, (when x is zero), the initial mass of the chemical is
11.506 grams.

Example 1

What is the mass of the chemical after ten seconds has passed?

Solution 1

Take your equation and enter the value of time x=10 and calculate ^y.

^y=11.506+12.208×10=133.586. ^y =11.506+12.208×10=133.586.This means that

after 1010 seconds of our experiment has passed, the mass of the chemical will
be 133.586133.586 grams. Check this value against a scatter plot of our data to see if this
answer is reasonable.

Example 2

By how much does the chemical increase in weight in five seconds?

41
Solution 2

For every minute increase in time the mass of the chemical increases
by 12.20812.208 grams. Multiply 12.20812.208 grams by 55 to find the increase in weight
of the chemical in 5 seconds.

12.208×5=61.40 (grams).12.208×5=61.40 (grams).

Example 3

How much time does it take for the weight of the chemical to increase by 50 grams?

Solution 3

We know that for every minute increase in time the mass of the chemical increases
by 12.208 grams, this also means it takes 112.208 seconds for the chemical to increase
by 11 gram. To find the time taken for the chemical to increase in weight by 50 grams
we need to multiply 112.208 by 50.

112.208×50=4.096 (3 d.p.).112.208×50=4.096 (3 d.p.).

42
Hypothesis Testing Formula

The hypothesis testing formula for some important test statistics are given below: z =
¯¯¯x−μσ√n x ¯ − μ σ n . ¯¯¯x x ¯ is the sample mean, μ μ is the population mean, σ σ
is the population standard deviation and n is the size of the sample.

Hypothesis Testing

Hypothesis testing ascertains whether a particular assumption is true for the whole
population. It is a statistical tool. It determines the validity of inference by evaluating
sample data from the overall population.

The concept of hypothesis works on the probability of an event’s occurrence. It

confirms whether the primary hypothesis results are correct or not. It is widely applied
in research—biology, criminal trials, marketing, and manufacturing.

Key Takeaways

 Hypothesis testing is a statistical interpretation that examines a sample to

determine whether the results stand true for the population.
 The test allows two explanations for the data—the null hypothesis or the
alternative hypothesis. If the sample mean matches the population mean, the
null hypothesis is proven true.
 Alternatively, if the sample mean is not equal to the population mean, the
alternate hypothesis is accepted.
 This method requires superior analytical abilities and, therefore, is inaccessible
for most. Also, this method heavily relies on probability.

Hypothesis Testing in Statistics Explained

Hypothesis testing uses sample data to validate the research. Researchers speculate on
relationships between various factors. They then collect data to test those relationships.
Based on the data, researchers draw conclusions. In statistics, it is very important to
eliminate randomness. The data should not have been caused by chance or a random
factor. Hypothesis testing eliminates such uncertainties.

For every research experiment, there are mainly two explanations: the null
hypothesis and the alternative hypothesis. It is often difficult to prove a theory;
therefore, investigators test to reject the null hypothesis. So, when the null hypothesis
is rejected, the remaining alternate theory is believed to be true.

43
For example, if we believe that the returns from the NASDAQ stock index are not
zero. Then the null hypothesis would state: ‘the recovery from the NASDAQ is zero.’
Tests are conducted for different levels of statistical significance.

Hypothesis tests are prone to two errors—type 1 and type 2. If the null hypothesis is
rejected by the sample outcome despite being true—it is considered a type 1 error.
Similarly, if the sample data fails to reject the null hypothesis, despite the null
hypothesis being false, it is considered a type 2 error.

Hypothesis Testing Types

Based on population distribution, hypothesis testing is further categorized into sub-

types:

1. Simple: In a simple hypothesis, the population parameter is stated as a specific value,

making the analysis easier.
2. Composite: In a composite hypothesis, the population parameter ranges between a
lower and upper value.
3. One-tailed: When the majority of the population is concentrated on one side, it is
called a one-tailed test. In a one-tailed test, the sample test is either higher or lower
than the population parameter.

4. Two-tailed: The two-tailed hypothesis test works when the critical distribution of the
population is two-sided. Here the test sample is either higher or lower than a number of
given values.

44
Hypothesis Testing Steps

Hypothesis tests involve the following steps:

 Researchers first mention whether the idea is a null theory or an alternative

hypothesis. If the variables are not correlated, then it is assumed null.
Alternatively, if the variables show correlation, then it is the alternative
hypothesis.
 Then they collect relevant data for sampling—it closely represents the whole
population on which the test is to be performed.
 Next, researchers choose a statistical test that suits the collected data.
 Based on the test results and level of significance, they either accept or reject
the null hypothesis.
 Finally, the statistical findings are compiled and summarized into a research
report.

Hypothesis Testing Formula

Researchers opt for different statistical tests like t-tests or z-tests. The z-test formula
is as follows:
Z = ( x̅ – μ0 ) / (σ /√n)

 Here, x̅ is the sample mean,

 μ0 is the population mean,
 σ is the standard deviation,
 n is the sample size.

45
Based on the Z-test result, the research derives the hypothesis conclusion. It can either
be a null or its alternative. They are measured using the following formula:

H0: μ=μ0

Ha: μ≠μ0

Here,

H0 = null hypothesis

Ha = alternate hypothesis

If the mean value is equal to the population mean, then the null hypothesis is proven
true. Otherwise, the alternate hypothesis is taken into consideration.
Hypothesis Testing Calculation with Examples

A battery manufacturing company claims that the average life of its two-wheeler
batteries is 2.1 years. The quality inspector surveyed ten customers to know the lasting
period of their batteries. The following data was collected:

Customer No. Battery Life (in years)

1 1.9

2 2.3

3 2.1

4 2.2

5 1.9

6 2.4

46
7 2.1

8 2.3

9 2.2

10 2.0

If the standard deviation is 0.17 and the significance level is 0.05, conduct a
hypothesis testing to prove the company’s claim.

Solution:

Given:

μ0= 2.1 years

σ = 0.17

n = 10

Level of Significance = 0.05

Assuming that the company’s claim of average battery life being 2.1 years is true,

We need to prove that:

H0: μ=μ0, or

Ha: μ≠μ0

Sample mean (x̅ ) = (1.9 + 2.3 + 2.1 + 2.2 + 1.9 + 2.4 + 2.1 + 2.3 + 2.2 + 2.0) / 10 =
2.14 years.

Applying the Z-test formula:

47
Z = (x̅ – μ0) / (σ /√n)

Z = (2.14 – 2.1) / (0.17 / √10) = 0.744

We already know that the level of significance is 0.05, and the z-score is 1.645. Let us
now compare the Z-test with it.

0.744 ˂ 1.645; therefore, the null hypothesis is true.

Thus, the company’s claim that the average life of its batteries is 2.1 years is proven
true.

Relevance and Use

Hypothesis testing validates a theory with the help of systematic statistical inference.
However, in practice, it is not easy. Therefore, researchers try to reject the null
hypothesis in order to validate the alternate explanation.

Hypothesis testing is widely applied in psychology, biology, medicine, finance,

production, marketing, advertising, and criminal trials.

Limitations

Hypothesis testing is all about assumptions and interpretations. It, therefore, requires
superior analytical abilities. As a result, it is inaccessible for most.

Also, this method heavily relies on mere probability. There can be errors in data. It
works better for large sample sizes. For smaller sample sets, this approach may not be
the most suitable.

P-value in hypothesis testing

P-value refers to the probability of the null hypothesis getting rejected. P-value
calculation determines whether the assumed result will hold true or not. A higher value
determines the acceptance of the assumed result, while a lower value signifies rejection
of this assumed result and acceptance of the alternate result.

The null and alternative hypothesis

A null hypothesis is a statement that proves that the sample mean is the same as the
population mean. An alternative hypothesis is the opposite of the null hypothesis, i.e., it
states that there is a difference between the sample mean and the population mean.

48
Importance of hypothesis testing
It is a useful statistical tool that interprets data-based conclusions—such that it stands
true for the whole population. It is implemented in scientific research, medical research,
psychology, manufacturing, marketing, advertising, and criminal trials.

Leveraging Lookups and Subsearches
100% (2)
Leveraging Lookups and Subsearches
72 pages
Chapter 4 Conic Section and Its Application
100% (1)
Chapter 4 Conic Section and Its Application
13 pages
AUDI 2.0 L FSI PDF
80% (5)
AUDI 2.0 L FSI PDF
44 pages
Index Numbers
No ratings yet
Index Numbers
23 pages
Solvent Deasphalting PPT Final - 1
100% (5)
Solvent Deasphalting PPT Final - 1
30 pages
Chapter-2 Value of Money: Methods of Preparing Price Index Numbers
No ratings yet
Chapter-2 Value of Money: Methods of Preparing Price Index Numbers
20 pages
Social & Economic Statistics (Chapter 1 - 5)
No ratings yet
Social & Economic Statistics (Chapter 1 - 5)
71 pages
Lenze 8400 Electrical Shaft Technology Application - v1-0 - EN
No ratings yet
Lenze 8400 Electrical Shaft Technology Application - v1-0 - EN
50 pages
Module 5: Index Numbers & Time Series: 1. Index Number For The Base Year Is Always Taken As 100
No ratings yet
Module 5: Index Numbers & Time Series: 1. Index Number For The Base Year Is Always Taken As 100
21 pages
Research Methodology Part 3 - Shrivastava - Ibrg
No ratings yet
Research Methodology Part 3 - Shrivastava - Ibrg
100 pages
Index Numbers
No ratings yet
Index Numbers
25 pages
Index Number
100% (1)
Index Number
13 pages
Index Numbers
No ratings yet
Index Numbers
10 pages
Cpi and Wpi
No ratings yet
Cpi and Wpi
17 pages
Index Numbers
No ratings yet
Index Numbers
20 pages
KEY POINTS TO REMEMBER-index
No ratings yet
KEY POINTS TO REMEMBER-index
6 pages
MV Seapace - Final Safety Investigation Report Annexes (Rocking Test)
No ratings yet
MV Seapace - Final Safety Investigation Report Annexes (Rocking Test)
120 pages
Uses of Index Numbers
No ratings yet
Uses of Index Numbers
3 pages
Narayana 14-06-2022 Outgoing SR Jee Main Model GTM 9 QP Final
No ratings yet
Narayana 14-06-2022 Outgoing SR Jee Main Model GTM 9 QP Final
19 pages
Valves Symbols
No ratings yet
Valves Symbols
4 pages
C++ All Modules
No ratings yet
C++ All Modules
68 pages
TM800V Service Manual
No ratings yet
TM800V Service Manual
149 pages
Paper III
No ratings yet
Paper III
124 pages
Social and Economics Lecture Note
No ratings yet
Social and Economics Lecture Note
104 pages
01 Index Number (Upgrated)
No ratings yet
01 Index Number (Upgrated)
52 pages
Pascal Output Answer
100% (1)
Pascal Output Answer
13 pages
General Physics II
No ratings yet
General Physics II
52 pages
Unit 10
No ratings yet
Unit 10
25 pages
Index Number
No ratings yet
Index Number
51 pages
74656bos60481 FND p3 cp18
No ratings yet
74656bos60481 FND p3 cp18
40 pages
Index Number
No ratings yet
Index Number
36 pages
Index No and Time Series
No ratings yet
Index No and Time Series
72 pages
Index Number ch6
No ratings yet
Index Number ch6
31 pages
Index No.
No ratings yet
Index No.
31 pages
A Short Course On Index Numbers
No ratings yet
A Short Course On Index Numbers
49 pages
Mca4020 SLM Unit 09
No ratings yet
Mca4020 SLM Unit 09
31 pages
Swamy's Number Theory
No ratings yet
Swamy's Number Theory
26 pages
Meaning AND Definition
No ratings yet
Meaning AND Definition
3 pages
PASOLINK V4 LCT Training Manual: NEC Cooperation
No ratings yet
PASOLINK V4 LCT Training Manual: NEC Cooperation
35 pages
Index Numbers
No ratings yet
Index Numbers
13 pages
Logicore™ Xaui V7.0: Getting Started Guide
No ratings yet
Logicore™ Xaui V7.0: Getting Started Guide
38 pages
ThinkServer TD350 - Product Guide
No ratings yet
ThinkServer TD350 - Product Guide
27 pages
Index Numbers
No ratings yet
Index Numbers
27 pages
Assignment Solutions GUIDE (2019-2020)
No ratings yet
Assignment Solutions GUIDE (2019-2020)
9 pages
318 Economics Eng Lesson11
No ratings yet
318 Economics Eng Lesson11
16 pages
CH 8
No ratings yet
CH 8
14 pages
Index Number1
No ratings yet
Index Number1
13 pages
Final Exam SEE3433 Mei (Solution)
No ratings yet
Final Exam SEE3433 Mei (Solution)
9 pages
Unit 6
No ratings yet
Unit 6
16 pages
Unit V
No ratings yet
Unit V
13 pages
Features of Index Numbers:: Subject: Business Mathematics and Statistics
No ratings yet
Features of Index Numbers:: Subject: Business Mathematics and Statistics
5 pages
ABM 401 Lesson 15
No ratings yet
ABM 401 Lesson 15
20 pages
SM 9 Index Number
No ratings yet
SM 9 Index Number
21 pages
Aukland TB - SQL in 24 Hours, Sams Teach Yourself - (PG 1 - 69)
No ratings yet
Aukland TB - SQL in 24 Hours, Sams Teach Yourself - (PG 1 - 69)
69 pages
Business Statistics 1st Unit
No ratings yet
Business Statistics 1st Unit
19 pages
Class 9 Cbse Board Syllabus
No ratings yet
Class 9 Cbse Board Syllabus
7 pages
Bcf42ht Maruyama
No ratings yet
Bcf42ht Maruyama
16 pages
APC200 ECM-ECI Error Codes TE13,15,17,27,32, Ver2.6
No ratings yet
APC200 ECM-ECI Error Codes TE13,15,17,27,32, Ver2.6
15 pages
Index Numbers Part 1
No ratings yet
Index Numbers Part 1
13 pages
Module - 4 Index Numbers: 4.3. 1 Price Index, 4.3.2 Quantity Index 4.3.3 Value Index. 4.3.4 Special Purpose
No ratings yet
Module - 4 Index Numbers: 4.3. 1 Price Index, 4.3.2 Quantity Index 4.3.3 Value Index. 4.3.4 Special Purpose
15 pages
1334156525index Numbers Econ Stats 105
No ratings yet
1334156525index Numbers Econ Stats 105
7 pages
Quantitative Techniques
No ratings yet
Quantitative Techniques
15 pages
QAB Presentation
No ratings yet
QAB Presentation
12 pages
Index Number Explained-Dr Aijaz Ahmad Khan
No ratings yet
Index Number Explained-Dr Aijaz Ahmad Khan
12 pages
Formal CalculatiON 2 REAL
No ratings yet
Formal CalculatiON 2 REAL
5 pages
Statistical Methods: Education For Excellence
No ratings yet
Statistical Methods: Education For Excellence
8 pages
Moon Observation Edited by Muhammed Syed
No ratings yet
Moon Observation Edited by Muhammed Syed
10 pages
BA1 Chapter 9
No ratings yet
BA1 Chapter 9
11 pages
Index Number (MBA)
No ratings yet
Index Number (MBA)
6 pages
Fluid Power - 2
No ratings yet
Fluid Power - 2
11 pages
BS Unit 5
No ratings yet
BS Unit 5
20 pages
DLL - Science 3 - Q3 - Week 1
No ratings yet
DLL - Science 3 - Q3 - Week 1
7 pages
Adobe Scan 14 Mar 2022
No ratings yet
Adobe Scan 14 Mar 2022
7 pages
Index Numbers
No ratings yet
Index Numbers
7 pages
Stats Assignment
No ratings yet
Stats Assignment
8 pages
Multiple Choice (8 X 1 PT)
No ratings yet
Multiple Choice (8 X 1 PT)
5 pages
Index Number
No ratings yet
Index Number
3 pages
Ndex Number
No ratings yet
Ndex Number
6 pages
Dear Sir,: Larsen & Toubro Limited Electrical & Automation Control & Automation
No ratings yet
Dear Sir,: Larsen & Toubro Limited Electrical & Automation Control & Automation
2 pages
Problems in Index Construction
No ratings yet
Problems in Index Construction
3 pages
Fshare VN
No ratings yet
Fshare VN
3 pages
Index Numbers - 241212 - 163154
No ratings yet
Index Numbers - 241212 - 163154
6 pages
Wa0224
No ratings yet
Wa0224
5 pages
EE Review 2
No ratings yet
EE Review 2
5 pages
Rt6-Xxx: Telecontrolli
No ratings yet
Rt6-Xxx: Telecontrolli
2 pages
Imaging and Design For Online Environment
No ratings yet
Imaging and Design For Online Environment
2 pages
Index Numbers Are Meant To Study The Change in The Effects of Such Factors Which Cannot Be Measured
No ratings yet
Index Numbers Are Meant To Study The Change in The Effects of Such Factors Which Cannot Be Measured
2 pages
2003 Addmath Smkab
No ratings yet
2003 Addmath Smkab
8 pages
Index Number
No ratings yet
Index Number
3 pages