Revision Notes On Probability and Regression Analysis Both Classses
Revision Notes On Probability and Regression Analysis Both Classses
Meaning of probability:
Types of probability:
1
Conditional: explains the probability of one event happening based on the
prior occurrence of another, so one is dependent on the other.
There are three methods for determining the probability of any event, and they are
based on the rules of:
There is also Laplace’s rule, which states that, in a random sample composed of
results that are equally probable, the probability of an event is the result of the
number of possible cases divided by the number of probable cases.
2
3. Behavioral analysis: in this type of application, probability is used to
evaluate certain behaviors of a population sample so that certain patterns of
opinions, behaviors, or thoughts can be predicted.
4. Medical research: the success of vaccines, as well as their side effects in a
population, is an example that’s determined by probabilistic calculations.
Index Numbers:
Characteristics, Formula, Examples, Types, Importance and Limitations
This section discusses:- 1. Meaning of Index Numbers 2. Features of Index
Numbers 3. Steps or Problems in the Construction 4. Construction of Price Index
Numbers (Formula and Examples) 5. Difficulties in Measuring Changes in Value of
Money 6. Types of Index Numbers 7. Importance 8. Limitations.
The value of money does not remain constant over time. It rises or falls and is
inversely related to the changes in the price level. A rise in the price level means a
fall in the value of money and a fall in the price level means a rise in the value of
money. Thus, changes in the value of money are reflected by the changes in the
general level of prices over a period of time. Changes in the general level of prices
can be measured by a statistical device known as ‘index number.’
Price index number indicates the average of changes in the prices of representative
commodities at one time in comparison with that at some other time taken as the
base period. According to L.V. Lester, “An index number of prices is a figure
showing the height of average prices at one time relative to their height at some
other time which is taken as the base period.”
3
(i) Index numbers are a special type of average. Whereas mean, median and mode
measure the absolute changes and are used to compare only those series which are
expressed in the same units, the technique of index numbers is used to measure
the relative changes in the level of a phenomenon where the measurement of
absolute change is not possible and the series are expressed in different types of
items.
(ii) Index numbers are meant to study the changes in the effects of such factors
which cannot be measured directly. For example, the general price level is an
imaginary concept and is not capable of direct measurement. But, through the
technique of index numbers, it is possible to have an idea of relative changes in the
general level of prices by measuring relative changes in the price level of different
commodities.
(iii) The technique of index numbers measures changes in one variable or group of
related variables. For example, one variable can be the price of wheat, and group of
variables can be the price of sugar, the price of milk and the price of rice.
The construction of the price index numbers involves the following steps
or problems:
The first step or the problem in preparing the index numbers is the selection of the
base year. The base year is defined as that year with reference to which the price
changes in other years are compared and expressed as percentages. The base year
should be a normal year.
In other words, it should be free from abnormal conditions like wars, famines,
floods, political instability, etc. Base year can be selected in two ways- (a) through
fixed base method in which the base year remains fixed; and (b) through chain
4
base method in which the base year goes on changing, e.g., for 1980 the base year
will be 1979, for 1979 it will be 1978, and so on.
2. Selection of Commodities:
The second problem in the construction of index numbers is the selection of the
commodities. Since all commodities cannot be included, only representative
commodities should be selected keeping in view the purpose and type of the index
number.
(a) The items should be representative of the tastes, habits and customs of the
people.
(c) Items should be stable in quality over two different periods and places.
(d) The economic and social importance of various items should be considered
(f) All those varieties of a commodity which are in common use and are stable in
character should be included.
3. Collection of Prices:
5
(a) Prices are to be collected from those places where a particular commodity is
traded in large quantities.
(c) In selecting individuals and institutions who would supply price quotations, care
should be taken that they are not biased.
(d) Selection of wholesale or retail prices depends upon the type of index number
to be prepared. Wholesale prices are used in the construction of general price index
and retail prices are used in the construction of cost-of-living index number.
4. Selection of Average:
Since the index numbers are, a specialized average, the fourth problem is to choose
a suitable average. Theoretically, geometric mean is the best for this purpose. But,
in practice, arithmetic mean is used because it is easier to follow.
5. Selection of Weights:
Generally, all the commodities included in the construction’ of index numbers are
not of equal importance. Therefore, if the index numbers are to be representative,
proper weights should be assigned to the commodities according to their relative
importance.
For example, the prices of books will be given more weightage while preparing the
cost-of-living index for teachers than while preparing the cost-of-living index for the
workers. Weights should be unbiased and be rationally and not arbitrarily selected.
The most important consideration in the construction of the index numbers is the
objective of the index numbers. All other problems or steps are to be viewed in the
light of the purpose for which a particular index number is to be prepared. Since,
different index numbers are prepared with specific purposes and no single index
6
number is ‘all purpose’ index number, it is important to be clear about the purpose
of the index number before its construction.
7. Selection of Method:
The selection of a suitable method for the construction of index numbers is the final
step.
Simple index number again can be constructed either by – (i) Simple aggregate
method, or by (ii) simple average of price relative’s method. Similarly, weighted
index number can be constructed either by (i) weighted aggregative method, or by
(ii) weighted average of price relative’s method. The choice of method depends
upon the availability of data, degree of accuracy required and the purpose of the
study.
In this method, the index number is equal to the sum of prices for the year for
which index number is to be found divided by the sum of actual prices for the base
year.
The formula for finding the index number through this method is as
follows:
7
2. Simple Average of Price Relatives Method:
In this method, the index number is equal to the sum of price relatives
divided by the number of items and is calculated by using the following
formula:
8
3. Weighted Aggregative Method:
In this method, different weights are assigned to the items according to their
relative importance. Weights used are the quantity weights. Many formulae have
been developed to estimate index numbers on the basis of quantity weights.
9
10
4. Weighted Average of Relatives Method:
In this method also different weights are used for the items according to their
relative importance.
The price index number is found out with the help of the following
formula:
11
Difficulties in Measuring Changes in Value of Money:
Measurement of changes in the value of money through price index number is not
an easy and reliable technique. There are a number of theoretical as well as
practical difficulties in the construction of price index numbers. Moreover, the index
number technique itself has many limitations.
The concept of money is vague, abstract and cannot be clearly defined. The value
of money is a relative concept which changes from person to person depending
upon the type of goods on which the money is spent.
2. Inaccurate Measurement:
Price index numbers do not measure the changes in the value of money accurately
and reliably. A rise or fall in the general level of prices as indicated by the price
12
index numbers does not mean that the price of every commodity has risen or fallen
to the same extent.
Price index numbers are averages and measure general changes in the value of
money on the average. Therefore, they are not of much significance for the
particular individuals who may be affected by the changes in the actual prices quite
differently from that indicated by the index numbers.
The wholesale price index numbers, which are generally used to measure
changes in the value of money, suffer from certain limitations:
(a) They do not reflect the changes in the cost of living because retail prices are
generally higher than the wholesale prices.
(b) They ignore some of the important items concerning the urban population, such
as, expenditure on education, transport, house rent, etc.
(c) They do not take into consideration the changes in the consumers’ preferences.
While preparing the index number, first difficulty arises regarding the selection of
base year. The base year should be a normal year. But, it is very difficult to find out
a fully normal year free from any unusual happening. There is every possibility that
the selected base year may be an abnormal year, or a distant year, or may be
selected by an immature or biased person.
2. Selection of Items:
13
(a) With the passage of time the quality of the product may change ; if the quality
of a product changes in the year of enquiry from what it was in the base year, the
product becomes irrelevant,
(b) The relative importance of certain commodities may change due to a change in
the consumption pattern of the people in the course of time; for example, Vanaspati
Ghee was not an important item of consumption in India in the pre-war period, but
today it has become an item of necessity. Under such conditions, it is not easy to
select the appropriate commodities.
3. Collection of Prices:
4. Assigning Weights:
Another important difficulty that arises in preparing the index numbers is that of
assigning proper weights to different items in order to arrive at correct and
unbiased conclusions. As there are no hard and fast rules to weights for the
commodities according to their relative importance, there is very likelihood that the
weights are decided arbitrarily on the basis of personal judgement and involve
biasness.
5. Selection of Averages:
Another major problem is that which average should be employed to find out the
price relatives. There are many types of averages such as arithmetic average,
geometric average, mean, median, mode, etc. The use of different averages gives
different results. Therefore, it is essential to select the method with great care. Dr.
Marshall has advocated the use of chain index number to solve the problem of
averaging and weighing.
14
In the dynamic world, the consumption pattern of the individuals and the number
and varieties of goods undergo continuous changes.
They create difficulties for preparing index numbers and making temporal
comparisons:
(a) Since, in the course of time, old commodities may disappear and many new
ones come into existence, the long-run comparison may become difficult,
(b) The quantity and quality of commodities may also change over the period of
time, thus making the choice of commodities for constructing index numbers
difficult,
(c) A number of factors, like income, education, fashion, etc., bring changes in the
consumption pattern of the people which render the index numbers uncomparable.
Wholesale price index numbers are constructed on the basis of the wholesale prices
of certain important commodities. The commodities included in preparing these
index numbers are mainly raw-materials and semi-finished goods. Only the most
important and most price-sensitive and semi- finished goods which are bought and
sold in the wholesale market are selected and weights are assigned in accordance
with their relative importance.
The wholesale price index numbers are generally used to measure changes in the
value of money. The main problem with these index numbers is that they include
only the wholesale prices of raw materials and semi-finished goods and do not take
into consideration the retail prices of goods and services generally consumed by the
common man. Hence, the wholesale price index numbers do not reflect true and
accurate changes in the value of money.
15
These index numbers are prepared to measure the changes.in the value of money
on the basis of the retail prices of final consumption goods. The main difficulty with
this index number is that the retail price for the same goods and for continuous
periods is not available. The retail prices represent larger and more frequent
fluctuations as compared to the wholesale prices.
These index numbers are constructed with reference to the important goods and
services which are consumed by common people. Since the number of these goods
and services is very large, only representative items which form the consumption
pattern of the people are included. These index numbers are used to measure
changes in the cost of living of the general public.
The working class cost-of-living index numbers aim at measuring changes in the
cost of living of workers. These index numbers are consumed on the basis of only
those goods and services which are generally consumed by the working class. The
prices of these goods and index numbers are of great importance to the workers
because their wages are adjusted according to these indices.
The purpose of these index numbers is to measure time to time changes in money
wages. These index numbers, when compared with the working class cost-of-living
index numbers, provide information regarding the changes in the real wages of the
workers.
Index numbers are used to measure all types of quantitative changes in different
fields.
16
Various advantages of index numbers are given below:
1. General Importance:
(b) They are useful in making comparisons with respect to different places or
different periods of time,
Index numbers are used to measure changes in the value of money or the price
level from time to time. Changes in the price level generally influence production
and employment of the country as well as various sections of the society. The price
index numbers also forewarn about the future inflationary tendencies and in this
way, enable the government to take appropriate anti- inflationary measures.
Index numbers highlight changes in the cost of living in the country. They indicate
whether the cost of living of the people is rising or falling. On the basis of this
information, the wages of the workers can be adjusted accordingly to save the
wage earners from the hardships of inflation.
4. Changes in Production:
Index numbers are also useful in providing information regarding production trends
in different sectors of the economy. They help in assessing the actual condition of
different industries, i.e., whether production in a particular industry is increasing or
decreasing or is constant.
5. Importance in Trade:
17
Importance in trade with the help of index numbers, knowledge about the trade
conditions and trade trends can be obtained. The import and export indices show
whether foreign trade of the country is increasing or decreasing and whether the
balance of trade is favourable or unfavourable.
Index numbers are useful in almost all the fields. They are especially important in
economic field.
Some of the specific uses of index numbers in the economic field are:
(b) In the share market, the index numbers can provide data about the trends in
the share prices,
(c) With the help of index numbers, the Railways can get information about the
changes in goods traffic.
(d) The bankers can get information about the changes in deposits by means of
index numbers.
Index number technique itself has certain limitations which have greatly reduced its
usefulness:
(i) Because of the various practical difficulties involved in their computation, the
index numbers are never cent per cent correct.
18
(ii) There are no all-purpose index numbers. The index numbers prepared for one
purpose cannot be used for another purpose. For example, the cost-of-living index
numbers of factory workers cannot be used to measure changes in the value of
money of the middle income group.
(iv) Index numbers measure only average change and indicate only broad trends.
They do not provide accurate information.
(v) While preparing index numbers, quality of items is not considered. It may be
possible that a general rise in the index is due to an improvement in the quality of a
product and not because of a rise in its price.
19
The Simple Linear Regression Model:
We have worked hard to come up with formulas for the intercept b0 and the slope b1 of
the least squares regression line. But, we haven't yet discussed
what b0 and b1 estimate.
What do b0 and b1 estimate?
Let's investigate this question with another example. Below is a plot illustrating a
potential relationship between the predictor "high school grade point average (gpa)"
and the response "college entrance test score." Only four groups ("subpopulations") of
students are considered — those with a gpa of 1, those with a gpa of 2, ..., and those
with a gpa of 4.
Let's focus for now just on those students who have a gpa of 1. As you can see, there
are so many data points — each representing one student — that the data points run
together. That is, the data on the entire subpopulation of students with a gpa of 1 are
plotted. And, similarly, the data on the entire subpopulation of students with gpas of 2,
3, and 4 are plotted.
Now, take the average college entrance test score for students with a gpa of 1. And,
similarly, take the average college entrance test score for students with a gpa of 2, 3,
and 4. Connecting the dots — that is, the averages — you get a line, which we
summarize by the formula μY=E(Y)=β0+β1x. The line — which is called the
"population regression line" — summarizes the trend in the population between the
predictor x and the mean of the responses μY. We can also express the average college
20
entrance test score for the i-th student, E(Yi)=β0+β1xi. Of course, not every student's
college entrance test score will equal the average E(Yi). There will be some error. That
is, any student's response yi will be the linear trend β0+β1xi plus some error ϵi. So,
another way to write the simple linear regression model is yi=E(Yi)+ϵi=β0+β1xi+ϵi.
When looking to summarize the relationship between a predictor x and a response y,
we are interested in knowing the population regression line μY=E(Y)=β0+β1x. The only
way we could ever know it, though, is to be able to collect data on everybody in the
population — most often an impossible task. We have to rely on taking and using a
sample of data from the population to estimate the population regression line.
Let's take a sample of three students from each of the subpopulations — that is, three
students with a gpa of 1, three students with a gpa of 2, ..., and three students with a
gpa of 4 — for a total of 12 students. As the plot below suggests, the least squares
regression line y^=b0+b1x through the sample of 12 data points estimates the
population regression line μY=E(Y)=β0+β1x. That is, the sample intercept b0 estimates
the population intercept β0 and the sample slope b1 estimates the population slope β1.
The least squares regression line doesn't match the population regression line perfectly,
but it is a pretty good estimate. And, of course, we'd get a different least squares
regression line if we took another (different) sample of 12 such students. Ultimately, we
are going to want to use the sample slope b1 to learn about the parameter we care
about, the population slope β1. And, we will use the sample intercept b0 to learn about
the population intercept β0.
In order to draw any conclusions about the population parameters β0 and β1, we have
to make a few more assumptions about the behavior of the data in a regression setting.
21
We can get a pretty good feel for the assumptions by looking at our plot of gpa against
college entrance test scores.
First, notice that when we connected the averages of the college entrance test scores
for each of the subpopulations, it formed a line. Most often, we will not have the
population of data at our disposal as we pretend to do here. If we didn't, do you think it
would be reasonable to assume that the mean college entrance test scores are linearly
related to high school grade point averages?
Again, let's focus on just one subpopulation, those students who have a gpa of 1, say.
Notice that most of the college entrance scores for these students are clustered near
the mean of 6, but a few students did much better than the subpopulation's average
scoring around a 9, and a few students did a bit worse scoring about a 3. Do you get
the picture? Thinking instead about the errors, ϵi, most of the errors for these students
are clustered near the mean of 0, but a few are as high as 3 and a few are as low as -
3. If you could draw a probability curve for the errors above this subpopulation of data,
what kind of a curve do you think it would be? Does it seem reasonable to assume that
the errors for each subpopulation are normally distributed?
Looking at the plot again, notice that the spread of the college entrance test scores for
students whose gpa is 1 is similar to the spread of the college entrance test scores for
students whose gpa is 2, 3, and 4. Similarly, the spread of the errors is similar, no
matter the gpa. Does it seem reasonable to assume that the errors for each
subpopulation have equal variance?
Does it also seem reasonable to assume that the error for one student's college
entrance test score is independent of the error for another student's college entrance
22
test score? I'm sure you can come up with some scenarios — cheating students, for
example — for which this assumption would not hold, but if you take a random sample
from the population, it should be an assumption that is easily met.
We are now ready to summarize the four conditions or assumptions that underlie "the
simple linear regression model:"
The mean of the response, E(Yi), at each value of the predictor, xi, is a Linear
function of the xi.
The errors, εi, are Independent.
The errors, εi, at each value of the predictor, xi, are Normally distributed.
The errors, εi, at each value of the predictor, xi, have Equal variances (denoted σ2).
Do you notice what the first letters that are colored in blue spell? "LINE." And, what
are we studying in this course? Lines! Get it? You might find this mnemonic a useful
way to remember the four conditions that make up what we call the "simple linear
regression model." Whenever you hear "simple linear regression model," think of these
four conditions!
An equivalent way to think of the first (linearity) condition is that the mean of the
error, E(ϵi), at each value of the predictor, xi, is zero. An alternative way to describe all
four assumptions is that the errors, ϵi, are independent normal random variables with
mean zero and constant variance, σ2.
In this article, we offer a multiple regression analysis definition, list the formula for
calculating multiple regression and explain how to calculate multiple regression with an
example to provide more insight into this type of statistical analysis.
23
Y = b0 + b1X1 + b1 + b2X2 + ... + bpXp
To perform a regression analysis, first calculate the multiple regression of your data.
You can use this formula:
In this formula:
24
Gives insight into predictive factors
Conducting a multiple regression analysis is useful for determining what factors are
affecting different aspects of a business' processes. For instance, revenue can be one
type of Y-value, where different independent variables like the number of sales and cost
of goods affect business revenue. With multiple regression analysis, analysts can
identify the individual activities that affect specific metrics they want to measure, giving
them better insight into how to improve efficiency and productivity.
When companies can analyze the factors that affect certain business operations,
management can better predict which independent variables influence the dependent
functions of the business. For example, a business analyst can predict which factors are
likely to affect an organization's future profitability, based on the results of a multiple
regression analysis.
In this case, the analyst may calculate the regression using the formula where profit is
the predictive variable and factors like overhead, liabilities and total sales revenue
represent the (b) and (X) values in the formula. When the analyst understands how
much these factors affect profits, they can better predict the variables that may affect
profits in the future.
Understanding the mathematical data that multiple regression analysis can provide
allows professionals to model the information in a graph or chart. Displaying multiple
regression—how external variables cause changes in a dependent variable—in this way
can help you model the cause-and-effect relationship to better see the changes taking
place in real time. This can be especially beneficial for financial activities like investing
in stocks and securities, where traders can see the cause-and-effect relationship in a
chart to understand how economic factors are influencing current market shares.
25
To understand the calculations of multiple regression analysis, assume a financial
analyst wants to predict the price changes in a stock share of a major fuel company.
Using this example, follow the steps below to understand how the analyst calculates
multiple regression:
Using the example, the financial analyst must first determine all the factors that can
cause the share prices to fluctuate. While stock prices can have many influencing
factors, assume the predictive variables the analyst evaluates include interest rates,
crude oil prices and prices to move fuel resources. The analyst determines:
Once the analyst knows the independent variables affecting share price, they can
identify the value of the regression coefficient, or the relationship between predictive
variables and responses in Y, at time zero. Time zero refers to the value of the stock at
the moment of evaluation. If the stock price is $50 when the analyst begins their
assessment, the b0 value is $50:
After calculating the predictive variables and the regression coefficient at time zero, the
analyst can find the regression coefficients for each X predictive factor. The regression
coefficient for the X1 variable represents the change in interest rates from time zero,
the regression coefficient for the X2 variable is the change in the price of crude oil and
the regression coefficient for the Xp variable is the change in transportation costs. The
regression coefficients, or change rates, the analyst calculates come from the
26
differences in prices between previous and current years. Assume the analyst uses
these values in the formula:
Once the analyst has all values in the formula, they can find the total sum, or the value
of Y. It looks like this:
The multiple regression sum represents the likelihood of changes occurring because of
the changes in the independent variables affecting the dependent factor. In the
example of the financial analyst evaluating the advantages of company stocks, the
value of Y is approximately 86.5, or 86.5%.
This shows that the stock price for shares of the fuel company's stock has an 86.5%
chance of fluctuating based on changes in external factors. While this value doesn't
determine whether the fluctuations are increases or decreases in price, a multiple
regression rate of 86.5% can give the analyst valuable insight into just how volatile the
company stock prices are.
27
Multiple linear regression is used to estimate the relationship between two or
more independent variables and one dependent variable. You can use multiple
linear regression when you want to know:
1. How strong the relationship is between two or more independent variables and
one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer
added affect crop growth).
2. The value of the dependent variable at a certain value of the independent
variables (e.g. the expected yield of a crop at certain levels of rainfall,
temperature, and fertilizer addition).
Multiple linear regression example: You are a public health researcher interested in
social factors that influence heart disease. You survey 500 towns and gather data on
the percentage of people in each town who smoke, the percentage of people in each
town who bike to work, and the percentage of people in each town who have heart
disease.
Because you have two independent variables and one dependent variable, and all
your variables are quantitative, you can use multiple linear regression to analyze the
relationship between them.
In multiple linear regression, it is possible that some of the independent variables are
actually correlated with one another, so it is important to check these before developing
the regression model. If two independent variables are too highly correlated (r2 >
~0.6), then only one of them should be used in the regression model.
Linearity: the line of best fit through the data points is a straight line, rather than a
curve or some sort of grouping factor.
28
How to perform a multiple linear regression
To find the best-fit line for each independent variable, multiple linear regression
calculates three things:
The regression coefficients that lead to the smallest overall model error.
The t statistic of the overall model.
The associated p value (how likely it is that the t statistic would have occurred
by chance if the null hypothesis of no relationship between the independent and
dependent variables was true).
It then calculates the t statistic and p value for each regression coefficient in the
model.
Load the heart.data dataset into your R environment and run the following code:
29
This code takes the data set heart.data and calculates the effect that the independent
variables biking and smoking have on the dependent variable heart disease using the
equation for the linear model: lm().
Summary (heart.disease.lm)
This function takes the most important parameters from the linear model and puts
them into a table that looks like this:
The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’).
If the residuals are roughly centered around zero and with similar spread on either side,
as these do (median 0.03, and min and max around -2 and 2) then the model probably
fits the assumption of heteroscedasticity.
Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the
coefficients table is labeled (Intercept) – this is the y-intercept of the regression
equation. It’s helpful to know the estimated intercept in order to plug it into the
regression equation and predict values of the dependent variable:
30
heart disease = 15 + (-0.2*biking) + (0.178*smoking) ± e
The most important things to note in this output table are the next two tables – the
estimates for the independent variables.
The Estimate column is the estimated effect, also called the regression
coefficient or r2 value. The estimates in the table tell us that for every one percent
increase in biking to work there is an associated 0.2 percent decrease in heart disease,
and that for every one percent increase in smoking there is an associated .17 percent
increase in heart disease.
The Std.error column displays the standard error of the estimate. This number shows
how much variation there is around the estimates of the regression coefficient.
The t value column displays the test statistic. Unless otherwise specified, the test
statistic used in linear regression is the t value from a two-sided t test. The larger the
test statistic, the less likely it is that the results occurred by chance.
The Pr( > | t | ) column shows the p value. This shows how likely the
calculated t value would have occurred by chance if the null hypothesis of no effect of
the parameter were true.
Because these values are so low (p < 0.001 in both cases), we can reject the null
hypothesis and conclude that both biking to work and smoking both likely influence
rates of heart disease.
In our survey of 500 towns, we found significant relationships between the frequency of
biking to work and the frequency of heart disease and the frequency of smoking and
frequency of heart disease (p < 0.001 for each). Specifically we found a 0.2% decrease
(± 0.0014) in the frequency of heart disease for every 1% increase in biking, and a
0.178% increase (± 0.0035) in the frequency of heart disease for every 1% increase in
smoking.
31
It can also be helpful to include a graph with your results. Multiple linear regression is
somewhat more complicated than simple linear regression, because there are more
parameters than will fit on a two-dimensional plot.
However, there are ways to display your results that include the effects of multiple
independent variables on the dependent variable, even though only one independent
variable can actually be plotted on the x-axis.
Here, we have calculated the predicted values of the dependent variable (heart disease)
across the full range of observed values for the percentage of people biking to work.
32
Receive feedback on language, structure, and formatting
Professional editors proofread and edit your paper by focusing on:
Academic style
Vague sentences
Grammar
Style consistency
Multiple regression formulas analyze the relationship between dependent and multiple
independent variables. For example, the equation Y represents the formula is equal to a
plus bX1 plus cX2 plus dX3 plus E where Y is the dependent variable, and X1, X2, and
X3 are independent variables. A is the intercept, b, c, and d are the slopes, and E is the
residual value.
Multiple regressions are a very useful statistical method. Regression plays a very
important role in the world of finance. A lot of forecasting is done
using regression analysis. For example, one can predict the sales of a particular
segment in advance with the help of macroeconomic indicators that have a very good
correlation with that segment.
Key Takeaways
33
Multiple regression formulas are used to analyze the relationship between a dependent
variable and multiple independent variables.
This method uses two or more independent variables to forecast or predict the
dependent variable.
The main objective is to identify and examine the relationship between the dependent
and independent variables. Based on this analysis, suitable independent variables are
selected to aid in predicting the dependent variable.
Multiple regression is employed when linear regression alone cannot fulfill the intended
purpose, and it helps determine the effectiveness of the chosen predictor variables in
forecasting the dependent variable.
Multiple regression model formula are a method to predict the dependent variable
with the help of two or more independent variables. While running this analysis, the
main purpose of the researcher is to find out the relationship between the dependent
and independent variables. The multiple independent variables are chosen, which can
help predict the dependent variable to predict the dependent variable. One may use it
when linear regression cannot serve the purpose. The regression analysis helps in
the process of validating whether the predictor variables are good enough to help in
predicting the dependent variable.
Where,
The main aim of this method of multiple regression model formula is to estimate
the coefficients that reduce or minimize the sum of squared differences between the
values of Y and the values that are predicted by the equations. Various software
34
packages used for statistical purposes can perform this analysis systematically because
they are designed to handle complex calculations within a limited timeframe and
provide statistical evaluation of the accuracy.
Examples
Let us understand the concept of multiple regression analysis formula with the
help of suitable examples.
Example #1
Let us try and understand the concept of multiple regression analysis with
the help of an example. But, first, let us try to find out the relation between
the distance covered by an UBER driver and the age of the driver, and the
number of years of experience of the driver.
To calculate multiple regression, go to the “Data” tab in Excel and select the “Data
Analysis” option. For further procedure and calculation, refer to the: Analysis ToolPak
in Excel article.
1. y = MX + MX + b
2. y= 604.17*-3.18+604.17*-4.06+0
3. y= -4377
Example #2
Let us try and understand the concept of multiple regression analysis with
the help of another example. Let us try to find the relation between the GPA
of a class of students, the number of hours of study, and the student’s height.
Go to the “Data” tab in Excel and select the “Data Analysis” option for the calculation.
35
The regression equation for the above example will be
y = MX + MX + b
y= 1.08*.03+1.08*-.002+0
y= .0325
In this particular example, we will see which variable is the dependent variable and
which variable is the independent variable. The dependent variable in this regression is
the GPA, and the independent variables are study hours and the height of the students.
Example #3
Let us try and understand the concept of multiple regression analysis with
the help of another example. Now, let us find out the relation between the
salary of a group of employees in an organization, the number of years of
experience, and the age of the employees.
Go to the “Data” tab in Excel and select the “Data Analysis” option for the calculation.
y = MX + MX + b
y= 41308*.-71+41308*-824+0
y= -37019
In this particular example, we will see which variable is the dependent variable and
which variable is the independent variable. The dependent variable in this regression
equation is the salary, and the independent variables are the experience and age of the
employees.
Thus, the above examples successfully explain the formula and the concept by using
different case studies to highlight the various areas of study where it can be applied
and used to derive suitable results that can be easily interpreted.
36
This concept is widely used for prediction of the values of the dependent variables with
relation to the values of independent variables. Some examples of such situations can
be prediction of share prices, sales value and students performance over a period of
time. In this way it can also help in assessing the relation between many or multiple
variables.
Multiple regression model equation can be used to isolate and identify any
particular factor that can impact one variable while other variables constant.
It can successfully capture the relationships between both the dependent and
independent variables which are complex, not linear in nature and and includes more
than one predictor.
Businesses can take decisions based on the outcome of this calculation related to
employee performance, sales figures, customer demand and satisfaction levels, etc.
Any business is subject to a numberof risks related to market movements, demand,
supply, prices, material availability and many more. In such cases this concept and
calculation can be used by the finance and insurance companies to assess the return or
the claim that they may have to handle to cover such risks.
Companies use the method of multiple regression model equation to assess the
extent to which the company’s marketing efforts are impacting the revenue and profits,
which is helpful for both the stakeholders andthe management to make crucial
decisions. The method also helps establish relationships betweenimportant variables like
the GDP, employment, and inflation, which are essential factors that every country’s
government needs to look into for all round development of the economy.
It helps in quality control and also generate process improvement ideas that contribute
to the upgradation of the standard of the products and services of the organization.
Thus, we see that the concept has a number of uses in the financial as well as
statistical field. It uses complex datasets and helps businesses generate business
models or take complex financial and other type of decisions that guides the business
towards a smooth operational process. It is necessary to use the procedure in the
correct manner to get proper result.
37
The limitations of multiple regression include the assumptions of linearity, independence
of observations, normality of errors, absence of multicollinearity, and homoscedasticity.
Violations of these assumptions can lead to biased or inefficient estimates and affect
the validity of the regression model’s predictions.
The calculation is based on the method of least squares. The idea behind it is to minimize
the sum of the vertical distance between all of the data points and the line of best fit.
Consider these attempts at drawing the line of best fit, they all look like they could be a
fair line of best fit, but in fact Diagram 3 is the most accurate as the regression line has
where:
Note: The underlying statistical model here is that there is a linear relation between the
variables, say y=a′+b′x, and so we should regard the equation that we obtain using the
method above as resulting in an estimate for the true equation. For this reason many
38
authorities write y=a+bx+ϵ to emphasize this point. A further discussion on the nature
of the error ϵ is not appropriate here, but is covered in the references below.
Worked Examples
Example 1
Consider the example below where the mass, y (grams), of a chemical is related to the
time, x (seconds), for which the chemical reaction has been taking place according to the
table:
Solution
Start off by working out the mean of the independent and dependent variables.
¯x=∑xn=5+7+12+16+205=605=12,¯y=∑yn=40+120+180+210+2405=7905=158.=5+7+12+1
6+205=605=12,=40+120+180+210+2405=7905=158.
Now calculate b
39
b=SxySxx=∑(xi−¯x)(yi−¯y)∑(xi−¯x)2=1880154=12.20779...=12.208 (3.d.p.)=1880154=12.2
0779...=12.208 (3.d.p.)
and calculate a
Example 2
To see how students' reaction skills have improved over a year, eight students took a
reactions test at the start of the year and at the end of the year. These are their scores:
Solution
^y=a+bx.
As we have been given some summed values we are going to use b=SxySxx=∑(xy)
−∑x∑yn∑(x2)−(∑x)2n.
b=SxySxx=∑(xy)−∑x∑yn∑(x2)−(∑x)2n=14590−515×224833441−51528=0.590534...=0.590
(3.d.p.)(∑=14590−515×224833441−51528=0.590534...=0.590 (3.d.p.)
¯x=∑xn=5158=64.375,¯y=∑yn=2248=28a=¯y−b¯x=28−(0.590×64.375)=−10.015631...=−10.
016 (3.d.p.)¯=∑=5158=64.375,¯=∑=2248=28=¯−¯=28−(0.590×64.375)=−10.015631...=−10.0
16 (3.d.p.)
40
So the equation of our regression line is ^y=−10.106+0.590x y^=−10.106+0.590x.
Video Example
Alissa Grant-Walker presents a video on working out the linear regression line.
We can also use the equation of the regression line for finding approximate values for
missing data.
Note: Using this to estimate outside the range of your data is unreliable.
Worked Example
Using the data from the last worked example about the mass of a chemical as time
increases, we worked out the equation of the regression line to be ^y=11.506+12.208x,
^y =11.506+12.208x. We can interpret this as for every 11 minute increase in time the
mass of the chemical increases by 12.20812.208 grams. The equation also tells us that
when no time has passed, (when x is zero), the initial mass of the chemical is
11.506 grams.
Example 1
What is the mass of the chemical after ten seconds has passed?
Solution 1
Take your equation and enter the value of time x=10 and calculate ^y.
Example 2
41
Solution 2
For every minute increase in time the mass of the chemical increases
by 12.20812.208 grams. Multiply 12.20812.208 grams by 55 to find the increase in weight
of the chemical in 5 seconds.
Example 3
How much time does it take for the weight of the chemical to increase by 50 grams?
Solution 3
We know that for every minute increase in time the mass of the chemical increases
by 12.208 grams, this also means it takes 112.208 seconds for the chemical to increase
by 11 gram. To find the time taken for the chemical to increase in weight by 50 grams
we need to multiply 112.208 by 50.
42
Hypothesis Testing Formula
The hypothesis testing formula for some important test statistics are given below: z =
¯¯¯x−μσ√n x ¯ − μ σ n . ¯¯¯x x ¯ is the sample mean, μ μ is the population mean, σ σ
is the population standard deviation and n is the size of the sample.
Hypothesis Testing
Hypothesis testing ascertains whether a particular assumption is true for the whole
population. It is a statistical tool. It determines the validity of inference by evaluating
sample data from the overall population.
Key Takeaways
For every research experiment, there are mainly two explanations: the null
hypothesis and the alternative hypothesis. It is often difficult to prove a theory;
therefore, investigators test to reject the null hypothesis. So, when the null hypothesis
is rejected, the remaining alternate theory is believed to be true.
43
For example, if we believe that the returns from the NASDAQ stock index are not
zero. Then the null hypothesis would state: ‘the recovery from the NASDAQ is zero.’
Tests are conducted for different levels of statistical significance.
Hypothesis tests are prone to two errors—type 1 and type 2. If the null hypothesis is
rejected by the sample outcome despite being true—it is considered a type 1 error.
Similarly, if the sample data fails to reject the null hypothesis, despite the null
hypothesis being false, it is considered a type 2 error.
4. Two-tailed: The two-tailed hypothesis test works when the critical distribution of the
population is two-sided. Here the test sample is either higher or lower than a number of
given values.
44
Hypothesis Testing Steps
Researchers opt for different statistical tests like t-tests or z-tests. The z-test formula
is as follows:
Z = ( x̅ – μ0 ) / (σ /√n)
45
Based on the Z-test result, the research derives the hypothesis conclusion. It can either
be a null or its alternative. They are measured using the following formula:
H0: μ=μ0
Ha: μ≠μ0
Here,
H0 = null hypothesis
Ha = alternate hypothesis
If the mean value is equal to the population mean, then the null hypothesis is proven
true. Otherwise, the alternate hypothesis is taken into consideration.
Hypothesis Testing Calculation with Examples
A battery manufacturing company claims that the average life of its two-wheeler
batteries is 2.1 years. The quality inspector surveyed ten customers to know the lasting
period of their batteries. The following data was collected:
1 1.9
2 2.3
3 2.1
4 2.2
5 1.9
6 2.4
46
7 2.1
8 2.3
9 2.2
10 2.0
If the standard deviation is 0.17 and the significance level is 0.05, conduct a
hypothesis testing to prove the company’s claim.
Solution:
Given:
σ = 0.17
n = 10
Assuming that the company’s claim of average battery life being 2.1 years is true,
H0: μ=μ0, or
Ha: μ≠μ0
Sample mean (x̅ ) = (1.9 + 2.3 + 2.1 + 2.2 + 1.9 + 2.4 + 2.1 + 2.3 + 2.2 + 2.0) / 10 =
2.14 years.
47
Z = (x̅ – μ0) / (σ /√n)
We already know that the level of significance is 0.05, and the z-score is 1.645. Let us
now compare the Z-test with it.
Thus, the company’s claim that the average life of its batteries is 2.1 years is proven
true.
Hypothesis testing validates a theory with the help of systematic statistical inference.
However, in practice, it is not easy. Therefore, researchers try to reject the null
hypothesis in order to validate the alternate explanation.
Limitations
Hypothesis testing is all about assumptions and interpretations. It, therefore, requires
superior analytical abilities. As a result, it is inaccessible for most.
Also, this method heavily relies on mere probability. There can be errors in data. It
works better for large sample sizes. For smaller sample sets, this approach may not be
the most suitable.
48
Importance of hypothesis testing
It is a useful statistical tool that interprets data-based conclusions—such that it stands
true for the whole population. It is implemented in scientific research, medical research,
psychology, manufacturing, marketing, advertising, and criminal trials.
49