Floxus Workshop
Floxus Workshop
BA
BUSINESS ANALYTICS
ANALYTICS
WORKSHOP
DAY-1
FR
What is Statistics?
• The word refers to numerical information. Statistics
include collecting, organizing and analysing the data for
describing situations and often for the purpose of
decision making.
This is a typical statistics problem. The student has the data (marks) and needs
to apply statistical techniques to get the information he requires. This is a
function of descriptive statistics.
Example of Descriptive FR
Statistics
There are a total of 42,796 miles of interstate highways in US. The
interstate represents only 1% of nations’s total roads but carries
more than 20% of the traffics. The longest is I-90 which stretches
from Boston to Seatte, a distance of 3081 miles. The shortest is I-878
in the New York City, which is 0.70 of a mile in length. Alaska does
not have any interstate highways, Texas has the most interstate
miles at 3232 and New York has the most interstates routes with 28.
FR
Example of Inferential
Statistics
Gamous and Associates, a public accounting firm is conducting an
audit of a Printing Company. To begin, the accounting firm selects a
random sample of 100 invoices and checks each invoice for
accuracy. There is at least one error on five of the invoices. Hence,
the accounting firm estimates that 5 percent of the population of
invoices contain at least one error.
Example of Inferential FR
Statistics
When an election for political office takes place, the television networks cancel regular
programming to provide election coverage. For important offices such as president or
senator in large states, the networks actively compete to see which one will be the first
to predict a winner. This is done through exit polls in which a random sample of voters
who exit the polling booth are asked for whom they voted. From the data, the sample
proportion of voters supporting the candidates is computed. A statistical technique is
applied to determine whether there is enough evidence to infer that the leading
candidate will garner enough votes to win. Suppose that the exit poll results from the
state of Florida during the year 2000 elections were recorded. Although several
candidates were running for president, the exit pollsters recorded only the votes of the
two candidates who had any chance of winning: Republican George W. Bush and
Democrat Albert Gore. The results (765 people who voted for either Bush or Gore) were
stored in a XML file. The network analysts would like to know whether they can
FR
What is Population?
• A population is the group of all items of interest to a statistics practitioner.
It is frequently very large and may, in fact, be infinitely large.
• Statistician gather data from a sample. They use this information to make
inferences about the population that the sample represents.
• Discrete Variable are numeric values that arise from a counting process.
Example-TV sets owned, children in a family, number of students in a section
etc.
• Continuous Variable are values that arise from a measuring process and the
values depend on the precision of the measuring instrument.
Example-Time spent on check-out lines, rainfall in a state etc.
FR
Example
i) Nominal-Level Data
ii) Ordinal-Level Data
iii) Interval-Level Data
iv) Ratio-Level Data
FR
Nominal-Level Data
These are the weakest of all data measurements. Categorization is the main
purpose of this measurement. Numbers are used to label an item or
characteristics.
OEDC 22.76
Russia 11.33
China 3.62
Others 12.35
• It includes all the characteristics of the ordinal level, but, in addition, the
difference of values is a constant.
• Higher-level data types may be treated as lower-level ones. For example, in universities and colleges,
we convert the marks in a course, which can be considered to be interval, to letter grades, which are
ordinal. Some graduate courses feature only a pass or fail designation. In this case, the interval data
are converted to nominal.
• It is important to point out that when we convert higher-level data as lower-level we lose
information. For example, a mark of 83 on an accounting course exam gives far more information
about the performance of that student than does a letter grade of A, which might be the letter grade
for marks between 80 and 90. As a result, we do not convert data unless it is necessary to do so.
• It is also important to note that we cannot treat lower-level data types as higher-level types. Interval
data may be treated us nominal ordinal. Ordinal may be treated as nominal but cannot be treated as
interval. Nominal data cannot be treated as ordinal or interval.
FR
Data Collection/Sources
• The person who collects the statistical data is addressed as the investigator whereas the person
who provide raw data and facts to the investigator is called the respondent.
• Based on the collection issue, there can be two types of data-Primary and Secondary data.
Primary Data
• This refers to the data that the investigators collects for the very first time.
• This data has not been collected by anyone before this.
• A primary data will provide the investigator with the most reliable first-hand information
about the respondents.
• The investigator would have clear idea about the terminologies used, the statistical units
employed , the research methodologies and the size of the sample.
• Primary data may either be internal or external to the organization.
FR
Methods of Collection of Primary Data
(i) Direct Personal Investigation
The investigator/researcher is responsible for personally approaching a respondent
and investigating the research and gather further information.
• It refers to the data that the investigator collects from another source. Past
investigators or agents collect data required from their study.
• There are problems in clarity and issues about the intricacies of the data.
• There may be ambiguity in terms of the sample size and sample technique.
• There may also be unreliability with respect to the accuracy of the data.
15k up to 18k 8
18k up to 21k 23
21k up to 24k 17
24k up to 27k 18
27k up to 30k 8
30k up to 33k 4
33k up to 36k 2
Total 80
FR
Observations from Frequency
Distribution
• The Selling prices Ranged from about $15,000 up to about $36,000.
• The selling prices are concentrated between $ 18,000 and $ 27,000. A total
of 58, or 72.5 percent, of the vehicles are sold within this range.
• Further we cannot tell the actual selling price for the least expensive
vehicle was $15,545 and for the most expensive $35,925. However, the
lower limit of the first class and upper limit of the largest class convey
essentially the same meaning.
• Likely, we will make the judgement that the lowest price is around $15,000
(the exact price is $15,546).
To complete the frequency polygon, midpoints of $13.5 and $ 37.5 are added to X –axis to
“anchor “ the polygon as zero frequencies.
Both the histogram and the frequency polygon allow us to get a quick picture of the main
characteristics of data ( high, low points of concentration).
Frequency Polygon
FR
Advantages
The advantages of histogram are-
(i) The rectangle clearly shows each class in the distribution.
(ii) The area of each rectangle, relative to all other rectangles, shows
the proportion of the total number of observations that occur in
that class.
Skewed Histogram-A skewed histogram is one with a long tail extending to either the right or
the left. Skewness, in statistics, is the degree of distortion from the symmetrical bell curve or normal
distribution. Many models assume normal distribution; i.e., data are symmetric about the mean. The
normal distribution has a skewness of zero. But in reality, data points may not be perfectly symmetric.
So, an understanding of the skewness of the dataset indicates whether deviations from the mean are
going to be positive or negative.
FR
Skewness of Histogram
A mode is the observation that occurs with the greatest frequency. A modal class is the
class with the largest number of observations. A unimodal histogram is one with a single
peak. A special type of symmetric unimodal histogram is one that is bell shaped.
A bimodal histogram is one with two peaks, not necessarily equal in height. Bimodal
histograms often indicate that two different distributions are present.
Scenario 1 FR
• A financial manager must be familiar with the main characteristics of the capital markets where long-
term financial assets such as stocks and bonds trade. A well-functioning capital market provides
managers with useful information concerning the appropriate prices and rates of return that are
required for a variety of financial securities with differing levels of risk. Statistical methods can be
used to analyze capital markets and summarize their characteristics, such as the shape of the
distribution of stock or bond returns.
• The return on an investment is calculated by dividing the gain (or loss) by the value of the
investment. For example, a $100 investment that is worth $106 after 1 year has a 6% rate of return. A
$100 investment that loses $20 has a –20% rate of return. For many investments, including individual
stocks and stock portfolios (combinations of various stocks), the rate of return is a variable. In other
words, the investor does not know in advance what the rate of return will be. It could be a positive
number, in which case the investor makes money—or negative, and the investor loses money.
• Investors are torn between two goals. The first is to maximize the rate of return on investment. The
second goal is to reduce risk. If we draw a histogram of the returns for a certain investment, the
location of the center of the histogram gives us some information about the return one might expect
from that investment. The spread or variation of the histogram provides us with guidance about the
risk. If there is little variation, an investor can be quite confident in predicting what his or her rate of
return will be. If there is a great deal of variation, the return becomes much less predictable and thus
riskier. Minimizing the risk becomes an important goal for investors and financial analysts.
FR
Scenario 2
A business researcher measured the volume of stocks traded on Wall Street three times a
month for nine years resulting in a database of 324 observations. Suppose a financial
decision maker wants to use these data to reach some conclusions about the stock market.
The Figure shows a histogram of these data. What can we learn from this histogram?
FR
Scenario 2
• Virtually all stock market volumes fall between zero and 1 billion shares. The
distribution takes on a shape that is high on the left end and tapered to the
right. The shape of this distribution is skewed toward the right end.
• In statistics, it is often useful to determine whether data are approximately
normally distributed (bell shaped curve). We can see by examining the
histogram that the stock market volume data are not normally distributed.
• Although the center of the histogram is located near 500 million shares, a large
portion of stock volume observations falls in the lower end of the data
somewhere between 100 million and 400 million shares.
• In addition, the histogram shows some outliers in the upper end of the
distribution. Outliers are data points that appear outside of the main body of
observations and may represent phenomena that differ from those
represented by other data points.
• By observing the histogram, we notice a few data observations near 1 billion.
One could conclude that on a few stock market days an unusually large
volume of shares are traded. These and other insights can be gleaned by
examining the histogram and show that histograms play an important role in
the initial analysis of data.
Scenario 3 FR
• Suppose that you are facing a decision about where to invest that small
fortune that remains after you have deducted the anticipated expenses for
the next year from the earnings from your summer job. A friend has
suggested two types of investment, and to help make the decision you
acquire some rates of return from each type. You would like to know the
types of information, such as whether the rates are spread out over a wide
range (making the investment risky) or are grouped tightly together
(indicating relatively low risk).
• Draw histograms
Fromfor each
To set of returns1 andInvestment
Investment report on2 your findings. Which
investment would
-45 you choose
-30 and why?
0 5
-30 -15 6 5
• Raw Data for returns
-15 on0Investment
10 A and returns
2 on Investment B are
given. 0 15 17 16
15 30 7 8
30 45 6 8
45 60 2 3
60 75 2 3
Comparison Using Histogram
Comparison Using Frequency FR
Polygon
FR
Interpretation
• The center of the histogram of the returns of investment A is slightly
lower than than for investment B.
• The spread of returns for investment A is considerably less than that for
investment B.
• Both histograms are slightly positively skewed.
• These findings suggest that investment A is superior. Although the
returns for A are slightly less than those for B, the wider spread for B
makes it unappealing to most investors.
• Both investments allow for the possibility of a relatively large return.
FR
Limitation
• One of the drawbacks of the histogram is that we lose potentially useful
information by classifying the observations.
• By classifying the observations we did acquire useful information.
However, the histogram focuses our attention on the frequency of each
class and by doing so sacrifices whatever information was contained in
the actual observations.
• The stem-and-leaf display is a method that to some extent overcomes this
loss.
FR
Stem and Leaf Plot
• Below are the runs scored by a batsman X in last 27 innings
• 30,29,29,11,61,54,44,10,11,39,25,15,34,52,30,15,36,18,10,59,66,24,35,41,22,
25,13
Stem Leaf
0 0 1 1 3 5 5 8
1
2 2 4 5 5 9 9
3 0 0 4 5 6 9
4 1 4
5 2 4 9
6 1 6
Stem and Leaf Plot v/s FR
Histogram
• 30,29,29,11,61,54,44,10,11,39,25,15,34,52,30,15,36,18,10,59,66,24,35,41,22,2
5,13
► Stem► Leaf
► 1 ► 0 0 1 1 3 5 5 8
► 2 ► 2 4 5 5 9 9
► 3 ► 0 0 4 5 6 9
► 4 ► 1 4
► 5 ► 2 4 9
► 6 ► 1 6
► The length of each line represents the frequency in the class interval defined by the
stems.
► The advantage of the stem-and-leaf display over the histogram is that we can see the
actual observations.
FR
Note
Factors That Identify When to Use a Histogram,
Frequency Polygon Ogive, or Stem-and-Leaf Display-
• Objective: Describe a single set of data
• Data type: Interval or Ratio level
Visualizing Data
FR
Summary Table
• A summary table tallies the values as frequency or percentage for each categories. It
helps to see the difference among the categories by displaying the frequency or
percentage.
• The below summary table tallies response to a recent survey that asked young adults
about main reason that they shop online.
Reason Percentage
Better Price 37%
Avoiding holiday crowds or hassles 29%
Convenience 18%
Better Selection 13%
Ship Directly 3%
• From the table, you can conclude that 37% shop online mainly for better prices and
convenience and 29% shop online mainly to avoid holiday crowds and hassles.
FR
Contingency Table
• It tallies jointly the values of two or more categorical variables, allowing you to study
patterns that exist between the variable.
• Tallies can be shown as frequency, a percentage of overall total, a percentage of row total
or column total.
• Each tally appears in its own cell and there is a cell for each joint response.
• For a sample of 316 retirement funds, a contingency table is done to exhibit the pattern
between the fund type variable and risk level variable.
Risk Level
Fund Type Low Average High Total
Growth 143 74 10 227
Value 69 17 3 89
Total 212 91 13 316
► Because fund type variable has defined categories Growth and Value and the risk level has
categories Low, Average and High, there are six possible joint responses for the table.
► For the first fund tested in the sample, you would add to tally in the cell that is the
intersection of the Growth row and Low column. It is the most frequent joint response.
Contingency Table On FR
Percentage of Overall Total
Risk Level
Fund Type Low Average High Total
Growth 143 74 10 227
Value 69 17 3 89
Total 212 91 13 316
The percentage is taken on the total number of funds. Table shows 71.84% of funds sampled are
growth funds, 28.16% are value fund and 42.25% are growth fund with low risk.
Risk Level
It may not happen that the frequencies are given in descending order. It needs to be arranged. From the
table the percentage frequency and the cumulative frequency needs to be detected.
FR
Table
Cause Frequency Percentage Cumulative
Frequency Percentage
Card jammed 365 50.41 50.41
Card unreadability 234 32.32 82.73
ATM malfunction 32 4.42 87.15
ATM out of cash 28 3.87 91.02
Invalid amount 23 3.18 94.20
request
Wrong password 23 3.18 97.38
Lack of funds in 19 2.62 100
accounts
Total 724 100
FR
FR
Comments
• Because the categories in a Pareto Chart are ordered in a decreasing
frequency of occurrence, the team can quickly see which cause
contributes maximum to the problem of incomplete transactions.
These causes would be the “vital few” and figuring out ways to avoid
such cases would be presumably a starting point for improving the
user experience of ATMs.
• By following the cumulative percentage line, we can see that the
first two causes account for 82.73% of incomplete transactions.
Describing Data- Measure of Central
Tendency
FR
Introduction
• Graphical techniques for organizing and displaying data allow the
researcher to make some general observations about the shape and
spread of the data, a more complete understanding of the data can be
attained by summarizing the data using statistics. It will deal with
measures of central tendency, measures of variability, and measures of
shape. The computation of these measures is different for ungrouped
and grouped data.
• Central Tendency is the extent to which the values of a numerical
variable group around a typical or central value.
• Most variable show a distinct tendency to group around a central value.
• When people talk about an “average value” or “middle value ” or “most
frequent value ”, they are unfortunately talking about mean, median and
FR
Arithmetic Mean
The arithmetic mean (typically referred to as the mean) is the most common
measure of central tendency.
The mean suggests a typical or central value and serve as the “balance point” or
fulcrum in a set of data.
X₁ , X₂,……………Xₙ be the set of n values where ‘n ’ represent size of sample, then
the sample mean is given by
Grade of labour Hourly wages Labour hrs per unit Labour hrs per
Product 1 unit
Product 2
Unskilled labour $5.00 1 4
Semi-skilled labour $7.00 2 3
Skilled $9.00 5 3
FR
Weighted Mean
•
FR
Geometric Mean
• Geometric mean is the measure of the central tendency when data is
changing over time.
• Examples might be growth of investments, the inflation rate or the
change of gross national product.
• Consider the growth of an initial investment of $1000 in a saving
account that is deposited for a period of five years. The interest rate
which is accumulated annually is different for each year.
Year Interest Growth Factor Value year
• The table gives the interest and the growth of the investments.
Rate(%)
1 6.0 1.060 $1060
2 7.5 1.075 $1139.50
3 8.2 1.082 $1232.94
4 7.9 1.079 $1330.34
5 5.1 1.051 $1389.19
Geometric Mean FR
•
FR
Median
•
29 31 35 39 39 40 43 44 44 52
1 2 3 4 5 6 7 8 9 10
FR
Median for Grouped Data
Consider previous tables of average monthly balance
Class($) Frequency(f)
0 - 49.99 78
50 - 99.99 123
100 - 149.99 187
150 - 199.99 82
200 - 249.99 51
250 - 299.99 47
300 - 349.99 13
350 - 399.99 9
400 - 449.99 6
450 - 499.99 4
FR
Median for Grouped Data
Consider previous tables of average monthly balance
Class($) Frequency(f) Cumulative frequency
0 - 49.99 78 78
50 - 99.99 123 201
100 - 149.99 187 388
150 - 199.99 82 470
200 - 249.99 51 521
250 - 299.99 47 568
300 - 349.99 13 581
350 - 399.99 9 590
400 - 449.99 6 596
450 - 499.99 4 600
FR
Median for Grouped Data
•
FR
Mode is another measure of central tendency and is that value that occurs
most frequently in a dataset.
Example: A system manager is in charge of company. His network keeps
track of number of server failures that occur in a day. Determine mode for
the following data which represents the number of server failures per day
for past1 two
3
values.
0 26 2 7 4 0 2 7 4 0 2 3 3 6 3
Solution:
0 0 1 2 2 2 3 3 3 3 4 6 7 26
Since 3 occurs four times , the mode is 3.Thus the system manager can say
that the most common occurrence is having three server failures in a day.
FR
Mode for Grouped Data
• Class($) Frequency(f) Cumulative
frequency
0 - 49.99 78 78
50 - 99.99 123 201
100 - 149.99 187 388
150 - 199.99 82 470
200 - 249.99 51 521
250 - 299.99 47 568
300 - 349.99 13 581
350 - 399.99 9 590
400 - 449.99 6 596
450 - 499.99 4 600
Applying the Mean, Median and FR
Mode
When we work with statistical problem, we must decide whether to use the
mean, the median or the mode as a measure of central tendency. Symmetrical
distributions that contain only one mode always have the same value for the
mean, median and the mode. In that context, the choice is easy.
When the distributions are positively skewed or negatively skewed, the median
is often the best measure of location because it is always between the mean and
the mode. The median is not as highly influenced by the frequency of
occurrence of a single value as is the mode nor it is affected by extreme values
Measures of Dispersion
Dispersion-Why is it important?
► The three curves have same mean. But curve A has less spread or variability
then curve B.
► From any data, the central tendency helps to know about the characteristics
of data. To increase our understanding of the pattern of data, we must also
measure its dispersion - its spread on variability.
► It is important to know the amount of dispersion, variation or spread as data
that is more dispersed or separated is less reliable for analytical purpose.
Dispersion
• Which of the
distributions of scores
has the larger
dispersion?
• The upper
distribution has more
dispersion because
the values are more
spread out.
Measures of Dispersion-The FR
Range
• Simplest measure of dispersion
• Difference between the largest and the
smallest values
Range = Xlargest – Xsmallest
Example
:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 13 - 1 = 12
Why The Range Can Be FR
Misleading?
• Ignores the way in which data are distributed
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
• Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
FR
Mean Deviation
• Mean Deviation is the arithmetic mean of the absolute values of the
deviations from the arithmetic mean.
FR
Variance and Standard Deviation
• The variance and standard deviations are also based on the deviations from the mean.
However, instead of the absolute value of the deviations, it squares the deviations.
• The larger the variance is, the more the scores deviate, on average, away from the
mean. The smaller the variance is, the less the scores deviate, on average, from the
mean.
• The variance provides us with only a rough idea about the amount of variation in the
data. However, this statistic is useful when comparing two or more sets of data of the
same type of variable. If the variance of one data set is larger than that of a second
data set, we interpret that to mean that the observations in the first set display more
variation than the observations in the second set.
• There is a variance and standard deviation both for a population and a sample.
• When the deviate scores are squared in variance, their unit of measure is squared as
well. Example- If people’s weights are measured in pounds, then the variance of the
weights would be expressed in pounds2 (or squared pounds).
• Since squared units of measure are often awkward to deal with, the square root of
variance is often used instead. The standard deviation is the square root of
Population Variance FR
•
FR
Sample Variance
•
FR
Example
• The standard deviation of the biweekly amounts invested in the Dupree Paint
Company profit- sharing mean is computed to be $ 7.51. Suppose these employees
are located in Georgia. If the standard deviation for a group of employees in Texas
is $ 10.47 and the means are about the same, it indicates that the amounts invested
by the Georgia employees are not dispersed as much as these in Texas. Since the
amounts invested by the Georgia employees are clustered more closely about the
mean, the mean for the Georgia employees is a more reliable measure than the
mean for the Texas group.
• Financial analysts are concerned about the dispersion of a firm’s earning. Widely
dispersed earning - those varying from extremely high to low or even negative
levels - indicates a high risk to stockholders and creditors.
• Quality control expert analyze the dispersion of a product’s quality levels. A drug
that but ranges from very pure to highly impure may endanger lives.
FR
Example
• Consistency is the hallmark of a good golfer. Golf equipment manufacturers
are constantly seeking ways to improve their products.
• Suppose that a recent innovation is designed to improve the consistency of
its users. As a test, a golfer was asked to hit 150 shots using a 7 iron, 75 of
which were hit with his current club and 75 with the new innovative 7 iron.
The distances were measured and recorded. Which 7 iron is more
consistent?
• To gauge the consistency, we must determine the standard deviations. The
standard deviation of the distances of the current 7 iron is 5.79 yards
whereas that of the innovative 7 iron is 3.09 yards. Based on this sample, the
innovative club is more consistent. Because the mean distances are similar it
would appear that the new club is indeed superior.
FR
Problem
Item Calories
1 80 -50 2500
5 130 0 0
6 190 60 3600
7 200 70 4900
∑= 13200
Infact 57.1% (four out of seven) of items lies within this interval.
FR
Variance for Grouped Data
Class Frequency(f) Mid Value(x) fx
• Because the histogram is not bell shaped, we cannot use the Empirical Rule. We
must employ Chebysheff’s Theorem instead. The intervals can be created by adding
and subtracting two and three standard deviations to and from the mean.
• At least 75% of the salaries lie between $22,000 [the mean minus two standard
deviations =28,000 -2(3,000)] and $34,000 [the mean plus two standard
deviations=28,000+2(3,000)].
• At least 88.9% of the salaries lie between $19,000 [the mean minus three standard
deviations =28,000 -3(3,000)] and $37,000 [the mean plus three standard
deviations=28,000 +3(3,000)].
FR
Coefficient of Variation
•
FR
Standard Score
•
FR
Fundamental
• Is there a relationship between x and y?
• What is the strength of this relationship?
FR
CORRELATION
• A statistical technique used to determine the degree to
which two variables are related
• Finding the relationship between two numerical
variables without being able to infer causal relationships
• The correlation between two random variables X and Y
is a measure of the degree of association between the
two variables.
• It describes the degree to which one variable is linearly
related to another
FR
Examples
• Whether the stocks of two airlines rise and fall in any related
manner.
• What is the degree of relatedness of the two stock prices over
time.
• In the transportation industry, is a correlation evident between
the price of transportation and the weight of the object being
shipped?
• How strong is the correlation between the producer price index
and the unemployment rate?
• In retail sales, are sales related to population density, number of
competitors, size of the store, amount of advertising, or other
variables?
FR
Example
• Between 2002 and 2005, there was a decrease in movie attendance.
There are several reasons for this decline. One reason may be the
increase in DVD sales. The percentage of U.S. homes with DVD
players and the movie attendance (billions) in the United States for
the years 2000 to 2005 are shown next. Can we describe the
relationship between these variables.
Year 2000 2001 2002 2003 2004 2005
DVD percentage 12 23 37 42 59 74
Movie 1.41 1.49 1.63 1.58 1.53 1.40
attendance
Positive Negative No
correlation correlation correlatio
FR
Positive Relationship
FR
Negative Relationship
FR
No Relation
FR
Variance vs Covariance
• Do two variables change together?
•
Pearson’s Correlation FR
Coefficient r
Pearson Product-Moment FR
Correlation
Regression
• Correlation tells you if there is an association between x
and y but it does not allow you to predict one variable
from the other.
= ŷ , predicted value
= y i , true value
ε = residual error
Least Squares Regression
• To find the best line we must minimise the
sum of the squares of the residuals (the
vertical distances from the data points to
our line)
Model line: ŷ = ax + b a = slope, b = intercept
Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2
b
ε b ε
b
b b b
y = ax + b b = y – ax
■ We can put our equation for a into this
giving: r sy r = correlation coefficient of x and y
b=y- x s = standard deviation of y
y
sx s = standard deviation of x
x
• Although the least-squares method results in the line that fits the data
with the minimum amount of error, unless all the observed data points
fall on a straight line, the prediction line is not a perfect predictor. Just
as all data values cannot be expected to be exactly equal to their mean,
neither can they be expected to fall exactly on the prediction line. An
important statistic, called the standard error of the estimate, measures
the variability of the actual Y values from the predicted values of Y in
the same way that the standard deviation in measures the variability
of each value around the sample mean.
• In other words, the standard error of the estimate is the standard
deviation around the prediction line, whereas the standard deviation
in is the standard deviation around the sample mean. It is measured
by square root of [SSE / (n-2)].
Predictions in Regression
Analysis- Interpolation Versus FR
Extrapolation
• When using a regression model for prediction purposes, you need to
consider only the relevant range of the independent variable in
making predictions. This relevant range includes all values from the
smallest to the largest X used in developing the regression model.
Hence, when predicting Y for a given value of X, you can interpolate
within this relevant range of the X values, but you should not
extrapolate beyond the range of X values.
FR
Examples
• The human resource manager of a telemarketing firm is concerned
about the rapid turnover of the firm’s telemarketers. It appears that
many telemarketers do not work very long before quitting. There may
be a number of reasons, including relatively low pay, personal
unsuitability for the work, and the low probability of advancement.
Because of the high cost of hiring and training new workers, the
manager decided to examine the factors that influence workers to quit.
He reviewed the work history of a random sample of workers who
have quit in the last year and recorded the number of weeks on the job
before quitting and the age of each worker when originally hired.
• Use regression analysis to describe how the work period and age are
related and comment on the relationship.
FR
Examples
• The human resource manager of a telemarketing firm is concerned
about the rapid turnover of the firm’s telemarketers. It appears that
many telemarketers do not work very long before quitting. There may
be a number of reasons, including relatively low pay, personal
unsuitability for the work, and the low probability of advancement.
Because of the high cost of hiring and training new workers, the
manager decided to examine the factors that influence workers to quit.
He reviewed the work history of a random sample of workers who
have quit in the last year and recorded the number of weeks on the job
before quitting and the age of each worker when originally hired.
• Use regression analysis to describe how the work period and age are
related and comment on the relationship.
FR
Examples
• Millions of boats are registered in the United States. As is the case with
automobiles, there is an active used-boat market. Many of the boats
purchased require bank financing, and, as a result, it is important for
financial institutions to be capable of accurately estimating the price of
boats. One variable that affects the price is the number of hours the
engine has been run. To determine the effect of the hours on the price,
a financial analyst recorded the price (in $1,000s) of a sample of 2007
24-foot Sea Ray cruisers (one of the most popular boats) and the
number of hours they had been run.
• Determine the least squares line and explain what the coefficients tell
you
FR
Examples
• Fire damage in the United States amounts to billions of dollars, much
of it insured. The time taken to arrive at the fire is critical. This raises
the question, Should insurance companies lower premiums if the
home to be insured is close to a fire station? To help make a decision, a
study was undertaken wherein a number of fires were investigated.
The distance to the nearest fire station (in miles) and the percentage of
fire damage were recorded.
• Determine the least squares line and interpret the coefficients.
FR
Examples
• A real estate agent specializing in commercial real estate wanted a
more precise method of judging the likely selling price (in $1,000s) of
apartment buildings. As a first effort, she recorded the price of a
number of apartment buildings sold recently and the number of
square feet (in 1,000s) in the building.
• Calculate the regression line. What do the coefficients tell you about
the relationship between price and square footage?
FR
Examples
• An economist for the federal government is attempting to produce a
better measure of poverty than is currently in use. To help acquire
information, she recorded the annual household income (in $1,000s)
and the amount of money spent on food during one week for a random
sample of households.
• Determine the regression line and interpret the coefficients.
FR
Odometer Reading and Prices of Used
Toyota Camrys
• Car dealers across North America use the so-called Blue Book to help
them determine the value of used cars that their customers trade in
when purchasing new cars. The book, which is published monthly, lists
the trade-in values for all basic models of cars. It provides alternative
values for each car model according to its condition and optional
features. The values are determined on the basis of the average paid at
recent used-car auctions, the source of supply for many used-car
dealers. However, the Blue Book does not indicate the value
determined by the odometer reading, despite the fact that a critical
factor for used-car buyers is how far the car has been driven. To
examine this issue, a used-car dealer randomly selected 100 3-year old
Toyota Camrys that were sold at auction during the past month. Each
car was in top condition and equipped with all the features that come
standard with this car. The dealer recorded the price ($1,000) and the
number of miles (thousands) on the odometer. The dealer wants to
BA
BUSINESS ANALYTICS
Thank You