0% found this document useful (0 votes)
100 views169 pages

Floxus Workshop

This document discusses key concepts in statistics including: - Statistics is the collection, organization, and analysis of numerical data to describe situations and support decision making. - Descriptive statistics summarize and visualize data, while inferential statistics make predictions and conclusions about populations based on samples. - A population is the entire group being studied, while a sample is a subset of data from the population used to make inferences.

Uploaded by

Vampire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views169 pages

Floxus Workshop

This document discusses key concepts in statistics including: - Statistics is the collection, organization, and analysis of numerical data to describe situations and support decision making. - Descriptive statistics summarize and visualize data, while inferential statistics make predictions and conclusions about populations based on samples. - A population is the entire group being studied, while a sample is a subset of data from the population used to make inferences.

Uploaded by

Vampire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

DATA

BA
BUSINESS ANALYTICS
ANALYTICS
WORKSHOP
DAY-1
FR
What is Statistics?
• The word refers to numerical information. Statistics
include collecting, organizing and analysing the data for
describing situations and often for the purpose of
decision making.

• Statistics are the methods that allow to work with data


effectively.

• Statistics can be defined as the methods that help


transform data into useful information for decision
makers.
FR
Scenario-1
• A bank which has been steadily losing customers in the light of
intense competition wants to investigate the reasons for the loss of
customers on account of perceived service quality in critical
dimensions like response time, reliability, courtesy of the service
staff and credibility.

• The bank would like to conduct a comprehensive survey to


measure the perceived service quality from the customer’s angle
on these dimensions along with competition. This would help the
bank develop and implement effective strategies to woo its
present customers back as well as to attract new customers.
FR
Scenario-2
• A company has to decide whether to introduce a new product into
the market or not. The company will introduce the product into
the market if at least 25% of the target audience in the relevant
population will accept the product so that the risk of product
failure is minimized.

• Obviously customer acceptance is paramount in decision making.


To know about the customer acceptance in a reasonable manner,
the company has done a “test marketing” exercise.

• In a test market, 26% of the sample target audience (based on a


sample of 250 customers) indicate their acceptance of the product.
Does the sample at 95% confidence level suggest that 25% of the
target audience in the population (entire market) will accept the
product?
Statistics: A way of FR
Thinking
To best understand that statistics is a way of thinking, we need a
general framework that organizes the set of tasks that form
statistics.
One such framework is DCOVA framework. The task of DCOVA
framework are-
• Define the data that you want to study to solve a problem or meet
an objective.
• Collect the data from appropriate sources.
• Organize the data collected by developing tables.
• Visualize the data by developing charts.
• Analyze the data collected to reach conclusions and present those
FR
DCOVA Framework
The tasks Define, Collect, Organize, Visualize and Analyze help to
apply statistics to business decision making. Using the DCOVA
framework helps to apply to four categories of business activities-
• Summarize and visualize business data
• Reach conclusions from those data
• Make reliable predictions about business activities
• Improve business process
FR
What is meant by Data?
• In general, data is any set of characters that is gathered and
translated for some purpose, usually analysis. It can be any
character, including text and numbers, pictures, sound, video
etc. Data can be defined as a collection of facts or
information from which conclusions may be drawn.

• In statistics, data are “the values associated with a trait or


property that help distinguish the occurrences of something.”

• A trait or property of something with which value(data) are


associated is called a variable. Example-Bank Prime Loan Rate,
Female Infant Mortality Rate of Goa
FR
Types of Statistics
In general, the study of statistics is usually divided into two
categories-

• Descriptive Statistics- Deals with methods of organizing,


summarizing and presenting data in an informative way.

• Inferential Statistics– Deals with finding something about a


population from a sample belonging to that population i.e. it is
the method used to estimate a property of a population on
basis of a sample.
Example of Descriptive FR
Statistics
A student enrolled in a business program is attending his first class of the
required statistics course. The student is somewhat apprehensive because he
believes the myth that the course is difficult. To alleviate his anxiety, the
student asks the professor about last year’s marks. Because this professor is
friendly and helpful, like all other statistics professors, he obliges the student
and provides a list of the final marks, which are composed of term work plus
the final exam. What information can the student obtain from the list?

This is a typical statistics problem. The student has the data (marks) and needs
to apply statistical techniques to get the information he requires. This is a
function of descriptive statistics.
Example of Descriptive FR
Statistics
There are a total of 42,796 miles of interstate highways in US. The
interstate represents only 1% of nations’s total roads but carries
more than 20% of the traffics. The longest is I-90 which stretches
from Boston to Seatte, a distance of 3081 miles. The shortest is I-878
in the New York City, which is 0.70 of a mile in length. Alaska does
not have any interstate highways, Texas has the most interstate
miles at 3232 and New York has the most interstates routes with 28.
FR
Example of Inferential
Statistics
Gamous and Associates, a public accounting firm is conducting an
audit of a Printing Company. To begin, the accounting firm selects a
random sample of 100 invoices and checks each invoice for
accuracy. There is at least one error on five of the invoices. Hence,
the accounting firm estimates that 5 percent of the population of
invoices contain at least one error.
Example of Inferential FR
Statistics
When an election for political office takes place, the television networks cancel regular
programming to provide election coverage. For important offices such as president or
senator in large states, the networks actively compete to see which one will be the first
to predict a winner. This is done through exit polls in which a random sample of voters
who exit the polling booth are asked for whom they voted. From the data, the sample
proportion of voters supporting the candidates is computed. A statistical technique is
applied to determine whether there is enough evidence to infer that the leading
candidate will garner enough votes to win. Suppose that the exit poll results from the
state of Florida during the year 2000 elections were recorded. Although several
candidates were running for president, the exit pollsters recorded only the votes of the
two candidates who had any chance of winning: Republican George W. Bush and
Democrat Albert Gore. The results (765 people who voted for either Bush or Gore) were
stored in a XML file. The network analysts would like to know whether they can
FR
What is Population?
• A population is the group of all items of interest to a statistics practitioner.
It is frequently very large and may, in fact, be infinitely large.

• In the language of statistics, population does not necessarily refer to a


group of people. It may, for example, refer to the population of ball
bearings produced at a large plant, the population of 50,000 students on
campus, the population of the Floridians who voted for Bush or Gore.

• A descriptive measure of a population is called a parameter. The


parameter of interest can be the mean number of soft drinks consumed by
all the students at the university, the proportion of the 5 million Florida
voters who voted for Bush.
• In most applications of inferential statistics, the parameter represents
the information we need.
FR
What is Sample?
• A sample is a set of data drawn from the studied population.

• A descriptive measure of a sample is called a statistic.

• We use statistics to make inferences about parameters.

• Example-We compute the proportion of the sample of 765 Floridians who


voted for Bush. The sample statistic is then used to make inferences about
the population of all 5 million votes—that is, we predict the election results
even before the actual count.
FR
Difference between Sample
and Population

• Statistician gather data from a sample. They use this information to make
inferences about the population that the sample represents.

• A population is a whole and a sample is a fraction or segment of that


whole.

• We study sample in order to able to describe or conclude something about


the population.
FR
Example
The population the television networks wanted to make inferences about is the
approximately 5 million Floridians who voted for Bush or Gore for president. The
sample consisted of the 765 people randomly selected by the polling company who
voted for either of the two main candidates. The characteristic of the population that we
would like to know is the proportion of the Florida total electorate that voted for Bush.
Specifically, we would like to know whether more than 50% of the electorate voted for
Bush (counting only those who voted for either the Republican or Democratic
candidate). It must be made clear that we cannot predict the outcome with 100%
certainty because we will not ask all 5 million actual voters for whom they voted. This is
a fact that statistics practitioners must understand. A sample that is only a small
fraction of the size of the population can lead to correct inferences only a certain
percentage of the time. You will find that statistics practitioners can control that fraction
and usually set it between 90% and 99%.
FR
Defining Data
Identifying or defining correct data is of prime importance to support a
business objective. Business objectives can arise from any level of
management and can vary-

• A marketing analyst needs to assess the effectiveness of a new online


advertising campaign
• A pharmaceutical company needs to determine whether a new drug is
more effective than those currently in use
• An operation manger wants to improve a manufacturing or service process
• An auditor needs to review a company’s financial transactions to
determine whether the company is in compliance with generally accepted
accounting principles.
FR
Defining Variable
Defining the variable that we want to study to solve a problem or meet an
objective is a crucial step. While defining the variable , we classify the variables
as-
i) categorical (qualitative) variable
ii) Numerical (quantitative) variable

• Categorical Variable-When the characteristics or trait studied is non-numeric


or qualitative.
Example-gender of a person, state of birth, brand of shirts used, attendance in
class etc.
It can also take form of yes-and-no questions such as “Do you use
Facebook?”
• Numerical Variable-When the characteristics or trait studied can be reported
Types of Numerical FR
Variable
Numerical variables are either discrete or continuous-

• Discrete Variable are numeric values that arise from a counting process.
Example-TV sets owned, children in a family, number of students in a section
etc.

• Continuous Variable are values that arise from a measuring process and the
values depend on the precision of the measuring instrument.
Example-Time spent on check-out lines, rainfall in a state etc.
FR
Example

Question Response Variable Type

Do you have a Yes/No Categorical


Facebook profile?

How many text ……………. Numerical


messages have you sent (discrete)
in the past three days?
How long did the …… seconds/minutes Numerical
mobile app update take (continuous)
to download?
FR
Types of Data
Data can be classified according to levels of measurement. The level of
measurement of the data dictates the calculation that can be done to summarize
and present the data. It also determines the statistical tests that should be
performed. There are four levels of measurements –

i) Nominal-Level Data
ii) Ordinal-Level Data
iii) Interval-Level Data
iv) Ratio-Level Data
FR
Nominal-Level Data
These are the weakest of all data measurements. Categorization is the main
purpose of this measurement. Numbers are used to label an item or
characteristics.

Example-A business school designate subject specialization by number such as


MBA in Finance=401, MBA in Systems=402. Various brands of toothpaste, savings
bank account numbers, jersey numbers of football players are other example.

Nominal-Level data have the following properties-


• Data categories are represented by labels or names.
• Even when the labels are numerically coded, the data categories have
no logical order.
FR
Nominal Scale Example
Categorical Variable Categories

Types of Investment Growth, Value, Others

Network Provider Reliance, Tata, Vodafone, Airtel

Brands of Shirts Raymonds, Turtle, Color Plus


FR
Example
Table-Source of World Oil Supply for 2004
Source Millions of Barrels per Day
OPEC 32.91

OEDC 22.76
Russia 11.33
China 3.62

Others 12.35

► The variable of interest is the country or region.


► This is a nominal-level variable because we record the information by the source of the oil
supply.
► There is no particular or logical order to the categories.
FR
Ordinal-Level Data
Ordinal-level data or Rank data are used to rank objects or attributes. The
properties of ordinal level of data are –
i) Data classification are represented by sets of labels or names (high,
medium, low etc) that have relative value.
ii) Because of the relative values, the data classified can be ranked or
ordered.
Rating of a Finance Professor
Example-
Rating Frequency
Superior 6
Good 28
Average 25
Poor 12
Inferior 3
FR
Ordinal Scale Example
Categorical Variable Ordered categories

Student class designation Fresher, Junior Senior

Product Satisfaction Excellent, Good, Average,


Not Satisfied

Faculty Status Professor, Associate


Professor, Assistant
Professor
FR
Interval-Level Data
• An interval scale is an ordered scale in which the difference between
measurements is a meaningful quantity but the measurements do not have a
true zero point.

• It includes all the characteristics of the ordinal level, but, in addition, the
difference of values is a constant.

Example- A popular one is temperature in centigrade, where, the distance


between 940 degree centigrade and 960 degree centigrade is the same as the
distance between 1000 degree centigrade and 1020 degree centigrade.

• In interval scale of measurement, the value of zero is assigned arbitrarily and


therefore we cannot take ratios of two measurements. We can take ratios of
interval.

Example-We measure time of a day which is in interval scale. We cannot say 10 A.


M. is twice as long as 5 A.M. But we can say that the interval between 0.00 A.M.
and 10.00 A.M. is twice as long as the interval between 0.00 A.M. and 05.00 A.M.
FR
Ratio-Level Data
• Practically, all quantitative data is recorded on the ratio level of measurement.
• It is the highest level of measurement that has the requisite desirable properties and data
measured on a ratio scale has a fixed zero point.
• Examples include business data like cost, revenue, market share, price, wages.
• Equal differences in the characteristics are represented by equal differences in the
numbers assigned to the classification.
• It has all the characteristics of the interval level, but in addition, the zero point is
meaningful and the ratio of two numbers is meaningful. The zero point is the absence of
the characteristics.
• Say, money and weight can be good examples. Zero dollars and zero kilograms imply no
money or weight. Further, Jim earns $40000 per year selling insurance and Rob earns
$80000 per year selling cars, then Rob earns twice as much as Jim.
Hierarchy of Data FR
• At the top of the list, we place the ratio-level data type because virtually all computations are
allowed. The nominal data type is at the bottom because no calculations other than determining
frequencies are permitted. We are permitted to perform calculations using the frequencies of codes,
but this differs from performing calculations on the codes themselves.

• Higher-level data types may be treated as lower-level ones. For example, in universities and colleges,
we convert the marks in a course, which can be considered to be interval, to letter grades, which are
ordinal. Some graduate courses feature only a pass or fail designation. In this case, the interval data
are converted to nominal.

• It is important to point out that when we convert higher-level data as lower-level we lose
information. For example, a mark of 83 on an accounting course exam gives far more information
about the performance of that student than does a letter grade of A, which might be the letter grade
for marks between 80 and 90. As a result, we do not convert data unless it is necessary to do so.

• It is also important to note that we cannot treat lower-level data types as higher-level types. Interval
data may be treated us nominal ordinal. Ordinal may be treated as nominal but cannot be treated as
interval. Nominal data cannot be treated as ordinal or interval.
FR
Data Collection/Sources
• The person who collects the statistical data is addressed as the investigator whereas the person
who provide raw data and facts to the investigator is called the respondent.
• Based on the collection issue, there can be two types of data-Primary and Secondary data.

Primary Data
• This refers to the data that the investigators collects for the very first time.
• This data has not been collected by anyone before this.
• A primary data will provide the investigator with the most reliable first-hand information
about the respondents.
• The investigator would have clear idea about the terminologies used, the statistical units
employed , the research methodologies and the size of the sample.
• Primary data may either be internal or external to the organization.
FR
Methods of Collection of Primary Data
(i) Direct Personal Investigation
The investigator/researcher is responsible for personally approaching a respondent
and investigating the research and gather further information.

(ii) Indirect Oral Interview


The investigator approaches (either by telephonic interview) an indirect respondent
who possesses the appropriate information for the research.

(iii) Mailed Questionnaire


Consist of mailing a set of series of questions related to the research . The respondent
answers the questionnaire and forward it back to the investigator after making his/her
response. It is a time-saving and cost effective process. The limitation is that the
researcher can only investigate those respondents who also have access to the internet
and an email.
FR
Methods of Collection of Primary Data
(iv) Schedules
It involves face to face situations with respondents. The interviewer questions
according to questions mentioned in a form known as schedule.

****A questionnaire is personally filled by respondents and interviewer may or may


not be present. But physical presence is compulsory for schedules.

(v) Local Agencies


The interviewer hires or employs a local agency to work for him/her and help in
gathering appropriate information.
Secondary Data FR

• It refers to the data that the investigator collects from another source. Past
investigators or agents collect data required from their study.
• There are problems in clarity and issues about the intricacies of the data.
• There may be ambiguity in terms of the sample size and sample technique.
• There may also be unreliability with respect to the accuracy of the data.

• Published Sources-There are many national organizations, international agencies,


official publications that collect various statistical data. They collect data related to
business, commerce, trade, prices, economy, productions, services, industries,
currency and foreign affairs. They also collect information related to various socio-
economic phenomenon and publish them.
• Unpublished Sources-Some statistical data are not always a part of publication. Such
data are stored by institutions and private firms. Researcher can make use of these
unpublished data.
Organizing
and
Visualizing Data
Table-Prices ($) of vehicle sold last month at Whitner
Autoplex
23,197 23,372 20,454 $3,591 23,651 27,453 17,266

18,021 28,683 30,872 19,587 23,169 35,851 19,251 Lowest

20,047 24,285 24,324 24,609 28,670 15,546 15,953

19,873 25,251 25,277 28,034 24,533 27,443 19,889

20,004 17,357 20,155 19,688 23,657 26,613 20,895

20,203 23,765 25,783 26,661 32,277 20,642 21,981

24,052 25,799 15,794 18,263 35,925 17,399 17,968

20,356 21,442 21,772 19,331 22,817 19,766 20,633

20,962 22,845 26,285 27,898 29,076 32,492 18,890

21,740 22,374 24,571 25,449 28,337 20,642 23,613

24,220 30,655 22,442 17,891 20,818 26,237 20,445

21,556 21,639 24,296 Highes


t
FR
Problem
• Suppose we want to summarize last month’s sales using selling price/ we
want to determine the value around which the selling price tend to
cluster-

• The data presented in table is raw because it is unprocessed by statistical


methods (Any information before it is arranged and analysed is called
raw data)

• For the purpose of organizing and summarization, we can start with


frequency distribution.

• Frequency Distribution is a grouping of data into mutually exclusive


classes showing the number of observations in each class.
FR
Table-Frequency Distribution of selling prices at Whitner Autoplex
(Last Month)
Selling prices Frequency
($ thousands)

15k up to 18k 8

18k up to 21k 23

21k up to 24k 17

24k up to 27k 18

27k up to 30k 8

30k up to 33k 4

33k up to 36k 2

Total 80
FR
Observations from Frequency
Distribution
• The Selling prices Ranged from about $15,000 up to about $36,000.

• The selling prices are concentrated between $ 18,000 and $ 27,000. A total
of 58, or 72.5 percent, of the vehicles are sold within this range.

• The largest concentration ,or highest frequency , is in the $ 18,000 up to $


21,000. The middle of the class is $19,5000. So we say the typical selling
price is $ 19,5000.
FR
Limitation
• By organizing the data into frequency distribution, we cannot pinpoint
the exact selling price, such as $23,197 or $26,237.

• Further we cannot tell the actual selling price for the least expensive
vehicle was $15,545 and for the most expensive $35,925. However, the
lower limit of the first class and upper limit of the largest class convey
essentially the same meaning.

• Likely, we will make the judgement that the lowest price is around $15,000
(the exact price is $15,546).

• The advantage of considering the data into more understandable and


organized form more than offset this disadvantage.
FR
Relative
Frequency
Table -Relative Frequency Distribution of the Prices of Vehicles Sold Last Month at
Whitner Autoplex

Selling Price Frequency Relative Frequency


($ thousands)
15k up to 18k 8 0.1000 8/80
18k up to 21k 23 0.2875 23/80
21k up to 24k 17 0.2125 17/80
24k up to 27k 18 0.2250 18/80
27k up to 30k 8 0.1000 8/80
30k up to 33k 4 0.0500 4/80
33k up to 36k 2 0.0250 2/80
Total 80 1.0000
Histogram
Histogram is a graph in which the classes are marked on the horizontal axis and the class
frequencies on the vertical axis. The class frequencies are represented by the heights of the
bars and the bars are drawn adjacent to each other.
Frequency Polygon
Frequency Polygon consists of line segments connecting the points formed by the
intersection of the class midpoints and the class frequencies. Class midpoint is the value at
the center of the class.

To complete the frequency polygon, midpoints of $13.5 and $ 37.5 are added to X –axis to
“anchor “ the polygon as zero frequencies.
Both the histogram and the frequency polygon allow us to get a quick picture of the main
characteristics of data ( high, low points of concentration).
Frequency Polygon
FR
Advantages
The advantages of histogram are-
(i) The rectangle clearly shows each class in the distribution.
(ii) The area of each rectangle, relative to all other rectangles, shows
the proportion of the total number of observations that occur in
that class.

The advantages of frequency polygon are-


(i) It sketches an outline of the data pattern more clearly.
(ii) The polygon become increasingly smooth and curve-like as we
increase the number of classes and number of observations.
(iii) It allows us to compare directly two or more frequency
distributions.
FR
Cumulative
Frequency
A cumulative frequency distribution enables us to see how many observations lie above
or below certain values, rather than merely recording the number of items within.

Table -Cumulative Frequency Distribution of the Prices (less than type)

Selling Price Frequency Cumulative Frequency


($ thousands) (Less than type)
15k up to 18k 8 8 Less than 18k
18k up to 21k 23 31 (8+23) Less than 21k
21k up to 24k 17 48 (31+17) Less than 24k
24k up to 27k 18 66 (48+18)
27k up to 30k 8 74 (66+8)
30k up to 33k 4 78 (74+4)
33k up to 36k 2 80 (78+2) Less than 36k
Total 80
FR
Cumulative
Frequency
Table -Cumulative Frequency Distribution of the Prices (more than type)

Selling Price Frequency Cumulative Frequency


($ thousands) (More than type)
15k up to 18k 8 80 15k or more
18k up to 21k 23 72 (80-8) 18k or more
21k up to 24k 17 49 (72-23) 21k or more
24k up to 27k 18 32 (49-17)
27k up to 30k 8 14 (32-18)
30k up to 33k 4 6 (14-8)
33k up to 36k 2 2 (6-4) 33k or more
Total 80
Ogives
FR
Shapes of
Histogram

Symmetric Histogram-A histogram is said to be symmetric if, when


we draw a vertical line down the center of the histogram, the two sides
are identical in shape and size.
FR
Shapes of
Histogram

Skewed Histogram-A skewed histogram is one with a long tail extending to either the right or
the left. Skewness, in statistics, is the degree of distortion from the symmetrical bell curve or normal
distribution. Many models assume normal distribution; i.e., data are symmetric about the mean. The
normal distribution has a skewness of zero. But in reality, data points may not be perfectly symmetric.
So, an understanding of the skewness of the dataset indicates whether deviations from the mean are
going to be positive or negative.
FR
Skewness of Histogram

Incomes of employees in large firms tend to be positively skewed because there


is a large number of relatively low-paid workers and a small number of well-
paid executives.
The time taken by students to write exams is frequently negatively skewed
because few students hand in their exams early; most prefer to reread their
papers and hand them in near the end of the scheduled test period.
Investors note skewness when judging a return distribution because it, like
kurtosis, considers the extremes of the data set rather than focusing solely on
the average of a set of data.
FR
Unimodal and Bimodal
Histogram

A mode is the observation that occurs with the greatest frequency. A modal class is the
class with the largest number of observations. A unimodal histogram is one with a single
peak. A special type of symmetric unimodal histogram is one that is bell shaped.
A bimodal histogram is one with two peaks, not necessarily equal in height. Bimodal
histograms often indicate that two different distributions are present.
Scenario 1 FR
• A financial manager must be familiar with the main characteristics of the capital markets where long-
term financial assets such as stocks and bonds trade. A well-functioning capital market provides
managers with useful information concerning the appropriate prices and rates of return that are
required for a variety of financial securities with differing levels of risk. Statistical methods can be
used to analyze capital markets and summarize their characteristics, such as the shape of the
distribution of stock or bond returns.
• The return on an investment is calculated by dividing the gain (or loss) by the value of the
investment. For example, a $100 investment that is worth $106 after 1 year has a 6% rate of return. A
$100 investment that loses $20 has a –20% rate of return. For many investments, including individual
stocks and stock portfolios (combinations of various stocks), the rate of return is a variable. In other
words, the investor does not know in advance what the rate of return will be. It could be a positive
number, in which case the investor makes money—or negative, and the investor loses money.
• Investors are torn between two goals. The first is to maximize the rate of return on investment. The
second goal is to reduce risk. If we draw a histogram of the returns for a certain investment, the
location of the center of the histogram gives us some information about the return one might expect
from that investment. The spread or variation of the histogram provides us with guidance about the
risk. If there is little variation, an investor can be quite confident in predicting what his or her rate of
return will be. If there is a great deal of variation, the return becomes much less predictable and thus
riskier. Minimizing the risk becomes an important goal for investors and financial analysts.
FR
Scenario 2
A business researcher measured the volume of stocks traded on Wall Street three times a
month for nine years resulting in a database of 324 observations. Suppose a financial
decision maker wants to use these data to reach some conclusions about the stock market.
The Figure shows a histogram of these data. What can we learn from this histogram?
FR
Scenario 2
• Virtually all stock market volumes fall between zero and 1 billion shares. The
distribution takes on a shape that is high on the left end and tapered to the
right. The shape of this distribution is skewed toward the right end.
• In statistics, it is often useful to determine whether data are approximately
normally distributed (bell shaped curve). We can see by examining the
histogram that the stock market volume data are not normally distributed.
• Although the center of the histogram is located near 500 million shares, a large
portion of stock volume observations falls in the lower end of the data
somewhere between 100 million and 400 million shares.
• In addition, the histogram shows some outliers in the upper end of the
distribution. Outliers are data points that appear outside of the main body of
observations and may represent phenomena that differ from those
represented by other data points.
• By observing the histogram, we notice a few data observations near 1 billion.
One could conclude that on a few stock market days an unusually large
volume of shares are traded. These and other insights can be gleaned by
examining the histogram and show that histograms play an important role in
the initial analysis of data.
Scenario 3 FR
• Suppose that you are facing a decision about where to invest that small
fortune that remains after you have deducted the anticipated expenses for
the next year from the earnings from your summer job. A friend has
suggested two types of investment, and to help make the decision you
acquire some rates of return from each type. You would like to know the
types of information, such as whether the rates are spread out over a wide
range (making the investment risky) or are grouped tightly together
(indicating relatively low risk).
• Draw histograms
Fromfor each
To set of returns1 andInvestment
Investment report on2 your findings. Which
investment would
-45 you choose
-30 and why?
0 5
-30 -15 6 5
• Raw Data for returns
-15 on0Investment
10 A and returns
2 on Investment B are
given. 0 15 17 16
15 30 7 8
30 45 6 8
45 60 2 3
60 75 2 3
Comparison Using Histogram
Comparison Using Frequency FR
Polygon
FR
Interpretation
• The center of the histogram of the returns of investment A is slightly
lower than than for investment B.
• The spread of returns for investment A is considerably less than that for
investment B.
• Both histograms are slightly positively skewed.
• These findings suggest that investment A is superior. Although the
returns for A are slightly less than those for B, the wider spread for B
makes it unappealing to most investors.
• Both investments allow for the possibility of a relatively large return.
FR
Limitation
• One of the drawbacks of the histogram is that we lose potentially useful
information by classifying the observations.
• By classifying the observations we did acquire useful information.
However, the histogram focuses our attention on the frequency of each
class and by doing so sacrifices whatever information was contained in
the actual observations.
• The stem-and-leaf display is a method that to some extent overcomes this
loss.
FR
Stem and Leaf Plot
• Below are the runs scored by a batsman X in last 27 innings
• 30,29,29,11,61,54,44,10,11,39,25,15,34,52,30,15,36,18,10,59,66,24,35,41,22,
25,13
Stem Leaf
0 0 1 1 3 5 5 8
1
2 2 4 5 5 9 9
3 0 0 4 5 6 9
4 1 4
5 2 4 9
6 1 6
Stem and Leaf Plot v/s FR
Histogram
• 30,29,29,11,61,54,44,10,11,39,25,15,34,52,30,15,36,18,10,59,66,24,35,41,22,2
5,13

► Stem► Leaf
► 1 ► 0 0 1 1 3 5 5 8
► 2 ► 2 4 5 5 9 9
► 3 ► 0 0 4 5 6 9
► 4 ► 1 4
► 5 ► 2 4 9
► 6 ► 1 6

► The length of each line represents the frequency in the class interval defined by the
stems.
► The advantage of the stem-and-leaf display over the histogram is that we can see the
actual observations.
FR
Note
Factors That Identify When to Use a Histogram,
Frequency Polygon Ogive, or Stem-and-Leaf Display-
• Objective: Describe a single set of data
• Data type: Interval or Ratio level
Visualizing Data
FR
Summary Table
• A summary table tallies the values as frequency or percentage for each categories. It
helps to see the difference among the categories by displaying the frequency or
percentage.
• The below summary table tallies response to a recent survey that asked young adults
about main reason that they shop online.
Reason Percentage
Better Price 37%
Avoiding holiday crowds or hassles 29%
Convenience 18%
Better Selection 13%
Ship Directly 3%

• From the table, you can conclude that 37% shop online mainly for better prices and
convenience and 29% shop online mainly to avoid holiday crowds and hassles.
FR
Contingency Table
• It tallies jointly the values of two or more categorical variables, allowing you to study
patterns that exist between the variable.
• Tallies can be shown as frequency, a percentage of overall total, a percentage of row total
or column total.
• Each tally appears in its own cell and there is a cell for each joint response.
• For a sample of 316 retirement funds, a contingency table is done to exhibit the pattern
between the fund type variable and risk level variable.
Risk Level
Fund Type Low Average High Total
Growth 143 74 10 227
Value 69 17 3 89
Total 212 91 13 316

► Because fund type variable has defined categories Growth and Value and the risk level has
categories Low, Average and High, there are six possible joint responses for the table.
► For the first fund tested in the sample, you would add to tally in the cell that is the
intersection of the Growth row and Low column. It is the most frequent joint response.
Contingency Table On FR
Percentage of Overall Total
Risk Level
Fund Type Low Average High Total
Growth 143 74 10 227
Value 69 17 3 89
Total 212 91 13 316

The percentage is taken on the total number of funds. Table shows 71.84% of funds sampled are
growth funds, 28.16% are value fund and 42.25% are growth fund with low risk.
Risk Level

Fund Type Low Average High Total

Growth 45.25% 23.42% 3.16% 71.84%

Value 21.84% 5.38% 0.95% 28.16%

Total 67.09% 28.80% 4.11% 100%


Contingency Table On FR
Percentage of Row Total
Risk Level
Fund Type Low Average High Total
Growth 143 74 10 227
Value 69 17 3 89
Total 212 91 13 316
The percentage is taken on the row total. Table shows 63% of growth funds have low risk and
77.53% of value funds have low risk.
Risk Level

Fund Type Low Average High Total

Growth 63% 32.60% 4.41% 100%

Value 77.53% 19.10% 3.37% 100%

Total 67.09% 28.80% 4.11% 100%


Contingency Table On FR
Percentage of Column Total
Risk Level
Fund Type Low Average High Total
Growth 143 74 10 227
Value 69 17 3 89
Total 212 91 13 316
The percentage is taken on the row total. Table shows that of the funds that have low risks,
67.45% are growth fund.
Risk Level

Fund Type Low Average High Total

Growth 67.45% 81.32% 72.92% 71.84%

Value 32.55% 18.68% 23.08% 28.16%

Total 100% 100% 100% 100%


FR
The Bar Graph
Bar Graph- A graph showing the differences in frequencies or percentages among categories of a
nominal or an ordinal variable. The categories are displayed as rectangles of equal width with
their height proportional to the frequency or percentage of the category.
Living Arrangements of FR
Males (65 and Older) in the United States,
2000
FR
Can display more info by splitting
gender
FR
What is the problems with this
graph?
FR
What is the problems with this
graph?
Gantt Chart FR

In this case, a vertical display allows better comparison of


calorie amounts.
STA6166-2-74
FR
Pie chart
Pie Chart- A graph showing the differences in frequencies or percentages among categories of a
nominal or an ordinal variable. The categories are displayed as segments of a circle whose pieces
add up to 100 percent of the total frequencies.
Too many categories can be FR
messy?
FR
Comparison of Bar Chart and
Pie Chart
• If a pie chart has too many wedges, it is hard to visually contrast
against each other compared to the height of bars in a bar graph.
• Bar charts are easier to read when we are comparing categories or
looking at change over time.
• The only thing bar charts lack is the whole-part relationship that
makes pie charts unique.
FR
Pareto Chart
• Pareto Chart helps us to visually identify the “vital few” categories
from the “trivial many” categories so that you can focus on the
important categories.
• It is based on the Pareto principle that states that out of the
observations in many data sets, a few categories represent the
majority of data while many other categories represent a relatively
small or trivial data.
• In Pareto Chart, the frequencies of each category are plotted as
vertical bars in descending orders and are combined with
cumulative percentage line on the chart.
• The cumulative percentage line is plotted at mid point of each
category.
Example FR
Consider a bank study team wants to enhance the user experience of ATMs. A survey generates the
following table-
Cause Frequency
Card jammed 365
Card unreadability 234
ATM malfunction 32
ATM out of cash 28
Invalid amount request 23
Wrong password 23
Lack of funds in accounts 19
Total 724

It may not happen that the frequencies are given in descending order. It needs to be arranged. From the
table the percentage frequency and the cumulative frequency needs to be detected.
FR
Table
Cause Frequency Percentage Cumulative
Frequency Percentage
Card jammed 365 50.41 50.41
Card unreadability 234 32.32 82.73
ATM malfunction 32 4.42 87.15
ATM out of cash 28 3.87 91.02
Invalid amount 23 3.18 94.20
request
Wrong password 23 3.18 97.38
Lack of funds in 19 2.62 100
accounts
Total 724 100
FR
FR
Comments
• Because the categories in a Pareto Chart are ordered in a decreasing
frequency of occurrence, the team can quickly see which cause
contributes maximum to the problem of incomplete transactions.
These causes would be the “vital few” and figuring out ways to avoid
such cases would be presumably a starting point for improving the
user experience of ATMs.
• By following the cumulative percentage line, we can see that the
first two causes account for 82.73% of incomplete transactions.
Describing Data- Measure of Central
Tendency
FR
Introduction
• Graphical techniques for organizing and displaying data allow the
researcher to make some general observations about the shape and
spread of the data, a more complete understanding of the data can be
attained by summarizing the data using statistics. It will deal with
measures of central tendency, measures of variability, and measures of
shape. The computation of these measures is different for ungrouped
and grouped data.
• Central Tendency is the extent to which the values of a numerical
variable group around a typical or central value.
• Most variable show a distinct tendency to group around a central value.
• When people talk about an “average value” or “middle value ” or “most
frequent value ”, they are unfortunately talking about mean, median and
FR
Arithmetic Mean
The arithmetic mean (typically referred to as the mean) is the most common
measure of central tendency.
The mean suggests a typical or central value and serve as the “balance point” or
fulcrum in a set of data.
X₁ , X₂,……………Xₙ be the set of n values where ‘n ’ represent size of sample, then
the sample mean is given by

Advantage of mean-As a single number, it represents the whole dataset. It is


unique measure because every dataset has one and only one mean. This is useful
for performing statistical procedure such as comparing the means from several
dataset.
Disadvantage-Although mean is reliable in that sense it reflects all the values in
the dataset, it may also be affected by extreme values that are not representative
Calculating Mean from Grouped FR
data
If the frequency distribution consist of data that are grouped within
classes, each value of the observation falls somewhere in one of the
classes.
In that case, to find the arithmetic mean, we calculate the midpoint
of each class. Then we multiply the midpoint with the frequency of
the class, sum then and divide the sum by total number of
observations in the sample. The sample mean is given by

Where, f = frequency in each class


x = midpoint of each class
n = total number of observations in sample
FR
Table: Average Monthly Balances of 600 customers
Class($) Frequency(f)
0 - 49.99 78
50 - 99.99 123
100 - 149.99 187
150 - 199.99 82
200 - 249.99 51
250 - 299.99 47
300 - 349.99 13
350 - 399.99 9
400 - 449.99 6
450 - 499.99 4
FR
Table: Average Monthly Balances of 600 customers
Class($) Frequency(f) Midpoint(x)
0 - 49.99 78 25 1950
50 - 99.99 123 75 9225
100 - 149.99 187 125 23375
150 - 199.99 82 175 14350
200 - 249.99 51 225 11475
250 - 299.99 47 275 12925
300 - 349.99 13 325 4225
350 - 399.99 9 375 3375
400 - 449.99 6 425 2550
450 - 499.99 4 475 1990
FR
Limitation of Mean for Grouped
Data
The mean calculated for grouped data is only approximation because
we assume that all the values in any class is equal to the midpoint of
the class.
It cannot be computed for open- ended class.
Class Interval Frequency
4.2 – 4.5 2
4.6 – 4.9 2
5.0 – 5.3 2
5.4 and above 1 (can you calculate midpoint of
this class?)
FR
Example
• According to Procter & Gamble, 35 billion loads of laundry are run in the
United States each year. Every second 1,100 loads are started. Statistics
show that one person in the United States generates a quarter of a ton of
dirty clothing each year. Americans appear to be spending more time
doing laundry than they did 40 years ago. Today, the average American
woman spends seven to nine hours a week on laundry.
• However, industry research shows that the result is dirtier laundry than
in other developed countries. Various companies market new and
improved versions of washers and detergents. Yet, Americans seem to be
resistant to manufacturers’ innovations in this area. In the United States,
the average washing machine uses about 16 gallons of water. In Europe,
the figure is about 4 gallons. The average wash cycle of an American
wash is about 35 minutes compared to 90 minutes in Europe. Americans
prefer top loading machines because they do not have to bend over, and
the top loading machines are larger. Europeans use the smaller front-
Weighted Mean FR

Grade of labour Hourly wages Labour hrs per unit Labour hrs per
Product 1 unit
Product 2
Unskilled labour $5.00 1 4
Semi-skilled labour $7.00 2 3
Skilled $9.00 5 3
FR
Weighted Mean

FR
Geometric Mean
• Geometric mean is the measure of the central tendency when data is
changing over time.
• Examples might be growth of investments, the inflation rate or the
change of gross national product.
• Consider the growth of an initial investment of $1000 in a saving
account that is deposited for a period of five years. The interest rate
which is accumulated annually is different for each year.
Year Interest Growth Factor Value year
• The table gives the interest and the growth of the investments.
Rate(%)
1 6.0 1.060 $1060
2 7.5 1.075 $1139.50
3 8.2 1.082 $1232.94
4 7.9 1.079 $1330.34
5 5.1 1.051 $1389.19
Geometric Mean FR

FR
Median

29 31 35 39 39 40 43 44 44 52

1 2 3 4 5 6 7 8 9 10
FR
Median for Grouped Data
Consider previous tables of average monthly balance
Class($) Frequency(f)
0 - 49.99 78
50 - 99.99 123
100 - 149.99 187
150 - 199.99 82
200 - 249.99 51
250 - 299.99 47
300 - 349.99 13
350 - 399.99 9
400 - 449.99 6
450 - 499.99 4
FR
Median for Grouped Data
Consider previous tables of average monthly balance
Class($) Frequency(f) Cumulative frequency
0 - 49.99 78 78
50 - 99.99 123 201
100 - 149.99 187 388
150 - 199.99 82 470
200 - 249.99 51 521
250 - 299.99 47 568
300 - 349.99 13 581
350 - 399.99 9 590
400 - 449.99 6 596
450 - 499.99 4 600
FR
Median for Grouped Data


FR

Advantage and Disadvantage of Median

Median is not affected by extreme values. It can be calculated for


grouped data with open ended classes unless it falls in an open
ended class.
Because the median is the value at the middle position, we must
array the data before we can perform any computation. This is time
consuming for any dataset with large number of elements. Therefore
if we want to use a sample statistics as an estimate of the population
parameter , mean is preferred than median.
FR
Comparison of Mean and Median
The mean summarizes all the information in the data. The mean is a single
point that can be viewed as the point where all the mass—the weight—of the
observations is concentrated.
The median, on the other hand, is an observation (or a point between two
observations) in the center of the data set. One-half of the data lie above this
observation, and one-half of the data lie below it. The median is resistant to
extreme observations.
Median is much more stable than the mean. Adding a new observation may
not change the median significantly.
However, the drawback of median is that it is not calculated using the entire
data like in the case of mean. We are simply looking for the midpoint instead
of using the actual values of the data.
The mean, however, does have strong advantages as a measure of central
tendency. The mean is based on information contained in all the observations
in the data set, rather than being an observation lying “in the middle” of the
set. The mean also has some desirable mathematical properties that make it
useful in many contexts of statistical inference.
In cases where we want to guard against the influence of a few outlying
Mode FR

Mode is another measure of central tendency and is that value that occurs
most frequently in a dataset.
Example: A system manager is in charge of company. His network keeps
track of number of server failures that occur in a day. Determine mode for
the following data which represents the number of server failures per day
for past1 two
3
values.
0 26 2 7 4 0 2 7 4 0 2 3 3 6 3

Solution:
0 0 1 2 2 2 3 3 3 3 4 6 7 26

Since 3 occurs four times , the mode is 3.Thus the system manager can say
that the most common occurrence is having three server failures in a day.
FR
Mode for Grouped Data
• Class($) Frequency(f) Cumulative
frequency
0 - 49.99 78 78
50 - 99.99 123 201
100 - 149.99 187 388
150 - 199.99 82 470
200 - 249.99 51 521
250 - 299.99 47 568
300 - 349.99 13 581
350 - 399.99 9 590
400 - 449.99 6 596
450 - 499.99 4 600
Applying the Mean, Median and FR
Mode

When we work with statistical problem, we must decide whether to use the
mean, the median or the mode as a measure of central tendency. Symmetrical
distributions that contain only one mode always have the same value for the
mean, median and the mode. In that context, the choice is easy.
When the distributions are positively skewed or negatively skewed, the median
is often the best measure of location because it is always between the mean and
the mode. The median is not as highly influenced by the frequency of
occurrence of a single value as is the mode nor it is affected by extreme values
Measures of Dispersion
Dispersion-Why is it important?

► The three curves have same mean. But curve A has less spread or variability
then curve B.
► From any data, the central tendency helps to know about the characteristics
of data. To increase our understanding of the pattern of data, we must also
measure its dispersion - its spread on variability.
► It is important to know the amount of dispersion, variation or spread as data
that is more dispersed or separated is less reliable for analytical purpose.
Dispersion
• Which of the
distributions of scores
has the larger
dispersion?

• The upper
distribution has more
dispersion because
the values are more
spread out.
Measures of Dispersion-The FR
Range
• Simplest measure of dispersion
• Difference between the largest and the
smallest values
Range = Xlargest – Xsmallest

Example
:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 13 - 1 = 12
Why The Range Can Be FR
Misleading?
• Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

• Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
FR
Mean Deviation
• Mean Deviation is the arithmetic mean of the absolute values of the
deviations from the arithmetic mean.
FR
Variance and Standard Deviation
• The variance and standard deviations are also based on the deviations from the mean.
However, instead of the absolute value of the deviations, it squares the deviations.
• The larger the variance is, the more the scores deviate, on average, away from the
mean. The smaller the variance is, the less the scores deviate, on average, from the
mean.
• The variance provides us with only a rough idea about the amount of variation in the
data. However, this statistic is useful when comparing two or more sets of data of the
same type of variable. If the variance of one data set is larger than that of a second
data set, we interpret that to mean that the observations in the first set display more
variation than the observations in the second set.
• There is a variance and standard deviation both for a population and a sample.
• When the deviate scores are squared in variance, their unit of measure is squared as
well. Example- If people’s weights are measured in pounds, then the variance of the
weights would be expressed in pounds2 (or squared pounds).
• Since squared units of measure are often awkward to deal with, the square root of
variance is often used instead. The standard deviation is the square root of
Population Variance FR

FR
Sample Variance

FR
Example
• The standard deviation of the biweekly amounts invested in the Dupree Paint
Company profit- sharing mean is computed to be $ 7.51. Suppose these employees
are located in Georgia. If the standard deviation for a group of employees in Texas
is $ 10.47 and the means are about the same, it indicates that the amounts invested
by the Georgia employees are not dispersed as much as these in Texas. Since the
amounts invested by the Georgia employees are clustered more closely about the
mean, the mean for the Georgia employees is a more reliable measure than the
mean for the Texas group.
• Financial analysts are concerned about the dispersion of a firm’s earning. Widely
dispersed earning - those varying from extremely high to low or even negative
levels - indicates a high risk to stockholders and creditors.
• Quality control expert analyze the dispersion of a product’s quality levels. A drug
that but ranges from very pure to highly impure may endanger lives.
FR
Example
• Consistency is the hallmark of a good golfer. Golf equipment manufacturers
are constantly seeking ways to improve their products.
• Suppose that a recent innovation is designed to improve the consistency of
its users. As a test, a golfer was asked to hit 150 shots using a 7 iron, 75 of
which were hit with his current club and 75 with the new innovative 7 iron.
The distances were measured and recorded. Which 7 iron is more
consistent?
• To gauge the consistency, we must determine the standard deviations. The
standard deviation of the distances of the current 7 iron is 5.79 yards
whereas that of the innovative 7 iron is 3.09 yards. Based on this sample, the
innovative club is more consistent. Because the mean distances are similar it
would appear that the new club is indeed superior.
FR
Problem
Item Calories

1 80 -50 2500

2 100 -30 900

3 100 -30 900

4 110 -20 400

5 130 0 0

6 190 60 3600

7 200 70 4900

∑= 13200

Infact 57.1% (four out of seven) of items lies within this interval.
FR
Variance for Grouped Data
Class Frequency(f) Mid Value(x) fx

700 – 799 4 750

800 – 899 7 850

900 – 999 8 950

1000 – 1099 10 1050

1100 – 1199 12 1150

1200 – 1299 17 1250

1300 – 1399 13 1350

1400 – 1499 10 1450

1500 – 1599 9 1550

1600 – 1699 7 1650

1700 – 1799 2 1750

1800 – 1899 1 1850


Interpreting the Standard FR
Deviation
• Knowing the mean and standard deviation allows the statistics
practitioner to extract useful bits of information. The information
depends on the shape of the histogram. If the histogram is bell shaped,
we can use the Empirical Rule.
• Approximately 68% of all observations fall within one standard
deviation of the mean.
• Approximately 95% of all observations fall within two standard
deviations of the mean.
• Approximately 99.7% of all observations fall within three standard
deviations of the mean.
FR
Example
• After an analysis of the returns on an investment, a statistics
practitioner discovered that the histogram is bell shaped and that the
mean and standard deviation are 10% and 8%, respectively. What can
you say about the way the returns are distributed?
• Because the histogram is bell shaped, we can apply the Empirical Rule-
Approximately 68% of the returns lie between 2% (the mean minus
one standard deviation, (10-8)) and 18% (the mean plus one standard
deviation (10+8)).
• Approximately 95% of the returns lie between -6% [the mean minus
two standard deviations (10 -2(8)] and 26% [the mean plus two
standard deviations (10+2(8)].
• Approximately 99.7% of the returns lie between -14% [the mean minus
three standard deviations (10 -3(8)] and 34% [the mean plus three
standard deviations (10-3(8)].
FR
Chebysheff’s Theorem

FR
Example
• The annual salaries of the employees of a chain of computer stores produced a
positively skewed histogram. The mean and standard deviation are $28,000 and
$3,000, respectively. What can you say about the salaries at this chain?

• Because the histogram is not bell shaped, we cannot use the Empirical Rule. We
must employ Chebysheff’s Theorem instead. The intervals can be created by adding
and subtracting two and three standard deviations to and from the mean.

• At least 75% of the salaries lie between $22,000 [the mean minus two standard
deviations =28,000 -2(3,000)] and $34,000 [the mean plus two standard
deviations=28,000+2(3,000)].

• At least 88.9% of the salaries lie between $19,000 [the mean minus three standard
deviations =28,000 -3(3,000)] and $37,000 [the mean plus three standard
deviations=28,000 +3(3,000)].
FR
Coefficient of Variation

FR
Standard Score

FR

Add a footer 123


Correlation
FR

Fundamental
• Is there a relationship between x and y?
• What is the strength of this relationship?
FR

CORRELATION
• A statistical technique used to determine the degree to
which two variables are related
• Finding the relationship between two numerical
variables without being able to infer causal relationships
• The correlation between two random variables X and Y
is a measure of the degree of association between the
two variables.
• It describes the degree to which one variable is linearly
related to another
FR
Examples
• Whether the stocks of two airlines rise and fall in any related
manner.
• What is the degree of relatedness of the two stock prices over
time.
• In the transportation industry, is a correlation evident between
the price of transportation and the weight of the object being
shipped?
• How strong is the correlation between the producer price index
and the unemployment rate?
• In retail sales, are sales related to population density, number of
competitors, size of the store, amount of advertising, or other
variables?
FR
Example
• Between 2002 and 2005, there was a decrease in movie attendance.
There are several reasons for this decline. One reason may be the
increase in DVD sales. The percentage of U.S. homes with DVD
players and the movie attendance (billions) in the United States for
the years 2000 to 2005 are shown next. Can we describe the
relationship between these variables.
Year 2000 2001 2002 2003 2004 2005

DVD percentage 12 23 37 42 59 74
Movie 1.41 1.49 1.63 1.58 1.53 1.40
attendance

• Sources: Northern Technology & Telecom Research and Motion Picture


Association.
FR
Scatter Diagram of Weight
and Systolic Blood Pressure
FR
Scatter Plots
The pattern of data is indicative of the type of relationship
between your two variables:
• positive relationship
• negative relationship
• no relationship

Positive Negative No
correlation correlation correlatio
FR
Positive Relationship
FR
Negative Relationship
FR
No Relation
FR
Variance vs Covariance
• Do two variables change together?

► Variance- Gives information on variability


of a single variable.

► Covariance-Gives information on the degree to


which two variables vary together.
Pearson’s Correlation Coefficient r
• Pearson’s R: standardises the covariance value.
• Divides the covariance by the multiplied
standard deviations of X and Y:


Pearson’s Correlation FR
Coefficient r
Pearson Product-Moment FR
Correlation

• Measures the degree and the direction of the linear


relationship between two variables
• Identified by r
r = degree to which X and Y vary together
degree to which X and Y vary separately

= ___covariability of X and Y____


variability of X and Y separately
The Coefficient of FR
Correlation
• Proposed by Karl Pearson, the coefficient of correlation
describes the strength of the relationship between two sets of
interval-scaled or ratio-scaled variables.
• It ranges from -1 up to and including +1.
• A value near +1 indicates a direct or positive association
between the variables. For a value near -1, it is inverse or
negative association.
• A value equals to +1 or –1 imply that the two variables are
perfectly related in a positive or negative linear sense.
Correlation r – Basic FR
Assumptions

• No categorical or nominal variables


• r does not change when we change the units of
measurement. For example, from Kg to pounds for
weight. Why?
• r does not measure nor describe curved or non-linear
association no matter how strong.
• Like the mean and SD, r is not resistant or uninfluenced
by outliers.
• r is strongly affected by outlier or outlying observations.
FR
The Coefficient of
Correlation
• If r = Zero =>no correlation.
• If 0 < r < 0.25 =>weak correlation.
• If 0.25 ≤ r < 0.75 => intermediate correlation.
• If 0.75 ≤ r < 1 => strong correlation.
• If r = l => perfect correlation.
FR
Coefficient of Determination

Example
A sample of 6 children was selected, data about their age
in years and weight in kilograms was recorded as shown in
the following table . It is required to find the correlation
between age and weight.
Weight (Kg) Age serial
(years) No
12 7 1
8 6 2
12 8 3
10 5 4
11 6 5
13 9 6
FR
Spearman Rank Correlation
• Pearson correlation is done when the random variables
involved are ratio or interval level. When both random
variable of are ordinal scale, we use Spearman Rank
Correlation.
• Procedure-Rank the values of X from 1 to n where n is the
numbers of pairs of of X and Y in the sample.
• Rank the values of Y from 1 to n.
• Compute the value of di for each pair of observation by
subtracting the rank of Y from the rank of X.
• Square each di and compute ∑di2 which is the sum of the
squared values.
Ranking of 12 countries under corruption and Gini Index

di 2 di=X- Gini Corruption Country


Y Index Rank
Rank (X)
Y
1 -1 2 1 1
1 1 3 4 2
9 3 9 12 3
9 -3 5 2 4
1 1 4 5 5
4 2 6 8 6
1 1 10 11 7 ∑ di2=68
0 0 7 7 8
4 2 8 10 9
4 2 1 3 10
25 -5 11 6 11
9 -3 12 9 12
Correlation does not imply Causation
► If two variables are linearly related, it does not mean that X
causes Y. It may mean that another variable causes both X
and Y or that Y causes X. Remember
► Correlation is not Causation
Regression
FR

Regression
• Correlation tells you if there is an association between x
and y but it does not allow you to predict one variable
from the other.

• To do this we need REGRESSION!

• Regression is a technique that fits a straight line as


possible between the co-ordinates of two variables plotted
on a two-dimensional graph- to summarize the
relationship
FR
Best-fit Line
• Aim of linear regression is to fit a straight line, ŷ =
ax + b, to data that gives best prediction of y for
any value of x
ŷ = ax + b
• This will be the line that slope intercept

minimises distance between ε


data and fitted line, i.e.
the residuals

= ŷ , predicted value
= y i , true value
ε = residual error
Least Squares Regression
• To find the best line we must minimise the
sum of the squares of the residuals (the
vertical distances from the data points to
our line)
Model line: ŷ = ax + b a = slope, b = intercept

Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2

■ we must find values of a and b that


minimise
Σ (y – ŷ)2
FR
Finding b
• First we find the value of b that gives the min sum of
squares

b
ε b ε
b

■ Trying different values of b is


equivalent to shifting the line up and
FR
Finding a
• Now we find the value of a that gives the min sum of
squares

b b b

■ Trying out different values of a is


equivalent to changing the slope of the
line, while b stays constant
Minimising sums of squares
• Need to minimise Σ(y–ŷ)2
• ŷ = ax + b
• so need to minimise:

sums of squares (S)


Σ(y - ax - b)2

• If we plot the sums of


squares for all different
values of a and b we get a
parabola, because it is a Gradient = 0
min S
squared term Values of a and b

• So the min sum of squares


is at the bottom of the
FR
Solution
• The min sum of squares is at the bottom of the curve where
the gradient = 0

• So we can find a and b that give min sum of squares by


taking partial derivatives of Σ(y - ax - b)2 with respect to a
and b separately

• Then we solve these for 0 to give us the values of a and b


that give the minimum sum of squares
FR
The Solution
• Doing this gives the following equations for a and b:

r sy r = correlation coefficient of x and y


a= sx
sy = standard deviation of y
sx = standard deviation of x

■ From you can see that:


▪ A low correlation coefficient gives a flatter slope
(small value of a)
▪ Large spread of y, i.e. high standard deviation, results
in a steeper slope (high value of a)
▪ Large spread of x, i.e. high standard deviation, results
in a flatter slope (small value of a)
FR
The solution cont.
• Our model equation is ŷ = ax + b
• This line must pass through the mean so:

y = ax + b b = y – ax
■ We can put our equation for a into this
giving: r sy r = correlation coefficient of x and y
b=y- x s = standard deviation of y
y
sx s = standard deviation of x
x

■ The smaller the correlation, the closer


the intercept is to the mean of y
Interpretation of a and b
• The slope a implies that for each increase of 1 unit in X, the mean value
of Y is estimated to increase by a units. The slope represents the
portion of Y that are estimated to vary according to X.
• The Y intercept b represents the mean value of Y when X equals 0.
Example
• A statistics professor wants to use the number of hours a student studies for
a statistics final exam (X) to predict the final exam score (Y). A regression
model was fit based on data collected for a class during the previous
semester, with the following results-
Y = 35 + 3X
• What is the interpretation of the Y intercept, b and the slope, a?
• The Y intercept b = 35.0 indicates that when the student does not study for
the final exam, the mean final exam score is 35.0. The slope a= 3 indicates
that for each increase of one hour in studying time, the mean change in the
final exam score is predicted to be +3.0.
• In other words, the final exam score is predicted to increase by 3 points for
each one-hour increase in studying time.
Measures of Variation
• When using the least-squares method to determine the regression
coefficients for a set of data, we need to compute three important measures
of variation. The first measure, the total sum of squares (SST ), is a measure
of variation of the Y values around their mean.
• In a regression analysis, the total variation or total sum of squares is
subdivided into explained variation and unexplained variation. The
explained variation or regression sum of squares (SSR) is due to the
relationship between X and Y, and the unexplained variation, or error sum of
squares (SSE) is due to factors other than the relationship between X and Y.
• The total sum of squares is equal to the regression sum of squares plus the
error sum of squares.
• SST=SSR+SSE
FR
Coefficient of Determination

FR
Standard Error of Estimate

• Although the least-squares method results in the line that fits the data
with the minimum amount of error, unless all the observed data points
fall on a straight line, the prediction line is not a perfect predictor. Just
as all data values cannot be expected to be exactly equal to their mean,
neither can they be expected to fall exactly on the prediction line. An
important statistic, called the standard error of the estimate, measures
the variability of the actual Y values from the predicted values of Y in
the same way that the standard deviation in measures the variability
of each value around the sample mean.
• In other words, the standard error of the estimate is the standard
deviation around the prediction line, whereas the standard deviation
in is the standard deviation around the sample mean. It is measured
by square root of [SSE / (n-2)].
Predictions in Regression
Analysis- Interpolation Versus FR
Extrapolation
• When using a regression model for prediction purposes, you need to
consider only the relevant range of the independent variable in
making predictions. This relevant range includes all values from the
smallest to the largest X used in developing the regression model.
Hence, when predicting Y for a given value of X, you can interpolate
within this relevant range of the X values, but you should not
extrapolate beyond the range of X values.
FR
Examples
• The human resource manager of a telemarketing firm is concerned
about the rapid turnover of the firm’s telemarketers. It appears that
many telemarketers do not work very long before quitting. There may
be a number of reasons, including relatively low pay, personal
unsuitability for the work, and the low probability of advancement.
Because of the high cost of hiring and training new workers, the
manager decided to examine the factors that influence workers to quit.
He reviewed the work history of a random sample of workers who
have quit in the last year and recorded the number of weeks on the job
before quitting and the age of each worker when originally hired.
• Use regression analysis to describe how the work period and age are
related and comment on the relationship.
FR
Examples
• The human resource manager of a telemarketing firm is concerned
about the rapid turnover of the firm’s telemarketers. It appears that
many telemarketers do not work very long before quitting. There may
be a number of reasons, including relatively low pay, personal
unsuitability for the work, and the low probability of advancement.
Because of the high cost of hiring and training new workers, the
manager decided to examine the factors that influence workers to quit.
He reviewed the work history of a random sample of workers who
have quit in the last year and recorded the number of weeks on the job
before quitting and the age of each worker when originally hired.
• Use regression analysis to describe how the work period and age are
related and comment on the relationship.
FR
Examples
• Millions of boats are registered in the United States. As is the case with
automobiles, there is an active used-boat market. Many of the boats
purchased require bank financing, and, as a result, it is important for
financial institutions to be capable of accurately estimating the price of
boats. One variable that affects the price is the number of hours the
engine has been run. To determine the effect of the hours on the price,
a financial analyst recorded the price (in $1,000s) of a sample of 2007
24-foot Sea Ray cruisers (one of the most popular boats) and the
number of hours they had been run.
• Determine the least squares line and explain what the coefficients tell
you
FR
Examples
• Fire damage in the United States amounts to billions of dollars, much
of it insured. The time taken to arrive at the fire is critical. This raises
the question, Should insurance companies lower premiums if the
home to be insured is close to a fire station? To help make a decision, a
study was undertaken wherein a number of fires were investigated.
The distance to the nearest fire station (in miles) and the percentage of
fire damage were recorded.
• Determine the least squares line and interpret the coefficients.
FR
Examples
• A real estate agent specializing in commercial real estate wanted a
more precise method of judging the likely selling price (in $1,000s) of
apartment buildings. As a first effort, she recorded the price of a
number of apartment buildings sold recently and the number of
square feet (in 1,000s) in the building.
• Calculate the regression line. What do the coefficients tell you about
the relationship between price and square footage?
FR
Examples
• An economist for the federal government is attempting to produce a
better measure of poverty than is currently in use. To help acquire
information, she recorded the annual household income (in $1,000s)
and the amount of money spent on food during one week for a random
sample of households.
• Determine the regression line and interpret the coefficients.
FR
Odometer Reading and Prices of Used
Toyota Camrys
• Car dealers across North America use the so-called Blue Book to help
them determine the value of used cars that their customers trade in
when purchasing new cars. The book, which is published monthly, lists
the trade-in values for all basic models of cars. It provides alternative
values for each car model according to its condition and optional
features. The values are determined on the basis of the average paid at
recent used-car auctions, the source of supply for many used-car
dealers. However, the Blue Book does not indicate the value
determined by the odometer reading, despite the fact that a critical
factor for used-car buyers is how far the car has been driven. To
examine this issue, a used-car dealer randomly selected 100 3-year old
Toyota Camrys that were sold at auction during the past month. Each
car was in top condition and equipped with all the features that come
standard with this car. The dealer recorded the price ($1,000) and the
number of miles (thousands) on the odometer. The dealer wants to
BA
BUSINESS ANALYTICS

Thank You

You might also like