Business Statistics PDF
Business Statistics PDF
Business Statistics PDF
Prepared by
Ahmed Sabbir
Definition of Statistics
The word Statistics refers to a special discipline or a collection of procedures and principles useful
as an aid in gathering and analyzing numerical information for the purpose of drawing
conclusion and making decisions.
Statistics is the branch of mathematics that transforms data into information for decision makers.
The science of collecting, organizing, presenting, analyzing and interpreting data to assist in
making more effective decisions.
Statistics is the study of numerical data, facts, figures and measurement. Statistics is used to
convert raw numerical data into useful information for relevant users.
Statistics
Information
Data
In business Statistics has many important uses. Statistics provide managers with more confidence
in dealing with uncertainty and taking effective decision. Statistical reports provide a summary
of business activities which improves the capability of making more effective decisions regarding
future activities.
To summarize business data
To draw conclusion from the business data
To make reliable forecasts about business activities
To improve business processes.
Discussed below are certain activities of a organization where statistics plays an important role:
Marketing:
Before a product launched, the market research team of an organization, through a survey, makes
use of various techniques of statistics to analyze data on population, purchasing power, habits of
consumer, competitors, pricing and a hoard of other aspects. Such studies reveal possible market
potential for the product.
Analysis of sales volume in relation to purchasing power and concentration of population is
helpful in marketing strategies to improve sales.
Production:
Decision regarding quantity of production and time to time purchasing of raw materials are based
on statistical data. Statistical methods are also used in quality improvement of existing product
and set standard for new ones.
Finance:
Financial forecast, break-even analysis, investment decision under uncertainty involves the
application of relevant statistical methods for analysis.
A statistical study through correlation analysis of profit and dividends helps to predict and
decide probable dividends for future year.
Limitations of Statistics
Statistics does not study qualitative phenomena.-Since statistics deals with
numerical data, it cannot be applied in studying those of problems which can be
stated and expressed quantitatively. For example, export volume of Bangladesh
has increased considerably during last few years cannot be annualized
statistically.
Statistics does not study individual: Statistics cannot consider any single or
isolated figure. Statistics laws are true on average. Statistics are aggregates of facts.
So single observation is not a statistics, it deals with groups and aggregates only.
For example, when average height of EMBA students is 6 ft, it shows the height
not of an individual but as found by the study of all individuals.
Statistics can be misused: Statistics deal with figures which are innocent in
themselves and can be easily manipulated or distorted by people for their selfish
motives. Therefore, it is a dangerous tool in the hands of a non-expert.
Features:
1. Statistics describes a numerical set of data by its-
Center
Variability
Shape
2. Statistics describes a categorical set of data by
Frequency, percentage or proportion of each category.
Inferential Statistics
Inferential Statistics includes statistical methods which facilitate estimating the characteristics of
a population or making decisions concerning a population on the basis of sample results. i.e.,
Statistical inference is the process of making an estimate, prediction, or decision about a population based
on a sample.
Inferential statistics start with a sample and then generalizes to a population. The larger group of
units about which inferences are to be made is called population and sample is subset or portion
of that population.
Inferential statistics is used to examine the relationships between variables within a sample and
then make generalizations or predictions about how those variables will relate to a larger
population.
Estimation: e.g., estimate the population mean weight using the sample mean weight.
Hypothesis Testing: e.g., Test the claim that the population mean weight is 120 pounds.
Problem:
The Rathburn Manufacturing Company makes electric wiring, which it sells to contractors in the
construction industry. Approximately 900 electric contractors purchase wire from Rathburn
annually. Rathburn’s director of marketing want’s to determined electric contractors’ satisfaction
with Rathburn’s wire. He developed a questionnaire that yields a satisfaction score between 10
and 50 for participant responses. A random sample of 35 of the 900 contractors is asked to
complete a satisfaction survey. The satisfaction scores for the 35 participants are averaged to
produce a mean satisfaction score.
a. What is the population for this study? 900
b. What is the sample for this study? 35
c. What is the statistic for this study? Sample mean (x) Satisfaction score
d. What would be a parameter for this study? Parameters are population mean (u).
Variables
Categorical Numerical
Examples:
Marital Status Discrete Continuous
Political Party
Eye color
Examples: Examples:
Number of Children Weight
Defects per hour Voltage
Level of measurement
Level of measurement or scale of measure is a classification that describes the nature of
information within the values assigned to variables.
A variable has one of four different levels of measurement: Nominal, Ordinal, Interval, or
Ratio. (Interval and Ratio levels of measurement are sometimes called Continuous or Scale).
Ordinal Scale:
In ordinal level, the numerical values are categorized to denote qualitative differences among
various categories as well as rank ordered in some meaningful way according to some preference.
The preferences would be ranked from best to worst, numbered 1, 2, and so on. So, in ordinal level,
data arranged in some order, but differences between data values cannot be determined are
meaningless.
Properties:
Data classification are represented by sets of labels or names (high, medium, low) that have
relative values.
Interval Level:
A scale of measurement for a variable in which the interval between observations is expressed in
terms of a fixed standard unit of measurement.
The interval scale not only classifies individuals according to certain categories and determine
order of these categories, it also measure the magnitude of the differences in preferences among
the individuals. In interval measurement the distance between attributes does have meaning.
For example, an interval level of measurement could be the measurement of anxiety in a
student between the score of 10 and 11, this interval is the same as that of a student who scores
between 40 and 41. Similarly, the difference between a temperature of 100 degrees and 90 degrees
is the same difference as between 90 degrees and 80 degrees.
Properties:
Ratio Level
A ratio scale is a scale of measurement for a variable that has interval which is measurable in
standard unit of measurement and meaningful zero, i.e., the ratio of two values is meaningful.
It is most powerful of for scales because it has a unique zero origin. For example, a person
weighting 90 kg is twice as one who weight 45 kg, which have the ration of 2:1
Examples of ratio variables are the following:
weight in kilograms or pounds
height in meters or feet
distance of school from home
amount of money spent during vacation
Properties:
Data classifications are ordered according to the amount of the characteristic they possess.
Exercise
State whether the following variables is qualitative or quantitative and indicate the measurement
scale that is appropriate for each:
i) Age
ii) Gender
iii) Class Rank
iv) Make of automobile
v) Annual sales
vi) Soft drinks size-small, medium large
vii) Earnings per share
viii) Method of payment (Cash, check, credit card)
Solution:
Variable Measurement Scale
(a) Age Quantitative Ratio
(b) Gender Qualitative Nominal
(c) Class Rank Qualitative Ordinal
(d) Make of automobile Quantitative Ordinal
(e) Annual sales Quantitative Ratio
(f) Soft drinks size-small, Qualitative Ordinal
medium large
(g) Earnings per share Quantitative Ratio
(h) Method of payment (Cash, Qualitative Ordinal
check, credit card)
Categorize by level of data from low to high nominal=group, ordinal=rank, interval=non zero
and ratio=absolute zero.
Classification of Data is the process of arranging data in groups/classes on the basis of certain
properties.
According to Secrist, “Classification is the process of arranging data into sequences and groups
according to their common characteristics”.
Classification means arranging the mass of data into different classes or groups on the basis of
their similarities and resemblances. All similar items of data are put in one class and all dissimilar
items of data are put in different classes. Statistical data is classified according to its characteristics.
For example, if we have collected data regarding the number of students admitted to a university
in a year, the students can be classified on the basis of sex. In this case, all male students will be
put in one class and all female students will be put in another class. The students can also be
classified on the basis of age, marks, marital status, height, etc.
Geographical Classification:
In geographical classification, data are classified on the basis of geographical location such cites,
districts or village. Such this type of classification is also known as areal or spatial classification.
Chronological Classification:
When the data are classified or arranged by their time of occurrence, such as years, months, weeks,
days, etc. Such classification are also called time series.
For example, Sales figure of a company in different years, Population of Bangladesh in different
years.
Year 1990 2000 2010
Population 11.1 12.0 14
Qualitative Classification:
In qualitative classification, data are classified on the basis of the descriptive characteristics or on
the basis of attributes like sex, literacy, religion, cast, or education, which cannot be quantified.
This can done in two ways:
a) Simple Classification: Each class is subdivided into two sub classes and only one attribute
is studied, such as male and female; blind and not blind etc.
b) Manifold: Each class is subdivided into more than two sub classes and only one attribute
is studied further.
Quantitative Classification
In quantitative classification, data classified on the basis of some characteristics which can be
measured such as height, weight, income, expenditure or sales etc.
Sources of Data
Organization of Data
The best way to examine a large set of numerical data is first to organize in an appropriate format.
The data can be organized by using-
1. Data Arrary
2. Tabulation.
Frequency distribution:
Frequency distribution is a tabular summery of data showing the number of observations
(frequency) in each of several non-overlapping class intervals.
Frequency distribution is listing of classes and their frequencies. A frequency distribution divides
observations in the data set into conveniently established, numerically ordered classes (groups or
categories). The number of each class is referred as frequency, denoted as f.
Table-1 presents the total number of overtime hours worked for 30 consecutive weeks by
machinists in a machine shop.
94 89 88 89 90 94 92 88 87 85
88 93 94 93 94 95 92 88 94 90
95 84 93 84 91 93 85 91 89 95
The data displayed here are in raw form that is the numerical observations are not arranged in any
particular order and sequence.
These raw data do not highlights any characteristics/trend and do not easily reveal any significant
trend regarding the nature and pattern of variations therein. Moreover, as number of observations
gets large, it becomes more difficult to focus on specific features in a set of data. Thus we need to
organize the observation so that we can better understand the information that the data revealing.
Ordered Array:
84 84 85 85 87 88 88 88 88 89
89 89 90 90 91 91 92 92 93 93
93 93 93 93 94 94 94 94 94 95
Problem-1
The following set of number represents mutual fund prices reported at the end of a week
for selected 40 nationally sold funds.
10 17 15 22 11 16 19 24 29 18
25 26 32 14 17 20 23 27 30 12
15 18 24 36 18 15 21 28 33 38
34 13 10 16 20 22 29 29 23 31
Arrange these prices into frequency distribution having a suitable number of classes.
Solution
Since No of observation are 40, it seems reasonable to choose 6 classes (26>42)
38 10
Class interval is= 4.66 or 5
6
Frequency Distribution
Problem-2
A computer company received a rush order for as many home computers as could be shipped
during a six week period. Company provide the following daily shipments:
22 65 65 67 55 50 65
77 75 30 62 54 48 65
79 60 63 45 51 68 79
83 33 41 49 28 55 61
65 75 55 75 39 87 45
50 66 65 59 25 35 53
Group these daily shipments figure into frequency distribution having suitable number of classes.
Solution:
Since, the number of observations are 42, so it seems reasonable to choose 6 classes (2 6>42)
87 22
Class interval is= 10.833 or11
6
Frequency Distribution
22-32 |||| 2
33-43 |||| 3
44-54 |||| |||| 9
55-65 |||| |||| |||| 14
66-76 |||| | 6
77-87 |||| 5
Total 42
Note:
If a continuous variable is classified according to the inclusive method, then certain adjustment is
needed to obtain continuity.
To ensure continuity first calculate correction factor
Write the marginal distribution of x and y and the conditional distribution of x when y lies between
15 and 20.
Problem-3
The following data are given the points scored in a tennis match by two players X & Y at the end
of twenty games:
(10,12), (7,11), (7,9), (15,10), (17,21), (12,8), (16,10), (14,14), (22,18), (16,7), (15,16), (22,20),
(19,15), (7,18), (11,11), (12,18), (10,10),(5,13), (11,7), (10,10)
Taking class interval as 5-9, 10-14, 15-19 for X and Y.
i) Construct bivariate frequency table
ii) Conditional frequency distribution for Y given X>15
Solution
(i) Bivariate frequency Table
Y X Marginal
5-9 10-14 15-19 20-24 frequencies fy
5-9 | || | 4
10-14 || |||| | 8
15-19 | | ||| | 6
20-24 | | 2
Marginal 4 8 6 2 20
frequencies fx
Problem
The data below shows the mass of 40 students in a class. The measurement is to the nearest kg.
55 70 57 73 55 59 64 72
60 48 58 54 69 51 63 78
75 64 65 57 71 78 76 62
49 66 62 76 61 63 63 76
52 76 71 61 53 56 67 71
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Relative frequency= 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑐𝑛 𝑦
Exercise
Tabulation of Data
Tabulation of data is another way of summarizing and presenting the given data in a
systematic form in rows and column. Such presentation facilitates comparison by
bringing related information close to each other and helps in further statistical analysis
and interpretation.
Tabulation defines as the process of classifying the data in a systematic form which
facilitates comparative studies of data sets.
Objectives of tabulation:
In general, a statistical table consists of the following eight parts. They are as follows:
(i) Table Number: Each table must be given a number. Table number helps in distinguishing one
table from other tables.
ii) Title of the Table: Every table should have a suitable title. It should be short & clear. Title
should be such that one can know the nature of the data contained in the table as well as where and
when such data were collected. It is either placed just below the table number or at its right.
(v) Body: This is the most important part of a table. It contains a number of cells. Cells are formed
due to the intersection of rows and column. Data are entered in these cells.
(vi) Head Note: The head-note (or prefatory note) contains the unit of measurement of data. It is
usually placed just below the title or at the right hand top corner of the table.
(vii) Foot Note: A foot note is given at the bottom of a table. It helps in clarifying the point which
is not clear in the table. A foot note may be keyed to the title or to any column or to any row
heading. It is identified by symbols such as *,+,@,£ etc.
(viii) Source Note: The source note shows the source of the data presented in the table. Reliability
and accuracy of data can be tested to some extent from the source note.
Exercise:
A survey of 370 students from the Commerce Faculty and 130 students from the Science Faculty
revealed that 180 students were studying for only C.A. examinations, 140 for only Costing
examinations and 80 for both C.A. and Costing examinations. The rest had opted for part-time
Management Courses. Of those studying Costing only, 13 were girls and 90 boys belonged to the
Commerce Faculty. Out of 80 students studying for both C.A. and Costing, 72 were from the
Commerce Faculty amongst which 70 were boys. Amongst those who opted for part-time
Management Courses, 50 boys were from the Science Faculty and 30 boys and 10 girls from the
Commerce Faculty. Of those studying CA only, 158 belongs to commerce faculty in which 150
boys and 6 girls belongs to science faculty. In all there were 110 boys in the Science Faculty.
(i) Present the above information in a tabular form.
(j) Find the number of students from the Science Faculty studying for part-time Management
Courses.
Solution:
In a sample study about coffee habit in two towns, the following information was received:
Town A: Females were 40%; Total coffee drinkers were 45% and male non-coffee drinkers were
20%.
Town B: Males were 55%. Male non-coffee drinkers were 30% and Female coffee drinkers were
15%.
Represent the above data in a tabular form.
Solution:
Problem-2
In 2003, out of total 1950 workers of a factory, 1400 were members of a trade union. The
number of women employed was 400 of which 275 did not belong to a trade union. In
2008, the number of union workers increased to 1780 of which 1490 were men. On the
other hand, the number of non-union workers fell to 408 of which 280 were men.
In the year 2013, there were 2000 employees who belonged to a trade union and 250 did
not belong to a trade union. Of all employees in 2000, 500 were women of whom only 208
did not belong to trade union.
Problem-4
A survey of 1500 workers in a factory gave the following results. Tabulate the information.
One third of the workers were females; 80% of the female workers were below 40 while the
percentage of male workers below 40 was 50. 80% of male workers below 40 were skilled and
the remaining unskilled. 40% of the male workers above 40 were skilled and the remaining
unskilled. 40% of the male workers above 40 were skilled. There was no skilled female worker
above 40 while 50 percent of the female workers below 40 were skilled.
SOLUTION:
DETAILS OF WORKERS IN A FACTORY
Male Female Total
Age Unskille
Skilled Unskilled Total Skilled Total Skilled Unskilled Total
d
Below 40 400 100 500 200 200 400 600 300 900
Above 40 200 300 500 0 100 100 200 400 600
Total 600 400 1000 200 300 500 800 700 1500
Limitations:
Time series
Plot
Question: For the data given below, construct a less than cumulative frequency table and plot
its ogive.
Marks 0 - 10 10 - 20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70 70 - 80 80 - 90 90 -100
Frequency 3 5 6 7 8 9 10 12 6 4
Solution:
Less than cumulative
Marks Frequency
frequency
0 - 10 3 3
10 - 20 5 8
20 - 30 6 14
30 - 40 7 21
40 - 50 8 29
50 - 60 9 38
60 - 70 10 48
70 - 80 12 60
80 - 90 6 66
90 - 100 4 70
Although frequency distributions and corresponding graphical representation make raw data more
meaningful, yet they fail to identify three major properties that describe a set of quantitative data.
These three major properties are:
1. The numerical value of an observation around which most of the values of other
observations in data set show a tendency to cluster or group, called central tendency.
2. Extent to which numerical values are dispersed around the central value, called variation.
3. The extent of departure of numerical values from symmetrical (normal) distribution around
the central value, called skewness.
These three properties-central tendency, variation and shape of the frequency distribution-may be
used to extract and summarize major features of data set by the application of certain statistical
methods called descriptive measures or summery measures.
If the descriptive summery measures are computed using data of samples, then these are called
sample statistic or simply statistic but if these measures are computed using data of the
population is called parameters.
A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. It is the extent which the data values group
around a typical or central value. As such, measures of central tendency are sometimes called
measures of central location. The mean (often called the average) is most likely the measure of
central tendency.
Objective of Averaging
A single value which can represent the whole set of data is called an average”. A few of the
objectives to calculate typical central value or average in order to describe the entire data set are
given below:
It is useful to extract and summarize the characteristics of the entire data set in a precise form.
The various measures of central tendency or average commonly used can be classified as:
1. Mathematical Average
(a) Arithmetic mean, commonly known as mean or average.
(i) Simple
(ii) Weighted
(b) Geometric mean
(c) Harmonic mean
2. Average of Position
(a) Median
(b) Quartiles
(c) Deciles
(d) Percentiles
(e) Mode.
Arithmetic Mean
Ungrouped data which is also known as raw data is data that has not been placed in any
group or category after collection.
Grouped (or classified) data is the type of data which is classified into groups after collection.
There are two methods for calculating arithmetic mean for ungrouped and unclassified data:
i) Direct method
ii) Indirect or Short-cut method.
In direct method, Arithmetic Mean is calculated by adding all the observations and dividing the
total by the number of observation. i.e.,
x1 x2 x3 ................. x N 1 N
Population Mean,
N
N
x
i 1
i
x1 x2 x3 ................. xn 1 n
Sample mean, x xi
n n i 1
1 n
x f i xi ,
n i 1
Where f i represent the frequency with which variable x i occurs in the given data set.
Exercise: The number of new orders received by a company over the last 25 days were recorded
as follows: 3,0,1,4,4,4,2,5,3,6,4,5,1,4,2,3,0,2,0,5,4,2,3,3,1. Calculate the arithmetic mean for the
number of orders received over all similar working days.
Solution:
1 n 71
Arithmetic Mean, x
n i 1
f i xi
25
2.84 3
In indirect or Short-cut method, an arbitrary assumed mean is used as a basis for calculating
deviations from individual values in the data set. Let A be the arbitrary assumed Arithmetic mean
and let,
d i xi A or xi A di
1 n 1 n 1 n
Now, x i n
n i 1
x
i 1
( A d i ) A di
n i 1
If frequencies of the numerical values are also taken into consideration, then
1 n
x A fi di
n i 1
Solution: The calculations of average daily earnings for employees are shown in below:
Let, A=160
Daily Earnings (Tk.) Number of d i xi 160 f i di
employees, fi
100 3 100-160=-60 -180
120 6 -40 -240
140 10 -20 -200
160 15 0 0
180 24 20 480
200 42 40 1680
220 75 60 4500
175 6040
Arithmetic Mean
1 n 6040
x A
n i 1
f i d i 160
175
194.514
For calculating arithmetic mean for a grouped data set, the following assumptions are made:
1 n
x f i mi
n i 1
x A
fd i i
h
n
Where, A=assumed arithmetic mean
mi A
d i =deviation from assumed mean= d i
h
Exercise:
A company is planning to improve plant safety. For this, accident data for the last 50 weeks was
compiled. These data are grouped into frequency distribution as shown below:
Number of Accident : 0-4 5-9 10-14 15-19 20-24
Number of weeks : 5 22 13 8 2
i) Calculate the arithmetic mean of the number of accident per week both direct and
indirect method.
Solution:
Number of Mid Points, No. of fimi mi A mi 12 fidi
Accidents mi weeks, fi di
h 5
0-4 2 5 10 2 12 -10
2
5
5-9 7 22 154 7 12 -22
1
5
10-14 12 13 156 12 12 0
0
5
15-19 17 8 136 17 12 8
1
5
20-24 22 2 44 22 12 4
2
5
50 500 -20
1 n 500
x f i mi 10
n i 1 50
x A
fi di h
n
20
x 12 5 12 2 10
50
1. The calculation of arithmetic mean is simple and it is unique, that is every data set has one
and only one mean.
2. The calculation of arithmetic mean is based on all values in the data set.
3. The arithmetic mean is reliable single value that reflects all values in the data set.
4. It is least affected by fluctuations of sampling.
5. It is readily put to algebraic treatment.
Disadvantages
The arithmetic mean is highly affected by extreme values, Imagine a data set of 4, 5, 6, 7,
and 8,578. The sum of the five numbers is 8,600 and the mean is 1,720 – which doesn’t
tell us anything useful about the level of the individual numbers.
It cannot average the ratios and percentages properly.
It is not an appropriate average for highly skewed distributions.
It cannot be computed accurately if any item is missing.
The mean sometimes does not coincide with any of the observed value.
The mean cannot be calculated for qualitative characteristics such as intelligence, beauty
or loyalty.
Mean cannot be calculated for a unequal or open ended class interval.
When calculating the arithmetic mean, the importance of all the items are considered to be equal.
However, there may be situations in which all the items under considerations are not of equal
importance. For example, when we want to find the average number of marks per students in
different subjects like mathematics, statistics, physics and biology. These subjects do not have
equal importance. Thus, the arithmetic mean computed by considering the relative importance
So, the weighted arithmetic mean is a measure of central tendency of a set of quantitative
observations when not all the observations have the same importance. We must assign a weight
to each observation depending on its importance relative to other observations.
xw
x wi i
w i
Weighted mean gives the result equal to the simple mean if the weights assigned to each of the
variant values are equal.
When the importance of all the numerical values in the given data set is not equal.
When frequencies of various classes are widely varying.
Where there is a change either in the population of numerical values or in the proportion
of their frequencies.
When ratios, percentage or rates being averaged.
Exercise: An examination was held to decide the awarding of a scholarship. The weight of various
subjects were different. The marks obtained by 3 candidates (out of 100) are given below:
Calculate the weighted Arithmetic Mean to award the scholarship and make a comparison with
simple arithmetic means to take the decision.
x wA
x w
i i
w i
For A,
x wA
x w
i i
605
60.3, x wA
x i
244
61
w i 10 n 4
For B,
x wA
x w
i i
594
59.4, x wA
x i
248
62
w i 10 n 4
For C,
x wA
x w
i i
618
61.8, x wA
x i
238
59.5
w i 10 n 4
From the above calculation, it may be noted that, student B should get the scholarship as per
simple Arithmetic mean values, but according to weighted arithmetic mean Student C should get
the scholarship because all the subject of the examination are not equal importance.
Problem: An appliances manufacturing company is forecasting regional sales for the next year.
The Chittagong branch, with current yearly sales of Tk.387.6 million, is expected to achieve a sales
growth of 7.25%; the Sylhet branch, with current sales Tk.158.6 million, is expected growth by
8.20% and the Barishal branch, with sales of Tk.115 million, is expected to increase sales by 7.15
percent. What is the average rate of growth forecasted for the next year?
x wA
x w
i i
387.6 7.25 158.6 8.20 115 7.15 2810.10 1300.52 822.25,
w i 387.6 158.6 115 661.20
4932.87
7.46%
661.20
Solution:
The data given in the problem are as bellows:
Now,
x wA
x w
i i
w i
However, firm should cite this average rate for the clients who use four professional categories.
Problem
According to a utility company, utility plant expenditures per employee were approximately $50,845,
$43,690, $47,098, $56,121, and $49,369 for the year 2005 through 2009. Employee at the end of each
year numbered 4738, 4637, 4540, 4397, and 4026, respectively. Using the annual number of employees
as weights, what is the weighted mean for the annual utility plant investment per employee during this
period?
Using log
log xi
G.M Anti log
n
If observations occurs with frequencies, then
f i log xi
G.M Anti log
n
Problem
The rate of increase in population of a country during last three decades is 5%, 8% and 12%. Find
the average rate of growth during the last three decades.
Solution:
Decades Rate of Increase (%) Population at the end of
decades
1 5 105
2 8 108
3 12 112
G.M x1 .x2 .x3 .............xn n
1
1
(105 108 112) 3
108.295
Hence, the average rate of increase in population over the last three decades is 108.2-100=8.2
percent.
Advantage:
The fluctuations of the observations do not affect the geometric mean.
It is not affected by extreme values.
Problem:
A given machine is assumed to depreciate 40 percent in value in the first year, 25 percent in
the second year, and 10 percent per year for the next three years, each percentage being
calculated on diminishing value. What is the average depreciation recorded on the
diminishing value for the period of five years?
Solution:
Rate of depreciation, No of years, f Log x f log x
x
40 1 1.6021 1.6021
25 1 1.3979 1.3979
10 3 1 3
6
We have,
f i log xi 6
G.M Anti log Anti log 15.848
n 5
Hence, the average rate of depreciation for first five years is 15.85%.
Harmonic Mean
A simple way to define a harmonic mean is to call it the reciprocal of the arithmetic mean of the
reciprocals of the observations. The most important criteria for it is that none of the observations
should be zero.
A harmonic mean is used in averaging of ratios. The most common examples of ratios are that of
speed and time, cost and unit of material, work and time etc.
n
HM n
(For ungrouped data)
1
i 1 xi
Problem:
Find the harmonic mean of the following distribution of data:
Dividend yield (percent) 2-6 6-10 10-14
Number of companies 10 12 18
Solution:
Dividend Mid value, No of companies Reciprocal of fi(1/mi)
yield mi (frequencies, fi) Mid value, mi
2-6 4 10 1/4 2.5
6-10 8 12 1/8 1.5
10-14 12 18 1/12 1.5
40 5.5
Averages of Position
The term ‘position’ refers to the place of the value of an observation in the data set. Sometimes we
need to measure qualitative characteristics of data set such as: honesty, consumer acceptance, and
so on, other measures of central tendency namely,
1. Median
2. Quartiles
3. Deciles
4. Percentile
5. Mode
1. Median: The median may be defined as the middle value in the data set when its elements are
arranged in a sequential order.
Ungrouped data:
In this case, first the data is arranged in either ascending or descending order of
magnitude.
(i) If observations is an odd number then,
n 1
( )
Med=Size or value of 2 th observations in the data set.
Grouped data: To find median of group data, first identify the class interval which contains
n
the median values, or th observation of the data set. Then identify such class interval, find
2
the cumulative frequency of the each class.
n
cf
Med l
2
h
f
l=lower limit of the median class interval.
cf=cumulative frequency of the class interval prior to the median class interval.
h=width of the class interval.
f=frequency of median class
n=total number of observation.
Example: In a factory employing 3000 persons, 5 percent earn less than Tk.150 per day, 580 earn
Tk.151 to tk.200 per day, 30% earn from Tk.201 to Tk.250 per day, 500 earn from Tk.251 to
Tk.300, 20 percent earn Tk.301 to Tk.350 per day, and the rest earn Tk,351 or more per day. What
is the median value?
Calculation of Median
Earning (Tk) Percent of worker Number of Persons Cumulative
frequency
Less than 150 5 150 150
151-200 580 730
201-250 30 900 1630
251-300 500 2130
n
Median observation= th=3000/2=1500th observation. This observation lies in the class interval
2
201-250.
n
cf
1500 730
Med l
2
h 201 50 201 42.77 Tk . 243.77
f 900
The measures of central tendency which are used for dividing into several equal parts are called
partition values-such as quartiles, deciles and percentiles.
2. Quartiles: The values which divide an ordered data set into 4 equal parts. The first quartile
divide a distribution such a way that 25 percent (=n/4) of observation have value less than Q 1.
The second quartile position is the median of the data set, which divides the data set in half.
The formula is:
n
i cf
Qi l
4
h [i=1,2,3,4]
f
3. Deciles: Deciles divide a data set into ten equal parts. The deciles are the nine values of the
variable that divide an ordered data set into ten equal parts. The deciles determine the values
for 10%, 20%... and 90% of the data. The formula is:
n
i cf
Di l
10
h [i=1,2,3,4,…….10]
f
4. Percentile: In common use, the percentile usually indicates that a certain percentage falls
below that percentile. For example, if you score in the 25th percentile, then 25% of test takers
are below your score, 75 percent at or above your score. The formula is:
n
i cf
Pi l 100
h
f
5. Mode: The Mode is that value of an observation which occurs most frequently in the data set,
that is, the point (or class mark) with highest frequencies.
It is always preferable to calculate mode from grouped data set.
f m f m1
M0 l h
2 f m f m1 f m1
Problem:
You are working for a transport manager of a call center, which hires cars for staffs. You are
interested in the weekly distances covered by these cars. Kilometers recorded for a sample of hired
cars during a given week yielded the following data:
Kilometers Covered Number of cars Kilometers covered Number of cars
100-110 4 150-160 8
110-120 0 160-170 5
120-130 3 170-180 0
130-140 7 180-190 2
140-150 11 40
Solution:
(i) Median:
We have,
n
cf
Med l
2
h
f
Since a median observation in the data set is the (n/2) th =(40/2)=20 th observation. The
observation lies in the class interval 140-150. Now we have,
(ii) Quartiles:
We have
n
i cf
Qi l
4
h
f
Since there are 40 observation in the data set, we find 1st quartiles at 40/4=10th observation. The
observation lies in the class interval 130-140.
40
1 7
Q1 130
4
10
7
130 4.28
134.42
We find 3rd quartiles at 3(40/4)=30th observation. The observation lies in the class interval 150-
160.
40
3 25
Q3 150
4
10
8
150 6.25 156.25
156
(iii) Percentiles:
We find P67 at 67(40/100)th =26.8=27th observation, The observation lies in the class interval
150-160.
40
75 25
P75 l 100
10
8
27 25
150 10
8
152.50
P75 at 75(40/100)th =30th observation, The observation lies in the class interval 150-160.
We find P87 at 87(40/100)th =35th observation, The observation lies in the class interval 160-170
40
87 25
P87 l 100
10
8
35 33
160 10
5
164
(iv) Deciles:
40
7 25
D7 150
10
10
8
153.75
Problem-2
The following distribution gives the pattern of overtime work per week done by 100 employees of
a company. Calculate Median, first quartile and seventh decile. Also calculate P60 and the mode of
the overtime work distribution.
Solution:
Overtime Hours Number of employees Cumulative frequency
10-15 11 11
15-20 20 31
20-25 35 66
25-30 20 86
30-35 8 94
35-40 6 100
100
Since the number of observation in the data set is 100, the median value is (n/2)th (=100/2)=50 th
observation. This observation lies in the class interval 20-25.
100
31
Med 20 2
5
35
50 31
20 5 20 2.714 22.714 hours
35
First Quartile:
We have,
n
i cf
Qi l
4
h
f
Now, first quartile is the value of (n/4)th observation=(100/4)=25 th observation, which lies in the
class interval 15-20.
100
1 11
Q1 15 4
5
20
25 11
15 5
20
15 3.5
18.5 hours
Seventh Deciles:
n
i cf
Di l
10
h
f
Seventh decile is the value of 7(n/10)th observation=7(100/10)=70th observation, which lies in the
class interval 25-30.
Thus,
Percentile Calculation
n
i cf
Pi l 100
h
f
P60=Value of 60(n/100)th observation=60(100/100)=60th observation, which lies in the class
interval 20-25
Thus,
100
60 31
P60 20 100
5 24.14 hours
35
Mode Calculation:
We have,
f m f m1
M0 l h
2 f m f m1 f m1
35 20
M 0 20 5 20 2.5 22.50 hours
2 35 20 20
Problem-3
The following are the profit figures earned by 50 companies in the country.
Problem-4
The following is the data on profit margin (in percent) of three products and their corresponding
sales (inTk.) during a particular period.
The measures of central tendency describe the major part of the values in the data set appears to
concentrate (cluster) around a central value called average with the remaining values scattered on
either sides of that value. But these measures do not reveal how these values are dispersed (spread
or scattered) on each side of the central value. The dispersion of values is indicated by the extent
to which these values tend to spread over an interval rather than cluster closely around an average.
So, Variation is a way to show how data is dispersed, or spread out.
Classification of Dispersion
Measures of dispersion
Algebraic Graphic
Absolute or Relative Lorenz Curve
Range and its Interquartile range Mean absolute deviation Standard Deviation
coefficient Or Deviation & its co-efficient or its co-efficient & its coefficient
Distance Measure
The distance measures describe the spread or dispersion of values of a variable in terms of
difference among values of data set. The average deviation measures describe the average
deviation for a given measure of central tendency.
Two distance measures are-
(i) Range
(ii) Interquartile deviation
For grouped frequency distributions value in the data set, the range is the difference between the
upper class limit of the last class and the lower class limit of first class.
Coefficient of Range
H L Range
Coefficient of Range
HL HL
Exercise: The following are the sales figure of a firm for last 12 months
Months 1 2 3 4 5 6 7 8 9 10 11 12
Sales
80 82 82 84 84 86 86 88 88 90 90 92
(Tk. ‘000)
Range=H-L=92-80=12
H L 12
and Coefficient of Range 0.069
H L 92 80
Merits of Range
It is the simplest of the measure of dispersion
Easy to calculate
Easy to understand
It is independent of measure of central tendency.
It is quite useful in cases where the purpose is only to find out the extent of extreme
variation, such as industrial quality control, temperature, rainfall and so on.
Disadvantages
It is based on two extreme observations.
It is largely influenced by two extreme values and completely independent of the other
values.
Application of Range
Fluctuation in share prices
Quality control
Weather forecast
Half distance between third quartile and the first quartile.is called semi-interquartile range or
the quartile deviation.
Exercise:
Use of appropriate measure to evaluate the variation in the following data:
Solution
Since the frequency distribution has open-end class intervals on the two extreme sides,
therefore, QD is the appropriate measurement of variation. The computation od QD is shown in
Table below:
n
i cf
Qi l
4
h
f
Now, first quartile is the value of (n/4)th observation=(2010/4)=502.5th observation, which lies in
the class interval 41-80.
2010
394
Qi 41 4
40 41 9.41 50.41 acres
461
Third Quartile, Q3
n
3 cf
Q3 l
4
h
f
Problem-1
You are given the data pertaining to kilowatt hours of electricity consumed by 100 persons in a
city.
1 n
MAD x x , for a sample
n i 1
n
1
MAD
N
x , for population
i 1
f x x i i
MAD i 1
f i
Coefficient of MAD
MAD MAD
Coefficient of MAD or 100
x Me
Exercise:
The number of patient seen in the emergency ward of a hospital for a sample of 5 days in the last
month were: 153,147,151,156, and 153. Determine the mean deviation and interpret.
Solution:
1 n 12
MAD
n i 1
xx
5
2.4 3 Patients (approx.)
The mean absolute deviation is 3 patients per day. The number of patients deviate per day on the
average by 3 patients from the mean of 152 patients per day.
Solution:
We have
n
f x x i i
MAD i 1
f i
x A
fd i i
h
n
11
x 175 50 179.91 per day
112
There mean absolute deviation is
5266.04
MAD 47.01
112
Coefficient of MAD
MAD 47.01
Coefficient of MAD 0.2612 26.12%
x 179.91
Thus, the average sales is Tk.179.91 thousand per day and mean absolute deviation of sales is Tk.
47.01 per day and relative measure of MAD is 26.12%.
Product A Product B
Sales |x-Me| |x-Me|
23 15 31 5
29 9 36 0
38 0 36 0
41 3 39 3
53 15 47 11
n=5 42 n=5 19
For Product A,
1 n 42
MAD
n i 1
x Me
5
8.4
Coefficient of MAD
MAD 8.4
Coefficient of MAD 0.221
Me 38
For Product B,
1 n 19
MAD
n i 1
x Me
5
3.8
MAD 3.8
Coefficient of MAD 0.106
Me 36
Variance-A measure of variability based on the squared deviation of the observed values in the
data set about the mean value.
Population Variance,
1 n
2 xi 2
N i 1
Sample Variance,
x x
2
s 2
n 1
Standard Deviation
d d
2 2
n
x
1
, d x A
2 2
N
i
N N
i 1
x x x
2 2 2
nx
s s 2
n 1 n 1 n 1
fd fd
2 2
n
x
1 1
x ( ) h,
2 2 2 2
N
i
N N N
i 1
m A
Where, d
h
Sample standard deviation
s
fx 2
( fx) 2
n 1 n(n 1)
Coefficient of Variance
SD
Coefficient of Variance 100
Mean
Disadvantage
It is difficult to calculate compared to other measures of variation.
While calculating SD, more weight is given to extreme values and less to those near mean.
Therefore large deviation is occurred when squared are proportionately more than small
deviation.
It cannot be used for comparing the dispersion of two, or more series given in different
units.
Solution:
Observation xx ( x x) 2
240 -20 400
260 0 0
270 10 100
245 -15 225
255 -5 25
286 26 676
264 4 16
1820 1442
n
1 1820
Mean, x
N
x
i 1
i
7
260
Variance, 2
1
N
xi x 2
1442
7
206
Standard deviation,
2 206 14 .352
14.352
Coefficient of Variance 0.0552
260
Problem
A study of 100 engineering companies gives the following information.
Profit : 0-10 10-20 20-30 30-40 40-50 50-60
(Tk. in crore)
Number of : 8 12 20 30 20 10
companies
We have,
fd fd
2 2
h
N N
200 28
2
10
100 100
2 0.078 10
13.863
Problem
Mr. Shad, a retired government servant is considering his money in two proposals. He wants to
choose the one that has higher average net present value and lower standard deviation. The relevant
data are given below. Can you help him in choosing the proposal?
Since expected NPV in both the cases is same, he would like to choose the less risky proposal.
So we need to calculate Standard Deviation in the both cases.
For Proposal A:
NPV (x) Expected NPV ( x ) xx f f ( x x) 2
1559 5485 -3926 0.30 4624042.8
5662 5485 177 0.40 12531.6
9175 5485 3690 0.30 4084830.0
1.0 8721404.4
sA
f ( x x) 2
8721404.40 Tk .2953.20
f
For Proposal B:
NPV (x) Expected NPV ( x ) xx f f ( x x) 2
-10050 5485 -15535 0.30 72400867.50
5812 5485 327 0.40 42771.6
20584 5485 15099 0.30 68393940
1.0 140837579
sB
f ( x x) 2
140837579 Tk .11867.50
f
The SA<SB indicates uniform net profit for proposal A. Thus, proposal A may be chosen.
The ages of 25 persons who secured the pension are as given below:
74 62 84 72 61 83 72 81 64 71 63 61
60 67 74 64 79 73 75 76 69 68 78 66
67
Calculate the monthly average pension payable per person and the standard deviation.
Solution
Age Group f Pension fx fx2
(Tk per month), x
60-65 7 20 140 2800
65-70 5 25 125 3125
70-75 6 30 180 5400
75-80 4 35 140 4900
80-85 5 40 120 4800
25 705 21025
Mean, x
fx 705 Tk .28.20
f 25
Standard Deviation:
fx2
n
x
2
21025
25
(28.20) 2 45.76 6.76
Problem
The weekly sales of two products A and B were recorded as below:
Product A 59 75 27 63 27 28 56
Product B 150 200 125 310 330 250 225
x A
fd 56 57 47.86
f 7
fd fd 2885 57
2 2 2
s 2
412.14 66.30 345.84
f f
A
7 7
S A 345.84 18.59
SA 18.59
CVA 100 100 38.84%
x 47.86
For Product B: Let A=225, be assumed mean of sales for product A.
Sales, x f d x A fd fd 2
125 1 -100 -100 10000
150 1 -75 -75 5625
200 1 25 25 625
225 1 0 0 0
250 1 25 25 625
310 1 85 85 7225
330 1 105 105 11025
7 15 35125
x A
fd 225 15 227.14
f 7
fd fd 35125 15
2 2
2
s 2
5017.85 4.59 5013.26
f f
B
7 7
S B 5013.26 70.80
SB 70.80
CVB 100 100 31.17%
x 227.14
Since coefficient of Variation for product A is more than the product B, therefore the sales fluctuation
in case of Product A is higher.
Problem:
The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the standard
deviation to the mean (average). For example, the expression “The standard deviation is 15% of
the mean” is a CV.
Mathematically,
The coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean
SD
CV
, i.e., Mean .
The CV is particularly useful when you want to compare results from two different surveys or tests
that have different measures or values. For example, if you are comparing the results from two
tests that have different scoring mechanisms. If sample A has a CV of 12% and sample B has a
CV of 25%, you would say that sample B has more variation, relative to its mean.
The coefficient of variation shows the extent of variability of data in sample in relation to the mean
of the population. In finance, the coefficient of variation allows investors to determine how much
volatility, or risk, is assumed in comparison to the amount of return expected from investments.
The lower the ratio of standard deviation to mean return, the better risk-return trade-off.
A statistical technique that is used to analyze the strength and direction of the relationship between
two quantitative variables is called correlations analysis.
The measure of correlation called the correlation coefficient.
The degree of relationship is expressed by coefficient which range from correlation ( -1 ≤ r ≥ +1
The coefficient of correlation is a number that indicates the strength and direction of statistical
relationship between two variables.
The Strength of relationship is determined by the closeness of the points to a straight line
when a pair of values of two variables are plotted on a graph. A straight-line is used as the
frame of reference for evaluating the relationship.
The direction is determined by whether one variable generally increases or decreases when
the other variable increases.
Correlation is a bivariate analysis that measures the strength of association between two variables
and the direction of the relationship. In terms of the strength of relationship, the value of the
correlation coefficient varies between +1 and -1. A value of ± 1 indicates a perfect degree of
association between the two variables. As the correlation coefficient value goes towards 0, the
relationship between the two variables will be weaker. The direction of the relationship is
indicated by the sign of the coefficient; a + sign indicates a positive relationship and a – sign
indicates a negative relationship.
Figure shows how strength of association between two variables is represented by coefficient of
correlation
Karl Pearson’s correlation coefficient measures quantitatively the extent to which two variables x
and y are correlated.
Correlation coefficient is a mathematical and most popular method of calculating correlation.
Arithmetic mean and standard deviation are the basis for its calculation.
1
Where, Cov( x, y )
n
( x x )( y y )
y ( y y) 2
1
n
( x x )( y y ) n xy x. y
r
( x x ) 2
( y y ) 2
n x 2 ( x ) 2 n y 2 ( y ) 2
n n
n d x d y ( d x )( d y )
r
n d x d x n d y d y
2 2 2 2
Grouped Data:
n fdx d y ( fdx )( fd y )
r
n fdx fdx n fd y fd y
2 2 2 2
Problem:
The following table gives the distribution of items of production and also relatively defective
items among them, according to size groups. Find the correlation coefficient between size and
defect in quality.
Size- Mid m A dx
2 Percent d y y 50 dy
2
d xd y
group Value, dx m 17.5 of
h
m defective
items
15-16 15.5 -2 4 75 25 625 -50
16-17 16.5 -1 1 60 10 100 -10
17-18 17.5 0 0 50 0 0 0
18-19 18.5 1 1 50 0 0 0
19-20 19.5 2 4 45 -5 25 -10
20-21 20.5 3 9 38 -12 144 -36
3 19 18 894 -106
Now,
n d x d y ( d x )( d y )
r
n d x d x n d y d y
2 2 2 2
6 (100) 3 18
0.949
6 19 (3) 2 6 894 (18) 2
Since value of r is negative, and is moderately close to -1, statistical association between x (size
group) and y (percent of defective items) is moderate and negative, we conclude that when size
of group increases, the number of defective items decreases and vice versa.
Problem: The following table gives frequency, according to the marks obtained by 67 students
in an intelligent test. Measure the degree of relationship between age and marks.
Let age of students and marks obtained by them be represented by variables x and y respectively.
Calculations for correlation coefficient for this bivariate data is shown in below:
Marks x 18 19 20 21
y m dy dx -1 0 1 2
250-300 275 0 3 0 5 0 4 0 2 0 14 0 0 0
300-350 325 1 2 -2 6 0 8 8 5 10 21 21 21 16
350-400 375 2 1 -2 4 0 6 12 10 40 21 42 84 50
Total,
fd fd
y y
2 fd x dy
f 10 19 20 18 n=67
=66
=52 =116
fd x
-10 0 20 36
fd x
=46
fd
2 2
fd x x
10 0 20 72
=102
fd x d y fd x dy
0 0 18 48
=66
m A 225 275
dy 1
h 50
d x x 19
67 66 46 52
0.415
67 102 (46) 2
67 116 (52) 2
Interpretation: Since the value of r is positive, therefore age of the students and marks obtained
in an intelligence test are positively correlated to the extent of 0.415. Hence, we conclude that as
the age of the student increases, score of marks in intelligence test also increases.
Calculate the coefficient of correlation from the following bivariate frequency distribution:
Solution: Let advertising expenditure and sales revenue be represented by variables x and y
respectively. The calculation for correlation coefficient are shown below:
Advertising Expenditure
x 5-10 10-15 15-20 20-25
Revenue Mid
y value (m) dy
175-225 200 0 1 0 3 0 4 0 2 0 10 10 0 0
225-275 250 1 1 -1 1 0 3 3 4 8 9 9 9 10
Total,
f 13 11 9 7 n=40
fd fd
y y
2 fd
=21
x dy
=-7 =45
fd x
-13 0 9 14 fd x
=10
fd
2 2
fd x x
13 0 9 28
=50
fd x d y fd x dy
14 0 1 6
=21
40 21 10 7 910
0.498
40 50 (10) 2 40 45 (7) 2 1900 1751
Interpretation: Since the value of r is positive, therefore advertising expenditure and sales
revenue are positively correlated to the extent of 0.498. Hence, we conclude that as the expenditure
on advertising increases, the sales revenue also increases.
This method is applied to measure the association between two variables when only ordinal (or
rank) data are available. It is applied in a situation in which quantitative measure of certain
qualitative factors such as judgment, brands, TV programs, color, taste etc.
The number ‘6’is placed in the formula as a scaling device, it ensures that the possible range of R
is from -1 to 1.
R 1 ,
12 12
n(n 1)
2
Where mi (i=1,2,3…) stands for the number of times an observation is repeated in the data set for
both variables.
Problem:
A financial analyst wanted to find out whether inventory turnover influences company’s earnings
per share (in per cent). A random sample of 7 companies listed in a stock exchange were selected
and the following data was recorded for each.
Solution:
Let us start ranking from lowest value for both variables. Since there are tied ranks, the sum of the
tied ranks are averaged and assigned to each of the tied observations as shown below.
Inventory Rank, R1 Earnings Per Rank, R2 Difference, d2
turnover, x share, y d R1 R2
4 2 11 5 -3.0 9.00
5 3.5 9 4 -0.5 0.25
7 6 13 6.5 0.5 0.25
8 7 7 1 6.0 36.00
6 5 13 6.5 -1.5 2.25
3 1 8 2.5 -1.5 2.25
5 3.5 8 2.5 1.0 1.00
d 2 =51
1 1 1
6 d 2 (m1 m1 ) (m2 m2 ) (m3 m3 )
3 3 3
R 1
12 12 12
n(n 1)
2
1 1 1 1
651 (23 2) (2 3 2) (23 2) (23 2)
R 1
12 12 12 12
7(7 1)
2
Interpretation: As R is positive but value is less than 0.20, so there is a very weak positive
association between two variables x and y, i.e., inventory turnover and earnings per share.
Problem
Obtain the rank correlation coefficient between the variables x and y from the following pairs of
observed values.
x 50 55 65 50 55 60 50 65 70 75
y 110 110 115 125 140 115 130 120 115 160
Solution:
Let us start ranking from lowest value for both the variables. Moreover, certain observation in both
data are repeated, the ranking is done in accordance with suitable average value as shown below:
55---(4+5)/2=4.5 115---(3+4+5)/3=4
65—(7+8)/2=7.5
EMBA Program, PSTU | 79
In the data set, for variable x, 50 is repeated thrice, m1=3, 55 is repeated twice, m2=2 and 65 is
repeated twice, m3=2 and for variable y, 110 is repeated twice, m4=2, and 115 is thrice, m5=3.
1 1 1 1 1
6 d 2 (m1 m1 ) (m2 m2 ) (m3 m3 ) (m4 m4 ) (m5 m5 )
3 3 3 3 3
R 1
12 12 12 12 12
n(n 1)
2
1 1 1 1 1
6134 (33 3) (23 2) (23 2) (23 2) (33 3)
R 1
12 12 12 12 12
10(10 1)
2
Problem
Use the method of rank correlation to determine the relationship between preference prices and
debentures prices.
R=0.125,
Hence, there is a very low degree of positive correlation, probably no correlation, between
preference share prices and debenture prices.
The statistical technique that express the relationship between two or more variables in the form
of an equation to estimate the value of a variable, based on given value of another variable, is
called regression analysis.
The variable whose value is estimated using algebraic equation is called dependent variable and
the variable whose value is used to estimate this value is independent variable.
The linear algebraic equation used for expressing a dependent variable in terms of independent
variable is called linear regression equation.
It plays a significant role in many human activities, as it is a powerful and flexible tool which used
to forecast the past, present or future events on the basis of past or present events. For instance:
On the basis of past records, a business’s future profit can be estimated.
The Regression Coefficient is the constant ‘b’ in the regression equation that tells about the
change in the value of dependent variable corresponding to the unit change in the independent
variable.
One of the popular method to determine the parameters of a fitted regression equation is Least
Squares method.
Let, yˆ a bx be the least squares line of y on x, where, ŷ is the estimated average value of
dependent variable y. The line that minimize the sum of squares of the deviations of the observed
values of y from those predicted is the best fitting line.
y na b x
xy a x b x 2
Or, b
S xy
, S xy ( x x)( y y ) xy
x y ; S ( x x) 2
xx
S xx n
Problem:
Use least squares regression line to estimate the increase in sales revenue expected from the
increase of 7.5 percent in advertising expenditure.
Solution:
Assume sales revenue (y) is dependent on advertising expenditure (x). Calculation for
regression line using following normal equations are shown in below:
y na b x …………………………(i)
xy a x b x 2
……………..(ii)
From equation-(i)
40=8a+56b……….(iii)
From equation-(ii)
373=56a+524b….(iv)
(iv)/7-(iii)
53.285-40=8a+74.85b-8a-56b
13.285 18.85b
b 0.704
(iii),
40 8a 56 0.704
8a 0.576
a 0.072
Substituting the value in the regression equation
yˆ a bx
y 0.072 0.704 x
For x=7.5% or 0.75increse in advertising expenditure, the estimated increase
ˆ a bx
y
y 0.072 0.704(0.075) 0.1248 12.48%
Solution:
Assume, blood pressure y as the dependent variable and age (x) as the independent variable. Calculation
for regression equation of blood pressure on age are shown in the table below:
a) The coefficient of correlation between age and blood pressure is given by-
n d x d y ( d x )( d y )
r
n d x d x n d y d y
2 2 2 2
10 1115 32 (33)
10(1202) (32) 2 10 1813 (33) 2
12206
0.892
13689
We may conclude that there is a high degree of positive correlation between age and blood
pressure.
y na b x
1417 10a 522b (1)
xy a x b x 2
y 83.758 1.11x
c) For women whose age is 45, the estimated average blood pressure will be
An index number is a statistical device for comparing the general level of magnitude (scale, size)
of a group related variables in two or more situation.
For example, if the price level of 2018 is compared with what it was in 2000.
An index number can be defined as relative measures describing the average changes in any
quantity over time. In other words, an index number measures the changing value of prices,
quantities or values over a period of time in relation to its value at some fixed point in time, called
the base period. These numbers are stated as a percentage of a base figure.
Mathematically,
Current Period Value
Index Number 100
Base Period Value
Index Numbers
Un-weighted Weighted
Where,
Example
Calculate index number from the following data by simple aggregate method taking prices of
2000 as base.
Commodity Price per Unit (in Tk)
2000 2004
A 80 95
B 50 60
C 90 100
D 30 45
Solution
P 0 250
It means the price in 2004 were 120% higher than the base year.
P1
p 100
Simple average of Price relative 0 , when arithmetic mean is used
n
Where,
Example:2
From the following data, construct an index for 2018 taking 2017 as base by the average price of
relative using (a) arithmetic mean and (b) Geometric mean.
Commodity Price per Unit (in Tk)
2017 2018
A 50 70
B 40 60
C 80 100
D 20 30
P
log( 1 )
Simple average of Price relative Anti log P0
100 Anti log(
8.5952
) 140.863
n 4
Problem
From the data given below, construct the index of price relatives for the year 2002 taking 2001 as
base year using a) arithmetic mean and (b) geometric mean.
Expense on Food Rent Clothing Education Misc
Price in 2001 1800 1000 700 400 700
Price in 2002 2000 1200 900 500 1000
a) 125.508; b) 125.00
Problem
In 2016, for working class people, wheat was selling at an average price of Tk.160 per 10kg, cloth
at Tk.40 per meter, house rent Tk.10,000 per house, and other items at Tk.100 per unit. By 2017
the cost of wheat rose by Tk.40 per 10 kg, house rent by Tk.1500 per house, and other items
doubled in price. The working class cost of living index for the year 2017 (with 2016 as base) was
160. By how much did the cloth price rise during the period 2016-2017?
There are various method of assigning weights and consequently constructing index number, here
we discuss only three of the methods.
1. Laspeyre’s method
2. Paasche’s method
3. Fisher’s Ideal method.
Laspeyre’s Method:
The Laspeyres price index is a weighted aggregate price index, where weight are determined by
quantities in the base period and is given by
Paasche’s method:
Here weight is determined by the quantities in the current year.
pq pq
1 0 1 1
100
p q p q
0 0 0 1
Problem:
For the following data, calculate the price index number of 2001 with 2000 as the base year, using
a) Laspeyere’s method
b) Paasche’s method
c) Fisher’s Ideal Method.
Solution
p q 0 0 172
p q 0 1 148
Fisher' s ideal index Number P01 L P
F
251 220
100 146.96
172 148
Problem
The Arapaho Valley Pediatrics Clinic has been in business for 18 years. The office manager noticed that
prices of clinic materials and office suppliers fluctuate over time. To get a handle on the price trends for
running the clinic, the office manager examined prices of six items the clinic uses as part of its operations.
Shown here are the items, their prices, and the quantities for the years 2005 and 2006. Use these data to
develop unweighted aggregate price indexes for 2006 with a base year of 2005. Compute the Laspeyres
price index for the year 2006 using 2005 as the base year. Compute the Paasche index number for 2006
using 2005 as the base year.
Hypothesis testing is a step-by-step methodology that allows you to make inferences about a
population parameter by analyzing differences between the results observed (the sample statistic)
and the results that can be expected if some underlying hypothesis is actually true.
The most common alpha value is 0.05 or 5%. Other popular choices are 0.01 (1%) and 0.1 (10%).
Level of
significance,
Acceptance region
Step-4 Select suitable test statistics and Determine the appropriate technique and
Sample size is an important thing for choosing a appropriate test statistic.
Hypothesis testing for , if is known, Z test; if is unknown—t test
Step-6 Make a test decision about the null hypothesis and interpret.
Decide, based on a comparison of the calculated value of the test statistic and the critical
value of the test, whether to reject the null hypothesis in favor of the alternative.
Once we have found the p-value or rejection region, and made a statistical decision about
the null hypothesis (i.e. we will reject the null or fail to reject the null). Following this
decision, we want to summarize our results into an overall conclusion for our test.
Probability:
A probability is a numerical measure of the likelihood or chance of occurrence of an uncertain
event.
For example, starting a new business. There is three possible outcomes may be occurred, Profit,
Loss or break even, or toss a die, there are 2 possible outcomes.
Random Experiment
A random experiment is a process by which we observe something uncertain. After the experiment,
the result of the random experiment is known. An outcome/event is a result of a random
experiment.
Random experiment is (also called act, trial, ) an activity that leads to the occurrence of one and
only one of several possible outcomes which is not likely to be known until its completion., that
is, the outcome is not perfectly predictable.
Example, measuring blood pressure of a group of individuals, tossing a coin and observing the
face that appears. Etc.
Sample Space
The set of all possible outcomes (events) for a random experiment is called sample space.
(i) No two or more of these outcomes occur simultaneously.
(ii) Exactly one of the outcomes must occur, whenever the experiment is performed.
It may be denoted by the capital letter S.
For example,
Consider the experiment of tossing two coins. The four possible outcomes are HH, HT,
TH, TT. The sample space is S= {HH, HT, TH, TT}
Random experiment: roll a die; sample space: S={1,2,3,4,5,6}.
Random experiment: observe the number of iPhones sold by an Apple store in Boston in
2015; sample space: S={0,1,2,3,⋯}.
Event
The set of outcomes from an experiment. (a subset of the sample space)
Each experiment may result in one or more out comes, which is called events.
For example, conducting an experiment by tossing a coin. The outcome of this experiment is the
coin landing ‘heads’ or ‘tails’. These can be said to be the events connected with the experiment.
So when the coin lands tails, an event can be said to have occurred.
Event types
Mutually Exclusive events: If two or more events cannot occur simultaneously in a single trail
of an experiment, then such event are called mutually exclusive. It is also called disjoint event.
For example, the numbers 2 and 3 cannot occur simultaneously on the roll of a dice.
Collectively Exhaustive:
A list of events is said to be collectively exhaustive when all possible events that can occur from
an experiment includes every possible outcome. That is, two or more events are said to be
collectively exhaustive if one of the events must occur. Two events A and B are known as
exhaustive events if the union of A and B gives the sample space.
If you are rolling a six-sided die, the set of events {1, 2, 3, 4, 5, 6} is collectively exhaustive. Any
roll must be represented by one of the set. Another example of an event that is both collectively
exhaustive and mutually exclusive is tossing a coin. The outcome must be either heads or tails, or
p (heads or tails) = 1, so the outcomes are collectively exhaustive. When heads occurs, tails can't
occur, or p (heads and tails) = 0, so the outcomes are also mutually exclusive.
Compound event: When two or more events occur in connection with each other, then their
simultaneous occurrence is called compound event. These event may be independent and
dependent.
For example, when we throw a die, the possibility of an even number appearing is a compound
event, as there is more than one possibility, there are three possibilities i.e. E = {2,4,6}
Dependent and independent Event: Two events are independent if the result of the second
event is not affected by the result of the first event. Two events, A and B, are independent if the
fact that A occurs does not affect the probability of B occurring.
Example, Landing on heads after tossing a coin AND rolling a 5 on a single 6-sided die.
Two events are dependent if the result of the first event affects the outcome of the second event
so that the probability is changed. In the above example, if the first marble is not replaced, the
sample space for the second event changes and so the events are dependent.
Equally likely event: Two or more events are said to be equally likely if each has an equal
chance to occur. In other words, equally likely events are events that have the same theoretical
probability (or likelihood) of occurring.
If we toss a coin, there are equal chances of getting a head or a tail. Hence, getting a head or a tail
by tossing a coin are equally likely events. If a dice is rolled, then getting an odd number and
Complementary event: Complementary events are two outcomes of an event that are the only
two possible outcomes. For example, this is like flipping a coin and getting heads or tails. Of
course, there are no other options, so these events are complementary.
The complement of A, denoted by A, A, A c , consists of all outcomes in which the event A does
not occur.
There are 3 ways to approach probability: classical probability, relative frequency of probability,
and subjective probability.
Classical Approach
The classical approach describes probability in terms of proportion of times that an event can be
theoretically expected to occur. This approach I based on the assumption that all possible outcomes
(finite in numbers) of an experiment are mutually exclusive and equally likely.
For example, most people know that if you toss a coin, it is 50/50 chance.
All outcomes are equally likely since neither head nor tail has a better chance of occurring.
Head can occur 50% of the time and tail can occur 50% of the time as well.
The probability of an event A is the ratio of the number of times that A has occurred in n trials of
an experiment.
This approach is based on the assumption that a random experiment can be repeated a large number
of times under identical conditions where trails are independent to each other. While conducting a
random experiment we may or may not observed the desired event. But as the experiment is
repeated many times, that event may occur some proportion of time.
For example,
Experiment: Administering a Taste Test for a New Soup
Subjective approach
This approach must be used when either sufficient data are not available or sources of information
giving different results are not known.
For example, you think you have a 50/50 chance of getting the job you applied for, because the
other applicant is also very qualified.
Problem
Problem:
(a) The probability the Cubs will win the World Series this year is 0.175.
(b) The probability tuition will increase next year is 0.95.
(c) The probability that you will win the lottery is 0.00062.
(d) The probability a randomly selected flight will arrive on time is 0.875.
(e) The probability of tossing a coin twice and observing two heads is 0.25.
(f) The probability that your car will start on a very cold day is 0.97
(g) The probability of scoring on a penalty shot in ice hockey is 0.47.
(h) The probability that the current mayor will resign is 0.85.
(i) The probability of rolling two sixes with two dice is 1⁄36.
(j) The probability that a president elected in a year ending in zero will die in office is 7⁄10.
(k) The probability that you will go to Europe this year is 0.14
Solution:
a) Subjective
b) Relative
c) Classical
d) Classical
e) Classical
f) Relative frequency
g) Relative frequency
h) Subjective
i) Classical.
j) Relative frequency
k) Subjective
Southern Bell is considering the distribution of funds for a campaign to increase long-distance
calls within North Carolina. The following table lists the markets that the company considers
worthy of focused promotions:
(a) Are the market segments listed in the table collectively exhaustive? Are they mutually
exclusive?(b) Make a collectively exhaustive and mutually exclusive list of the possible events of
the spending decision.(c) Suppose the company has decided to spend the entire $800,000 on
special campaigns. Does this change your answer to part (b)? If so, what is your new answer?
Presentation of Events
A Contingency table is used to classify sample observation according to two or more identifiable
characteristic.
Decision Trees: 4
Wed
Hints: Mutually exclusive events cannot happen at the same time. For example: when tossing a
coin, the result can either be heads or tails but cannot be both. Events are independent if the
occurrence of one event does not influence (and is not influenced by) the occurrence of the other(s).
The definition of being mutually exclusive (disjoint) means that it is impossible for two events to
occur together. Given two events, A and B, they are mutually exclusive if (A П B) = 0. If these
two events are mutually exclusive, they cannot be independent.
Consider 2 events A and B which satisfy the condition that they both are mutually exclusive and
independent (simultaneously).
Now, Since they are independent,
⇒P(A∩B)=P(A).P(B)
Thus, if we chose any 2 events such that at least one of them is guaranteed to not occur, then those
two events will be both mutually exclusive and independent.
----------------
Problem-1
A research agency administers a demographic survey to 90 telemarketing companies to determine the size
of their operations. When asked to report how many employees now work in their telemarketing operation,
the companies gave responses ranging from 1 to 100. The agency’s analyst organizes the figures into a
frequency distribution.
Number of Employees Working in Telemarketing Number of Companies
0–under 20 32
20–under 40 16
40–under 60 13
60–under 80 10
80–under 100 19
a. Compute the mean, median, and mode for this distribution.
b. Compute the sample standard deviation for these data.
Problem-2
The client company data from the Decision Dilemma reveal that 155 employees worked one of four types
of positions. Shown here again is the raw values matrix (also called a contingency table) with the frequency
counts for each category and for subtotals and totals containing a breakdown of these employees by type
of position and by sex.
i) If an employee of the company is selected randomly, what is the probability that the employee is
female or a professional worker?
ii) If an employee of the company is selected randomly what is the probability that the
employee is female worker?
iii) If a female employee of the company is selected randomly, what is the probability that the
employee is technical?
Problem-4
The data below represents the overall miles per gallon of 2019 SUV's.
24 23 22 21 22 22 18 19 19 19 21 21 21 18
19 21 17 22 18 18 22 16 16
You are required to compute and comment on the first quartile the third quartile and the
interquartile range
Problem-5
A corporation owns several companies. The strategic planner for the corporation believes dollars
spent on advertising can to some extent be a predictor of total sales dollars. As an aid in long-
term planning, she gathers the following sales and advertising information from several of the
companies for 2009 ($ millions).
Advertising cost Sales
12 148
4 55
222 338
60 994
38 541
6 89
17 126
41 379
Based on the relationship developed in (i) above, what would be the sales figures if advertising
cost is $111.50?