Ups Quant Notes
Ups Quant Notes
DEFINITION OF STATISTICS
HISTORY OF STATISTICS
The Word statistics have been derived from Latin word ―Status‖ or the Italian word ―Statista‖,
meaning of these words is ―Political State‖
Inferential statistics can be defined as a field of statistics that uses analytical tools for drawing
conclusions about a population by examining random samples. The goal of inferential statistics is to
make generalizations about a population. In inferential statistics, a statistic is taken from the sample data
(e.g., the sample mean) that used to make inferences about the population parameter (e.g., the population
mean).
IMPORTANCE OF STATISTICS IN DIFFERENT FIELDS
Statistics plays a vital role in every fields of human activity. Statistics has important role in
determining the existing position of per capita income, unemployment, population growth
rate, housing, schooling medical facilities etc…in a country. Now statistics holds a central
position in almost every field like Industry, Commerce, Trade, Physics, Chemistry,
Economics, Mathematics, Biology, Botany, Psychology, Astronomy etc…, so application of
statistics is very wide. Now we discuss some important fields in which statistics is
commonly applied.
1. Business:
Statistics play an important role in business. A successful businessman must be very quick and
accurate in decision making. He knows that what his customers wants, he should therefore, know
what to produce and sell and in what quantities. Statistics helps businessman to plan production
according to the taste of the costumers, the quality of the products can also be checked more
efficiently by using statistical methods. So all the activities of the businessman based on
statistical information. He can make correct decision about the location of business, marketing of
the products, financial resources etc…
2. In Economics:
Statistics play an important role in economics. Economics largely depends upon statistics.
National income accounts are multipurpose indicators for the economists and administrators.
Statistical methods are used for preparation of these accounts. In economics research statistical
methods are used for collecting and analysis the data and testing hypothesis. The relationship
between supply and demands is studies by statistical methods, the imports and exports, the
inflation rate, the per capita income are the problems which require good knowledge of statistics.
3. In Mathematics:
Statistical plays a central role in almost all natural and social sciences. The methods of natural
sciences are most reliable but conclusions draw from them are only probable, because they are
based on incomplete evidence. Statistical helps in describing these measurements more
precisely. Statistics is branch of applied mathematics. The large number of statistical methods
like probability averages, dispersions, estimation etc… is used in mathematics and different
techniques of pure mathematics like integration, differentiation and algebra are used in
statistics.
4. In Banking:
Statistics play an important role in banking. The banks make use of statistics for a number of
purposes. The banks work on the principle that all the people who deposit their money with the
banks do not withdraw it at the same time. The bank earns profits out of these deposits by
lending to others on interest. The bankers use statistical approaches based on probability to
estimate the numbers of depositors and their claims for a certain day.
8. In Astronomy:
Astronomy is one of the oldest branches of statistical study; it deals with the measurement
of distance, sizes, masses and densities of heavenly bodies by means of observations. During
these measurements errors are unavoidable so most probable measurements are founded by
using statistical methods.
DIAGRAMATIC AND GRAPHICAL REPRESENTATION OF DATA
Although tabulation is very good technique to present the data, but diagrams are an advanced technique
to represent data.
Types of Diagrams
(a) Line Diagrams
In these diagrams only line is drawn to represent one variable. These lines may be vertical or
horizontal. The lines are drawn such that their length is the proportion to value of the terms or
items so that comparison may be done easily.
Like line diagrams these figures are also used where only single dimension i.e. length can
present the data. Procedure is almost the same, only one thickness of lines is measured. These
can also be drawn either vertically or horizontally. Breadth of these lines or bars should be equal.
Similarly distance between these bars should be equal. The breadth and distance between them
should be taken according to space available on the paper.
(C) Multiple Bar Diagrams
Example 1
Calculate the mean for pH levels of soil 6.8, 6.6, 5.2, 5.6, 5.8
Grouped Data
The mean for grouped data is obtained from the following formula:
Where
A = any value in x
n = total frequency
c = width of the class interval
Example 3
For the frequency distribution of seed yield of plot given in table, calculate the mean yield
per plot.
Yield per plot in(ing) 64.5- 84.5- 104.5- 124.5-
84.5 104.5 124.5 144.5
No of 3 5 7 20
plots
Solution
Yield ( in g) No of Plots (f) Mid X Fd
64.5-84.5 3 74.5 -1 -3
84.5-104.5 5 94.5 0 0
104.5-124.5 7 114.5 1 7
124.5-144.5 20 134.5 2 40
Total 35 44
A=94.5
= =119.64 gms
Shortcut method
Geometric mean
The geometric mean of a series containing n observations is the nth root of the product of the
values. If x1, x2…, xn are observations then
G.M=
Log GM =
GM = Antilog
GM = Antilog
= Antilog
= Antilog
Grouped Data
Example 12
Find the Geometric mean for the following
Weight of sorghum (x) No. of ear head(f)
50 4
65 6
75 16
80 8
95 7
100 4
Solution
Weight of No. of ear Log x f x log x
sorghum (x) head(f)
50 5 1.699 8.495
63 10 10.799 17.99
65 5 1.813 9.065
130 15 2.114 31.71
135 15 2.130 31.95
Total 50 9.555 99.21
Here n= 50
GM = Antilog
= Antilog
Continuous distribution
Example 13
For the frequency distribution of weights of sorghum ear-heads given in table below.
Calculate the Geometric mean
Weights of ear No of ear
heads ( in g) heads (f)
60-80 22
80-100 38
100-120 45
120-140 35
140-160 20
Total 160
\
Solution
Weights of ear No of ear Mid x Log x f log x
heads ( in g) heads (f)
60-80 22 70 1.845 40
59
80-100 38 90 1.954 74.25
100-120 45 110 2.041 91.85
120-140 35 130 2.114 73.99
140-160 20 150 2.176 43.52
Total 160 324.2
Here n = 160
GM = Antilog
= Antilog
= Antilog
= 106.23
Example 13
From the given data 5, 10,17,24,30 calculate H.M.
X
5 0.2000
10 0.1000
17 0.0588
24 0.0417
30 0.4338
= 11.526
Example 14
Number of tomatoes per plant are given below. Calculate the harmonic mean.
Number of tomatoes per plant 20 21 22 23 24 25
Number of plants 4 2 7 1 3 1
Solution
Number of No of
tomatoes per plants(f)
plant (x)
20 4 0.0500 0.2000
21 2 0.0476 0.0952
22 7 0.0454 0.3178
23 1 0.0435 0.0435
24 3 0.0417 0.1251
25 1 0.0400 0.0400
18 0.8216
Median
The median is the middle most item that divides the group into two equal parts, one part
comprising all values greater, and the other, all values less than that item.
Ungrouped or Raw data
Arrange the given values in the ascending order. If the number of values are odd, median
is the middle value.If the number of values are even, median is the mean of middle two values.
By formula
Example 4
If the weights of sorghum ear heads are 45, 60,48,100,65 gms, calculate the median
Solution
Here n = 5
First arrange it in ascending order
45, 48, 60, 65, 100
Median =
= =60
Example 5
If the sorghum ear- heads are 5,48, 60, 65, 65, 100 gms, calculate the median.
Solution
Here n = 6
Grouped data
In a grouped distribution, values are associated with frequencies. Grouping can be in the
form of a discrete frequency distribution or a continuous frequency distribution. Whatever may be
the type of distribution, cumulative frequencies have to be calculated to know the total number of
items.
Cumulative frequency (cf)
Cumulative frequency of each class is the sum of the frequency of the class and the
frequencies of the pervious classes, ie adding the frequencies successively, so that the last
cumulative frequency gives the total number of items.
Discrete Series
Step1: Find cumulative frequencies.
Step3: See in the cumulative frequencies the value just greater than
Example 6
The following data pertaining to the number of insects per plant. Find median number of insects
per plant.
Number of insects per plant (x) 1 2 3 4 5 6 7 8 9 10 11 12
No. of plants(f) 1 3 5 6 10 13 9 5 3 2 2 1
Solution
Form the cumulative frequency table
x f cf
1 1 1
2 3 4
3 5 9
4 6 15
5 10 25
6 13 38
7 9 47
8 5 52
9 3 55
10 2 57
11 2 59
12 1 60
60
Median = size of
Here the number of observations is even. Therefore median = average of (n/2)th item and
(n/2+1)th item.
Step2: Find
Step3: See in the cumulative frequency the value first greater than , Then the corresponding
class interval is called the Median class. Then apply the formula
Median =
Median =
It lies between 60 and 105. Corresponding to 60 the less than class is 100 and corresponding to
105 the less than class is 120. Therefore the medianal class is 100-120. Its lower limit is 100.
Here 100, n=164 , f = 45 , c = 20, m =60
Median =
Mode
The mode refers to that value in a distribution, which occur most frequently. It is an
actual value, which has the highest concentration of items in and around it. It shows the centre of
concentration of the frequency in around a given value. Therefore, where the purpose is to know
the point of the highest concentration it is preferred. It is, thus, a positional measure.
Its importance is very great in agriculture like to find typical height of a crop variety,
maximum source of irrigation in a region, maximum disease prone paddy variety. Thus the mode
is an important measure in case of qualitative data.
Grouped Data
For Discrete distribution, see the highest frequency and corresponding value of x is mode.
Example:
Find the mode for the following
Weight of sorghum in No. of ear head(f)
gms (x)
50 4
65 6
75 16
80 8
95 7
100 4
Solution
The maximum frequency is 16. The corresponding x value is 75.
mode = 75 gms.
Continuous distribution
Locate the highest frequency the class corresponding to that frequency is called the modal class.
Then apply the formula.
Mode =
Example 10
For the frequency distribution of weights of sorghum ear-heads given in table below. Calculate
the mode
Weights of ear No of ear
heads (g) heads (f)
60-80 22
80-100 38
100-120 45 f
120-140 35
140-160 20
Total 160
Solution
Mode =
Mode =
= 109.589
Percentiles
The percentile values divide the distribution into 100 parts each containing 1 percent of the
cases. The xth percentile is that value below which x percent of values in the distribution fall. It
may be noted that the median is the 50th percentile.
For raw data, first arrange the n observations in increasing order. Then the xth percentile is
given by
Where
= lower limit of the percentile calss which contains the xth percentile value (x. n /100)
= cumulative frequency uotp
= frequency of the percentile class
C= class interval
N= total number of observations
Percentile for Raw Data or Ungrouped Data
Example 15
The following are the paddy yields (kg/plot) from 14 plots:
30,32,35,38,40.42,48,49,52,55,58,60,62,and 65 ( after arranging in ascending order). The
computation of 25th percentile (Q1) and 75th percentile (Q3) are given below:
= 35 + (38-35)
= 35 + 3 = 37.25 kg
= 55 +(58-55)
= 55 + 3 = 55.75 kg
Example 16
The frequency distribution of weights of 190 sorghum ear-heads are given below. Compute 25th
percentile and 75th percentile.
Weight of ear- No of ear
heads (in g) heads
40-60 6
60-80 28
80-100 35
100-120 55
120-140 30
140-160 15
160-180 12
180-200 9
Total 190
Solution
Weight of ear- No of ear heads Less than class Cumulative
heads (in g) frequency
40-60 6 < 60 6
60-80 28 < 80 34
80-100 35 <100 69 47.5
100-120 55 <120 124
142.5
120-140 30 <140 154
140-160 15 <160 169
160-180 12 <180 181
180-200 9 <200 190
Total 190
or P25, first find out , and for , and proceed as in the case of median.
The value 47.5 lies between 34 and 69. Therefore, the percentile class is 80-100. Hence,
= 80 +7.71 or 87.71 g.
Quartiles
The quartiles divide the distribution in four parts. There are three quartiles. The second
quartile divides the distribution into two halves and therefore is the same as the median. The first
(lower).quartile (Q1) marks off the first one-fourth, the third (upper) quartile (Q3) marks off the
three-fourth. It may be noted that the second quartile is the value of the median and 50th percentile.
Example 18
Compute quartiles for the data given below (grains/panicles) 25, 18, 30, 8, 15, 5, 10, 35, 40, 45
Solution
5, 8, 10, 15, 18, 25, 30, 35, 40, 45
= (2.75)th item
= 8+ (10-8)
= 8+ x2
= 8+1.5
= 9.5
= 3 x (2.75) th item
= (8.75)th item
= 35+ (40-35)
= 35+1.25
= 36.25
Discrete Series
Step1: Find cumulative frequencies.
Step2: Find
Step3: See in the cumulative frequencies, the value just greater than , then the
corresponding value of x is Q1
Step4: Find
Step5: See in the cumulative frequencies, the value just greater than ,then the
corresponding value of x is Q3
Example 19
Compute quartiles for the data given bellow (insects/plant).
X 5 8 12 15 19 24 30
f 4 3 2 4 5 2 4
Solution
x f cf
5 4 4
8 3 7
12 2 9
15 4 13
19 5 18
24 2 20
Continuous series
Step1: Find cumulative frequencies
Step2: Find
Step3: See in the cumulative frequencies, the value just greater than , then the
Step4: Find See in the cumulative frequencies the value just greater than then the
corresponding class interval is called 3rd quartile class. Then apply the respective formulae
MEASURES OF VARIATION:
RANGE
The difference between the lowest and highest values.
In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9, so the range is 9 − 3 = 6.
Range can also mean all the output values of a function.
Quartile Deviation :
In a distribution, partial variance between the upper quartile and lower quartile is known as
'quartile deviation'. Quartile Deviation is often regarded as semi inter quartile range.
Three steps:
1. Find the mean of all values
2. Find the distance of each value from that mean (subtract the mean from each value,
ignore minus signs)
3. Then find the mean of those distances Example: the Mean Deviation of 3, 6, 6, 7, 8, 11,
15, 16
Step 1: Find the mean:
Mean = 3 + 6 + 6 + 7 + 8 + 11 + 15 + 16 = 72 = 9
8 8
Step 2: Find the distance of each value from that mean:
Value Distance from 9
3 6
6 3
6 3
7 2
8 1
11 2
15 6
16 7
Step 3. Find the mean of those distances:
6+3+3+2+1+2+6+7 30
Mean Deviation = ----------------------------------- = ----- = 3.75
8 8
In that example the values are, on average, 3.75 away from the middle.
For deviation just think distance Formula
The formula is:
Mean Deviation = Σ|x - μ|
N
Let's learn more about those symbols!
Firstly:
μ is the mean (in our example μ = 9)
x is each value (such as 3 or 16)
N is the number of values (in our example N = 8)
Standard Deviation
The Standard Deviation is a measure of how spread out numbers are.
Its symbol is σ (the greek letter sigma)
The formula is easy: it is the square root of the Variance. So now you ask, "What is the
Variance?"
Variance:
The Variance is defined as. The average of the squared differences from the Mean.
1−3 40 2 80 4 160
3−5 30 4 120 0 0
5−7 20 6 120 4 80
7−9 10 8 80 16 160
Total 100 400 400
5−7 20 6 4 80 320
7−9 10 8 6 60 160
Total 100 200 800
SKEWNESS
Measure of Skewness:
The difference between the mean and mode gives as absolute measure of skewness. If we divide
this difference by standard deviation we obtain a relative measure of skewness known as
coefficient and denoted by SK.
SK=Mean−Mode/S.D
Sometimes the mode is difficult to find. So we use another formula
SK=3(Mean−Median)/S.D
Bowley‗s coefficient of Skewness
SK=Q1+Q3−2Median/Q3−Q1
Introduction to Correlation and Regression Analysis
In this section we will first discuss correlation analysis, which is used to quantify the association
between two continuous variables (e.g., between an independent and a dependent variable or
between two independent variables). Regression analysis is a related technique to assess the
relationship between an outcome variable and one or more risk factors or confounding variables.
The outcome variable is also called the response or dependent variable and the risk factors and
confounders are called the predictors, or explanatory or independent variables. In regression
analysis, the dependent variable is denoted "y" and the independent variables are denoted by "x".
Correlation Analysis
In correlation analysis, we estimate a sample correlation coefficient, more specifically the
Pearson Product Moment correlation coefficient. The sample correlation coefficient, denoted r,
ranges between -1 and +1 and quantifies the direction and strength of the linear association
between the two variables. The correlation between two variables can be positive (i.e., higher
levels of one variable are associated with higher levels of the other) or negative (i.e., higher
levels of one variable are associated with lower levels of the other).
The sign of the correlation coefficient indicates the direction of the association. The magnitude
of the correlation coefficient indicates the strength of the association.
For example, a correlation of r = 0.9 suggests a strong, positive association between two
variables, whereas a correlation of r = -0.2 suggest a weak, negative association. A correlation
close to zero suggests no linear association between two continuous variables.
It is important to note that there may be a non-linear association between two continuous
variables, but computation of a correlation coefficient does not detect this. Therefore, it is always
important to evaluate the data carefully before computing a correlation coefficient. Graphical
displays are particularly useful to explore associations between variables.
The figure below shows four hypothetical scenarios in which one continuous variable is plotted
along the X-axis and the other along the Y-axis.
Scenario 1 depicts a strong positive association (r=0.9), similar to what we might see
for the correlation between infant birth weight and birth length.
Scenario 2 depicts a weaker association (r=0,2) that we might expect to see between
age and body mass index (which tends to increase with age).
Scenario 3 might depict the lack of association (r approximately 0) between the extent of
media exposure in adolescence and age at which adolescents initiate sexual activity.
Scenario 4 might depict the strong negative association (r= -0.9) generally observed
between the number of hours of aerobic exercise per week and percent body fat.
The variances of x and y measure the variability of the x scores and y scores around their
respective sample means ( considered separately). The covariance measures the
variability of the (x,y) pairs around the mean of x and mean of y, considered simultaneously.
To compute the sample correlation coefficient, we need to compute the variance of gestational
age, the variance of birth weight and also the covariance of gestational age and birth weight.
We first summarize the gestational age data. The mean gestational age is:
To compute the variance of gestational age, we need to sum the squared deviations (or
differences) between each observed gestational age and the mean gestational age. The
computations are summarized below.
Next, we summarize the birth weight data. The mean birth weight is:
The variance of birth weight is computed just as we did for gestational age as shown in the table
below.
The variance of birth weight is:
To compute the covariance of gestational age and birth weight, we need to multiply the deviation
from the mean gestational age by the deviation from the mean birth weight for each participant
(i.e.,
The computations are summarized below. Notice that we simply copy the deviations from the
mean gestational age and birth weight from the two tables above into the table below and
multiply.
The covariance of gestational age and birth weight is:
Formula :
R = 1 - ( (6 ∑d2) / (n3 - n) ), where d is difference between the ranks and n is number of observations
Partial correlation analysis involves studying the linear relationship between two
variables after excluding the effect of one or more independent factors.
Simple correlation does not prove to be an all-encompassing technique especially under the
above circumstances. In order to get a correct picture of the relationship between two
variables, we should first eliminate the influence of other variables.
For example, study of partial correlation between price and demand would involve studying the
relationship between price and demand excluding the effect of money supply, exports, etc.
UNIT-V
REGRESSION ANALYSIS
Regression Analysis
Introduction
A study of measuring the relationship between associated variables, wherein one variable is
dependent on another independent variable, called as Regression.
Regression analysis is a statistical tool to study the nature and extent of functional relationship
between two or more variables and to estimate (or predict) the unknown values of dependent
variable from the known values of independent variable. The variable that forms the basis for
predicting another variable is known as the Independent Variable and the variable that is
predicted is known as dependent variable. For example, if we know that two variables price (X)
and demand (Y) are closely related we can find out the most probable value of X for a given
value of Y or the most probable value of Y for a given value of X. Similarly, if we know that the
amount of tax and the rise in the price of a commodity are closely related, we can find out the
expected price for a certain amount of tax levy.
Regression Lines and Regression Equation: Regression lines and regression equations are
used synonymously. Regression equations are algebraic expression of the regression lines. Let
us consider two variables: X & Y. If y depends on x, then the result comes in the form of simple
regression. If we take the case of two variable X and Y, we shall have two regression lines as
the regression line of X on Y and regression line of Y on X. The regression line of Y on X gives
the most probable value of Y for given value of X and the regression line of X on Y given the
most probable value of X for given value of Y. Thus, we have two regression lines. However,
when there is either perfect positive or perfect negative correlation between the two variables,
the two regression line will coincide, i.e. we will have one line. If the variables are independent,
r is zero and the lines of regression are at right angles i.e. parallel to X axis and Y axis.
Therefore, with the help of simple linear regression model we have the following two regression
lines
1. Regression line of Y on X: This line gives the probable value of Y (Dependent variable) for
any given value of X (Independent variable). Regression line of Y on X : Y – Ẏ = byx (X – Ẋ)
OR : Y = a + bX
2. Regression line of X on Y: This line gives the probable value of X (Dependent variable) for
any given value of Y (Independent variable). Regression line of X on Y : X – Ẋ = bxy (Y – Ẏ)
OR : X = a + bY
In the above two regression lines or regression equations, there are two regression parameters,
which are “a” and “b”. Here “a” is unknown constant and “b” which is also denoted as “byx” or
“bxy”, is also another unknown constant popularly called as regression coefficient. Hence, these
“a” and “b” are two unknown constants (fixed numerical values) which determine the position
of the line completely. If the value of either or both of them is changed, another line is
determined. The parameter “a” determines the level of the fitted line (i.e. the distance of the line
directly above or below the origin). The parameter “b” determines the slope of the line (i.e. the
change in Y for unit change in X).
This above method is popularly known as direct method, which becomes quite cumbersome
when the values of X and Y are large. This work can be simplified if instead of dealing with
actual values of X and Y, we take the deviations of X and Y series from their respective means.
In that case: Regression equation Y on X: Y = a + bX will change to (Y – Ẏ) = byx (X – Ẋ)
Regression equation X on Y: X = a + bY will change to (X – Ẋ) = bxy (Y – Ẏ) In this new form
of regression equation, we need to compute only one parameter i.e. “b”. This “b” which is also
denoted either “byx” or “bxy” which is called as regression coefficient.
BUSINESS FORECASTING:
Business forecasting is an essential tool for businesses to anticipate future trends and conditions that may impact their
operations. It involves using statistical methods and other analytical techniques to make predictions about future sales,
revenues, costs, and profits.
The purpose of business forecasting is to provide decision-makers with information that can help them plan and make
informed decisions. By anticipating future trends and conditions, businesses can adjust their strategies and operations to
take advantage of opportunities and mitigate risks.
The methods of forecasting can be classified into two broad categories: (i) quantitative or objective and (ii) qualitative or
subjective.
These methods consist of collecting the opinions and judgments of individuals who are expected to have the best
knowledge of current activities or future plans of the organization. An important advantage of qualitative methods is that
they are easily understood. Another advantage is that they can incorporate subjective experience as inputs along with
objectives data. Again the cost involved in forecasting is quite low.
Qualitative or subjective:
Personal opinion: Here an individual does some forecast which can be relatively reliable and accurate, of the future on the
basis of his or her personal judgment without using a formal quantitative model. This approach is usually recommended
when conditions in the past are not likely to hold in the future.
Panel Consensus: To reduce the prejudices and ignorance that may arise in the individual judgment, it is possible to
develop consensus among group of individuals. Such a panel of individuals is encouraged to share information, opinions,
and assumptions (if any) to predict future value of some variable under study.
Delphi Method: This method is very similar to the Panel Consensus method. In the Delphi method, a group of experts
who may be stationed at different locations and who do not interact with each other is constituted. After this, a
questionnaire is sent to each expert to seek his opinion about the matter under investigation. A summary is prepared on the
basis of the returned questionnaire. On the basis of this summary, a few more questions are to be included in the
questionnaire and this modified questionnaire is again sent back to each expert. This process, which generally keeps them
informed of each others’ forecasts, is repeated until the desirable consensus is reached.
Market Research: The marketing research method is introduced in order to collect data and accordingly a well-designed
questionnaire is prepared and distributed among respondents. On the basis of the response obtained, a summary is prepared
and the survey result is developed.
Historical Comparison: Once the data are arranged chronologically, the time series approach facilitates comparison
between one time period and the next. It provides a scientific basis to make comparisons by studying and isolating the
effects of various influencing factors on the patterns of variable values. It also enables in making regional comparison
amongst data collected on the basis of time.
QUANTITATIVE FORECASTING METHODS
The quantitative methods of forecasting are further classified into two broad categories, namely: (1) Causal or Explanatory
Methods and (2) Time Series forecasting
The time series methods are concerned with taking some observed historical pattern for some variable and projecting this
pattern into the future using a mathematical formula. These methods fail to describe the cause effect relationship between
variables. This limitation of the time series approach can be overcome by using the causal method. Quantitative methods
of time series forecasting include: free hand method, smoothing method, Exponential smoothing method, trend projection
method, autoregressive model, Box-Jenkins method
Free hand method: The free hand method is the simplest method of determining trend. It is obtained by plotting the
values t y against time t. Smoothing of the time series with the freehand curve eliminates the seasonal and irregular
components. The forecast can be obtained simply by extending the trend line.
Smoothing method: The main objective of the smoothing method is to “smooth out” the random variations due to the
irregular fluctuation in the time series data. Various methods are available to smooth out the random variations due to
irregular fluctuations, so that the resulting series may have a better overall impression of the pattern of movement in the
data over a specified period. We have already discussed different methods of smoothing namely moving averages method,
weighted moving averages and semi-averages method.
Trend Projection: A trend is the long-run general direction (upward, downward or constant) of a business climate over a
period of several years. It is best represented by a straight line
Exponential smoothing method: Exponential smoothing is another technique used to “smooth” a time series of its sharp
variations. It is a type moving average forecasting technique which consists of a series of exponentially weighted moving
averages. The exponential smoothing method weigh data from the previous time period with exponentially decreasing
importance in the forecast. This method has a relative advantage over the methods of moving averages. First, this method
focuses upon the most recent data. Second, during forecasting, this method takes into account all the observed values
because each smoothing value is based upon the values observed previously. In this way, the values observed most
recently receive the highest weight; the previously observed value receives the second highest weight and so on. This
method is used for forecasting when there is no apparent trend or seasonal variation in the given values of a variable.
However, this method is useful mostly for short-term forecasting like forecasting of sales, inventory, price etc.
In exponential smoothing method, forecasting is made up of the actual value for the present time period Xt multiplied by a
value, called the exponential smoothing constant, between 0 and 1. This exponential constant is denoted by . So, the
resulting value is Xt which is added to the product of the present time period forecast Ft and 1) . Mathematically, the
formula can be represented by the following functional relationship
Box-Jenkins method: Another very difficult approach to forecasting is Box-Jenkins method. Here first of all, the analyst
identifies a tentative model based on nature of the past data. This tentative model along with the data is entered in the
computer. The Box-Jenkins programme then gives the values of the parameters included in the model. A diagnostic check
is then performed to determine whether the model produce an adequate description of the data. If the model satisfies the
analyst in this respect, then it is used to forecast. In case the model is not satisfactory, then the computer points out
diagnostic information, which is then used by the analyst in revising the model. This process is continued until the analyst
obtains an appropriate model, which is used for making forecasts. The limitation of the method is that it requires at least 45
observations in the time series.
The components of a time series are the building blocks that make up the data and can help to identify patterns and trends.
The four main components of a time series are:
1. Trend: The trend component represents the long-term direction or movement of the data over time. It is the
overall pattern of increase or decrease in the data over an extended period. A trend can be either upward,
downward or remain constant over time.
2. Seasonality: The seasonality component represents the regular pattern of fluctuations in the data that repeats at
fixed intervals. It is a variation that occurs regularly over time and may be caused by factors such as holidays,
weather, or other cyclical events. Seasonality may be daily, weekly, monthly, or yearly.
3. Cyclical: The cyclical component represents the periodic but irregular variations in the data that occur over an
extended period, often beyond one year. This component is characterized by fluctuations that are not regular, and
may be influenced by the economic conditions or other external factors.
4. Random or residual: The random or residual component represents the unpredictable fluctuations in the data that
cannot be explained by the other three components. It reflects the randomness or unpredictability in the data that
may arise due to factors such as measurement errors or unexpected events.
INDEX NUMBERS:
Index numbers are meant to study the change in the effects of such factors which cannot be measured directly. According
to Bowley, Index numbers are used to measure the changes in some quantity which we cannot observe directly‖. For
example, changes in business activity in a country are not capable of direct measurement but it is possible to study relative
changes in business activity by studying the variations in the values of some such factors which affect business activity,
and which are capable of direct measurement. Index numbers are commonly used statistical device for measuring the
combined fluctuations in a group related variables.
Index numbers may be classified in terms of the variables that they are intended to measure. In business, different groups
of variables in the measurement of which index number techniques are commonly used are (i) price, (ii) quantity, (iii)
value and (iv) business activity. Thus, we have index of wholesale prices, index of consumer prices, index of industrial
output, index of value of exports and index of business activity, etc.
The present period is called the current period and some period in the past is called the base period.
It describes the changes in the prices from one period to another for a market basket of goods and services.
CPI Uses:
It allows consumers to determine the effect of price increase on their purchasing power.
It is a yardstick for revising wages, pensions etc.
It is an economic indicator at the rate of inflation
It computes Real Income = Money Income/CPI * 100
Concepts of Probability:
Probability is a measure of the likelihood or chance of an event occurring. It is a numerical value
that ranges from 0 to 1, where 0 represents impossibility, and 1 represents certainty. The
concept of probability is used extensively in many fields, including mathematics, science,
engineering, finance, and social sciences.
There are two types of probability: theoretical probability and experimental probability.
Theoretical probability is calculated based on the assumption that all outcomes are equally
likely, while experimental probability is determined by conducting experiments and collecting
data.
Addition Law:
The addition law of probability is used to find the probability of the union of two or more events.
It states that the probability of the union of two events A and B is equal to the sum of the
probability of A and the probability of B, minus the probability of their intersection.
Mathematically, it can be represented as:
Where:
For example, let's say we want to find the probability of rolling either a 3 or a 4 on a fair
six-sided die. The probability of rolling a 3 is 1/6, and the probability of rolling a 4 is also 1/6.
However, these events are not mutually exclusive, since it is possible to roll a 4 and a 3 at the
same time. The probability of rolling both a 3 and a 4 is 1/36 (since there is only one way to do
this out of 36 possible outcomes). Using the addition law of probability, we can find the
probability of rolling either a 3 or a 4 as:
Multiplication Law:
The Multiplication Law of Probability is a fundamental concept in probability theory that states
that the probability of the joint occurrence of two or more independent events is the product of
their individual probabilities.
If we have two events A and B, then the probability of both events occurring is given by:
where P(A) is the probability of event A occurring and P(B|A) is the conditional probability of
event B occurring given that event A has occurred.
If we have three events A, B, and C, then the probability of all three events occurring is given
by:
where P(A) is the probability of event A occurring, P(B|A) is the conditional probability of event B
occurring given that event A has occurred, and P(C|A and B) is the conditional probability of
event C occurring given that events A and B have occurred.
In general, for any number of independent events, the probability of all events occurring is the
product of their individual probabilities:
Using the multiplication law of probability, we can find the probability of drawing a red ball on the
first attempt, and then the probability of drawing another red ball on the second attempt, given
that the first ball was not replaced.
The probability of drawing a red ball on the first attempt is 4/7, since there are 4 red balls out of
a total of 7 balls in the bag.
After drawing a red ball on the first attempt, there are now only 3 red balls and 6 total balls left in
the bag. Therefore, the probability of drawing another red ball on the second attempt is 3/6, or
1/2.
To find the probability of both events happening (i.e., drawing two red balls in a row), we multiply
the probabilities of each individual event:
So the probability of drawing two red balls in a row is 2/7, or approximately 0.29.
Conditional Probability:
Conditional probability refers to the probability of an event occurring given that another event
has already occurred. In other words, it is the probability of event B occurring, given that event A
has occurred. It is denoted by P(B|A), where P denotes the probability and the vertical bar (|)
represents "given that".
Where, P(A and B) is the probability of both A and B occurring together, and P(A) is the
probability of event A occurring.
For example, let's say we have two events: A and B. The probability of event A occurring is 0.6,
and the probability of event B occurring is 0.4. We also know that the probability of A and B
occurring together is 0.2.
This means that the probability of event B occurring given that event A has occurred is 1/3 or
0.33.
Bayes Theorem:
Bayes' theorem is a mathematical formula that is used to calculate the conditional probability of
an event based on prior knowledge of related conditions. It is named after the English
mathematician Thomas Bayes, who developed the formula in the 18th century. The formula is
expressed as:
Where:
Suppose there is a medical test to detect a certain disease, and the test has a 95% accuracy
rate. This means that the test will correctly identify 95% of people who have the disease and 5%
of people who do not have the disease will be falsely identified as having it.
Suppose the prevalence of the disease in the population is 1%. If a person tests positive for the
disease, what is the probability that they actually have the disease?
Using the law of total probability, we can calculate the prior probability of testing positive:
Now, using Bayes' theorem, we can calculate the conditional probability of having the disease
given that the person tests positive:
Therefore, the probability that a person who tests positive actually has the disease is about
16.1%.
Proof (Additional):
Bayes' theorem can be mathematically proven using conditional probability.
where P(A and B) is the probability of both A and B occurring, P(A|B) is the probability of A
given that B has occurred, and P(B) is the probability of B occurring.
Now, we can apply the definition of conditional probability to the numerator of Bayes' theorem:
The probability of getting exactly k successes in n trials can be calculated using the following
formula:
where
Therefore, the probability of getting exactly 3 heads in 10 coin tosses is 0.1172 or about 11.72%.
Poisson Distribution:
Poisson Distribution is a statistical distribution that is used to model the probability of a certain
number of events occurring in a fixed period of time or space. It is used when the number of
occurrences is rare, and the occurrences are independent of each other.
Where:
P(X = k) is the probability of k events occurring
e is the mathematical constant e (approximately equal to 2.71828)
λ is the expected number of events
k is the actual number of events that occur
k! is the factorial of k (i.e., k x (k-1) x (k-2) x ... x 2 x 1)
The Poisson distribution is often used in manufacturing, quality control, and insurance to predict
the likelihood of a certain number of defects or accidents occurring in a given period of time.
For example, suppose a manufacturing plant produces an average of 3 defective products per
day. The probability of producing exactly 2 defective products in a given day can be calculated
using the Poisson distribution formula as follows:
Therefore, the probability of producing exactly 2 defective products in a given day is 0.2241 or
approximately 22.4%.
Normal Distribution:
The Normal Distribution is defined by the probability density function for a continuous random
variable in a system. Let us say, f(x) is the probability density function and X is the random
variable. Hence, it defines a function which is integrated between the range or interval (x to x +
dx), giving the probability of random variable X, by considering the values between x and x+dx.
f(x) ≥ 0 ∀ x ϵ (−∞,+∞)
where:
The graph of normal distribution is symmetric about the mean, which is denoted by the peak of
the curve. The standard deviation measures the spread of the data around the mean. The larger
the standard deviation, the more spread out the data.
The normal distribution can be standardized to a standard normal distribution, which has a
mean of 0 and a standard deviation of 1. This standardization process is done using the
following formula:
z = (x - μ) / σ
where:
The standard normal distribution is useful because it can be used to calculate probabilities for
any normal distribution by converting the data to a standard form.
There are two main types of sampling: probability sampling and non-probability sampling.
Probability sampling: Probability sampling refers to the sampling method where each member
of the population has a known and equal chance of being selected for the sample. This type of
sampling ensures that the sample is representative of the population and that the estimates
based on the sample are unbiased. Probability sampling can be further divided into the following
types:
Simple random sampling: In simple random sampling, each member of the population is
selected at random, and every member of the population has an equal chance of being
selected.
Stratified random sampling: Stratified random sampling involves dividing the population into
subgroups or strata based on some characteristic (such as age or gender) and then selecting a
random sample from each stratum.
Cluster sampling: In cluster sampling, the population is divided into clusters or groups, and
then a random sample of clusters is selected. All members of the selected clusters are then
included in the sample.
Systematic sampling: In systematic sampling, the population is ordered in some way, and then
a random starting point is selected. Members of the population are then selected at regular
intervals until the desired sample size is reached.
Non-probability sampling: Non-probability sampling refers to the sampling method where the
probability of any member of the population being selected for the sample is not known. This
type of sampling may introduce bias into the sample, and the estimates based on the sample
may be less accurate. Non-probability sampling can be further divided into the following types:
Convenience sampling: Convenience sampling involves selecting the individuals or objects that
are most readily available. This type of sampling is easy to implement but may not be
representative of the population.
Quota sampling: Quota sampling involves selecting a sample that matches some
predetermined characteristics of the population (such as age or gender), but the selection of
individuals within those characteristics is not random.
Judgmental sampling: Judgmental sampling involves selecting the sample based on the
judgment of the researcher or some other expert.
Point Estimation:
Point estimation is a method used to estimate an unknown parameter of a population based on
a single value called a point estimate. The point estimate is calculated using a sample statistic,
such as the sample mean or sample proportion, and is used to estimate the true value of the
population parameter.
For example, suppose we want to estimate the population mean weight of apples in a particular
orchard. We randomly select 50 apples from the orchard and calculate the sample mean weight
to be 150 grams. The point estimate for the population mean weight of apples would be 150
grams.
Interval Estimation:
Interval estimation is a method used to estimate an unknown population parameter by
specifying an interval or range of values likely to contain the true value of the population
parameter. The interval is called a confidence interval, and the level of confidence is typically set
to 95% or 99%.
The margin of error depends on the level of confidence, sample size, and variability in the
sample data.
For example, suppose we want to estimate the population proportion of voters in a city who will
vote for a particular candidate in an upcoming election. We randomly select a sample of 500
voters and find that 250 of them plan to vote for the candidate. Using the sample proportion as
the point estimate, we can calculate a 95% confidence interval as follows:
Therefore, we can say with 95% confidence that the true proportion of voters in the city who will
vote for the candidate is likely to be between 0.456 and 0.544.
The formula for calculating the confidence limits for population mean is:
z-score is the critical value of the standard normal distribution based on the desired level of
confidence. For example, if the desired confidence level is 95%, the z-score will be 1.96.
standard error of the mean is the standard deviation of the sampling distribution of the mean,
which is calculated as the population standard deviation divided by the square root of the
sample size.
For example, if a sample of size 100 has a mean of 50 and a standard deviation of 10, and we
want to estimate the population mean with 95% confidence, we can calculate the confidence
limits as follows:
Therefore, we can say with 95% confidence that the true population mean lies between 48.04
and 51.96.
The formula for calculating confidence limits for population proportion is as follows:
where:
p̂ = sample proportion
n = sample size
zα/2 = z-value corresponding to the desired level of confidence (e.g., 1.96 for 95% confidence)
√ = square root
For example, suppose a random sample of 500 people was taken from a population, and 280 of
them had a particular characteristic of interest. The sample proportion is p̂ = 280/500 = 0.56. To
calculate the 95% confidence interval, we can use the formula:
Lower limit = 0.56 - 1.96 * √(0.56(1-0.56)/500) = 0.512
Upper limit = 0.56 + 1.96 * √(0.56(1-0.56)/500) = 0.608
Thus, we can say with 95% confidence that the true proportion of the population with the
characteristic lies between 0.512 and 0.608.
Difference of Means:
The difference of means is calculated using the following formula:
where,
x̄1 = mean of group 1
x̄2 = mean of group 2
Σx1 = sum of observations in group 1
Σx2 = sum of observations in group 2
n1 = number of observations in group 1
n2 = number of observations in group 2
For example, if we want to compare the average income of two groups, we can calculate the
difference of means by calculating the mean income of both groups and then subtracting one
from the other.
Difference of Proportions:
The difference of proportions is calculated using the following formula:
where,
p1 = proportion of group 1
p2 = proportion of group 2
p̂1 = sample proportion in group 1
p̂2 = sample proportion in group 2
For example, if we want to compare the proportion of smokers in two different groups, we can
calculate the difference of proportions by calculating the proportion of smokers in each group
and then subtracting one from the other.
Central Limit:
Central Limit Theorem is a statistical concept that states that as the sample size of a data set
increases, the distribution of the sample means approaches a normal distribution, regardless of
the shape of the original population distribution. It is an important concept in inferential statistics
because it enables us to make statistical inferences about a population based on a sample.
z = (x̄ - μ) / (σ / √n)
where,
z = the standard normal distribution value
x̄ = the sample mean
μ = the population mean
σ = the population standard deviation
n = the sample size
μx̄ = μ
σx̄ = σ/√n
where,
The sampling distribution of the mean is an important concept in statistics because it helps us to
estimate the characteristics of a population from a sample. It states that if we take multiple
random samples of the same size from a population and calculate the mean of each sample,
the distribution of those means will be approximately normal, regardless of the shape of the
original population distribution.
The central limit theorem is the basis for the sampling distribution of the mean. It states that as
the sample size increases, the distribution of sample means will approach a normal distribution,
even if the original population distribution is not normal. The mean of the sampling distribution of
the mean will always be equal to the population mean, while the standard deviation of the
sampling distribution of the mean will decrease as the sample size increases.
Using the formula for the sampling distribution of the mean, we can calculate the probability of
obtaining a sample mean within a certain range. This is useful in hypothesis testing, where we
compare the sample mean to the population mean to determine whether there is a statistically
significant difference.
σ_p = √(p(1-p)/n)
z = (p̂ - p) / √(p(1-p)/n)
Where:
μ_x̄ = mean of sample means
μ_p = mean of sample proportions
σ_x̄ = standard deviation of sample means
σ_p = standard deviation of sample proportions
n = sample size
x̄ = sample mean
p̂ = sample proportion
p = population proportion
μ = population mean
σ = population standard deviation
z = z-score, used for finding probabilities from the standard normal distribution.
For large sample sizes, the sampling distribution of proportion is approximately normal, and we
can use the normal distribution to make statistical inferences about the population proportion
based on the sample proportion.
χ² = ∑(O-E)²/E
Where,
χ² = chi-square value
O = observed frequency
E = expected frequency
The chi-square test is used to compare the observed data with the expected data. If the
observed data differs significantly from the expected data, then we reject the null hypothesis and
accept the alternative hypothesis.
The null hypothesis is that there is no relationship between gender and smoking habits, and the
alternative hypothesis is that there is a significant relationship between them.
Now, we need to calculate the expected frequency for each cell using the formula:
χ² = ∑(O-E)²/E
χ² = 60
We can use a chi-square distribution table to find the critical value for a 95% confidence level
with 1 degree of freedom, which is 3.84. Since our calculated chi-square value is less than the
critical value, we fail to reject the null hypothesis. Therefore, we can conclude that there is no
significant relationship between gender and smoking habits in the population.
Analysis of Variances:
Analysis of variance (ANOVA) is a statistical method used to analyze the differences among
group means and the variation within groups. ANOVA is a technique that compares the means
of two or more groups to determine if there is a significant difference between them.
The basic idea behind ANOVA is to partition the total variance into two components, one due to
the differences between groups (also called "explained variance") and the other due to
differences within groups (also called "unexplained variance"). The ratio of these two variances
is used to test the hypothesis that the group means are equal.
There are different types of ANOVA, including one-way ANOVA, two-way ANOVA, and repeated
measures ANOVA. The choice of ANOVA depends on the number of independent variables and
the design of the experiment.
$F = \frac{MS_{Between}}{MS_{Within}}$
where $F$ is the test statistic, $MS_{Between}$ is the mean square due to differences between
groups, and $MS_{Within}$ is the mean square due to differences within groups.
$df_{Between} = k - 1$
$df_{Within} = N - k$
where $k$ is the number of groups and $N$ is the total sample size.
As an example, suppose we want to test if there is a difference in the mean weight of apples
among three different orchards. We collect samples of 10 apples from each orchard and
measure their weights. The data is shown below:
Orchard 1: 50, 51, 52, 49, 53, 48, 50, 52, 51, 50
Orchard 2: 55, 54, 53, 52, 56, 57, 54, 55, 53, 54
Orchard 3: 58, 60, 57, 59, 56, 55, 60, 58, 59, 57
The first step is to calculate the group means and the overall mean:
$\bar{X}_{1} = 50.4$
$\bar{X}_{2} = 54.3$
$\bar{X}_{3} = 58.2$
$df_{Between} = 3 - 1 = 2$
$SS_{Within} = (50-50.4)^
Quality control charts are used to monitor the quality of a process and identify any deviations
from a standard. There are different types of quality control charts, but the most commonly used
are the control chart for variables and the control chart for attributes.
Formula:
Control limits are calculated as follows:
Where:
X̄ = the average of the quality characteristic
A2 = a constant from a statistical table based on the sample size
R = the range of the subgroups
Diagram:
The control chart for variables typically has three lines: the average line (X̄), the upper control
limit (UCL), and the lower control limit (LCL). The data points are plotted on the chart and the
lines are drawn based on the calculated values of X̄, UCL, and LCL.
The control charts of variables can be classified based on the statistics of subgroup summary
plotted on the chart.
X¯ chart
R Chart
S Chart
X¯ chart describes the subset of averages or means, R chart displays the subgroup ranges, and
S chart shows the subgroup standard deviations. Regarding the quality that is to be measured
on a continuous scale, a particular analysis makes both the process mean and its variability
apparent along with a mean chart that is aligned over its corresponding S- or R- chart.
Formula:
The control limits for the control chart for attributes are calculated using the following formulas:
Diagram:
The control chart for attributes typically has two lines: the upper control limit (UCL) and the lower
control limit (LCL). The data points are plotted on the chart and the lines are drawn based on
the calculated values of p, UCL, LCL, z, and σp.
In both types of control charts, if any data points fall outside the control limits, it indicates that
the process is out of control and needs to be investigated and corrected.