MDU MBA 1st Semester Buisness Statitcs and Analytics Notes 1
MDU MBA 1st Semester Buisness Statitcs and Analytics Notes 1
BUISNESS STATISTICS
AND ANALYTICS
Introduction :
Defination
Role and Application
Measures of Central Tendencies and their
application
Measure of Dispersion
Range
Quartile Deviation
Standard Deviations
Coefficient of Variance and mean deviation
Skewness and Kurtosis
SECTION-II
Correlation:
Meaning and Types of Correlation
Positive Correlation
Negative Correlation
3
SECTION-III
Time series:
Introduction
Objective and identification of trends
Variation in time
Secular variation
Cyclical variation
Seasonal and irregular variation
4
SECTION-IV
Sampling:
Meaning and basic sampling concept
Sampling and non sampling errors
Hypothesis testing
Formulation and procedure for testing a
hypothesis
Large and small sample test
Z,t,f Test and ANNOVA(one way)
Non parametric test:
Chi Square test
Sign Test
Kruskal Wallis Test
Meaning
Types
Application of business analytics
SECTION-I
INTRODUCTION
Defination:
Business Statistics
You are free to use this image o your website, templates, etc, Please
provide us with an attribution link
6
Types
Let us understand the different types of this method in detail.
#1 – Descriptive Statistics
This method involves summarizing substantial data into different bits
of information in a meaningful and useful manner. It uses different
statistical tools, such as tables, charts, and graphs, to describe a
specific phenomenon or make generalizations.
This method looks into what happened and clarifies the reason
behind it. Managers can use historical information to check the
mistakes and achievements in the past. The use of descriptive
statistics is common in operations, finance, and marketing.
#2 – Inferential Statistics
Not every generalization made using descriptive statistics needs to be
true. Hence, individuals utilize this method to test whether the
generalizations are valid. It involves assessing the validity and
estimating facts and figures to make business decisions.
8
Example #1
Suppose a software company, ABC, looks at their customers’ mean
spending on the mobile-based application offered by them, the mode
of the products purchased, and the median spending for each
customer. Although, at first glance, this might appear to be
overlapping, the three figures individually show a different aspect of
the organization.
Example #2
10
Importance
One can understand the importance of this concept by going through
the following points.
Central Tendencies in Statistics are the numerical values that are used
to represent mid-value or central value a large collection of numerical
data. These obtained numerical values are called central or average
values in Statistics. A central or average value of any statistical data or
series is the value of that variable that is representative of the entire
data or its associated frequency distribution. Such a value is of great
significance because it depicts the nature or characteristics of the
entire data, which is otherwise very difficult to observe.
12
Mean
Median
Mode
Central Tendency
Mean
Mean in general terms is used for the arithmetic mean of the data,
but other than the arithmetic mean there are geometric mean and
harmonic mean as well that are calculated using different formulas.
Here in this article, we will discuss the arithmetic mean.
OR
Example: If there are 5 observations, which are 27, 11, 17, 19, and 21
then the mean (\bar{x} ) is given by
13
⇒\bar{x} = 95 ÷ 5
⇒\bar{x} = 19
xi
15
10
fi
10
8
14
10
⇒\bar{x} = 360 ÷ 40
⇒\bar{x} =9
Central Tendencies in Statistics are the numerical values that are used
to represent mid-value or central value a large collection of numerical
data. These obtained numerical values are called central or average
values in Statistics. A central or average value of any statistical data or
series is the value of that variable that is representative of the entire
data or its associated frequency distribution. Such a value is of great
significance because it depicts the nature or characteristics of the
entire data, which is otherwise very difficult to observe.
Mean
Median
Mode
Central Tendency
Mean
Mean in general terms is used for the arithmetic mean of the data,
but other than the arithmetic mean there are geometric mean and
harmonic mean as well that are calculated using different formulas.
Here in this article, we will discuss the arithmetic mean.
OR
Example: If there are 5 observations, which are 27, 11, 17, 19, and 21
then the mean (\bar{x} ) is given by
⇒\bar{x} = 95 ÷ 5
⇒\bar{x} = 19
Mean (\bar{x} ) is defined for the grouped data as the sum of the
product of observations (xi) and their corresponding frequencies (fi)
divided by the sum of all the frequencies (fi).
xi
15
10
fi
10
10
⇒\bar{x} = 360 ÷ 40
⇒\bar{x} =9
Related Resources,
Arithmetic Mean
Geometric Mean
Harmonic Mean
Arithmetic Mean: The formula for Arithmetic Mean is given by
Where,
Where,
18
OR
Where,
The algebraic sum of deviations from the arithmetic mean is zero i.e.,
\bold{\sum{(x_i - \bar{x})} = 0} .
If \bold{\bar{x}} is the arithmetic mean of observations and a is
added to each of the observations, then the new arithmetic mean is
given by \bold{\bar{x'} =\bar{x}+a}
If \bold{\bar{x}} is the arithmetic mean of observations and a is
subtracted from each of the observations, then the new arithmetic
mean is given by \bold{\bar{x'} =\bar{x}-a}
If \bold{\bar{x}} is the arithmetic mean of observations and a is
multiplied by each of the observations, then the new arithmetic mean
is given by \bold{\bar{x'} =\bar{x}\times a}
If \bold{\bar{x}} is the arithmetic mean of observations and each
of the observations is divided by a, then the new arithmetic mean is
given by \bold{\bar{x'} =\bar{x}\div a}
Disadvantage of Mean as Measure of Central Tendency
19
Median
The Median of any distribution is that value that divides the
distribution into two equal parts such that the number of
observations above it is equal to the number of observations below it.
Thus, the median is called the central value of any given data either
grouped or ungrouped.
Case 1: N is Odd
Case 2: N is Even
Example 1: If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20,
32 then the Median is given by
Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 32,
36, 38
⇒Median = 27
Example 2: If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20
then the Median is given by
Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 36,
38
⇒Median = 26
Where,
Class
10 – 20
20 – 30
30 – 40
40 – 50
50 – 60
Frequency
10
12
Solution:
22
20 – 30
10
15
30 – 40
12
27
40 – 50
35
50 – 60
40
⇒Median = 30 + (5/12) × 10
⇒Median = 30 + 4.17
⇒Median = 34.17
Mode
The Mode is the value of that observation which has a maximum
frequency corresponding to it. In other, that observation of the data
occurs the maximum number of times in a dataset.
The mode of the data set is the highest frequency term in the data set
as shown in the image added below.
Solution:
xi
fi
Where,
25
As the class interval with the highest frequency is 40-50, which has a
frequency of 16. Thus, 40-50 is the modal class.
Thus, l = 40 , h = 10 , f1 = 16 , f0 = 12 , f2 = 10
⇒Mode = 40 + (4/10)×10
⇒Mode = 40 + 4
⇒Mode = 44
Measure of Dispersion:
Standard Deviation
Mean Deviation
Quartile Deviation
Variance
Range, etc
Dispersion in the general sense is the state of scattering. Suppose we
have to study the data for thousands of variables there we have to
find various parameters that represent the crux of the given data set.
These parameters are called the measure of dispersion.
Standard Deviation
Mean Deviation
Quartile Deviation
Variance
27
Range, etc
Dispersion in the general sense is the state of scattering. Suppose we
have to study the data for thousands of variables there we have to
find various parameters that represent the crux of the given data set.
These parameters are called the measure of dispersion.
measure-of-dispersion
9, 16, 19, and 20 out of 20. Then the average value scored by the
student in the class is,
= 135/10 = 13.5
Measure-of-Depresion
Range: Range is defined as the difference between the largest and the
smallest value in the distribution.
29
R=L–S
where
Example: Find the range of the data set 10, 20, 15, 0, 100.
Solution:
R = 100 – 0
R = 100
31
Example: Find out the range for the following observations, 20, 24,
31, 17, 45, 39, 51, 61.
Solution:
Largest Value = 61
Smallest Value = 17
Thus, the range of the data set is
Range = 61 – 17 = 44
Example: Find out the range for the following frequency distribution
table for the marks scored by class 10 students.
Solution:
Range = 40
Mean Deviation
Range as a measure of dispersion only depends on the highest and
the lowest values in the data. Mean deviation on the other hand
measures the deviation of the observations from the mean of the
distribution. Since the average is the central value of the data, some
deviations might be positive and some might be negative. If they are
added like that, their sum will not reveal much as they tend to cancel
each other’s effect. For example,
Step 1: Calculate the arithmetic mean for all the values of the dataset.
M.D = \frac{\sum|d|}{n}
Example: Calculate the mean deviation for the given ungrouped data,
2, 4, 6, 8, 10
Solution:
Mean(μ) = (2+4+6+8+10)/(5)
μ=6
M. D = \frac{\sum|d|}{n}
⇒M.D = \frac{|(2 - 6)| + |(4 - 6)| + |(6 - 6)| + |(8 - 6)| + |(10 - 6)|}{5}
⇒M.D = (4+2+0+2+4)/(5)
34
Related Formulas
Range
H–S
where,
σ2 = Σ(xi-μ)2 /n
Sample Variance(S2)
S2 = Σ(xi-μ)2 /(n-1)
where,
μ is the mean
n is the number of observation
Standard Deviation S.D. = √(σ2)
35
Mean Deviation
μ = (x – a)/n
where,
where,
Q3 = Third Quartile
Q1 = First Quartile
The table added here is for the Related Measure of Dispersion.
where,
Central Tendency
Measure of Dispersion
Mean
Median
Mode
Various parameters included for the measure of dispersion are,
Range
Variance
Standard Deviation
Mean Deviation
Quartile Deviation
Coefficient of variation
38
CV% = (SD/Xbar)100
Alternate formulae
The lessons on Basic QC Practices cover these same terms (see
QC - The data calculations), but use a different form of the
equation for calculating cumulative or lot-to-date means and SDs.
Guidelines in the literature recommend that cumulative means
and SDs be used in calculating control limits [2-4], therefore it is
important to be able to perform these calculations.
39
This equation looks quite different from the prior equation in this
lesson, but in reality, it is equivalent. The cumulative standard
deviation formula is derived from an SD formula called the Raw
Score Formula. Instead of first calculating the mean or Xbar, the
Raw Score Formula calculates Xbar inside the square root sign.
Steps to Calculate:
ADVERTISEMENTS:
M.D. = ∑fdy/N
ADVERTISEMENTS:
42
:
43
Skew
45
Types of Skewness
Types of Skewness
Types of Kurtosis
Types of Kurtosis
47
SECTION-II
CORRELATION
Meaning Of Correlation:
What is correlation?
Correlation refers to the statistical relationship between two entities.
In other words, it's how two variables move in relation to one
another. Correlation can be used for various data sets, as well. In
49
some cases, you might have predicted how things will correlate, while
in others, the relationship will be a surprise to you. It's important to
understand that correlation does not mean the relationship is causal.
6. Divide
Divide the numerator (the value you determined in step 4) by the
denominator (the value you determined in step 5). This will result in
the correlation coefficient.
Positive correlations
Here are some examples of positive correlations:
1. The more time you spend on a project, the more effort you'll have
put in.
2. The more money you make, the more taxes you will owe.
3. The nicer you are to employees, the more they'll respect you.
5. The more overtime you work, the more money you'll earn.
Negative correlations
Here are some examples of negative correlations:
1. The more payments you make on a loan, the less money you'll owe.
3. The more you work in the office, the less time you'll spend at
home.
4. The more employees you hire, the fewer funds you'll have.
5. The more time you spend on a project, the less time you'll have.
No correlation
Here are some examples of entities with zero correlation:
1. The nicer you treat your employees, the higher their pay will be.
4. The earlier you arrive at work, your need for more supplies
increases.
5. The more funds you invest in your business, the more employees
will leave work early.
Linear Correlation:
For example, let's consider the relationship between the hours spent
studying and the grades obtained in a class. If the relationship is
linear, we can expect that the more time a student spends studying,
the better grades they will get. This can be represented by a straight
line on a scatter plot.
Nonlinear Correlation:
Scatter Diagram:
55
(50, 3), (65, 18), (70, 54), (85, 75), (100, 98)
56
Now that points have been created, they can be plotted to see what
the scatter plot looks like. The independent variable will go along the
x-axis and the dependent variable will go along the y-axis.
He brings out his lab notebook and starts to record the data. Tom
counts the number of tomatoes on each plant. He also records the
number of hours of sun each tomato plant gets during the day. Tom
now takes the data back indoors and wonders how to make sense of
it. Is there a connection between the two things that he measured?
That's where the scatter diagrams come in. Just like it sounds, a
scatter diagram, or scatter plot, is a graph of your data. Scatter
diagrams are types of graphs that help you find out if two things are
connected. In math, we like to call those things variables. How do you
know if there's a connection or a relationship between two variables?
We measure the two variables and graph them on an (x, y) coordinate
system.
No Correlation
Scatter plots may also end up showing relationships that are not
linear. Examples of these are relationships that may be exponential or
quadratic.
58
Exponential Correlation
Quadratic Correlation
In looking at this scatter plot it does not appear that there is any
potential positive slope or negative slope. Additionally, the data does
not seem to be showing signs of a linear pattern, exponential pattern,
or quadratic pattern. Therefore, this data has no correlation. This is to
be expected. Shoe size does not dictate how many books someone
reads, nor does the number of books someone reads dictate their
shoe size.
Real-Life Example 2
In an Algebra I class a student decided to survey their friends
regarding information that interested them about the pandemic. They
asked their classmates how many times they had to quarantine in the
year 2021 and how many tv shows they binge-watched. The student
then organized the data into a table and made a scatter plot with the
results.
Times Quarantined TV Shows Binged
1 5
1 6
2 10
3 15
4 22
2 12
1 4
2 11
3 18
Karl~Pearson’s~Coefficient~of~Correlation(r)=\frac{Sum~of~Products
~of~Deviations~from~their~respective~means}{Number~of~Pairs\tim
es{Standard~Deviations~of~both~Series}}
Or
r=\frac{\sum{xy}}{N\times{\sigma_x}\times{\sigma_y}}
Where,
r = Coefficient of Correlation
Karl~Pearson’s~Coefficient~of~Correlation(r)=\frac{Sum~of~Products
~of~Deviations~from~their~respective~means}{Number~of~Pairs\tim
es{Standard~Deviations~of~both~Series}}
Or
r=\frac{\sum{xy}}{N\times{\sigma_x}\times{\sigma_y}}
62
Where,
r = Coefficient of Correlation
Table of Content
Methods of Calculating Karl Pearson’s Coefficient of Correlation
1. Actual Mean Method
2. Direct Method
3. Short-Cut Method/Assumed Mean Method
4. Step Deviation Method
Change of Scale and Origin
Methods of Calculating Karl Pearson’s Coefficient of Correlation
Actual Mean Method
Direct Method
Short-Cut Method/Assumed Mean Method/Indirect Method
Step-Deviation Method
1. Actual Mean Method
The steps involved in the calculation of coefficient of correlation by
using Actual Mean Method are:
The first step is to calculate the mean of the given two series (say X
and Y).
63
Now, take the deviation of X series from \bar{X} and denote the
deviations by x.
Square the deviations of x and obtain the total; i.e., \sum{x^2}
Take the deviation of Y series from \bar{Y} and denote the
deviations by y.
Square the deviations of y and obtain the total; i.e., \sum{y^2}
Multiply the respective deviations of Series X and Y and obtain the
total; i.e., \sum{xy} .
Now, use the following formula to determine the Coefficient of
Correlation:
r=\frac{\sum{xy}}{\sqrt{\sum{x^2}\times{\sum{y^2}}}}
Example:
Use Actual Mean Method and determine the coefficient of correlation
for the following data:
Data Table
Solution:
Coefficient of Correlation
\bar{X}=\frac{\sum{X}}{N}=\frac{168}{7}=24
\bar{Y}=\frac{\sum{Y}}{N}=\frac{105}{7}=15
r=\frac{\sum{xy}}{\sqrt{\sum{x^2}\times{\sum{y^2}}}}
r=\frac{336}{\sqrt{448\times252}}=\frac{336}{\sqrt{1,12,896}}=\frac{3
36}{336}=1
Coefficient of Correlation = 1
64
2. Direct Method
The steps involved in the calculation of coefficient of correlation by
using Direct Method are:
Example:
Use Direct Method and determine the coefficient of correlation for
the following data:
Data Table
Solution:
Coefficient of Correlation
r=\frac{N\sum{XY}-\sum{X}.\sum{Y}}{\sqrt{N\sum{X^2}-
(\sum{X})^2}{\sqrt{N\sum{Y^2}-(\sum{Y})^2}}}
=\frac{(7\times2,856)-(168\times105)}{\sqrt{(7\times4,480)-
(168)^2}\times{\sqrt{(7\times1,827)-(105)^2}}}
=\frac{19,992-17,640}{\sqrt{31,360-28,224}\times{\sqrt{12,789-
11,025}}}
65
=\frac{2,352}{\sqrt{3,136}\times{\sqrt{1,764}}}=\frac{2,352}{56\times
42}
=\frac{2,352}{2,352}=1
Coefficient of Correlation = 1
First of all, take the deviations of X Series from the assumed mean
and denote the values by dx. Calculate their total; i.e., ∑dx.
Now, square the deviations of X series and calculate their total; i.e.,
∑dx2.
Take the deviations of Y Series from the assumed mean and denote
the values by dy. Calculate their total; i.e., ∑dy.
Square the deviations of Y series and calculate their total; i.e., ∑dy2.
Multiply dx and dy and calculate their total; i.e., ∑dxdy.
Now, use the following formula to determine Coefficient of
Correlation:
r=\frac{N\sum{dxdy}-\sum{dx}.\sum{dy}}{\sqrt{N\sum{dx^2}-
(\sum{dx})^2}{\sqrt{N\sum{dy^2}-(\sum{dy})^2}}}
Where,
Example:
Use Step Deviation Method and determine the coefficient of
correlation for the following data:
Data Table
Solution:
Coefficient of Correlation under Step Deviation Method
r=\frac{N\sum{dx^\prime{dy^\prime}}-
\sum{dx^\prime}.\sum{dy^\prime}}{\sqrt{N\sum{dx^\prime{^2}}-
(\sum{dx^\prime})^2}{\sqrt{N\sum{dy^\prime{^2}}-
(\sum{dy^\prime})^2}}}
=\frac{(7\times35)-(7\times7)}{\sqrt{(7\times35)-
(7)^2}\times{\sqrt{(7\times35)-(7)^2}}}
=\frac{245-49}{\sqrt{245-49}\times{\sqrt{245-49}}}
=\frac{196}{\sqrt{196}\times{\sqrt{196}}}=\frac{196}{14\times14}
=\frac{196}{196}=1
Coefficient of Correlation = 1
67
Data Table
Solution:
As the coefficient of correlation is not affected by the change in scale
and origin of the variables, we will multiply the X Series by 10 and
divide the Y series by 100.
Coefficient of Correlation
r=\frac{N\sum{dxdy}-\sum{dx}.\sum{dy}}{\sqrt{N\sum{dx^2}-
(\sum{dx})^2}{\sqrt{N\sum{dy^2}-(\sum{dy})^2}}}
=\frac{(8\times156)-[(-24)\times(-4)]}{\sqrt{(8\times1,584)-(-
24)^2}\times{\sqrt{(8\times44)-(-4)^2}}}
=\frac{1,248-96}{\sqrt{12,672-576}\times{\sqrt{352-16}}}
=\frac{1,152}{\sqrt{12,096}\times{\sqrt{336}}}=\frac{1,152}{110\times
18.3}
68
=\frac{1,152}{2,013}=0.57
-1<=r<= + 1 or | r |
This property reveals that if we subtract any constant from all the
values of X and Y, it will not affect the coefficient of correlation.
Plot different sets of values i.e. (8, 70), (16, 58) (24, 50), (31, 32), (42,
26), (50, 12) on the graph paper. Join these points. The result is the
scatter diagram. This data gives high degree of negative correlation.
In other words, the probable error (P.E.) is the value which is added or
subtracted from the coefficient of correlation (r) to get the upper
limit and the lower limit respectively, within which the value of the
correlation expectedly lies.
By adding and subtracting the value of P.E from the value of ‘r,’ we
get the upper limit and the lower limit, respectively within which the
correlation of coefficient is expected to lie. Symbolically, it can be
expressedP.E.r-2
where rho denotes the correlation in a population
The probable Error can be used only when the following three
conditions are fulfilled:
2.3 Interpretation
The partial correlation coefficient measures the strength and
direction of the relationship between two variables while controlling
for the effects of one or more other variables. A positive partial
correlation coefficient indicates a positive relationship between the
two variables, while a negative partial correlation coefficient indicates
a negative relationship between the two variables. A partial
correlation coefficient of 0 indicates no relationship between the two
variables.
3.3 Interpretation
The multiple correlation coefficient measures the strength and
direction of the relationship between a dependent variable and two
or more independent variables. A multiple correlation coefficient of 1
72
5.2 Limitations
Partial and multiple correlation analysis also have some limitations.
They assume that the relationship between variables is linear and that
there is no interaction between variables. They also assume that the
variables are normally distributed and that there are no outliers or
influential observations in the data.
73
REGRESSION
What Is a Regression?
Regression is a statistical method used in finance, investing, and other
disciplines that attempts to determine the strength and character of
the relationship between one dependent variable (usually denoted by
Y) and a series of other variables (known as independent variables).
KEY TAKEAWAYS
75
Understanding Regression
Regression captures the correlation between variables observed in a
data set and quantifies whether those correlations are statistically
significant or not.
The two basic types of regression are simple linear regression and
multiple linear regression, although there are non-linear regression
methods for more complicated data and analysis. Simple linear
regression uses one independent variable to explain or predict the
outcome of the dependent variable Y, while multiple linear regression
uses two or more independent variables to predict the outcome
(while holding all others constant).
Calculating Regression
Linear regression models often use a least-squares approach to
determine the line of best fit. The least-squares technique is
77
Once this process has been completed (usually done today with
software), a regression model is constructed. The general form of
each type of regression model is:
�
=
�
+
�
�
+
�
Y=a+bX+u
�
=
�
+
�
1
�
1
78
+
�
2
�
2
+
�
3
�
3
+
.
.
.
+
�
�
�
�
+
�
where:
�
=
The dependent variable you are trying to predict
or explain
�
=
The explanatory (independent) variable(s) you are
using to predict or associate with Y
�
=
The y-intercept
�
=
79
Y=a+b
1
X
1
+b
2
X
2
+b
3
X
3
+...+b
t
X
t
+u
where:
Y=The dependent variable you are trying to predict
or explain
80
What Are the Assumptions That Must Hold for Regression Models?
In order to properly interpret the output of a regression model, the
following main assumptions about the underlying data process of
what you analyzing must hold:
82
1. Linear Regression
The most extensively used modelling technique is linear regression,
which assumes a linear connection between
dependent variable (Y) and an independent variable (X). It employs a
regression line, also known as a best-fit line. The linear connection is
defined as Y = c+m*X + e, where ‘c’ denotes the intercept, ‘m’
denotes the slope of the line, and ‘e’ is the error term.
The linear regression model can be simple (with only one dependent
and one independent variable) or complex (with numerous
dependent and independent variables) (with one dependent variable
and more than one independent variable).
83
Linear Regression
IMAGE
2. Logistic Regression
When the dependent variable is discrete, the logistic regression
technique is applicable. In other words, this technique is used to
compute the probability of mutually exclusive occurrences such as
pass/fail, true/false, 0/1, and so forth. Thus, the target variable can
take on only one of two values, and a sigmoid curve represents its
connection to the independent variable, and probability has a value
between 0 and 1.
84
Polynomial Regression
IMAGE
4. Ridge Regression
When data exhibits multicollinearity, that is, the ridge regression
technique is applied when the independent variables are highly
correlated. While least squares estimates are unbiased in
multicollinearity, their variances are significant enough to cause the
85
The lambda (λ) variable in the ridge regression equation resolves the
multicollinearity problem.
Ridge Regression
Log Lambda
IMAGE
5. Lasso Regression
86
Lasso Regression
IMAGE
6. Quantile Regression
The quantile regression approach is a subset of the linear regression
technique. It is employed when the linear regression requirements
are not met or when the data contains outliers. In statistics and
87
Quantile Regression
IMAGE
7. Bayesian Linear Regression
Bayesian linear regression is a form of regression analysis technique
used in machine learning that uses Bayes’ theorem to calculate the
regression coefficients’ values. Rather than determining the least-
squares, this technique determines the features’ posterior
distribution. As a result, the approach outperforms ordinary linear
regression in terms of stability.
88
Bayesian Linear
Elastic Net
IMAGE
Multiple Regression:
Please open the output at the link labeled “Chapter Five – Standard
Regression” to view the output.
image
93
On Slide 2 you can see in the red circle, the test statistics are
significant. The F-statistic examines the overall significance of the
model, and shows if your predictors as a group provide a better fit to
the data than no predictor variables, which they do in this example.
The R2 values are shown in the green circle. The R2 value shows the
total amount of variance accounted for in the criterion by the
predictors, and the adjusted R2 is the estimated value of R2 in the
population.
Previous/next navigation
Previous: Section 5.2: Simple Regression Assumptions, Interpretation,
and Write Up
Next: Section 5.4: Hierarchical Regression Explanation, Assumptions,
Interpretation, a
In these Venn Diagrams, you can see why it is best for the predictors
to be strongly correlated with the dependent variable but
uncorrelated with the other Independent Variables. This reduces the
amount of shared variance between the independent variables. The
95
Regression line:
Key Takeaways
The regression line establishes a linear relationship between two
sets of variables. The change in one variable is dependent on the
changes to the other (independent variable).
The Least Squares Regression Line (LSRL) is plotted nearest to the
data points (x, y) on a regression graph.
Regression is widely used in financial models like CAPM and
investing measures like Beta to determine the feasibility of a
project. It is also used for creating projections of investments and
financial returns.
97
Using regression
, the company can determine the appropriate asset price with
respect to the cost of capital. In the stock market, it is used for
determining the impact of stock price changes on the price of
underlying commodities.
Formula
The formula to determine the Least Squares Regression Line
(LSRL) of Y on X is as follows:
Y=a + bX + ɛ
Here,
And,
99
a = (∑Y – b ∑X) / N
Example
Let us look at a hypothetical example to understand real-world
applications of the theory.
We assume there is no error. The price and sales volume for the
previous five years are as follows:
Solution:
Given:
Y = Sales Volume
X = Profit
N=5
100
ɛ=0
Year Price (in $) (X) Sales Volume (Y) X2 XY
2017 2100 15000 4410000 31500000
2018 2050 16500 4202500 33825000
2019 2000 21000 4000000 42000000
2020 2200 19000 4840000 41800000
2021 2050 20000 4202500 41000000
– 10400 91500 21655000 190125000
Y = a + bX + ɛ
Properties of Regression:
SECTION-III
TIME SERIES
Introduction:
KEY TAKEAWAYS
A time series is a data set that tracks a sample over time.
In particular, a time series allows one to see what factors influence
certain variables from period to period.
Time series analysis can be useful to see how a given asset, security,
or economic variable changes over time.
Forecasting methods using time series are used in both fundamental
and technical analysis.
Although cross-sectional data is seen as the opposite of time series,
the two are often used together in practice.
Understanding Time Series
103
A time series can be taken on any variable that changes over time. In
investing, it is common to use a time series to track the price of a
security over time. This can be tracked over the short term, such as
the price of a security on the hour over the course of a business day,
or the long term, such as the price of a security at close on the last
day of every month over the course of five years.
Time series analysis can be useful to see how a given asset, security,
or economic variable changes over time. It also can be used to
examine how the changes associated with the chosen data point
compare to shifts in other variables over the same time period.
A time series graph of the population of the United States from 1900
to 2000.
A time series graph of the population of the United States from 1900
to 2000.
C.K. Taylor
Delving a bit deeper, you might analyze time series data with
technical analysis tools to know whether the stock’s time series shows
104
any seasonality. This will help to determine if the stock goes through
peaks and troughs at regular times each year. Analysis in this area
would require taking the observed prices and correlating them to a
chosen season. This can include traditional calendar seasons, such as
summer and winter, or retail seasons, such as holiday seasons.
One potential issue with time series data is that since each variable is
dependent on its prior state or value, there can be a great deal of
autocorrelation, which can bias results.
Time Series Forecasting
Time series forecasting uses information regarding historical values
and associated patterns to predict future activity. Most often, this
relates to trend analysis, cyclical fluctuation analysis, and issues of
seasonality. As with all forecasting methods, success is not
guaranteed.
Description
Explanation
Prediction
Control
The description of the objectives of time series analysis are as follows:
Description
The first step in the analysis is to plot the data and obtain simple
descriptive measures (such as plotting data, looking for trends,
seasonal fluctuations and so on) of the main properties of the series.
In the above figure, there is a regular seasonal pattern of price change
although this price pattern is not consistent. Graph enables to look for
“wild” observations or outlier (not appear to be consistent with the
rest of the data). Graphing the time series makes possible the
presence of turning points where the upward trend suddenly changed
to a downward trend. If there is a turning point, different models may
have to be fitted to the two parts of the series.
Explanation
108
Prediction
Given an observed time series, one may want to predict the future
values of the series. It is an important task in sales of forecasting and
is the analysis of economic and industrial time series. Prediction and
forecasting used interchangeably.
Control
When time series generated to measure the quality of a
manufacturing process (the aim may be) to control the process.
Control procedures are of several different kinds. In quality control,
the observations are plotted on a control chart and the controller
takes action as a result of studying the charts. A stochastic model is
fitted to the series. Future values of the series are predicted and then
the input process variables are adjusted so as to keep the process on
target.
Identification of trends:
financial performance
competitor movement and growth
manufacturing efficiency
new or emerging technologies
109
customer complaints
staff performance reviews and key performance indicators (KPIs).
For example, ensuring you or your bookkeeper retain all data, that it
is kept up to date and entered accurately, will mean you can run
regular reports on past performance giving you insights into where
the business is going.
Business intelligence (BI) software was once only affordable for large
businesses but is now available as software as a service (SAAS) at a
low monthly or yearly cost.
You can also access data and analytics on your website and social
media platforms.
When would a trend become worrying and require your action? For
example, decreasing purchases in a retail location over the past 1 to 2
quarters may be explained by increasing domestic costs, but over the
past year the demographics in your location may have changed. You
may need to review your products and services.
What will be your critical decision points? Can you, for instance, apply
a threshold that is an acceptable variation for your business (e.g. 10%
over or under)?
What opportunity might improve your business over another? For
example, if your information technology (IT) system is experiencing
interruptions and it is a continuing trend, would outsourcing your
system be preferable to purchasing a new system? The cost of
outsourcing may be better than purchasing a new system.
112
The Pareto Principle (80% consequences result from 20% causes) also
shows the importance of working on the business. The amount of
time you commit to trend analysis will give you more valuable
improvements across your entire business.
The components, by which time series is composed of, are called the
component of time series data. There are four basic components of
the time series data described below.
This method uses the concept of ironing out the fluctuations of the
data by taking the means. It measures the trend by eliminating the
changes or the variations by means of a moving average. The simplest
of the mean used for the measurement of a trend is the arithmetic
means (averages).
Moving Average
The moving average of a period (extent) m is a series of successive
averages of m terms at a time. The data set used for calculating the
average starts with first, second, third and etc. at a time and m data
taken at a time.
115
In other words, the first average is the mean of the first m terms. The
second average is the mean of the m terms starting from the second
data up to (m + 1)th term. Similarly, the third average is the mean of
the m terms from the third to (m + 2) th term and so on.
If the extent or the period, m is odd i.e., m is of the form (2k + 1), the
moving average is placed against the mid-value of the time interval it
covers, i.e., t = k + 1. On the other hand, if m is even i.e., m = 2k, it is
placed between the two middle values of the time interval it covers,
i.e., t = k and t = k + 1.
The least squares method is a statistical procedure to find the best fit
for a set of data points.
The method works by minimizing the sum of the offsets or residuals
of points from the plotted curve.
Least squares regression is used to predict the behavior of dependent
variables.
The least squares method provides the overall rationale for the
placement of the line of best fit among the data points being studied.
Traders and analysts can use the least squares method to identify
trading opportunities and economic or financial trends.
Understanding the Least Squares Method
The least squares method is a form of regression analysis that
provides the overall rationale for the placement of the line of best fit
among the data points being studied. It begins with a set of data
points using two variables, which are plotted on a graph along the x-
and y-axis. Traders and analysts can use this as a tool to pinpoint
bullish and bearish trends in the market along with potential trading
opportunities.
Advantages
One of the main benefits of using this method is that it is easy to
apply and understand. That’s because it only uses two variables (one
that is shown along the x-axis and the other on the y-axis) while
highlighting the best relationship between them.
Investors and analysts can use the least square method by analyzing
past performance and making predictions about future trends in the
economy and stock markets. As such, it can be used as a decision-
making tool.
Disadvantages
The primary disadvantage of the least square method lies in the data
used. It can only highlight the relationship between two variables. As
118
such, it doesn’t take any others into account. And if there are any
outliers, the results become skewed.
Another problem with this method is that the data must be evenly
distributed. If this isn’t the case, the results may not be reliable.
Pros
Easy to apply and understand
Cons
Only highlights relationship between two variables
INDEX NUMBER
Defination:
Types
Although there are different types of index numbers, most of their
primary objective is to simplify data to make comparison easier. One
often uses this method in public and private sectors to make well-
informed decisions regarding policies, prices, and investments. Let us
look at some of the popular types of this statistical tool:
Where:
ΣP1 is the sum of all prices in the year for which one has to compute
the index number.
P01 = ΣR ÷ N
Where:
Importance
Some uses of index numbers are as follows:
1. General Importance
Generally, this tool helps in many ways. Some of them are as follows:
This tool measures the difference in the price levels or the value of
money. Additionally, it warns about inflationary tendencies, enabling
a government to take effective anti-inflationary measures.
Besides these, the tools have specific uses in economics. They are as
follows:
124
Limitations
The statistical measure has the following limitations:
untancy
Business Studies
125
Microeconomics
Macroeconomics
Statistics for Economics
Human Resource Management
Marketing
Income Tax
Finance
Indian Economic Development
Management
Commerce
Open In App
Related Articles
Explore Our Geeks Community
Write an Interview Experience
Share Your Campus Experience
CBSE Class 11 Statistics for Economics Notes
Chapter 1: Concept of Economics and Significance of Statistics in
Economics
Chapter 2: Collection of Data
Chapter 3: Organisation of Data
Chapter 4: Presentation of Data: Textual and Tabular
Chapter 5: Diagrammatic Presentation of Data
Chapter 6: Measures of Central Tendency: Arithmetic Mean
Chapter 7: Measures of Central Tendency: Median and Mode
Chapter 8: Measures of Dispersion
Chapter 9: Correlation
Chapter 10: Index Number
Index Number | Meaning, Characteristics, Uses and Limitations
Methods of Construction of Index Number
Unweighted or Simple Index Numbers: Meaning and Methods
Methods of calculating Weighted Index Numbers
Fisher's Index Number as an Ideal Method
Fisher's Method of calculating Weighted Index Number
126
Here, rational weights mean the weights which are perfectly rational
for one investigation. However, this weight might be unsuitable for
other investigations. In fact, the purpose of the index number and the
nature of the data concerned with it helps in deciding the rational
weights.
There are two methods through which Weighted Index Numbers can
be constructed; viz., Weighted Aggregative Method and Weighted
Average of Price Relatives Method.
Laspeyre’s Method
Paasche’s Method
Fisher’s Ideal Method
Drobish and Bowley’s Method
Marshall Edgeworth Method
Walsch’s Method
Kelly’s Method
Note: According to CBSE Syllabus, we will be only studying Laspeyre’s,
Paasche’s, and Fisher’s Methods.
i) Laspeyre’s Method
128
Laspeyre's~Price~Index~(P_{01})=\frac{\sum{p_1q_0}}{\sum{p_0q_0}}
\times{100}
Here,
Pasche's~Index~Number~(P_{01})=\frac{\sum{p_1q_1}}{\sum{p_0q_1
}}\times{100}
Here,
Fisher's~Price~Index~(P_{01})=\sqrt{\frac{\sum{p_1q_0}}{\sum{p_0q_
0}}\times{\frac{\sum{p_1q_1}}{\sum{p_0q_1}}}}\times{100}
Here,
in the current year by the price in the base year, and denote the value
calculated as R.
Now, multiply the price of commodities in the base year (p0) with
their respective weights (q0), and denote the value weights by W.
After that, multiply the price relatives (R) with value weights (W) and
obtain their total; i.e., ∑RW.
Determine the total of value weights; i.e., ∑W.
Use the following formula to determine Index Number:
P_{01}=\frac{\sum{RW}}{\sum{W}}
Example:
Use Weighted Relatives Method and determine the index number
from the following data for the year 2021 with 2010 as the base year.
Information Table
Solution:
Weighted Index Number Table
P_{01}=\frac{\sum{RW}}{\sum{W}}
\frac{1,02,182}{790}=129.34
SECTION-IV
SAMPLING
Meaning and basic of Sampling:
What Is Sampling?
Sampling is a process in statistical analysis where researchers take a
predetermined number of observations from a larger population.
Sampling allows researchers to conduct studies about a large group
by using a small portion of the population. The method of sampling
depends on the type of analysis being performed, but it may include
simple random sampling or systematic sampling. Sampling is
commonly done in statistics, psychology, and the financial industry.
KEY TAKEAWAYS
Sampling allows researchers to use a small group from a larger
population to make observations and determinations.
132
Random Sampling
With random sampling, every item within a population has an equal
probability of being chosen. It is the furthest removed from any
potential bias because there is no human judgement involved in
selecting the sample.
Judgment Sampling
Auditor judgment may be used to select the sample from the full
population. An auditor may only be concerned about transactions of a
material nature. For example, assume the auditor sets the threshold
for materiality for accounts payable transactions at $10,000. If the
client provides a complete list of 15 transactions over $10,000, the
auditor may just choose to review all transactions due to the small
population size.
The auditor may alternatively identify all general ledger accounts with
a variance greater than 10% from the prior period. In this case, the
auditor is limiting the population from which the sample selection is
being derived. Unfortunately, human judgment used in sampling
always comes with the potential for bias, whether explicit or implicit.
Block Sampling
Block sampling takes a consecutive series of items within the
population to use as the sample. For example, a list of all sales
transactions in an accounting period could be sorted in various ways,
including by date or by dollar amount.
Systematic Sampling
Systematic sampling begins at a random starting point within the
population and uses a fixed, periodic interval to select items for a
sample. The sampling interval is calculated as the population size
divided by the sample size. Despite the sample population being
selected in advance, systematic sampling is still considered random if
the periodic interval is determined beforehand and the starting point
is random.
135
Therefore, the auditor selects every fifth check for testing. Assuming
no errors are found in the sampling test work, the statistical analysis
gives the auditor a 95% confidence rate that the check procedure was
performed correctly. The auditor tests the sample of 60 checks and
finds no errors, so he concludes that the internal control over cash is
working properly.
Example of Sampling
Market Sampling
Businesses aim to sell their products and/or services to target
markets. Before presenting products to the market, companies
generally identify the needs and wants of their target audience. To do
so, they may employ sampling of the target market population to gain
a better understanding of those needs to later create a product
and/or service that meets those needs. In this case, gathering the
opinions of the sample helps to identify the needs of the whole.
Audit Sampling
During a financial audit, a certified public accountant (CPA) may use
sampling to determine the accuracy and completeness of account
balances in their client’s financial statements. This is called audit
sampling.
2
Audit sampling is necessary when the population (the account
transaction information) is large.
136
But there’s more to doing sampling well than just getting the right
sample size. For this reason, it is important to understand both
sampling errors and non-sampling errors so you can prevent them
from causing problems in your research.
and Landon of the Republican party. The sample frame was from car
registrations and telephone directories. In 1936, many Americans did
not own cars or telephones, and those who did were largely
Republicans. The results wrongly predicted a Republican victory.
The error here lies in the way a sample has been selected. Bias has
been unconsciously introduced because the researchers didn’t
anticipate that only certain kinds of people would show up in their list
of respondents, and parts of the population of interest have been
excluded. A modern equivalent might be using mobile phone
numbers, and therefore inadvertently missing out on adults who
don’t own a mobile phone, such as older people or those with severe
learning disabilities.
Frame errors can also happen when respondents from outside the
population of interest are incorrectly included. For example, say a
researcher is doing a national study. Their list might be drawn from a
geographical map area that accidentally includes a small corner of a
foreign territory – and therefore includes respondents who are not
relevant to the scope of the study.
5. Sampling errors
As described previously, sampling errors occur because of variation in
the number or representativeness of the sample that responds.
Sampling errors can be controlled and reduced by (1) careful sample
designs, (2) large enough samples (check out our online sample size
calculator), and (3) multiple contacts to ensure a representative
response.
With the Qualtrics market research platform, you can take advantage
of market-leading statistical tools that produce reports and data that
non-experts can easily understand. With predictions and insights
expressed in simple sentences, you can use them to communicate
findings at all levels of your company and make crucial business
decisions with confidence.
Hypothesis Testing
Formulation and Procedure for testing a hypothesis:
KEY TAKEAWAYS
Hypothesis testing is used to assess the plausibility of a
hypothesis by using sample data.
The test provides evidence concerning the plausibility of the
hypothesis, given the data.
Statistical analysts test a hypothesis by measuring and
examining a random sample of the population being analyzed.
The four steps of hypothesis testing include stating the
hypotheses, formulating an analysis plan, analyzing the
sample data, and analyzing the result.
Practice trading with virtual money
Find out what a hypothetical investment would be worth
today.
If, on the other hand, there were 48 heads and 52 tails, then it
is plausible that the coin could be fair and still produce such a
result. In cases such as this where the null hypothesis is
“accepted,” the analyst states that the difference between
the expected results (50 heads and 50 tails) and the observed
results (48 heads and 52 tails) is “explainable by chance
alone.”
First Case: The person is innocent, and the judge identifies the
person as innocent
Second Case: The person is innocent, and the judge identifies
the person as guilty
Third Case: The person is guilty, and the judge identifies the
person as innocent
Fourth Case: The person is guilty, and the judge identifies the
person as guilty
Outcome possibilities of Hypothesis Testing [z test and t test]
As you can clearly see, there can be two types of error in the
judgment – Type 1 error, when the verdict is against the
person while he was innocent, and Type 2 error, when the
verdict is in favor of the person while he was guilty.
P-value has the benefit that we only need one value to make a
decision about the hypothesis. We don’t need to compute
two different values such as critical value and test scores.
Another benefit of using the p-value is that we can test at any
desired level of significance by comparing this directly with
the significance level.
Directional Hypothesis
In the Directional Hypothesis, the null hypothesis is rejected if
the test score is too large (for right-tailed) or too small (for
left-tailed). Thus, the rejection region for such a test consists
of one part, which is on the right side for a right-tailed test; or
the rejection region is on the left side from the center in the
case of a left-tailed test.
Examples of Z Test
One-Sample Z-Test
We perform the One-Sample z-Test when we want to
compare a sample mean with the population mean.
Z-score formula [z test]
Here’s an Example to Understand a One Sample z-Test
In this example:
Since the P-value is less than 0.05, we can reject the null
hypothesis and conclude based on our result that Girls on
average scored higher than 600.
Two-Sample Z-Test
We perform a Two Sample z-test when we want to compare
the mean of two samples.
Z score calculation [z test]
Here’s an Example to Understand a Two Sample Z-Test
In this example:
In this example:
Two-Sample T-Test
We perform a Two-Sample t-test when we want to compare
the mean of two samples.
In this example:
F-Test:
F Test Definition
F test can be defined as a test that uses the f test statistic to
check whether the variances of two samples (or populations)
are equal to the same value. To conduct an f test, the
population should follow an f distribution and the samples
must be independent events. On conducting the hypothesis
test, if the results of the f test are statistically significant then
the null hypothesis can be rejected otherwise it cannot be
rejected.
F Test Formula
The f test is used to check the equality of variances using
hypothesis testing. The f test formula for different hypothesis
tests is given as follows:
Null Hypothesis:
H
0
:
σ
2
1
=
σ
2
2
Alternate Hypothesis:
H
1
:
σ
2
1
<
σ
2
2
Decision Criteria: If the f statistic < f critical value then reject
the null hypothesis
Null Hypothesis:
H
0
:
160
σ
2
1
=
σ
2
2
Alternate Hypothesis:
H
1
:
σ
2
1
>
σ
2
2
Decision Criteria: If the f test statistic > f test critical value
then reject the null hypothesis
Null Hypothesis:
H
0
:
σ
2
1
=
161
σ
2
2
Alternate Hypothesis:
H
1
:
σ
2
1
≠
σ
2
2
Decision Criteria: If the f test statistic > f test critical value
then the null hypothesis is rejected
F Statistic
The f test statistic or simply the f statistic is a value that is
compared with the critical value to check if the null
hypothesis should be rejected or not. The f test statistic
formula is given below:
σ
2
1
is the variance of the first population and
σ
2
2
is the variance of the second population.
σ
2
2
for an f statistic is given below:
1
−
1
.
Determine the degrees of freedom of the second sample by
subtracting 1 from the sample size. This given y =
n
2
−
1
.
If it is a right-tailed test then
α
is the significance level. For a left-tailed test 1 -
α
is the alpha level. However, if it is a two-tailed test then the
significance level is given by
α
/ 2.
The F table is used to find the critical value at the required
alpha level.
The intersection of the x column and the y row in the f table
will give the f test critical value.
ANOVA F Test
The one-way ANOVA is an example of an f test. ANOVA stands
for analysis of variance. It is used to check the variability of
group means and the associated variability in observations
within that group. The F test statistic is used to conduct the
ANOVA test. The hypothesis is given as follows:
H
165
0
: The means of all groups are equal.
H
1
: The means of all groups are not equal.
Non-Parametric T-Test
Whenever a few assumptions in the given population are
uncertain, we use non-parametric tests, which are also
considered parametric counterparts. When data are not
distributed normally or when they are on an ordinal level of
measurement, we have to use non-parametric tests for
analysis. The basic rule is to use a parametric t-test for
normally distributed data and a non-parametric test for
skewed data.
Test statistic:
Sign Test
168
Test statistic: The test statistic of the sign test is the smaller of
the number of positive or negative signs.
Test statistic:
Easily understandable
Short calculations
Assumption of distribution is not required
Applicable to all types of data
The disadvantages of the non-parametric test are:
PreviousNext
Tutorial Playlist
Table of Contents
What Is a Chi-Square Test?Chi-Square Test DefinitionFormula
For Chi-Square TestFundamentals of Hypothesis TestingWhat
Are Categorical Variables?View More
The world is constantly curious about the Chi-Square test’s
application in machine learning and how it makes a
difference. Feature selection is a critical topic in machine
learning, as you will have multiple features in line and must
choose the best ones to build the model. By examining the
relationship between the elements, the chi-square test aids in
the solution of feature selection problems. In this tutorial, you
will learn about the chi-square test and its application.
As a result of chance or
Because of the relationship
Your Data Analytics Career is Around The Corner!
Data Analyst Master’s ProgramEXPLORE PROGRAM
Formula For Chi-Square Test
172
Chi_Sq_formula.
Where
c = Degrees of freedom
O = Observed Value
E = Expected Value
Independence
Goodness-of-Fit
Independence
The Chi-Square Test of Independence is a derivable ( also
known as inferential ) statistical test which examines whether
the two sets of variables are likely to be related with each
other or not. This test is used when we have counts of values
for two nominal or categorical variables and is considered as
non-parametric test. A relatively large sample size and
independence of obseravations are the required criteria for
conducting this test.
For Example-
Goodness-Of-Fit
In statistical hypothesis testing, the Chi-Square Goodness-of-
Fit test determines whether a variable is likely to come from a
given distribution or not. We must have a set of data values
and the idea of the distribution of this data. We can use this
test when we have value counts for categorical variables. This
test demonstrates a way of deciding if the data values have a
“ good enough” fit for our idea or if it is a representative
sample data of the entire population.
For Example-
Example
Let’s say you want to know if gender has anything to do with
political party preference. You poll 440 voters in a simple
177
chi-1.
Chi_Sq_formula_1.
Chi_Sq_formula_2
Similarly, you can calculate the expected value for each of the
cells.
chi-2.
Now you will calculate the (O - E)2 / E for each cell in the
table.
Where
O = Observed Value
E = Expected Value
chi-3.
= 9.837
Before you can conclude, you must first determine the critical
statistic, which requires determining our degrees of freedom.
The degrees of freedom in this case are equal to the table’s
number of columns minus one multiplied by the table’s
number of rows minus one, or (r-1) (c-1). We have (3-1)(2-1) =
2.
PreviousNext
Tutorial Playlist
Table of Contents
What Is a Chi-Square Test?Chi-Square Test DefinitionFormula
For Chi-Square TestFundamentals of Hypothesis TestingWhat
Are Categorical Variables?View More
The world is constantly curious about the Chi-Square test’s
application in machine learning and how it makes a
difference. Feature selection is a critical topic in machine
learning, as you will have multiple features in line and must
choose the best ones to build the model. By examining the
relationship between the elements, the chi-square test aids in
the solution of feature selection problems. In this tutorial, you
will learn about the chi-square test and its application.
As a result of chance or
Because of the relationship
Your Data Analytics Career is Around The Corner!
Data Analyst Master’s ProgramEXPLORE PROGRAM
Formula For Chi-Square Test
Chi_Sq_formula.
Where
c = Degrees of freedom
O = Observed Value
181
E = Expected Value
Independence
184
Goodness-of-Fit
Independence
The Chi-Square Test of Independence is a derivable ( also
known as inferential ) statistical test which examines whether
the two sets of variables are likely to be related with each
other or not. This test is used when we have counts of values
for two nominal or categorical variables and is considered as
non-parametric test. A relatively large sample size and
independence of obseravations are the required criteria for
conducting this test.
For Example-
For Example-
Example
Let’s say you want to know if gender has anything to do with
political party preference. You poll 440 voters in a simple
random sample to find out which political party they prefer.
The results of the survey are shown in the table below:
chi-1.
Chi_Sq_formula_1.
Chi_Sq_formula_2
Similarly, you can calculate the expected value for each of the
cells.
chi-2.
Where
O = Observed Value
E = Expected Value
187
chi-3.
= 9.837
Before you can conclude, you must first determine the critical
statistic, which requires determining our degrees of freedom.
The degrees of freedom in this case are equal to the table’s
number of columns minus one multiplied by the table’s
number of rows minus one, or (r-1) (c-1). We have (3-1)(2-1) =
2.
Chi_Sq_formula_3
Chi_Square_Distribution_1
Articles
191
Tutorials
Interview Questions
Free Courses
Videos
Career Guide
Great Learning Blog Data Science and Business Analytics
Business Analytics
Quick Links
Data Science
Business Analytics
Python
Free Data Science Course
Free Data Science Courses
Data Visualization Courses
What is Business Analytics? Definition, Examples & Types
By Swati Aggarwal Updated on Jul 14, 2023 48571
Table of contents
What is Business Analytics?
Difference Between Business Analytics and Business
Intelligence
Types of Business Analytics
Business Analytics Jobs
Business Analytics Myths
Business Analytics tools
Business Analytics FAQs
What is Business Analytics?
From manual effort to machines, there has been no looking
back for humans. In came the digital age and out went the last
iota of doubt anyone had regarding the future of mankind.
Business Analytics, Machine Learning, AI, Deep Learning,
Robotics, and Cloud have revolutionized the way we look,
192
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
1. Descriptive Analytics
It summarizes an organisation’s existing data to understand
what has happened in the past or is happening currently.
Descriptive Analytics is the simplest form of analytics as it
employs data aggregation and mining techniques. It makes
data more accessible to members of an organisation such as
the investors, shareholders, marketing executives, and sales
managers.
2. Diagnostic Analytics
This type of Analytics helps shift focus from past performance
to the current events and determine which factors are
influencing trends. To uncover the root cause of events,
techniques such as data discovery, data mining and drill-down
are employed. Diagnostic analytics makes use of probabilities,
and likelihoods to understand why events may occur.
Techniques such as sensitivity analysis, and training
algorithms are employed for classification and regression.
3. Predictive Analytics
This type of Analytics is used to forecast the possibility of a
future event with the help of statistical models and ML
techniques. It builds on the result of descriptive analytics to
devise models to extrapolate the likelihood of items. To run
predictive analysis, Machine Learning experts are employed.
They can achieve a higher level of accuracy than by business
intelligence alone.
4. Prescriptive Analytics
Going a step beyond predictive analytics, it provides
recommendations for the next best action to be taken. It
suggests all favorable outcomes according to a specific course
of action and also recommends the specific actions needed to
deliver the most desired result. It mainly relies on two things,
199
Python
SAS
R
Tableau
Python – Python has a very regular syntax as it stands out for
its general-purpose characteristics. It has a relatively gradual
and low learning curve for it focuses on simplicity and
readability. Python is very flexible and can also be used in web
scripting. It is mainly applied when there is a need for
integrating the data analyzed with a web application or the
statistics is to be used in a database production. The IPython
202
Many statistical tools are used at the next stage to analyze the
collected market metrics. The analyst synthesizes the research
data to define the new product and to determine its features.
He also uses advanced analytics and statistical tools to
determine hidden patterns and trends.
207