Assignment On Statistics For Management
Assignment On Statistics For Management
ASSIGNMENT ON
STATISTICS FOR
MANAGEMENT
BY RAHUL
GUPTA
Question 1: What do you mean by sample survey?
What are the different sampling methods? Briefly
describe them?
Answer
Introduction:
In statistics, survey sampling describes the process of selecting a sample of
elements from a target population in order to conduct a survey.
A survey may refer to many different types or techniques of observation, but in the
context of survey sampling it most often refers to a questionnaire used to measure
the characteristics and/or attitudes of people. The purpose of sampling is to reduce
RAHUL GUPTA, MBAHCS (1ST SEM), SUBJECT CODE-MB0024, SET-2
Page 1
STATISTICS FOR MANAGEMENT
the cost and/or the amount of work that it would take to survey the entire target
population. A survey that measures the entire target population is called a census.
Probability Sampling:
In a probability sample (also called "scientific" or "random" sample) each member
of the target population has a known and non-zero probability of inclusion in the
sample. A survey based on a probability sample can in theory produce statistical
measurements of the target population that are:
• unbiased, the expected value of the sample mean is equal to the population
mean E(ȳ)=μ, and
• Have a measurable sampling error, which can be expressed as a confidence
interval, or margin of error.
Non-Probability Sampling:
Many surveys are not based on a probability samples, but rather by finding a
suitable collection of respondents to complete the survey. Some common examples
of non-probability sampling are:
In non-probability samples the relationship between the target population and the
survey sample is immeasurable and potential bias is unknowable. Sophisticated
users of non-probability survey samples tend to view the survey as an experimental
condition, rather than a tool for population measurement, and examine the results for
internally consistent relationships
Sampling Methods:
Random sampling is the purest form of probability sampling. Each member of the
population has an equal and known chance of being selected. When there are very
Correlation:
Several sets of (x, y) points, with the correlation coefficient of x and y for each set.
Note that the correlation reflects the noisiness and direction of a linear relationship
(top row), but not the slope of that relationship (middle), nor many aspects of
nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in
that case the correlation coefficient is undefined because the variance of Y is zero. In
statistics, correlation (often measured as a correlation coefficient, ρ) indicates the
strength and direction of a relationship between two random variables. The
commonest use refers to a linear relationship, but the concept of nonlinear
correlation is also used. In general statistical usage, correlation or co-relation refers
to the departure of two random variables from independence. In this broad sense
there are several coefficients, measuring the degree of correlation, adapted to the
nature of the data.
Pearson's product-moment
coefficient:
A number of different coefficients are used for different situations. The best known
is the Pearson product-moment correlation coefficient, which is obtained by
dividing the covariance of the two variables by the product of their standard
deviations. Karl Pearson developed the coefficient from a similar but slightly
different idea by Francis Galton.
Regression analysis:
In statistics, regression analysis includes any techniques for modeling and
analyzing several variables, when the focus is on the relationship between a
dependent variable and one or more independent variables. More specifically,
regression analysis helps us understand how the typical value of the dependent
variable changes when any one of the independent variables is varied, while the
other independent variables are held fixed. Most commonly, regression analysis
estimates the conditional expectation of the dependent variable given the
independent variables — that is, the average value of the dependent variable when
the independent variables are held fixed. Less commonly, the focus is on a quantile,
or other location parameter of the conditional distribution of the dependent variable
given the independent variables. In all cases, the estimation target is a function of
the independent variables called the regression function. In regression analysis, it is
also of interest to characterize the variation of the dependent variable around the
regression function, which can be described by a probability distribution.
Mathematical properties:
The correlation coefficient ρX, Y between two random variables X and Y with
expected values μX and μY and standard deviations σX and σY is defined as:
where E is the expected value operator and cov means covariance. A widely used
alternative notation is
Since μX = E(X), σX2 = E[(X − E(X))2] = E(X2) − E2(X) and likewise for Y, and since
The correlation is defined only if both of the standard deviations are finite and both
of them are nonzero. It is a corollary of the Cauchy–Schwarz inequality that the
correlation cannot exceed 1 in absolute value.
If the variables are independent then the correlation is 0, but the converse is not true
because the correlation coefficient detects only linear dependencies between two
variables. Here is an example: Suppose the random variable X is uniformly
distributed on the interval from −1 to 1, and Y = X2. Then Y is completely
determined by X, so that X and Y are dependent, but their correlation is zero; they
are uncorrelated. However, in the special case when X and Y are jointly normal,
uncorrelatedness is equivalent to independence.
Sample correlation:
If we have a series of n measurements of X and Y written as xi and yi where i = 1,
2, ..., n, then the Pearson product-moment correlation coefficient can be used to
estimate the correlation of X and Y . The Pearson coefficient is also known as the
"sample correlation coefficient". The Pearson correlation coefficient is then the best
estimate of the correlation of X and Y. The Pearson correlation coefficient is written:
where and are the sample means of X and Y , sx and sy are the sample standard
deviations of X and Y and the sum is from i = 1 to n. As with the population
correlation, we may rewrite this as
Again, as is true with the population correlation, the absolute value of the sample
correlation must be less than or equal to 1. The above formula conveniently suggests
a single-pass algorithm for calculating sample correlations, but, depending on the
numbers involved, it can sometimes be numerically unstable.
The square of the sample correlation coefficient, which is also known as the
coefficient of determination, is the fraction of the variance in yi that is accounted for
by a linear fit of xi to yi. This is written
Where sy|x2 is the square of the error of a linear regression of xi on yi by the equation
y = a + bx:
Note that since the sample correlation coefficient is symmetric in xi and yi, we will
get the same value for a fit of yi to xi:
This equation also gives an intuitive idea of the correlation coefficient for higher
dimensions. Just as the above described sample correlation coefficient is the fraction
of variance accounted for by the fit of a 1-dimensional linear sub manifold to a set
of 2-dimensional vectors (xi, yi), so we can define a correlation coefficient for a fit of
an m-dimensional linear sub manifold to a set of n-dimensional vectors. For
example, if we fit a plane z = a + bx + CY to a set of data (xi, yi, zi) then the
correlation coefficient of z to x and y is
Geometric interpretation:
For centered data (i.e., data which have been shifted by the sample mean so as to
have an average of zero), the correlation coefficient can also be viewed as the cosine
of the angle between the two vectors of samples drawn from the two random
variables.
As an example, suppose five countries are found to have gross national products of
1, 2, 3, 5, and 8 billion dollars, respectively. Suppose these same five countries (in
the same order) are found to have 11%, 12%, 13%, 15%, and 18% poverty. Then let
x and y be ordered 5-element vectors containing the above data: x = (1, 2, 3, 5, 8)
and y = (0.11, 0.12, 0.13, 0.15, 0.18).
By the usual procedure for finding the angle between two vectors (see dot product),
the uncentered correlation coefficient is:
Note that the above data were deliberately chosen to be perfectly correlated: y =
0.10 + 0.01 x. The Pearson correlation coefficient must therefore be exactly one.
Centering the data (shifting x by E(x) = 3.8 and y by E(y) = 0.138) yields x = (−2.8,
−1.8, −0.8, 1.2, 4.2) and y = (−0.028, −0.018, −0.008, 0.012, 0.042), from which
As expected.
Then, the equation of the least-squares line can be derived to be of the form:
As we go from each pair to the next pair x increases, and so does y. This relationship
is perfect, in the sense that an increase in x is always accompanied by an increase
in y. This means that we have a perfect rank correlation, and both Spearman's and
Kendall's correlation coefficients are 1, whereas in this example Pearson's product
moment correlation coefficient is 0.456, indicating that the points are far from lying
on a straight line. In the same way if y always decreases when x increases, the rank
correlation coefficients will be −1, while the product moment correlation coefficient
may or may not be close to 1, depending on how close the points are to a straight
line. Although in the extreme cases of perfect rank correlation the two coefficients
are both equal (being both +1 and both -1) this is not in general so, and values of the
two coefficients cannot meaningfully be compared. For example, for the three pairs
(1, 1) (2, 3) (3, 2) Spearman's coefficient is 1/2, while Kendall's coefficient is 1/3.
The image on the right shows scatter plots of Anscombe's quartet, a set of four
different pairs of variables created by Francis Anscombe. The four y variables have
the same mean (7.5), standard deviation (4.12), correlation (0.816) and regression
line (y = 3 + 0.5x). However, as can be seen on the plots, the distribution of the
variables is very different. The first one (top left) seems to be distributed normally,
and corresponds to what one would expect when considering two variables
correlated and following the assumption of normality. The second one (top right) is
not distributed normally; while an obvious relationship between the two variables
can be observed, it is not linear, and the Pearson correlation coefficient is not
relevant. In the third case (bottom left), the linear relationship is perfect, except for
one outlier which exerts enough influence to lower the correlation coefficient from 1
to 0.816. Finally, the fourth example (bottom right) shows another example when
one outlier is enough to produce a high correlation coefficient, even though the
relationship between the two variables is not linear.
where EX and EY are the expected values of X and Y, respectively, and σx and σy are
the standard deviations of X and Y, respectively.
X 12 15 18 20 27 34 28 48
Y 123 150 158 170 180 184 176 130
Answer:
Assumed mean of X is 26.
Assumed mean of Y is 158
Regression equation of Y on X
Y-158.8= byx (X-25.25) where byx= N*dxdy – dx*dy/N*dx2 – (dx)2
byx= 8*130- (-7)(7)/ 8*955- (-7)2
byx= 540+49/ 7640-49
byx = 589/ 7591
byx= 0.07
Y-158.8= 0.07(X-25.25)
Y=0.07X+ 157.0325
Regression equation of X on Y
X-25.25= bxy (X-158.8) where bxy= N*dxdy – dx*dy/N*dy2 – (dy)2
bxy= 8* 130 – (-7)(7)/ 8* 3701 – (7)2
bxy= 540 +49 / 29559
bxy= 589/ 29559
bxy = 0.019
X = 0.019Y + 22.2328
• Regression equation of Y on X:
Y=0.07X+ 157.0325
• Regression equation of X on Y :
X = 0.019Y + 22.2328
Introduction:
Business forecasting has always been one component of running an enterprise.
However, forecasting traditionally was based less on concrete and comprehensive
data than on face-to-face meetings and common sense. In recent years, business
forecasting has developed into a much more scientific endeavor, with a host of
theories, methods, and techniques designed for forecasting certain types of data. The
development of information technologies and the Internet propelled this
development into overdrive, as companies not only adopted such technologies into
their business practices, but into forecasting schemes as well. In the 2000s,
projecting the optimal levels of goods to buy or products to produce involved
sophisticated software and electronic networks that incorporate mounds of data and
advanced mathematical algorithms tailored to a company's particular market
conditions and line of business. Business forecasting involves a wide range of tools,
including simple electronic spreadsheets; enterprise resource planning (ERP) and
electronic data interchange (EDI) networks, advanced supply chain management
systems, and other Web-enabled technologies. The practice attempts to pinpoint key
factors in business production and extrapolate from given data sets to produce
accurate projections for future costs, revenues, and opportunities. This normally is
done with an eye toward adjusting current and near-future business practices to take
maximum advantage of expectations.
In the Internet age, the field of business forecasting was propelled by three
interrelated phenomena. First, the Internet provided a new series of tools to aid the
science of business forecasting. Second, business forecasting had to take the Internet
itself into account in trying to construct viable models and make predictions.
Finally, the Internet fostered vastly accelerated transformations in all areas of
business that made the job of business forecasters that much more exacting. By the
2000s, as the Internet and its myriad functions highlighted the central importance of
information in economic activity, more and more companies came to recognize the
value, and often the necessity, of business forecasting techniques and systems.
Business forecasting is indeed big business, with companies investing tremendous
resources in systems, time, and employees aimed at bringing useful projections into
the planning process. According to a survey by the Hudson, Ohio-based Answer
Think Consulting Group, which specializes in studies of business planning, the
average U.S. Company spends more than 25,000 person-days on business
forecasting and related activities for every billion dollars of revenue.
RAHUL GUPTA, MBAHCS (1ST SEM), SUBJECT CODE-MB0024, SET-2
Page 15
STATISTICS FOR MANAGEMENT
Forecasting systems draw on several sources for their forecasting input, including
databases, e-mails, documents, and Web sites. After processing data from various
sources, sophisticated forecasting systems integrate all the necessary data into a
single spreadsheet, which the company can then manipulate by entering in various
projections—such as different estimates of future sales—that the system will
incorporate into a new readout.
The third primary forecasting model is known as the judgmental model. In this case,
one attempts to produce a forecast where there is no useful historical data. A
company might choose to use the judgmental model when it attempts to project
sales for a brand new product, or when market conditions have qualitatively
changed, rendering previous data obsolete. In addition, according to the Journal of
Business Forecasting Methods & Systems, this model is useful when the bulk of
sales derive only from a relative handful of customers. To proceed in the absence of
historical data, alternative data is collected by way of experts in the field,
prospective customers, trade groups, business partners, or any other relevant source
of information. Business forecasting systems often work hand-in-hand with supply
chain management systems. In such systems, all partners in the supply chain can
RAHUL GUPTA, MBAHCS (1ST SEM), SUBJECT CODE-MB0024, SET-2
Page 16
STATISTICS FOR MANAGEMENT
electronically oversee all movement of components within that supply chain and
gear the chain toward maximum efficiency.
The Internet has proven to be a panacea in this field, and business forecasting
systems allow partners to project the optimal flow of components into the future so
that companies can try to meet optimal levels rather than continually catch up to
them.
Judgmental methods:
Judgmental forecasting methods incorporate intuitive judgments, opinions and
subjective probability estimates.
• Composite forecasts
• Surveys
• Delphi method
• Scenario building
• Technology forecasting
• Forecast by analogy
Other methods:
• Simulation
• Prediction market
• Probabilistic forecasting and Ensemble forecasting
• Reference class forecasting
Forecasting accuracy:
The forecast error is the difference between the actual value and the forecast value
for the corresponding period.
Where E is the forecast error at period t, Y is the actual value at period t, and F is the
forecast for period t.
One of the most essential elements of being a high-performing manager is the ability
to lead effectively one's own life, then to model those leadership skills for
employees in the organization. This site comprehensively covers theory and practice
of most topics in forecasting and economics. I believe such a comprehensive
approach is necessary to fully understand the subject. A central objective of the site
is to unify the various forms of business topics to link them closely to each other and
to the supporting fields of statistics and economics. Nevertheless, the topics and
coverage do reflect choices about what is important to understand for business
decision making. Almost all managerial decisions are based on forecasts. Every
decision becomes operational at some point in the future, so it should be based on
forecasts of future conditions. Forecasts are needed throughout an organization --
and they should certainly not be produced by an isolated group of forecasters.
Neither is forecasting ever "finished". Forecasts are needed continually, and as time
moves on, the impact of the forecasts on actual performance is measured; original
forecasts are updated; and decisions are modified, and so on.
For example, many inventory systems cater for uncertain demand. The inventory
parameters in these systems require estimates of the demand and forecast error
distributions. The two stages of these systems, forecasting and inventory control, are
often examined independently. Most studies tend to look at demand forecasting as if
this were an end in itself or at stock control models as if there were no preceding
stages of computation. Nevertheless, it is important to understand the interaction
between demand forecasting and inventory control since this influences the
performance of the inventory system. This integrated process is shown in the
following figure:
There may have also sets of constraints which apply to each of these
components. Therefore, they do not need to be treated separately.
9. Actions: Action is the ultimate decision and is the best course of strategy to
achieve the desirable goal.
The forecast for time period t + 1 is the forecast for all future time periods.
However, this forecast is revised only when new data becomes available. You may
like using Forecasting by Smoothing JavaScript, and then performing some
numerical experimentation for a deeper understanding of these concepts.
Where the weights are any positive numbers such that: w1 + w2 + w3 = 1. A typical
weights for this example is, w1 = 3/ (1 + 2 + 3) = 3/6, w2 = 2/6, and w3 = 1/6.
You may like using Forecasting by Smoothing JavaScript, and then performing
some numerical experimentation for a deeper understanding of the concepts.
Moving Averages with Trends: Any method of time series analysis involves a
different degree of model complexity and presumes a different level of
comprehension about the underlying trend of the time series. In many business time
series, the trend in the smoothed series using the usual moving average method
indicates evolving changes in the series level to be highly nonlinear.
In order to capture the trend, we may use the Moving-Average with Trend (MAT)
method. The MAT method uses an adaptive linearization of the trend by means of
incorporating a combination of the local slopes of both the original and the
smoothed time series.
error must be a random variable distributed normally with mean close to zero and a
constant variance across time.
For computer implementation of the Moving Average with Trend (MAT) method
one may use the forecasting (FC) module of WinQSB which is commercial grade
stand-alone software. WinQSB’s approach is to first select the model and then enter
the parameters and the data. With the Help features in WinQSB there is no learning-
curve one just needs a few minutes to master its useful features.
Introduction:
Statistics is considered by some to be a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and presentation of data, while
others consider it to be a branch of mathematics concerned with collecting and
interpreting data. Statisticians improve the quality of data with the design of
experiments and survey sampling. Statistics also provides tools for prediction and
forecasting using data and statistical models. Statistics is applicable to a wide variety
of academic disciplines, including natural and social sciences, government, and
business.
Levels of measurement:
There are four types of measurements or levels of measurement or measurement
scales used in statistics:
• Nominal.
• Ordinal.
• Interval.
• Ratio.
Characteristics of Statistics:
Some of its important characteristics are given below:
(2) Statistical helps in proper and efficient planning of a statistical inquiry in any
field of study.
(6) Statistics helps in drawing valid inference, along with a measure of their
reliability about the population parameters from the sample data.
Limitations of Statistics:
The important limitations of statistics are:
(1) Statistics laws are true on average. Statistics are aggregates of facts. So single
observation is not a statistics, it deals with groups and aggregates only.
(4) It sufficient care is not exercised in collecting, analyzing and interpretation the
data, statistical results might be misleading.
(5) Only a person who has an expert knowledge of statistics can handle statistical
data efficiently.
(6) Some errors are possible in statistical decisions. Particularly the inferential
statistics involves certain errors. We do not know whether an error has been
committed or not.
Introduction:
Statistical surveys are used to collect quantitative information about items in a
population. Surveys of human populations and institutions are common in political
polling and government, health, social science and marketing research. A survey
may focus on opinions or factual information depending on its purpose, and many
surveys involve administering questions to individuals. When the questions are
administered by a researcher, the survey is called a structured interview or a
researcher-administered survey. When the questions are administered by the
respondent, the survey is referred to as a questionnaire or a self-administered survey.
Serial surveys:
Serial surveys are those which repeat the same questions at different points in time,
producing time-series data. They typically fall into two types:
• Cross-sectional surveys which draw a new sample each time. In a sense any
one-off survey will also be cross-sectional.
• Longitudinal surveys where the sample from the initial survey is re-
contacted at a later date to be asked the same questions.
Advantages:
• It is an efficient way of collecting information from a large number of
respondents. Very large samples are possible. Statistical techniques can be
used to determine validity, reliability, and statistical significance.
• Surveys are flexible in the sense that a wide range of information can be
collected. They can be used to study attitudes, values, beliefs, and past
behaviors.
• Because they are standardized, they are relatively free from several types of
errors.
• They are relatively easy to administer.
• There is an economy in data collection due to the focus provided by
standardized questions. Only questions of interest to the researcher are
asked, recorded, codified, and analyzed. Time and money is not spent on
tangential questions.
• Cheaper to run.
Disadvantages:
• They depend on subjects’ motivation, honesty, memory, and ability to
respond. Subjects may not be aware of their reasons for any given action.
They may have forgotten their reasons. They may not be motivated to give
accurate answers; in fact, they may be motivated to give answers that present
themselves in a favorable light.
• Structured surveys, particularly those with closed ended questions, may have
low validity when researching affective variables.
• Although the chosen survey individuals are often a random sample, errors
due to no response may exist. That is, people who choose to respond on the
survey may be different from those who do not respond, thus biasing the
estimates.
• Survey question answer-choices could lead to vague data sets because at
times they are relative only to a personal abstract notion concerning
"strength of choice". For instance the choice "moderately agree" may mean
different things to different subjects, and to anyone interpreting the data for
correlation. Even yes or no answers are problematic because subjects may
for instance put "no" if the choice "only once" is not available.
Telephone:
• Use of interviewers encourages sample persons to respond, leading to higher
response rates.
Mail:
• The questionnaire may be handed to the respondents or mailed to them, but
in all cases they are returned to the researcher via mail.
• Cost is very low, since bulk postage is cheap in most countries.
• Long time delays, often several months, before the surveys are returned and
statistical analysis can begin.
• Not suitable for issues that may require clarification.
• Respondents can answer at their own convenience (allowing them to break
up long surveys; also useful if they need to check records to answer a
question).
• No interviewer bias introduced.
• Large amount of information can be obtained: some mail surveys are as long
as 50 pages.
• Response rates can be improved by using mail panels:
o Members of the panel have agreed to participate.
o Panels can be used in longitudinal designs where the same
respondents are surveyed several.
Online surveys:
• Can use web or e-mail.
• Web is preferred over e-mail because interactive HTML forms can be used.
• Often inexpensive to administer.
• Very fast results.
• Easy to modify.
• Response rates can be improved by using online panels - members of the
panel have agreed to participate.
• If not password-protected, easy to manipulate by completing multiple times
to skew results.
• Data creation, manipulation and reporting can be automated and/or easily
exported. into a format which can be read by PSPP, DAP or other statistical
analysis software.
• Data sets created in real time.
• Some are incentive based (such as Survey Vault or Yoga).
• May skew sample towards a younger demographic compared with CATI.
• Often difficult to determine/control selection probabilities, hindering
quantitative analysis of data.
• Use in large scale industries.
Shoppers at malls are intercepted - they are interviewed on the spot, taken to
a room and interviewed, or taken to a room and given a self-administered
questionnaire.
Types of classification:
The very important types are:
Methods of Classification:
Classification is done according to a single attribute or variable, is known as
one way classification.
Classification done according to two attributes or variables is known as two-
way
Classification.
Classification done according to more than two attributes or variables is
known as
Manifold classification.
Examples:
Where the feature vector input is , and the function f is typically parameterized by
some parameters . In the Bayesian approach to this problem, instead of choosing a
single parameter vector , the result is integrated over all possible thetas, with the
thetas weighted by how likely they are given the training data D:
• The third problem is related to the second, but the problem is to estimate the
class-conditional probabilities and then use Bayes' rule to
produce the class probability as in the second problem.
Table:
In relational databases and flat file databases, a table is a set of data elements
(values) that is organized using a model of vertical columns (which are identified by
their name) and horizontal rows. A table has a specified number of columns, but can
have any number of rows. Each row is identified by the values appearing in a
particular column subset which has been identified as a candidate key. Table is
another term for relations; although there is the difference in that a table is usually a
multi-set (bag) of rows whereas a relation is a set and does not allow duplicates.
Besides the actual data rows, tables generally have associated with them some meta-
information, such as constraints on the table or on the values within particular
columns. The data in a table does not have to be physically stored in the database.
Views are also relational tables, but their data are calculated at query time. Another
example is nicknames, which represent a pointer to a table in another database.
Unlike a spreadsheet, the data type of field is ordinarily defined by the schema
describing the table. Some relational systems are less strict about field data type
definitions.
Tabulation:
Tabulation follows classification. It is a logical listing of related data in rows and
columns. Objectives of tabulation are:
To simplify complex data.
To highlight important characteristics.
To present data in minimum space.
To facilitate comparison.
To bring out trends and tendencies.
To facilitate further analysis.