ERM 4a Final

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 101

UNIT-IV

DATA COLLECTION
&

DATA ANALYSIS
Why a Manager Needs to Know About Statistics

 To Know How to Properly Present Information

 To Know How to Draw Conclusions about Populations


Based on Sample Information

 To Know How to Improve Processes

 To Know How to Obtain Reliable Forecasts


Why We Need Data
 To Provide Input to Survey
 To Provide Input to Study
 To Measure Performance of Ongoing Service or
Production Process
 To Evaluate Conformance to Standards
 To Assist in Formulating Alternative Courses of
Action
 To Satisfy Curiosity
Exploring the Data
 The task of data collection begins after a research problem has been defined and research
design/plan chalked out.
 The collection of data is the important task in the research methodology. Before explaining the
methods of data collection, researcher should understand the need of the study and decide the type
of the data required

Source of Data:
The Researcher should keep in mind two types of data:
1. Primary
2. Secondary
The Primary Data : Those which are collected afresh and for the first time, and thus
happen to be original in character.
The secondary data : Those which have already been collected by someone else and
which have already been passed through the statistical process.
The distinction between Primary and Secondary data can be made more clear on the
basis of documents:
1. Primary data : Documented as record
2. Secondary data : Documented as report
Exploring the Data Contd….
 The researcher has to decide which type of data he would like
to use for this study and accordingly he will have to select
particular type of data base.

 The data collection methods differ in each type of data to be


collected by the researcher personally, where as in secondary
data it is only compilation of the data already collected.

 We describe the different methods of data collection, with the


pros and cons of each method.
COLLECTION OF PRIMARY DATA
We collect primary data during the course of doing experiments in an
experimental research but in case we do research of the descriptive type and
perform surveys, whether sample surveys or census surveys, then we can obtain
primary data either through observation or through direct communication with
respondents in one form or another or through personal interviews.*

In an experiment the investigator measures the effects of an experiment which he


conducts intentionally.
In a survey, the investigator examines those phenomena which exist in the
universe independent of his action. The difference between an experiment and a
survey can be depicted as under:
For ex- survey to find mental fitness/happiness index etc
There are several methods of collecting primary data, particularly in surveys and
descriptive researches. Important ones are:
(i)Observation method,
(ii)Interview method,
(iii)Through questionnaires,
(iv)Through schedules, and
(v)Other methods which include
Collection of Primary Data
Other Methods:
(a) Warranty cards;
(b) Distributor audits;
(c) Pantry audits;
(d) Consumer panels;
(e) Using mechanical devices;
(f) Through projective techniques;
(g) Depth interviews, and
(h) Content analysis.
We briefly take up each method separately.
Observation Method
Good and Hatt : Science begins with observation and must ultimately return to
observation for its final validation.
Moses and Kalton: Observation implies the use of eyes rather than of ears and the
voice.
Definition of Observation: As systematic viewing, coupled with consideration of the seen
phenomena, in which main consideration must be given to the larger unit
of activity by which the specific observed phenomena occurred.
Observing natural phenomena, aided by systematic classification and measurement, led to the
development of theories and laws of nature’s forces.

Components of Observation: Observation involves Three Processes:


1. Sensation: It is gained through the sense of organs
which depends upon the physical alertness of the observer.
It is reports the facts as observed.
2. Attention : Which is largely a matter of habit.
3. Perception: Which involves the interpretation of
sensory reports.
It enables the mind to recognize the facts.
Observation Method
 Observation becomes a scientific tool and the method of data collection for the
researcher, when it serves a formulated research purpose, is systematically
planned and recorded and is subjected to checks and controls on validity and
reliability. Under the observation method, the information is sought by way of
investigator’s own direct observation without asking from the respondent.
 For instance, in a study relating to consumer behaviour, the investigator instead
of asking the brand of wrist watch used by the respondent, may himself look at
the watch.
 The main advantage of this method is that subjective bias is eliminated, if
observation is done accurately.
 The information obtained under this method relates to what is currently
happening; it is not complicated by either the past behaviour or future
intentions or attitudes
 This method is independent of respondents’ willingness to respond and as such
is relatively less demanding of active cooperation on the part of respondents as
happens to be the case in the interview or the questionnaire method.
Characteristics of Observation
1. Observation is at once a physical as well as mental activity. The use of sense organs
is involved as in observation one has to see or hear something.
2. Observation is selective because one has to observe the range of those things which
fall within the observation
3. Observation is purposive. Observation is limited to those facts and details which
help in achieving the specified objectives of research.
4. Observation has to be efficient. Mere one is not enough. There should be scientific
thinking. Further, these observations should be based on tools of research which
have been properly standardized.
5. In observation the researcher makes direct study. It is a classical scientific method
for the collection primary and dependable data.
6. Through observation, it is possible to establish cause – effect relationship in social
phenomena. The investigator first of all observes things and then collect data.

Aids of Observation: Diaries, note-books, schedules, photographs and maps are the
commonly used devices for observation.
Observation method has various limitations
 It is an expensive method.
 The information provided by this method is very limited.
 Sometimes unforeseen factors may interfere with the observational task.
 At times, the fact that some people are rarely accessible to direct observation
creates obstacle for this method to collect data effectively.

The researcher should keep in mind things like:


 What should be observed?
 How the observations should be recorded? Or how the accuracy of observation can be ensured?
 Incase the observation is characterized by a careful definition of the units to be observed,
 The style of recording the observed information, standardized conditions of observation and the
selection of pertinent data of observation, then the observation is called as structured observation.

 Generally, controlled observation takes place in various experiments that are carried out in a
laboratory or under controlled conditions
 Whereas uncontrolled observation is resorted to in case of exploratory researches.
Interview Method
 The interview method is one of the important methods of primary data collection.
 It is a confiscation between the observer and respondent. It is oral-verbal questions and
corresponding oral – verbal response to the queries made.

Definition of interviews:
PV Young : The interview may be regarded as a systematic method by which one persons enters
more or less legitimately into the inner life of another who is generally a stranger to him.
Hsin Pao Yang: The interview is a technique of field work which is used to watch the behaviour of
an individual or individuals, to record statements, to observe the concrete results of social or
group interactions.
CA Master : In a formal interview pre-determined questions are asked and the answers are
collected in a certain way.
The interviews can be conducted personally or though telephones.
The concept of interview, usually understood as face -to- face encounter, can be extended to
include telephone interviews and in today’s context, video interviews.
Interview Method
 The interview method of collecting data involves presentation of oral-verbal
stimuli and reply in terms of oral-verbal responses.
 This method can be used through personal interviews and, if possible, through
telephone interviews.
Personal interviews: Personal interview method requires a person known as the
interviewer asking questions generally in a face-to-face contact to the other
person or persons.
 At times the interviewee may also ask certain questions and the interviewer
responds to these, but usually the interviewer initiates the interview and collects
the information.
 This sort of interview may be in the form of direct personal investigation or it
may be indirect oral investigation.
 Direct personal investigation: He has to be on the spot and has to meet people
from whom data have to be collected.
 This method is particularly suitable for intensive investigations.
Interview Method
Indirect oral examination can be conducted under which the interviewer has to
cross-examine other persons who are supposed to have knowledge about the
problem under investigation and the information, obtained is recorded.
Most of the commissions and committees appointed by government to carry on
investigations make use of this method.
Major advantages of personal interviews:
1. More information and that too in greater depth can be obtained.
2. There is greater flexibility under this method as the opportunity to restructure
questions is always there, specially in case of unstructured interviews.
3. Observation method can as well be applied to recording verbal answers to
various questions.
4. Personal information can as well be obtained easily under this method.
5. The interviewer can collect supplementary information about the respondent’s
personal characteristics and environment which is often of great value in
interpreting results.
Interview Method
Weaknesses of personal interviews:
1. It is a very expensive method, specially when large and widely spread
geographical sample is taken.
2. There remains the possibility of the bias of interviewer as well as that of the
respondent; there also remains the headache of supervision and control of
interviewers.
3. Certain types of respondents such as important officials or executives or people
in high income groups may not be easily approachable under this method and to
that extent the data may prove inadequate.
4. The presence of the interviewer on the spot may over-stimulate the respondent,
sometimes even to the extent that he may give imaginary information just to
make the interview interesting.
5. Under the interview method the organization required for selecting, training and
supervising the field-staff is more complex with formidable problems.
6. Interviewing at times may also introduce systematic errors.
Interview Method
Telephone interviews: his method of collecting information consists in
contacting respondents on telephone itself. It is not a very widely used
method, but plays important part in industrial surveys, particularly in
developed regions.
The chief merits of such a system are:
1. It is more flexible in comparison to mailing method.
2. It is faster than other methods i.e., a quick way of obtaining information.
3. It is cheaper than personal interviewing method; here the cost per response is relatively low.
4. Recall is easy; callbacks are simple and economical.
5. There is a higher rate of response than what we have in mailing method; the non-response is
generally very low.
6. Replies can be recorded without causing embarrassment to respondents.
7. Interviewer can explain requirements more easily.
8. At times, access can be gained to respondents who otherwise cannot be contacted for one reason
or the other.
9. No field staff is required.
10. Representative and wider distribution of sample is possible
Interview Method Contd….
Telephone interviews
Demerits of collecting information are:
1. Little time is given to respondents for considered answers; interview
period is not likely to exceed five minutes in most cases.
2. Surveys are restricted to respondents who have telephone facilities.
3. Extensive geographical coverage may get restricted by cost considerations.
4. It is not suitable for intensive surveys where comprehensive answers are
required to various questions.
5. Possibility of the bias of the interviewer is relatively more.
6. Questions have to be short and to the point; probes are difficult to handle.
COLLECTION OF DATA THROUGH QUESTIONNAIRES
 This method of data collection is quite popular, particularly in case of big
enquiries. It is being adopted by private individuals, research workers, private
and public organisations and even by governments.
 In this method a questionnaire is sent (usually by post) to the persons
concerned with a request to answer the questions and return the questionnaire.
A questionnaire consists of a number of questions printed or typed in a
definite order on a form or set of forms.
 The questionnaire is mailed to respondents who are expected to read and
understand the questions and write down the reply in the space meant for the
purpose in the questionnaire itself. The respondents have to answer the
questions on their own.
 The method of collecting data by mailing the questionnaires to respondents is
most extensively employed in various economic and business surveys.
COLLECTION OF DATA THROUGH QUESTIONNAIRES
Contd…
 The opening questions should be such as to arouse human
interest. The following type of questions should generally
be avoided as opening questions in a questionnaire:
1. Questions that put too great a strain on the memory or
intellect of the respondent;
2. Questions of a personal character;
3. Questions related to personal wealth, etc.
Questionnaire Design
General Considerations
The first rule is design the questionnaire to fit the medium
Examples:
Multiple Choice
1. Where do you live?
 North
 South
 East
 West
Numeric Open End
2. How much did you spend on groceries this week? ……………..
Questionnaire Design
Text Open End
3. How can our company improve is working conditions?

Rating Scales and Agreement Scales are two types of questions that some researchers treat
as multiple choice questions and others treat as numeric open end questions.
Rating Scales
4. How would you rate this product?
 Excellent
 Good
 Fair
 Poor
5. On a scale where “10” means you have a great amount of interest in a subject and “I” means you
have none at all, how would you rate your interest in each of the following topics?
Domestic politics …
Foreign Affairs …
Science and Health …
Business …
Questionnaire Design

Agreement Scale
6. How much do you agree with each of the following statements
S. No Particulars Strongly Agree Dis Strongly
agree agree Disagree
1 My manager provides constructive criticism
2 Our medical plan provides adequate coverage
3 I would prefer to work longer hours on fewer days
A Sample Questionnaire
A study for telephone services company to find the expectations of customers using telephone booths at
Hyderabad and their profiles. The format of the questionnaire used in this study is presented below:
Questionnaire
Study on customer expectations and profiles of PCO booths at Hyderabad
Address of Telephone Booth:
Customer’s personal profile
1.Name :
2.Age :
a. Up to 17 years b. 18-24 years
c. 25-40 years d. 41-50 years
e. 51- 60 years f. More than 60 years
3.Gender
4.a. Male …… b. Female …..
5.Monthly househod income
a. Less than Rs. 10,000 b. Rs. 10,000 – 20,000
c. Rs. 20,000 d. Rs. 30,000 – 50, 000 e. more than Rs. 50,000.
6.Occupation
a. Service sector b. Government c. Public d. Private
e. Business f. Student / house wife g. Others (specify) ………
SOME OTHER METHODS OF DATA COLLECTION
Particularly used by big business houses in modern times.
1. Warranty cards: Warranty cards are usually postal sized cards which are used by dealers of consumer durables
to collect information regarding their products. The consumer to fill in the card and post it back to the dealer.
2. Distributor or store audits: Performed by distributors as well as manufactures through their salesmen at regular
intervals. To estimate market size, market share, seasonal purchasing pattern and so on. The data are obtained in
such audits not by questioning but by observation.
3. Pantry audit technique: It is used to estimate consumption of the basket of goods at the consumer level. It is to
find out what types of consumers buy certain products and certain brands, the assumption being that the contents
of the pantry accurately portray consumer’s preferences.

4. Consumer panel: An extension of the pantry audit approach on a regular basis is known as ‘consumer panel’,
where a set of consumers are arranged to come to an understanding to maintain detailed daily records of their
consumption and the same is made available to investigator on demands.
5. Use of mechanical devices : The use of mechanical devices has been widely made to collect information by
way of indirect means. Eye camera, Pupilometric camera, Psychogalvanometer, Motion picture camera and
Audiometer are the principal devices so far developed and commonlyused by modern big business houses, mostly
in the developed world for the purpose of collecting the required information.
6. Projective techniques: Projective techniques (or what are sometimes called as indirect interviewing techniques)
for the collection of data, it play an important role in motivational researches or in attitude surveys.
7. Depth interviews : Depth interviews are held to explore needs, desires and feelings of respondents Unless the
researcher has specialized training, depth interviewing should not be attempted
8. Content-analysis : Content-analysis consists of analysing the contents of documentary materials such as books,
magazines, newspapers and the contents of all other verbal materials.
COLLECTION OF SECONDARY DATA
Secondary data means data that are already available i.e., they refer to the data which have
already been collected and analyzed by someone else.
When the researcher utilizes secondary data, then he has to look into various sources from
where he can obtain them.
Secondary data may either be published data or unpublished data.
Usually published data are available in:
a.Various publications of the central, state are local governments;
b.Various publications of foreign governments or of international bodies and their subsidiary
organizations;
c.Technical and trade journals;
d.Books, magazines and newspapers;
e.Reports and publications of various associations connected with business and industry,
banks, stock exchanges, etc.;
f.Reports prepared by research scholars, Universities, Economists, etc. In different fields;
g.Public records and statistics, historical documents, and other sources of published
information.
COLLECTION OF SECONDARY DATA Contd….
The sources of unpublished data are many: It may be found in diaries, letters, unpublished
biographies and autobiographies and also may be available with scholars and research workers,
trade associations, labour bureaus and other public/private individuals and organisations.
Researcher must be very careful in using secondary data. By way of caution, the researcher,
before using secondary data, must see that they possess following characteristics:
1.Reliability of data: Reliability can be tested by finding out
(a) Who collected the data? (b) What were the sources of data?
(c) Were they collected by using proper methods (d) At what time were they collected?
(e) Was there any bias of the compiler? (f) What level of accuracy was desired? Was it
achieved ?
2.Suitability of data: The data that are suitable for one enquiry may not necessarily be found suitable
in another enquiry.
3.Adequacy of data: If the level of accuracy achieved in data is found inadequate for the purpose of
the present enquiry, they will be considered as inadequate and should not be used by the researcher.
From all this we can say that it is very risky to use the already available data. The already
available data should be used by the researcher only when he finds them reliable, suitable and
adequate.
Description and analysis of Data
 Technically speaking, description implies editing, coding, classification and
tabulation of collected data so that they are amenable to analysis.
 The term analysis refers to the computation of certain measures along with
searching for patterns of relationship that exist among data-groups.
 Thus, “in the process of analysis, relationships or differences supporting or
conflicting with original or new hypotheses should be subjected to statistical
tests of significance to determine with what validity data an be said to indicate
any conclusions”.
Editing: A routine work,
it has to be carried out with utmost care and devotion,
Checking the filled questionnaires,
Coding: It is an operation which requires judgment, skill, particularly for developing the coding
frame Reducing the mass data into manageable proportion
Classification: Tabulation of data is a common tool
It is used for summarizing the data so that they are amenable for interpretation
Summarizing data into tabular form.
Description Operations
Editing: Editing of data is a process of examining the collected raw data
(specially in surveys) to detect errors and omissions and to
correct these when possible. It involves a careful scrutiny of the
completed questionnaires and/or schedules.
 Field editing:
• Consists in the review of the reporting forms by the investigator for
completing (translating or rewriting)
• This type of editing is necessary in view of the fact that individual
writing styles often can be difficult for others to decipher.

 Central editing:
• It should take place when all forms or schedules have been
completed and returned to the office. This type of editing implies
that all forms should get a thorough editing by a single editor in a
small study and by a team of editors in case of a large inquiry.
Description Operations Contd…..
Coding:
•Coding refers to the process of assigning numerals or other symbols to
answers so that responses can be put into a limited number of categories or
classes.
•Coding is necessary for efficient analysis and through it the several replies
may be reduced to a small number of classes which contain the critical
information required for analysis.

Classification:
•Most research studies result in a large volume of raw data which must be
reduced into homogeneous groups if we are to get meaningful relationships.
1.Classification according to attributes: Data are classified on the basis of common
characteristics which can either be descriptive (such as literacy, sex, honesty, etc.) or numerical
(such as weight, height, income, etc.).

2.Classification according to class-intervals : The numerical characteristics refer to quantitative


phenomenon which can be measured through some statistical units. Data relating to income,
production, age, weight, etc.
Description Operations Contd…..
Tabulation: When a mass of data has been assembled, it becomes
necessary for the researcher to arrange the same in some kind
of concise and logical order. This procedure is referred to as
tabulation.

Tabulation is essential because of the following reasons:

1. It conserves space and reduces explanatory and descriptive


statement to a minimum.
2. It facilitates the process of comparison.
3. It facilitates the summation of items and the detection of
errors and omissions.
4. It provides a basis for various statistical computations.
Need for Sampling
Sampling is used in practice for a variety of reasons such as:
1. Sampling can save time and money. A sample study is usually less
expensive than a census study and produces results at a relatively faster
speed.
2. Sampling may enable more accurate measurements for a sample study is
generally conducted by trained and experienced investigators.
3. Sampling remains the only way when population contains infinitely many
members.
4. Sampling remains the only choice when a test involves the destruction of
the item under study.
5. Sampling usually enables to estimate the sampling errors and, thus, assists
in obtaining information concerning some characteristic of the population.
Sample Design
The following are to considered for a sample design:
i. Nature of universe: Universe may be either homogenous or heterogenous in
nature. If the items of the universe are homogenous, a small sample can serve
the purpose. But if the items are heteogenous, a large sample would be
required. Technically, this can be termed as the dispersion factor.
ii. Number of classes proposed: If many class-groups (groups and sub-groups)
are to be formed, a large sample would be required because a small sample
might not be able to give a reasonable number of items in each class-group.
iii. Nature of study: If items are to be intensively and continuously studied, the
sample should be small. For a general survey the size of the sample should
be large, but a small sample is considered appropriate in technical surveys.
iv. Type of sampling: Sampling technique plays an important part in
determining the size of the sample. A small random sample is apt to be much
superior to a larger but badly selected sample.
Sample Design Contd…
v. Standard of accuracy and acceptable confidence level: If the standard of
accuracy or the level of precision is to be kept high, we shall require
relatively larger sample. For doubling the accuracy for a fixed significance
level, the sample size has to be increased fourfold.

vi. Availability of finance: In practice, size of the sample depends upon the
amount of money available for the study purposes. This factor should be
kept in view while determining the size of sample for large samples result
in increasing the cost of sampling estimates.

vii. Other considerations: Nature of units, size of the population, size of


questionnaire, availability of trained investigators, the conditions under
which the sample is being conducted, the time available for completion of
the study are a few other considerations to which a researcher must pay
attention while selecting the size of the sample.
Role of Statistics for Data Analysis
 In research is to function as a tool in designing research, analysing its data

and drawing conclusions there from. Most research studies result in a large
volume of raw data which must be suitably reduced so that the same can be
read easily and can be used for further analysis. Clearly the science of
statistics cannot be ignored by any research worker.
 The important statistical measures that are used to summarize the
survey/research data are:
1. Measures of central tendency or statistical averages
2. Measures of dispersion
3. Measures of asymmetry (skewness)
4. Measures of relationship
Some Important Definitions
 A Population (Universe) is the whole collection of things under
consideration

 A Sample is a Portion of the population selected for analysis

 A Parameter is a Summary measure computed to describe the


characteristic of a population

 A Statistic is a Summary measure computed to describe the characteristic


of a sample
Population and Sample

Population Sample
Use statistics to
summarize features
Use parameters to
summarize features

Inference on the population from the sample


Types of Data

D a ta

Categorical Num erical


(Q ualitative) (Q uantitative)

Discrete Continuous
Summary Measures
Summary Measures

Central Tendency Quartile Variation

Mean Mode
Median Range Coefficient
of Variation
Variance

Standard Deviation
Geometric Mean
IMPORTANT STATISTICAL
MEASURES
 Measures of Central Tendency(Statistical averages)
 Mean, Median, Mode, Geometric Mean, Harmonic Mean

 Quartiles
 Measure of Variation/dispersion
 Range, Semi Inter-quartile Range, Mean Deviation, Variance, Standard
Deviation and Coefficient of Variation
 Measures of Skewness / Shape (Measure Asymmetry)
 Symmetric, Skewed

 Measures of Kurtosis/Peakedness
 Lepto kurtic / Platy Kurtic / Meso kurtic
Points of Central Tendency
 Measures of central tendency (or statistical averages) tell us the point about which
items have a tendency to cluster. Such a measure is considered as the most
representative figure for the entire mass of data. Measure of central tendency is also
known as statistical average. Mean, median and mode are the most popular averages.
Mean, also known as arithmetic average

 Where = The symbol we use for mean (pronounced as X bar)


 Ʃ = Symbol for summation
 n = total number of items
 Median (M) is the value of the middle item of series when it is arranged in
ascending or descending order of magnitude.

 Mode is the most commonly or frequently occurring value in a series. The


mode in a distribution is that item around which there is maximum
concentration.
Dispersion/Variation is a measure of how spread out the data is around the
center of the data.

The Variation of the Data


Measures of variation are statistics of how far away the
values in the observations (data points) are from each other.
There are different measures of variation. The most
commonly used are:

•Range
•Standard Deviation
•Mean deviation

Measures of variation combined with an average (measure of


center) gives a good picture of the distribution of the data.

Note: These measures of variation can only be calculated for


numerical data.
What is Dispersion in Statistics?
Dispersion is the state of getting dispersed or spread.
Statistical dispersion means the extent to which numerical
data is likely to vary about an average value. In other words,
dispersion helps to understand the distribution of the data.

Measures of Dispersion
In statistics, the measures of dispersion help to interpret the variability of data
i.e. to know how much homogenous or heterogeneous the data is. In simple
terms, it shows how squeezed or scattered the variable is.
The types of absolute measures of dispersion are:

1.Range: It is simply the difference between the maximum value and the
minimum value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6

1.Variance: Deduct the mean from each data in the set, square each of them and
add each square and finally divide them by the total no of values in the data set to
get the variance. Variance (σ2) = ∑(X−μ)2/N

1.Standard Deviation: The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.

1.Quartiles and Quartile Deviation: The quartiles are values that divide a list of
numbers into quarters. The quartile deviation is half of the distance between the
third and the first quartile.
2.Mean and Mean Deviation: The average of numbers is known as the mean and
the arithmetic mean of the absolute deviations of the observations from a measure
of central tendency is known as the mean deviation (also called mean absolute
deviation).
What is Variance?
Variance is a measure of dispersion. A measure of dispersion is a quantity that is
used to check the variability of data about an average value. Data can be of two
types - grouped and ungrouped. When data is expressed in the form of class
intervals it is known as grouped data. On the other hand, if data consists of
individual data points, it is called ungrouped data. The sample and population
variance can be determined for both kinds of data.
Variance Definition
Population Variance - All the members of a group are known as the population.
When we want to find how each data point in a given population varies or is
spread out then we use the population variance. It is used to give the squared
distance of each data point from the population mean.
Sample Variance - If the size of the population is too large then it is difficult to
take each data point into consideration. In such a case, a select number of data
points are picked up from the population to form the sample that can describe the
entire group. Thus, the sample variance can be defined as the average of the
squared distances from the mean. The variance is always calculated with respect to
the sample mean.
A general definition of variance is that it is the expected value of the squared
differences from the mean.
Variance Example
Suppose we have the data set {3, 5, 8, 1} and we want to find the population
variance. The mean is given as (3 + 5 + 8 + 1) / 4 = 4.25. Then by using the
definition of variance we get [(3 - 4.25)2 + (5 - 4.25)2 + (8 - 4.25)2 + (1 - 4.25)2] /
4 = 6.68. Thus, variance = 6.68.
Standard Deviation
Standard deviation is the positive square root of the variance. It is one of the
basic methods of statistical analysis. Standard Deviation is commonly
abbreviated as SD and denoted by the symbol 'σ’ and it tells about how much
data values are deviated from the mean value. If we get a low standard deviation
then it means that the values tend to be close to the mean whereas a high standard
deviation tells us that the values are far from the mean value.

Standard deviation is the degree of dispersion or the scatter of the data


points relative to its mean, in descriptive statistics. It tells how the values are
spread across the data sample and it is the measure of the variation of the
data points from the mean. The standard deviation of a data set, sample,
statistical population, random variable, or probability distribution is the
square root of its variance.
Standard Deviation of Ungrouped Data
The calculations for standard deviation differ for different data. Distribution
measures the deviation of data from its mean or average position. There are
three methods to find the standard deviation.
•Actual mean method
•Assumed mean method
•Step deviation method
Standard Deviation by The Actual Mean Method
In this method, we first compute the mean of the data values (x¯) and then
compute the deviations of each data value from the mean. Then we use the
following standard deviation formula by actual mean method:
σ = √(∑(x−¯x)2 /n), where n = total number of observations.
Consider the data observations 3, 2, 5, 6. Here the mean of these data points is
(3 + 2 + 5 + 6)/4 = 16/4 = 4.
The sum of the squared differences from mean = (4-3) 2+(2-4)2 +(5-4)2 +(6-
4)2 = 10
Variance = Squared differences from mean/ number of data points =10/4 =2.5
Standard deviation = √2.5 = 1.58
Standard deviation by Assumed Mean Method
When the x values are large, an arbitrary value (A) is chosen as the mean (as the
computation of mean is difficult in this case). The deviation from this assumed
mean is calculated as d = x - A. Then the standard deviation formula by assumed
mean method is:
σ = √[(∑(d)2 /n) - (∑d/n)2]
Standard Deviation by Step Deviation Method

The standard deviation of grouped data also can be calculated by "step


deviation method". In this method also, some arbitrary data value is
chosen as the assumed mean, A. Then we calculate the deviations of all
data values by using d = x - A. The next step is to calculate the step
deviations (d') using d' = d/i where 'i' is a common factor of all 'd' values
(choose any common factor in case of multiple factors). Now, the
standard deviation of ungrouped data by step deviation method is found
by the formula:
σ = √[(∑(d')2 /n) - (∑d'/n)2] × i, where 'n' is the total number of data
values.
Mean Deviation Definition

The mean deviation is defined as a statistical measure that is used to


calculate the average deviation from the mean value of the given data set.
The mean deviation of the data values can be easily calculated using the
below procedure.
Step 1: Find the mean value for the given data values
Step 2: Now, subtract the mean value from each of the data values given
(Note: Ignore the minus symbol)
Step 3: Now, find the mean of those values obtained in step 2.
Mean Deviation Examples
Example 1:
Determine the mean deviation for the data values 5, 3,7, 8, 4, 9.
Solution:
Given data values are 5, 3, 7, 8, 4, 9.
We know that the procedure to calculate the mean deviation.
First, find the mean for the given data:
Mean, µ = ( 5+3+7+8+4+9)/6
µ = 36/6
µ=6
Therefore, the mean value is 6.
Now, subtract each mean from the data value, and ignore the minus symbol if any
(Ignore”-”)
5–6=1
3–6=3
7–6=1
8–6=2
4–6=2
9–6=3
Now, the obtained data set is 1, 3, 1, 2, 2, 3.
Finally, find the mean value for the obtained data set
Therefore, the mean deviation is
= (1+3 + 1+ 2+ 2+3) /6
= 12/6
=2
Hence, the mean deviation for 5, 3,7, 8, 4, 9 is 2.
Skewness is a measure of the asymmetry of a distribution. A distribution is
asymmetrical when its left and right side are not mirror images.
A distribution can have right (or positive), left (or negative), or zero skewness. A
right-skewed distribution is longer on the right side of its peak, and a left-skewed
distribution is longer on the left side of its peak:
You might want to calculate the skewness of a distribution to:
•Describe the distribution of a variable alongside other descriptive statistics
•Determine if a variable is normally distributed. A normal distribution has zero
skew and is an assumption of many statistical procedures.
Shape of a Distribution
Describe How Data are Distributed
Measures of Shape
Symmetric or skewed

Left-Skewed Symmetric Right-Skewed


Mean < Median < Mode Mean = Median =Mode Mode < Median < Mean
Estimates of Population
 In most statistical research studies, population parameters are usually unknown
and have to be estimated from a sample.

 The estimate of a population parameter may be one single value or it could be


a range of values. In the former case it is referred as point estimate, whereas
in the latter case it is termed as interval estimate.

 The random variables (such as µ and σ2) used to estimate population


parameters, such as µ and σ2 They are conventionally called as ‘estimators’,
while specific values of these (such as µ = 105 or σ2 = 21.44) are referred to as
‘estimates’ of the population parameters.
Parametric and Non-Parametric Methods
 Statisticians have developed several tests of hypotheses (also known
as the tests of significance) for the purpose of testing of hypotheses
which can be classified as:
1. Parametric tests or Standard tests of hypotheses
2. Non-parametric tests or Distribution-free test of hypotheses

 Parametric tests usually assume certain properties of the parent population


from which we draw samples. Assumptions like observations come from a
normal population, sample size is large, assumptions about the population
parameters like mean, variance, etc., must hold good before parametric tests
can be used.

 The important parametric tests are:


(1) z-test; (2) t-test; (3) ѱ2-test, and (4) F-test.
Parametric and Non-Parametric Methods Contd..
When the researcher cannot or does not want to
make such assumptions. In such situations we use
statistical methods for testing hypotheses which are
called non-parametric tests because such tests do
not depend on any assumption about the
parameters of the parent population.
IMPORTANT NONPARAMETRIC OR DISTRIBUTION-FREE TESTS

Tests of hypotheses with ‘Order statistics’ or ‘Non-Parametric statistics’ or


‘Distribution-free statistics’ are known as nonparametric or distribution-free tests.

The following distribution-free tests are important and generally used:

i.Test of a hypothesis concerning some single value for the given data (such as one-
sample sign test).
ii.Test of a hypothesis concerning no difference among two or more sets of data (such as
two-sample sign test, Fisher-Irwin test, Rank sum test, etc.).
iii.Test of a hypothesis of a relationship between variables.
iv.Test of a hypothesis concerning variation in the given data i.e., test analogous to
ANOVA .
v.Tests of randomness of a sample based on the theory of runs viz., one sample runs
test.
vi.Test of hypothesis to determine if categorical data shows dependency or if two
classifications are independent viz., the chi-square test. The chi-square test can as well
be used to make comparison between theoretical populations and actual data when
categories are used.
Types of Parametric Tests for Hypothesis Testing
1. T-Test
1. It is a parametric test of hypothesis testing based on Student’s T
distribution.
2. It is essentially, testing the significance of the difference of the mean values
when the sample size is small (i.e, less than 30) and when the population
standard deviation is not available.
3. Assumptions of this test:
•Population distribution is normal, and

•Samples are random and independent

•The sample size is small.

•Population standard deviation is not known.

4. Mann-Whitney ‘U’ test is a non-parametric counterpart of the T-test.


KEY TAKEAWAYS
•A t-test is an inferential statistic used to determine if there is a statistically
significant difference between the means of two variables.
•The t-test is a test used for hypothesis testing in statistics.
•Calculating a t-test requires three fundamental data values including the
difference between the mean values from each data set, the standard deviation
of each group, and the number of data values.
•T-tests can be dependent or independent.
Using a T-Test
Consider that a drug manufacturer tests a new medicine. Following standard
procedure, the drug is given to one group of patients and a placebo to
another group called the control group. The placebo is a substance with no
therapeutic value and serves as a benchmark to measure how the other
group, administered the actual drug, responds.
After the drug trial, the members of the placebo-fed control group
reported an increase in average life expectancy of three years, while the
members of the group who are prescribed the new drug reported an
increase in average life expectancy of four years.
Initial observation indicates that the drug is working. However, it is also
possible that the observation may be due to chance. A t-test can be used to
determine if the results are correct and applicable to the entire population.
Four assumptions are made while using a t-test. The data collected must
follow a continuous or ordinal scale, such as the scores for an IQ test, the
data is collected from a randomly selected portion of the total population,
the data will result in a normal distribution of a bell-shaped curve, and equal
or homogenous variance exists when the standard variations are equal.
T-Test Formula
A T-test can be a:
One Sample T-test: To compare a sample mean with that of the population mean.

where,
x̄ is the sample mean
s is the sample standard deviation
n is the sample size
μ is the population mean
Two-Sample T-test: To compare the means of two different samples.

where,
x̄ 1 is the sample mean of the first group
x̄ 2 is the sample mean of the second group
S1 is the sample-1 standard deviation
S2 is the sample-2 standard deviation
n is the sample size
Conclusion:

This calculated t-value is then compared against a value obtained from a


critical value table called the T-distribution table.

•If the value of the test statistic is greater than the table value -> Rejects the null
hypothesis.
•If the value of the test statistic is less than the table value -> Do not reject the
null hypothesis.
2. Z-Test
1. It is a parametric test of hypothesis testing.
2. It is used to determine whether the means are different when the population
variance is known and the sample size is large (i.e., greater than 30).
3. Assumptions of this test:
•Population distribution is normal

•Samples are random and independent.

•The sample size is large.

•Population standard deviation is known.


A Z-test can be:
One Sample Z-test: To compare a sample mean with that of the population mean.

Two Sample Z-test: To compare the means of two different


samples.
where,
x̄1 is the sample mean of 1st group
x̄ 2 is the sample mean of 2nd group

σ1 is the population-1 standard deviation

σ2 is the population-2 standard deviation


n is the sample size
F-Test

1. It is a parametric test of hypothesis testing based on Snedecor F-

distribution.

2. It is a test for the null hypothesis that two normal populations have the same

variance.

3. An F-test is regarded as a comparison of equality of sample variances.

4. F-statistic is simply a ratio of two variances.

5. It is calculated as:

F = s12/s22
6. By changing the variance in the ratio, F-test has become a very flexible test.

It can then be used to:

•Test the overall significance for a regression model.

•To compare the fits of different models and

•To test the equality of means.

7. Assumptions of this test:

•Population distribution is normal, and

•Samples are drawn randomly and independently.


4. ANOVA
1. Also called as Analysis of variance, it is a parametric test of hypothesis testing.
2. It is an extension of the T-Test and Z-test.
3. It is used to test the significance of the differences in the mean values among
more than two sample groups.
4. It uses F-test to statistically test the equality of means and the relative variance
between them.
5. Assumptions of this test:
•Population distribution is normal, and

•Samples are random and independent.

•Homogeneity of sample variance.

6. One-way ANOVA and Two-way ANOVA are is types.


7. F-statistic = variance between the sample means/variance within the sample
•Analysis of variance, or ANOVA, is a statistical method that separates
observed variance data into different components to use for additional
tests.
•A one-way ANOVA is used for three or more groups of data, to gain
information about the relationship between the dependent and
independent variables.
•If no true variance exists between the groups, the ANOVA's F-ratio
should equal close to 1.

The Formula for ANOVA is:

where:
F=ANOVA coefficient
MST=Mean sum of squares due to treatment
MSE=Mean sum of squares due to error​
SIGNIFICANCE OF ANOVA

ANOVA tells you whether the group means are significantly different from each
other.

ANOVA works by partitioning the total variation in a data set into two segments:
variation between groups and variation within groups.
If the between-groups variation is significantly larger than expected by chance,
then there are actual differences in the group means.

ANOVA is useful for assessing the effects of other treatments, studying the
impact of factors, and reading contrasts between groups. It feeds insights that
simple comparisons of means cannot.

The results of an ANOVA tell you whether there are any statistically important
contrasts between groups but not exactly where or why the differences exist. Post
hoc tests are needed to define precisely which group means differ.
ADVANTAGES OF ANOVA

•It allows comparisons of more than two group means simultaneously. Ordinary
t-tests can only compare two group means at a time. ANOVA can compare 3,4,5,
and more group means together.

•It is an efficient method of analysis. ANOVA calculates group variances in just


one calculation rather than multiple t-tests. This makes it quicker and easier to
perform the analysis.

•ANOVA results are easy to interpret. The F-statistic tells us if important


disparities exist between group means, and p-values indicate the probability of
obtaining those differences by chance.
•It is a robust technique that works well even if assumptions are slightly violated.
Types of Non-parametric Tests

1. Chi-Square Test
1. It is a non-parametric test of hypothesis testing.
2. As a non-parametric test, chi-square can be used:
• test of goodness of fit.
•as a test of independence of two variables.
3. It helps in assessing the goodness of fit between a set of observed and those
expected theoretically.
4. It makes a comparison between the expected frequencies and the observed
frequencies.
5. Greater the difference, the greater is the value of chi-square.
6. If there is no difference between the expected and observed frequencies,
then the value of chi-square is equal to zero.
7. It is also known as the “Goodness of fit test” which determines whether a
particular distribution fits the observed data or not.
9. Chi-square is also used to test the independence of two variables.

Conditions for chi-square test:

•Randomly collect and record the Observations.

•In the sample, all the entities must be independent.

•No one of the groups should contain very few items, say less than 10.

•The reasonably large overall number of items. Normally, it should be at least 50, however small the number of

groups may be.

11. Chi-square as a parametric test is used as a test for population variance based on sample variance.

12. If we take each one of a collection of sample variances, divide them by the known population variance and

multiply these quotients by (n-1), where n means the number of items in the sample, we get the values of chi-

square.
Measures of Relationship
 We have dealt with those statistical measures that we use in context of
univariate population i.e., the population consisting of measurement of
only one variable.

For example: Whether the number of hours students devote for studies is
somehow related to their family income, to age, to gender or to similar
other factor.
There are several methods of determining the relationship between
variables, but no method can tell us for certain that a correlation is
indicative of causal relationship.
Descriptive Statistics
Collect Data
E.g., Survey

Present Data
E.g., Tables and graphs

Characterize Data
E.g., Sample Mean = X i

n
Inferential Statistics
Analysis, particularly in case of survey or experimental data,
involves estimating the values of unknown parameters of the
population and testing of hypotheses for drawing inferences.

Descriptive analysis: Descriptive analysis is largely the study of distributions


of one variable. This study provides us with profiles of companies, work groups,
persons and other subjects on any of a multiple of characteristics such as size.
Composition, efficiency, preferences.

Inferential analysis: Inferential analysis is often known as statistical analysis.


It is concerned with the various tests of significance for testing hypotheses in order to
determine with what validity data can be said to indicate some conclusion or conclusions.
It is also concerned with the estimation of population values.
Inferential Statistics
Estimation
E.g., Estimate the population mean
weight using the sample
mean weight
Hypothesis Testing
E.g., Test the claim that the
population mean weight is
120 pounds

Drawing conclusions and/or making decisions


concerning a population based on sample results.
What is a Hypothesis?
 A Hypothesis is a
Claim (Assumption)
about the Population
Parameter I claim the mean GPA of
 Examples of parameters this class is   3.5!
are population mean
or proportion
 The parameter must
be identified before
analysis

© 1984-1994 T/Maker Co.


The Null Hypothesis, H0
States the Assumption (Numerical) to be Tested
E.g., The mean GPA is 3.5

Null Hypothesis is Always about a Population


Parameter
H 0 :   3.5 H 0 :   3.5
( ), Not about a Sample Statistic (
) H 0 : X  3.5
Is the Hypothesis a Researcher Tries to Reject
The Null Hypothesis, H0 (continued)

Begin with the Assumption that the Null Hypothesis is


True
Similar to the notion of innocent until
proven guilty
Refer to the Status Quo
Always Contains the “=” Sign
The Null Hypothesis May or May Not be Rejected
The Alternative Hypothesis, H1
Is the Opposite of the Null Hypothesis
H(1 :   3.5
E.g., The mean GPA is NOT 3.5 )
Challenges the Status Quo
Never Contains the “=” Sign
The Alternative Hypothesis May or May Not Be
Accepted (i.e., The Null Hypothesis May or May
Not Be Rejected)
Is Generally the Hypothesis that the Researcher
Claims
Hypothesis Testing Process
Assume the
population
mean GPA is 3.5
( H 0 :   3.5) Identify the Population

Is X  2.4 likely if   3.5?


Take a Sample
No, not likely!

REJECT

Null Hypothesis
 X  2.4 
Type I and Type II errors are subjected to the result of the null hypothesis.
In case of type I or type-1 error, the null hypothesis is rejected though it is
true whereas type II or type-2 error, the null hypothesis is not rejected even
when the alternative hypothesis is true. Both the error type-i and type-ii are
also known as “false negative”. A lot of statistical theory rotates around the
reduction of one or both of these errors, still, the total elimination of both is
explained as a statistical impossibility.
Type I Error
A type I error appears when the null hypothesis (H0) of an experiment is true,
but still, it is rejected. It is stating something which is not present or a false hit.
A type I error is often called a false positive (an event that shows that a given
condition is present when it is absent). In words of community tales, a person
may see the bear when there is none (raising a false alarm) where the null
hypothesis (H0) contains the statement: “There is no bear”.
The type I error significance level or rate level is the probability of refusing the
null hypothesis given that it is true. It is represented by Greek letter α (alpha)
and is also known as alpha level. Usually, the significance level or the
probability of type i error is set to 0.05 (5%), assuming that it is satisfactory to
have a 5% probability of inaccurately rejecting the null hypothesis.
Type II Error
A type II error appears when the null hypothesis is false but mistakenly fails to
be refused. It is losing to state what is present and a miss. A type II error is also
known as false negative (where a real hit was rejected by the test and is
observed as a miss), in an experiment checking for a condition with a final
outcome of true or false.
A type II error is assigned when a true alternative hypothesis is not
acknowledged. In other words, an examiner may miss discovering the bear
when in fact a bear is present (hence fails in raising the alarm). Again, H0, the
null hypothesis, consists of the statement that, “There is no bear”, wherein, if a
wolf is indeed present, is a type II error on the part of the investigator. Here, the
bear either exists or does not exist within given circumstances, the question
arises here is if it is correctly identified or not, either missing detecting it when
it is present, or identifying it when it is not present.
The rate level of the type II error is represented by the Greek letter β (beta) and
linked to the power of a test (which equals 1−β).
Type I and Type II Errors Example
Check out some real-life examples to understand the type-i
and type-ii error in the null hypothesis.

Example 1: Let us consider a null hypothesis – A man is not guilty of a crime.


Then in this case:

Type I error (False Positive) Type II error (False Negative)

He is condemned to crime, He is condemned not guilty


though he is not guilty or when the court actually does
committed the crime. commit the crime by letting
the guilty one go free.
Example 2: Null hypothesis- A patient’s signs after treatment A, are
the same from a placebo.

Type I error (False Positive) Type II error (False Negative)

Treatment A is more efficient Treatment A is more powerful


than the placebo than placebo even though it truly
is more efficient.
Reason for Rejecting H0
Sampling Distribution of X
It is unlikely that ... Therefore,
we would get a we reject the
sample mean of null hypothesis
this value ... that  = 3.5.

... if in fact this were


the population mean.

2.4  = 3.5 X
If H0 is true
General Steps in Hypothesis Testing

E.g., Test the Assumption that the True Mean # of TV Sets in


U.S. Homes is at Least 3 ( Known) 
1. State the H0 H0 :   3
2. State the H1 H1 :   3
3. Choose   =.05
4. Choose n n  100
5. Choose Test Z test
General Steps in Hypothesis Testing Contd…

Reject H0
6. Set up critical value(s)

Z
-1.645
7. Collect data 100 households surveyed
8. Compute test statistic Computed test stat =-2,
and p-value p-value = .0228
Reject null hypothesis
9. Make statistical decision
The true mean # TV set is
10. Express conclusion
less than 3
Level of Significance, 
 Defines Unlikely Values of Sample Statistic if Null
Hypothesis is True
Called rejection region of the sampling distribution
 Designated by , (level of significance)
Typical values are .01, .05, .10
 Selected by the Researcher at the Beginning
 Controls the Probability of Committing a Type I Error
 Provides the Critical Value(s) of the Test
Error in Making Decisions Contd…
Type II Error
Fail to reject a false null hypothesis
Probability of Type II Error is
The power of the test is

Probability of Not Making Type I Error
 1   
Called the Confidence Coefficient
1   
Result Probabilities
H0: Innocent
Jury Trial Hypothesis Test
The Truth The Truth
Verdict Innocent Guilty Decision H0 True H0 False
Do Not Type II
Innocent Correct Error Reject 1-
Error (  )
H0
Type I Power
Guilty Error Correct Reject Error
H0 (1 -  )
( )
Level of Significance and the Rejection Region


H0: 3.5 Critical

H1:  < 3.5 Value(s)


Rejection 0
Regions 
H0:   3.5
H1:  > 3.5
0
/2
H0:  3.5
H1:  
0
3.5
Type I & II Errors Have an Inverse Relationship

Reduce probability of one error


and the other one goes up holding
everything else unchanged.


Factors Affecting Type II Error

True Value of Population Parameter


 increases when the difference between the hypothesized
parameter and its true value decrease
Significance Level

 increases when decreases
Population Standard Deviation 
 increases when
 increases

Sample Size
 increases when n decreases
 


n
How to Choose between Type I and Type II
Errors
Choice Depends on the Cost of the Errors
Choose Smaller Type I Error When the Cost of
Rejecting the Maintained Hypothesis is High
A criminal trial: convicting an innocent person
The Exxon Valdez: causing an oil tanker to sink
Choose Larger Type I Error When You Have an
Interest in Changing the Status Quo
A decision in a startup company about a new piece of software
A decision about unequal pay for a covered group
Less Variability
Standard Error (Standard Deviation) of the
Sampling Distribution  X is Less Than the
Standard Error of Other Unbiased Estimators

f  X  Sampling
Distribution
of Median Sampling
Distribution of
Mean

 X

You might also like