Submitted To Submitted by
Submitted To Submitted by
Submitted To Submitted by
Submitted To Submitted By
Name :Anil Bisht
Class:- B.com Hons 4th sem
Government College CollegeRoll No. – 1924
Sector – 9, Gurugram Registration No. :17GU301062
University Roll No. -
Anil Bisht
ACKNOWLEDGEMENT
Anil Bisht
STATISTICAL ANALYSIS
Whether you are working with large data volumes or running multiple permutations of
your calculations, statistical computing has become essential for today’s statistician.
Popular statistical computing practices include:
The first thing to do with any data is to summarise it, which means to
present it in a way that best tells the story.
The starting point is usually to group the raw data into categories, and/or to
visualise it. For example, if you think you may be interested in differences
by age, the first thing to do is probably to group your data in age categories,
perhaps ten- or five-year chunks.
When most people say average, they are talking about the mean. It
has the advantage that it uses all the data values obtained and can be
used for further statistical analysis. However, it can be skewed by
‘outliers’, values which are atypically large or small.
Measures of Spread:
Range, Variance and Standard Deviation
Researchers often want to look at the spread of the data, that is, how
widely the data are spread across the whole possible measurement
scale.
There are three measures which are often used for this:
The range is the difference between the largest and smallest values.
Researchers often quote the interquartile range, which is the range of
the middle half of the data, from 25%, the lower quartile, up to 75%,
the upper quartile, of the values (the median is the 50% value). To find
the quartiles, use the same procedure as for the median, but take the
quarter- and three-quarter-point instead of the mid-point.
The standard deviation measures the average spread around the mean,
and therefore gives a sense of the ‘typical’ distance from the mean.
Skew -The skew measures how symmetrical the data set is, or
whether it has more high values, or more low values. A sample with
more low values is described as negatively skewed and a sample with
more high values as positively skewed.
Generally speaking, the more skewed the sample, the less the mean,
median and mode will coincide.
MICROSOFT – EXCEL
Microsoft Excel is a spreadsheet developed by Microsoft for
Windows, macOS, Android and iOS. It features calculation,
graphing tools, pivot tables, and a macro programming language
called Visual Basic for Applications. It has been a very widely
applied spreadsheet for these platforms, especially since version
5 in 1993, and it has replaced Lotus 1-2-3 as the industry
standard for spreadsheets. Excel forms part of Microsoft Office.
In one Excel sheet are now 407 632x1 048 576 cells, so there are
427 428 937 728 cells. The last cell is named XFD1048576.
Microsoft Excel has the basic features of all spreadsheets, using a grid of
cells arranged in numbered rows and letter-named columns to organize data
manipulations like arithmetic operations. It has a battery of supplied
functions to answer statistical, engineering and financial needs. In addition,
it can display data as line graphs, histograms and charts, and with a very
limited three-dimensional graphical display.
It allows sectioning of data to view its dependencies on various factors for
different perspectives (using pivot tables and the scenario manager). It has
a programming aspect, Visual Basic for Applications, allowing the user to
employ a wide variety of numerical methods, for example, for solving
differential equations of mathematical physics, and then reporting the results
back to the spreadsheet.
It also has a variety of interactive features allowing user interfaces that can
completely hide the spreadsheet from the user, so the spreadsheet presents
itself as a so-called application, or decision support system (DSS), via a
custom-designed user interface, for example, a stock analyzer, or in
general, as a design tool that asks the user questions and provides answers
and reports.
In a more elaborate realization, an Excel application can automatically poll
external databases and measuring instruments using an update schedule,
analyze the results, make a Word report or PowerPoint slide show, and e-
mail these presentations on a regular basis to a list of participants. Excel
was not designed to be used as a database.
History - From its first version Excel supported end user programming
of macros (automation of repetitive tasks) and user defined functions
(extension of Excel's built-in function library). In early versions of Excel
these programs were written in a macro language whose statements had
formula syntax and resided in the cells of special purpose macro sheets
(stored with file extension .XLM in Windows.) XLM was the default macro
language for Excel through Excel 4.0. Beginning with version 5.0 Excel
recorded macros in VBA by default but with version 5.0 XLM recording was
still allowed as an option. After version 5.0 that option was discontinued. All
versions of Excel, including Excel 2010 are capable of running an XLM
macro, though Microsoft discourages their use.
Excel is convenient for data entry, and for quickly manipulating rows and
columns prior to statistical analysis. However when you are ready to do
the statistical analysis, we recommend the use of a statistical package
such as SAS, SPSS, Stata, Systat or Minitab.
To present the results, we will use a small example. The data for this
example is fictitious. It was chosen to have two categorical and two
continuous variables, so that we could test a variety of basic statistical
techniques. Since almost all real data sets have at least a few missing
data points, and since the ability to deal with missing data correctly is
one of the features that we take for granted in a statistical analysis
package, we introduced two empty cells in the data:
We used this data to do some simple analyses and compared the results with a
standard statistical package. The comparison considered the accuracy of the results as
well as the ease with which the interface could be used for bigger data sets - i.e. more
columns. We used SPSS as the standard, though any of the statistical packages OIT
supports would do equally well for this purpose. In this article when we say "a statistical
package," we mean SPSS, SAS, STATA, SYSTAT, or Minitab.
Most of Excel statistical procedures are part of the Data Analysis tool pack, which is in
the Tools menu. It includes a variety of choices including simple descriptive statistics, t-
tests, correlations, 1 or 2-way analysis of variance, regression, etc. If you do not have a
Data Analysis item on the Tools menu, you need to install the Data Analysis
ToolPak. Search in Help for "Data Analysis Tools" for instructions on loading the
ToolPak.
Two other Excel features are useful for certain analyses, but the Data Analysis tool pack
is the only one that provides reasonably complete tests of statistical significance. Pivot
Table in the Data menu can be used to generate summary tables of means, standard
deviations, counts, etc. Also, you could use functions to generate some statistical
measures, such as a correlation coefficient. Functions generate a single number, so
using functions you will likely have to combine bits and pieces to get what you want.
Even so, you may not be able to generate all the parts you need for a complete
analysis.
Unless otherwise stated, all statistical tests using Excel were done with the Data
Analysis ToolPak. In order to check a variety of statistical tests, we chose the following
tasks:
Get means and standard deviations of X and Y for the entire group, and for each
treatment group.
Do a two sample t-test to test whether the two treatment groups differ on X and Y.
Do a paired t-test to test whether X and Y are statistically different from each other.
Compare the number of subjects with each outcome by treatment group, using a chi-
squared test.
All of these tasks are routine for a data set of this nature, and all of them could be easily
done using any of the aobve listed statistical packages.
General Issues
Enable the Analysis ToolPak - The Data Analysis ToolPak is not installed with the
standard Excel setup. Look in the Tools menu. If you do not have a Data Analysis
item, you will need to install the Data Analysis tools. Search Help for "Data Analysis
Tools" for instructions.
Missing Values - A blank cell is the only way for Excel to deal with missing data. If
you have any other missing value codes, you will need to change them to blanks.
Data Arrangement - Different analyses require the data to be arranged in various
ways. If you plan on a variety of different tests, there may not be a single
arrangement that will work. You will probably need to rearrange the data several
ways to get everything you need.
Output location - The output from each analysis can go to a new sheet
within your current Excel file (this is the default), or you can place it
within the current sheet by specifying the upper left corner cell where
you want it placed. Either way is a bit of a nuisance. If each output is in
a new sheet, you end up with lots of sheets, each with a small bit of
output. If you place them in the current sheet, you need to place them
appropriately; leave room for adding comments and labels; changes you
need to make to format one output properly may affect another output
adversely. Example: Output from Descriptives has a column of labels
such as Standard Deviation, Standard Error, etc. You will want to make
this column wide in order to be able to read the labels. But if a simple
Frequency output is right underneath, then the column displaying the
values being counted, which may just contain small integers, will also be
wide.
Results of Analyses
Descriptive Statistics - The quickest way to get means and standard deviations
for a entire group is using Descriptives in the Data Analysis tools. We can choose
several adjacent columns for the Input Range (in this case the X and Y columns),
and each column is analyzed separately. The labels in the first row are used to label
the output, and the empty cells are ignored. If you have more, non-adjacent columns
you need to analyze, you will have to repeat the process for each group of
contiguous columns. The procedure is straightforward, can manage many columns
reasonably efficiently, and empty cells are treated properly.
Correlations - Using the Data Analysis tools, the dialog for correlations is much
like the one for descriptives - you can choose several contiguous columns, and get
an output matrix of all pairs of correlations. Empty cells are ignored appropriately.
The output does NOT include the number of pairs of data points used to compute
each correlation (which can vary, depending on where you have missing data), and
does not indicate whether any of the correlations are statistically significant. If you
want correlations on non-contiguous columns, you would either have to include the
intervening columns, or copy the desired columns to a contiguous location.
Two-Sample T-test - This test can be used to check whether the two treatment
groups differ on the values of either X or Y. In order to do the test you need to enter
a cell range for each group. Since the data were not entered by treatment group, we
first need to sort the rows by treatment. Be sure to take all the other columns
along with treatment, so that the data for each subject remains intact. After the
data is sorted, you can enter the range of cells containing the X measurements for
each treatment. Do not include the row with the labels, because the second group
does not have a label row. Therefore your output will not be labeled to indicate that
this output is for X. If you want the output labeled, you have to copy the cells
corresponding to the second group to a separate column, and enter a row with a
label for the second group. If you also want to do the t-test for the Y measurements,
youll need to repeat the process. The empty cells are ignored, and other than the
problems with labeling the output, the results are correct.
Paired t-test - The paired t-test is a method for testing whether the difference
between two measurements on the same subject is significantly different from 0. In
this example, we wish to test the difference between X and Y measured on the same
subject. The important feature of this test is that it compares the measurements
within each subject. If you scan the X and Y columns separately, they do not look
obviously different. But if you look at each X-Y pair, you will notice that in every case,
X is greater than Y. The paired t-test should be sensitive to this difference. In the two
cases where either X or Y is missing, it is not possible to compare the two measures
on a subject. Hence, only 8 rows are usable for the paired t-test.
Additional Analyses
The remaining analyses were not done on this data set, but some comments about
them are included for completeness.
Simple Frequencies –
We can use Pivot Tables to get simple frequencies. (see
Crosstabulations for more about how to get Pivot Tables.) Using Pivot Tables, each
column is considered a separate variable, and labels in row 1 will appear on the
output. You can only do one variable at a time.
Linear Regression - Since regression is one of the more frequently used statistical
analyses, we tried it out even though we did not do a regression analysis for this
example. The Regression procedure in the Data Analysis tools lets you choose one
column as the dependent variable, and a set of contiguous columns for the
independents. However, it does not tolerate any empty cells anywhere in the input
ranges, and you are limited to 16 independent variables. Therefore, if you have any
empty cells, you will need to copy all the columns involved in the regression to new
columns, and delete any rows that contain any empty cells. Large models, with more
than 16 predictors, cannot be done at all.
Analysis of Variance
In general, the Excel's ANOVA features are limited to a few special cases rarely
found outside textbooks, and require lots of data re-arrangements.
One-way ANOVA - Data must be arranged in separate and adjacent columns (or
rows) for each group. Clearly, this is not conducive to doing 1-ways on more than
one grouping. If you have labels in row 1, the output will use the labels.
Two-Factor ANOVA Without Replication - This only does the case with one
observation per cell (i.e. no Within Cell error term). The input range is a
rectangular arrangement of cells, with rows representing levels of one factor,
columns the levels of the other factor, and the cell contents the one value in that
cell.
Using a statistical program, the data would normally be arranged with the rows
representing the subjects, and the columns representing variables (as they are in our
sample data). With this arrangement you can do any of the analyses discussed here,
and many others as well, without having to sort or rearrange your data in any way.
Only much more complex analyses, beyond the capabilities of Excel and the scope
of this article would require data rearrangement.
Working with Many Columns - What if your data had not 4, but
40 columns, with a mix of categorical and continuous measures? How
easily do the above procedures scale to a larger problem?
There are some shortcuts to move within the current sheet:
Home - moves to the first column in the current row
End - Right Arrow - moves to the last filled cell in the current row
End - Down Arrow - moves to the last filled cell in the current column
Ctrl-Home -moves to cell A1
Ctrl-End - moves to the last cell in your document (not the last cell of the
current sheet)
Ctrl-Shift-End - selects everything between the active cell to the last cell
in the document
Entering data -we can type anything on a cell, in general you can enter
text (or labels), numbers, formulas (starting with the =sign), and logical
values (as in – trueor false).
Click on a cell and start typing, once we finish typing press - enter (to move
to the next cell below) or tab (to move to the next cell to the right)
We can write long sentences in one single cell but we may see it partially
depending on the column width of the cell (and whether the adjacent
column is full). To adjust the width of a column go to Format - Column -
Width or select - AutoFit Selection.
Numbers are assumed to be positive, if we need to enter a negative value
use the minus sign (-) or enclose the number in parentheses (number).
If we need to enter percentages, dollar sign, or any other symbol to identify
the number just add the %or $. We can also enter the number and change
its format using the menu: Format - Cell and select the number tab which
has all the different formats.
Dates are automatically stored as mm/dd/yyyy (or the default format if
changed) but there is some flexibility here. Enter month and number and
excel will enter the date in the default format. If we press ctrl and ; (Crtl-;)
excel will enter the current date.
Time is also entered in a default format. Enter 5 pm, excel will write 5:00
PM. To enter the current time press ctrland : (Ctrl-:)
To practice enter the following table (these data are made-up, not real)
Each column has a list of items. Column A has IDs, column B has last
names of students and so on.
Lets say for example we do not want capital letters for the columns - Last
Name and First Name. we do not want SMITH we want Smith. Two
options, we can re-type all the names or we can use the following
formula (IMPORTANT: All formulas start with the equal (= sign):
=PROPER(cell with the text you want to change)
The full table should look like this. This is a made up table, it is just a
collection of random info and data.
Exploring data in excel
Descriptive statistics (using excels data analysis tool)
Generally one of the first things to do with new data is to get to know it by
asking some general questions like but not limited to the following:
What variables are included? What information are we getting?
After looking at the data you may want to know
How many males/females?
What is the average SAT score? It is the same for graduates and
undergraduates?
Who reads the newspaper more frequently: men or women?
We can start answering some of these questions by looking directly at the
table, for some other questions we may have to do some calculations by
obtaining a set of descriptive statistics. These statistics are a collection of
measurements of two things: location and variability. Location tells you the
central value (the mean is the most common measure of this) of your
variables. Variability refers to the spread of the data from the center value
(i.e. variance, standard deviation). Statistics is basically the study of what
causes such variability.
Locatio Variability
n
Mean Variance
Mode Standard deviation
Median Range
Lets check this window:
Input Range: This is to select the data you want to analyze.
Once you click in the input range you need to select the cells you want to
analyze.
Back to the window
Since we include the labels in first row make sure to check that option. For
the output option which is the place where excel will enter the results select
O1 or you can select a new worksheet or even new workbook.
Check Summary statistics and the press OK. we will get the following:
While the whole descriptive statistics cells are selected go to Format Cells
to change all numbers to have one decimal point. When we get the format
cells window, select the following:
Click OK. All numbers should now have one decimal as follows:
Now we know something about our data.
The average student in this sample is 25.2 years, has a SAT score of
1848.9, got a grade of 80.4, is 66.4 inches tall and reads the newspaper
4.9 times a week. We know this by looking at the mean value on each
variable.
The mean is the sum of the observations divided by the total number of
observations. It is the most common indicator of central tendency of a
variable. If we look at the last two rows: Sum and Count we can estimate
the mean dividing Sumby Count (sum/count). we can also calculate the
mean using the function below (IMPORTANT: All functions start with the equal =
sign):
=AVERAGE(range of cells with the values of interest)
For age=AVERAGE(J2:J31)
Sumrefers to the sum of all the values in a range of values. For age means
the sum of the ages of all students. The excel function for sum is:
=SUM(range of cells with the values of interest)
Countrefers to the count of cell that contain values (numbers). The function
is:
=COUNT(range of cells with the values of interest)
Minis the lowest value in an array of values. The function is:
=MIN(range of cells with the values of interest)
Max is the largest value in an array of values. The function is:
=MAX(range of cells with the values of interest)
The pivot wizard will walk you through the process, this is the first window
Press Next. In step 2 select the range for the range of all values as in the
following picture:
In step 3 select New worksheetand press Layout
This is where you make the pivot table:
On the right side of the wizard layout we can see the list of all variables in
the data. Click and drag Genderinto the ROW area. Click and drag
Majorinto the COLUMN area, and click and drag Sat score into the
DATAarea. The wizard layout should look like this:
In the DATA area double-click on Sum of Sat score, a new window will pop-
up select Average and click OK.
The wizard layout should look like this. Click OK, in the wizard window step
3 click Finish
In a new worksheet you will see the following (the pivot table window was
moved to save some space).
This is a crosstabulation between gender and major. Each cell represents
the average SAT score for a student according to gender and major. For
example a female student with an econ major has an average SAT score of
1952 (cell B5 in the picture) while a male student also with an econ major
has 1743 (B6). Overall econ major students have an average SAT score of
1806 (B7) . In general, female students have an average SAT score in this
sample of 1871.8 (E5) while male students 1826 (E6).
STATIS
the strength of the relationships.
TICAL
INFERE
NCE
SAMPLING
Sampling is a process used in statistical analysis in which a predetermined
number of observations are taken from a larger population. The methodology
used to sample from a larger population depends on the type of analysis being
performed but may include simple random sampling or systematic sampling.
METHOD OF
SAMPLING
It would normally be impractical to study a whole population, for example when doing a
questionnaire survey. Sampling is a method that allows researchers to infer information
about a population based on results from a subset of the population, without having to
investigate every individual. Reducing the number of individuals in a study reduces the
cost and workload, and may make it easier to obtain high quality information, but this
has to be balanced against having a large enough sample size with enough power to
detect a true association. (Calculation of sample size is addressed in section 1B
(statistics) of the Part A syllabus.)
If a sample is to be used, by whatever method it is chosen, it is important that the
individuals selected are representative of the whole population. This may involve
specifically targeting hard to reach groups. For example, if the electoral roll for a town
was used to identify participants, some people, such as the homeless, would not be
registered and therefore excluded from the study by default.
There are several different sampling techniques available, and they can be subdivided
into two groups: probability sampling and non-probability sampling. In probability
(random) sampling, you start with a complete sampling frame of all eligible individuals
from which you select your sample. In this way, all eligible individuals have a chance of
being chosen for the sample, and you will be more able to generalise the results from
your study. Probability sampling methods tend to be more time-consuming and
expensive than non-probability sampling. In non-probability (non-random) sampling, you
do not start with a complete sampling frame, so some individuals have no chance of
being selected. Consequently, you cannot estimate the effect of sampling error and there
is a significant risk of ending up with a non-representative sample which produces non-
generalisable results. However, non-probability sampling methods tend to be cheaper
and more convenient, and they are useful for exploratory research and hypothesis
generation.
In this case each individual is chosen entirely by chance and each member of the
population has an equal chance, or probability, of being selected. One way of obtaining a
random sample is to give each individual in a population a number, and then use a table
of random numbers to decide which individuals to include.1 For example, if you have a
sampling frame of 1000 individuals, labelled 0 to 999, use groups of three digits from
the random number table to pick your sample. So, if the first three numbers from the
random number table were 094, select the individual labelled “94”, and so on.
As with all probability sampling methods, simple random sampling allows the sampling
error to be calculated and reduces selection bias. A specific advantage is that it is the
most straightforward method of probability sampling. A disadvantage of simple random
sampling is that you may not select enough individuals with your characteristic of
interest, especially if that characteristic is uncommon. It may also be difficult to define a
complete sampling frame and inconvenient to contact them, especially if different forms
of contact are required (email, phone, post) and your sample units are scattered over a
wide geographical area.
2. Systematic sampling
Individuals are selected at regular intervals from the sampling frame. The intervals are
chosen to ensure an adequate sample size. If you need a sample size n from a
population of size x, you should select every x/nth individual for the sample. For
example, if you wanted a sample size of 100 from a population of 1000, select every
1000/100 = 10th member of the sampling frame.
Systematic sampling is often more convenient than simple random sampling, and it is
easy to administer. However, it may also lead to bias, for example if there are
underlying patterns in the order of the individuals in the sampling frame, such that the
sampling technique coincides with the periodicity of the underlying pattern. As a
hypothetical example, if a group of students were being sampled to gain their opinions
on college facilities, but the Student Record Department’s central list of all students was
arranged such that the sex of students alternated between male and female, choosing
an even interval (e.g. every 20thstudent) would result in a sample of all males or all
females. Whilst in this example the bias is obvious and should be easily corrected, this
may not always be the case.
3. Stratified sampling
In this method, the population is first divided into subgroups (or strata) who all share a
similar characteristic. It is used when we might reasonably expect the measurement of
interest to vary between the different subgroups, and we want to ensure representation
from all the subgroups. For example, in a study of stroke outcomes, we may stratify the
population by sex, to ensure equal representation of men and women. The study sample
is then obtained by taking equal sample sizes from each stratum. In stratified sampling,
it may also be appropriate to choose non-equal sample sizes from each stratum. For
example, in a study of the health outcomes of nursing staff in a county, if there are
three hospitals each with different numbers of nursing staff (hospital A has 500 nurses,
hospital B has 1000 and hospital C has 2000), then it would be appropriate to choose
the sample numbers from each hospital proportionally (e.g. 10 from hospital A, 20 from
hospital B and 40 from hospital C). This ensures a more realistic and accurate estimation
of the health outcomes of nurses across the county, whereas simple random sampling
would over-represent nurses from hospitals A and B. The fact that the sample was
stratified should be taken into account at the analysis stage.
Stratified sampling improves the accuracy and representativeness of the results by
reducing sampling bias. However, it requires knowledge of the appropriate
characteristics of the sampling frame (the details of which are not always available), and
it can be difficult to decide which characteristic(s) to stratify by.
4. Clustered sampling
In a clustered sample, subgroups of the population are used as the sampling unit, rather
than individuals. The population is divided into subgroups, known as clusters, which are
randomly selected to be included in the study. Clusters are usually already defined, for
example individual GP practices or towns could be identified as clusters. In single-stage
cluster sampling, all members of the chosen clusters are then included in the study. In
two-stage cluster sampling, a selection of individuals from each cluster is then randomly
selected for inclusion. Clustering should be taken into account in the analysis. The
General Household survey, which is undertaken annually in England, is a good example
of a (one-stage) cluster sample. All members of the selected households (clusters) are
included in the survey.1
Cluster sampling can be more efficient that simple random sampling, especially where a
study takes place over a wide geographical region. For instance, it is easier to contact
lots of individuals in a few GP practices than a few individuals in many different GP
practices. Disadvantages include an increased risk of bias, if the chosen clusters are not
representative of the population, resulting in an increased sampling error.
1. Convenience sampling
2. Quota sampling
This method of sampling is often used by market researchers. Interviewers are given a
quota of subjects of a specified type to attempt to recruit. For example, an interviewer
might be told to go out and select 20 adult men, 20 adult women, 10 teenage girls and
10 teenage boys so that they could interview them about their television viewing. Ideally
the quotas chosen would proportionally represent the characteristics of the underlying
population.
Whilst this has the advantage of being relatively straightforward and potentially
representative, the chosen sample may not be representative of other characteristics
that weren’t considered (a consequence of the non-random nature of sampling). 2
Also known as selective, or subjective, sampling, this technique relies on the judgement
of the researcher when choosing who to ask to participate. Researchers may implicitly
thus choose a “representative” sample to suit their needs, or specifically approach
individuals with certain characteristics. This approach is often used by the media when
canvassing the public for opinions and in qualitative research.
4. Snowball sampling
Bias in sampling
There are five important potential sources of bias that should be considered when
selecting a sample, irrespective of the method used. Sampling bias may be introduced
when:1
Advantages of sampling
Sampling ensures convenience, collection of intensive and
exhaustive data, suitability in limited resources and better rapport.
In addition to this, sampling has the following advantages also.
5. Organization of convenience
Organizational problems involved in sampling are very few. Since
sample is of a small size, vast facilities are not required. Sampling is
therefore economical in respect of resources. Study of samples
involves less space and equipment.
8. Better rapport
An effective research study requires a good rapport between the
researcher and the respondents. When the population of the study
is large, the problem of rapport arises. But manageable samples
permit the researcher to establish adequate rapport with the
respondents.
Disadvantages of sampling
The reliability of the sample depends upon the appropriateness of
the sampling method used. The purpose of sampling theory is to
make sampling more efficient. But the real difficulties lie in
selection, estimation and administration of samples.
Chances of bias
Difficulties in selecting truly a representative sample
Need for subject specific knowledge
changeability of sampling units
impossibility of sampling.
1. Chances of bias
The serious limitation of the sampling method is that it involves
biased selection and thereby leads us to draw erroneous
conclusions. Bias arises when the method of selection of sample
employed is faulty. Relative small samples properly selected may be
much more reliable than large samples poorly selected.
Some of the cases of sample may not cooperate with the researcher
and some others may be inaccessible. Because of these problems, all
the cases may not be taken up. The selected cases may have to be
replaced by other cases. Changeability of units stands in the way of
results of the study.
5. Impossibility of sampling
Deriving a representative sample is di6icult, when the universe is
too small or too heterogeneous. In this case, census study is the only
alternative. Moreover, in studies requiring a very high standard of
accuracy, the sampling method may be unsuitable. There will be
chances of errors even if samples are drawn most carefully.
SAMPLING VS NON-
SAMPLING ERROR
Sampling error is one which occurs due to unrepresentativeness of the
sample selected for observation. Conversely, non-sampling error is an
error arise from human error, such as error in problem identification, method
or procedure used, etc.
An ideal research design seeks to control various types of error, but there are
some potential sources which may affect it. In sampling theory, total error can
be defined as the variation between the mean value of population parameter
and the observed mean value obtained in the research. The total error can be
classified into two categories, i.e. sampling error and non-sampling error.
In this article excerpt, you can find the important differences between
sampling and non-sampling error in detail.
1. Comparison Chart
2. Definition
3. Key Differences
4. Conclusion
Comparison Chart
BASIS FOR NON-SAMPLING
SAMPLING ERROR
COMPARISON ERROR
The main reason behind sampling error is that the sampler draws various
sampling units from the same population but, the units may have individual
variances. Moreover, they can also arise out of defective sample design, faulty
demarcation of units, wrong choice of statistic, substitution of sampling unit
done by the enumerator for their convenience. Therefore, it is considered as
the deviation between true mean value for the original sample and the
population.
Definition of Non-Sampling Error
o Researcher Error
Surrogate Error
Sampling Error
Measurement Error
Data Analysis Error
Population Definition Error
Respondent Error
Inability Error
Unwillingness Error
Interviewer Error
Questioning Error
Recording Erro
Respondent Selection Error
Cheating Error
Non-Response Error: Error arising due to some respondents
who are a part of the sample do not respond.
To end this discussion, it is true to say that sampling error is one which is
completely related to the sampling design and can be avoided, by expanding
the sample size. Conversely, non-sampling error is a basket that covers all the
errors other than the sampling error and so, it unavoidable by nature as it is
not possible to completely remove it.