UGC Net Statistics
UGC Net Statistics
We may define statistics either in a singular sense or in a plural sense. Statistics, when used as
a plural noun, may be defined as data qualitative as well as quantitative that are collected,
usually with a view of having statistical analysis. However, statistics, when used as a singular
noun, may be defined, as the scientific method that is employed for collecting, analyzing and
presenting data, leading finally to drawing statistical inferences about some important
characteristics it means it is ‘science of counting’ or ‘science of averages’.
According to Croxton and Cowden, “Statistics may be defined as the collecting presentation,
analysis and interpretation of numerical data.
❖ Data:
Data are the values of subjects with respect to qualitative or quantitative variables. We may
define ‘data’ as quantitative information about some particular characteristic(s) under
consideration. Although a distinction can be made between a qualitative characteristic and a
quantitative characteristic but so far as the statistical analysis of the characteristic is
concerned, we need to convert qualitative information to quantitative information by
providing a numeric descriptions to the given characteristic.
❖ Data analysis:
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the
goal of discovering useful information, informing conclusions, and supporting decision-making.
Data analysis has multiple facets and approaches, encompassing diverse techniques under a
variety of names, while being used in different business, science, and social science domains.
❖ Data type:
The different data types, also called measurement scales, is a crucial prerequisite for doing
Exploratory Data Analysis (EDA), since you can use certain statistical measurements only for
specific data types. Nominal, Ordinal, Interval and Ratio are defined as the four fundamental
levels of measurement scales that are used to capture data.
Nominal Scale is used for labeling variables into distinct classifications and doesn’t involve
a quantitative value or order. Nominal scale is the most fundamental research scale.
Nominal scale is often used in research surveys and questionnaires where only variable
labels hold significance. For instance, a customer survey asking “Which brand of
smartphones do you prefer?” Options: “Apple”- 1, “Samsung”-2, “OnePlus”-3.
❖ Ordinal Scale:
Ordinal Scale is defined as a variable measurement scale used to simply depict the order of
variables and not the difference between each of the variables. These scales are generally
used to depict non-mathematical ideas such as frequency, satisfaction, happiness, a degree
of pain etc. Example - How satisfied are you with our services? 1- Very Unsatisfied 2-
Unsatisfied 3- Neural 4- Satisfied 5- Very Satisfied.
❖ Interval Scale:
Interval Scale is defined as a numerical scale where the order of the variables is known as
well as the difference between these variables. Interval scale contains all the properties of
ordinal scale, in addition to which, it offers a calculation of the difference between
variables. For example - temperature scale or time scale.
www.everstudy.co.in Query: [email protected]
❖ Ratio Scale:
Ratio scales are the ultimate nirvana when it comes to measurement scales. Ratio scales
provide a wealth of possibilities when it comes to statistical analysis. These variables can be
meaningfully added, subtracted, multiplied, divided (ratios). In addition, everything above
about interval data applies to ratio scales. Good examples of ratio variables include height
and weight.
Collection of data plays the very important role for any statistical analysis. The data which are
collected for the first time by an investigator or agency are known as primary data whereas
the data are known to be secondary if the data, as being already collected, are used by a
different person or agency. Example - if Mr. C collects the data on the height of every student
in his class, then these would be primary data for him. If, however, another person, say, Mr. D
uses the data, as collected by Mr. C, for finding the average height of the students belonging
to that class, then the data would be secondary for Mr. D.
The following methods are employed for the collection of primary data:
(ii) Mailed questionnaire method - A wide area can be covered using the mailed questionnaire
method, but the amount of non-responses is also likely to be maximum in this method.
(iii) Observation method – Data on height, weight etc. can be collected by direct observation.
(iv) Questionnaires filled and sent by enumerators - Enumerators collects information directly
by interviewing the persons having information: Question are explained and hence data is
collected.
❖ Census:
Census method is that method of statistical enumeration where all members of the
population are studied. A population refers to the set of all observations under concern. For
example, if you want to carry out a survey to find out student’s feedback about the facilities of
your school, all the students of your school would form a part of the ‘population’ for your
study.
Sampling is the process in which the fraction of the population, so selected to represent the
characteristics of the larger group. This method is used for statistical testing, where it is not
possible to consider all members or observations, as the population size is very large.
The units which constitute sample is considered as ‘Sampling Units’. The full-fledged list
containing all sampling units is called ‘Sampling Frame’.
A sampling technique where every item in the population has an even chance and
likelihood of being selected in the sample. Here the selection of items completely depends
on chance or by probability and therefore this sampling technique is also sometimes known
as a method of chances. .
❖ Cluster Sampling:
Cluster sampling is a sampling plan used when mutually homogeneous yet internally
heterogeneous groupings are evident in a statistical population. It is a method where the
researchers divide the entire population into sections or clusters that represent a
population. Clusters are identified and included in a sample on the basis of defining
demographic parameters such as age, location, sex etc.
❖ Systematic Sampling:
Using systematic sampling method, members of a sample are chosen at regular intervals of
a population. Example - In a sample of 500 people out of 5000 people, every 10th individual
may be selected.
Stratified random sampling is a method of sampling that involves the division of a population
into multiple non-overlapping, homogeneous, smaller groups known as strata and randomly
choose final members from the various strata for research which reduces cost and improves
efficiency. Members in each of these groups should be distinct so that every member of all
groups get equal opportunity to be selected using simple probability. Stratified random
sampling is also called proportional random sampling or quota random sampling. For example,
a researcher looking to analyze the characteristics of people belonging to different annual
income divisions, will create strata (groups) according to annual family income.
❖ Convenience sampling:
This method is dependent on the ease of access to subjects such as surveying customers at
a mall or passers-by on a busy street. It is usually termed as convenience sampling, as it’s
carried out on the basis of how easy is it for a researcher to get in touch with the subjects.
www.everstudy.co.in Query: [email protected]
❖ Judgmental or Purposive Sampling:
In judgmental or purposive sampling, the sample is formed by the discretion of the judge
purely considering the purpose of study along with the understanding of target audience.
❖ Sampling Error:
A sampling error is a statistical error that occurs when an analyst does not select a sample that
represents the entire population of data and the results found in the sample do not represent
the results that would be obtained from the entire population. Sampling error can be
eliminated when the sample size is increased and also by ensuring that the sample adequately
represents the entire population.
Non-Sampling Errors are those which creep in due to human factors which always varies from
one investigator to another. In other words, a non-sampling error is an error that results
during data collection, causing the data to differ from the true values. Non-sampling errors
may be present in both samples and censuses in which an entire population is surveyed and
may be random or systematic. While increasing sample size will help minimize sampling error,
it will not have any effect on reducing non-sampling error. Unfortunately, non-sampling errors
are often difficult to detect, and it is virtually impossible to eliminate them entirely.
❖ Data Presentation:
Data can be presented in textual form, tabular or diagrammatic form. Most popular form of
data presentation is diagrammatic form. As can be used for both the educated section and
uneducated section of the society. Furthermore, any hidden trend present in the given data
can be noticed only in this mode of representation.
❖ Line Diagram: A line graph, also known as a line chart, is a type of chart used to visualize
the value of something over time. For example, a finance department may plot the change
in the amount of cash the company has on hand over time. The line graph consists of a
horizontal x-axis and a vertical y-axis.
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars
with heights or lengths proportional to the values that they represent. A bar graph may run
horizontally or vertically.
❖ Pie Chart: A pie chart (or a circle chart) is a circular statistical graphic, which is divided into
slices to illustrate numerical proportion. It shows proportions and percentages between
categories, by dividing a circle into proportional segments. In a pie chart, the arc length of
each slice (and consequently its central angle and area), is proportional to the quantity it
represents.
❖ Frequency Polygon:
Another type of graph that can be drawn to represent the same set of data as a histogram
represents is a frequency polygon. A frequency polygon is a graph constructed by using lines to
join the midpoints of each interval.
By plotting cumulative frequency against the respective class boundary, we get ogives. As such
there are two ogives – less than type ogives, obtained by taking less than cumulative
frequency on the vertical axis and more than type ogives by plotting more than type
cumulative frequency on the vertical axis and thereafter joining the plotted points.
❖ Dispersion:
The dispersion is the tendency of data to be scattered over a range. Dispersion is the
important feature of a frequency distribution. It is also called spread or variation. Range,
variance and standard deviation are all measures of dispersion.
❖ Standard Error:
Standard error is the approximate standard deviation of a statistical sample population. In
statistics, a sample mean deviates from the actual mean of a population; this deviation is the
standard error.
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal
distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets
with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be
the extreme case.
Two important discrete probability distributions are (a) Binomial Distribution and (b)
Poisson distribution.
❖ Binomial Distribution:
It is derived from a particular type of random experiment known as Bernoulli process named
after the famous mathematician Bernoulli. A binomial distribution can be thought of as simply
the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated
multiple times. The binomial is a type of distribution that has two possible outcomes (the
prefix “bi” means two, or twice). For example, a coin toss has only two possible outcomes:
heads or tails and taking a test could have two possible outcomes: pass or fail.
(i) Each trial is associated with two mutually exclusive and exhaustive outcomes, the
occurrence of one of which is known as a 'success' and as such its nonoccurrence as a 'failure'.
(ii) The trials are independent.
(iii) The probability of a success, usually denoted by p, and hence that of a failure, usually
denoted by q = 1–p, remain unchanged throughout the process.
(iv) Binomial distribution is known as biparametric distribution as it is characterised by two
parameters n and p. This means that if the values of n and p are known, then the distribution
is known completely.
(v) Binomial distribution has mean = np and variance = np (1-p)
Poisson distribution is the discrete probability distribution of the number of events occurring
in a given time period, given the average number of times the event occurs over that time
period. Poisson distribution is applied in situations where there are a large number of
independent Bernoulli trials with a very small probability of success in any trial say p. Thus very
commonly encountered situations of Poisson distribution are: The number of aircraft/road
accidents in any time interval.
Note: Mean, or average in n tries will be equal to np. If μ is the average number of successes
occurring in a given time interval or region in the Poisson distribution, then the mean and the
variance of the Poisson distribution are both equal to μ.
❖ Normal Distribution:
This random variable X is said to be normally distributed with mean μ and standard
deviation σ if its probability distribution is given by:
Example of normal Distribution: body temperature for healthy humans, heights and weights
of adults, thickness and dimension of a product, IQ and standardized test score, quality control
test results, errors in measurements etc.
7) The area to the left and the area to the right of the curve is 0.5.
This means that 68.27% of the scores lie within 1 standard deviation of the means. Also
95.45% of the scores lie within 2 standard deviations of the mean. Finally 99.73% of the score
lie within 3 standard deviations of the mean.
❖ Hypothesis :
Hypothesis testing is an objective method of making decisions or inferences from sample data
(evidence). Sample data is used to choose between two choices i.e. hypotheses or statements
about a population. A statistical hypothesis is an assumption about a population parameter.
This assumption may or may not be true. It refers to the formal procedures used by
statisticians to accept or reject statistical hypotheses. Typically this is carried out by comparing
what we have observed to what we expected if one of the statements (Null Hypothesis) was
true.
❖ Alternative Hypothesis:
A statistical hypothesis used in hypothesis testing, which states that there is a significant
difference between the set of variables. It is often referred to as the hypothesis other than the
null hypothesis, often denoted by H1 (H-one). It is what the researcher seeks to prove in an
indirect way, by using the test. It refers to a certain value of sample statistic, e.g., x¯, s, p
The acceptance of alternative hypothesis depends on the rejection of the null hypothesis i.e.
until and unless null hypothesis is rejected, an alternative hypothesis cannot be accepted.
❖ Type I error.
A Type I error occurs when the researcher rejects a null hypothesis when it is true. The
probability of committing a Type I error is called the significance level. This probability is also
called alpha, and is often denoted by α.
❖ Type II error.
A Type II error occurs when the researcher fails to reject a null hypothesis that is false. The
probability of committing a Type II error is called Beta, and is often denoted by β. The
probability of not committing a Type II error is called the Power of the test.
❖ P-value:
P values evaluate how well the sample data support the devil’s advocate argument that the
null hypothesis is true. It measures how compatible your data are with the null hypothesis.
How likely is the effect observed in your sample data if the null hypothesis is true?
A low P value suggests that your sample provides enough evidence that you can reject the null
hypothesis for the entire population. P value is the probability of obtaining an effect at least as
extreme as the one in your sample data, assuming the truth of the null hypothesis. For
example, suppose that a vaccine study produced a P value of 0.04. This P value indicates that if
the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due
to random sampling error
❖ Region Of Rejection:
The set of values outside the region of acceptance is called the region of rejection. If the test
statistic falls within the region of rejection, the null hypothesis is rejected. In such cases, we
say that the hypothesis has been rejected at the α level of significance.
A test of a statistical hypothesis, where the region of rejection is on only one side of
the sampling distribution, is called a one-tailed test. For example, suppose the null hypothesis
states that the mean is less than or equal to 10. The alternative hypothesis would be that the
mean is greater than 10. The region of rejection would consist of a range of numbers located
on the right side of sampling distribution; that is, a set of numbers greater than 10.
A test of a statistical hypothesis, where the region of rejection is on both sides of the sampling
distribution, is called a two-tailed test. . For example, suppose the null hypothesis states that
the mean is equal to 10. The alternative hypothesis would be that the mean is less than 10 or
greater than 10. The region of rejection would consist of a range of numbers located on both
sides of sampling distribution; that is, the region of rejection would consist partly of numbers
that were less than 10 and partly of numbers that were greater than 10
The critical value for conducting the right-tailed test H0 : μ = 3 versus HA : μ > 3 is the t-value,
denoted tαα, n - 1, such that the probability to the right of it is αα.
❖ Left-Tailed:
The critical value for conducting the left-tailed test H0 : μ = 3 versus HA : μ < 3 is the t-value,
denoted -t(αα, n- 1) , such that the probability to the left of it is αα
In hypothesis testing, a critical value is a point on the test distribution that is compared to the
test statistic to determine whether to reject the null hypothesis. If the absolute value of your
test statistic is greater than the critical value, you can declare statistical significance and reject
the null hypothesis. Critical values correspond to α, so their values become fixed when you
choose the test's α.
❖ Level of significance:
This refers to the degree of significance with which we accept or reject a particular hypothesis.
The significance level, also denoted as alpha or α, is the probability of rejecting the null
hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of
concluding that a difference exists when there is no actual difference.
❖ Causation:
Causation indicates that one event is the result of the occurrence of the other event; i.e. there
is a causal relationship between the two events. This is also referred to as cause and effect.
❖ Correlation Coefficient:
The main result of a correlation is called the correlation coefficient (or "r"). It ranges from -1.0
to +1.0. The closer r is to +1 or -1, the more closely the two variables are related. If r is close to
0, it means there is no relationship between the variables. If r is positive, it means that as one
variable gets larger the other gets larger. If r is negative it means that as one gets larger, the
other gets smaller (often called an "inverse" correlation).
❖ Scatter Diagram:
The scatter diagram is known by many names, such as scatter plot, scatter graph, and
correlation chart. This diagram is drawn with two variables, usually the first variable is
independent and the second variable is dependent on the first variable.
When we need finding correlation between two qualitative characteristics, say, beauty and
intelligence, we take recourse to using rank correlation coefficient. In order to find out
correlation, the characteristics are first assigned ranking. The Spearman correlation
coefficient, rs, can take values from +1 to -1. In formula terms, this is given by:
Note: If (2c–m) >0, then we take the positive sign both inside and outside the radical sign and
if (2c–m) <0, we are to consider the negative sign both inside and outside the radical sign.
❖ Regression analysis:
When there are two variables x and y and if y is influenced by x i.e. if y depends on x, then we
get a simple linear regression or simple regression. y is known as dependent variable or
regression or explained variable and x is known as independent variable or predictor or
explanator. In case of a simple regression model if y depends on x, then the regression line of y
on x in given by:
y = a + bx, Here a and b are two constants and they are also known as regression parameters.
Furthermore, b is also known as the regression coefficient of y on x and is also denoted by byx.
Regression coefficient :
The Regression Line is the line that best fits the data, such that the overall distance from the
line to the points (variable values) plotted on a graph is the smallest.
(i) The regression coefficients remain unchanged due to a shift of origin but change due
to a shift of scale.
(ii) The two lines of regression intersect at the point (mean of "x", mean of "y"), where x
and y are the variables under consideration.
(iii) The coefficient of correlation between two variables x and y in the simple geometric
mean of the two regression coefficients. The sign of the correlation coefficient would
be the common sign of the two regression coefficients.
(iv) The two lines of regression coincide i.e. become identical when r = –1 or 1.
(v) The two lines of regression are perpendicular to each other when r = 0.
Nonparametric statistics refer to a statistical method in which the data is not required to fit a
normal distribution. Nonparametric statistics uses data that is often ordinal, meaning it does
not rely on numbers, but rather a ranking or order of sorts. For example, a survey conveying
consumer preferences ranging from like to dislike would be considered ordinal data.
The parametric test is one which has information about the population parameter. The
parametric test is the hypothesis test which provides generalisations for making statements
about the mean of the parent population. A t-test based on Student’s t-statistic, which is often
used in this regard
www.everstudy.co.in Query: [email protected]
❖ t-test:
T-test is a small sample test. It was developed by William gusset in 1908. He published this test
under the pen name of “student” It is known as student t-test. For applying t-test ,the value of
t-statistic is computed. The following formula is used:
t= deviation from the population parameter / standard error of the sample statsic.
A t-test is a type of inferential statistic which is used to determine if there is a significant
difference between the means of two groups which may be related in certain features. It is
mostly used when the data sets, like the set of data recorded as outcome from flipping a coin
a 100 times, would follow a normal distribution and may have unknown variances. T test is
used as a hypothesis testing tool, which allows testing of an assumption applicable to a
population.
❖ Chi-square test:
The test is applied when you have two categorical variables from a single population. It is used
to determine whether there is a significant association between the two variables. For
example, in an election survey, voters might be classified by gender (male or female) and
voting preference (Democrat, Republican, or Independent). We could use a chi-square test for
independence to determine whether gender is related to voting preference.
Chi-squared test is used to determine whether there is a significant difference between the
expected frequencies and the observed frequencies in one or more categories. The value of
chi-square is compared with pre-determined level of significance and degrees of Freedom.
When the computed chi-square statistic exceeds the critical value in the table for a given
significance level, then we can reject the null hypothesis.
Data processing refers to the operation performed on data in order to derive new information
according to a given set of rules. Data processing may involve various processes, including data
validation, summarization, aggregation, analysis and reporting.
The three basic structural elements of data processing system are -files, flows, and processes.
Files are collections of permanent records in the system, flows are data interfaces between
the system and the environment, and processes are functionally defined logical manipulations
of the data. An investigation of the cost of developing software as related to files, flows, and
processes was conducted.
❖ Data entry:
Direct input of data in the appropriate data fields of a database, through the use of a human
data-input device such as a keyboard, mouse, stylus, or touch screen, or through speech
recognition software.
❖ Computer:
'Computer' is basically derived form the word 'computer', which means to calculate some
thing. A computer is a fast electronic device, processing the Input data according to the
Instructions given by the Programmer/User and provides the desired information as an output.
Speed: - A computer is very fast device. It can perform large amount of work in a few seconds
❖ Hardware: - Hardware refers to all the physical parts and components of the computer.
❖ INPUT DEVICES: - In a computerized system, before any processing takes place, the data
and instructions must be fed. This is achieved through the Input Devices, which provide a
communication medium between the user and the machine. The most common of Input
devices keyboard, which resembles a typewriter. The help of a keyboard, the user types
data and instruction.
1) Text Input Devices: In text input devices the mainly used keyboard
2) Cursor Control Devices: Cursor control devices are mouse, joystick, scanner etc.
❖ Operating System:
Operating system is a program that acts as an interface between the users and computes
hardware and controls the execution of all kinds of programs. It is the most important
program in the computer System. It is one program that executes all the time always as the
computer is operational and it exit only when the computer is shut down. OS are the programs
that makes the computer work hence the name OS. It takes instructions in the form of
commands from user and translates into machine understandable instructions. It gets the
instructions executed by the CPU and translates the result back into user understandable
form.
1. Fine Arts:
• To draw something
• To make changes in photograph
• Makes scanning
2.Business/Commerce:
3. Banks:
4.Education:
5.Entertainment:
6. Libraries:
• Connecting data
• Interpreting and getting output
8.Publishing:
www.everstudy.co.in Query: [email protected]
• Computers do desk top publishing
• Editing becomes very easy
Marketing professionals use computer technology to plan, manage and monitor campaigns. By
analyzing and manipulating data on computers, they can increase the precision of marketing
campaigns, personalize customer and prospect communications, and improve customer
relationship management. Computer technology also makes it easier for marketing
professionals to collaborate with colleagues, agencies and suppliers
❖ Statistical software:
www.everstudy.co.in Query: [email protected]
SPSS: The Statistical Package for the Social Sciences (SPSS) is a software package used in
statistical analysis of data. It was developed by SPSS Inc. and acquired by IBM in 2009. In 2014,
the software was officially renamed IBM SPSS Statistics. The software was originally meant for
the social sciences, but has become popular in other fields such as health sciences and
especially in marketing, market research and data mining.
SAS stands for Statistical Analysis System. It was developed at the North Carolina State
University in 1966, so is contemporary with SPSS.
Stata is a more recent statistical package with Version 1 being released in 1985. Since then, it
has become increasingly popular in the areas of epidemiology and economics, and probably
now rivals SPSS and SAS in it user base. We are now on Version 14.