Ugc Net Commerce: Business Statistics & Research
Ugc Net Commerce: Business Statistics & Research
According to A.L. Bowley “Statistics are numerical statements of facts in any department of
enquiry placed in relation to each other.”
According to Croxton and Cowden, “Statistics may be defined as the collection, presentation,
analysis, and interpretation of numerical data.
Functions of Statistics
The functions of statistics may be enumerated as follows:
Averages provide us the gist and give a bird’s eye view of the huge mass of unwieldy
numerical data.
Averages are the typical values around which other items of the distribution congregate.
This value lies between the two extreme observations of the distribution and give us an
idea about the concentration of the values in the central part of the distribution.
And so they are called the measures of central tendency.
Averages are also called measures of location since they enable us to locate the
position or place of the distribution in question.
1. It must be rigidly defined and not left to the mere estimation of the observer. If the definition
is rigid, the computed value of the average obtained by different persons shall be similar.
2. The average must be based upon all values given in the distribution.
3. It should be easily understood. The average should possess simple and obvious properties. It
should be too abstract for the common people.
The following are the main types of averages that are commonly used:
1. Mean
3. Mode
1. Arithmetic Mean:
The mean is the arithmetic average, and it is probably the measure of central tendency
that you are most familiar.
Calculating the mean is very simple.
Add up all of the values and divide by the number of observations in the dataset.
The calculation of the mean incorporates all values in the data. If you change any value,
the mean changes.
1. The sum of the deviation of a given set of individual observations from the arithmetic
mean is always zero.
2. The sum of squares of deviations of a set of observations is the minimum when
deviations are taken from the arithmetic average.
3. If each value of a variable X is increased or decreased or multiplied by a constant k, the
arithmetic mean also increases or decreases or multiplies by the same constant.
4. If we are given the arithmetic mean and number of items of two or more groups, we can
compute the combined average of these groups by apply the following formula :
Types of
Direct Method Shortcut Methods Step deviation Methods
Series
Discrete
series
Continuous
Series
2. Weighted Average
A weighted average is a type of average where each observation in the data set is
multiplied by a predetermined weight before calculation.
In calculating a simple average (arithmetic mean) all observations are treated equally and
assigned equal weight.
A weighted average assigns weights that determine the relative importance of each data
point.
Takes into account relative importance of data points when calculating an average
thereby making it more descriptive than a simple average.
Often used in finance to calculate cost basis of stock portfolios, inventory accounting
and valuation.
3. Geometric Mean:
A geometric mean is a mean or average which shows the central tendency of a set of
numbers by using the product of their values.
For a set of n observations, a geometric mean is the nth root of their product.
The geometric mean G.M., for a set of numbers x1, x2, … , xn is given as
The geometric mean of two numbers, say x, and y is the square root of their product x×y. For three
numbers, it will be the cube root of their products i.e., (x y z) 1⁄3.
In order to make our calculation easy and less time consuming we use the concept of logarithms in
the calculation of geometric means.
log G.M. = 1⁄N (f1 log x1 + f2 log x2 + … + fn log xn) = 1⁄N [∑ i= 1n fi log xi ].
1. The logarithm of geometric mean is the arithmetic mean of the logarithms of given
values
2. If all the observations assumed by a variable are constants, say K >0, then the G.M. of the
observation is also K
3. The geometric mean of the ratio of two variables is the ratio of the geometric means of
the two variables
4. The geometric mean of the product of two variables is the product of their geometric
means
The geometric Mean has certain specific uses, some of them are:
4. Harmonic Mean
A simple way to define a harmonic mean is to call it the reciprocal of the arithmetic mean
of the reciprocals of the observations.
The most important criteria for it are that none of the observations should be zero.
The most common examples of ratios are that of speed and time, cost and unit of material,
work and time etc.
1. If all the observation taken by a variable are constants, say k, then the harmonic mean of
the observations is also k
2. The harmonic mean has the least value when compared to the geometric mean and the
arithmetic mean
Relationship between Arithmetic Mean, Harmonic Mean, and Geometric Mean of Two
Numbers
GM = geometric mean,
AM × HM = GM2
(II) Median
Median is the middle value of the series when arranged in order of the magnitude.
When a series is divided into more than two parts, the dividing values are called Partition
values.
The very first thing to be done with raw data is to arrange them in ascending or descending
order.
In Layman’s terms:
As we have 5 numbers the middle number will be the 3rd number which can also be
calculated as
So the Median is 8
For even numbers: As then there is no value exactly in the middle of the series. In
such a situation the median is arbitrarily taken to be halfway between the two
middle items.
Similarly there are certain other measures which divide the series into certain equal parts
Quartiles:
Quartiles are the measures which divide the data into four equal parts; each portion contains
equal number of observation.
1. The lower half of a data set is the set of all values that are to the left of the median value
when the data has been put into increasing order.
2. The upper half of a data set is the set of all values that are to the right of the median value
when the data has been put into increasing order.
1. The first quartile, denoted by Q1, is the median of the lower half of the data
set. This means that about 25% of the numbers in the data set lie below Q1 and
about 75% lie above Q1.
2. The second quartile also called median and denoted by Q2, has 50% of the
items below it and 50% of the items above it.
3. The third quartile, denoted by Q3, is the median of the upper half of the
data set. This means that about 75% of the numbers in the data set lie below Q3
and about 25% lie above Q3.
First Quartile
Third Quartile
Deciles
Deciles: Deciles distribute the series into ten equal parts and generally expressed as D.
There are nine deciles expressed as D1,D2…D9 which are called as first decile, second
decile and so on
Percentiles
Percentiles: Percentiles divide the series into hundred equal parts and generally expressed as P.
(i) Simple measure of central tendency. (i) Not based on all the items in the
(ii) It is not affected by extreme series, as it indicates the value of middle
observations. items.
(Iii) Possible even when data is (ii) Not suitable for algebraic treatment.
incomplete. (iii) Arranging the data in ascending
(iv) Median can be determined by order takes much time.
graphic presentation of data. (iv) Affected by fluctuations of items.
(v) It has a definite value. (v) It cannot be computed exactly where
Mode
Mode is that value of the variable which occurs or repeats itself maximum number of
item.
The mode is most “ fashionable” size in the sense that it is the most common and typical
and is defined by Zizek as “the value occurring most frequently in series of items and around
which the other items are distributed most densely.”
In the words of Croxton and Cowden, the mode of a distribution is the value at the point
where the items tend to be most heavily concentrated.
According to A.M. Tuttle, Mode is the value which has the greater frequency density in
its immediate neighborhood.
In the case of individual observations, the mode is that value which is repeated the
maximum number of times in the series. The value of mode can be denoted by the alphabet ‘z’
also.
DISPERSION
➢ According to Dr. Bowley, “dispersion is the measure of the variation between items.”
Objectives of Dispersion
a) To determine the reliability of an average.
MEASURES OF
DISPERSION
ABSOLUTE
RELATIVE MEASURES
MEASURES
Range:
It is the simplest method of studying dispersion. Range is the difference between the
smallest value and the largest value of a series.
While computing range, we do not take into account frequencies of different groups.
(X max + X min)
Quartile Deviation
✓ The concept of ‘Quartile Deviation does take into account only the values of the ‘Upper
quartile (Q3) and the ‘Lower quartile’ (Q1).
✓ Inter quartile range is the difference between Upper Quartile (Q3) and Lower Quartile
Q1.
= Q3 – Q1
Mean Deviation:
Average deviation is defined as a value which is obtained by taking the average of the
deviations of various items from a measure of central tendency Mean or Median or Mode,
ignoring negative signs.
Standard Deviation:
Standard deviation is calculated as the square root of average of squared deviations taken from
actual mean.
But,
(Some authors use ‘x’ as the deviation of individual scores from the mean)
(i) When the most accurate, reliable and stable measure of variability is wanted.
(ii) When more weight is to be given to extreme deviations from the mean.
(iii) When coefficient of correlation and other statistics are subsequently computed.
(v) When scores are to be properly interpreted with reference to the normal curve.
(vii) When we want to test the significance of the difference between two statistics.
Coefficient of Dispersion
Whenever we want to compare the variability of the two series which differ widely in
their averages.
We need to calculate the coefficients of dispersion along with the measure of dispersion.
The coefficient of variation (C.V.) is 100 times the coefficient of dispersion based on
standard deviation.
(2) When M’s are unequal, the units of the scale being the same.
Types of Distributions
Bernoulli Distribution
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0
(failure), and a single trial.
So the random variable X which has a Bernoulli distribution can take value 1 with the
probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.
Probability of getting a head = Probability of getting a tail = 0.5 since there are only two
possible outcomes.
px(1-p)1-x
✓ For instance, the result of a fight between me and Undertaker. He is pretty much certain
to win. So in this case probability of my success is 0.15 while my failure is 0.85
E(X) = p
Var(X) = p-(1-p)
Binomial Distribution
A distribution where only two outcomes are possible, such as success or failure, gain or
loss, win or lose and where the probability of success and failure is same for all the trials is
called a Binomial Distribution.
2. There are only two possible outcomes in a trial- either a success or a failure.
4. The probability of success and failure is same for all trials. (Trials are identical.)
• Drawing 5 cards from a deck for a poker hand (done without replacement, so not
independent)
Normal distribution :
Normal distribution represents the behavior of most of the situations in the universe.
The large sum of (small) random variables often turns out to be normally distributed,
contributing to its widespread application.
Any distribution is known as Normal distribution if it has the following characteristics:
1. The mean, median and mode of the distribution coincide.
2. The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
3. The total area under the curve is 1.
4. The mean divides the curve into 2 equal parts
5. Its quartile deviation, Q.D= /3 σ
Limits Area %
µ± σ 68.2
µ± 1.96σ 95
µ± 2σ 95.4
µ± 3σ 99.7
Mean= E(X) = µ
Variance = Var(X) = σ2
Poisson distribution
The Poisson distribution is a discrete distribution with a single parameter ‘m’.
Poisson process is obtained when the binomial experiment is conducted many number of
times
If the probability of success “p” is small and the number of trails “n” is large, the binomial
distribution is approximated to Poisson distribution.
A distribution is called Poisson distribution when the following assumptions are valid:
1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a
longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.
Now, if any distribution validates the above assumptions then it is a Poisson distribution.
Here, X is called a Poisson Random Variable and the probability distribution of X is called
Poisson distribution.
Mean = E(X) = µ = λ
Variance = Var(X) = µ = λ
Exponential Distribution
Consider the call center example one more time. What about the interval of time between the
calls ?
Here, exponential distribution comes to our rescue. Exponential distribution models the
interval of time between the calls.
Here λ > 0 is the parameter of the distribution, often called the rate
parameter.
For survival analysis, λ is called the failure rate of a device at any time t, given that it has
survived up to t.
Data collection:
Collection of data is the first and most important stage in any statistical survey. The
method for collection of data depends upon various factors such as objective, scope, nature of
investigation and availability of resources.
1. Statistical sources refer to data that are collected for some official purposes and include
censuses and officially conducted surveys.
2. Non-statistical sources refer to the data that are collected for other administrative purposes
or for the private sector.
➢ Census : Opposite to a sample survey, a census is based on all items of the population and
then data are analyzed. Data collection happens for a specific reference period. For
example, the Census of India is conducted every 10 years. Other censuses are conducted
roughly every 5-10 years. Data is collected using questionnaires that may be mailed to the
respondents. Responses can also be collected over other modes of communication like the
telephone.
➢ Register: Registers are basically storehouses of statistical information from which data can
be collected and analysis can be made. Registers tend to be detailed and extensive. It is
beneficial to use data from here as it is reliable. Two or more registers can be linked
together based on common information for even more relevant data collection.
Types of Data
There are two types of data – primary data and secondary data.
1. Primary data is the data collected for the first time keeping in view the objective of the
survey. Interview, questionnaire and telephone/mail are all examples of primary data.
2. Secondary data is any information, used for the current investigation but is obtained
from data, which has been collected and used by some other agency or person in a separate
investigation, or survey.
Primary data
Primary data is the one, which is collected by the investigator for the purpose of a
specific inquiry or study.
Such data is original in character and is generated by a survey conducted by individuals
or a research institution or any organization.
They are likely to be more reliable. However, cost of collection of such data is much
higher.
Primary data is collected by either a census method or a sampling method
1. Direct personal observation: In the direct personal observation method, the investigator
collects data by having direct contact with the units of investigation. The accuracy of data
depends upon the ability, training and attitude of the investigator.
Secondary data:
Any information, that is used for the current investigation but is obtained from some data,
which has been collected and used by some other agency or person in a separate
investigation, or survey, is known as secondary data.
They are available in a published or unpublished form.
Though, use of secondary data is economic in terms of expense, time and manpower
requirement, researcher must be careful in choosing such secondary data.
1. Reliability of data
3. Adequacy of data
Meaning Primary data refers to the first Secondary data means data
Questionnaire design
Questionnaire design is the process of designing the format and questions in the survey
instrument that will be used to collect data about a particular phenomenon.
In designing a questionnaire, all the various stages of survey design and implementation should
be considered.
2. Question content
i. Relevance of a question
3. Question phrasing
v.Discourage guessing
vi.Do not assume anything for granted from the part of the respondents
4. Types of questions
i. Dichotomous
i. Completely unstructured
5. Question sequence
i.Logical order
i. Uncover faults
ii. Misprints
1. The questionnaires should begin with an effort to awaken the respondents’ interest.
Important target questions should be asked in the middle of the opinion survey.
2. The respondent should not take much time in completing the questionnaire. It should be
5. Questions should be unbiased. The questions in the questionnaire should not disturb the
avoided.
9. All the questions related to personal information (name, income, phone, address etc) of
the respondents should be either optional or asked in the last section of the questionnaires.
10. A pilot test should be conducted to detect the weakness in the questionnaires designed.
Surveys differ from each other with regard to their purpose, field of study, scope, and the source
of information. The standard tools for any statistical study are:
• relevance
• timeliness
1. Planning and
2. Execution.
Planning: A properly planned investigation can lead to the best results with least cost and
time. There are five steps involved in planning the survey.
Sampling:
The process of selecting a number of individuals for a study in such a way that the individuals
represent the larger group from which they were selected
Specifying a sampling method for selecting items or events from the frame
Types of Sample:
Cluster sample
Pros: Cons:
It involves a random start and then proceeds with the selection of every kth element from
then onwards. In this case, k=(population size/sample size).
It is important that the starting point is not automatically the first in the list, but is instead
randomly chosen from within the first to the kth element in the list.
In a systematic sample, after you decide the sample size, arrange the elements of the
population in some order and select terms at regular intervals from the list.
A simple example would be to select every 10th name from the telephone directory (an
'every 10th' sample, also referred to as 'sampling with a skip of 10').
ADVANTAGES: DISADVANTAGES:
ADVANTAGES: DISADVANTAGES:
The process of randomly selecting intact groups, not individuals, within the defined
population sharing similar characteristics
Clusters are locations within which an intact group of members of the population can be
found
Selection process
1. One-stage sampling: All of the elements within selected clusters are included in the
sample.
2. Two-stage sampling: A subset of elements within selected clusters is randomly selected
for inclusion in the sample.
5. Multi-Stage Sampling
1. Convenience Sampling
The process of including whoever happens to be available at the time that is, readily
available and convenient .
The researcher using such a sample cannot scientifically make generalizations about the
total population from this sample because it would not be representative enough.
For example, if the interviewer was to conduct a survey at a shopping center early in the
morning on a given day, the people that he/she could interview would be limited to those given
there at that given time, which would not represent the views of other members of society in
such an area, if the survey was to be conducted at different times of day and several times per
week.
In social science research, snowball sampling is a similar technique, where existing study
subjects are used to recruit more subjects into the sample.
Advantages: Disadvantages
2. Purposive sample:
The researcher chooses the sample based on who they think would be appropriate for the
study.
Advantages: Disadvantages
3. Quota Sampling
It starts with characterizing the population based on certain desired features and assigns a
quota to each subset of the population.
The population is first segmented into mutually exclusive sub-groups, just as in stratified
sampling.
Then judgment used to select subjects or units from each segment based on a specified
proportion.
For example, an interviewer may be told to sample 200 females and 300 males between
the age of 45 and 60.
It is this second step which makes the technique one of non-probability sampling.
4. Snowball Sampling
Just as the snowball rolls and gathers mass, the sample constructed in this way will grow
in size as you move through the process of conducting a survey.
In this technique, you rely on your initial respondents to refer you to the next respondents
whom you may connect with for the purpose of your survey.
Snowball sampling can be useful when you need the sample to reflect certain features
that are difficult to find.
To conduct a survey of people who go jogging in a certain park every morning, for
example, snowball sampling would be a quick, accurate way to create the sample.
Advantage: Disadvantage:
The costs associated with this The clear downside of this
method are significantly lower, and you approach is that you may restrict
will end up with a sample that is very yourself to only a small, largely
relevant to your study. homogenous section of the population.
Hypothesis Testing
The Hypothesis is an assumption which is tested to check whether the inference drawn from the
sample of data stand true for the entire population or not.
H0 = there is no difference between the assumed and actual value of the parameter.
The alternative hypothesis denoted by H1 is the other hypothesis about the population,
which stands true if the null hypothesis is rejected.
HYPOTHESIS
TESTING
Alternative
Null hypothesis, H0
hypothesis,HA
State the hypothesized value of the All possible alternatives other than the
parameter before sampling null hypothesis.
The assumption we wish to test (or the E.g µ ≠ 20
assumption we are trying to reject) . µ > 20
E.g population mean µ = 20 µ < 20
There is no difference between coke There is a difference between coke and
and diet coke diet coke
Once the hypothesis about the population is constructed the researcher has to decide the
level of significance, i.e. a confidence level with which the null hypothesis is accepted or
rejected.
The significance level is denoted by ‘α’ and is usually defined before the samples are
drawn such that results obtained do not influence the choice.
In practice, we either take 5% or 1% level of significance.
After the hypothesis is constructed, and the significance level is decided upon, the next
step is to determine a suitable test statistic and its distribution.
Most of the statistic tests assume the following form:
Before the samples are drawn it must be decided that which values to the test
statistic will lead to the acceptance of H0 and which will lead to its rejection.
The values that lead to rejection of H0 are called the critical region.
5. Performing Computations:
Once the critical region is identified, we compute several values for the random sample
of size ‘n.’
Then we will apply the formula of the test statistic as shown in step (3) to check whether
the sample results falls in the acceptance region or the rejection region.
6. Decision-making:
Once all the steps are performed, the statistical conclusions can be drawn, and the
management can take decisions.
The decision involves either accepting the null hypothesis or rejecting it.
Thus, to test the hypothesis, it is necessary to follow these steps systematically so that the
results obtained are accurate and do not suffer from either of the statistical error Viz. Type-I
error and Type-II error.
For example
Type II error
Type II error refers to the situation when we accept the null hypothesis when it is false.
H0: there is no difference between the two drugs on average. Type II error will occur if
we conclude that the two drugs produce the same effect when actually there is a difference.
Prob(Type II error) = ß
One Tail Test Upper tailed test will reject the null hypothesis if the sample mean is
significantly higher than the hypothesized mean.
Appropriate when H0 : µ = µ0 and HA: µ > µ0
T test:
The T-statistic was introduced by W.S. Gossett under the pen name “Student”
Developed T test around 1905, for dealing with small samples in brewing quality control
which was Published in 1908
T test is used to compare two samples to determine if they came from the same
population
Application of T Test
1. Test of Hypothesis about population(One sample t-Test )
Larger the degrees of freedom, the more it approximates the normal distribution.
The curve doesn’t touches X axis
Independent Dependent
One Sample T
Z- Test Samples T Samples T
Test
Test Test
The Z-test is applied to compare sample and population means to know if there’s a
significant difference between them.
Z test is also called as Standard Normal deviate Test, Standard Normal Test,
approximate Test and Large Sample Test
Conditions of Z Test
1. Data points should be independent from each other
Application of Z Test
1. Test of significance for single mean
1. If the Table value > Calculated value, we accept the Null Hypothesis
2. If the Table value < Calculated value, we Reject the Null Hypothesis
Table Values
T-test is more adaptable than Z-test since Z-test will often require certain conditions to be
reliable.
Additionally, T-test has many methods that will suit any need. T-tests are more commonly
used than Z-tests.
Z-tests are preferred than T-tests when standard deviations are known.
ANOVA Test:
Analysis of Variance (ANOVA) is a parametric statistical technique used to compare
datasets.
This technique was invented by R.A. Fisher, in 1920 and is thus often referred to as Fisher’s
ANOVA, as well.
It is similar in application to techniques such as t-test and z-test, in that it is used to compare
means and the relative variance between them.
However, analysis of variance (ANOVA) is best applied where more than 2 populations
or samples are meant to be compared.
So F test never be in negative because of square and numerator is always greater than
denominator.
F Test is mainly arise when the models have been shifted to the data using to least square
Types of t-tests
One way analysis: When we are comparing more than three groups based on one
factor variable, then it said to be one way analysis of variance (ANOVA).
For example, if we want to compare whether or not the mean output of three workers
is the same based on the working hours of the three workers.
Two way analysis: When factor variables are more than two, then it is said to be two
way analysis of variance (ANOVA).
For example, based on working condition and working hours, we can compare
whether or not the mean output of three workers is the same.
Steps ANOVA
State Alpha
If we reject the null hypothesis, all know is that there is a difference somewhere among
(between) the groups but we don’t know where the differences are ???
Additional tests called Post Hoc tests can be done to determine where differences lie.
It may be between first and second or second and third or may be between all of them.
Chi-Square Test
✓ The chi-square test is an important test amongst the several tests of significance
developed by statistician Karl Pearson in1900.
✓ The distributions are positively skewed. The research hypothesis for the chi-square is
always a one-tailed test.
4. The frequency data must have a precise numerical value and must be organized into
categories or groups.
6. No group should contain very few items, say less than 10.
df = n-1
In a contingency table
df = (r – 1)(c – 1)
CHI-
SQUARE
Non-
Parametric
Parametric
Test of Test of
Test of Test of
Comparing Goodness of
Independence Homogeneity
Variance fit
Goodness Of Fit
Test Of Homogeneity
Test Of A chi-square test ( Snedecor and Cochran, 1983) can be used to test
Comparing if the variance of a population is equal to a specified value. This test can
Variance be either a two-sided test or a one-sided test.
Goodness of fit In Chi-Square goodness of fit test, the term goodness of fit is used
to compare the observed sample distribution with the expected
probability distribution.
Yates's Correction When Degree of freedom is 1 i.e. In 2*2 contingency table and
Factor N<50, adjust χ2 by Yates's Correction Factor
Decision rule:
Correlation:
The degree of relationship between the variables under consideration is measure through
the correlation analysis. „
The measure of correlation called the correlation coefficient.
The degree of relationship is expressed by coefficient which range from correlation ( -1 ≤
r ≥ +1) „
The direction of change is indicated by a sign. „
The correlation analysis enables us to have an idea about the degree & direction of the
relationship between the two variables under study.
Correlation is a statistical tool that helps to measure and analyze the degree of
relationship between two variables. „
Correlation analysis deals with the association between two or more variables.
2. Negative Correlation: The correlation is said to be negative correlation when the values
of variables change with opposite direction. As X is increasing, Y is decreasing „ As X is
decreasing, Y is increasing. Ex. Price & qty. demanded.
3. No correlation: There might be the case when there is no change in a variable with any
change in another variable. In this case, it is defined as no correlation between the two.
Correlatio
n
On the On the
basis of On the
basis of
Degree of basis of
number of
Correlatio linerity
variables
n
Positive Negitive Partial Linear Non-
Simple Multiple
Correlatio Correlatio Correlatio Correlatio Linear
correlation correlation
n n n n correlation
For example, from the values of two variables given below, it is clear that the ratio of change
between the variables is the same:
X: 10 20 30 40 50
Y: 20 40 60 80 100
2. Non – Linear correlation: The correlation is called as non-linear or curvilinear when the
amount of change in one variable does not bear a constant ratio to the amount of change in the
other variable. For example, if the amount of fertilizers is doubled the yield of wheat would not
be necessarily being doubled.
Scatter Diagram:
The probable error of correlation coefficient can be obtained by applying the following
formula:
1. The data must approximate to the bell-shaped curve, i.e. a normal frequency curve.
2. The Probable error computed from the statistical measure must have been taken from the
sample.
3. The sample items must be selected in an unbiased manner and must be independent of
each other.
Thus, the probable error is calculated to check the reliability of the value of coefficient
calculated from the random sampling.
Types of problems:
An individual must follow the following steps to calculate the correlation coefficient:
Regression
➢ Regression analysis is the scientific technique for making such prediction.
➢ M.M. Blair has described Regression analysis as a mathematical measures of the average
relationship two or more variables in terms of the original units of the data.
➢ Regression Analysis: The Regression Analysis is a statistical tool used to determine the
probable change in one variable for the given amount of change in another. It is used to
get the measure of the error involved while using the regression line as a basis for
estimation
➢ It estimates the values of dependent variables from the values of the independent
variable. This means, the value of the unknown variable can be estimated from the known
value of another variable.
➢ Regression Line:
➢ The degree to which the variables are correlated to each other depends on the Regression
Line.
➢ The regression line is a single line that best fits the data, i.e. all the points plotted are
connected via a line in the manner that the distance from the line to the points is the
smallest.
Regression Coefficient
The constant ‘b’ in the regression equation (Ye = a + bX) is called as the Regression
Coefficient.
It determines the slope of the line, i.e. the change in the value of Y corresponding to the
unit change in X and therefore, it is also called as a “Slope Coefficient.”
The correlation coefficient is the geometric mean of two regression coefficients.
r2=byx*bxy
r = √ byx * bxy
byx * bxy ≤ 1
The sign of both the regression coefficients will be same, i.e. they will be either positive
or negative.
It is an absolute measure
The average value of the two regression coefficients will be greater than the value of the
correlation.
Meaning of Probability
www.everstudy.co.in Query: [email protected]
A probability is a measure of the likelihood that an event in the future will happen. It can only
assume a value between 0 and 1. A value near zero means the event is not likely to happen. A
value near one means it is likely. There are three ways of assigning probability: ◦ classical, ◦
empirical, and ◦ subjective.
2. The sum of the simple probabilities for all possible outcomes of an activity must equal 1.
3. Probability „p‟ of the happening of an event is also known as probability of success & „q‟
the non-happening of the event as the probability of failure.
Simple Definitions:
❖ Trial & Event ◦ Example: - Consider an experiment which, though repeated under
essentially identical conditions, does not give unique results but may result in any one of the
several possible outcomes. ◦ Experiment is known as a Trial & the outcomes are known as
Events or Cases.
Example: throwing a die is a Trial & getting 1 (2,3,…,6) is an event. Tossing a coin is a Trial
& getting Head (H) or Tail (T) is an event.
▪ Basic assumption of classical approach is that the outcomes of a random experiment are
“equally likely”.
▪ According to Laplace, a French Mathematician: “Probability, is the ratio of the number of
„favorable‟ cases to the total number of equally likely cases”.
▪ If the probability of occurrence of A is denoted by p(A), then by this definition, we have:
1. Classical probability is often called a priori probability because if one keeps using orderly
examples of unbiased dice, fair coin, etc. one can state the answer in advance (a priori)
without rolling a dice, tossing a coin etc.
2. Classical definition of probability is not very satisfactory because of the following reasons:
• It fails when the number of possible outcomes of the experiment is infinite.
• It is based on the cases which are “equally likely” and as such cannot be applied to
experiments where the outcomes are not equally likely.
P(E) = probability that an event, E, will occur. n(E) = number of equally likely outcomes of E.
n(S) = number of equally likely outcomes of sample space S.
✓ The Empirical probability P(A) defined earlier can never be obtained in practice and we
can only attempt at a close estimate of P(A) by making N sufficiently large.
✓ The experimental conditions may not remain essentially homogeneous and identical in a
large number of repetitions of the experiment.
✓ The relative frequency of m/N, may not attain a unique value, no matter however large N
may be.
Subjective Probability:
➢ The Bayes Theorem was developed and named after English statistician Thomas
Bayes(1702-1761), who discovered the formula in 1763.
➢ Show the Relation between one conditional probability and its inverse.
➢ Provide a mathematical rule for revising an estimate or forecast in light of experience and
observation.
➢ It is considered the foundation of the special statistical inference approach called the
Bayes’ inference.
➢ The Bayes’ theorem describes the probability of an event based on prior knowledge of
the conditions that might be relevant to the event.
Explanation: Bayes' theorem thus gives the probability of an event based on new information
that is, or may be related, to that event. The formula can also be used to see how the probability
of an event occurring is affected by hypothetical new information, supposing the new
information will turn out to be true.
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule)
describes the probability of an event, based on conditions that might be related to the event.
➢ a theorem about conditional probabilities:
➢ the probability that an event A occurs given that another event B has already occurred is
equal to the probability that the event B occurs given that A has already occurred multiplied by
the probability of occurrence of event A and divided by the probability of occurrence of event B
where:
P(A) is the probability of A occurring;
P(B) is the probability of B occurring;
P(A∣B) is the probability of A given B;
P(B∣A) is the probability of B given A; and
P(A⋂B)) is the probability of both A and B occurring.
When to Apply Bayes' Theorem:
Bayes' theorem should be considered when the following conditions exist:
➢ Within the sample space, there exists an event B, for which P(B) > 0.
➢ The analytical goal is to compute a conditional probability of the form: P( Ak | B ).
➢ at least one of the two sets of probabilities described below should be known:
• P( Ak ∩ B ) for each Ak
➢ When it actually rains, the weatherman correctly forecasts rain 90% of the time.
➢ When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the
probability that it will rain on the day of Marie's wedding?
➢ The sample space is defined by two mutually-exclusive events – it rains or it does not
rain.
✓ P( A2 ) = 360/365 = 0.9863014 [It does not rain 360 days out of the year.]
✓ P( B | A1 ) = 0.9 [When it rains, the weatherman predicts rain 90% of the time.]
✓ P( B | A2 ) = 0.1 [When it does not rain, the weatherman predicts rain 10% of the time.]
We want to know P( A1 | B ), the probability it will rain on the day of Marie's wedding, given a
forecast for rain by the weatherman. The answer can be determined from Bayes' theorem, as
shown below.
• The null and alternative hypotheses for the Kruskal- Wallis test are as follows.
4. Your two variables should be measured on an ordinal scale or a continuous scale (i.e., an
interval or ratio scale).
If these conditions are met, the test is approximated by a chi-square distribution with k – 1
degrees of freedom where k is the number of samples.
Given three or more independent samples, the test statistic H for the Kruskal-Wallis test is
➢ An advantage with this test is that the two samples under consideration do not necessarily
need to have the same number of observations or instances.
2. The second is that the observations are numeric or ordinal (i.e. arranged in ranks/orders)
n1 (n1 + 1)
U1 = (n1 )(n2 ) + − R1
2
n2 (n2 + 1)
U 2 = (n1 )(n2 ) + − R2
2
Concept of Research:
Characteristics of Research
1. Research is a Systematic and critical Investigation, into a Phenomenon.
Types of Research
• Qualitative research, on the other hand, is based on words, feelings, emotions, sounds
and other non-numerical and unquantifiable elements. It has been noted that “information
is considered qualitative in nature if it cannot be analysed by means of mathematical
techniques. This characteristic may also mean that an incident does not take place often
enough to allow reliable data to be collected”
• Descriptive research usually involves surveys and studies that aim to identify the facts.
In other words, descriptive research mainly deals with the “description of the state of
affairs as it is at present” and there is no control over variables in descriptive research.
Applied research is also referred to as an action research, and the fundamental research is
sometimes called basic or pure research.
The table below summarizes the main differences between applied research and fundamental
research. Similarities between applied and fundamental (basic) research relate to the adoption of
a systematic and scientific procedure to conduct the study.
Applied Research Fundamental Research
Table below illustrates the main differences between exploratory and conclusive research
designs:
Research Design
▪ Research design is a pre-planned sketch for the explanation of a problem. It is the first
step to take and the whole research.
▪ RESEARCH DESIGN refers to the plan, structure, and strategy of research--the
blueprint that will guide the research process.
▪ Study will conduct on the basis of this research design.
▪ It gives us a due that how the further process would be taking place and how would be
the research study carry into classification, interpretation and suggestions.
▪ This is a guideline for the whole work.
There are different types of research design depend on the nature of the problem and
objectives of the study. Following are the four types of research design.
• He is looking for the unexplored situation and brings it to the eyes of the people.
• The qualitative nature data is mostly collected like knowledge, attitude, beliefs and
opinion of the people. Examples of such designs are the newspaper articles, films,
dramas, and documentary etc.
➢ A written statement prepared for the benefit of others describing what has happened or a
state of affairs normally based on investigation.
➢ A report is a piece of factual writing, usually based on some kind of research or real-life
experience.
✓ It is a reliable document