0% found this document useful (0 votes)
27 views2 pages

Consu Bit (Numerical and Statistical) 1

1. Regression analysis measures the relationship between two variables and uses this relationship to predict unknown values. It involves identifying an independent variable used for prediction and a dependent variable whose values are being predicted. 2. Correlation analysis measures the strength and direction of the relationship between two variables. A high positive correlation means the variables increase together, while a high negative correlation means they change in opposite directions. Measures of dispersion quantify how spread out a data set is.

Uploaded by

Sreejith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views2 pages

Consu Bit (Numerical and Statistical) 1

1. Regression analysis measures the relationship between two variables and uses this relationship to predict unknown values. It involves identifying an independent variable used for prediction and a dependent variable whose values are being predicted. 2. Correlation analysis measures the strength and direction of the relationship between two variables. A high positive correlation means the variables increase together, while a high negative correlation means they change in opposite directions. Measures of dispersion quantify how spread out a data set is.

Uploaded by

Sreejith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

REGRESSION ANALYSIS Permutations : The number of arrangements of n CORRELATION ANALYSIS Probability density function (pdf)

Regression measures the nature of relationship things taken r at a time is given by nPr = n(n-1)(n- • Statistical data : Statistical data refers to set of The probability of a continuous random variable X
between two variables.It provides us with a 2) …….. [ r terms ] Eg : 7P3 = 7 x 6 x 5 numbers collected for a predetermined purpose. is represented by a function f(x), This function f(x)
functional relationship between two variables X The number of arrangements of n things taken n • Univariate data :Statistical data providing is called the probability density function or pdf of
and y in the form of an equation.Regression at a time is given by nPn = n(n-1)(n-2)….3.2.1 = n! information about a single characteristic or X if it satisfies the following conditions
analysis is used for prediction/ estimation.That is Note : In probability permutation is used in variable (x) is called univariate data. Eg : The 1) f(x) ≥ 0 , for every x
once the functional relationship between X and y problems of arranging the letters of marks of students in a class. • Bivariate data : 2) ∫�(�)�� =1� (Total pdf =1)
is known we can predict the unknown value of a word , arranging people on seats etc. Statistical data providing information about two MEASURES OF DISPERSION
one variable given the known value of the other Combinations : The number of ways in which n characteristics or variables (x,y) is called bivariate Measures of dispersion are the statistical devices
variable.In regression analysis we come across things can be combined taking r at a time is given data. Eg : The heights and weights of students in a to measure the scatteredness or variation of
two types of variables - independent variable and by nCr = �(�−1)(�−2)/…..1.2.3…..� class Price and demand of different commodities observations in a set of data. They tell us the
dependent variable.The variable whose value is to Limitations of classical definition of probability CORRELATION Correlation is a statistical device extent to which the values of a a data differ
be predicted is called dependent variable and the • If the events of a random experiment are not which measures the degree or strength of between each other or from their average.
variable that is used for prediction is called equally likely classical definition cannot be applied. relationship between two variables. A high Purpose of measuring variation
independent variable. • If the total number of outcomes of a random correlation means there is a strong relationship • To test the reliability of an average.• To compare
Linear regression experiment becomes infinite classical definition between the variables and a low correlation the variability of two or more sets of data.• To
When we plot the points in a bivariate data in a fails to give a measure of probability. • Classical means the realationship between the variables exercise control over variability.Desirable
scatter diagram ,it concentrates around a straight definition of probability can be applied mainly to are weak. TYPES OF CORRELATION • Positive properties of an ideal measure of dispersion• It
line then we say that there is a linear regression games of chance like tossing a coin ,throwing a die correlation : If in a bivariate data , the values of should be rigidly defined.• It should be simple to
between x and y and the straight line around etc. two variables move in the same direction , the understand and easy to calculate.• It should be
which the points of the data concentrate is called Frequency definition of probability correlation is said to be positive. That is when the based on all observations.• It should be capable of
the regression line or line of best fit. Consider a random experiment which is repeated value of the variable x increases ,the value of the further mathematical treatment.• It should have
Properties of regression coefficients n times. Let an event A occurs in f out of n variable y also increases or when x decreases y sampling stability Types of measures of
• byx ≠ bxy ( Regression coefficients are not repetitions. Then f is called frequency and is also decreases , then the correlation is said to be dispersionMeasures of dispersion are classified as
symmetric) • byx is the coefficient of x in the called the frequency ratio or relative frequency of positive. • Negative correlation : If in a bivariate absolute measures of dispersion and relative
regression equation of y on x • boxy is the the event A. When n becomes very large , the data , the values of two variables move in the measures of dispersion.Absolute measures of
coefficient of y in the regression equation of x on y. frequency ratio becomes more regular and opposite directions , the correlation is said to be dispersion are used to find the variations among a
• byx * bxy ≤ 1 • Correlation coefficient approaches a constant. This constant value is negative. That is when the value of the variable x single set of data.Relative measures of dispersion
=√���∗��� . That is correlation coefficient is the called Probability of event A and the process of increases , the value of y decreases or when x are used for comparing two or more sets of data
geometric mean of the regression coefficients. frequency ratio approaching a constant when n decreases y increases ,then the correlation is said for their variability.Important absolute measures
• r , byx and bxy have the same sign . Either all the becomes very large is called Statistical regularity. to be negative. • Zero correlation or no of dispersion are
three coefficients are positive or all the three are That is by frequency definition , P(A) = �� , when correlation : When there is no association • Range• Quartile Deviation (QD)• Mean
negative. • If the correlation coefficient r =0 , the n is very large Limitations of frequency definition between the two variables x and y then there is Deviation (MD)• Standard Deviation (SD)
two regression lines are perpendicular to each of probability • If a random experiment is no correlation or zero correlation between x and Corresponding relative measures of dispersion are
other. • If the correlation coefficient r =1, then the repeated a large number of times , the y . • Perfect correlation : If the values of one • Coefficient of range• Coefficient of quartile
two regression lines coincide. • The point of experimental conditions may not remain identical. variable is proportional to the values of other deviation• Coefficient of mean deviation
intersection of two regression lines is (x̅ , y̅). So by • The limit may not attain a unique value, variable , then the correlation is said to be perfect. • Coefficient of standard deviation and Coefficient
solving the equations of two regression lines we however large n may be If the values are directly proportional then the of variation.Note : Relative measure of dispersion
can find the means (x̅ , y̅). Algebra of events correlation is perfectly positive is the ratio of absolute measure of dispersion to
Identifying the two regression lines For an event A KARL PEARSON’S COEFFICIENT OF CORRELATION an appropriate average from which the deviations
Suppose we are given the equations of two 1) Occurrence of an event A → A Karl Pearson’s coefficient of correlation is a are measured.
regression lines. To identify which one is the 2) Non occurrence of an event A (not A)→ Ac or A' mathematical method for studing correlation. RANGE
regression line of y on x and which one is the For any two events A and B Karl Pearson’s coefficient of correlation is a real Range is the difference between the highest and
regression line of x on y , we first assume that one 1) Occurrence of both (A and B)→ AՈ B number lying between -1 and +1 which tells us the lowest values in a data. Range = H – L Coefficient
of them as the regression line of y on x and the 2) Occurrence of at least one of A, B ( A or B) → degree or strength of relationship between two of range = �−��+�Where H = Highest value in
other as the regression line of x on y. From the AUB 3) Occurrence of only A (A and not B) → variables. It is denoted by the letter ‘r’ or ‘rxy’. the data andL = Lowest value in the data.
assumed regression line of y on x , we calculate AՈ Bc 4) Occurrence of only B (not A and B) → r = ���(�,�) ���(�,�) Merits of range• It is simple to understand and
the regression coefficient byx and from the AcՈ B 5) Occurrence of exactly one → √�(�)√�(�) = �� �� easy to calculate.• Range is a popular measure in
regression line of x on y , we calculate bxy . Then (AՈ Bc )U(AcՈ B) 6) Occurrence of none (not A On simplification , the formula for Karl the field of medicine and weather
Pearson’s correlation coefficient can be written forecast.Demerits of range• It is not based on all
we will find the product byx* bxy. If this product is and not B) → AcՈ Bc
as observations.• It does not have sampling stability .
less than or equal to 1 then our assumption is true Addition theorem of probability r= �Σ��−Σ�Σ�
otherwise our assumption is false and we have to For any two events A and B , P(A or B) = P(A) + P(B) √�Σ�²−(Σ�)² √�Σ�²−(Σ�)² • It cannot be calculated for open end data.
take the assumed regression line of y on x as x on – P(A and B) Quartile deviation is defined as half the difference
y and regression line of x on y as y on x. P(AUB) = P(A) + P(B) – P(AՈ B) Spearmans’s rank correlation coefficient between third and first quartiles. QD = �3− �12
PROBABILITY In particular if A and B are two mutually exclusive measures the correlation between two sets of The only measure of dispersion that can be
The word probability means chance. In statistics events then P(AUB) = P(A) + P(B) ranks. Usually qualities like beauty , intelligence , applied for an open end data is Quartile deviation
probability refers to a real number between 0 and FITTING OF CURVES/ CURVE FITTING sincerity etc cannot be measured directly. Instead Merits of QD• It is rigidly defined.• It is simple to
1 which represents the chance of occurrence of an Fitting of curve to a given bivariate data refers to they can be given ranks. In such cases we can use understand and easy to calculate.• It is not unduly
event. Random experiment : A random finding the most appropriate curve which fits the Spearman’s rank correlation coefficient to affected by extreme values.• It can be calculated
experiment is an experiment having several given data. Such a curve is called the curve of best measure their degree of relationship for open end data.
possible outcomes but we cannot predict which fit. The curve of best fit is obtained using the RANDOM VARIABLES Demerits of QD• It is not based on all
outcome will turn up in a particular trial. principle of least squares. A random variable is a real valued function observations.• It is not capable of further
Example :- Tossing a coin, Throwing a die , The principle of least squares states that the sum defined over the sample space of a random mathematical treatment.• It doesn't have
selecting a card from a pack of cards etc. of squares of errors between the observed values experiment. Random variables are usually sampling stability.
Sample space : Sample space is the set of all and the expected values should be minimum. denoted by X ,Y, Z etc. The domain of the function MEAN DEVIATION (MD)
possible outcomes of a random experiment. It is Consider a bivariate data (x₁,y₁) ,(x₂,y₂),...... random variable is the sample space and the Mean deviation is defined as the Arithmetic mean
denoted by the letter S or Ω (x ,y ). We are interested in finding out an range is the set of real numbers. Discrete random of absolute values of deviations of observations
Example :- • In the coin tossing experiment equation of a curve y = f(x) which shows the variables : A random variable is said to be discrete from an average. The average can be mean ,
sample space S = { H , T } • In the die throwing approximate relation between x and y values so if it takes finite or countably infinite number of median or mode.Merits of mean deviation
experiment sample space S = { 1,2,3,4,5,6 } that the sum of squares of errors is minimum. values. • Mean deviation is rigidly defined• It is based on
• In the random experiment of tossing two coins Here x is the independent variable and y is the Probability mass function (pmf) all observations• It is less affected by extreme
sample space S ={ HH,HT,TH,TT } dependent variable. In the data yᵢ represents the Let X be a discrete random variable then the valuesDemerits of mean deviation• Mean
Sample points : The elements of the sample space observed value of the variable y corresponding to function f(x) = P(X = x) is called the probability deviation suffers from inaccuracy because
are called samples points. Trial : Trial is an attempt the value of x , say xᵢ , i = 1,2...n .Let ye be the mass function or pmf of X , if it satisfies the signa of the deviations are ignored• It is not
to produce an outcome of the random experiment. expected value of y .Then the error eᵢ = yᵢ-ye , i following conditions capable of further mathematical treatment
Events : Out of all the outcomes of a sample space = 1,2...n The sum of squares of errors is given 1) f(x) ≥ 0 , for every x • Cannot be calculated for open end data
certain outcomes satisfy a particular condition. by S = �(yᵢ- ye)² By the method of calculus S is 2) Σ�(�)� = 1 ( Total pmf = 1) STANDARD DEVIATION (SD)
Set of such outcomes are called events. Events are minimum if its first derivative is 0 and the second Cumulative distribution function (cdf) Standard deviation is defined as the square root of
subsets of the sample space. They are denoted by derivative is greater than 0. By equating the first /Distribution function let X be a discrete random arithmetic mean of squares of deviations of
the letters A,B,C … derivative to 0 , we can find the unknown variable having the pmf f(x) , then the cdf of X is observations from arithmetic mean.
Equally likely events : Two or more events are constants in the equation y = f(x) and hence find given by F(x) = P(X ≤ x) = Σ�(�)�−∞ It is denoted by the letter σ .
said to be equally likely if they have the same the equation of the curve of best fit. Properties of cdf The square of standard deviation is called variance.
Fitting of straight line Consider the problem of • F(x) ≥ 0 It is denoted by σ2 .
chance of occurrence in a trial.
fitting a straight line of the form y= ax + b to a • F(-∞) = 0 Standard deviation in raw data SD , σ =
Mutually exclusive events : Two or more events
• F(+∞) = 1
are said to be mutually exclusive if they cannot given bivariate data. Here we have to find out the √[1���2− � 2] Or σ =√1��(�−� )2 Variance ,
• The graph of the cdf F(x) is a step function. σ2 = 1��(�−� )2 Coefficient of standard
occur simultaneously in the same trial. constants a and b. The two normal equations for
• P(a < X ≤ b) = F(b) – F(a)
Favourable cases of an event : The outcomes finding the unknown constants a and b is obtained deviation = ���� = �� Coefficient of variation
Continuous random variables (CV) = �� � 100
which results in the happening of the event is using the principle of least squares. By equating
A random variable is said to be continuous if it can Standard deviation in discrete series SD , σ =
called favourable cases of the event. dS/da= 0 and dS/db= 0 we get two normal
take uncountably infinite number of values. √1���(�−� )2 , where N = Σf Variance , σ2 =
Exhaustive cases of a random experiment : The equations ,The normal equations are
Examples of continuous random variables 1���(�−� )2 Coefficient of standard deviation
totality of outcomes of a random experiment are Σyᵢ = aΣxᵢ + nb
• Life time of an electric bulb. = ���� = �� Coefficient of variation (CV) = ��
called its exhaustive cases. Σxᵢyᵢ = aΣxᵢ2 + bΣxᵢ
• Temperature of a place. � 100
Factorial notation , Permutations and By solving these two normal equations we can
• Time taken to finish a running rac Standard deviation in continuous series SD , σ =
Combinations Factorial : The factorial of a number find out a and b and hence the line of best fit to
• Length of a film. √1���(�−� )2 , where N = Σf and x= mid value
is the product of integers from 1 to that number. the given data.
of classes Variance , σ2 = 1���(�−� )2
n! = 1x2x3x….n
Coefficient of standard deviation = ���� = ��
Eg : 5! = 1x2x3x4x5
Coefficient of variation (CV) = �� � 100
Merits of SD• It is rigidly defined Harmonic mean in raw data
• It is based on all observations• It is capable of If there are n observations in a data then their
further mathematical treatment• It has sampling harmonic mean,
stabilityDemerits of SD• It is difficult to HM = � Σ(1�)
understand and calculate• It cannot be calculated Harmonic mean in discrete series
for open end data If x denote the observations in the data and f
Definition of statistics as statistical data : represents their frequencies then
According to professor Horace Secrist “Statistics HM = � Σ(��)
are aggregates of facts affected to a marked Where N = total frequency AVERAGES Raw data Discrete series Continuous
extent by multiplicity of causes, numerically Harmonic mean in continuous series series
expressed , enumerated or estimated according to If x denote the mid values of the classes and f
reasonable standards of accuracy, collected in a represents their frequencies then Arithmetic ��=�� ��=��� ��=���
systematic manner for a predetermined purpose HM = � Σ(��) mean
and placed in relation with each other. In short Where N = total frequency Geometric ��=������� ��=�.������ ��=�������
statistics are set of numbers collected for a Merits of harmonic mean mean (�����) (������) (������)
predetermined purpose. • It is rigidly defined. • It is simple to understand Harmonic mean HM = HM = N/Σ(f/x) HM = N/Σ(f/x)
Definition of statistics as statistical methods : and easy to calculate. • It is based on all �/Σ(1/x)
According to Croxton and Cowden “Statistics may observations. • It is capable of further
be defined as the science of collection, mathematical treatment. • It has sampling
presentation, analysis and interpretation of stability. Demerits of harmonic mean
numerical data • It is affected by extreme values.
Statistical data • If one of the values is zero, harmonic mean
Statistical data may be classified into three types: cannot be determined. • It cannot be determined
• Raw data. • Discrete series / discrete frequency graphically. • It cannot be calculated for open end
distribution. • Continuous series / continuous data
frequency distribution. Relation between AM,GM and HM
Raw data is the unarranged data. Here all the • AM ≥ GM ≥ HM
numbers in the data are simply listed. • GM2 = AM*HM or GM = √AM∗ HM
In discrete series the data values (x) are written MEDIAN
along with their frequencies (f). In continuous Median is the middle most observation in a data
series the data is divided into different classes and which is arranged in ascending or descending
each class is written along with their frequencies. order.
MEASURES OF CENTRAL TENDENCY (AVERAGES) Median in raw data
Measures of Central tendency or averages are Median = �+12 th observation, when the
single representative values which represents a observations are arranged in ascending or
large number of numerical values. It can be descending order.
considered as a central value around which all Median in discrete frequency distribution
other values in the data cluster. Important Median = +12 th observation where N = Σf
measures of Central tendency /Averages are Steps for finding median in discrete frequency
• Arithmetic mean (AM) • Geometric mean (GM) distribution
• Harmonic mean (HM) • Median • Mode 1. Calculate (N+1)/2
Here Arithmetic mean ,Geometric mean and 2. Find the cumulative frequency
Harmonic mean are called mathematical averages 3. Identify the cumulative frequency just greater
while Median is called a positional average. than or equal to (N+1)/2
ARITHMETIC MEAN (AM): 4. The observation corresponding to that
Arithmetic mean is defined as sum of cumulative frequency will be the median.
observations divided by number of observations. Median in continuous frequency distribution
Arithmetic mean in raw data If there are n Median = l + (�2−�)��
observations in the data then their arithmetic Where l = lower limit of median class
mean is given by ��=Σ�� Median class = class containing +12 ℎ
Arithmetic mean in discrete series : observation. N = Σf ,total frequency
If x denotes the data values and f denotes their m = cumulative frequency of the class preceding
frequencies then their arithmetic mean is given the median class. c = class width of median class.
by ��=Σ��� Where N = Σf f = frequency of median class
Arithmetic mean in continuous series Merits of median
If x denotes the mid values of the classes and • It is rigidly defined. • It is simple to understand
and f denotes their frequencies, then their and easy to calculate • It is not much affected by
arithmetic mean is given by extreme values. • It can be calculated for open
��=Σ��� , where N = Σf end data. • It can be determined graphically.
Merits of arithmetic mean • It is simple to Demerits of median • It is not based on all
understand and easy to calculate. • It is rigidly observations. • It is not capable of further
defined. • It is based on all observations. • It is mathematical treatment. • It does not have
capable of further mathematical treatment. • It sampling stability.
has sampling stability. Demerits of arithmetic MODE : Mode is the most frequently occurring
mean • It is effected by extreme values. • It observation in a data.
cannot be calculated for open end data • It cannot Note : In some cases mode is ill defined. In such
be determined graphically cases mode can be obtained using the empirical
GEOMETRIC MEAN relation between mean median and mode, which
Geometric mean of n observations is defined as is Mean – Mode = 3(Mean – Median)
the nth root of the product of the observations. Or Mode = 3Median – 2Mean.
Geometric mean in raw data Mode in raw data
Geometric mean , GM =Antilog(Σlogxn) Mode = The observation which repeats the most
Geometric mean in discrete series number of times
Geometric mean , GM =Antilog(ΣflogxN) Mode in discrete series
Where N= Σf . Here x represents the observations Mode = observation having highest frequency.
and f represents their frequencies. Mode in continuous series
Geometric mean in continuous series Mode = l +Δ1(Δ1+Δ2)� �
Geometric mean , GM =Antilog(ΣflogxN) Where l = lower limit of the modal class.
Where N = Σ f . Here x represent midvalues of the Modal class = class having highest frequency.
classes and f represent their frequencies . f1 = frequency of the modal class.
Note: Geometric mean is used to find the f0 = frequency of the class preceding the modal
averages of ratios and percentages. class. f2 = frequency of the class succeeding the
Merits of Geometric mean • It is rigidly defined. modal class.
• It is based on all observations. • It is capable of Δ1 = f1 - f0
further mathematical treatment. • It has sampling Δ2 = f1 - f2
stability. Demerits of geometric mean c = class width of the model class.
• It is not simple to understand and is difficult to Merits of mode
calculate. • If one or more of the values are zeros • Mode is simple to understand and easy to
or negative then geometric means cannot be calculate. • It is not much affected by extreme
calculated. • It cannot be calculated for open end values. • It can be calculated for open end data.
data. • It cannot be determined graphically • It can be determined graphically.
HARMONIC MEAN(HM) • It is the most typical or representative value in a
Harmonic mean is defined as the reciprocal of the data since it has the greatest frequency.
arithmetic mean of reciprocal of the observations. Demerits of mode
Note :: Harmonic mean is used to find the average • It is not rigidly defined. • It is not based on all
of speeds observations. • It is not capable of of further
mathematical treatment. • It does not have
sampling stability.

You might also like