Unit 3
Unit 3
Data: A set of values recorded on one or more observational units. Or it is factual information
collected during research studies.
Qualitative data: The variables that yield observations on which individuals can be categorized
according to certain characteristics or qualities are referred as qualitative variables or data, e.g.
gender, occupation, marital status, and educational level.
Quantitative data: The variables that yield observations that can be measured are considered
quantltative data, e.g. height, weight, blood pressure, serum cholesterol, body temperitture
Quantitatijre data are further divided into discrete and continuou5 data.
Discrete data: The data in a whole number is called discrete data such as number of children in a
family, pulse rate, ESR, blood sugar, blood pressure, etc. It can be understood with following
example. Pulse rate of ten people recorded is as follow:
80, 72, 75, 82, 77, 83, 86, 74, 78, 88
Continuous data: The data which can be measured in fractional values such as height, weight, body
temperature, chest circumference, etc. is called continuous data. It can be understood with following
example:
45.7, 50.2, 48.9, 48.4, 56.5, 44.5, 47.8, 47.8, 45.5, 46.3
Frequency Curve:
A frequency curve is a smooth, graphical representation of a frequency distribution, which shows
how often different values occur in a dataset. The curve is formed by connecting the midpoints of
the top edges of a histogram's bars with a freehand, smooth curve.
X-axis:
It represents the variable being measured (e.g., height, weight, scores).
Y-axis:
It represents the frequency, or how many times each value (or range of values) occurs.
Shape:
The curve's shape can reveal information about the data distribution.
Common shapes include:
1) Normal distribution (bell curve): Symmetrical, with the highest frequency in the center and
tails tapering off on both sides.
2) Skewed distributions: Asymmetrical, with a longer tail on one side, indicating a higher
concentration of values towards one end.
3) U-shaped curve: Has a low frequency in the center and higher frequencies at the extremes.
4) J-shaped curve: Starts with a high peak and then slopes downward.
5) Mixed curve: A combination of different shapes.
or
σ = Variance
VARIANCE:
Variance is a statistical measure that shows how much the values in a dataset deviate from the mean
(average). It gives a sense of how spread out or concentrated the data is.
Calculation of variance:
∑(x−xˉ)2
Variance(σ ) =
2
or
Variance = σ2
Covariance of Data:
Covariance is a measure of the relationship between two random variables and to what extent, they
change together or in other words, it defines the changes between the two variables, such that change
in one variable is equal to change in another variable. Covariance is measured in units, which are
calculated by multiplying the units of the two variables.
Types of Covariance
Covariance can have both positive and negative values. Based on this, it has two types:
• Positive Covariance
• Negative Covariance
Positive Covariance
If the covariance for any two variables is positive, that means, both the variables move in the same
direction. Here, the variables show similar behaviour. That means, if the values (greater or lesser) of
one variable corresponds to the values of another variable, then they are said to be in positive
covariance.
Negative Covariance
• If the covariance for any two variables is negative, that means, both the variables move in the
opposite direction. It is the opposite case of positive covariance, where greater values of
one variable correspond to lesser values of another variable and vice-versa.
Where,
xi = data value of x
yi = data value of y
x̄ = mean of x
ȳ = mean of y
N = number of data values.
If cov(X, Y) is greater than zero, the covariance for any two variables is positive and both the
variables move in the same direction.
If cov(X, Y) is less than zero, the covariance for any two variables is negative and both the variables
move in the opposite direction.
Quartile:
Quartiles are the set of values which has three points dividing the data set into four identical parts.
The middle part of the three quarters measures the central point of distribution and shows the data
which are near to the central point. The lower part of the quarters indicates just half information set
which comes under the median and the upper part shows the remaining half, which falls over the
median.
Quartiles divide the entire set into four equal parts. So, there are three quartiles, first, second and
third represented by Q1, Q2 and Q3, respectively. Q2 is the median, since it indicates the position of
the item in the list and thus, is a positional average. To find quartiles of a group of data, arrange the
data in ascending order.
Q1 = [(n+1)/4]th item
Q2 = [(n+1)/2]th item
Q3 = [3(n+1)/4]th item
Percentile:
A percentile is a statistical measure that indicates the relative standing of a value within a dataset.
For example, if a student scores in the 90th percentile on a test, they have scored better than 90% of
the other students who took the test. A percentile is a measure used to indicate the value below which
a given percentage of observations in a group of observations fall.
Formula of Percentile
After arranging the data in order, we need to calculate the rank. The formula for rank is given as
Case 1: If the rank is a whole number, the value at that position in the ordered dataset is the desired
percentile.
Case 2: If the rank is a decimal, interpolate to the nearest whole number to find the percentile value.
P=100n×(N+1)P=n100×(N+1)
Let's assume we have the following data set: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100. To find the 70th
percentile:
R=70100×(10+1)=7.7R=10070×(10+1)=7.7
The 70th percentile lies between the 7th and 8th values. Thus, the 70th percentile is a value between
70 and 80.