Unit 1
Unit 1
UNIT 1
Introduction: Meaning of statistics, Classification of statistics – descriptive vs
inferential, parametric vs non-parametric. Levels of Measurement. Measures of central
tendency – Mean, median, Mode. Measures of variability – Inter Quartile Range,
quartile deviation, standard deviation. Normal Distribution – Meaning, importance,
properties.
MEANING OF STATISTICS
The word Statistics ‘and Statistical ‘are all derived from the Latin word Status, means
a political state.
Statistics is concerned with scientific methods for collecting, organising,
summarising, presenting and analysing data as well as deriving valid conclusions and
making reasonable decisions on the basis of this analysis. Statistics is concerned with
the systematic collection of numerical data and its interpretation
DESCRIPTIVE VS INFERENTIAL
Inferential statistics go beyond describing data and aim to make inferences or draw
conclusions about populations based on sample data. These techniques allow
researchers to generalize their findings to larger populations and assess the likelihood
that observed differences or relationships are not due to random chance.
For example, if a psychologist wants to know if a new therapy is more effective than
an existing one, they might use inferential statistics to analyze data from a sample of
patients and determine whether the observed improvement is likely to be a real effect
or just due to chance.
Measures of centre tendency (mean, median, mode), measures of variability (range, variance,
standard deviation), frequency distributions, histograms, scatterplots, and box plots are
Descriptive Statistics:
Pros:
Cons:
1. Limited Inference: Descriptive statistics don't allow for generalizing findings beyond
the dataset itself; they lack the ability to make broader conclusions.
2. Lack of Context: Descriptive statistics may not reveal underlying relationships or
factors that might impact the data.
3. No Significance Testing: Descriptive statistics don't assess whether observed
differences are statistically significant or due to chance.
Inferential Statistics:
Pros:
1. Generalization: Inferential statistics enable researchers to draw conclusions about
populations beyond the sampled data.
2. Significance Testing: They allow researchers to test hypotheses and determine
whether observed differences are statistically significant or likely due to chance.
3. Complex Relationships: Inferential statistics help identify and understand complex
relationships among variables through techniques like regression and ANOVA.
Cons:
PARAMETRIC VS NON-PARAMETRIC
Parametric Tests:
Parametric tests assume that the data follows a specific distribution, usually the normal
distribution. These tests require certain assumptions to be met, such as homogeneity of
variance and normality of data
Not suitable for data that deviate significantly from the assumed distribution.
Can be used with data that is not normally distributed or when sample sizes are small.
Generally less powerful than parametric tests when assumptions of parametric tests
are met.
1. Statistical Power: Parametric tests tend to have higher statistical power when the
assumptions are met. This means they are more likely to detect true differences or
relationships if they exist in the data.
2. Efficiency: When the assumptions are satisfied, parametric tests are often more
efficient, meaning they require smaller sample sizes to achieve the same level of
confidence compared to non-parametric tests.
4. More Information: They provide more detailed information about the data, including
estimates of population parameters (like means and variances), which can be useful in
drawing meaningful conclusions.
Disadvantages:
2. Limited Applicability: Parametric tests might not be suitable for all types of data. If
the data doesn't follow the assumed distribution, using parametric tests can lead to
inaccurate results.
5. Less Robust: Parametric tests are less robust when assumptions are violated
compared to non-parametric tests, which don't rely on these assumptions.
NON-PARAMETRIC TEST
MERTIS
1. Robustness: Non-parametric tests are robust against deviations from assumptions like
normality and equal variances. They can handle data with outliers and non-normal
distributions without affecting the results significantly.
2. Flexibility: Non-parametric tests can be applied to a wide range of data types,
including ordinal and categorical data, which may not have a clear numerical
interpretation.
5. Small Sample Sizes: Non-parametric tests can be used effectively with small sample
sizes when parametric assumptions are not met.
1. Reduced Power: Non-parametric tests generally have lower statistical power than
parametric tests when the data meets the assumptions of the latter. This means they
might be less likely to detect true effects.
2. Less Precise Estimation: Non-parametric tests tend to provide less precise estimates
of population parameters compared to parametric tests when the data follows the
parametric assumptions.
3. Limited Test Options: There are fewer types of non-parametric tests available
compared to parametric tests. This can limit the ability to test specific research
questions.
4. Limited for Complex Analyses: Non-parametric tests might not be suitable for more
complex analyses involving multiple variables, interactions, and covariates.
5. Loss of Information: Non-parametric tests often involve converting data into ranks,
which can lead to a loss of information from the original data.
7. Less Widely Understood: While some non-parametric tests are widely known and
used, they might be less familiar to researchers and practitioners than their parametric
counterparts.
LEVELS OF MEASUREMENT
Nominal Scale:
The nominal scale is the simplest level of measurement. Data at this level are
categorical and are used to categorize items into distinct groups or categories. In this scale,
data points are assigned to different categories based on some shared characteristic, but the
categories themselves have no inherent order or numerical value.. Examples of nominal data
include gender, ethnicity, religious affiliation
Ordinal Scale:
An ordinal scale is a type of measurement scale used in statistics that involves ordering
or ranking data based on some characteristic, without assigning specific numerical values that
represent the magnitude of the differences between the categories. Examples include
educational levels (e.g., high school, bachelor's degree, master's degree) or customer
satisfaction ratings (e.g., "very satisfied," "satisfied," "neutral," "dissatisfied," "very
dissatisfied")
Interval Scale:
Interval data is a type of measurement scale used in statistics that has ordered categories
and consistent intervals between them. In contrast to nominal and ordinal data, interval data
allows for meaningful comparisons of the differences between values because the intervals
are equal and have a consistent meaning. However, interval data doesn't have a true zero
point, meaning that ratios between values are not meaningful. arithmetic operations like
addition and subtraction can be performed on interval data. Examples of data at the interval
scale include temperatures measured in Celsius or Fahrenheit.
Ratio Scale:
The ratio scale is the most advanced level of measurement. It includes an ordered
scale, meaningful differences between values, a true zero point that indicates the absence of
the attribute, and consistent measurement units. Examples of ratio data include height,
weight, age, and income.
MEAN
The mean is one of the most commonly used measures of central tendency. It provides a way
to summarize a dataset by calculating the arithmetic average of all the values within it.
Mathematically, the mean is calculated by summing up all the values in the dataset and then
dividing the sum by the total number of values. The mean provides insight into the general
level or balance of the data, helping to understand its central location.
1. To get a single value that describes the characteristics of the entire group
Mean formula
25, 28, 24, 26.
1. Add Up the Ages: Add together all the ages. Sum of ages = 25 + 28 + 24 + 26 = 103
2. Count the Ages: Determine the total number of ages. Total number of ages = 4
3. Calculate the Mean: Divide the sum of ages by the total number of ages. Mean = Sum
of ages / Total
4. 103/4=25.75
Properties of Mean
3. Mean is the most reliable measure of central tendency since it takes into account every
item in the set of data.
5. The sum of the differences between individual observations and the mean is zero.
7. The sum of squares of deviation of set of values about its mean is minimum.
The mean is used when both of the following conditions are met:
1. Data is scaled a) Data with equal intervals like speed, weight, height, temperature, etc.
2. Distribution is normal b) The mean is sensitive to outliers that are found in skewed
distributions, you should only use the mean when the distribution is more or less normal.
MERIT
1. It can be easily calculated; and can be easily understood. It is the reason that it is the
most used measure of central tendency.
4. When repeated samples are gathered from the same population, fluctuations are minimal
for this measure of central tendency.
5. Unlike other measures like as mode and median, it can be subjected to algebraic treatment.
6. A.M. has an advantage in that it is a calculated quantity that is not depending on the order
of terms in a series.
DEMERITS
3. Only if the frequency is regularly distributed will it be useful. If the skewness is greater,
the results will be ineffectual.
4. In the case of open end class intervals, we must assume the intervals’ boundaries, and a
small fluctuation in X is possible. This is not the case with median and mode, as the open end
intervals are not used in their calculations.
Weighted mean: some values contribute more to the mean than others.
Geometric mean: values are multiplied rather than summed up.
Harmonic mean: reciprocals of values are used instead of the values themselves.
MEDIAN
The median is another measure of central tendency that represents the middle value in a
dataset when the values are arranged in order. It's the value that separates the data into two
equal halves: half of the values are greater than or equal to the median, and half are less than
or equal to it. The median is less sensitive to outliers and skewed distributions compared to
the mean.
PROPERTIES
In statistics, the properties of the median are explained in the following points:xfsh
The median is used when either one of two conditions are met.
FORMULA
2. Even if the value of extreme item is much different from other values, it is not much
affected by these values e.g. Median in case of 4, 7, 12, 18, 19 is 12 and if we add two
values equal to 450 10000, new median is 18.
3. It can also be used for the Quantities; those can’t give A.M; as is in case of
intelligence etc. It is possible to arrange in any order and to locate the middle valve. For
such cases it is the best measure.
5. For open end intervals, it is also suitable one. As taking any value of the intervals,
value of Median remains the same.
7. Median is also used for other statistical devices such as Mean Deviation and skewness.
9. Extreme items may not be available to get Median. Only if number of terms is known,
we can get median e.g.
1. Even if the value of extreme items is too large, it does not affect too much, but due to
this reason, sometimes median does not remain the representative of the series.
3. Median cannot be used for further algebraic treatment. Unlike mean we can neither
find total of terms as in case of A.M. nor median of some groups when combined.
4. In a continuous series it has to be interpolated. We can find its true-value only if the
frequencies are uniformly spread over the whole class interval in which median lies.
5. If the number of series is even, we can only make its estimate; as the A.M. of two
middle terms is taken as Median.
MODE
The mode is a statistical measure that represents the value that appears most frequently in a
dataset. In other words, the mode is the value that occurs with the highest frequency among
all the values in the dataset. Unlike the mean and median, which focus on the central
tendency of the data, the mode highlights the most common value(s).
Properties of Mode
2. The mode is not always unique. A data set can have more than on mode, or the mode may
not exist for a data set.
5. The mode can be used when the data are nominal or categorical, such as religious
preference, gender, or political affiliation.
When to use the mode? The Mode is used when you want to know the most frequent
response, number or observation in a distribution.
Merits of mode:
(1) Simple and popular: - Mode is very simple measure of central tendency. Sometimes, just
at the series is enough to locate the model value. Because of its simplicity, it s a very popular
measure of the central tendency.
(2) Less effect of marginal values: - Compared top mean, mode is less affected by marginal
values in the series. Mode is determined only by the value with highest frequencies.
(3) Graphic presentation:- Mode can be located graphically, with the help of histogram.
(4) Best representative: - Mode is that value which occurs most frequently in the series.
Accordingly, mode is the best representative value of the series.
(5) No need of knowing all the items or frequencies: - The calculation of mode does not
require knowledge of all the items and frequencies of a distribution. In simple series, it is
enough if one knows the items with highest frequencies in the distribution.
Demerits of mode:
(1) Uncertain and vague: - Mode is an uncertain and vague measure of the central tendency.
(2) Not capable of algebraic treatment: - Unlike mean, mode is not capable of further
algebraic treatment.
(3) Difficult: - With frequencies of all items are identical, it is difficult to identify the modal
value.
Advantages:
2. Useful for Summary: These measures provide a single value that represents the
center or average of the data, which can be helpful for summarizing large datasets and
making comparisons.
3. Easy Interpretation: The concept of central tendency is intuitive. It's easy to grasp
that the mean, median, or mode represents a "typical" value in the dataset.
4. Basis for Further Analysis: Measures of central tendency are often used as starting
points for more advanced statistical analyses and inferential procedures.
5. Data Reduction: When working with large datasets, these measures allow you to
condense a vast amount of information into a single value, simplifying analysis.
Disadvantages:
2. Lack of Information: Measures of central tendency don't provide insights into the
full distribution of the data. They might hide important details about how data is
spread out.
3. Unrepresentative for Skewed Data: If the data distribution is heavily skewed or not
symmetric, the mean might not accurately represent the center of the data.
5. Mode Ambiguity: A dataset might have multiple modes or no clear mode, making the
mode less informative in some cases.
6. Dependence on Sample: The mean and mode can be influenced by the specific
sample you have. If you took a different sample from the same population, you might
get slightly different values.
7. Misleading for Bimodal Data: In cases where the data has two distinct peaks, the
mean might fall between the peaks and not represent either peak well.
NORMAL DISTRIBUTION
Normal distribution, also known as the Gaussian distribution, is a probability distribution A
normal distribution or Gaussian distribution refers to a probability distribution where the
values of a random variable are distributed symmetrically. These values are equally
distributed on the left and the right side of the central tendency. Thus, a bell-shaped curve is
formed.
IMPORTANCE
1. Central Limit Theorem: The normal distribution is closely tied to the Central Limit
Theorem (CLT), which states that the sum (or average) of a large number of
independent and identically distributed random variables tends to follow a normal
distribution, even if the individual variables are not normally distributed themselves.
This property is crucial in statistics as it allows us to make inferences about
population parameters based on sample data.
The central limit theorem, which is a statistical theory, states that when a large
sample size as a finite variance, the samples will be normally distributed, and the
mean of samples will be approximately equal to the mean of the whole population. As
the sample size gets bigger and bigger, the mean of the sample will get closer to the
actual population mean. If the sample size is small, the actual distribution of the data
may or may not be normal, but as the sample size gets bigger, it can be approximated
by a normal distribution.
2. Statistical Inferenceany statistical tests and methods, such as t-tests, ANOVA, and
regression analysis, rely on assumptions of normality. When data are approximately
normally distributed, these tests tend to perform well and yield reliable results.
Deviations from normality can affect the validity of these tests..
3. Parameter Estimation: The normal distribution has only two parameters, the mean
(μ) and the standard deviation (σ), which are easy to interpret and estimate. This
makes it a convenient choice for modeling various phenomena in real-world
situations.
5. Predictive Modeling: In fields like finance and risk analysis, the normal distribution
is used to model asset prices and returns. It forms the foundation for various risk
assessment and portfolio management techniques.
8. Psychological Testing: Many psychological tests, such as IQ tests and aptitude tests,
are designed to have a normal distribution of scores in the general population. This
design allows for the identification of individuals who fall above or below a
certain threshold, aiding in diagnostic and decision-making processes.
f(x)=1σ√2πe−(x−μ)22σ2
PROPERTIES
Empirical Rule (68-95-99.7 Rule): This rule states that approximately 68% of the data falls
within one standard deviation of the mean, about 95% within two standard deviations, and
approximately 99.7% within three standard deviations.
Measures of variability
Measures of variability, also known as measures of dispersion, are statistical metrics that
provide information about the spread or dispersion of a dataset. They help you understand
how the data points are spread out from the central tendency (mean, median, mode) of the
dataset.
INTERQUARTILE RANGE
Quartiles are the values that divide a list of numerical data into three quarters. There are
three quartiles, first, second and third, denoted by Q1, Q2 and Q3. Here, Q2 is nothing but the
median of the given data.
The interquartile range (IQR) measures the spread of the middle half of your
dataIn Statistics, the range is the smallest of all the measures of dispersion. It is the
difference between the two extreme conclusions of the distribution. In other words, the range
is the difference between the maximum and the minimum observation of the distribution.
It is defined by
Range = Xmax – Xmin
Where Xmax is the largest observation and Xmin is the smallest observation of the variable
values.
The interquartile range (IQR) measures the spread of the middle half of your
data. The interquartile range defines the difference between the third and the first quartile.
Quartiles are the partitioned values that divide the whole series into 4 equal parts. So, there
are 3 quartiles. First Quartile is denoted by Q1 known as the lower quartile, the second
Quartile is denoted by Q2 and the third Quartile is denoted by Q3 known as the upper quartile.
Therefore, the interquartile range is equal to the upper quartile minus lower quartile.
The difference between the upper and lower quartile is known as the interquartile range. The
formula for the interquartile range is given below
where Q1 is the first quartile and Q3 is the third quartile of the series
Merits:
1. Robustness to Outliers: One of the most significant advantages of the IQR is its
resistance to outliers. Since it's based on quartiles (percentiles), extreme values have
less influence on its calculation compared to other measures like the range or standard
deviation.
2. Focus on Middle Data: The IQR concentrates on the middle 50% of the data,
providing insights into the variability of the central portion of the dataset. This is
particularly useful when the extreme values are not the primary concern.
3. Descriptive of Spread: The IQR gives a clear sense of how data points are distributed
within the middle range of the dataset. It describes the spread of the data that's more
representative of the majority of observations.
4. Non-Parametric Nature: The IQR doesn't make assumptions about the distribution
of the data, making it suitable for both symmetric and skewed datasets.
5. Useful in Comparisons: When comparing different datasets, the IQR can help you
assess differences in the spread of the middle portion of the data, independent of
differences in central tendency.
Demerits:
1. Limited Information: While the IQR is useful for understanding the spread within
the middle 50% of the data, it doesn't provide a comprehensive view of the entire
dataset. It can't tell you about the distribution of the data beyond the first and third
quartiles.
3. Lack of Balance: The IQR might not be the best choice if you're interested in a
measure that considers both the center and the spread of the data. In such cases, a
combination of mean and standard deviation might be more suitable.
4. Less Precise than Standard Deviation: The standard deviation provides more
detailed information about the spread of the data and is widely used in statistical
analyses. While the IQR has its own strengths, it lacks the precision offered by the
standard deviation.
Where:
The Quartile Deviation gives you an idea of the spread of data within the central 50% of the
dataset, similar to the IQR. It is often used as a measure of dispersion when you're interested
in understanding the spread of the middle portion of the data while being less sensitive to
outliers.
Like the IQR, the Quartile Deviation is a robust measure of dispersion that is less affected by
extreme values compared to the standard deviation or range
Merits:
4. Suitable for Skewed Data: The Quartile Deviation doesn't assume a normal
distribution and is appropriate for datasets with asymmetric or skewed distributions.
Demerits:
1. Limited Information: Like the Interquartile Range, the Quartile Deviation provides
information only about the spread of the middle 50% of the data. It doesn't take into
account the full range of data variability, including the outer 25% of the data.
2. Neglects Data Points: Since it focuses only on the quartiles, the Quartile Deviation
doesn't provide insights into the actual data points themselves. It may not give you a
clear picture of how individual values are distributed within the range.
3. Less Precision: While the Quartile Deviation is less affected by outliers, it might not
provide the same level of precision in measuring variability as the standard deviation
or the range.
5. Less Commonly Used: The Quartile Deviation is not as commonly used as other
measures like the IQR or standard deviation. This could mean that it might be less
familiar to those interpreting your results.
STANDARD VARIATION
Standard Deviation is a measure which shows how much variation (such as spread,
dispersion, spread,) from the mean exists. The standard deviation indicates a “typical”
deviation from the mean. tandard deviation calculates the extent to which the values differ
from the average. Standard Deviation, the most widely used measure of dispersion, is
based on all values. Therefore a change in even one value affects the value of standard
deviation.
Merits:
2. Predictive Power: It allows for making predictions and estimates. If you know the
mean (average) and the standard deviation (a measure of how spread out the data is),
you can make educated guesses about where most of the data points will fall.
5. Basis for Z-Scores: Z-scores are calculated using the standard deviation, providing a
standardized way to measure how far a data point is from the mean. This is useful for
identifying outliers.
Demerits:
1. Sensitivity to Outliers: The standard deviation is highly affected by extreme values
(outliers) in the dataset. A single outlier can greatly increase the standard deviation
and potentially distort its interpretation.
3. Not Robust for Skewed Data: In asymmetric distributions, such as highly skewed
data, the standard deviation might not provide an accurate reflection of data spread, as
it's influenced by extreme values.
4. Population vs. Sample: There's a distinction between the population and sample
standard deviations. Using the wrong formula (population vs. sample) can lead to
incorrect results.
Advantages:
2. Sensitive to Data Variation: These measures capture the differences between data
points, giving you insights into the degree of variability. They help identify whether
the data points are tightly clustered or widely spread.
4. Identifying Outliers: High variability measures can signal the presence of outliers,
extreme values that may skew the interpretation of the data. This helps in identifying
data points that might need further investigation.
5. Assessment of Data Quality: In fields like quality control, variability measures help
in monitoring consistency and identifying potential issues in manufacturing processes.
Disadvantages:
2. Limited Information: Measures of variability focus solely on the spread of data and
don't provide information about the shape or pattern of the distribution itself.
3. Dependence on Scale: Some measures, like the standard deviation, are influenced by
the scale of the data. If the data is measured in different units or on different scales,
direct comparisons of variability can be misleading.
6. Lack of Context: Measures of variability alone might not give a full picture without
considering the context of th