Unit 1 mean And SD
Unit 1 mean And SD
Unit 1 mean And SD
Meaning of Statistics
The word, “Statistics”, in general connotes the following:
1. Statistics refers to numerical facts such as those concerning birth and death, school attendance
etc.
2. Statistics also signifies the methods of dealing with numerical facts.
Defining Statistics
Statistics (plural)is a science, a specialization within the field of mathematics. Statistics is the
science of classifying, organizing, and analyzing data. Applied statistics is used by investigators
who need to know statistics to appreciate reports of findings in their professional fields or who
must apply statistical treatment in their own work. Among our ranks are biologists, educators,
psychologists, engineers, sociologists, medical researchers, and business executives.
The term statistic (singular) has another meaning as contrasted to statistics (plural). A statistic is
a descriptive index of a sample. The same index, if descriptive of a population, is called a
parameter.
1
to organize and to summarize observations so that they are easier to comprehend.
Statistics allow psychologists to Organize data: When dealing with an enormous amount
of information, it is all too easy to become overwhelmed. Statistics allow psychologists to
present data in ways that are easier to comprehend. Example: When a researcher tries to
describe the demographic characteristics (e.g. age) of her population she may use mean
and SD to describe it. Using descriptive and inferential statistics see which group exhibits
a higher variability in performance. Solution: To describe the variability in performance
the variance is used.
● Inferential statistics is the branch of statistics in which we make inferences about
population (complete set of data) from samples (only a part of it). We try to know
population parameter from sample statistic. Two kinds of procedures are typically
involved:
1. Estimation: sample results provide only estimates of the values of the population
characteristics
2. Hypothesis testing: This helps the researcher to determine if the difference
between two means is due to chance error or not. E.g. Is it possible that a certain
drug has an effect on the speed of learning? It is impossible to administer the drug
to everyone in the population, so the investigator selects at random two samples
of 25 subjects each. The investigator then administers the drug to one of the
groups and a placebo to the other group. She then measures the learning of both
groups on a task. She finds that the average learning scores of the two groups
differ by 5 points. We would expect some difference between the groups even if
both received the drug because of chance factors involved in the random selection
of the groups. Hypothesis testing will help the experimenter find out whether the
obtained difference of 5 points is larger than can be accounted for by chance
variation. If yes then she will infer that the difference is due to the drug.
What is Random sampling? Discuss the difference between random sampling with and
without replacement.
Random sampling is a procedure that ensures that all samples of a particular size
have an equal and independent chance of being selected and thus eliminates any
bias when we draw a sample.
● sampling without replacement, in which a subset of the observations are
selected randomly from the population, and once an observation is
selected it is not put back in the population. Hence, it cannot be selected
again.
● sampling with replacement, in which a subset of observations are selected
2
randomly from the population and then returned back to the population.
Thus, an observation may be selected more than once.
Levels of measurement
What do you understand by the term ‘measurement’? Describe the various scales of
measurement.
Before you can use statistics to analyze a problem, you must convert information about the problem into
data. That is, you must establish or adopt a system of assigning values, most often numbers, to the objects
or concepts that are central to the problem in question. Data need not be inherently NUMERICAL to be
useful in an analysis, for instance, the categories mate and female are commonly used in both science and
everyday life to classify people, and there is nothing inherently numeric about these two categories.
Similarly, we often speak of the colors of objects in broad classes such as red and blue, and there is
nothing inherently numeric about these categories either.
Levels of measurement
Variables can be measured at four different levels—nominal, ordinal, interval, and ratio—that
communicate increasing amounts of quantitative information(Stevens, 1946, 1951, 1961).
The level of measurement affects the kinds of statistics you can use and conclusions you can
draw from your data.
Nominal Data
“Nominal” scales could simply be called “labels.”
The most basic level of measurement is the nominal scale. With nominal scales, numbers are assigned to
Categories of information that differ qualitatively. Numbers in this scale serve simply as labels and do
not have numeric meaning.
For instance, you might create a variable for gender, which takes the value 1 if the person is male and 0 if
3
the person is female (called binary data). The 0 and 1 have no numeric meaning and are simply used as
labels. Another example is the number marked on a football, basketball, or baseball player s uniform.
Again, such numbers are labels used for identification. They do not imply that one player is superior to
another.
Nominal measures must have categories that are mutually exclusive and collectively exhaustive. For
example, if we were measuring marital status, we might use the following categories: I = married. 2 =
separated or divorced. 3 = widowed. Each subject must be classifiable into one and only one category.
The requirement for collective exhaustiveness would not be met if, for example, there were subjects in the
sample who had never been married.
However, main limitation is that higher mathematical calculation cannot be done i.e. we cannot
add, subtract, multiply, or divide ordinal scores. Mean cannot be calculated and only median can
be calculated. Usually researchers report nominal data in terms of frequencies in each category. For
example, a psychologist might report how many individuals are assigned a 1 (Males) and a 2 (females).
Ordinal Data
Ordinal scales are the second level of measurement. Ordinal scales have all of the qualities of nominal
scores, but also the numbers indicate an order or rank. If a teacher asks children to line up in order of their
height, placing the shortest child first and the tallest child last, the teacher can then assign numbers based
on the children's height. The shortest child may be assigned a 1, the next-shortest child a 2, and so on. If
there are 20 children, the tallest child would be assigned a 20.
However, there is no metric way to quantify how great the distance between categories is. You could rank
countries of the world in order of their population, creating a meaningful order without saying anything
about whether, say, the difference between the 30th and 31st countries was similar to that between the
31st and 32nd countries. Also, because we cannot add, subtract, multiply, or divide ordinal scores (nor
4
can we compute means or standard deviations), ordinal scales are limited in their usefulness. It is
appropriate to calculate the median (central value) of ordinal data but not the mean0.
In fact, most psychological scales produce ordinal data. Therefore, test developers often make the
assumption that these instruments produce equal interval data.
Equal interval scales have all of the qualities of the previous scales, but in addition equal intervals
between objects, represent equal differences on the scale. The interval differences are meaningful.
However, it does not have an absolute zero (total absence of the property) and only an arbitrary zero.
Hence ratios can’t be found.
One example of the interval level of measurement is the Fahrenheit and the Celsius temperature scale. A
10-degree difference has the same meaning anywhere along the scale For example, the difference
between 10 and 20 degrees is the same as between 80 and 90 degrees. Thus, intervals are equal and
meaningful. But, we can’t say that 80 degrees is twice as hot as 40 degrees Hence ratio between numbers
cannot be calculated. This is because there is no true zero, only an arbitrary zero. On the centigrade
temperature scale, the zero value is arbitrarily taken as the point at which water freezes and the 100°C
value. Temperatures below 0°C are designated negative numbers. So the arbitrary 0°C does not mean 'no
temperature'. But when expressed on the kelvin scale, a ratio scale, a measure of 0K equivalent to -273°C
does indeed mean no temperature!
Calendar years are another example of an interval measurement. An arbitrary 0 was assigned when Christ
was born, and time before this is given the prefix BC and after him is given AD. For instance, we know
that there are exactly five years between 1975 and 1980, and we also know that the amount of time
between 1975 and 1980 is the same amount of time as between 1980 and 1985. The intervals are equal
but ratio between numbers cannot be taken. We cannot someone born in AD 400 is twice as old as
someone born in AD 800.
5
Scores on intelligence tests, for example, are usually assumed to be arranged this way. Someone with an
IQ of 120 is believed to be more intelligent than someone else with in IQ of 110. Furthermore — and this
is the defining feature of an interval scale — the difference in intelligence between people with IQs of
120 and 110 is assumed to be the same quantity as the difference between people with IQs of 110 and
100. In other words, the intervals are assumed to be equal. Note the word "assumed." however; some
psychologists consider IQ (and other psychological tests) to be an example of an ordinal scale, arguing
that it is difficult, if not impossible, to be sure about the equal-interval assumption in this case. Most
accept the inclusion of IQ as an example of an interval scale, though, partly for a practical reason:
Psychologists prefer to use interval and ratio scales generally because data on those scales allow more
sophisticated statistical analyses and a wider range of them.
We assume that most psychological constructs fall in the interval scale although it is controversial (as
strictly speaking they should belong to the ordinal scale). Although we can measure an individual's level
of anxiety, intelligence, or mechanical aptitude, it is difficult to establish the point at which an individual
totally lacks anxiety, intelligence, or mechanical aptitude.
Addition and subtraction are appropriate with interval scales because a difference of 10 degrees
represents the same amount of change in temperature over the entire scale. Multiplication and division are
not appropriate with interval data: there is no mathematical sense in the statement that 80 degrees is twice
as hot as 40 degrees, for instance.
Ratio Data
This is the highest level of measurement which is reached by natural sciences but has not been reached by
social sciences including Psychology. Ratio scales have all of the qualities of the previous scales,
and they also have a point that represents an absolute absence of the property being measured—
that point is called zero. Because ratio scales have an absolute zero, all arithmetic operations are
permissible. One can meaningfully add, subtract, multiply, and divide numbers on a ratio scale.
We can also find out ratios between the numbers. Ratio measures provide information
concerning the ordering of objects on the critical attribute, the intervals between objects, and the
absolute magnitude of the attribute.
6
Many physical measurements are ratio data: for instance, height, weight, and age all qualify. So does
income: you can certainly earn 0 dollars in a year or have 0 rupees in your bank account, and this signifies
an absence of money. With ratio-level data, it is appropriate to multiply and divide as well as add and
subtract; it makes sense to say that someone with Rs.100 has twice as much money as someone with Rs.
50.
7
Why “Sample” the Population? Why not study the whole population?
Saves time
• Saves money
• analysis of a sample is less cumbersome and more practical than an analysis of the entire
population.
• errors in sampling can be controlled more easily
• population may not be reachable
• experiment require destruction of item
Basics
1. Ungrouped and grouped Data
8
All formula for Discontinuous distribution
● Class Limits: 140 and 145
• Lower Limit (LL): LHS figure (140)
• Upper Limit (UL): RHS figure 144
• Range Highest score-Lowest score in distribution: 199-140
• i (width): (UL-LL) +1
(144-140) +1=5
● Midpoint of class = (UL+LL)
2
Or = LL + (UL-LL)
2
= 140 + (144-140)=142
2
9
Measures of central tendency
Measures of central tendency (or location) summarize what the data in a sample are typically
like. It is a central tendency that refers to a central (‘average’) value that is representative of the
whole data. Measures of central tendencies are also referred to as ‘average’ though the layperson
uses it for the mean. However technically it is used to refer to mean, median or mode.
Significance/Uses:
● First, it is an "average" which represents all of the scores made by the group, and as such
gives a concise description of the performance of the group as a whole and
● Second, it enables us to compare two or more groups in terms of typical performance.
Mean (calculations)
Symbols
• General Symbol: M
• Mean of Sample: x̄
• Mean of Population: μ
1. Ungrouped Data:
Solution
∑x = 15+25+18+15+20+25+18+18+20+25 = 199
N = 10
x̅ = 199/10= = 19.9
2. Grouped Data
10
Find mean of following
Solution:
11
ans =60.76
(ans=Mean = 106.00)
The median of a distribution is the point along the scale of possible scores that has half the scores
below it and half the scores above it.
The median is another name for the 50th percentile. The 50th percentile is written as P 50.
Concept of percentile
● A percentile is the point below which a certain percentage of the actual scores fall.
12
● 25th Percentile is written as P25.
● If P25 = 66.9. Then 66.9 is the percentile score and 25 is the percentile Rank.
● Percentile ranks may take values only between 0 and 100, whereas a percentile (point)
may have any value that scores may have.
Calculation of Mdn
Ungrouped Data
• For raw scores, think of the median as the middle score of a distribution based on score
frequency.
• Step 1: To find the median, we first put the scores in rank order from lowest to highest.
• Step 2a: If n (or N) is odd, the median will be the score that has an equal number of
scores below and above it. For example, for the following scores: (n+1)/2th score
0, 7, 8, 11, 15, 16, 20 -----the median is 11.
• If n is an even number of scores, then median is the average of the two scores that are in
the middle position. For example, for the group of scores: 12, 14, 15, 18, 19, 20 the
median is 15+ 18/2 = 16.5 (n+1)/2th score
● Concept of Percentile Rank and percentile Score: The median corresponds to 50th
percentile rank. Percentile Rank is the percentage of score that fall below a given score. It
varies b/w 0 to 100. For e.g., if percentile rank is 50 it means that 50% of the scores fall
below the given score. The score itself is referred to as percentile score and can take up
any value. Median is P50 or 50th Percentile. For e.g. P50=109 means that the score
corresponding to 50th percentile is 109.
● Calculations of percentiles
Find the value of P25.
Class f
Limits
96-98 1
93-95 1
90-92 3
13
87-89 3
84-86 4
81-83 7
78-80 8
75-77 9
72-74 12
69-71 6
66-68 11
63-65 7
60-62 2
57-59 3
54- 56 3
N=80
Steps
Class f cf
Limits
96-98 1 80
93-95 1 79
90-92 3 78
87-89 3 75
84-86 4 72
81-83 7 68
78-80 8 61
75-77 9 53
72-74 12 44
69-71 6 32
66-68 11 26
63-65 7 15
60-62 2 8
57-59 3 6
54- 56 3 3
N=80
Step 2: find the class interval in which P25 falls. P25 is the score point below which 25% of the
cases fall.
b) Identify Critical Class: The 20th case (from the bottom) will fall in the class
interval 66–68 (see cf column).
14
● LL ----Exact lower limit of the critical class
● (R/100)N –R is the given rank-percentile
● i ----width of the class interval
● cf below----- cf below the critical class
● f critical class-----frequency of scores in critical class
15
Find P30, P60, P90, P50
16
Ans: Group A
P30 45.81
P60 55.77
P90 73.64
P50=52.19
Group B
P30 –48.68
P60 --59.8
P90 --74.8
P50--56.14
Special case 1: If the cf percentile coincides with the cf in Critical Class, then consider take the
class interval above it. Find P20, P60 for above data (Table 1)
Special case 2: Computation of the percentile when there are gaps in the distribution
● Difficulty arises when it becomes necessary to calculate the median from a distribution in
which there are gaps or zero frequency upon one or more intervals.
● Since N = 10, and N /2 = 5, we count up the frequency column 5 scores through 6-7.
Ordinarily, this would put the median at 7.5, the exact lower limit of interval 8-9.
● In order to have the median come out at the same point, whether computed from the top
or the bottom of the frequency distribution, the procedure usually followed in cases like
this to have interval 6-7 include 8-9, thus becoming 6-9; and to have interval 12-13
include 10-11, becoming 10-13.
● Lengthening these intervals from two to four units eliminates the zero frequency on the
adjacent intervals by spreading the numerical frequency over them.
N/2=10/2=5th score
17
Remove gaps:
f Cf
20- 21 2 10
18- 19 1 8
16- 17 0 7
14 -15 0 7
10- 13 2 7
6- 9 2 5
4- 5 1 3
2-3 1 2
0 -1 1 1
16/2=8th score
First step
18
F cf
20- 21 2 16
18-19 2 14
16-17 4 12
14-15 0 8
12-13 4 8
10-11 0 4
8-9 4 4
Second step
F cf
20- 21 2 16
18-19 2 14
15-17 4 12
12-14 4 8
10-11 0 4
8-9 4 4
P= LL + i (N/2 - cfbelow)
fmedian class
.
Mdn=14.5 +3(8-8)/4= 14.5
Mode
Ungrouped data
In a simple ungrouped series of measures the "crude" or "empirical" mode is that single measure
or score which occurs most frequently. For example, in the series
10, 11, 11, 12, 12, 13, 13, 13, 14, 14,
the most often recurring measure, namely, 13, is the crude or empirical mode.
Grouped data
Mode = 3 Mdn - 2 Mean
19
Properties of mean, median, mode
Measures of central tendency (or location) summarize what the data in a sample are typically
like. It is a central tendency that refers to a central (‘average’) value that is representative of the
20
whole data. Measures of central tendencies are also referred to as ‘average’ though the layperson
uses it for the mean. However technically it is used to refer to mean, median or mode.
Uses:
First, it is an "average" which represents all of the scores made by the group, and as such gives a
concise description of the performance of the group as a whole and second, it enables us to
compare two or more groups in terms of typical performance.
Mean (properties)
1. It is a measure that best reflects the total of the score.
2. The mean is sensitive to the exact value of all the scores in the distribution. To calculate
the mean you have to add all the scores, so a chane in any of the scores will cause a
change in the mean. This is not true of the median or the mode.
4. The sum of the deviations about the mean equals zero. Written algebraically, this
property becomes Σ (Xi - x̅ ) = 0. This property says that if the mean is subtracted from
each score, the sum of the differences will equal zero.
5. The mean is the balance point of the distribution so that it is the point about which the
sum of negative deviations equal the positive deviations. The mean can be thought of as
the fulcrum of a seesaw, to use a mechanical analogy. The analogy is shown in Figure
4.1, using the scores of Table 4.1. When the scores are distributed along the seesaw
according to their values, the mean of the distribution occupies the position where the
scores are in balance.
21
6. The sum of the squared deviations when taken from mean (as compared to other points)
is minimum. Stated algebraically Σ (Xi - x̅ ) is a minimum as shown below:
22
7. The mean is least subject to sampling variation. Sampling fluctuation refers to the extent
to which a statistic takes on different values with different samples. That is, it refers to
how much the value fluctuates from sample to sample. A statistic whose value fluctuates
greatly from sample to sample is highly subject to sampling fluctuation. If we were
repeatedly to take samples from a population on a random basis, the mean, median and
the mode all would vary but the mean varies least compared to other measures of central
tendency.
The above property is very important in inferential statistics and is a major reason why
the mean is used in inferential statistics whenever possible.
8. Mean, mode and median as indicators of skewness. If the distribution is normal the mean,
median, and mode will all be equal. When the distribution is skewed, the mean and
median will not be equal. With a negatively skewed distribution, the mean will be lower
than the median. With a positively skewed curve, the mean will be larger than the
median.
(measures of central tendency in symmetrical and asymmetrical distribution)
23
9. Score transformations: A score transformation is a process that changes every score in a
distribution to one on a different scale. Scores may be transformed by adding,
subtracting, multiplying or dividing by a constant. Adding, subtracting, multiplying or
dividing each score in a distribution by a constant also results in adding, subtracting,
multiplying and dividing the mean by the same amount.
Properties of median
1. The median is the point that divides the lower half of scores from the upper half. The
median is also referred to as the score at the 50th percentile in the distribution.
2. The median is responsive to the number of scores above or below its value, but not to
their exact locations.
4. The median is more subject to sampling fluctuation than the mean but less subject to
sampling fluctuation than the mode.
5. Because the median is usually less stable than the mean from sample to sample, it is not
as useful in inferential statistics.
24
Properties of mode
1. The mode is defined as the most frequent score in the distribution.
2. Most affected by the choice of class interval
3. sometimes not a unique point in the distribution
4. subject to maximum sampling fluctuation
5. used for preliminary work
6. easy to obtain
7. little use beyond descriptive level
8. only measure suitable for nominal data
9. Score transformations: Adding, subtracting, multiplying or dividing each score in a
distribution by a constant also results in adding, subtracting, multiplying and dividing
the median by the same amount.
25
( 1) When a quick and approximate measure of central tendency is all that is wanted.
( 2) When the measure of central tendency should be the most typical value. When we describe
the style of dress or shoes worn by the "average woman," for instance, the modal or most popular
fashion is usually meant.
Uses
The mean is the most generally useful measure of central tendency, whereas the mode is the least
useful. However, there are circumstances in which the mean is less useful, such as when there
are a few extreme scores in the data (i.e. data is skewed) in which case median should be used as
it is less affected by extreme scores. Median should be used when the data is open ended and
mode should be used when the most typical value is needed or when the data is nominal.
Measures of Variability/Dispersion
Central tendency alone is not a good way of describing a sample or population. Along with this
aggregate number, it is critical to describe the measure of variability or dispersion. Measures of
dispersion or variability can be thought of as the scatter or spread in the data. In other words, it is
a measure of how individual values cluster around the central tendency.
Suppose we have two distributions with the same mean. However, Curve A is less spread or less
scattered than curve B. They have the same mean and cannot be distinguished from one another
unless the dispersions are also known. A narrow dispersion means that individual scores are
quite similar to one another and the distribution is fairly homogeneous. A wide dispersion
indicates that the data diverge considerably from the mean score, which consequently is less
representative of the distribution. Thus it is important to have some measure of how the data vary
within the set.
26
Standard deviation
The standard deviation or SD is the most stable index of variability and is customarily employed
in experimental work and in research studies. The conventional symbol for the population SD is
the Greek letter sigma (σx). For the sample SD it is simply written as Sx
27
( ∑x or SSx = (X- x̄
2
)2
Find SD
X
6
2
3
1
Solution:
Step 1: Find x̄ or M
x̄ = ∑X = 6+2+3+1 =12/4=3
n 4
Step 3: Find SD
28
1. Raw Score Formula (ungrouped data)
29
● Distribution X is 5.1, 8.7, 3.5, 5.4, and 7.9. Using the raw-score and deviation score
method find the standard deviation. (ans:1.91 )
● Distribution X is: 15, 14, 11, 11, 9, and 6. Using the raw-score and deviation-score
method, find the standard deviation. (ans: 3)
30
Practice questions
Solution:
Find x̄ or M
Find SD
31
b. Raw Score Formula (grouped data)
32
Practice questions
Calculation of variance
The symbol for the variance of a population is (the lowercase Greek letter sigma) and
that for a sample is S2 .
33
Calculation of Quartile deviation or semi Interquartile Range (Q)
The semi-interquartile range, denoted by Q, is defined as one-half the distance between the first
quartile point, Q1 and the third quartile point, Q3. The quartile points are the three score points
that divide the distribution into four parts, each containing an equal number of cases. These
points, symbolized by Q1, Q2, and Q3, are therefore refer to as P 25, P50 (the median), and P75,
respectively.
These points and the median are shown below.
Q1 is a point below which 25% of the scores fall; Q3 is a point below which 75% fall; the median
is sometimes referred to as Q2 because it divides the distribution into two equal halves; Q is half
the distance from Q3 to Q1.
The formula for Q is:
Ungrouped Data
34
For an ungrouped data, quartiles can be obtained using the following formulas but before that
arrange the data points in increasing/decreasing order.
Q1 = [(n+1)/4]th item,
Q3 = [3(n+1)/4]th item
Grouped Data
Q1= P25
Q3=P75
Range
Range is the simplest measure of variability or dispersion. It is calculated by subtracting the
lowest score in the series from the highest. But it is a very rough measure of the variability of a
series.
Deviational Score
Question: On the mid-term exam in their sociology course, Shweta’s deviation score was +5 and
Nita s deviation score was —13. If the mean for the class was 75, what were the raw scores
obtained by Shweta and Nita ?
● Shweta---X1-75=+5,
35
X1=75+5=80
● Nita--------X2-75= -13
X2= -13+75=62
Suppose we have two distributions with the same mean. However, Curve A is less spread or less
scattered than curve B. They have the same mean and cannot be distinguished from one another
unless the dispersions are also known. A narrow dispersion means that individual scores are
quite similar to one another and the distribution is fairly homogeneous. A wide dispersion
indicates that the data diverge considerably from the mean score, which consequently is less
representative of the distribution. Thus it is important to have some measure of how the data vary
within the set.
36
The usefulness of a statistic which provides a measure of variability can be seen from a simple
example. Suppose a test of arithmetic reasoning has been administered to a group of 50 boys and
to a group of 50 girls. The mean scores are, boys, 34.6, and girls, 34.5. So far as the means go
there is no difference in the performance of the two groups. But suppose the boys' scores are
found to range from 15 to 51 and the girls' scores from 19 to 45. This difference in range shows
that in a general sense the boys are more variable, than the girls; and this greater variability may
be of more interest than the lack of a difference in the means.
Four measures have been devised to indicate the variability or dispersion within a set of
measures. These are (1) the range, (2) the quartile deviation or Q, (3) the standard deviation or
SD and 4) Variance.
37
Properties of SD
Similar to mean in its properties
3. The standard deviation, like the mean, is responsive to the exact value of every score in
the distribution. Because it is calculated by taking deviations from the mean, if a score is
shifted to a position away from the mean, the standard deviation will increase. If the shift
is closer to the mean, the standard deviation decreases.
5. It is smallest when the deviations are taken from the mean as compared to any other
point. Written algebraically, this property becomes
6. The SD is least subject to sampling variation. Sampling fluctuation refers to the extent to
which a statistic takes on different values with different samples. That is, it refers to how
much the value fluctuates from sample to sample. A statistic whose value fluctuates
greatly from sample to sample is highly subject to sampling fluctuation. If we were
repeatedly to take samples from a population on a random basis, all measure of
dispersion would vary but the Sd varies least compared to other measures. In repeated
random samples drawn from populations, the numerical value of the standard deviation
tends to jump about less than would that of other measures computed on the same
samples.
7. The above property is very important in inferential statistics and is a major reason why
the mean is used in inferential statistics whenever possible i.e. it is least subject to
sampling fluctuation.
38
8. Score transformations: A score transformation is a process that changes every score in a
distribution to one on a different scale. Scores may be transformed by adding,
subtracting, multiplying or dividing by a constant. Adding, or subtracting a constant to
each score in the distribution has no effect on SD i.e. it remains the same. Multiplying or
dividing each score in a distribution by a constant results in multiplying and dividing the
SD by the same amount. The effect of addition, subtraction, multiplication or division by
a constant holds true also for the other measures of dispersion.
The standard deviation, which is typically reported with the mean, is the most important and
most widely used measure of dispersion for quantitative variables whose distributions are
relatively symmetrical. Its popularity is due largely to its superior sampling stability however the
standard deviation is neither a preferred nor an appropriate measure of dispersion: when a
distribution is very skewed.
1. The semi-interquartile range, denoted by Q, is defined as one-half the distance between the
first quartile point, Q1 and the third quartile point, Q3. The quartile points are the three score
points that divide the distribution into four parts, each containing an equal number of cases.
These points, symbolized by Q1, Q2, and Q3, are therefore refer to as P 25, P50 (the median),
and P75, respectively.
39
Formula for Q is:
2. It is responsive to the number of scores above or below its value, but not to their exact
locations.
4. Q is more subject to sampling fluctuation than the SD but less subject to sampling
fluctuation than the other measures.
5. Because Q is usually less stable than the mean from sample to sample, it is not as useful
in inferential statistics.
Range
Similar to mode in its properties and often reported with the mode.
● The Range is the difference between the lowest and highest values.
● A great deal of information is ignored when computing the range, since only the largest
and smallest data values are considered.
40
● The range can sometimes be misleading when there are extremely high or low values .
Example: In {8, 11, 5, 9, 7, 6, 3616}:the lowest value is 5, and the highest is 3616, So the range is 3616-5 = 3611.
The single value of 3616 makes the range large, but most values are around 10.
Variance
What is the difference between standard deviation and variance ? Explain why standard
deviation is more often used as a statistics than variance?
● Square of the SD
● The variance is frequently used in especially in inferential statistics but not much in
descriptive statistics. The reason why it is not used in Descriptive statistics is that its
calculated value is expressed in squared units of measurement. (If the scores are weights
in pounds, the variance will be a certain quantity of squared pounds.) Consequently, it is
of little use in descriptive statistics. However, this is easily remedied. By taking the
square root of the variance, i.e. by using an index of variability called the standard
deviation.
● Properties similar to SD
Comparison
● Skewed data: if the data is skewed or has a few extreme score then Q is most preferable,
followed by SD and then range (Q, SD, and Range).
● Sampling fluctuation: SD is most resistant to sampling fluctuation followed by Q and
then Range.
Uses
41
The SD is the most generally useful measure of central tendency, whereas the range is the least
useful. However, there are circumstances in which the SD is less useful, such as when there are a
few extreme scores in the data (i.e. data is skewed) in which case Q should be used as it is less
affected by extreme scores. Q should be used when the data is open ended and range should be
used when a rough measure is required.
2. Use the Q
(1) when the median is the measure of central tendency
(2) when there are scattered or extreme scores which would influence the SD disproportionately
(3) when the concentration around the median—the middle 50% of cases—is of primary interest.
4. Use the SD
(1) when the statistic having the greatest stability is sought
(2) when coefficients of correlation and other statistics are subsequently to be computed.
Score transformations:
A score transformation is a process that changes every score in a distribution to one on a
different scale. Scores may be transformed by adding, subtracting, multiplying or dividing by a
constant.
Mean is most important as it least subject to sampling fluctuation and is therefore most
42
stable. Hence, we can use it in inferential statistics.
9. For a distribution of scores mean is 70 and the median is 80, describe its shape.
It will be negatively skewed since mean is less than mode.
43
11. If the data is badly skewed which measure should not be used?
Mean is most sensitive to extreme scores (called outliers) as compared to other measures
of central tendency. Because of this the Mean may not be the best choice when the
distribution contains a few very extreme scores (outliers) or when the distribution is
badly skewed
12.With the help of rough sketch, illustrate which measure of central tendency has the
highest value in a positively skewed distribution.
12 A school psychologist determines the IQ score for every student in her school. The
school nurse measures the current height of every student. Are the two studying the same
population? Explain.
Both populations are different -one will be a population of IQ and other will be population of
heights. Both populations will differ in terms of their measures of central tendency and measures
of variability
44
13. A researcher is interested in assessing intelligence of DU students. She selects 5% of the
students to test their intelligence level. Identify the population, sample, parameter, and
statistic in this context. 2
Median is an appropriate measure as there is an outlier (10) that will affect the mean but not
median
15. The mean of a set of 5 scores is 50. Four of the scores are 52, 54, 30 and 44. Find
the fifth score.
M= X1+X2+X3+X4+X5 ,
5
50=52+54+30+44+X5
5
50=180+X5
5
X5= (50x5)-180=70
16. If you calculated the variance of a distribution and obtained a value of-25, is your
answer correct ? Why ?
45