0% found this document useful (0 votes)
0 views5 pages

Week 7

The document discusses scientific measurement techniques, particularly focusing on small data sizes where the t-distribution is applicable instead of the normal distribution. It provides an example of calculating the t-value to determine if a particle's mass is less than a specified value using limited measurements. Additionally, it explains the construction of box-and-whisker plots to visualize data spread and identify outliers.

Uploaded by

rp21ms106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views5 pages

Week 7

The document discusses scientific measurement techniques, particularly focusing on small data sizes where the t-distribution is applicable instead of the normal distribution. It provides an example of calculating the t-value to determine if a particle's mass is less than a specified value using limited measurements. Additionally, it explains the construction of box-and-whisker plots to visualize data spread and identify outliers.

Uploaded by

rp21ms106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Elements of Scientific Measurement

(continued)

1 When the data size is small


The above procedures apply to situations where the data size is reasonably
large (at least 25), without which the Central Limit Theorem would not be
applicable. But there are situations in which it is difficult (or very expensive)
to obtain many data points. What to do in such cases?
We have seen earlier that if the number of samples is sufficiently large,
the sampling distribution of the mean follows a normal distribution. In that
case, we defined a quantity
x̄ − µ x̄ − µ
z= = σ ,
σx̄ √
n

which followed a normal distribution. Then we used the z-table to obtain


the probability of getting a z value at least that large (or that small).
Where the data size n is small, the sampling distribution of the mean
would not follow a normal distribution. But it follows a different distribution,
which is called the t-distribution, whose characteristics can then be used to
derive meaningful results. In this case, the quantity t is defined the same
way:
x̄ − µ
t= σ

n

Then one can use the t-table to derive similar conclusions.


Let us illustrate this with an example.
Example 1: A scientist could take only 9 measurements on the mass of a
particle, and the measured values were 16.2, 19.7, 21.8, 15.6, 19.0, 18.7, 16.9,
21.7 and 20.2 (in suitable units). Do the data provide sufficient evidence to
say that the mass of the particle is less than 21? Here “sufficient evidence”
implies that the probability that the statement is wrong is less than 0.01 or
1%.
Solution: From the data, we find that the sample mean is x̄ = 18.87, and
the sample standard deviation is s = 2.2583. The prediction we have to test
is that the population mean µ < 21. This is the same as checking the odds
of getting x̄ = 18.87 or below if the value of µ were 21. So our approach will
be to assume µ = 21 and to check the probability of getting x̄ = 18.87 or
below. If the probability is less than 0.01, there will be less than 1% chance
of making an error.

1
Using the data, we get
x̄ − µ x̄ − µ 18.87 − 21
t= ≈ = = −2.83
√σ √s 2.2583

n n 9

We need to look at the t-table in Table 1 to locate the threshold value


of t that has a significance level of 1%. The columns are arranged according
to the “significance level” (which is the area under the t-distribution curve
beyond that value of t). In this case, we are looking for a 1% significance
level. The rows are arranged according to the degree of freedom (DoF), which
is one less than the number of data points, i.e., n − 1. Here the number of
data points is 9. Therefore the degree of freedom is n − 1 = 8. For the
above degree of freedom and significance level, we find t = 3.355. Therefore
the probability of getting a t value higher than 3.355 is 1%. Since the t-
distribution is symmetrical about zero, the probability of getting a t value
below −3.355 is also 1%.
The value of t we got in our case is −2.83, which is above −3.355. This
implies that if the mean is µ = 21, the probability of getting x̄ = 18.87 or
lower is more than 1%. Thus, from the data, if we state that the population
mean µ < 21, there will be more than 1% chance of committing an error. □

2 Box and whisker plots


You may notice that a plot of the experimental results showing the error bars
does not give the information about the spread of the data obtained. In some
applications where such information is essential, one prefers a different way
of presenting the results. This is called a ‘box-and-whisker’ plot, a typical
representation is shown in Fig. 1.
Variable

Parameter

Figure 1: A typical box and whisker plot

2
In producing such a plot, the data are first arranged in ascending or-
der. The minimum value and the maximum value thus obtained gives the
extremities of the ‘whiskers’ of the plot. Then one has to obtain the median,
which is nothing but the middle value. If the number of data points is odd,
the middle number is easy to identify. If there are an even number of data
points, two numbers will appear in the middle, and one has to take the mean
of these two numbers. This median gives the mid-point of the plot, called
the second quartile, or Q2 (see Fig. 2).
Interquartile
range

Minimum Q1 Q2 Q3 Maximum
Median

Figure 2: The ranges in a box and whisker plot

Then one has to obtain the median of the data points below Q2. That
gives another value, called the first quartile or Q1. Similarly, one obtains
the median of the data points above Q2, which gives the third quartile, or
Q3. The range between Q1 and Q3 is called the interquartile range (IQR),
which is plotted as a box. The range between the minimum and Q1, and
that between Q3 and the maximum is plotted as a ‘whisker’. Therefore, the
representation of a typical data set would look like Fig. 2. One characteristic
feature of such a plot is that 25% of the data lie in each of the four ranges
shown in the plot.
Sometimes one gets some data points that lie way outside the natural
range of the data. These are called the ‘outliers’. The box plot also enables
one to identify and present the outliers. The usual method is that the data
points outside 1.5 times the interquartile range outside the box are called
outliers. Therefore, one can identify the ’reasonable’ range of the data as
that between (Q1 − 1.5 × IQR) and (Q3 + 1.5 × IQR), and any data point
falling outside this range may be suspected to be ‘outlier’1 .
Such outliers may result from experimental or observational errors but
may also result from some phenomenon not yet discovered. That is why one
cannot simply ignore an outlier or delete it from a data set. Outliers have to
be faithfully presented in a research paper, though you may ignore these in
further analysis of the data.
1
Before we conclude that such a point is indeed an outlier, some more tests would be
required.

3
Example 1:
Consider the following data set:
17.2, 15.9, 16.7, 18.3, 15.0, 19.3, 20.2, 16.3, 17.9, 15.3, 10.1, 19.1, 18.2
Obtain the box and whisker plot.
Solution:
Arranging the data in ascending order, we get
10.1, 15.0, 15.3, 15.9, 16.3, 16.7, 17.2, 17.9, 18.2, 18.3, 19.1, 19.3, 20.2
It has 13 data points, which is an odd number. Hence, the 7th data point,
17.2, is the median.
There are 6 data points below and above the median, which is an even
number. So we get Q1 by taking the mean of the 3rd and 4th entries and
get Q1=15.6. Similarly, we get Q3 as the mean of the 10th and 11th entries
and get Q3=18.7.

10.1 15.6 17.2 18.7 20.2

10 11 12 13 14 15 16 17 18 19 20

Figure 3: The box and whisker plot for the whole data set given in the
Example

Therefore, the box and whisker plot becomes as shown in Fig. 3.


Now let us see if any data point can be identified as an outlier. The IQR is
18.7 − 15.6 = 3.1. Going below the lowest point of the box by 1.5×IQR gives
10.95. We see that there is one data point below that value. Therefore we
can suspect that this point is an outlier and set the end of the whisker at the
last data point above 10.95. This value is 15.0. Going above Q3 by 1.5×IQR
gives 23.35. This is above the highest point of the data set. Thus there is no
outlier on the higher side. The resulting plot, excluding the outlier, is shown
in Fig. 4.

15.0 15.6 17.2 18.7 20.2

10.1

10 11 12 13 14 15 16 17 18 19 20

Figure 4: The box and whisker plot excluding the outlier.

4
Table 1: The t-table.

You might also like