Transforming Datas
Transforming Datas
Introduction :
In this investigation, I will be investigating how translations and enlargements of data affect the statistical
parameters such as mean, median, the quartiles, standard deviation and more. I will be analyzing how the
transformation of datas influences these parameters. To do this, I will be experimenting with multiple variables
to see how it is impacting parameters such as additions, subtraction and multiplication on the data setThe
investigation will be firstly conducted through testing adding, subtracting and multiplying a certain constant
value on the data scores to figure out how these changes affect the mean and standard deviation and find the
pattern. In addition to that, the investigation will also regard how IQR ranges and quatiles are influenced.
During this, the concept of a certain constant value will be exhibited by the letter “a '' and will analyze the
conjunction between the range of “a '' and the patterns of changes on mean and standard deviation in the course
of this investigation. For instance, it will regard the differences when a > 0 , a < 1 and a < 0. Through this I will
be able to develop a pattern related to cumulative frequency and graphical characteristics regarding interquartile
range. Moreover, it will mitigate the understanding of quartiles such as median and interquartile ranges and the
method of calculating those values.
In this investigation all the values will be rounded up to 3 d.p for concision. However when calculating the
difference, the full decimal will be used to provide more accurate results.
Parameters
Equations or Symbol Functions
Median The middle value of daft set arranged in to find the center value.
order of size Ex. 2 , 3 , 4
The median is 3
Standard Deviation Σ(𝑥 − 𝑥) A Measure of the amount of variation of a random variable expected
𝑛 about its mean.
IQR Q3 - Q1 IQR stands for interquartile range. It is the distance from the first
quartile ( Q 1 ) to the last quartile ( Q 3 ) in a data set.
I. Mean
To calculate the mean of the given data, you first sum all the numbers together and divide it by the total
quantity of datas.
2
Σ(𝑥−152.8)
60
2 2 2 2
(130 − 152. 8) + (130 − 152. 8) + ⋯ + (172 − 152. 8) + (179 − 152. 8) = 17.09542236=17.095
Graph 1
Graph 1 is the dot plot graph of the given height data shown by frequency on the y axis and heights on the x
axis. The dots are plotted on each height centimeters in the range between 130 and 180 in steps of 1 cm.
Therefore the values exactly lie on the x axis(y = 0), it means there are no people who are that height.
Investigating how the parameters, mean and standard deviation is impacted when there is a change on the data
set
Mean : When 5 cm is added to each height in the data set, the sum of the data set become 9468
And the calculation for the mean can be shown like :
Default When the data are additioned by 5 Comparison with original default
value
As you can see the value of mean has changed and increased by 5 when 5cm is added to each score in the data
set. And you can conclude that when a certain value is added to the entire values in a given data set, the mean
changes identically.
Standard deviation : When 5 cm is added to each height in the data set, the calculation for the standard
deviation can be showns as :
2
Σ{𝑥+5−(𝑥+5)}
60
2
Σ(𝑥+5−157.8)
60
I have implemented addition by 5 on the side of the x value since all the values are equally increased by 5
2 2 2
(130 + 5 −157.8) + ⋯ +(172 + 5 − 157.8) + (179 + 5 − 157.8)
60
= 17.09542236=17.095
Default When the data are additioned by 5 Comparison with original default
value
Graph 2
Graph 2 compares the changes on the dot plot graph when each score is added by 5. As you can observe the dot
graphs have parallely shifted by 5 on the direction of the x-axis. The blue dots represent the original value and
the red dots represent the data when each value is added by 5. This parallel shift explains the proportional move
on the entire data set when adding a constant value. ( a > 0 ). Furthermore it is also possible to observe that the
range of the data isn't changed even if it is translated.
[Subtraction] : When 12 cm is subtracted from each values in the data set ( a < 0 )
Mean : When 12 cm is subtracted to each height in the data set, the sum of the data set become 9096
And the calculation for the mean can be shown like :
Default When the data are subtracted by Comparison with original default
12 value
Standard deviation : When 5 cm is added to each height in the data set, the calculation for the standard
deviation can be showns as :
2
Σ{𝑥−12−(𝑥−12)}
60
Σ(𝑥−12−140.8
60
I have implemented subtraction by 12 on the side of the x value since all the values are equally decreased by 12
2 2
(130 −12 − 140.8) + ⋯ +(172 −12 − 140.8 + (179 −12 − 140.8)
60
= 17.09542236 =17.095
Default When the data are subtracted by Comparison with original default
12 value
Mean : When 5 is multiplied to each height in the data set, the sum of the data set become 45840
And the calculation for the mean can be shown like :
Default When the data are multiplied by 5 Comparison with original default
value
As you can see the value of mean has changed and increased by 5 time of the default value
(152.8 x 5 =764) when each of the scores in the data set are multiplied by 5. And you can conclude that when a
certain value is multiplied to the entire values in a given data set, the mean changes identically
Standard Deviation: When 5 is multiplied to each height in the data set, the calculation for the standard
deviation can be showns as :
2
Σ{5𝑥−𝑥(5)}
60
2
Σ(5𝑥−764)
60
2 2 2
(650 −764) + ⋯ +(860 −764) + (895 −764)
60
= 85. 478712 = 85.479
Default When the data are multiplied by 5 Comparison with original default
value
As you can observe, the value of standard deviations have increased by the multiplication of 5
(17.095 x 5 = 85.478) . This shows that when the mean is multiplied by a certain number and increases. The
standard deviation also increases positively, and the standard deviation also increases by the same
multiplication of a constant value acted on the data set. Therefore when the number(a) that is multiplied on the
values is bigger than 0 (a > 0), the standard deviation increases.
Graph 4.
Graph 4 compares the changes on the dot plot graph(cumulative frequency x height) when each score is
multiplied by 5. As you can observe, I have utilized a different type of dot plot graph unlike the dot plot I used
where the investigation was experimenting with the effect of adding and subtracting. It is because the previous
graph loses its readability because of the concept of multiplication, the graph becomes significantly wider
compared to adding and subtracting. Therefore to keep the investigation concise and provide better
communication I have changed the formatting of the graph. The green dots represent the original value and the
red dots represent the data when each value is multiplied by 5. As you can observe not only the graph is shift up
on the y axis but the range on the y axis direction between each score has increased, from this you can observe
that multiplying a constant value over a date set creates changes on its standard deviation as it is a measure of
the range from the mean.
[Multiplication] When 0.2 is multiplied on each values on the data set ( a > 0)
1 1
0.2 is a decimal number which can converted to fraction, 5
and multiplying 5
is equal to division. Therefore
this session is investigating how mean and standard deviation is impacted when the values are divided by
certain values.
As you can see the value of mean has changed and decreased by 0.2 times of the default value
(152.8 x 0.2 =30.56) when each of the scores in the data set are multiplied by 0.2. And you can conclude that
when a certain value is multiplied to the entire values in a given data set, the mean changes identically.
Standard Deviation: When 0.2 is multiplied to each height in the data set, the calculation for the standard
deviation can be showns as :
2
Σ{(0.2)𝑥−𝑥(0.2)}
60
2
Σ(𝑥−30.56)
60
2 2 2
(130 −30.56) + ⋯ +(172 −30.56) + (179 −30.56)
60
= 3.3905977357192 = 3.391
Default When the data are multiplied by Comparison with original default
0.2 value
As you can observe, the value of standard deviations have decreased by the multiplication of 0.2 This shows
that when the mean is multiplied by a certain number and decreases, the standard deviation also decreases by
the same multiplication of a constant value acted on the data set. Therefore when the number(a) that is
multiplied on the values is smaller than 1 (a < 1), the standard deviation decreases.
Graph 5.
Graph 5 compares the changes on the dot plot graph(cumulative frequency x height) when each score is
multiplied by 0.2. As you can observe, The green dots represent the original value and the red dots represent the
data when each value is multiplied by 0.2. As you can observe not only the graph is shift down on the y axis but
the range on the y axis direction between each score has increased, from this you can observe that multiplying a
constant value over a date set creates changes on its standard deviation as it is a measure of the range from the
mean.
[Multiplication] When a value that is lower than 0 is multiplied on each values on the data set ( a < 0)
When -1 is multiplied
Default When the data are multiplied by -1 Comparison with original default
value
As you can observe, when a < 0, the value becomes negative however in the field of height measurement, it is
impossible to be negative. However it is still able to calculate the mean of the value as it is shown above on the
table. And you can further observe that the new mean when -1 is multiplied on each scores have different of
multiplication by -1 which shows it follows the same rule when a > 0
Standard deviation:
2
Σ{(−1)𝑥+𝑥(−1)}
60
2
Σ(𝑥+152.8)
60
2 2 2
(−130 +152.883333) + ⋯ +(−172 +152.883333) + (−179 +152.883333)
60
= 17.09542236 = = 17.095
Default When the data are multiplied by -1 Comparison with original default
value
When -2 is multiplied
Mean :
-18336(new sum) ÷ 60 = -305.6
Default When the data are multiplied by -2 Comparison with original default
value
As you can observe from the comparison between the default mean and the changed mean, it has changed by
the multiplication of -2. Therefore when each of the scores are multiplied by -2, the mean is also multiplied
identically.
Standard Deviation: When -2 is multiplied to each height in the data set, the calculation for the standard
deviation can be showns as :
Σ(−2𝑥−𝑥(−2)
60
2
Σ(−2𝑥+305.6)
60
2 2 2
(−260 −305.6) + ⋯ +(−254 −305.6) + (−258 −305.6)
60
= 34.19084472 = 34.191
As you can observe, even though the data is multiplied by a negative number, -2, the standard deviation stays
positive.
Default When the data are multiplied by -2 Comparison with original default
value
As you can see from table 15, you can observe that even when -2 is multiplied on the data set, the comparison
between the original standard deviation and the value when -2 is multiplied is 2 which is a positive number.
a>0 When 0 is 0 0 0 0
multiplied
Graph 6
Graph 6 is the graph which shows the cumulative frequency graph of the height data from table N. Throughout this
graph you can observe the first, second and third quartiles of the data. The first quartile, 𝑄1 is also known as the
lower quartile. The values lower than 𝑄1 represent the 25th percentile where lowest 25% data is below this point.
The second quartile, 𝑄2 represents the median of the data set which is the middle of the entire data set, therefore
the data below the median value is the 50th percentile, the lowest 50 % of data. The third quartile, also known as 𝑄3
represents the 75th percentile which shows the lowest 75% of the data is below this value.
Finding Median Q2
𝑛+1
To find the median, we can use the median equation ( 2
th). In this equation, n represents the number of values in
the data set. Therefore n = 60 From this equation we know what number of terms the median is in the data set.
60+1
2
= 30.5
the number of values in the data set. Therefore n = 60. From this equation we know what number of terms the
median is in the data set.
1
4
(60 + 1) = 15.25
the number of values in the data set. Therefore n = 60. From this equation we know what number of terms the
median is in the data set.
3
4
(60 + 1) = 45.75
Q2 Median 148.5
Investigation how the median and IQR impacted the changes on the values of the data set.
Graph 7 shows the comparison of two cumulative frequency graphs where the green represents the original
graph and red is when 5 is added to each score in the data. From the observation, the graph is parallel shifted to
the right side by 5 which is the x axis direction. Furthermore, you can observe that the values of quartiles are
also just parrelly shifted therefore the new parameters of Q1 Q2 Q3 will be the original value increased by 5.
On the other hand, the interquartile range stays the same because adding a constant to all data points shifts the
entire data set but does not change the spacing between the quartiles, which is why the IQR remains
unchanged.
The new Q3 : Q3 + 5
= Q3 - Q1
= 32
Graph 8
Graph 8 shows the comparison of two cumulative frequency graphs where the green represents the original
graph and red is when 12 is subtracted from each score in the data. From the observation, the graph is parallel
shifted to the left by 12 which is the x axis direction. Furthermore, you can observe that the values of quartiles
are also just parrelly shifted therefore the new parameters of Q1 Q2 Q3 will be the original value decreased by
12 On the other hand, the interquartile range stays the same because adding a constant to all data points shifts
the entire data set but does not change the spacing between the quartiles, which is why the IQR remains
unchanged.
The new Q3 : Q3 + a
= Q3 - Q1(a disappears)
= 32
When 1 is 136 -1 32 0
subtracted
When 2 is 135 -2 32 0
subtracted
When 3 is 134 -3 32 0
subtracted
[Multiplication] When 5 is multiplied on each values on the data set ( a > 0 )
Graph 9
Graph 9 shows the graph of the cumulative frequency graph when 5 is multiplied on each score in the data set .
From the observation, the graph is translated to the x positive direction as the initial value has changed to 650
from 130. Furthermore, you can observe that the values of quartiles are also changed to new parameters of Q1
Q2 Q3 .On the other hand, the interquartile range has also changed.
As you can see from the changes on the quartiles. All of the values of quartiles, Q1, Q2, Q3 and IQR have been
multiplied by 5 when 5 is multiplied on each of the scores in the data set.
=5(Q3 - Q1)
=5(32)
5 x 32 = 160
[Multiplication] When 0.2 is multiplied on each values on the data set ( a < 1)
Graph 10
Graph 10 shows the graph of the cumulative frequency graphs when 0.2 is multiplied on each score in the data
set . From the observation, the graph is translated to the x positive direction as the initial value has changed to
25 from 130. Furthermore, you can observe that the values of quartiles are also changed to new parameters of Q1
Q2 Q3 .On the other hand, the interquartile range has also changed.
Graph 11
Graph 11 shows the graph of the cumulative frequency graphs when -0.3 is multiplied on each score in the data
set . From the observation, the graph is translated to the x positive direction as the initial value has changed to
-39 from 130. Furthermore, you can observe that the values of quartiles are also changed to new parameters of
Q1 Q2 Q3 .On the other hand, the interquartile range has also changed.
When 0 is 0 ×0 0 ×0
multiplied
When 1 is 137 ×1 32 ×1
multiplied
When 2 is 274 ×2 64 ×2
multiplied
When 3 is 411 ×3 96 x3
multiplied
When a constant value is multiplied to the scores of the data set, the values of quartiles are also proportionally
affected by the certain value. For example when it was multiplied by 5, the Q1, Q3 and IQR increased by the
multiplication of 5. For the median value, equal to other quartiles, the median has increased by the
multiplication of 5. By this, you can assure that when a, a > 0 is multiplied to the data set, the quartile values
increase proportionally by the same value. However when a < 1, the value of quartiles decreases and it is proven
by the experiment when 0.2 is multiplied to the data set. For instance when 0.2 was multiplied the Q1, Q3 and
Iar have decreased by the multiplication of 0.2. For the median value, equal to other quartiles, the median has
decreased by the multiplication of 0.2. Then we can assure that when a > 1 is multiplied on the data set, the
quartile values decrease. Lastly also when a < 0 is multiplied on the data,the values of quartiles are also
proportionally affected by the certain value. For example when it was multiplied by -0.3, the Q1, Q3 and IQR
decreased by -0.3. For the median value, equal to other quartiles, the median has decreased by the
multiplication of -0.3
Changes on IQR.
As you can observe on table 21 and 23, when a constant value was multiplied on the data set, the IQR range also
followed the changes. For example when 5 (a > 0)was multiplied on the data set, the IQR range also increased
by the multiplication of 5. When 0.2 (a >0) was multiplied, the IQR range also decreased by the multiplication
of 0.2. So we can assure that when a > 0 is multiplied, the IQR range proportionally increases or decreases.
However when a < 0 is multiplied on the data set, it creates differences while calculating the IQR. In a normal
IQR calculation, Q3 should be greater than Q1 to calculate appropriate IQR however if a < 0 is multiplied on
the data set, the Q3 value becomes smaller than Q1. Hence the IQR becomes negative. But according to
research(‘Why can’t IQR be negative’ Study, 2022) , The IQR is always non-negative because it represents the
range of the middle 50% of the data, reflecting the spread of data, not its absolute values. Therefore when a < 0
is multiplied, it indicates an error in the data processing or calculation. The IQR, as a measure of statistical
dispersion, is inherently non-negative.
[Conclusion]
In conclusion, this investigation reveals that when a constant value (a > 0 or a < 0) is added to each data point in
a data set, both the mean and median are shifted by the constant value a. However, the standard deviation IQR
remains unchanged by this edition of a constant value because they are measures of spread or dispersion in the
data, and adding a constant to each data point does not alter the relative distances between the data points.
Thus, while measures of central tendency (mean and median) change due to the constant shift, measures of
variability (standard deviation and IQR) remain unchanged. When considering the multiplication of each data
point by a constant value a, the mean and median changes proportionally. For example, if a > 1, the mean and
median increases by the multiplication of a. Conversely, if 0 < a < 1, the mean decreases proportionally by the
multiplication of a. Moreover, if a < 0, the mean and median changes the sign accordingly. Similarly, the
standard deviation and IQRalso change when each data point is multiplied by a constant. Multiplying by a > 1
increases the standard deviation and IQR proportionally. In addition, multiplying by (0 < a < 1) both of the
values decreases proportionally .. This is because the standard deviation and IQR measures the spread of data
points around the mean, and multiplying by a constant scales the distances between data points. Multiplying by
a negative value affects the standard deviation similarly in magnitude, as the standard deviation is inherently
non-negative.
Further Investigation
To convert the mean of a data set to 0, each score in the data set should be subtracted by the mean of the
original data set which we have calculated previously on the introduction of this investigation. This is because of
the mechanism of how the dot plot is parrelly transmitted on the x axis when the data set is subtracted by the
constant value a. For example in this investigation, when the data set was subtracted by 12, the mean decreased
proportionally. Therefore by subtracting 152.8, which is the mean of the original data set, it will be able to have a
mean as 0.
To calculated the mean, the sum of the data should be divided by the number of data set so ,
To convert the standard deviation of a data set to 1, each of the scores should be divided by the value of standard
deviation because then each data point is divided by the standard deviation, it standardizes the distribution,
ensuring that the data's spread is normalized to a standard deviation of 1. This standardization makes it simpler
to compare and analyze different datasets by ensuring consistent variability.
𝑛
𝑛 ∑(𝑥𝑛 ÷ 17.1)
1 2
∑ {(𝑥 ÷ 17.1) − ( 60
)} 2 2 2 2
𝑛=1 {(7.6)−(8.94)} + {(7.6)−(8.94)} + ⋯ + {(10.24)−(8.94)} + {(10.24)−(8.94)} 60
60
= 60
= 60
= 1
3. Transform the given set of data so that it has a mean of 0 and a standard deviation of 1.
To transform the data set to a form that has a mean of 0 and a standard deviation of 1, wecan combine two
methods used on number one and two, which is subtracting 152.8 and dividing 17.1 from the data set.
𝑛
𝑛 ∑ ((𝑥−152.8) ÷ 17.1)
2
∑ {(𝑥𝑛−152.8) ÷ 17.1) − ( 𝑛=1 60
)} 2 2 2
𝑛=1 (−1.3) + (−1.3) + ⋯ + (1.4) + (1.4) 60
60
= 60
= 60
= 1