StatisticsLecture1
StatisticsLecture1
In January 1986, the space shu<le Challenger broke apart shortly aBer liBoff. The
accident was caused by a part that was not designed to fly at the unusually cold
temperature of 29◦ F at launch.
Here are the launch-temperatures of the first 25 shu<le missions (in degrees F):
16
14
12
10
8
6
4
2
0
[29, 42] (42, 55] (55, 68] (68, 81]
The two most important func-ons of descrip-ve sta-s-cs are: Communicate informa-on
and support reasoning about data.
When exploring data of large size, it becomes essen-al to use summaries.
It is best to use a graphical summary to communicate informa-on, because people
prefer to look at pictures rather than at numbers. There are many ways to visualize data. The
nature of the data and the goal of the visualiza-on determine which method to choose.
The dot plot makes it easier to compare frequencies of various categories, while the pie
chart allows more easily to eyeball what frac-on of the total a category corresponds to.
Bar graph
When the data are quan-ta-ve (i.e. numbers), then they should be put on a number
line. This is because the ordering and the distance between the numbers convey
important informa-on. The bar graph is essen-ally a dot plot put on its side.
The Histogram
The histogram allows to use blocks with different widths. Key point here is that the areas of
the blocks are propor-onal to frequency.
So, the percentage falling into a block can be figured without a ver-cal scale since the
total area equals 100%. But it’s helpful to have a ver-cal scale (density scale). Its unit is ‘%
per unit’, so in the above example the ver-cal unit is ‘% per year’.
1. Density (crowding): The height of the bar tells how many subjects there are for one
unit on the horizontal scale. For example, the highest density is around age 19 as
.04 = 4% of all subjects are age 19. In contrast, only about 0.7% of subjects fall into
each one year range for ages 60–80.
The boxplot
The boxplot depicts five key numbers of the data. The boxplot conveys less informa-on than
a histogram, but it takes up less space and so is well suited to compare several datasets:
The Sca5erplot
The sca<erplot is used to depict data that come as pairs. The sca<erplot visualizes the
rela-onship between the two variables.
Numerical summary measures
For summarizing data with one number, use the mean (average) or the median.
The median is the number that is larger than half the data and smaller than the other
half.
Mean and median are the same when the histogram is symmetric:
If the median sales price of 10 homes is $ 1 million, then we know that 5 homes sold for
$ 1 million or more. If we are told that the average sale price is $ 1 million, then we can’t
draw such a conclusion.
Percen)les
The interquar-le range = 3rd quar-le− 1st quar-le. It measures how spread out the data are.
$ $
1 1
𝑠 = ' *(𝑥! − 𝑥̅ )# = ' *(𝑥! − 𝑥̅ )#
𝑛 𝑛−1
!%& !%&
The two numbers 𝑥̅ and 𝑠 are oBen used to summarize data. Both are sensi-ve to a
few large or small data. If that is a concern, use the median and the interquar-le range.