0% found this document useful (0 votes)
0 views

StatisticsLecture1

The document discusses the importance of descriptive statistics in communicating information and reasoning about data, using the Challenger disaster as a case study. It outlines various methods for visualizing data, including dot plots, pie charts, bar graphs, histograms, boxplots, and scatterplots, each serving different purposes. Additionally, it explains numerical summary measures such as mean, median, percentiles, and standard deviation to summarize data effectively.

Uploaded by

thekonan726
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

StatisticsLecture1

The document discusses the importance of descriptive statistics in communicating information and reasoning about data, using the Challenger disaster as a case study. It outlines various methods for visualizing data, including dot plots, pie charts, bar graphs, histograms, boxplots, and scatterplots, each serving different purposes. Additionally, it explains numerical summary measures such as mean, median, percentiles, and standard deviation to summarize data effectively.

Uploaded by

thekonan726
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Introduc)on to Sta)s)cs | Lecture 1

Why are descrip-ve sta-s-cs important?

In January 1986, the space shu<le Challenger broke apart shortly aBer liBoff. The
accident was caused by a part that was not designed to fly at the unusually cold
temperature of 29◦ F at launch.
Here are the launch-temperatures of the first 25 shu<le missions (in degrees F):

16
14
12
10
8
6
4
2
0
[29, 42] (42, 55] (55, 68] (68, 81]

The two most important func-ons of descrip-ve sta-s-cs are: Communicate informa-on
and support reasoning about data.
When exploring data of large size, it becomes essen-al to use summaries.
It is best to use a graphical summary to communicate informa-on, because people
prefer to look at pictures rather than at numbers. There are many ways to visualize data. The
nature of the data and the goal of the visualiza-on determine which method to choose.

Pie chart and Dot plot

The dot plot makes it easier to compare frequencies of various categories, while the pie
chart allows more easily to eyeball what frac-on of the total a category corresponds to.
Bar graph
When the data are quan-ta-ve (i.e. numbers), then they should be put on a number
line. This is because the ordering and the distance between the numbers convey
important informa-on. The bar graph is essen-ally a dot plot put on its side.

The Histogram
The histogram allows to use blocks with different widths. Key point here is that the areas of
the blocks are propor-onal to frequency.

So, the percentage falling into a block can be figured without a ver-cal scale since the
total area equals 100%. But it’s helpful to have a ver-cal scale (density scale). Its unit is ‘%
per unit’, so in the above example the ver-cal unit is ‘% per year’.

The histogram gives two kinds of informa-on about the data:

1. Density (crowding): The height of the bar tells how many subjects there are for one
unit on the horizontal scale. For example, the highest density is around age 19 as
.04 = 4% of all subjects are age 19. In contrast, only about 0.7% of subjects fall into
each one year range for ages 60–80.

2. Percentages (rela:ve frequences): Those are given by


area = height x width.
For example, about 14% of all subjects fall into the age range 60–80, because the
corresponding area is (20 years) x (0.7 % per year)=14 %. Alterna-vely, you can find
this answer by eyeballing that this area makes up roughly 1/7 of the total area of the
histogram, so roughly 1/7=14% of all subjects fall in that range.

The boxplot
The boxplot depicts five key numbers of the data. The boxplot conveys less informa-on than
a histogram, but it takes up less space and so is well suited to compare several datasets:

The Sca5erplot
The sca<erplot is used to depict data that come as pairs. The sca<erplot visualizes the
rela-onship between the two variables.
Numerical summary measures
For summarizing data with one number, use the mean (average) or the median.
The median is the number that is larger than half the data and smaller than the other
half.

Differences between mean and median:


1. Symmetric Data – data sets whose values are evenly spread around the centre.
2. Skewed Data – data sets aren’t symmetric.

Mean and median are the same when the histogram is symmetric:

If the median sales price of 10 homes is $ 1 million, then we know that 5 homes sold for
$ 1 million or more. If we are told that the average sale price is $ 1 million, then we can’t
draw such a conclusion.
Percen)les

The 90th percen-le of incomes is 135,000 $. 90% of households report an income of $


135,000 or less, 10% report more.
The 75th percen-le is called 3rd quar-le: 85,000 $
The 50th percen-le is the median: 50,000 $
The 25th percen-le is called 1st quar-le.

Recall that the boxplot gives a five-number summary of the data:


the smallest number, 1st quar-le, median, 3rd quar-le, largest number.

The interquar-le range = 3rd quar-le− 1st quar-le. It measures how spread out the data are.

The standard devia)on


A more commonly used measure of spread is the standard devia-on.
𝑥̅ stands for the average of the numbers 𝑥! , 𝑥" , , … , 𝑥# .
The standard devia-on of these numbers is:

$ $
1 1
𝑠 = ' *(𝑥! − 𝑥̅ )# = ' *(𝑥! − 𝑥̅ )#
𝑛 𝑛−1
!%& !%&

The two numbers 𝑥̅ and 𝑠 are oBen used to summarize data. Both are sensi-ve to a
few large or small data. If that is a concern, use the median and the interquar-le range.

You might also like