0% found this document useful (0 votes)
6 views59 pages

Lec 2

The document discusses methods for visualizing and summarizing variation in both numerical and categorical data, including tools like dotplots, histograms, and bar charts. It emphasizes the importance of understanding the distribution's shape, center, and variability, as well as the need for appropriate graphing techniques to avoid misleading representations. Additionally, it outlines how to describe categorical distributions using mode and variability.

Uploaded by

slenderwather
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views59 pages

Lec 2

The document discusses methods for visualizing and summarizing variation in both numerical and categorical data, including tools like dotplots, histograms, and bar charts. It emphasizes the importance of understanding the distribution's shape, center, and variability, as well as the need for appropriate graphing techniques to avoid misleading representations. Additionally, it outlines how to describe categorical distributions using mode and variability.

Uploaded by

slenderwather
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Picturing Variation with

Graphs
Topics

• Visualizing variation in numerical and


categorical data
• Summarizing important features in
numerical and categorical distributions
SECTION 2.1 VISUALIZING
VARIATION IN NUMERICAL DATA

• Exploring a
Distribution of
Numerical Data
• Dotplots, Frequency
/ Relative Frequency
Histograms, and
Stemplots
What Is a Distribution?

Recall: Any collection of data will have variation


within the data.
The most important tool for organizing the
variation in data is called the distribution of the
sample.
Three Main Components of a Numerical
Distribution:
1. Shape (What does it look like visually?)
2. Center (typical value)
3. Variability (horizontal spread)
Distribution: Example
Here are some raw data from the National
Collegiate Athletic Association (NCAA), available
online. This set of data shows the number of goals
scored by first-year NCAA female soccer players in
Division III in the 2012 season.
9,11,11,11,11,12,13,13,13,13,13,14,14,14,15,
15,16,16,16,16,18,18,19,19,20,20,21,35
Note that:
• Distributions record all observed data values.
• Seeing patterns in the raw data may be difficult.
Frequency Tables
Value Frequency
The distribution of data can be 9 1
organized in a frequency table: 11 4

(This is the data from the 12 1


13 5
previous example.)
14 3
Keep in mind: 15 2
• A frequency table lists all 16 4
18 2
data values with their counts.
19 2
• Patterns may still be difficult 20 2
to see. 21 1
35 1
Examining a Distribution (1 of 2)

When examining distributions, use a two-


step process:
1. Visualize the data.
– Use a graph that effectively summarizes the
data visually.
– Using a picture to display the data will help us
see patterns.
– Graphs for numerical data include:
Dotplots, histograms, and
stemplots
Examining a Distribution (2 of 2)

2. Summarize the data.


– Shape: Is there symmetry?
– Center: Is there a most common value?
– Spread: Are any data values far from the rest
of the data?
Visualizing Data: Goal

Keep in mind:
• Using a picture to display the data will help to
see patterns.
• Different visual representations capture different
aspects in the data.
• The picture must:
– Record the data values.
– Indicate the frequency (count) of the data
values.
Visualizing Data: Dotplots
Record data values on a number line with a dot
Dotplot above the number line for each data value
observed.
Value Frequency Example: Here is the dotplot for
9 1
11 4
our example data.
12 1
13 5
14 3
15 2
16 4
18 2
19 2
20 2
21 1
35 1
Dotplot: Example (1 of 2)

• How many textbooks cost $150 or more?


• What percent of the textbooks cost $50 or less?
• Are there any unusually expensive or
inexpensive texts?
Dotplot: Example (2 of 2)

• 4 textbooks cost $150 or more.


8
• 100% = 34.8% of the textbooks cost $50 or less.
23
• The text that costs close to $300 may be
unusually expensive.
Dotplots: Advantages and
Disadvantages
• Advantages
– Shows individual data values
– Helps investigate the shape of the distribution
• Disadvantages
– Not as common as histograms and other
graphs
– Not great for data with too many individual
values
Visualizing Data: Histogram
Histogram
• Group data into intervals, called bins (width of
the interval = bin width).
• Count how many data values fall into each bin.
• Each rectangle has the following properties:
– Consecutive bins touch
– First value in each bin is recorded on the
horizontal axis
– The height of each rectangle corresponds to
the count.
Histogram: Example
Here is the histogram for our example data.
Value Frequency
9 1
11 4
12 1
13 5
14 3
15 2
16 4
18 2
19 2
20 2
21 1
35 1

Note: Vertical axis can show frequency or relative frequency


Histogram: Changing Bin Widths

Changing the bin width changes the shape.


Notes About Bin Width

• A width that is:


– Too narrow shows too much detail.
– Too wide hides detail.
• Most technology (StatCrunch and TI-84)
chooses bin widths for an initial look at the data,
but one should always experiment with adjusting
bin width to see if anything interesting appears.
Histogram: Advantages and
Disadvantages
• Advantages
– Good for large data sets
– Helps focus on the general shape of the data
– Easy to spot outliers
• Disadvantages
– Individual data values are not visible (lost)
– Distribution shape affected by change in bin
width
Visualizing Data: Stemplot

Stemplots
• Also called stem-and-leaf plots
• Like dotplots, show all individual data values
• Useful when technology is not available or when
the data set is not too large
Stemplot: Example
A collection of college students who said that they drink
alcohol were asked how many alcoholic drinks they had
consumed in the last seven days. Their answers were:
1,1,1,1,1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 5, 5, 5, 6, 6, 6, 8,10,10,15
17, 20, 25, 30, 30, 40

The stemplot for this data is:


Stem Leaves
0 111112223333345556668
1 0057
2 05
3 00
4 0
SECTION 2.2 SUMMARIZING
IMPORTANT FEATURES OF A
NUMERICAL DISTRIBUTION
• Shape, Center, and
Spread
• Outliers
Shape

Three Basic Characteristics of Shape:


– Is the distribution symmetric or skewed?
– How many mounds appear?
– Are unusually large or small values present?
Shape: Symmetric

Symmetric
Left and right side roughly the same
Shape: Skewed

Skewed
Most of the data is on one side with a long tail
(or skew) on the other side
Shape: Mounds (1 of 2)

Classify data by how many mounds are


present: One main mound
• Unimodal -
Shape: Mounds (2 of 2)

Classify data by how many mounds are present:


• Bimodal - Two main mounds
• Multimodal - More than two main mounds
Mounds: Keep in Mind
• Mounds can be different heights.
• Bimodal and multimodal data may indicate
existence of different groups within the data.
• In this case, it may be preferable to separate the
data into two groups and provide separate
graphs for each group.
• Examples:
– Men and women’s heights
– Afternoon and evening sales at a restaurant
Shape: Examples

What shape would you expect to see in a


histogram of the following data sets?
• GPA of college students
• SAT scores
• Last digit of Social Security numbers for a
random sample of students
• Income of USA residents
Shape: Example
What shape would you expect to see in a
histogram of the following data sets?

• GPA of college students Skewed left


• SAT scores Symmetric (Unimodal)
• Last digit of Social Security
numbers for a random Symmetric
sample of students (Uniform)
• Income of USA residents Skewed right
Shape: Extreme Values (1 of 2)
Outliers
– Extremely large or small values
– Data values that don’t fit the pattern of the rest
of the data
– Not precisely defined (subject to opinion)
Shape: Extreme Values (2 of 2)

When you see extremely large or small


values:
• Report the values.
• Realize they could be sources of error (typos,
etc.).
• Genuine outliers are unusually interesting data
values!
Center
Center - The typical data value
Example: The histograms for Division III first-year women and
men soccer players in 2012 are given below.

– The typical scores:


Women: 16 goals Men: 13 goals
– It would seem that the typical male soccer player scores
fewer goals in a season than the typical female player.
Variability (1 of 2)
Look at the horizontal spread in the
histogram or dotplot:
• If all data values are • If data values are
similar: Narrow graph different: Wider graph
Describing Numerical Distributions:
Summary
Always remember to describe a numerical
distribution using these three components:
1. Shape (symmetric, skewed left/right, modes)
2. Center (typical value)
3. Variability (horizontal spread)
SECTION 2.3 VISUALIZING
VARIATION IN CATEGORICAL
VARIABLES
• Bar Charts
• Pie Graph
Visualizing Data: Bar Chart (1 of 2)
Note: We treat categorical variables similar to
numerical variables.

Bar Chart
Similar to a histogram, record data
categories along the horizontal axis with the
height corresponding to the frequency of the
data
Visualizing Data: Bar Chart (2 of 2)
Example:
Here is the bar chart for the class standing of
students interested in a UCLA statistics class.

Class Frequency
Unknown 7
First-year student 0
Sophomore 3
Junior 4
Senior 5
Graduate 1
Total 20
Bar Chart vs. Histogram
Note: Bar charts and histograms are different!
Key differences:
Blank Histogram Bar Chart
Bars: May touch Do not touch
Bar Width: Corresponds to bin Can be any desired
width width (all the same)
Horizontal labels: Numerical Order No inherent order

Note: A Pareto chart is a bar graph in which the


bars are arranged from tallest to shortest. (This
cannot necessarily be done in a histogram!)
Visualizing Data: Pie Chart (1 of 2)

Pie Chart
A circle divided into pieces (the area of each
piece is proportional to the relative frequency, or
percent, of the data in that piece)
Visualizing Data: Pie Chart (2 of 2)
Example:
Here is the pie chart for the class standing of
students interested in a U C L A statistics class.
Class Frequency
Unknown 7
First-year student 0
Sophomore 3
Junior 4
Senior 5
Graduate 1
Total 20
SECTION 2.4 SUMMARIZING
CATEGORICAL DISTRIBUTIONS

• Mode and Variability


in Categorical
Distributions
• Describing
Categorical
Distributions
Describing a Categorical Distribution

Recall:
To describe a numerical distribution, we record
shape, center, and spread.
Since categorical data has no inherent order, these
measures do not make sense for categorical data.
Two Main Components of a Categorical
Distribution:
1. Mode (typical, or most frequent, outcome)
2. Variability (or diversity in outcomes)
Mode
Mode The category that occurs the most frequently
Key difference in the mode for categorical and
numerical data:
– Numerical data: Modes do not need to be the same height.

– Categorical data: Modes must be roughly the same height.


We use the same wording as before:
– Unimodal: One distinct mode
– Bimodal: Two modes with same (or very close) frequency
– Multimodal: More than two modes with (or close) frequency
Mode: Example
In 2012, the Pew survey asked a new group of
2508 Americans which economic class they
identified with.

The mode is the middle


class.
Variability (2 of 2)
Variability
Think of this as diversity in the data values
What to look for:
– High variation: Each value is represented with about
the same frequency (many observations in many
different categories).
– Low variation: A small number of values appear a
large number of times (many observations fall into a
few categories).
Caution! Variability here is more about the
occurrence of many different values rather than many
frequencies.
Variability: Example
The bar charts below show the ethnic composition
of two schools in the Los Angeles City School
System.

School A has the greater variability in ethnicity.


SECTION 2.5 INTERPRETING
GRAPHS

• Making Appropriate
Graphs
• Misleading Graphs
Appropriate Graphs
Recall:
The type of data you are dealing with
determines the type of graph you use!

Numerical Data Categorical Data


Dotplot Pie Chart
Histogram Bar Graph
Stemplot Blank
Appropriate Measures
Recall:
The type of data you are dealing with
determines how you describe the distribution of
data!
Misleading Graphs
Well designed graphs help us see patterns, but
misleading graphs play tricks with our eyes and lead
to wrong conclusions!
Watch For Example
Inappropriate scaling Figure 2.34
(starting at a value
other than 0)
Using icons of different Figure 2.35
sizes rather than bars
Case Study (1 of 2)
Question: Are private 4-year schools better than
public 4-year schools?
Note: Better can mean different things!

– Measure: Student-to-teacher ratio (one of many choices)


– Data type: Numerical (see table 2.1 in book)
– When: The 2010-2011 academic year
– Who: 89 private colleges and 49 public colleges
Case Study (2 of 2)
• Analysis of data:

– What can you say about this data?


– Can you answer the question of interest?
(See book for answer)
Section 2.2 Question 1
The dotplot shows the number of siblings for a
sample of 29 statistics students.

What percent of students


had more than three
siblings?
A. 2%
B. 5%
 2 
C. 6.7%  30  100% 
 
D. 13.3%
Section 2.2 Question 2

The histograms show the distribution of monthly rents


for a sample of studio apartments in two cities.

What is the shape of the distribution of rents for City B?


A. Skewed left B. Skewed right
C. Uniform D. Multi-modal
Section 2.2 Question 3
The histograms show the distribution of monthly
rents for a sample of studio apartments in two
cities.

Which city has a higher typical rent for a studio apartment?


A. City A B. City B
C. C. It cannot be determined from the histograms.
Section 2.2 Question 4

Data was collected on the ages of students


at a community college. What shape would
you expect the distribution of ages to have?
A. Bell-shaped
B. Uniform
C. Skewed left
D. Skewed right (more younger people, fewer older people)
Section 2.3 Question 1

Which of the following is a graph that can be


used to display categorical data?
A. Histogram
B. Dot plot
C. Bar graph
D. All of the above
Section 2.2 Question 5

Which of the following graphs cannot be


used to summarize numerical data?
A. Pie chart (used for categorical data)
B. Dot plot
C. Histogram
D. All of these can be used to summarize
numerical data.
Section 2.4 Question 1

Which of the following are used to describe


categorical data?
A. The shape of the distribution
B. The mode of the distribution
C. The center of the distribution
D. All of these can be used

You might also like