0% found this document useful (0 votes)
24 views38 pages

Session 2

The document discusses frequency and relative frequency tables for categorical data, emphasizing the use of bar charts, pie charts, and Pareto charts for visualization. It outlines best practices for displaying data, including respecting the area principle and avoiding misleading representations. Additionally, it provides examples of frequency distributions and their application in analyzing data sets.

Uploaded by

kinwad123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views38 pages

Session 2

The document discusses frequency and relative frequency tables for categorical data, emphasizing the use of bar charts, pie charts, and Pareto charts for visualization. It outlines best practices for displaying data, including respecting the area principle and avoiding misleading representations. Additionally, it provides examples of frequency distributions and their application in analyzing data sets.

Uploaded by

kinwad123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Frequency and Relative Frequency Tables

 The distribution of a categorical variable is a list of values


with its associated count (frequency)

 A frequency table summarizes the distribution of a


categorical variable

 A relative frequency table shows the proportion (or


percentage) in each category
Visualizing Categorical Data

Bar Charts

Pie Charts

Pareto Charts
Bar Chart
 Uses horizontal or vertical bars to show the distribution of
a categorical variable.
 The bars have lengths proportional to the values that they
represent.
 The length of the bar indicates the size of the group defined
by the column label.
 A bar chart is very useful for recording certain information
whether it is continuous or not continuous data.
 Bar charts also look a lot like a histogram and they are
often mistaken for each other.
Which hosts send the most visitors to
Amazon’s Web site?

 Data set consists of 188,996 visits

 Host is a categorical variable

 To answer this question we must


describe the variation in Host
Bar Chart (Horizontal) of Top 10 Hosts
Bar Chart (Vertical) of Top 10
Hosts
The Pie Chart
 Uses wedges of a circle to show the distribution of a
categorical variable.
 The arc length of each segment is proportional to the
quantity it represents.
 Commonly chosen to illustrate market shares or sources of
revenue for a company
 Less useful than bar charts if we want to compare actual
counts (easier to compare bars than angles of wedges).
 Statisticians generally consider pie charts as a poor method
of displaying information, and they are uncommon in
scientific literature.
Pie Chart of Top 10 Hosts
The Pareto Chart
 Used to portray categorical data (nominal scale)
 A vertical bar chart, where categories are shown in
descending order of frequency
 A cumulative polygon is shown in the same graph
 Used to separate the “vital few” from the “trivial many”
Pareto Chart of Top 10 Hosts

Chart Title
120% 8,000
7,000
100%
6,000
80%
5,000
60% 4,000
3,000
40%
2,000
20%
1,000
0% 0
m m m m m m m m m m
. co . co co co l. co co co co co co
sn oo e. e. . e. a. s. .
h gl rc ao on
zin ol ng db
m o iw si
ya go sou e at
w
s im
p e bm -b
le
ci ily
re a
d

Series1 Series2
Example: ROLLING OVER
 Question:
 Are certain types of vehicles more prone to roll-over
accidents than others?

 Method:
 Data gathered from Fatality Analysis Reporting.
 System (FARS) for roll-over accidents on interstate
highways.
 Cases that make up the rows are accidents resulting in roll-
overs in 2000.
 The column of interest is model of the car involved.
Frequency table
Bar Graph
Inference

 Ford Broncos were involved in more than twice as many


roll-over accidents as the next-closest model.
Example: Selling Smartphones to Businesses

 Question:
Apple, Google and Research in Motion (RIM) aggressively
compete to sell their smartphones to businesses. RIM has
dominated with its Blackberry line, but has that success held
up to the intense competition from Apple and Google?
Example: Selling Smartphones to Businesses
Example: Selling Smartphones to Businesses
Example: Selling Smartphones to Businesses

 Inference

 Corporate customers are purchasing more iPhones and


Android phones for managers.
 From 2010 to 2011, Blackberry sales grew less than sales of
iPhones and Android phones.
 While RIM still had the largest share of the market in 2011,
it had decreased to less than 50%.
The Area Principle
 The Fundamental Rule for Data Displays

 The area occupied by a part of the graph/chart that displays


data should be proportional to the amount of data it
represents

 Charts decorated to attract attention often violate the area


principle
An Example Violating the Area Principle
The Same Example Respecting the Area
Principle
Best Practices
 Use a bar chart to show the frequencies of a categorical
variable.

 Use a pie chart to show the proportions of a categorical


variable.

 Keep the baseline of a bar chart at zero.

 Preserve the ordering of an ordinal variable.


Best Practices (Continued)

 Respect the area principle.

 Show the best plots to answer the motivating question.

 Label your chart to show the categories and indicate


whether some have been combined or omitted.
Pitfalls
 Avoid elaborate plots that may be deceptive.

 Do not show too many categories.

 Do not put ordinal data in a pie chart.

 Do not carelessly round data.


Tables Used For Organizing Numerical Data

Ordered Array

Frequency
Distributions

Cumulative
Distributions
Stacked Or Unstacked Format
 This is an issue when you have a categorical variable that
may be used to group your numerical variable for analysis.

 Stacked format is when your numerical variable is in one


column and a second column identifies the value of the
categorical variable.

 Unstacked format is when the values of the numerical


variable in each group (unique value of the categorical
variable) are in different columns.
Ordered Array
 An ordered array is a sequence of data, in rank order,
from the smallest value to the largest value.
 Shows range (minimum value to maximum value).
 May help identify outliers (unusual observations).

 Advantages:

Lowest and highest values can be identified easily

Data can be divided in sections

Repetition of any observation can be recognized

Distance between two successive data points can be


Example
Age of Day Students
Surveyed
College 16 17 17 18 18 18
Students 19 19 20 20 21 22
22 25 27 32 38 42
Night Students
18 18 19 19 20 21
23 28 32 33 41 45
Arranging Data Using Data Array

 Advantages:

Lowest and highest values can be identified easily

Data can be divided in sections

Repetition of any observation can be recognized

Distance between two successive data points can be


observed
Organizing Numerical Data: Frequency
Distribution
 The frequency distribution is a summary table in which the data are arranged
into numerically ordered classes.

 You must give attention to selecting the appropriate number of class groupings
for the table, determining a suitable width of a class grouping, and establishing
the boundaries of each class grouping to avoid overlapping.

 The number of classes depends on the number of values in the data. With a
larger number of values, typically there are more classes. In general, a
frequency distribution should have at least 5 but no more than 15 classes.

 To determine the width of a class interval, you divide the range (Highest value–
Lowest value) of the data by the number of class groupings desired.
Why Use a Frequency Distribution?

 It condenses the raw data into a more useful form


 It allows for a quick visual interpretation of the data
 It enables the determination of the major
characteristics of the data set including where the
data are concentrated / clustered
Frequency Distributions: Some Tips
 Different class boundaries may provide different pictures for the
same data (especially for smaller data sets)

 Shifts in data concentration may show up when different class


boundaries are chosen

 As the size of the data set increases, the impact of alterations in


the selection of class boundaries is greatly reduced

 When comparing two or more groups with different sample sizes,


you must use either a relative frequency or a percentage
distribution
Example: Hudson Auto Repair
The manager of Hudson Auto
would like to have a better
understanding of the cost
of parts used in the engine
tune-ups performed in the
shop. She examines 50
customer invoices for tune-ups. The costs of parts,
rounded to the nearest dollar, are listed on the next
slide.
Example: Hudson Auto Repair

Sample of Parts Cost for 50 Tune-ups

91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
Tabular Summary:
Frequency and Percent Frequency
Parts Parts Percent
Cost ($) Frequency Frequency
50-59 2 4
60-69 13 26 (2/50)100
70-79 16 32
80-89 7 14
90-99 7 14
100-109 5 10
50 100
Given here are the marks scored by students in a
Mathematics test. Prepare a frequency distribution with
class width 10. Also draw a histogram, ogive and
frequency polygon for the data.

24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53,
27
 Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
 Find range: 58 - 12 = 46
 Select number of classes: 5 (usually between 5 and 15)
 Compute class interval (width): 10 (46/5 then round up)
 Determine class boundaries (limits):
 Class 1: 10 to 20
 Class 2: 20 to 30
 Class 3: 30 to 40
 Class 4: 40 to 50
 Class 5: 50 to 60
 Compute class midpoints: 15, 25, 35, 45, 55
 Count observations & assign to classes
Frequency Distribution of Marks

Relative Percent Cumulative


Marks Frequency
Frequency Frequency Frequency

10-19 3 0.15 15 15%


20-29 6 0.30 30 45%
30-39 5 0.25 25 70%
40-49 4 0.20 20 90%
50-60 2 0.10 10 100%

You might also like