Live Session
Module #1
DS510
STATISTICS FOR DATA SCIENCE
Instructor Name
Module #1 Learning
Outcomes
1. Discuss the data analysis function in an
organization.
2. Plan the solution of a business problem using
analytics.
3. Examine the audiences, reporting needs, and
dissemination channels for the findings from analyses
in supporting data-driven decision making.
4. Perform univariate distribution analysis of a
numerical variable.
The Role of Statistics and
1 the Data Analysis Process
Copyright © Cengage Learning. All rights reserved.
Section 1.1 Why Study Statistics?
Copyright © Cengage Learning. All rights reserved.
Why Study Statistics? (1 of 5)
There is an old saying that “without data, you are just
another person with an opinion.”
It is challenging to function in today’s world without a basic
understanding of statistics. For example, How many people
does it take to build an airplane? The article “Boeing
Delivers Records” (The Wall Street Journal, January 10,
2018) looked at the ways in which Boeing has been able to
increase airplane production and introduce new airplane
models.
5
Why Study Statistics? (2 of 5)
The article states that “Boeing has boosted output by two-
thirds over the past seven years but cut the average number
of employees needed to build each plane.”
This statement was supported by a graph that showed the
number of employees per jet airplane produced over time. In
2017, this number was at its lowest, with 94 employees per
jet produced.
6
Why Study Statistics? (3 of 5)
To be an informed consumer of reports such as those
described above, you must be able to do the following:
1. Extract information from tables, charts, and graphs.
2. Follow numerical arguments.
3. Understand the basics of how data should be gathered,
summarized, and analyzed in order to draw statistical
conclusions.
Your statistics course will help prepare you to perform these
tasks.
7
Why Study Statistics? (4 of 5)
Studying statistics will also enable you to collect data in a
sensible way and then use the data to answer questions of
interest. In addition, studying statistics will allow you to
critically evaluate the work of others by providing you with
the tools you need to make informed judgments.
Throughout your personal and professional life, you will
need to understand and use data to make decisions. To do
this, you must be able to
1. Decide whether existing data are adequate or whether
additional information is required.
8
Why Study Statistics? (5 of 5)
2. If necessary, collect more information in a reasonable
and thoughtful way.
3. Summarize the available data in a useful and informative
manner.
4. Analyze the available data.
5. Draw conclusions, make decisions, and assess the risk
of an incorrect decision.
These are the steps in the data analysis process.
9
The Nature and Role of
Section 1.2
Variability
Copyright © Cengage Learning. All rights reserved.
The Nature and Role of Variability (1 of 2)
Statistical methods allow us to collect, describe, analyze,
and draw conclusions from data.
We need to understand variability to be able to collect,
describe, analyze, and draw conclusions from data in a
sensible way.
11
Example 1.2 – Monitoring Water Quality (1 of 5)
As part of its regular water quality monitoring efforts, an
environmental control board selects five containers of water
from a particular well each day.
The concentration of contaminants in parts per million (ppm)
is measured for each of the five containers, and then the
average of the five measurements is calculated.
12
Example 1.2 – Monitoring Water Quality (2 of 5)
The histogram in the following figure summarizes the
average contamination values for 200 days.
Average contamination concentration (in ppm) measured each day for 200
days.
Figure 1.2
13
Example 1.2 – Monitoring Water Quality (3 of 5)
Now suppose that a chemical spill has occurred at a
manufacturing plant 1 mile from the well. It is not known
whether a spill of this nature would contaminate
groundwater in the area of the spill and, if so, whether a spill
this distance from the well would affect the quality of well
water.
One month after the spill, five containers of water are
collected from the well, and the average contamination is
15.5 ppm.
14
Example 1.2 – Monitoring Water Quality (4 of 5)
Considering the variability before the spill, should we interpret
this as evidence that the well water was affected by the spill?
What if the calculated average was 17.4 ppm? 22.0 ppm?
How is the reasoning related to the histogram in Figure 1.2?
Before the spill, the average contaminant concentration varied
from day to day. An average of 15.5 ppm would not have
been an unusual value, so seeing an average of 15.5 ppm
after the spill isn’t necessarily an indication that contamination
has increased.
15
Example 1.2 – Monitoring Water Quality (5 of 5)
On the other hand, an average as large as 17.4 ppm is less
common, and an average as large as 22.0 ppm is not at all
typical of the pre-spill values. In this case, we would
probably conclude that the well contamination level has
increased.
16
The Nature and Role of Variability (2 of 2)
Understanding variability allows us to distinguish between
common and unusual values.
The ability to recognize unusual values in the presence of
variability is an important aspect of most statistical
procedures. It also enables us to quantify the chance of
being incorrect when a conclusion is based on data.
17
Statistics and the Data
Section 1.3
Analysis Process
Copyright © Cengage Learning. All rights reserved.
Statistics and the Data Analysis Process (1 of 7)
The data analysis process can be viewed as a sequence of
steps that lead from planning to data collection to making
informed conclusions based on the resulting data. The
process can be organized into the six steps described
below.
The Data Analysis Process
1. Understanding the nature of the problem. Effective
data analysis requires an understanding of the research
problem. We must know the goal of the research and
what questions we hope to answer.
19
Statistics and the Data Analysis Process (2 of 7)
It is important to have a clear direction before gathering
data to ensure that we will be able to answer the
questions of interest using the data collected.
2. Deciding what to measure and how to measure it.
The next step in the process is deciding what information
is needed to answer the questions of interest. In some
cases, the choice is obvious.
20
Statistics and the Data Analysis Process (3 of 7)
3. Data collection. The data collection step is very
important. The researcher must first decide whether an
existing data source is adequate or whether new data
must be collected.
If a decision is made to use existing data, it is important
to understand how the data were collected and for what
purpose, so that any resulting limitations are also fully
understood.
21
Statistics and the Data Analysis Process (4 of 7)
If new data are to be collected, a careful plan must be
developed, because the type of analysis that is
appropriate and the conclusions that can be drawn
depend on how the data are collected.
4. Data summarization and preliminary analysis. After
the data are collected, the next step is usually a
preliminary analysis that includes summarizing the data
graphically and numerically.
22
Statistics and the Data Analysis Process (5 of 7)
This initial analysis provides insight into important
characteristics of the data and provides guidance in
selecting appropriate methods for further analysis.
5. Formal data analysis. The data analysis step requires
the researcher to select appropriate statistical methods.
6. Interpretation of results. The interpretation step often
leads to the formulation of new research questions.
These new questions lead back to the first step. In this
way, good data analysis is often an iterative process.
23
Statistics and the Data Analysis Process (6 of 7)
DEFINITIONS
Population: The entire collection of individuals or objects
about which information is desired is called the population
of interest.
Sample: A sample is a subset of the population, selected
for study.
24
Statistics and the Data Analysis Process (7 of 7)
DEFINITIONS
Descriptive statistics: The branch of statistics that
includes methods for organizing and summarizing data.
Inferential statistics: The branch of statistics that involves
generalizing from a sample to the population from which
the sample was selected and assessing the reliability of
such generalizations.
25
Example 1.3 – Chew More, Eat Less? (1 of 4)
The article “Increasing the Number of Chews before
Swallowing Reduces Meal Size in Normal-Weight,
Overweight, and Obese Adults” ( Journal of the Academy
of Nutrition and Dietetics [2014]: 926–931) describes a
study that investigated whether chewing each bite of food
more before swallowing would result in people eating less.
Participants in the study were adults between the ages of 18
and 45 years. At the beginning of the study, each participant
was observed as they each ate five pizza rolls, and the
number of chews made before swallowing was observed in
order to determine a baseline for that participant.
26
Example 1.3 – Chew More, Eat Less? (2 of 4)
Participants were then invited back for a second session on
a different day. They were asked to eat their usual breakfast
on that day and to not eat anything after breakfast. At the
second session, all participants were provided with a platter
of pizza rolls and were told to eat until they were
comfortably full.
They were also told they could request more pizza rolls if
they wanted more. Each participant was also told how many
times to chew each pizza roll before swallowing. Then, each
participant was assigned to one of three groups.
27
Example 1.3 – Chew More, Eat Less? (3 of 4)
The participants in group 1 were given a number of chews
equal to their baseline. The participants in group 2 were
given a number of chews that was 150% of (one and a half
times as large as) their baseline. The participants in group 3
were assigned a number of chews that was 200% of (twice
as large as) their baseline.
After analyzing data from this study, the researcher
concluded that people ate about 10% less when they
increased the number of chews by 50% (group 2) and about
15% less when they doubled the number of chews.
28
Example 1.3 – Chew More, Eat Less? (4 of 4)
This study illustrates the nature of the data analysis
process. A clearly defined research question and an
appropriate choice of how to measure the variables of
interest (the number of chews and how much people ate)
preceded the data collection.
29
Types of Data and Some
Section 1.4
Simple Graphical Displays
Copyright © Cengage Learning. All rights reserved.
Types of Data
31
Types of Data (1 of 5)
The individuals or objects in any particular population might
possess many characteristics that could be studied.
Consider a group of students currently enrolled in a
statistics class. One characteristic of the students in the
population is the brand of calculator they use (Casio,
Hewlett-Packard, Sharp, Texas Instruments, and so on).
Another characteristic is the number of textbooks purchased
that semester, and yet another is the distance from the
college to each student’s home.
32
Types of Data (2 of 5)
DEFINITIONS
Variable: A characteristic whose value may change from
one observation to another.
Data: A collection of observations on one or more
variables.
A univariate data set consists of observations on a single
variable made on individuals in a sample or population. There
are two types of univariate data sets: categorical and
numerical.
33
Types of Data (3 of 5)
DEFINITIONS
Categorical data set: A univariate data set is categorical
(or qualitative) if the individual observations are
categorical responses.
Numerical data set: A univariate data set is numerical (or
quantitative) if each observation is a number.
34
Example 1.4 – College Choice Do-Over? (1 of 2)
The Higher Education Research Institute at UCLA surveys
over 20,000 college seniors each year. One question on the
survey asks seniors the following question: If you could make
your college choice over, would you still choose to enroll at
your current college?
Possible responses are definitely yes (DY), probably yes
(PY), probably no (PN), and definitely no (DN). Responses
for 20 students were:
DY PN DN DY PY PY PN PY PY DY
DY PY DY DY PY PY DY DY PN DY
35
Example 1.4 – College Choice Do-Over? (2 of 2)
Because the response to the question about college choice
is categorical, this is a univariate categorical data set.
36
Types of Data (4 of 5)
In the previous example, the data set consists of
observations on a single variable (college choice response),
so this is a univariate data set. In some studies, we are
interested in two different characteristics.
For example, both height (in inches) and weight (in pounds)
might be recorded for each person on a basketball team.
The resulting data set consists of pairs of numbers, such as
(74, 185). This is called a bivariate data set.
37
Types of Data (5 of 5)
Multivariate data result from recording a value for each of
two or more attributes (so bivariate data are a special case
of multivariate data).
For example, multivariate data would result from
determining height, weight, pulse rate, and blood pressure
for each person on a basketball team.
38
Two Types of Numerical Data
39
Two Types of Numerical Data (1 of 2)
There are two different types of numerical data: discrete
and continuous.
DEFINITION
Discrete numerical variable: A numerical variable results
in discrete data if the possible values of the variable
correspond to isolated points on the number line.
Continuous numerical variable: A numerical variable
results in continuous data if the set of possible values
forms an entire interval on the number line.
40
Two Types of Numerical Data (2 of 2)
Discrete data usually arise when observations are determined
by counting (for example, the number of roommates a student
has or the number of petals on a flower).
41
Example 1.4 – Do U Txt? (1 of 2)
The number of text messages sent on a particular day is
recorded for each of 12 students. The resulting data set is
23 0 14 13 15 0 60 82 0 40 41 22
Possible values for the variable number of text messages
sent are 0, 1, 2, 3,… . These are isolated points on the
number line, so this data set consists of discrete numerical
data.
42
Example 1.4 – Do U Txt? (2 of 2)
Suppose that instead of the number of text messages sent,
the time spent texting had been recorded. Even though time
spent may have been reported rounded to the nearest
minute, the actual time spent could have been 6 minutes,
6.2 minutes, 6.28 minutes, or any other value in an entire
interval.
So, recording values of time spent texting would result in
continuous data.
43
Frequency Distributions and Bar Charts
for Categorical Data
44
Frequency Distributions and Bar Charts for Categorical Data (1 of 4)
When a data set is categorical, a common way to present
the data is in the form of a table, called a frequency
distribution.
45
Frequency Distributions and Bar Charts for Categorical Data (2 of 4)
DEFINITION
Frequency distribution for categorical data: A table that
displays the possible categories along with the associated
frequencies and/or relative frequencies.
Frequency: The frequency for a particular category is the
number of times the category appears in the data set.
Relative frequency: The relative frequency for a
particular category is calculated as
46
Frequency Distributions and Bar Charts for Categorical Data (3 of 4)
DEFINITION
The relative frequency for a particular category is the
proportion of the observations that belong to that category.
Relative frequency distribution: A frequency distribution
that includes relative frequencies.
47
Example 1.4 – Motorcycle Helmets—Can You See Those Ears? (1 of 4)
The U.S. Department of Transportation establishes
standards for motorcycle helmets. To ensure a certain
degree of safety, helmets should reach the bottom of the
motorcyclist’s ears.
The report “Motorcycle Helmet Use in 2014—Overall
Results” (National Highway Traffic Safety
Administration, January 2015) summarized data collected
in 2014 by observing 806 motorcyclists nationwide at
selected roadway locations.
48
Example 1.4 – Motorcycle Helmets—Can You See Those Ears? (2 of 4)
Each time a motorcyclist passed by, the observer noted
whether the rider was wearing no helmet, a noncompliant
helmet, or a compliant helmet. The following coding was
used:
N = no helmet
NH = noncompliant helmet
CH = compliant helmet
A few of the observations were
CH N CH NH N CH CH CH N N
49
Example 1.4 – Motorcycle Helmets—Can You See Those Ears? (3 of 4)
There were also 796 additional observations, which we didn’t
reproduce here. In total, there were 250 riders who wore no
helmet, 40 who wore a noncompliant helmet, and 516 who
wore a compliant helmet.
The corresponding frequency distribution is given in table.
Frequency Distribution for Helmet Use
Table 1.1
50
Example 1.4 – Motorcycle Helmets—Can You See Those Ears? (4 of 4)
From the frequency distribution, we can see that a large
percentage of riders (31%) were not wearing a helmet, but
most of those who wore a helmet were wearing one that met
the Department of Transportation safety standard.
51
Frequency Distributions and Bar Charts for Categorical Data (4 of 4)
A frequency distribution summarizes a data set in a table. It
is also common to display categorical data graphically. A bar
chart is one of the most widely used types of graphical
displays for categorical data.
52
Bar Charts
53
Bar Charts (1 of 3)
A bar chart is a graph of a frequency distribution for
categorical data.
Each category in the frequency distribution is represented
by a bar or rectangle, and the picture is constructed in such
a way that the area of each bar is proportional to the
corresponding frequency or relative frequency.
54
Bar Charts (2 of 3)
Bar Charts
When to Use Categorical data.
1. Draw a horizontal axis, and write the category names or
labels below the line at equally spaced intervals.
2. Draw a vertical axis, and label the scale using either
frequency or relative frequency.
55
Bar Charts (3 of 3)
3. Place a rectangular bar above each category label. The
height is determined by the category’s frequency or
relative frequency, and all bars should have the same
width. With the same width, both the height and the area
of the bar are proportional to frequency and relative
frequency.
• Frequently and infrequently occurring categories.
56
Example 1.4 – Revisiting Motorcycle Helmets (1 of 3)
Example 1.7 used data on helmet use from a sample of 806
motorcyclists to construct a frequency distribution.
Frequency Distribution for Helmet Use
Table 1.1
57
Example 1.4 – Revisiting Motorcycle Helmets (2 of 3)
Figure shows the bar chart corresponding to this frequency
distribution.
Bar chart of helmet use.
Figure 1.5
58
Example 1.4 – Revisiting Motorcycle Helmets (3 of 3)
The bar chart provides a visual representation of the
information in the frequency distribution. From the bar
chart, it is easy to see that the compliant helmet use
category occurred most often in the data set.
The bar for compliant helmets is about twice as tall as the
bar for no helmet because approximately twice as many
motorcyclists wore compliant helmets than wore no
helmet.
59
Dotplots for Numerical Data
60
Dotplots for Numerical Data (1 of 3)
A dotplot is a simple way to display numerical data when the
data set is reasonably small. Each observation is
represented by a dot above the location corresponding to its
value on a horizontal measurement scale.
When a value occurs more than once, there is a dot for
each occurrence and these dots are stacked vertically.
61
Dotplots for Numerical Data (2 of 3)
Dotplots
When to Use Small numerical data sets.
1. Draw a horizontal line and mark it with an appropriate
measurement scale.
2. Locate each value in the data set along the
measurement scale, and represent it by a dot
62
Dotplots for Numerical Data (3 of 3)
Dotplots convey information about:
• A representative or typical value in the data set.
• The extent to which the data values vary.
• The shape of the distribution of values along the number
line.
• The presence of unusual values in the data set.
63
Example 1.4 – Making It to Graduation . . . (1 of 11)
The article “Keeping Score When It Counts: Graduation
Success and Academic Progress Rates for the 2016
NCAA Men’s Division I Basketball Tournament Teams”
(The Institute for Diversity and Ethics in Sport,
University of Central Florida, March 2016) compared
graduation rates of basketball players to those of all student
athletes for the universities and colleges that sent teams to
the 2016 Division I playoffs. The graduation rates in the
accompanying table are the percentage of athletes who
started college in 2005, 2006, 2007, and 2008 who
graduated within 6 years.
64
Example 1.4 – Making It to Graduation . . . (2 of 11)
Minitab, a computer software package for statistical
analysis, was used to construct a dotplot of the graduation
rates for basketball players.
Minitab dotplot of graduation rates for basketball players.
Figure 1.6
65
Example 1.4 – Making It to Graduation . . . (3 of 11)
From this dotplot, we see that basketball graduation rates
varied a great deal from school to school, ranging from a
low of 20% to a high of 100%.
We can also see that the graduation rates seem to cluster in
several groups, denoted by the colored ovals that have
been added to the dotplot. There are 11 schools with
graduation rates of 100% (excellent!).
66
Example 1.4 – Making It to Graduation . . . (4 of 11)
The majority of schools are in the large cluster with
graduation rates from about 62% to about 96%.
And then there is a group of 12 schools with low graduation
rates for basketball players and one school with an
unusually low graduation rate of 20%.
67
Example 1.4 – Making It to Graduation . . . (5 of 11)
School ALL BB Difference School ALL BB Difference
1 79 93 −14 17 82 78 4
2 88 91 −3 18 91 70 21
3 88 100 −12 19 84 85 −1
4 71 73 −2 20 92 88 4
5 74 57 17 21 90 70 20
6 98 100 −2 22 83 80 3
7 98 100 −2 23 60 42 18
8 71 77 −6 24 60 39 21
9 69 53 16 25 81 91 −10
10 97 88 9 26 90 55 35
11 67 58 9 27 85 82 3
12 87 67 20 28 78 54 24
13 90 92 −2 29 79 92 −13
14 80 75 5 30 78 80 −2
15 87 63 24 31 83 92 −9
16 87 100 −13 32 78 80 −2
68
Example 1.4 – Making It to Graduation . . . (6 of 11)
School ALL BB Difference School ALL BB Difference
33 79 55 24 51 80 50 30
34 79 36 43 52 82 62 20
35 86 83 3 53 81 82 −1
36 85 20 65 54 70 50 20
37 95 100 −5 55 85 100 −15
38 78 62 16 56 87 83 4
39 89 100 −11 57 83 90 −7
40 84 100 −16 58 86 64 22
41 81 90 −9 59 92 91 1
42 85 91 −6 60 85 67 18
43 89 93 −4 61 93 83 10
44 89 89 0 62 94 100 −6
45 85 90 −5 63 76 83 −7
46 85 80 5 64 69 100 −31
47 81 46 35 65 82 83 −1
48 80 75 5 66 80 63 17
49 98 100 −2 67 94 91 3
50 84 71 13 68 98 95 3
69
Example 1.4 – Making It to Graduation . . . (7 of 11)
Figure shows two dotplots of graduation rates—one for
basketball players and one for all student athletes.
Minitab dotplot of graduation rates for basketball players and for all
athletes. Figure 1.7
70
Example 1.4 – Making It to Graduation . . . (8 of 11)
The graduation rates for all student athletes tend to be
higher and to vary less from school to school than the
graduation rates for only basketball players.
The data given here are an example of paired data. Each
basketball graduation rate is paired with a graduation rate
for all student athletes from the same school.
71
Example 1.4 – Making It to Graduation . . . (9 of 11)
In this case, the difference between the graduation rate for
all student athletes and for basketball players for each
school. These differences (All – Basketball) are also shown
in the data table.
Figure gives a dotplot of the differences.
Dotplot of graduation rate differences (all athletes – basketball players).
Figure 1.8
72
Example 1.4 – Making It to Graduation . . . (10 of 11)
Notice that one difference is equal to 0. This corresponds to a
school for which the basketball graduation rate is equal to the
graduation rate of all student athletes. There are 30 schools
for which the difference is negative. Negative differences
correspond to schools that have a higher graduation rate for
basketball players than for all student athletes.
The most interesting feature of the difference dotplot is the
variability in the positive differences. Positive differences
correspond to schools that have a lower graduation rate for
basketball players.
73
Example 1.4 – Making It to Graduation . . . (11 of 11)
The positive differences range from 1% all the way up to
65%. There were five schools that had a graduation rate for
all athletes that was 30 percentage points or more greater
than the graduation rate for basketball players.
74
Module #1 Assignment Requirements
❑ Discussion Board
Discuss the steps of the Data Science process flow. Which
step is the most important? What types of questions can
data science answer?
For this discussion choose a KSA company that you are
familiar with and identify a business problem that can be
informed by data analysis. Propose a plan for the process
of answering the question to reasonably ensure that an
appropriate data product is constructed and that it
supports the resolution of the business problem. Discuss
how your proposal answers the question and how you will
disseminate the findings from your analyses.
This concludes our live session.
Thank you for your
attendance!
Questions
Take advantage of this opportunity
to seek further clarification.
Next Live Session
• <Insert date for next Live Session.>
• <Insert time for next Live Session.>