Lecture2 Data
Lecture2 Data
1
Lecture Outline: Data, Summaries, and Visuals
• Data Exploration
• Descriptive Statistics
• Visualizations
• An Example
Communicate/Visualize the
Results
Today we will begin introducing the data collection and data
exploration steps.
Simple or atomic:
• Numeric: integers, floats
• Boolean: binary or true false values
• Strings: sequence of symbols
We’ll see later that it’s important to distinguish between classes of variables
or attributes based on the type of values they can take on.
• Quantitative variable: is numerical and can be either:
• discrete - a finite number of values are possible in any bounded
interval. For example: “Number of siblings” is a discrete variable
• continuous - an infinite number of values are possible in any
bounded interval. For example: “Height” is a continuous variable
• Categorical variable: no inherent order among the values For example:
“What kind of pet you have” is a categorical variable
The mean describes what a “typical” sample value looks like, or where is
the “center” of the distribution of the data.
Key theme: there is always uncertainty involved when calculating a sample
mean to estimate a population mean.
CS109A, PROTOPAPAS, RADER, TANNER 22
Sample median
The (sample) variance, denoted s2, measures how much on average the
sample values deviate from the mean:
𝑛
1
2
𝑠= ∑
𝑛−1 𝑖=1
2
¿𝑥 𝑖 − 𝑥∨¿ ¿
Note: the term measures the amount by which each deviates from the
mean . Squaring these deviations means that s2 is sensitive to extreme
values (outliers).
Note: s2 doesn’t have the same units as the :(
What does a variance of 1,008 mean? Or 0.0001?
√
𝑛
1
𝑠= √ 𝑠 = ∑
2 2
¿ 𝑥 𝑖 − 𝑥∨¿ ¿
𝑛−1 𝑖=1
Note: s does have the same units as the . Phew!
The following four data sets comprise the Anscombe’s Quartet; all four
sets of data have identical simple summary statistics.
Summary statistics clearly don’t tell the story of how they differ. But a
picture can be worth a thousand words:
If I tell you that the average score for Homework 0 was: 7.64/15 =
50.9% last year, what does that suggest?
Visualizations help us to analyze and explore the data. They help to:
Pie charts are often frowned upon (and bar charts are used instead). Why?
When the data is high dimensional, a scatter plot of all data attributes can
be impossible or unhelpful
Communicate/Visualize the
Results
Note: This process is by no means linear!
CS109A, PROTOPAPAS, RADER, TANNER 49
Analyzing Hubway Data
By 2016, Hubway operated 185 stations and 1750 bicycles, with 5 million ride
since launching in 2011.
The Data: In April 2017, Hubway held a Data Visualization Challenge at the
Microsoft NERD Center in Cambridge, releasing 5 years of trip data.
The Question: What does the data tell us about the ride share program?
Our original question: ‘What does the data tell us about the ride share program?’
is a reasonable slogan to promote a hackathon. It is not good for guiding scientific
investigation.
Sometimes the feature you want to explore doesn’t exist in the data, and must
be engineered!
• How does user demographics impact the duration the bikes are being used? Or where
they are being checked out?
• How does weather or traffic conditions impact bike usage?
• How do the characteristics of the station location affect the number of bikes being
checked out?
http://hubwaydatachallenge.org