Principles of Data Literacy - Introduction To Data Cheatsheet - Codecademy
Principles of Data Literacy - Introduction To Data Cheatsheet - Codecademy
Introduction to Data
Data Gaps
The ability to separate good, mediocre, and poor quality
data is a crucial data literacy skill. Data-driven
conclusions are only as strong, robust, and well-
supported as the data behind them. This is also often
referred to with the phrase “garbage in, garbage out.”
Addressing Bias
Bias in data collection leads to poorer quality data.
Recognizing bias in data is a crucial data literacy skill.
Some key questions about bias include “Who made the
data?”, “Who participated in the data?” and “Who is left
out of the data?”
What is Statistics?
Statistics helps to measure whether an event happens by
chance or by a systemic factor or factors. For example,
it’s statistically more likely to see traffic during peak rush
hour than outside of peak rush hour times.
Statistics at work
Statistics can reveal systemic patterns in a data set rather
than relying on individual experiences. This is important in
legal cases including those addressing discrimination or
class-action lawsuits.
Garbage In, Garbage Out
The quality of the predictions made during a predictive
analysis is deeply dependent on the quality of the data
used to generate the predictions.
For example, if a model is trained with mislabeled data, it
will produce inaccurate predictions no matter how good
the actual algorithm is. This is commonly referred to as,
“garbage in, garbage out.”
Categorical Variables
Categorical variables consist of data that can be grouped
into distinct categories, and are ordinal or nominal.
Ordinal categorical variables which are groups that
contain an inherent ranking, such as ratings of plays or
responses to a survey question with a point scale e.g., on
a scale from 1-7, how happy are you right now? Nominal
categorical variables are made of categories without an
inherent order, examples of nominal variables are species
of ants, or people’s hair color.
Messy Data
Messy data is data that violates one of the tidy dataset
rules (1. Each variable forms a column; 2. Each
observation forms a row; 3. Each type of observational
unit forms a table).
Below is an example of messy data:
1 Brown F B
B smith
Saito,
3 A 90
K
Tabular Data
Tabular data is organized into rows, or observations, along
the vertical axis, and columns, also referred to as
variables or features, along the horizontal axis.
Row
Variable 1 Variable 2 Variable 3
#
1 Observation Observation Observation
1 Gnasher ACD M
2 Cassie Collie F 1 3
French
3 Pepper F 4 2
Bulldog
Golden
4 Jed M
Retreiver
5 Henry Spaniel M
6 Ruby ACD F 1 6
2 Stool NaN
5 Stool NaN
8 Stool NaN
10 Stool NaN
Data Missing Completely at Random
Dat Missing Completely at Random (MCAR) data has no
detectable underlying reason causing the values to be
missing.
The table below has MCAR data. The # of fruits is missing
for some plants, but the missing fruit data seems
unrelated to the height of the plant. Short and tall plants
are both missing fruit data. In addition, we are missing the
height for one of our plants!
1 65 10
2 87
3 987
4 44
5 105 35
6 547 74
7 876
8 55
9 875 95