Data Analyst
Data Analyst
Data
Types
Quantitativ
Continuous Discrete
e:
Another Look
To break down our data types, there are two main blocks:
You should have now mastered what types of data in the world
around us falls into each of these four buckets: Discrete,
Continuous, Nominal, and Ordinal. In the next sections, we will work
through the numeric summaries that relate specifically to
quantitative variables.
Final Words
In this section, we looked at the different data types we might work
with in the world around us. When we work with data in the real
world, it might not be very clean - sometimes there are typos or
missing values. When this is the case, simply having some expertise
regarding the data and knowing the data type can assist in our
ability to ‘clean’ this data. Understanding data types can also assist
in our ability to build visuals to best explain the data. But more on
this very soon!
Calculating the 5 Number Summary
The five-number summary consist of 5 values:
Range
The range is then calculated as the difference between
the maximum and the minimum.
IQR
The interquartile range is calculated as the difference
between Q3Q3 and Q1Q1.
In the upcoming sections, you will practice this with Katie and on
your own.
Example: Calculating the Standard
Deviation
The dataset for the example is 10,14,10,610,14,10,6
1. First, calculate the mean:
x‾=(∑i=14xi)n=404=10x=n(i=1∑4xi)=440=10
2. Next, calculate the distance of each observation from the mean and
square the value:
(xi−x‾)2=(xi−x)2=
(10−10)2=02=0(10−10)2=02=0
(14−10)2=42=16(14−10)2=42=16
(10−10)2=02=0(10−10)2=02=0
(6−10)2=−42=16(6−10)2=−42=16
2. Then calculate the variance, the average squared difference of
each observation from the mean:
1n∑i=1n(xi−x‾)2=14(0+16+0+16)=324=8n1i=1∑n(xi
−x)2=41(0+16+0+16)=432=8
4. Finally, calculate the standard deviation, the square root of the
variance:
1n∑i=1n(xi−x‾)2=8=2.83n1i=1∑n(xi−x)2=8=2.83
The standard deviation is, on average, how far each point in our
dataset is from the mean.
Recap
Variable Types
We have covered a lot up to this point! We started with identifying data
types as either categorical or quantitative . We then learned we could
identify quantitative variables as either continuous or discrete . We also
found we could identify categorical variables as either ordinal or nominal .
Categorical Variables
When analyzing categorical variables, we commonly just look at the count or
percent of a group that falls into each level of a category. For example, if we
had two levels of a dog category: lab and not lab . We might say, 32% of
the dogs were lab (percent), or we might say 32 of the 100 dogs I saw were
labs (count).
1. Measures of Center
2. Measures of Spread
3. Shape of the Distribution
4. Outliers
1. Means
2. Medians
3. Modes
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Calculating Variance
We saw that we could calculate the variance as:
1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
You will also see:
1n−1∑i=1n(xi−xˉ)2n−11i=1∑n(xi−xˉ)2
The reason for this is beyond the scope of what we have covered thus far,
but you can find an explanation here(opens in a new tab) .
You can commonly find answers to your questions with a quick Google
search(opens in a new tab) . Now is a great time to get started with this
practice! This answer should make more sense at the completion of this
lesson.
What Next?
In the next sections, we will be looking at the last two aspects of quantitative
variables: shape and outliers. What we know about measures of center and
measures of spread will assist in your understanding of these final two
aspects.
Supporting Materials
Calculating Variance
Histograms
We learned how to build a histogram in this video, as this is the
most popular visual for quantitative data.
Shape
From a histogram, we can quickly identify the shape of our data,
which helps influence all of the measures we learned in the previous
concepts. We learned that the distribution of our data is frequently
associated with one of the three shapes:
1. Right-skewed
2. Left-skewed
Summary
Mean vs.
Shape Real-World Applications
Median
When working with data, building a quick plot lets you quickly see
the shape of your data.
Common Techniques
When outliers are present we should consider the following points.
3. Understanding why they exist, and the impact on questions we are trying to answer about our
data.
4. Reporting the 5 number summary values is often a better indication than measures like the
mean and standard deviation when we have outliers.
IQR is space between the first and third quartile which are the edges of the
box. They are about 4.8 for the first quartile and 5.2 for the third
To match the appropriate Iris type to the statements regarding their Sepal
Length, you can refer to the characteristics of the three Iris species: Setosa,
Versicolor, and Virginica. Here’s how they typically compare:
These matches are based on the typical statistical characteristics of the Iris
dataset. If you have any further questions or need more clarification, feel
free to ask!
Pay attention to the scale of these two graphs. The first is dealing
with a lot higher numbers.
The average factors in all the numbers so outliers will bring the
average towards them.
Left Skewed is when the graphs start with a low frequency and then
slopes up. Right Skewed is when the graph starts with a high
frequency and slopes down
Recap
Variable Types
We have covered a lot up to this point! We started with identifying
data types as either categorical or quantitative . We then learned
we could identify quantitative variables as
either continuous or discrete . We also found we could identify
categorical variables as either ordinal or nominal .
Categorical Variables
When analyzing categorical variables, we commonly just look at the
count or percent of a group that falls into each level of a category.
For example, if we had two levels of a dog category: lab and not
lab . We might say, 32% of the dogs were lab (percent), or we
might say 32 of the 100 dogs I saw were labs (count).
Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:
1. Measures of Center
2. Measures of Spread
3. Shape of the Distribution
4. Outliers
Measures of Center
We looked at calculating measures of Center
1. Means
2. Medians
3. Modes
Measures of Spread
We also looked at calculating measures of Spread
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Shape
We learned that the distribution of our data is frequently associated
with one of the three shapes:
1. Right-skewed
2. Left-skewed
Outliers
We learned that outliers have a larger influence on measures like
the mean than on measures like the median. We learned that we
should work with outliers on a situation by situation basis. Common
techniques include:
What Next?
Up to this point, we have only looked at Descriptive Statistics,
because we are describing our collected data. In the final sections of
this lesson, we will be looking at the difference
between Descriptive Statistics and Inferential Statistics.
In this section, we learned about how Inferential Statistics differs
from Descriptive Statistics.
Descriptive Statistics
Descriptive statistics is about describing our collected data.
Inferential Statistics
Inferential Statistics is about using our collected data to
draw conclusions about a larger population.
Inferential Statistics
Inferential Statistics is about using our collected data to
draw conclusions about a larger population.
Descriptive Statistics
Descriptive statistics is about describing our collected
data using the measures discussed throughout this lesson:
measures of center, measures of spread, the shape of our
distribution, and outliers. We can also use plots of our data to gain a
better understanding.
Inferential Statistics
Inferential Statistics is about using our collected data to draw
conclusions to a larger population. Performing inferential
statistics well requires that we take a sample that accurately
represents our population of interest.
Looking Ahead
Though we will not be diving deep into inferential statistics within
this course, you are now aware of the difference between these two
branches of statistics. If you have ever conducted a hypothesis test
or built a confidence interval, you have performed inferential
statistics. The way we perform inferential statistics is changing as
technology evolves. Many career paths involving Machine
Learning and Artificial Intelligence are aimed at using collected
data to draw conclusions about entire populations at an individual
level. It is an exciting time to be a part of this space, and you are
now well on your way to joining the other practitioners!
In this lesson, you'll learn to manipulate data in your spreadsheet, such as:
Pro Tip: Enter a formula by typing directly into a selected cell, beginning
with "=". This way is more direct and faster than the formula bar!
FIND and LEFT can be used to extract text. FIND can be given a substring
and a cell to return the position in a string where the substring was
found. LEFT can then be used to extract a certain number of characters
from a cell, starting from the left side.
RIGHT therefore extracts from the right side, while MID can extract from
some starting point in the middle of a cell.
PROPER sets the first letter of each word to upper case, with the rest
lowercase.
UPPER sets all letters to upper case, while LOWER sets all letters to
lowercase.
Pivot tables sum and aggregate in a single step. In Excel, if you select all the
relevant data, you can use Insert -> Pivot Table. You then can click on the
desired fields to include in the pivot table, or drag them to the relevant area
(such as column or row) you want to use the field within.
In the example in the video, we first chose to make the teams the rows and
the positions the columns of the pivot table, with the names as the values.
This defaulted to a count of names. After switching to salaries, we switched
the aggregation to the sum of the salaries instead.
In Google Sheets pivot tables can be found in the Insert menu:
In Excel, you can name a cell by selecting the relevant cell, and then in the "Formulas" tab, select
"Define Name". You can then change the name (if desired) or update which cells the name
applies to.
If you want to name a range, select the range of data (including labels), and instead select
"Create From Selection".
This makes it so you can make the addresses in a formula more clear, where J2 could be
replaced by apple_price , for example.
Note: If you are using Google Sheets, you can find Named Ranges under the Data menu,
similar to Pivot Tables.