0% found this document useful (0 votes)
9 views21 pages

Data Analyst

The document outlines various data types, categorizing them into quantitative (continuous and discrete) and categorical (ordinal and nominal) variables. It discusses methods for analyzing these data types, including measures of center, spread, and the impact of outliers. Additionally, it contrasts descriptive statistics, which summarize collected data, with inferential statistics, which draw conclusions about larger populations.

Uploaded by

dessalegnayalew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views21 pages

Data Analyst

The document outlines various data types, categorizing them into quantitative (continuous and discrete) and categorical (ordinal and nominal) variables. It discusses methods for analyzing these data types, including measures of center, spread, and the impact of outliers. Additionally, it contrasts descriptive statistics, which summarize collected data, with inferential statistics, which draw conclusions about larger populations.

Uploaded by

dessalegnayalew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Summary of Video

The table below summarizes our data types. To expand on the


information in the table, you can look through the text that follows.

Data
Types

Quantitativ
Continuous Discrete
e:

Pages in a Book, Trees in Yard, Dogs at a


Height, Age, Income
Coffee Shop

Categorical: Ordinal Nominal

Letter Grade, Survey


Gender, Marital Status, Breakfast Items
Rating

Below is a little more detail of the information shared in the above


table.

Another Look
To break down our data types, there are two main blocks:

Quantitative and Categorical

Quantitative can be further divided into Continuous or Discrete .

Categorical data can be divided into Ordinal or Nominal .

You should have now mastered what types of data in the world
around us falls into each of these four buckets: Discrete,
Continuous, Nominal, and Ordinal. In the next sections, we will work
through the numeric summaries that relate specifically to
quantitative variables.

Quantitative vs. Categorical


Some of these can be a bit tricky - notice even though zip codes are
a number, they aren’t really a quantitative variable. If we add two
zip codes together, we do not obtain any useful information from
this new value. Therefore, this is a categorical variable.

Height, Age, the Number of Pages in a Book, and Annual


Income all take on values that we can add, subtract and perform
other operations with to gain useful insight. Hence, these
are quantitative .

Gender, Letter Grade, Breakfast Type, Marital Status, and Zip


Code can be thought of as labels for a group of items or individuals.
Hence, these are categorical .

Continuous vs. Discrete


To consider if we have continuous or discrete data, we should see if
we can split our data into smaller and smaller units. Consider time -
we could measure an event in years, months, days, hours, minutes,
or seconds, and even at seconds we know there are smaller units we
could measure time in. Therefore, we know this data type is
continuous. Height, age, and income are all examples
of continuous data . Alternatively, the number of pages in a
book, dogs I count outside a coffee shop, or trees in a
yard are discrete data . We would not want to split our dogs in
half.

Ordinal vs. Nominal


In looking at categorical variables, we found Gender, Marital
Status, Zip Code, and your Breakfast items are nominal
variables where there is no order ranking associated with this type
of data. Whether you ate cereal, toast, eggs, or only coffee for
breakfast; there is no rank-ordering associated with your breakfast.

Alternatively, the Letter Grade or Survey Ratings have a rank


ordering associated with it, as ordinal data . If you receive an A,
this is higher than an A-. An A- is ranked higher than a B+, and so
on... Ordinal variables frequently occur on rating scales from very
poor to very good. In many cases, we turn these ordinal variables
into numbers, as we can more easily analyze them, but more on this
later!

Final Words
In this section, we looked at the different data types we might work
with in the world around us. When we work with data in the real
world, it might not be very clean - sometimes there are typos or
missing values. When this is the case, simply having some expertise
regarding the data and knowing the data type can assist in our
ability to ‘clean’ this data. Understanding data types can also assist
in our ability to build visuals to best explain the data. But more on
this very soon!
Calculating the 5 Number Summary
The five-number summary consist of 5 values:

1. Minimum: The smallest number in the dataset.


2. Q1Q1: The value such that 25% of the data fall below.
3. Q2Q2: The value such that 50% of the data fall below.
4. Q3Q3: The value such that 75% of the data fall below.
5. Maximum: The largest value in the dataset.

In the above video, we saw that calculating each of these values


was essentially just finding the median of a bunch of different
datasets. Because we are essentially calculating a bunch of
medians, the calculation depends on whether we have an odd or
even number of values.

Range
The range is then calculated as the difference between
the maximum and the minimum.

IQR
The interquartile range is calculated as the difference
between Q3Q3 and Q1Q1.

In the upcoming sections, you will practice this with Katie and on
your own.
Example: Calculating the Standard
Deviation
The dataset for the example is 10,14,10,610,14,10,6
1. First, calculate the mean:

x‾=(∑i=14xi)n=404=10x=n(i=1∑4xi)=440=10
2. Next, calculate the distance of each observation from the mean and
square the value:

(xi−x‾)2=(xi−x)2=
(10−10)2=02=0(10−10)2=02=0
(14−10)2=42=16(14−10)2=42=16
(10−10)2=02=0(10−10)2=02=0
(6−10)2=−42=16(6−10)2=−42=16
2. Then calculate the variance, the average squared difference of
each observation from the mean:

1n∑i=1n(xi−x‾)2=14(0+16+0+16)=324=8n1i=1∑n(xi
−x)2=41(0+16+0+16)=432=8
4. Finally, calculate the standard deviation, the square root of the
variance:

1n∑i=1n(xi−x‾)2=8=2.83n1i=1∑n(xi−x)2=8=2.83
The standard deviation is, on average, how far each point in our
dataset is from the mean.

Recap

Variable Types
We have covered a lot up to this point! We started with identifying data
types as either categorical or quantitative . We then learned we could
identify quantitative variables as either continuous or discrete . We also
found we could identify categorical variables as either ordinal or nominal .

Categorical Variables
When analyzing categorical variables, we commonly just look at the count or
percent of a group that falls into each level of a category. For example, if we
had two levels of a dog category: lab and not lab . We might say, 32% of
the dogs were lab (percent), or we might say 32 of the 100 dogs I saw were
labs (count).

However, the 4 aspects associated with describing quantitative variables are


not used to describe categorical variables.
Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:

1. Measures of Center
2. Measures of Spread
3. Shape of the Distribution
4. Outliers

We looked at calculating measures of Center

1. Means
2. Medians
3. Modes

We also looked at calculating measures of Spread

1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance

Calculating Variance
We saw that we could calculate the variance as:

1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
You will also see:

1n−1∑i=1n(xi−xˉ)2n−11i=1∑n(xi−xˉ)2

The reason for this is beyond the scope of what we have covered thus far,
but you can find an explanation here(opens in a new tab) .

You can commonly find answers to your questions with a quick Google
search(opens in a new tab) . Now is a great time to get started with this
practice! This answer should make more sense at the completion of this
lesson.

Standard Deviation vs. Variance


The standard deviation is the square root of the variance. In practice, you
usually use the standard deviation rather than the variance. The reason for
this is because the standard deviation shares the same units with our
original data, while the variance has squared units.

What Next?
In the next sections, we will be looking at the last two aspects of quantitative
variables: shape and outliers. What we know about measures of center and
measures of spread will assist in your understanding of these final two
aspects.

Supporting Materials

 Calculating Variance

Histograms
We learned how to build a histogram in this video, as this is the
most popular visual for quantitative data.
Shape
From a histogram, we can quickly identify the shape of our data,
which helps influence all of the measures we learned in the previous
concepts. We learned that the distribution of our data is frequently
associated with one of the three shapes:

1. Right-skewed

2. Left-skewed

3. Symmetric (frequently normally distributed)

Summary
Mean vs.
Shape Real-World Applications
Median

Symmetric Mean equals


Height, Weight, Errors, Precipitation
(Normal) Median

Amount of drug remaining in a bloodstream, Time


Mean greater
Right-skewed between phone calls at a call center, Time until light bulb
than Median
dies

Mean less than Grades as a percentage in many universities, Age of


Left-skewed
Median death, Asset price changes
The mode of a distribution is essentially the tallest bar in a
histogram. There may be multiple modes depending on the number
of peaks in our histogram.

When working with data, building a quick plot lets you quickly see
the shape of your data.

Distribution Shape Types of Data

Bell Shaped Heights, Weight, Scores

Left Skewed GPA, Age of Death, Price

Right Skewed Distribution of Wealth, Athletic Abilities

Common Techniques
When outliers are present we should consider the following points.

1. Noting they exist and the impact on summary statistics.

2. If typo - remove or fix

3. Understanding why they exist, and the impact on questions we are trying to answer about our
data.

4. Reporting the 5 number summary values is often a better indication than measures like the
mean and standard deviation when we have outliers.

5. Be careful in reporting. Know how to ask the right questions.


Image Summary
In the below image, we have three box-plots. Each box-plot is for a different
Iris flower: setosa , versicolor , or virginica . On the y-axis, we are given the
sepal length. Notice that virginica has an outlier towards the bottom of the
plot. Therefore, the minimum is not given by the bottom line here; rather, it
is provided by this point.

Box Plots of Sepal length for 3 Iris Flower Species


Quick Refresher: The measures of center and spread we can determine
from a Box Plot are as follows. Let's use Setosa for these examples.

Median is the centerline inside the box and is 5

IQR is space between the first and third quartile which are the edges of the
box. They are about 4.8 for the first quartile and 5.2 for the third
To match the appropriate Iris type to the statements regarding their Sepal
Length, you can refer to the characteristics of the three Iris species: Setosa,
Versicolor, and Virginica. Here’s how they typically compare:

1. The largest Range:


 Match: Virginica
2. The smallest Interquartile Range:
 Match: Setosa
3. Median is approximately 5:
 Match: Setosa
4. Third quartile is approximately 6.3:
 Match: Versicolor
5. Approximately Symmetric:
 Match: All (This can refer to the distribution of sepal lengths across
all species.)
6. The largest sepals on average:
 Match: Virginica

These matches are based on the typical statistical characteristics of the Iris
dataset. If you have any further questions or need more clarification, feel
free to ask!

Pay attention to the scale of these two graphs. The first is dealing
with a lot higher numbers.

The median is the middle number and is not affected by outliers.

The average factors in all the numbers so outliers will bring the
average towards them.

Left Skewed is when the graphs start with a low frequency and then
slopes up. Right Skewed is when the graph starts with a high
frequency and slopes down
Recap

Variable Types
We have covered a lot up to this point! We started with identifying
data types as either categorical or quantitative . We then learned
we could identify quantitative variables as
either continuous or discrete . We also found we could identify
categorical variables as either ordinal or nominal .

Categorical Variables
When analyzing categorical variables, we commonly just look at the
count or percent of a group that falls into each level of a category.
For example, if we had two levels of a dog category: lab and not
lab . We might say, 32% of the dogs were lab (percent), or we
might say 32 of the 100 dogs I saw were labs (count).

However, the 4 aspects associated with describing quantitative


variables are not used to describe categorical variables.

Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:

1. Measures of Center
2. Measures of Spread
3. Shape of the Distribution
4. Outliers
Measures of Center
We looked at calculating measures of Center

1. Means
2. Medians
3. Modes

Measures of Spread
We also looked at calculating measures of Spread

1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance

Shape
We learned that the distribution of our data is frequently associated
with one of the three shapes:

1. Right-skewed

2. Left-skewed

3. Symmetric (frequently normally distributed)


Depending on the shape associated with our dataset, certain
measures of center or spread may be better for summarizing our
dataset.

When we have data that follows a normal distribution, we can


completely understand our dataset using the mean and standard
deviation .

However, if our dataset is skewed, the 5 number summary (and


measures of center associated with it) might be better to summarize
our dataset.

Outliers
We learned that outliers have a larger influence on measures like
the mean than on measures like the median. We learned that we
should work with outliers on a situation by situation basis. Common
techniques include:

1. At least note they exist and the impact on summary statistics.

2. If typo - remove or fix

3. Understand why they exist, and the impact on questions we are


trying to answer about our data.

4. Reporting the 5 number summary values is often a better


indication than measures like the mean and standard deviation
when we have outliers.

5. Be careful in reporting. Know how to ask the right questions.

Histograms and Box Plots


We also looked at histograms and box plots to visualize our
quantitative data. Identifying outliers and the shape associated with
the distribution of our data are easier when using a visual as
opposed to using summary statistics.

What Next?
Up to this point, we have only looked at Descriptive Statistics,
because we are describing our collected data. In the final sections of
this lesson, we will be looking at the difference
between Descriptive Statistics and Inferential Statistics.
In this section, we learned about how Inferential Statistics differs
from Descriptive Statistics.

Descriptive Statistics
Descriptive statistics is about describing our collected data.

Inferential Statistics
Inferential Statistics is about using our collected data to
draw conclusions about a larger population.

We looked at specific examples that allowed us to identify the

1. Population - our entire group of interest.


2. Parameter - numeric summary about a population
3. Sample - a subset of the population
4. Statistic numeric summary about a sample

In this section, we learned about how Inferential Statistics differs


from Descriptive Statistics.
Descriptive Statistics
Descriptive statistics is about describing our collected data.

Inferential Statistics
Inferential Statistics is about using our collected data to
draw conclusions about a larger population.

We looked at specific examples that allowed us to identify the

1. Population - our entire group of interest.


2. Parameter - numeric summary about a population
3. Sample - a subset of the population
4. Statistic numeric summary about a sample

Descriptive vs. Inferential Statistics


In this section, we learned about how Inferential Statistics differs
from Descriptive Statistics.

Descriptive Statistics
Descriptive statistics is about describing our collected
data using the measures discussed throughout this lesson:
measures of center, measures of spread, the shape of our
distribution, and outliers. We can also use plots of our data to gain a
better understanding.

Inferential Statistics
Inferential Statistics is about using our collected data to draw
conclusions to a larger population. Performing inferential
statistics well requires that we take a sample that accurately
represents our population of interest.

A common way to collect data is via a survey. However, surveys


may be extremely biased depending on the types of questions that
are asked, and the way the questions are asked. This is a topic you
should think about when tackling the first project.

We looked at specific examples that allowed us to identify the

1. Population - our entire group of interest.


2. Parameter - numeric summary about a population
3. Sample - a subset of the population
4. Statistic - numeric summary about a sample

Looking Ahead
Though we will not be diving deep into inferential statistics within
this course, you are now aware of the difference between these two
branches of statistics. If you have ever conducted a hypothesis test
or built a confidence interval, you have performed inferential
statistics. The way we perform inferential statistics is changing as
technology evolves. Many career paths involving Machine
Learning and Artificial Intelligence are aimed at using collected
data to draw conclusions about entire populations at an individual
level. It is an exciting time to be a part of this space, and you are
now well on your way to joining the other practitioners!

In this lesson, you'll learn to manipulate data in your spreadsheet, such as:

 Working with cell formulas


 Text data
 Math operations
 Statistical functions
 Performing data operations at the table level
 Duplicating rows
 Splitting columns
 Sorting data

Text String: String of letters, numbers, and punctuation that is not


treated numerically.

While the SUBSTITUTE function sounds similar to find/replace, it is


used for different purposes. Find/replace gets rid of the old data,
while SUBSTITUTE will not change the original cell, instead showing
the transformed data in a new cell.

SUBSTITUTE uses the syntax SUBSTITUTE({text}, {old_text},


{new_text}) , where {text} is the cell to change, {old_text} is the
string sequence to be replaced, and {new_text} is the new string in
place of the old one.

Pro Tip: Enter a formula by typing directly into a selected cell, beginning
with "=". This way is more direct and faster than the formula bar!
FIND and LEFT can be used to extract text. FIND can be given a substring
and a cell to return the position in a string where the substring was
found. LEFT can then be used to extract a certain number of characters
from a cell, starting from the left side.

RIGHT therefore extracts from the right side, while MID can extract from
some starting point in the middle of a cell.

Pro Tip: To display the formula of one cell in another, use


the FORMULATEXT function.

ONCATENATE will join together two or more strings. It's important to


note that this will not automatically add spaces between them, so
make sure to add spaces as formula parameters if you need them.

TRIM will help to remove excess whitespace from a string.

PROPER sets the first letter of each word to upper case, with the rest
lowercase.

UPPER sets all letters to upper case, while LOWER sets all letters to
lowercase.

Pivot tables sum and aggregate in a single step. In Excel, if you select all the
relevant data, you can use Insert -> Pivot Table. You then can click on the
desired fields to include in the pivot table, or drag them to the relevant area
(such as column or row) you want to use the field within.

In the example in the video, we first chose to make the teams the rows and
the positions the columns of the pivot table, with the names as the values.
This defaulted to a count of names. After switching to salaries, we switched
the aggregation to the sum of the salaries instead.
In Google Sheets pivot tables can be found in the Insert menu:

1. Select the data


2. Click Insert and then Pivot table

3. Choose New sheet or Existing sheet and click Create

4. Select values for Rows, Columns, and Values

In Excel, you can name a cell by selecting the relevant cell, and then in the "Formulas" tab, select
"Define Name". You can then change the name (if desired) or update which cells the name
applies to.

If you want to name a range, select the range of data (including labels), and instead select
"Create From Selection".

This makes it so you can make the addresses in a formula more clear, where J2 could be
replaced by apple_price , for example.

Note: If you are using Google Sheets, you can find Named Ranges under the Data menu,
similar to Pivot Tables.

You might also like