Lesson 2 Notes
Lesson 2 Notes
Diploma in Data
Analysis
Exploring
Data
DATA ANALYSIS
2
Contents
3 Lesson objectives
Introduction
Lesson 1 recap
4 Data types
7 Descriptive statistics
9 Visualisations
11 Conclusion
References
DATA ANALYSIS
3
Lesson Objectives
• Understand the different data types: categorical vs numerical
• Learn about some basic descriptive stats and how to use them
Lesson Introduction
Exploring data can be likened to detective work. You need
to understand each variable before you can start visualising
and modelling the data. What do the observations represent?
What do the variables represent. Is all the data useful for our
analysis? Let’s continue our journey into the world of data
analysis to answer these questions.
Did you know that bad data costs US businesses alone $50
billion annually? This shows us the importance of cleaning and
investigating data properly before analysing it. It is important
to assess the factors that will have an impact on the quality of
data. These factors can include study design, errors, sample
size, reliability, and validity of the data, to name a few. In this
course, we will learn, amongst other things, to make sure that
our data is of utmost quality.
Lesson 1 Recap
What did we learn about data in the previous lesson? Data is observations or
measurements (unprocessed or processed) represented as text, numbers, or
multimedia. It can be classified as qualitative and quantitative.
DATA ANALYSIS
4
Data types
Datatypes are an important concept because statistical methods can only be used with certain data types. You must
analyse continuous data differently than categorical data otherwise it would result in a wrong analysis. Therefore,
knowing the types of data, you are dealing with, enables you to choose the correct method of analysis.
Having a good understanding of the different data types, also called measurement scales, is a crucial prerequisite for
doing Exploratory Data Analysis (EDA) since you can use certain statistical measurements only for specific data types.
You also need to know which data type you are dealing with to choose the right visualization method. Think of data
types to categorize different types of variables. We will discuss the main types of variables and look at an example for
each. We will sometimes refer to them as measurement scales.
Data Types
Categorical Numerical
DATA ANALYSIS
5
Categorical data
Categorical variables represent types of data which may be divided into groups. Examples include race, sex, age group,
and educational level.
Nominal data
Data that is used for naming or labelling variables (vars), without any quantitative value. Nominal data is qualitative
(or non-numeric) and its categories are mutually exclusive. It always contains a minimum 2 categories and this
information is not ordered or ranked. You can code categories to make analysis easier and use visual representations
like a pie chart or bar graph to explain your data (more on this later). Examples include gender and marital status.
Ordinal data
Ordinal data is like nominal data. It is also qualitative, but usually only contains 1 category. The category items are
ordered or ranked by being greater than or less than their neighbour. This information can be visually be represented
by a box plot (once again, more on this later). Examples include marathon rankings for athletes.
Numerical data
Discrete data
• Only assume certain values
• Example: Number of people taking this course can only be 0, 1, 2, etc.
• Can only be a certain number, not fractional (i.e. you can’t have half a person taking this class)
Continuous data
• Assume any value in a specified interval
• Example: Height, weight, or any sets of measurements
• Can be fractional (i.e. your height is not either 1m or 2m, it can be 1.65m)
DATA ANALYSIS
6
Interval data
• Quantitative (numeric)
• Has unit of measure
• Has no meaningful zero point (absolute zero)
• Includes greater than or smaller than rankings
• Example: Temperature or time
• Temperature has zero degrees but does not mean the complete absence of heat - this is simply a convenient
starting point.
Ratio
• Quantitative (numeric)
• Includes greater than or less than measurements
• Has unit of measurement
• Has absolute zero point, i.e. zero value means that item measured (for instance) is not observable or present.
Example: Age, height, weight
DATA ANALYSIS
7
Descriptive statistics
Why do you need to know more about descriptive statistics?
Descriptive statistic is a simpler interpretation of the data (in comparison to raw data). Often this type of data is
more meaningful to the reader. It has the ability to highlight relationships between variables (e.g. the correlation
between variables, but more on that later in Module 2!)
What are the different ways that you can describe the data?
Measures of location or central tendency. This tells us of the location of the centre of the data.
Mean
Mean is the most widely used measure of location or central tendency. There are many different types of means
e.g. arithmetic mean (also known as the ‘average’). The mean is equal to the sum of all observations divided by the
number of observations. Other types of means include geometric mean, harmonic mean, finance and investing. The
mean is affected by extreme outliers and this can lead to a misrepresentation of the data. It can also help assess an
average of series over time.
Median
The median is the middle number in a sorted, ascending or descending, list of numbers and can be more descriptive
of that data set than the average/mean. The median is sometimes used as opposed to the mean when there are
outliers in the sequence that might skew the average of the values.
Mode
The mode is the number that appears most frequently or the most commonly occurring observation or observation
with the greatest frequency. It is also known as “Modal observation”. If we made histogram (to visually represent our
data), the mode would be the value that has most observations. Data can have 1 or more modes or none at all. Mode
is not affected by extreme outliers and is extremely useful for qualitative information and is usually not based on all
the data.
DATA ANALYSIS
8
Range
• Range is the difference between the maximum and minimum value of the data set.
• Example: Difference between the highest and lowest marks for a quiz.
Variance
• Measure that shows spread between numbers in data set and tells us how far each number in
a set is from the mean (i.e. it measures variability from the mean/average)
• Can be a negative value
• Variance of zero indicates that all values in that set are identical
• Also takes outliers into account (which is not always good)
• Treats all variations from mean the same regardless of negative or positive (direction)
• Difficult to analyse, so we rather use standard deviation
Standard deviation
Skewness
DATA ANALYSIS
9
Visualisations
Many people find it difficult (or even ‘boring’) to look at raw data (i.e. data not
been organised numerically). It is not informative or interesting and so creating
visualisations helps us to sketch a better picture of the data through graphical
representations. This makes the information easier to understand and helps us
to tell the data’s story. Visualisation highlights trends (i.e. distribution) as well as
useful information (like outliers).
Please note: There are many different types of visualisations, we will only be introducing a few types in this lesson.
Go ahead and do some research on your own and tell us what you have found!
Pie chart
Pie charts can be used to show the differences between groups based on one variable (e.g. percentages or
proportions). The data must add up to 100% or the whole of the sample total. No more than 5 categories should be
displayed in a pie chart as this will overwhelm the reader. A limitation of a pie chart is that it does not show change
over time.
Example: The proportion of students who have achieved different levels or grades on an assessment.
Bar chart
A bar chart is a graphical display of categorical (or qualitative) variables. The bar’s height indicates frequency.
Examples: Bar graphs can be good for plotting data that spans a length of time (e.g. for comparing achievement
between the beginning and the end of the year) or they can be used for comparing different items in a related
category (e.g. achievement results for different classes).
DATA ANALYSIS
10
Box plot
A box plot is a form of visualisation that has information graphically displayed measures of central tendency. Box
plots give a good indication of the variability of dispersion of data (i.e. it tells us how the data is spread) and they
don’t take up as much space as a histogram (which makes it useful when comparing many groups). Box plots easily
show the reader outliers.
Five-number summary
A five-number summary is especially useful in descriptive analyses or during the preliminary investigation of a large
data set. A summary consists of five values: the most extreme values in the data set (the maximum and minimum
values), the lower (Q1) and upper quartiles (Q3), and the median.
Median = Middle value of data set, also called Q2/50th percentile
First quartile also called Q1 or 25th percentile = the middle number between the smallest number (not the
“minimum”) and the median of the dataset.
Third quartile also called Q3 or 75th percentile = the middle value between the median and the highest value (not
the “maximum”) of the dataset.
Histogram
A histogram can also be called a frequency histogram, graphically represents frequency distribution. It is useful for
large datasets involving quantitative variables and as the data is continuous/ related to one another, there are no
gaps (a box plot has gaps).
Deceiving visualisations
There are certain types of visualization that do not represent the data accurately, this is elaborated in more detail in
the final demo in your lesson, so do pop on over to your recording to watch that section again.
DATA ANALYSIS
11
Conclusion
There are many forms of data and the type of data you collect and how your visually represent it will depend on
what kind of outcomes you are looking to achieve. All data is useful and learning how to best represent this data is
an important part of being a good data analyst. If readers are unable to understand the data, the data loses its value.
References
• Analysis, A., Miyazaki, M., Reca, M., Weber, A., Reca, M. and Weber, A., 2020. A Brief History Of Data Analysis.
[online] Flydata.com. Available at: https://www.flydata.com/blog/a-brief-history-of-data-analysis/.
• BI Blog | Data Visualization & Analytics Blog | datapine. 2019. 10 Steps for Asking the Right Data Analysis
Questions. [online] Available at: https://www.datapine.com/blog/data-analysis-questions/
• Gesing, E., 2020. How to Enable or Install Power Query in Excel - Blog LUZ. [online] Blog LUZ. Available at:.
https://powerspreadsheets.com/excel-power-query-tutorial/
• freeCodeCamp.org. 2019. Why You Should Learn Data Analysis. [online] Available at: https://www.
freecodecamp.org/news/why-should-you-learn-data-analysis/.
• Insights, S., n.d. 5 Reasons Why Everybody Should Learn Data Analytics. [online] Sas.com. Available at:
https://www.sas.com/en_nz/insights/articles/analytics/5-reasons-why-everybody-should-learn-data-
analytics.html
• R-project.org. n.d. R: What Is R?. [online] Available at: https://www.r-project.org/about.html.
• Sceo.archives.math.ca. 2017. Statistics: Power from Data! Five-Number Summary. [online] Available at:
<http://sceo.archives.math.ca/edu/power-pouvoir/ch12/5214877-eng.htm>
• SearchSQLServer. 2019. What Is A Database? - Definition From Whatis.Com. [online] Available at:
https://searchsqlserver.techtarget.com/definition/database#:~:text=A%20database%20is%20a%20
collection,or%20interactions%20with%20specific%20customers..
• Tukey, J. and Cleveland, W., 1986. The Collected Works of John W. Tukey. Belmont, Calif: Wadsworth
Advanced Books & Software.
• Usgs.gov. n.d. What Are the Differences Between Data, A Dataset, And A Database? [online] Available at:
https://www.usgs.gov/faqs/what-are-differences-between-data-a-dataset-and-a-database?qt-news_
science_products=0#qt-news_science_products.
DATA ANALYSIS