0% found this document useful (0 votes)
47 views

02a EDA and Data Visualization

The document provides an outline and overview of an exploratory data analysis and data visualization training held by the Indonesian Ministry of Finance's Directorate of Information Systems and Technology in July 2022. It includes sections on exploratory data analysis, descriptive statistics, data visualization principles and tools. Specific topics covered include central tendency, measures of variation, outliers, correlation, skewness, kurtosis, and examples of effective and ineffective data visualization.

Uploaded by

Wildan A Yahya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

02a EDA and Data Visualization

The document provides an outline and overview of an exploratory data analysis and data visualization training held by the Indonesian Ministry of Finance's Directorate of Information Systems and Technology in July 2022. It includes sections on exploratory data analysis, descriptive statistics, data visualization principles and tools. Specific topics covered include central tendency, measures of variation, outliers, correlation, skewness, kurtosis, and examples of effective and ineffective data visualization.

Uploaded by

Wildan A Yahya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Direktorat SITP DJP Kemenkeu RI

Data Science Training, July 2022

EDA & Data Visualization

Instructor: Siti Aminah, M.Kom


Faculty of Computer Science, Universitas Indonesia
Outline Outline

Exploratory Data Analysis

Descriptive Statistics

Data Visualization

Data Visualization Principles

Basic Visualization Tools

Specialized Visualization Tools

Advanced Visualization Tools

2
CRISP-DM: Cross-industry standard process for data mining

Kenneth Jensen, CC BY-SA 3.0, via Wikimedia Commons

3
Exploratory Data Analysis

4
EDA: Take a peek at data
• EDA is a term for an initial analysis
done with datasets.
• It's basically taking a peek at the
data to understand more about
what it represents and how to use
it.
• It's often a precursor to more
advanced data analytics
techniques.

5
Exploratory data analysis (EDA)

Exploratory data analysis (EDA) is an approach:


• to analyzing datasets
• by summarizing their main characteristics
• often with visual methods.

The term EDA was coined by John W. Tukey in the book


"Exploratory Data Analysis" in 1977.
Why EDA?
• We need to familiarize with a new dataset: How does it look like?
• How many attributes, and of what kind?
• Are there any missing values?
• How are the values distributed?
• Is our dataset imbalanced? (= if left untreated, our model can be biased)

• Hunting for something interesting: What catches your eyes?


• Are there any outliers?
• Are there any correlations between attributes?
• How do the distributions compare between different samples?

7
EDA approach

• Descriptive statistics • Data visualizations


• Central tendency • Single attribute (univariate
• Measure of variation analysis):
• Skewness & kurtosis Barcharts, histogram, pie charts,
donut charts
• Correlations
• Multiple attributes (multivariate
analysis):
Scatter plots, bubble charts, line
charts, heat maps

8
Descriptive Statistics

9
Kenali Data Anda
• Kategorikal vs Numerik

Kecerdasan Artifisial dan Sains Data Dasar | Semester Genap 2021/2022 | Fakultas Ilmu Komputer – Universitas Indonesia 10
Kenali Data Anda
• Nominal, Ordinal, Interval, Rasio

Kecerdasan Artifisial dan Sains Data Dasar | Semester Genap 2021/2022 | Fakultas Ilmu Komputer – Universitas Indonesia 11
Central Tendency

Central Tendency

Mean Median Mode

12
Measure of Variation

Measure of Variation

Range IQR Variance, Stdev

Kecerdasan Artifisial dan Sains Data Dasar | Semester Genap 2021/2022 | Fakultas Ilmu Komputer – Universitas Indonesia 13
Measure of Variation
Range

Range = max - min

The simplest measure of variation, often denoted by indicating


the largest and smallest values separately.

14
Measure of Variation
Inter-Quartile Range (IQR)

Divides a dataset into quartiles:


• Q1 (lower quartile): 25th percentile
- Median of lower half
• Q2 (median): 50th percentile
• Q3 (upper quartile): 75th percentile
- Median of upper half

IQR = Q3-Q1
15
Measure of Variation
Inter-Quartile Range (IQR)

• From the data (n = 7):


5, 7, 4, 4, 6, 2, 8

• Q1 = ?
• Q2 = ?
• Q3 = ?

• IQR = ? Range = ?
16
Measure of Variation
Inter-Quartile Range (IQR)

• From the data (n = 7):


5, 7, 4, 4, 6, 2, 8 -> Sorted: 2, 4, 4, 5, 6, 7, 8

• Q1 = median of lower half = 4


• Q2 = 5
• Q3 = median of upper half = 7

• IQR = Q3-Q1 = 3 Range = 6

17
Outliers
An outlier is an observation that lies an
abnormal distance from other values in a
random sample from a population.

Before abnormal observations can be


singled out, it is necessary to
characterize normal observations.

Outliers, according to IQR, are data


points whose values are:
• less than Q1-1.5*IQR, or
• more than Q3+1.5*IQR

18
Variance & Standard Deviation

Variance = Average of the squared deviation of the observations from


the mean

Standard deviation s = Square root of the variance

19
Skewness & Kurtosis
• Skewness
A measure of asymmetry

• Kurtosis
A measure of outliers

20
Skewness

• Skewness is a measure of asymmetry of the data around the mean.

• When a distribution is skewed, the mode remains the most


commonly occurring value, the median remains the middle value in
the distribution, but the mean is generally ‘pulled’ in the direction
of the tails.
21
Skewness

• where 𝑥1 is each data point, 𝑥 is the arithmetic mean, 𝑛


is the size of the data , and 𝑠 is the standard deviation.

• The skewness for a normal distribution is zero, and any symmetric


data should have skewness near zero. Negative values for the
skewness indicate data that are skewed left and positive values for
the skewness indicate data that are skewed right.

22
Kurtosis

• High kurtosis indicates the presence of outliers!

https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/

23
Kurtosis

• where 𝑥1 is each data point, 𝑥 is the arithmetic mean, 𝑛


is the size of the data , and 𝑠 is the standard deviation.

• A normal distribution has kurtosis exactly 3 (mesokurtic).


• A distribution with kurtosis<3 is called platykurtic.
• A distribution with kurtosis>3 is called leptopkurtic

24
(Pearson) Correlation
• It is a technique to investigate the relationship between
two variables: that is, measures the strength of
the association between the two variables
• Pearson's correlation coefficient (r) is a type of correlation coefficient
• Correlation coefficient returns a value between -1 and 1
• -1 denotes strongest negative correlation
• 0 denotes no correlation
• 1 denotes strongest positive correlation

25
(Pearson) Correlation

Berapa nilai korelasi (Pearson r) masing-masing gambar ini?

26
(Pearson) Correlation

27
Data Visualization

28
29
More Examples

• The famous GapminderVideo, Hans Rosling: 200 Countries, 200


Years, 4 Minutes:
https://www.youtube.com/watch?feature=player_embedded&v=jb
kSRLYSojo

• NY Times Interactive Visualizations (e.g., 2013 Federal Budget)


http://www.nytimes.com/interactive/2012/02/13/us/politics/2013-
budget-proposal-graphic.html

30
Why Data Visualization?
01 BUILD 02 BUILD 03 BUILD 04 BUILD
VISUALS VISUALS VISUALS VISUALS

Enables Communicate Share Support


exploratory data clearly unbiased recommendation
data analysis representation to different
stake holder
of data

31
Three Key Points of Build Visuals
by DARKHORSE ANALYTICS

Any feature or
design you incorporate
in your plot to make it
more attractive or
pleasing should
LESS IS: support the message
that the plot is meant
to get across and not
distract from it.
more effective

32
Look at this figure

33
• It looks like features such as the
blue background or 3D
orientation are meant to convey
anything.
• In fact, these additional
unnecessary features distract
from the main message and
can be confusing to the
audience.

34
• The message here is that people
are most likely to choose bacon
over other types of pig meat, so
let's get rid of everything that can
be distracting from this core
message.
• It is simple, cleaner, less
distracting, and much easier to
read.

35
36
• The proportion of each pie is wrong.
• Unnecessary sky background.

37
38
• Are you sure the internet users are only 1/3rd of the total population?

39
LINE PLOTS

BASIC AREA PLOTS


VISUALIZATION
TOOLS HISTOGRAMS

BAR CHARTS

40
LINE PLOTS

• Line plot is a plot in the form of a series of data points connected by


straight line segments.
• The best use case for a line plot is when you have a continuous dataset
and you're interested in visualizing the data over a period of time.

41
LINE PLOTS

• For example, say we're interested in the trend


of immigrants from Haiti to Canada.
• We can generate a line plot and the resulting
figure will depict the trend of Haitian
immigrants to Canada from 1980 to 2013.

• Based on the line plot, we can then research


for justifications of obvious anomalies or
changes
• From previous plot, we see that there is a
spike of immigration from Haiti to Canada in
2010.
• A quick Google search for major events in Haiti
in 2010 would return the tragic earthquake that
took place in 2010, and therefore this influx of
immigration to Canada was mainly due to that
tragic earthquake.

42
AREA PLOTS

• An area Plot (also known as an area chart or area graph) depicts


accumulated totals using numbers or percentages over time.
• It is based on the line plot and is commonly used when trying to compare
two or more quantities.

43
AREA PLOTS

44
HISTOGRAMS

• A histogram is a way of representing the frequency distribution of a numeric


dataset.
• It takes as input one numerical variable.

• The variable is cut into several bins, and the number of observations per bin is
represented by the height of the bar.

• To construct a histogram, the first step is to “bin” the range of values — that is, divide
the entire range of values into a series of intervals — and then count how many
values fall into each interval. The bins are usually specified as consecutive, non-
overlapping intervals of a variable.

45
HISTOGRAMS

https://id.wikipedia.org/wiki/Histogram#/media/Berkas:Black_cherry_tree_histogram.svg
46
HISTOGRAMS
Histogram with different bin size

https://chartio.com/learn/charts/histogram-complete-guide/
47
HISTOGRAMS

The number of bins needs to be:


• large enough to reveal interesting features;
• small enough not to be too noisy.

Choice of bin size has an inverse relationship with the number of bins.
• The larger the bin sizes, the fewer bins there will be to cover the whole range
of data.
• With a smaller bin size, the more bins there will need to be.
• It is worth taking some time to test out different bin sizes to see how the
distribution looks in each one, then choose the plot that represents the data
best.
https://chartio.com/learn/charts/histogram-complete-guide/
48
HISTOGRAMS
Use a zero-valued base line

https://chartio.com/learn/charts/histogram-complete-guide/
49
HISTOGRAMS

Updating Histogram with


Colors

https://matplotlib.org/3.3.4/gallery/statistics/hist.html
50
BAR CHARTS

• Unlike a histogram, a bar chart also known as a bar graph is a type of plot
where the length of each bar is proportional to the value of the item that it
represents.
• It is commonly used to compare the values of a variable at a given point in
time.

51
BAR CHARTS

• Unlike a histogram, a bar chart also known as a bar graph is a type of plot
where the length of each bar is proportional to the value of the item that it
represents.
• It is commonly used to compare the values of a variable at a given point in
time.

52
BAR CHARTS

Single Bar Chart

https://www.open.edu/openlearn/mod/oucontent/view.php?id=90853&extra=thumbnailfigure_idm333
53
BAR CHARTS

Dual Bar Chart

https://www.open.edu/openlearn/mod/oucontent/view.php?id=90853&extra=thumbnailfigure_idm333
54
BAR CHARTS

Stacked Bar Chart

https://www.open.edu/openlearn/mod/oucontent/view.php?id=90853&extra=thumbnailfigure_idm333
55
BAR CHARTS

Horizontal Bar Chart

Apa keunggulannya?

https://matplotlib.org/3.4.0/gallery/statistics/barchart_demo.html#sphx-glr-gallery-statistics-barchart-demo-py
56
BAR CHARTS

Percentiles as
Horizontal Bar Chart

https://matplotlib.org/3.4.0/gallery/statistics/barchart_demo.html#sphx-glr-gallery-statistics-barchart-demo-py
57
Histogram vs Bar Chart

58
PIE CHARTS

BOX PLOTS
SPECIALIZED VISUALIZATION
TOOLS
SCATTER PLOTS

HEAT MAPS

59
PIE CHARTS

• A pie chart is a
circular statistical
graphic divided into
slices to illustrate
numerical proportion.
• The input data you
must provide is an
array of numbers,
where each numbers
will be mapped to one
of the pie item.
Source: https://www.python-graph-gallery.com/pie-plot-matplotlib-basic

60
PIE CHARTS

Multiple pie charts to show


changes in parts-to-whole
relationship

Source: https://www.jmp.com/en_us/statistics-knowledge-portal/exploratory-data-analysis/pie-chart.html
61
PIE CHARTS

• Some people suggest no to use Pie Charts


• Graphs of data should tell us about the quantities involved and help us to make
accurate comparisons between these quantities. The quantities in each category
should be easy to estimate and the category labels should be clear.
• Pies and doughnuts fail because:
• Quantity is represented by slices; humans aren’t particularly good at estimating
quantity from angles, which is the skill needed.
• Matching the labels and the slices can be hard work.
• Small percentages (which might be important) are tricky to show.

Source: https://www.jmp.com/en_us/statistics-knowledge-portal/exploratory-data-analysis/pie-chart.html
62
PIE CHARTS

• You need to add the percentage


to every slice.
• You need to directly label every
slice.
• You have run out of colors for the
slices.
• You decide to explode the chart
to solve your first three problems.

https://scc.ms.unimelb.edu.au/resources/data-visualisation-and-exploration/no_pie-charts
63
PIE CHARTS

https://scc.ms.unimelb.edu.au/resources/data-visualisation-and-exploration/no_pie-charts
64
BOX PLOTS

• A box plot is a way of statistically representing the distribution


of given data through five main dimensions:
• The first dimension is minimum of the data.
• The second dimension is first quartile.
• The third dimension is median.
• The fourth dimension is third quartile.
• And the final dimension is maximum of the data.

65
BOX PLOTS

66
SCATTER PLOTS

• A scatter plot is a type of


plot that displays values
pertaining to typically two
variables against each
other.
• Usually it is a dependent
variable to be plotted
against an independent
variable in order to
determine if any
correlation between the
two variables exists.

https://www.data-to-viz.com/graph/scatter.html

67
SCATTER PLOTS

https://www.data-to-viz.com/graph/scatter.html

68
HEAT MAPS

• Heatmaps visualise data through variations in colouring.


• When applied to a tabular format, Heatmaps are useful for cross-
examining multivariate data, through placing variables in the rows
and columns and colouring the cells within the table.
• Heatmaps are good for showing variance across multiple variables,
revealing any patterns, displaying whether any variables are similar
to each other, and for detecting if any correlations exist in-between
them.

69
HEAT MAPS

https://datavizcatalogue.com/methods/heatmap.html
70
WAFFLE CHARTS

ADVANCED
WORD CLOUDS
VISUALIZATION TOOLS

BUBBLE PLOTS

71
WAFFLE CHARTS

• A Waffle Charts is an interesting visualization that is normally created to


display progress towards goals.
• As its name, it usually consists some small squares arranged in a M-by-N
layout.
• The squares are colored according to the proportions you are aiming to
visualize, similarly to how you would color different slices of a pie chart.

72
WAFFLE CHARTS

https://datascience.stackexchange.com/questions/57603/how-this-visualisation-was-made
73
WORD CLOUDS

• A word cloud is simply a depiction of the importance of different words


in the body of text.
• A word cloud works in a simple way; the more a specific word appears in
a source of textual data the bigger and bolder it appears in the world
cloud.
• Assuming that we didn't know anything about the content of these
documents, a word cloud can be very useful to assign a topic to some
unknown textual data.

74
WORD CLOUDS

75
BUBBLE PLOTS

A bubble plot is a scatterplot where a third dimension is added: the value of an additional variable is
represented through the size of the dots. You need 3 numerical variables as input: one is represented by
the X axis, one by the Y axis, and one by the size.

76
BUBBLE PLOTS

Bubble plots over Maps

https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Map_of_earthquakes_in_Indonesia_1
900-2019.svg/1280px-Map_of_earthquakes_in_Indonesia_1900-2019.svg.png
77
References & Credits
• Chirag Shah, Hands on Introduction to Data Science, Cambridge
University Press, 2020
• Data Visualization from IBM Data Science Training Materials and
cognitiveclass.ai
• Siti Aminah & Dhimas Arief Darmawan, Data Visualization, Salindia Mata
Kuliah Data Sains Semester Genap 2020/2021, Fakultas Ilmu Komputer,
Universitas Indonesia
• Fariz Darari, EDA & Visualization, Salindia Mata Kuliah Data Sains
Semester Gasal 2020/2021, Fakultas Ilmu Komputer, Universitas
Indonesia

• Gambar dan tangkapan layar hanya untuk kebutuhan penjelasan


• Hak cipta tetap ada pada pemilik aslinya.

78
Wish You Success


79

You might also like