02a EDA and Data Visualization
02a EDA and Data Visualization
Descriptive Statistics
Data Visualization
2
CRISP-DM: Cross-industry standard process for data mining
3
Exploratory Data Analysis
4
EDA: Take a peek at data
• EDA is a term for an initial analysis
done with datasets.
• It's basically taking a peek at the
data to understand more about
what it represents and how to use
it.
• It's often a precursor to more
advanced data analytics
techniques.
5
Exploratory data analysis (EDA)
7
EDA approach
8
Descriptive Statistics
9
Kenali Data Anda
• Kategorikal vs Numerik
Kecerdasan Artifisial dan Sains Data Dasar | Semester Genap 2021/2022 | Fakultas Ilmu Komputer – Universitas Indonesia 10
Kenali Data Anda
• Nominal, Ordinal, Interval, Rasio
Kecerdasan Artifisial dan Sains Data Dasar | Semester Genap 2021/2022 | Fakultas Ilmu Komputer – Universitas Indonesia 11
Central Tendency
Central Tendency
12
Measure of Variation
Measure of Variation
Kecerdasan Artifisial dan Sains Data Dasar | Semester Genap 2021/2022 | Fakultas Ilmu Komputer – Universitas Indonesia 13
Measure of Variation
Range
14
Measure of Variation
Inter-Quartile Range (IQR)
IQR = Q3-Q1
15
Measure of Variation
Inter-Quartile Range (IQR)
• Q1 = ?
• Q2 = ?
• Q3 = ?
• IQR = ? Range = ?
16
Measure of Variation
Inter-Quartile Range (IQR)
17
Outliers
An outlier is an observation that lies an
abnormal distance from other values in a
random sample from a population.
18
Variance & Standard Deviation
19
Skewness & Kurtosis
• Skewness
A measure of asymmetry
• Kurtosis
A measure of outliers
20
Skewness
22
Kurtosis
https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/
23
Kurtosis
24
(Pearson) Correlation
• It is a technique to investigate the relationship between
two variables: that is, measures the strength of
the association between the two variables
• Pearson's correlation coefficient (r) is a type of correlation coefficient
• Correlation coefficient returns a value between -1 and 1
• -1 denotes strongest negative correlation
• 0 denotes no correlation
• 1 denotes strongest positive correlation
25
(Pearson) Correlation
26
(Pearson) Correlation
27
Data Visualization
28
29
More Examples
30
Why Data Visualization?
01 BUILD 02 BUILD 03 BUILD 04 BUILD
VISUALS VISUALS VISUALS VISUALS
31
Three Key Points of Build Visuals
by DARKHORSE ANALYTICS
Any feature or
design you incorporate
in your plot to make it
more attractive or
pleasing should
LESS IS: support the message
that the plot is meant
to get across and not
distract from it.
more effective
32
Look at this figure
33
• It looks like features such as the
blue background or 3D
orientation are meant to convey
anything.
• In fact, these additional
unnecessary features distract
from the main message and
can be confusing to the
audience.
34
• The message here is that people
are most likely to choose bacon
over other types of pig meat, so
let's get rid of everything that can
be distracting from this core
message.
• It is simple, cleaner, less
distracting, and much easier to
read.
35
36
• The proportion of each pie is wrong.
• Unnecessary sky background.
37
38
• Are you sure the internet users are only 1/3rd of the total population?
39
LINE PLOTS
BAR CHARTS
40
LINE PLOTS
41
LINE PLOTS
42
AREA PLOTS
43
AREA PLOTS
44
HISTOGRAMS
• The variable is cut into several bins, and the number of observations per bin is
represented by the height of the bar.
• To construct a histogram, the first step is to “bin” the range of values — that is, divide
the entire range of values into a series of intervals — and then count how many
values fall into each interval. The bins are usually specified as consecutive, non-
overlapping intervals of a variable.
45
HISTOGRAMS
https://id.wikipedia.org/wiki/Histogram#/media/Berkas:Black_cherry_tree_histogram.svg
46
HISTOGRAMS
Histogram with different bin size
https://chartio.com/learn/charts/histogram-complete-guide/
47
HISTOGRAMS
Choice of bin size has an inverse relationship with the number of bins.
• The larger the bin sizes, the fewer bins there will be to cover the whole range
of data.
• With a smaller bin size, the more bins there will need to be.
• It is worth taking some time to test out different bin sizes to see how the
distribution looks in each one, then choose the plot that represents the data
best.
https://chartio.com/learn/charts/histogram-complete-guide/
48
HISTOGRAMS
Use a zero-valued base line
https://chartio.com/learn/charts/histogram-complete-guide/
49
HISTOGRAMS
https://matplotlib.org/3.3.4/gallery/statistics/hist.html
50
BAR CHARTS
• Unlike a histogram, a bar chart also known as a bar graph is a type of plot
where the length of each bar is proportional to the value of the item that it
represents.
• It is commonly used to compare the values of a variable at a given point in
time.
51
BAR CHARTS
• Unlike a histogram, a bar chart also known as a bar graph is a type of plot
where the length of each bar is proportional to the value of the item that it
represents.
• It is commonly used to compare the values of a variable at a given point in
time.
52
BAR CHARTS
https://www.open.edu/openlearn/mod/oucontent/view.php?id=90853&extra=thumbnailfigure_idm333
53
BAR CHARTS
https://www.open.edu/openlearn/mod/oucontent/view.php?id=90853&extra=thumbnailfigure_idm333
54
BAR CHARTS
https://www.open.edu/openlearn/mod/oucontent/view.php?id=90853&extra=thumbnailfigure_idm333
55
BAR CHARTS
Apa keunggulannya?
https://matplotlib.org/3.4.0/gallery/statistics/barchart_demo.html#sphx-glr-gallery-statistics-barchart-demo-py
56
BAR CHARTS
Percentiles as
Horizontal Bar Chart
https://matplotlib.org/3.4.0/gallery/statistics/barchart_demo.html#sphx-glr-gallery-statistics-barchart-demo-py
57
Histogram vs Bar Chart
58
PIE CHARTS
BOX PLOTS
SPECIALIZED VISUALIZATION
TOOLS
SCATTER PLOTS
HEAT MAPS
59
PIE CHARTS
• A pie chart is a
circular statistical
graphic divided into
slices to illustrate
numerical proportion.
• The input data you
must provide is an
array of numbers,
where each numbers
will be mapped to one
of the pie item.
Source: https://www.python-graph-gallery.com/pie-plot-matplotlib-basic
60
PIE CHARTS
Source: https://www.jmp.com/en_us/statistics-knowledge-portal/exploratory-data-analysis/pie-chart.html
61
PIE CHARTS
Source: https://www.jmp.com/en_us/statistics-knowledge-portal/exploratory-data-analysis/pie-chart.html
62
PIE CHARTS
https://scc.ms.unimelb.edu.au/resources/data-visualisation-and-exploration/no_pie-charts
63
PIE CHARTS
https://scc.ms.unimelb.edu.au/resources/data-visualisation-and-exploration/no_pie-charts
64
BOX PLOTS
65
BOX PLOTS
66
SCATTER PLOTS
https://www.data-to-viz.com/graph/scatter.html
67
SCATTER PLOTS
https://www.data-to-viz.com/graph/scatter.html
68
HEAT MAPS
69
HEAT MAPS
https://datavizcatalogue.com/methods/heatmap.html
70
WAFFLE CHARTS
ADVANCED
WORD CLOUDS
VISUALIZATION TOOLS
BUBBLE PLOTS
71
WAFFLE CHARTS
72
WAFFLE CHARTS
https://datascience.stackexchange.com/questions/57603/how-this-visualisation-was-made
73
WORD CLOUDS
74
WORD CLOUDS
75
BUBBLE PLOTS
A bubble plot is a scatterplot where a third dimension is added: the value of an additional variable is
represented through the size of the dots. You need 3 numerical variables as input: one is represented by
the X axis, one by the Y axis, and one by the size.
76
BUBBLE PLOTS
https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Map_of_earthquakes_in_Indonesia_1
900-2019.svg/1280px-Map_of_earthquakes_in_Indonesia_1900-2019.svg.png
77
References & Credits
• Chirag Shah, Hands on Introduction to Data Science, Cambridge
University Press, 2020
• Data Visualization from IBM Data Science Training Materials and
cognitiveclass.ai
• Siti Aminah & Dhimas Arief Darmawan, Data Visualization, Salindia Mata
Kuliah Data Sains Semester Genap 2020/2021, Fakultas Ilmu Komputer,
Universitas Indonesia
• Fariz Darari, EDA & Visualization, Salindia Mata Kuliah Data Sains
Semester Gasal 2020/2021, Fakultas Ilmu Komputer, Universitas
Indonesia
78
Wish You Success
79