0% found this document useful (0 votes)
24 views18 pages

01 - Lesson - Visualization - Jupyter Notebook

Belajar Data Sains : Visualisasi

Uploaded by

almamalik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views18 pages

01 - Lesson - Visualization - Jupyter Notebook

Belajar Data Sains : Visualisasi

Uploaded by

almamalik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Data Visualization

Data Visualization
Key skill today

“The ability to take data-to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it-that’s going to be a hugely important skill
in the next decades."

Hal Varian (Google’s Chief Economist) (https://en.wikipedia.org/wiki/Hal_Varian)

Data Visualization for a Data Scientist


1. Data Quality: Explore data quality including identifying outliers
2. Data Exploration: Understand data with visualizing ideas
3. Data Presentation: Present results

The power of Data Visualization

Consider the following data


what is the connection?
See any patterns?

In [2]: import pandas as pd

In [8]: sample = pd.read_csv('files/sample_corr.csv')


In [9]: sample

Out[9]: x y

0 1.105722 1.320945

1 1.158193 1.480131

2 1.068022 1.173479

3 1.131291 1.294706

4 1.125997 1.293024

5 1.037332 0.977393

6 1.051670 1.040798

7 0.971699 0.977604

8 1.102914 1.127956

9 1.164161 1.431070

10 1.161464 1.344481

11 1.080161 1.191159

12 0.996044 0.997308

13 1.143305 1.412850

14 1.062949 1.139761

15 1.149252 1.455886

16 1.190105 1.489407

17 1.026498 1.153031

18 1.110015 1.329586

19 1.077741 1.277995

In [ ]: ​

Visualizing the same data


Let's try to visualize the data

Matplotlib (https://matplotlib.org) is an easy to use visualization library for Python.

In Notebooks you get started with.

import matplotlib.pyplot as plt


%matplotlib inline
In [12]: import matplotlib.pyplot as plt
%matplotlib inline

In [13]: sample.plot.scatter(x='x', y='y')

Out[13]: <AxesSubplot:xlabel='x', ylabel='y'>

In [ ]: ​

What Data Visualization gives


Absorb information quickly
Improve insights
Make faster decisions

Data Quality

Is the data quality usable


Consider the dataset: files/sample_height.csv

Check for missing values

isna() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) .any()


(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html): Check for any missing
values - returns True if missing values

data.isna().any()

Visualize data
Notice: you need to know something about the data
We know that it is heights of humans in centimeters
This could be checked with a histogram

In [14]: data = pd.read_csv('files/sample_height.csv')

In [15]: data.head()

Out[15]: height

0 129.150282

1 163.277930

2 173.965641

3 168.933825

4 171.075462

In [17]: data.isna().any()

Out[17]: height False


dtype: bool

In [18]: data.plot.hist()

Out[18]: <AxesSubplot:ylabel='Frequency'>
In [19]: data[data['height'] < 50]

Out[19]: height

17 1.913196

22 1.629159

23 1.753424

27 1.854795

50 1.914587

60 1.642295

73 1.804588

82 1.573621

91 1.550227

94 1.660700

97 1.675962

98 1.712382

In [ ]: ​

Identifying outliers
Consider the dataset: files/sample_age.csv

Visualize with a histogram

This gives fast insights

Describe the data

describe() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html):
Makes simple statistics of the DataFrame

data.describe()

In [20]: data = pd.read_csv('files/sample_age.csv')


In [21]: data.head()

Out[21]: age

0 30.175921

1 32.002551

2 44.518393

3 56.247751

4 33.111986

In [22]: data.describe()

Out[22]: age

count 100.000000

mean 42.305997

std 29.229478

min 18.273781

25% 31.871113

50% 39.376896

75% 47.779303

max 314.000000

In [23]: data.plot.hist()

Out[23]: <AxesSubplot:ylabel='Frequency'>
In [24]: data[data['age'] > 150]

Out[24]: age

31 314.0

Data Exploration

Data Visaulization
Absorb information quickly
Improve insights
Make faster decisions

World Bank
The World Bank (https://www.worldbank.org/en/home) is a great source of datasets

CO2 per capita

Let's explore this dataset EN.ATM.CO2E.PC


(https://data.worldbank.org/indicator/EN.ATM.CO2E.PC)
Already available here: files/WorldBank-ATM.CO2E.PC_DS2.csv

Explore typical Data Visualizations

Simple plot
Set title
Set labels
Adjust axis

Read the data


In [26]: data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
data.head()

Out[26]: ABW AFE AFG AFW AGO ALB AND ARB ARE

Year

1960 204.631696 0.906060 0.046057 0.090880 0.100835 1.258195 NaN 0.609268 0.119037

1961 208.837879 0.922474 0.053589 0.095283 0.082204 1.374186 NaN 0.662618 0.109136

1962 226.081890 0.930816 0.073721 0.096612 0.210533 1.439956 NaN 0.727117 0.163542

1963 214.785217 0.940570 0.074161 0.112376 0.202739 1.181681 NaN 0.853116 0.175833

1964 207.626699 0.996033 0.086174 0.133258 0.213562 1.111742 NaN 0.972381 0.132815

5 rows × 266 columns

In [ ]: ​

Simple plot

.plot() Creates a simple plot of data


This gives you an idea of the data

In [27]: data['USA'].plot()

Out[27]: <AxesSubplot:xlabel='Year'>

In [ ]: ​

Adding title and labels


Arguments

title='Tilte' adds the title


xlabel='X label' adds or changes the X-label
ylabel='X label' adds or changes the Y-label

In [20]: data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita')

Out[20]: <AxesSubplot:title={'center':'CO2 per capita in USA'}, xlabel='Year', ylabel='C


O2 per capita'>

In [ ]: ​

Adding axis range

xlim=(min, max) or xlim=min Sets the x-axis range


ylim=(min, max) or ylim=min Sets the y-axis range
In [21]: data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita', ylim=0)

Out[21]: <AxesSubplot:title={'center':'CO2 per capita in USA'}, xlabel='Year', ylabel='C


O2 per capita'>

In [ ]: ​

Comparing data
Explore USA and WLD
In [25]: data[['USA', 'WLD']].plot(ylim=0)

Out[25]: <AxesSubplot:xlabel='Year'>

In [ ]: ​

Set the figure size

figsize=(width, height) in inches


In [27]: data[['USA', 'DNK', 'WLD']].plot(ylim=0, figsize=(20,6))

Out[27]: <AxesSubplot:xlabel='Year'>

In [ ]: ​

Bar plot
.plot.bar() Create a bar plot

In [28]: data['USA'].plot.bar(figsize=(20,6))

Out[28]: <AxesSubplot:xlabel='Year'>
In [29]: data[['USA', 'WLD']].plot.bar(figsize=(20,6))

Out[29]: <AxesSubplot:xlabel='Year'>

Plot a range of data


.loc[from:to] apply this on the DataFrame to get a range (both inclusive)

In [30]: data[['USA', 'WLD']].loc[2000:].plot.bar(figsize=(20,6))

Out[30]: <AxesSubplot:xlabel='Year'>

In [ ]: ​

Histogram
.plot.hist() Create a histogram
bins=<number of bins> Specify the number of bins in the histogram.
In [34]: data['USA'].plot.hist(figsize=(20,6), bins=7)

Out[34]: <AxesSubplot:ylabel='Frequency'>

In [ ]: ​

Pie chart
.plot.pie() Creates a Pie Chart

In [35]: df = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3'])


df.plot.pie()

Out[35]: <AxesSubplot:ylabel='None'>

In [ ]: ​

Value counts and pie charts


A simple chart of values above/below a threshold
.value_counts() Counts occurences of values in a Series (or DataFrame column)
A few arguments to .plot.pie()
colors=<list of colors>
labels=<list of labels>
title='<title>'
ylabel='<label>'
autopct='%1.1f%%' sets percentages on chart

In [43]: (data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>=17.5',

Out[43]: <AxesSubplot:title={'center':'CO2 per capita'}, ylabel='USA'>

In [ ]: ​

Scatter plot
Assume we want to investigate if GDP per capita and CO2 per capita are correlated
Data available in 'files/co2_gdp_per_capita.csv'
.plot.scatter(x=<label>, y=<label>) Create a scatter plot
.corr() Compute pairwise correlation of columns (docs
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html))

In [44]: data = pd.read_csv('files/co2_gdp_per_capita.csv', index_col=0)


data.head()

Out[44]: CO2 per capita GDP per capita

AFE 0.933541 1507.861055

AFG 0.200151 568.827927

AFW 0.515544 1834.366604

AGO 0.887380 3595.106667

ALB 1.939732 4433.741739


In [46]: data.plot.scatter(x='CO2 per capita', y='GDP per capita')

Out[46]: <AxesSubplot:xlabel='CO2 per capita', ylabel='GDP per capita'>

In [47]: data.corr()

Out[47]: CO2 per capita GDP per capita

CO2 per capita 1.000000 0.633178

GDP per capita 0.633178 1.000000

In [ ]: ​

Data Presentation
This is about making data esay to digest

The message
Assume we want to give a picture of how US CO2 per capita is compared to the rest of the world

Preparation

Let's take 2017 (as more recent data is incomplete)


What is the mean, max, and min CO2 per capital in the world

In [54]: data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)


In [55]: data.head()

Out[55]: ABW AFE AFG AFW AGO ALB AND ARB ARE

Year

1960 204.631696 0.906060 0.046057 0.090880 0.100835 1.258195 NaN 0.609268 0.119037 2.38

1961 208.837879 0.922474 0.053589 0.095283 0.082204 1.374186 NaN 0.662618 0.109136 2.45

1962 226.081890 0.930816 0.073721 0.096612 0.210533 1.439956 NaN 0.727117 0.163542 2.53

1963 214.785217 0.940570 0.074161 0.112376 0.202739 1.181681 NaN 0.853116 0.175833 2.33

1964 207.626699 0.996033 0.086174 0.133258 0.213562 1.111742 NaN 0.972381 0.132815 2.55

5 rows × 266 columns

In [ ]: ​

In [56]: data.loc[year].describe()

Out[56]: count 239.000000


mean 4.154185
std 4.575980
min 0.028010
25% 0.851900
50% 2.667119
75% 6.158644
max 32.179371
Name: 2017, dtype: float64

In [ ]: ​

And in the US?

In [57]: data.loc[year]['USA']

Out[57]: 14.8058824221278

In [ ]: ​

How can we tell a story?

US is above the mean


US is not the max
It is above 75%
Some more advanced matplotlib

In [58]: ax = data.loc[year].plot.hist(bins=15, facecolor='green')



ax.set_xlabel('CO2 per capita')
ax.set_ylabel('Number of countries')
ax.annotate("USA", xy=(15, 5), xytext=(15, 30),
arrowprops=dict(arrowstyle="->",
connectionstyle="arc3"))

Out[58]: Text(15, 30, 'USA')

Creative story telling with data visualization


Check out this video https://www.youtube.com/watch?v=jbkSRLYSojo
(https://www.youtube.com/watch?v=jbkSRLYSojo)

In [60]: from IPython.display import YouTubeVideo



YouTubeVideo('jbkSRLYSojo')

Out[60]:

In [ ]: ​

You might also like