01 - Lesson - Visualization - Jupyter Notebook
01 - Lesson - Visualization - Jupyter Notebook
Data Visualization
Key skill today
“The ability to take data-to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it-that’s going to be a hugely important skill
in the next decades."
Out[9]: x y
0 1.105722 1.320945
1 1.158193 1.480131
2 1.068022 1.173479
3 1.131291 1.294706
4 1.125997 1.293024
5 1.037332 0.977393
6 1.051670 1.040798
7 0.971699 0.977604
8 1.102914 1.127956
9 1.164161 1.431070
10 1.161464 1.344481
11 1.080161 1.191159
12 0.996044 0.997308
13 1.143305 1.412850
14 1.062949 1.139761
15 1.149252 1.455886
16 1.190105 1.489407
17 1.026498 1.153031
18 1.110015 1.329586
19 1.077741 1.277995
In [ ]:
In [ ]:
Data Quality
data.isna().any()
Visualize data
Notice: you need to know something about the data
We know that it is heights of humans in centimeters
This could be checked with a histogram
In [15]: data.head()
Out[15]: height
0 129.150282
1 163.277930
2 173.965641
3 168.933825
4 171.075462
In [17]: data.isna().any()
In [18]: data.plot.hist()
Out[18]: <AxesSubplot:ylabel='Frequency'>
In [19]: data[data['height'] < 50]
Out[19]: height
17 1.913196
22 1.629159
23 1.753424
27 1.854795
50 1.914587
60 1.642295
73 1.804588
82 1.573621
91 1.550227
94 1.660700
97 1.675962
98 1.712382
In [ ]:
Identifying outliers
Consider the dataset: files/sample_age.csv
describe() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html):
Makes simple statistics of the DataFrame
data.describe()
Out[21]: age
0 30.175921
1 32.002551
2 44.518393
3 56.247751
4 33.111986
In [22]: data.describe()
Out[22]: age
count 100.000000
mean 42.305997
std 29.229478
min 18.273781
25% 31.871113
50% 39.376896
75% 47.779303
max 314.000000
In [23]: data.plot.hist()
Out[23]: <AxesSubplot:ylabel='Frequency'>
In [24]: data[data['age'] > 150]
Out[24]: age
31 314.0
Data Exploration
Data Visaulization
Absorb information quickly
Improve insights
Make faster decisions
World Bank
The World Bank (https://www.worldbank.org/en/home) is a great source of datasets
Simple plot
Set title
Set labels
Adjust axis
Out[26]: ABW AFE AFG AFW AGO ALB AND ARB ARE
Year
1960 204.631696 0.906060 0.046057 0.090880 0.100835 1.258195 NaN 0.609268 0.119037
1961 208.837879 0.922474 0.053589 0.095283 0.082204 1.374186 NaN 0.662618 0.109136
1962 226.081890 0.930816 0.073721 0.096612 0.210533 1.439956 NaN 0.727117 0.163542
1963 214.785217 0.940570 0.074161 0.112376 0.202739 1.181681 NaN 0.853116 0.175833
1964 207.626699 0.996033 0.086174 0.133258 0.213562 1.111742 NaN 0.972381 0.132815
In [ ]:
Simple plot
In [27]: data['USA'].plot()
Out[27]: <AxesSubplot:xlabel='Year'>
In [ ]:
In [ ]:
In [ ]:
Comparing data
Explore USA and WLD
In [25]: data[['USA', 'WLD']].plot(ylim=0)
Out[25]: <AxesSubplot:xlabel='Year'>
In [ ]:
Out[27]: <AxesSubplot:xlabel='Year'>
In [ ]:
Bar plot
.plot.bar() Create a bar plot
In [28]: data['USA'].plot.bar(figsize=(20,6))
Out[28]: <AxesSubplot:xlabel='Year'>
In [29]: data[['USA', 'WLD']].plot.bar(figsize=(20,6))
Out[29]: <AxesSubplot:xlabel='Year'>
Out[30]: <AxesSubplot:xlabel='Year'>
In [ ]:
Histogram
.plot.hist() Create a histogram
bins=<number of bins> Specify the number of bins in the histogram.
In [34]: data['USA'].plot.hist(figsize=(20,6), bins=7)
Out[34]: <AxesSubplot:ylabel='Frequency'>
In [ ]:
Pie chart
.plot.pie() Creates a Pie Chart
Out[35]: <AxesSubplot:ylabel='None'>
In [ ]:
In [ ]:
Scatter plot
Assume we want to investigate if GDP per capita and CO2 per capita are correlated
Data available in 'files/co2_gdp_per_capita.csv'
.plot.scatter(x=<label>, y=<label>) Create a scatter plot
.corr() Compute pairwise correlation of columns (docs
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html))
In [47]: data.corr()
In [ ]:
Data Presentation
This is about making data esay to digest
The message
Assume we want to give a picture of how US CO2 per capita is compared to the rest of the world
Preparation
Out[55]: ABW AFE AFG AFW AGO ALB AND ARB ARE
Year
1960 204.631696 0.906060 0.046057 0.090880 0.100835 1.258195 NaN 0.609268 0.119037 2.38
1961 208.837879 0.922474 0.053589 0.095283 0.082204 1.374186 NaN 0.662618 0.109136 2.45
1962 226.081890 0.930816 0.073721 0.096612 0.210533 1.439956 NaN 0.727117 0.163542 2.53
1963 214.785217 0.940570 0.074161 0.112376 0.202739 1.181681 NaN 0.853116 0.175833 2.33
1964 207.626699 0.996033 0.086174 0.133258 0.213562 1.111742 NaN 0.972381 0.132815 2.55
In [ ]:
In [56]: data.loc[year].describe()
In [ ]:
In [57]: data.loc[year]['USA']
Out[57]: 14.8058824221278
In [ ]:
Out[60]:
In [ ]: