Content From Jose Portilla's Udemy Course Learning Python For Data Analysis and Visualization Notes by Michael Brothers, Available On
Content From Jose Portilla's Udemy Course Learning Python For Data Analysis and Visualization Notes by Michael Brothers, Available On
Content from Jose Portilla's Udemy course Learning Python for Data Analysis and Visualization
https://www.udemy.com/learning-python-for-data-analysis-and-visualization/
Notes by Michael Brothers, available on http://github.com/mikebrothers/data-science/
Table of Contents
MATPLOTLIB ........................................................................................................................................................................... 2
Scatter plot ......................................................................................................................................................................... 2
Bar plot with errorbars....................................................................................................................................................... 3
3D Graphical Analysis: ........................................................................................................................................................ 4
Histograms: ......................................................................................................................................................................... 5
SEABORN LIBRARIES ............................................................................................................................................................... 6
Rug Plots ............................................................................................................................................................................. 6
Histograms using factorplot ............................................................................................................................................... 7
Combined Plots (kde, hist, rug) using distplot .................................................................................................................. 9
Box & Whisker Plots ........................................................................................................................................................... 9
Violin Plots .......................................................................................................................................................................... 9
Joint Plots ......................................................................................................................................................................... 10
Regression Plots ............................................................................................................................................................... 10
Heatmaps .......................................................................................................................................................................... 11
Clustered Matrices ........................................................................................................................................................... 12
OTHER USEFUL TOOLS: ......................................................................................................................................................... 13
How to Save a DataFrame as a Figure (.png file) ............................................................................................................ 13
How to open a webpage inside Jupyter .......................................................................................................................... 13
Note: except where noted, code & output are Python v2.7 on Jupyter Notebooks
MATPLOTLIB
Scatter plot with wine data downloaded from UC Irvine's Machine Learning Archive:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/'
the file 'winequality-red.csv' was saved to the jupyter notebook directory
dframe_wine = pd.read_csv('winequality-red.csv',sep=';') note the separator
# you can add data labels above each bar (I chose not to in the plot below):
def autolabel(rects):
# attach some text labels
for rect in rects:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
'%d' % int(height),
ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)
dataset1 = randn(100)
plt.hist(dataset1)
(array([ 1., 1., 5., 9., 11., 23., 23., 13., 10., 4.]),
array([-3.03051447, -2.5119968 , -1.99347913, -1.47496147, -0.9564438 ,
-0.43792613, 0.08059154, 0.5991092 , 1.11762687, 1.63614454,
2.1546622 ]),
<a list of 10 Patch objects>)
dataset2 = randn(80)
plt.hist(dataset2,color='indianred')
plt.hist(dataset1,normed=True,alpha=0.5,bins=20)
plt.hist(dataset2,normed=True,color='indianred',alpha=0.5,bins=20)
Rug Plots
dataset = randn(25)
sns.rugplot(dataset) plots a simple row of tic marks along the x-axis
Histograms using factorplot
Note: Histograms are already part of matplotlib: plt.hist(dataset)
Seaborn's factorplot lets you choose between histograms, point plots, violin plots, etc.
Also, the "hue" argument makes it easy to compare multiple variables simultaneously.
Unfortunately, sorting columns appropriately can be a challenge.
The following example makes use of the Iris flower data set included in Seaborn:
xorder = np.apply_along_axis(sorted, 0, iris['Petal Length'].unique())
sns.factorplot('Petal Length', data=iris, order=xorder, size=8, hue='Species',
kind='count'); Note: without size=8, the x-axis labels overlap
Source: https://en.wikipedia.org/wiki/Kernel_density_estimation
KDE Plots, cont'd
By hand:
dataset = randn(25) take a random (normal distribution) sample set.
sns.rugplot(dataset) make a rug plot
x_min = dataset.min() – 2 Set up the x-axis for the plot
x_max = dataset.max() + 2
x_axis = np.linspace(x_min,x_max,100) set 100 equally spaced points from x_min to x_max
bandwidth = ((4*dataset.std()**5)/(3*len(dataset)))**.2 =Silverman's rule of thumb
kernel_list = [] Create an empty kernel list
for data_point in dataset: Plot each basis function
# Create a kernel for each point and append to list
kernel = stats.norm(data_point,bandwidth).pdf(x_axis)
kernel_list.append(kernel)
#Scale for plotting
kernel = kernel / kernel.max()
kernel = kernel * .4
plt.plot(x_axis,kernel,color = 'grey',alpha=0.5)
plt.ylim(0,1) SEE PLOT BELOW LEFT
sum_of_kde = np.sum(kernel_list,axis=0)
fig = plt.plot(x_axis,sum_of_kde,color='indianred')
sns.rugplot(dataset)
plt.suptitle("Sum of the Basis Functions") SEE PLOT ABOVE RIGHT
Using Seaborn:
sns.kdeplot(dataset) sns.kdeplot(dataset,cumulative=True)
Seaborn allows you to quickly change bandwidth, kernels, orientation, and a number of other parameters.
Seaborn also supports multivariate density estimation. See jupyter notebook for more info.
Combined Plots (kde, hist, rug) using distplot
sns.distplot(dataset,bins=25) by default, a KDE over a histogram
sns.distplot(dataset,rug=True,hist=False) here a rug and a KDE
To control specific plots in distplot, use a [plot]_kws argument with dictionaries:
sns.distplot(dataset,bins=25,
kde_kws={'color':'indianred','label':'KDE PLOT'},
hist_kws={'color':'blue','label':"HISTOGRAM"})
sns.jointplot(data1,data2) sns.jointplot(data1,data2,kind='hex')
Regression Plots
tips = sns.load_dataset("tips") load a Seaborn sample dataset
sns.lmplot(x,y,data)
sns.lmplot("total_bill","tip",tips); scatter plot with linear regression line & confidence interval
sns.lmplot("total_bill","tip",tips,order=4, scatter_kws={"color":"indianred"},
line_kws={"linewidth": 1, "color": "blue"}) ABOVE RIGHT
Refer to the online documentation & jupyter notebook for more on adjusting the confidence interval, plotting discrete
variables, jittering, removing the regression line, and using hue & markers to define subsets along a column.
Seaborn even supports loca regression (LOESS) with the argument lowess=True.
For lower level regression plots, use sns.regplot(x,y,data). These can be tied to other plots.
Heatmaps
flight_dframe = sns.load_dataset('flights') load a Seaborn sample dataset
Pivot the data to make it more usable (index=month, columns=year, fill=passengers):
flight_dframe = flight_dframe.pivot("month","year","passengers")
Note: unlike the lecture notebook, dframe now sorts months in date order, not alphabetically.
sns.heatmap(flight_dframe);
Heatmap() can be added onto a subplot axis to create more informative figures:
f,(axis1,axis2) = plt.subplots(2,1) figure "f" will have two rows, one column
Since yearly_flights is a weird format, we'll have to grab the values we want with a Series, then put them in a dframe
yearly_flights = flight_dframe.sum()
years = pd.Series(yearly_flights.index.values)
years = pd.DataFrame(years)
flights = pd.Series(yearly_flights.values)
flights = pd.DataFrame(flights)