Course3 Notes
Course3 Notes
hist(),
some_data.plot(kind=’type_plot’,…), and some_data.plot.type_plot(…). The (…) is the frequency.
-We can also plot multiple histograms on the same plot. For example,
The results of this code doesn’t seem look right, therefore it must be transposed as,
df_t.head()
plt.ylabel(‘Number of Years’)
plt.xlabel(‘Number of Immigrants’)
plt.show()
-Now, let’s make some modifcations to improve the visualization, like increasing the bin size to 15 in
bins parameter, set transparency to 60% in alpha parameter, label the x-axis by passing in x-label
parameter, and change the colors of the plots by passing in color parameter.
plt.show()
Tip: for a full listing of colors available in Matplotlib, run the following code in the python shell:
Import matplotlib
for name, hex in matplotlib.colors.cnames.items()
print(name, hex)
-We also can stack them using the stacked parameter and adjust the min and max x-axis by using a
tuple with xlim parameter,
#stacked histogram
df_t.plot(kind=’hist’,
figsize=(10,6),
bins=15,
xticsks=bin_edges,
color=[‘coral’,’darkslateblue’,’mediumseagreen’]
stacked=True,
xlim=(xmin, xmax))
Plt.title(‘…….’)
plt.ylabel(‘…….’)
plt.xlabe;(‘…….’)
plt.show()
To create a bar plot, we can pass one of two arguments via kind parameter in plot(): where kind=bar
creates a vertical bar plot, and kind=barh creates a horizontal bar plot.
# plot data
df_iceland.plot(kind=’bar’, figsize=(10,6))
plt.xlabel(‘Year’)
plt.ylabel(‘Number of Immigrants’)
plt.title(‘Icelandic immigrant to Canada from 1980 to 2013’)
plt.show()
To annotate this on the plot using the annotate method of the scripting layer or the pyplot
interface. We will pass in the following parameters:
# Annotate arrow
xy=(32, 70), # place head of the arrow at point (year 2012 , pop 70)
xytext=(28, 20), # place base of the arrow at point (year 2008 , pop 20)
xycoords='data', # will use the coordinate system of the object being annotated
# Annotate Text
xy=(28, 30), # start the text at at point (year 2008 , pop 30)
plt.show()
Another example;
df_top15=df_can['Total'].head(15)
df_top15
df_top15.plot(kind='barh', figsize=(10,6))
plt.xlabel('Number of Immigrants')
plt.ylabel('Country')
plt.title('LALALALA')
plt.show()
Let’s use a pie chart to explore the proportion (percentage) of new immigrants grouped by
continents for the entire time period from 1980 to 2013.
Example:
df_continents.head()
Step 2: Plot the data. We will pass in kind = ‘pie’ keyword, along with the following additional
parameters:
autopct-is a string or function used to label the wedges (slices) with their numeric value. The
label will be placed inside the wedge. If it is a format string, the label will be fmt%pct.
startangle-rotates the start of the pie chart by angle degrees counerclockwise from the x-
axis.
shadow-Draws a shadow beneath the pie (to give a 3D feel)
#maksud 1.1 yg dilabel merah, itu mksdnya 1 dibelakang koma. Kalo 1.2, itu 2 dibelakang koma, 1.3
itu 3 dibelakang koma, and so on.
plt.show()
The above visual is not very clear, the numbers and text overlap in some instances. Let’s make a few
modifications to improve the visuals:
legend or plt.legend is to put legends anywhere we want
pctdistancec is to apply distances of the percentages
colors is to change every wedge’s color
explode is to emphasize the wedges (in this case, the lowest three continents which are
Africa, North America, and Latin America and Carribean)
plt.show()
Box Plots
A box plot is a way of statistically representing the distribution of the data through five main
dimensions:
minimum: smallest number in the dataset
first quartile: middle number between the minimum and the median
second quartile (median): middle number of the (sorted) dataset
third quartile: middle number between median and maximum
maximum: highest number in the dataset
To make box plot, it uses kind=box in plot.
df_japan.plot(kind=’box’, figsize=(8,6))
plt.show()
We can immediately make a few key observations from the plot above:
1. The minimum number of immigrants s around 200 (min), max around 1300m and median around
900.
2. 25% of the years for period 1980-2013 had an annual immigrant count of ~500 or fewer (first
quartile).
3. 75% of the years for period 1980-2013 had an annual immigrant count of ~1100 or fewer (third
quartile)
to make a horizontal box plots, we can use vert parameter in the plot function and assign it to False.
For examples, if the distribution of both China and India are analysed using dataframe of df_CI, the
code are as follows:
plt.show
Subplots
Often times we might want to plot multiple plots and put them in the same figure.
To visualize multiple plots together, we can create a figure (overall canvas) and divide it into
subplots, each containing a plot. With subplots, we usually work with the artist layer instead of the
scripting layer.
Typical syntax:
Example:
plt.show()
Additional info: subplot(211) == subplot(2,1,1)
Scatter Plot
Step1: Get dataset
#we can use the sum() method to get the total population per year
df_tot=pd.DataFrame(df_can[years].sum(axis=0))
#rename columns
df_tot.columns=[‘year’,’tota’]
Step 2: Plot the data. In matplotlib, scatter plot is created by kind=’scatter’ along with specifying the
x and y (not automated)
plt.show()
Now, let’s try to plot a linear line of best fit, and use it to predict number of immigrants in 2015.
Step 1: Get the equation of line of best fit. We will use Numpy’s polyfit() method by passing in the
following
x = x-coordinates of the data
y = y-coordinates of the data
deg = Degree of fitting polynomial. 1=Linear, 2=quadratic, and so on.
x=df_tot[‘year’]
y=df_tot[‘total’]
fit=np.polyfit(x,y,deg=1)
fit
In this case the slop is 5.56+03 with position in 0, and the intercept is -1.0926+07 with position in 1.
plt.show()
#rename columns
df_total.columns=['year','total']
Bubble Plot
Anayzing argentina’s great depression and compare it with Brazil
Step 1: Get data for Brazil and Argentina. Like in the previous example, we will convert the Years to
type int and bring it in the dataframe.
# let's label the index. This will automatically be the column name when we reset the index
df_can_t.index.name = 'Year'
There are several methods of normalizations in statistics, each with its own use. In this case, we will
use feature scalling to bring all values into the range [0,1]. The general formula is:
Therefore:
#Brazil
ax0=df_can_t.plot(kind=’scatter’, x=’year’, y=’Brazil’, figsize=(14,8), alpha=0.5, color=’green’,
s=norm_brazil*2000+10, xlim=(1975,2015))
#Argentina
ax1=df_can_t.plot(kind=’scatter’, x=’Year’, y=’Argentina’, alpha=0.5, color=”blue”,
s=norm_argentina*2000+10, ax = ax0)
ax0.set_ylabel(‘Number of Immigrants’)
ax0.set_title(‘Immigration from Brazil and Argentina from 1980-2013’)
ax0.legend([‘Brazil’,’Argentina’], loc=’upper left’, fontsize=’x-large’)
These codes follow after the data pandas and numpys import, and the data preprocessing.
-Import Matplotlib
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import matpltlib.patches as mpatches #needed for waffle Charts
-revisit the previous case study about Denmark, Norway, and Sweden
-Unfortunately, unlike R, waffle charts are not built into any of he Python visualization libraries.
Therefore, we will learn how to create them from scratch.
Step 1. The first step into creating a waffle chart is determining the proportion of each category with
respect to the total.
Step 2. The second step is defining the overall size of the waffle chart
Step 3. The third step is using the proportion of each category to determine it respective number of
tiles
Based on the calculated proportions, Denmark will occupy 129 tiles of the waffle chart, Norway will
occupy 77 tiles, and Sweden will occupy 194 tiles.
Step 4. The fourth step is creating a matrix that resembles the waffle chart and populating it.
#if the number of tiles populated for the current category is equal to its
corresponding allocated tiles…
if tile_index > sum(tiles_per_category[0:category_index]):
#...proceed to the next category
category_index +=1
plt.yticks([])
plt.yticks([])
Step 7 Create a legend and add it to chart
A Python package already exists in Python for generating word clouds. The package, called
word_cloud was developed by Andreas Mueller.
#install wordcloud
!conda install -c conda -forge wordlcloud==1.4.1 –yes
Word clouds are commonly used to perform high-level analysis and visualization of text data. Now,
let’s digress the immigration to Canada data and work analysing a short novel written by Lewis
Caroll titled Alice’s Adventures in Wonderland.
-next, let’s use the stopwords that we imported from word_cloud. We use the function set to
remove any redundant stopwords.
stopwords = set(STOPWORDS)
Create a world cloud object and generate a world cloud. For simplicity, let’s generate a world cloud
using only the first 2000 words in the novel.
The bigger the words, assumingly the more common words within those 2000 words. Now, resize
the cloud so that we can see the less frequent words a little better.
fig=plt.figure()
fig.set_figwidth(14) #set width
fig.set_figheight(18) #set height
said isn’t really an informative word. So let’s add it to our stopwords and re-generate the cloud
plt.imshow(alice_wc, interpolation=’bilinear’)
plt.axis(‘off’)
plt.show()
word_cloud also provide the package to superimpose the words onto a mask of any shape. For
example, using a mask of Alice and her rabbit.
plt.imshow(alice_wc, interpolation=’bilinear’)
plt.axis(‘off’)
plt.show()
Regression Plot
-Install seaborn
#install seaborn
!conda install -c anaconda seaborn –yes
#import library
import seaborn as sns
-Create a new dataframe that stores that total number of landed immigrants to Canada per year
from 1980 to 2013.
#using the sum() method to get the total population per year
df_tot = pd.DataFrame(df_can[years].sum(axis=0))
#rename columns
df_tot.columns = [‘year’, ‘total’]
-customize color
-blow up the plot a little bit so that it is more appealing to the sight
plt.figure(figsize=(15,10))
ax=sns.regplot(x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’)
-increase the size of markers so they match the new size of the figure, and add a title and x- and y-
labels.
plt.figure(figsize=(15,10))
ax=sns.regplot(x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’, scatter_kws={‘s’:200})
-increase the font size of the tickmark labels, the title, and the x- and y-labels.
plt.figure(figsize=15,10))
sns.set(font_scale=1.5)
plt.figure(figsize=(15,10))
sns.set(font_scale=1.5)
sns.set_style(‘ticks’) # change background to white background
plt.figure(figsize=(15,10))
sns.set(font_scale=1.5)
sns.set_style(‘whitegrid’)
-Install Folium
print(‘Folium installed and imported!’) #if it’s printed, meaning that the folium is successfully
installed.
-Generating the world map is straightforward in Folium. You simply create Folium Map object and
then you display it. What is attractive about Folium maps is that they are interactive, so you can
zoom into any region of interest despite the initial zoom level.
All locations on a map are defined by their respective latitude and longitude values. So you can
create a map and pass in a center of Latitude and Longitude values of [0,0].
For a defined center, you can also define the intial zoom level into that location when the map is
rendered. The higher he zoom level the more the map is zoomed into the center.
#define the world map centered around Canada with a low zoom level
world_map=folium.Map(location=[56.130, -106.35], zoom_start=4)
# define the world map centered around Canada with a higher zoom level
world_map=folium.Map(location=[56.130, -106.35], zoom_start=8) #blue is latitude, and red is
longitude.
#display map
world_map
These are maps that feature hill shading and natural vegetation colors. They showcase advanced
labelling and linework generalization of dual-carriageway roads.
#display map
world_map
C. Mapbox Bright Maps
These are maps that quite similar to the default style, except that the borders are not visible with a
low zoom level.
Let’s download and import the data on police incidents using pands read_csv() method
df_incidents = pd.read_csv(‘https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-
data/CognitiveClass/DV0101EN/labs/Data_Files/Police_Department_Incidents_-
_Previous_Year__2016_.csv')
df_incidents.head()
df_incidents.shape
(150500,13)
-So the dataframe consist of 150,500 crimes, which took place in the year 2016. In order to reduce
computational cost, let’s just work with the first 100 incidents in this dataset.
df_incidents.shape
(100,13)
Now that we reduce the data a little bit, let’s visualize where these crimes took place in the city of
San Fransisco. We will use the default style and we will initialize the zoom level to 12.
The zip() function returns a zip object, which is an iterator of tuples where the first item in each
passed iterator is paired together, and then the second item in each passed iterator are paired
together etc.
Example:
a = (“John”, “Charles”, “Mike”)
b = (“Jenny”, “Christy, “Monica”)
x=zip(a,b)
print(tuple(x))
Results:
-Now let’s superimpose the locations of the crimes onto the map. The way to do that in Folium is to
create a feature group with its own features and style and then add it to the sanfran map.
#Loop through the 100 crimes and add each to the incidents feature group
for lat, lng, in zip(df_incidents.Y, df_incidents.X):
incidents.add_child(
[lat, lng],
radius=5, #define how big you want the circle markers to be
color=’yellow’,
fill=True,
fill_color=’blue’,
fill_opacity=0.6))
#loop through the 100 crimes and add each to the incidents feature group
for lat, lng in zip(df_incidents.Y, df_incidents.X):
incidents.add_child(
folium.features.CircleMarker(
[lat, lng]
radius=5 #define how big you want the circle markers to be
color=’yellow’
fill=True
fill_color=’blue’,
fill_opacity=0.6))
#to recap, the Y, X, and Category is the three columns in the dataframe
1. The simpler solution is to remove these locations markers and just add the text to the circle
markers themselves as follows:
#loop through the 100 crimes and add each to the map
for lat, lng, label in zip(df_incidents.Y, df_incidents.X, df_incidents.Category):
folium.features.CircleMarker(
[lat, lng],
radius=5 #define how big you want the circle markers to be
color=’yellow’,
fill=True,
popup=label,
fill_color=’blue’,
fill_opacity=0.6).add_to(sanfran_map)
#show map
sanfran_map
2. The second way which is much proper is to group the markers into different clusters. Each cluster
is then represented by the number of crimes in each neighbourhood. These clusters can be thought
of as pockets of San Fransisco which you can then analyse separately.
To implement this, we start off by instantiating a MarkerCluster object and adding all the data points
in the dataframe t othis object
#loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df_incidents.Y, df_incidents.X, df_incidents.Category):
folium.Marker(
location=[lat,lng],
icon=None,
popup=label).add_to(incidents)
#display map
sanfran_map
When you zoom out all the way, all markers are groupd into one cluster.
Choropleth Maps
-Download the dataset and read it into a pandas dataframe: (n.p. if the xlrd is not installed, install it
first by typing a code, !conda install -c anaconda xlrd –yes
df_can=pd.read_excel(‘https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-
data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx', sheet_name=’Canada by Citizenship’,
skiprows=range(20), skipfooter=2)
#for sake of consistency, let’s also make all column labels of type string
df_can.columns=list(map(str, df_can.columns))
df_can.head()
-In order to create a Choropleth map, we need a GeoJSON file that defines the areas/boundaries of
the state, country, or country that we are interested in. In our case, since we are endeavouring to
create a world map, we want a GeoJSON that defines the boundaries of all world countries. For our
convenience, the developer has provided us with a file, and able to be downloaded. Let’s name it
world_countries.json.
-Now that we have GeoJSON file, let’s create a world map, centered [0,0] latitude and longitude
values, with an initial zoom level of 2, and using Mapbox Brigth style.
-And now to create a Choropleth map, we will use the choropleth method with the following main
parameters:
#generate choropleth map using the total immigration of each country to Canada from 1980 to 2013
World_map.choropleth(
geo_data=world_gep
data=df_can
columns=[‘Country’,’Total’],
key_on=’feature.properties.name’,
fill_color=’Y10rRd’,
fill_opacity=0.7,
line_opacity=0.2,
legend_name=’Immigration to Canada’
#display map
world_map
Notice how the legend is displaying a negative boundary or threshold. Let’s fix that by defining our
own thresholds and starting with 0 instead of -6,918!
world_geo=r’world_countries.json’
#create a numpy array of length 6 and has linear spacing from the minimum total immigration to the
maximum total immigration
threshold_scale=np.linspace(df_can[‘Total’].min(), df_can[‘Total’].max(), 6, dtype=int)
threshold_scale=threshold_scale.tolist() change the numpy array to a list #make sure that the last
value of the list is greater than the maximum immigration
world_map