0% found this document useful (0 votes)

58 views44 pages

Course3 Notes

The document discusses various methods for creating different types of plots in Python using Matplotlib and Pandas. It describes how to create histograms, bar charts, pie charts, box plots, and stacked histograms from dataframe data. Methods include transposing dataframes, customizing colors, labels, and other visual properties. Box plots and pie charts are used to analyze and visualize immigration data grouped by country and continent.

Uploaded by

Stefano Pentury

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views44 pages

Course3 Notes

Uploaded by

Stefano Pentury

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

-To make a histogram graph, we can use three type of codes; df_can[‘2013’].plot.

hist(),
some_data.plot(kind=’type_plot’,…), and some_data.plot.type_plot(…). The (…) is the frequency.

-We can also plot multiple histograms on the same plot. For example,

df_can.loc[[‘Denmark’, ’Norway’, ’Sweden’], years]

-To generate the histogram

df_can.loc[[‘Denmark’, ‘Norway’, ‘Sweden’], years].[Link]()

The results of this code doesn’t seem look right, therefore it must be transposed as,

df_t = df_can.log[[‘Denmark’, ‘Norway’, ‘Sweden’], years}.transpose()

df_t.head()

-To generate the histogram

df_t.plot(kind=’hist’, figsize=(10,6))

[Link](‘Histogram of Immigration from Denmark, Norway, and Sweden’)

[Link](‘Number of Years’)

[Link](‘Number of Immigrants’)

[Link]()

-Now, let’s make some modifcations to improve the visualization, like increasing the bin size to 15 in
bins parameter, set transparency to 60% in alpha parameter, label the x-axis by passing in x-label
parameter, and change the colors of the plots by passing in color parameter.

count, bin_edges = [Link](df_t, 15)

df_t.plot(kind=’hist’, figsize=(10,6), bins=15, xsticks=bid_edges, color=[‘coral’,

‘darkslateblue’,’mediumseagreen’])
[Link](‘……………..’)
[Link](‘………….’)
[Link](‘…………..’)

[Link]()

Tip: for a full listing of colors available in Matplotlib, run the following code in the python shell:
Import matplotlib
for name, hex in [Link]()
print(name, hex)

-We also can stack them using the stacked parameter and adjust the min and max x-axis by using a
tuple with xlim parameter,

count, bin_edges = [Link](df_t,15)

xmin = bin_edges[0]-10 #first bin value is 31.0, substracting of 10 for asthetic purposes
xmax = bin_edges[-1]+10 #last bin value is 308.0, adding buffer of 10 for asthetic purposes

#stacked histogram
df_t.plot(kind=’hist’,
figsize=(10,6),
bins=15,
xticsks=bin_edges,
color=[‘coral’,’darkslateblue’,’mediumseagreen’]
stacked=True,
xlim=(xmin, xmax))

[Link](‘…….’)
[Link](‘…….’)
[Link];(‘…….’)

[Link]()

2.4 Bar Charts (Dataframe)

To create a bar plot, we can pass one of two arguments via kind parameter in plot(): where kind=bar
creates a vertical bar plot, and kind=barh creates a horizontal bar plot.

-get the data

df_iceland = df_can.loc[‘Iceland’, years]
df_iceland.head()

# plot data
df_iceland.plot(kind=’bar’, figsize=(10,6))
[Link](‘Year’)
[Link](‘Number of Immigrants’)
[Link](‘Icelandic immigrant to Canada from 1980 to 2013’)

[Link]()

To annotate this on the plot using the annotate method of the scripting layer or the pyplot
interface. We will pass in the following parameters:

s: str, the text of annotation

xy: Tuple specifying the (x,y) point to annotate (in this case, end point of arrow)
xytext: Tuple specifying the (x,y) point to place the text (in this case, start point of arrow)
xycoords: The coordinate system that xy is given in – ‘data’ uses the coordinate system of the object
being annotated (default).
arrowprops: Takes a dictionary of properties to draw the arrow:
arrowstyle: specifies the arrow style, ‘->’ is standard arrow.
connectionstyle: specifies the connection type. arc3 is a straight line.
color: specifies the color of arrow
lw: specifies the line width.
additional parameters:
rotation: rotation angle of text in degrees (counter clockwise)
va: vertical alignment of text [‘center’|’top’|’bottom’|’baseline’]
ha: horizontal alignment of text [‘center’|’right’|’left’]

df_iceland.plot(kind='bar', figsize=(10, 6), rot=90) [Link]('Year')

[Link]('Number of Immigrants')

[Link]('Icelandic Immigrants to Canada from 1980 to 2013')

# Annotate arrow

[Link]('', # s: str. will leave it blank for no text

xy=(32, 70), # place head of the arrow at point (year 2012 , pop 70)

xytext=(28, 20), # place base of the arrow at point (year 2008 , pop 20)

xycoords='data', # will use the coordinate system of the object being annotated

arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2)

# Annotate Text

[Link]('2008 - 2011 Financial Crisis', # text to display

xy=(28, 30), # start the text at at point (year 2008 , pop 30)

rotation=72.5, # based on trial and error to match the arrow

va='bottom', # want the text to be vertically 'bottom' aligned

ha='left', # want the text to be horizontally 'left' algned.

[Link]()
Another example;

#Get the data pertanining to the top 15 countries

df_can.sort_values(['Total'], ascending=False, inplace=True)

df_top15=df_can['Total'].head(15)

df_top15

#Plot data using horizontal bar chart

df_top15.plot(kind='barh', figsize=(10,6))

[Link]('Number of Immigrants')

[Link]('Country')

[Link]('LALALALA')

for index, value in enumerate(df_top15):

Label = format(int(value), ',')

[Link](Label, xy=(value-47000, index-0.10), color='white')

[Link]()

Pie charts, box plots, scatter plots, and bubble plots

Pie Chart

Uses kind=pie keyword

Let’s use a pie chart to explore the proportion (percentage) of new immigrants grouped by
continents for the entire time period from 1980 to 2013.

Step 1: Gather data

We will use pandas groupby method to summarize the immigration data by continent. The general
process of groupby involves the following steps:
[Link]: Splitting the data into groups based on some criteria.
[Link]: Applying a function to each group independently:
.sum()
.count()
.mean()
.std()
.aggregate()
.apply()
.etc..
[Link]: Combining the results into a data structure.

Example:

# group countries by continents and apply sum() function

df_continents=df_can.groupby(‘Continent’, axis=0).sum()

#note: the output of the groupby method is a ‘groupby’ object.

#we can not use it further until we apply a function (eg. Sum())
print(type(df_can.groupby(‘Continent’,axis=0)))

df_continents.head()
Step 2: Plot the data. We will pass in kind = ‘pie’ keyword, along with the following additional
parameters:
autopct-is a string or function used to label the wedges (slices) with their numeric value. The
label will be placed inside the wedge. If it is a format string, the label will be fmt%pct.
startangle-rotates the start of the pie chart by angle degrees counerclockwise from the x-
axis.
shadow-Draws a shadow beneath the pie (to give a 3D feel)

df_continents[‘Total’].plot(kind=’pie’, figsize=(5,6), autopct=’%1.1%%’, startangle=90, shadow=True)

#maksud 1.1 yg dilabel merah, itu mksdnya 1 dibelakang koma. Kalo 1.2, itu 2 dibelakang koma, 1.3
itu 3 dibelakang koma, and so on.

[Link](‘Immigration to Canada by Continent [1980-2013]’)

[Link](‘equal’) #sets the pie chart to look like a circle

[Link]()

The above visual is not very clear, the numbers and text overlap in some instances. Let’s make a few
modifications to improve the visuals:
legend or [Link] is to put legends anywhere we want
pctdistancec is to apply distances of the percentages
colors is to change every wedge’s color
explode is to emphasize the wedges (in this case, the lowest three continents which are
Africa, North America, and Latin America and Carribean)

colors_list = [‘gold’, ‘yellowgreen’, ‘lightcoral’, ‘lightskyblue’, ‘lightgreen’,’pink’]

explode_list=[0.1,0,0,0,0.1,0.1] #ratio for each continent with which to offset each wedge
df_continents[‘Total’].plot(kind=’pie’, figsize=(15,6), autopct=’%1.1f%%’, startangle=90, sadow=True
labels=None, pctdistance=1.12, colors=colors_list, explode=explode_list)

[Link](‘Immigration to Canada by Continent [1980-2013]’, y=1.12) #red is to give distance in y-axis

[Link](‘equal’)
[Link](labels=df_continents.index, loc=’upper left’)

[Link]()

To make only the pie chart only in 2013, type,

df_continents[‘2013’].plot(kind=’pie’,……..) #the rest is the same as the previous one

Box Plots

A box plot is a way of statistically representing the distribution of the data through five main
dimensions:
minimum: smallest number in the dataset
first quartile: middle number between the minimum and the median
second quartile (median): middle number of the (sorted) dataset
third quartile: middle number between median and maximum
maximum: highest number in the dataset
To make box plot, it uses kind=box in plot.

Step1: Get the dataset.

#to get a dataframe, place extra square brackets around ‘Japan’.

df_japan = df_can.loc[[‘Japan’], years]. transpose()

df_japan.head()

Step 2: plot by passing in kind=’box’

df_japan.plot(kind=’box’, figsize=(8,6))

[Link](‘Box plot of Japanese Immigrants from 1980-2013’)

[Link](‘Number of Immigrants’)

[Link]()

We can immediately make a few key observations from the plot above:
1. The minimum number of immigrants s around 200 (min), max around 1300m and median around
900.
2. 25% of the years for period 1980-2013 had an annual immigrant count of ~500 or fewer (first
quartile).
3. 75% of the years for period 1980-2013 had an annual immigrant count of ~1100 or fewer (third
quartile)

To make sure, we can use describe()

df_japan.describe()

to make a horizontal box plots, we can use vert parameter in the plot function and assign it to False.
For examples, if the distribution of both China and India are analysed using dataframe of df_CI, the
code are as follows:

df_CI.plot(kind=’box’, figsize=(10,7), color=’blue’, vert=False)

[Link](‘Box plots of Immigrants from China and India (1980-2013)’)

[Link](‘Number of Immigrants’)

[Link]

Subplots

Often times we might want to plot multiple plots and put them in the same figure.

To visualize multiple plots together, we can create a figure (overall canvas) and divide it into
subplots, each containing a plot. With subplots, we usually work with the artist layer instead of the
scripting layer.

Typical syntax:

fig = [Link]() #create figure

ax = fig.add_subplot(nrows, ncols, plot_numver) # create subplots
Where:
nrows and ncols are used to notionally split the figure into (nrows*ncols) sub-axes,
plot_number is used to identify the particular subplot (first, second, third, and so on)

Example:

Fig=[Link]() #create figure

ax0=fig.add_subplot(1,2,1) # add subplot 1 (1 row, 2 columns, first plot)

ax1=fig.add_subplot(1,2,2) # add subplot 2 (1 row, 2 columns, second plot)

# Subplot 1: Box plot

df_CI.plot(kind='box', color='blue', vert=False, figsize=(20, 6), ax=ax0) # add to subplot 1
ax0.set_title('Box Plots of Immigrants from China and India (1980 - 2013)')
ax0.set_xlabel('Number of Immigrants')
ax0.set_ylabel('Countries')

# Subplot 2: Line plot

df_CI.plot(kind='line', figsize=(20, 6), ax=ax1) # add to subplot 2
ax1.set_title ('Line Plots of Immigrants from China and India (1980 - 2013)')
ax1.set_ylabel('Number of Immigrants')
ax1.set_xlabel('Years')

[Link]()
Additional info: subplot(211) == subplot(2,1,1)

Scatter Plot
Step1: Get dataset

#we can use the sum() method to get the total population per year
df_tot=[Link](df_can[years].sum(axis=0))

#change the years to type int (useful regression later on)

df_top.index=map(int, df_tot.index)

#reset the index to put in back in as a column in the df_tot dataframe

df_tot.reset_index(inplace=True)

#rename columns
df_tot.columns=[‘year’,’tota’]

#view the final dataframe

df_tot.head()

Step 2: Plot the data. In matplotlib, scatter plot is created by kind=’scatter’ along with specifying the
x and y (not automated)

df_tot.plot(kind=’scatter’, x=’year’, y=’total’, figsize=(10,6), color=’darkblue’)

[Link](‘Total Immigration to Canada from 1980-2013’)

[Link](‘Year’)
[Link](‘Number of Immigrants’)

[Link]()
Now, let’s try to plot a linear line of best fit, and use it to predict number of immigrants in 2015.

Step 1: Get the equation of line of best fit. We will use Numpy’s polyfit() method by passing in the
following
x = x-coordinates of the data
y = y-coordinates of the data
deg = Degree of fitting polynomial. 1=Linear, 2=quadratic, and so on.

x=df_tot[‘year’]
y=df_tot[‘total’]
fit=[Link](x,y,deg=1)
fit

In this case the slop is 5.56+03 with position in 0, and the intercept is -1.0926+07 with position in 1.

Step2: Plot the regression line on the scatter plot.

df_tot.plot(kind=’scatter’, x=’year’, y=’total’, figsize=(10,6), color=’darkblue’)

[Link](‘Total Immigration to Canada from 1980-2013’)

[Link](‘Year’)
[Link](‘Number of Immigrants’)

[Link](x, fit[0]*x + fit[1], color=’red’) #recall that x is the years

[Link](‘y={0:0.f} x + {1:.0f}’.format(fit[0], fit[1]), xy=(2000, 150000))

[Link]()

#print out the line of best fit

‘No. Immigrants = {0:.0f}*Year + {1:.0f}’.format(fit[0], fit[1])

Now, we can predict the no. immigrants in 2015. To predict,

No. Immigrants = 5567*Year – 10926195
No. Immigrants = 5567*2015 – 10926195
No. Immigrants = 291,310
Another example, create a scatter plot f the total immigration from Denmark, Norway, and Sweden
to Canada from 1980 to 2013.

#create df_countries dataframe

df_countries=df_can.loc[['Denmark','Norway','Sweden'],years].transpose()

#create df_total by summing across three countries for each year

df_total=[Link](df_countries.sum(axis=1))

#reset index in place

df_total.reset_index(inplace=True)

#rename columns
df_total.columns=['year','total']

#change column year from string to int to create scatter plot

df_total['year']=df_total['year'].astype(int)

#show resulting dataframe

df_total.head()

#plot the scatter plot

df_total.plot(kind=’scatter’, x=’year’, y=’total’, figsize=(10,6), color=’darkblue’)

[Link](‘Immigration from Denmark, Norway, and Sweden to Canada from 1980-2013')

[Link]('Year')
[Link]('Number of Immigrants')
[Link]()

Bubble Plot
Anayzing argentina’s great depression and compare it with Brazil

Step 1: Get data for Brazil and Argentina. Like in the previous example, we will convert the Years to
type int and bring it in the dataframe.

df_can_t = df_can[years].transpose() # transposed dataframe

# cast the Years (the index) to type int
df_can_t.index = map(int, df_can_t.index)

# let's label the index. This will automatically be the column name when we reset the index
df_can_t.[Link] = 'Year'

# reset index to bring the Year in as a column

df_can_t.reset_index(inplace=True)

# view the changes

df_can_t.head()

Step 2: Create the normalized weights.

There are several methods of normalizations in statistics, each with its own use. In this case, we will
use feature scalling to bring all values into the range [0,1]. The general formula is:

Therefore:

# normalize Brazil data

norm_brazil = (df_can_t['Brazil'] - df_can_t['Brazil'].min()) / (df_can_t['Brazil'].max() -
df_can_t['Brazil'].min())

# normalize Argentina data

norm_argentina = (df_can_t['Argentina'] - df_can_t['Argentina'].min()) / (df_can_t['Argentina'].max()
- df_can_t['Argentina'].min())

Step 3: Plot the data.

-To plot two different scatter plots in one plot, we can include the axes one plot into the other by
passing it via ax parameter.
-We will also pass in the weights using the s parameter. Given that the normalized weights are
between 0-1, they won’t be visible, therefore:
-multiply weights by 2000 to scale it up on the graph, and,
-add 10 to compensate for the min value (which has a 0 weight and therefore scale with
x2000).

#Brazil
ax0=df_can_t.plot(kind=’scatter’, x=’year’, y=’Brazil’, figsize=(14,8), alpha=0.5, color=’green’,
s=norm_brazil*2000+10, xlim=(1975,2015))

#Argentina
ax1=df_can_t.plot(kind=’scatter’, x=’Year’, y=’Argentina’, alpha=0.5, color=”blue”,
s=norm_argentina*2000+10, ax = ax0)

ax0.set_ylabel(‘Number of Immigrants’)
ax0.set_title(‘Immigration from Brazil and Argentina from 1980-2013’)
[Link]([‘Brazil’,’Argentina’], loc=’upper left’, fontsize=’x-large’)

WEEK 3-Advanced Visualizations and Geospatial Data

WAFFLE CHARTS, WORD CLOUDS, and REGRESSION PLOTS

These codes follow after the data pandas and numpys import, and the data preprocessing.

-Import Matplotlib

%matplotlib inline
import matplotlib as mpl
import [Link] as plt
import [Link] as mpatches #needed for waffle Charts

[Link](‘ggplot’) #optional: for ggplot-like style

#check for latest version of Matplotlib

print(‘Matplotlib version:’,mpl.__version__) # >= 2.0.0
WAFFLE CHART

-revisit the previous case study about Denmark, Norway, and Sweden

#create a new dataframe for these three countries

df_dsn = df_can.loc[[‘Denmark’,’Norway’,’Sweden’], :]

#let’s take a look at our dataframe

df_dsn

-Unfortunately, unlike R, waffle charts are not built into any of he Python visualization libraries.
Therefore, we will learn how to create them from scratch.

Step 1. The first step into creating a waffle chart is determining the proportion of each category with
respect to the total.

#compute the proportion of each category with respect to the total

total_values=sum(df_dsn[‘Total’])
category_proportions=[(float(value)/total_values) for value in df_dsn[‘Total’]]

#print out proportions

for i, proportion in enumerate(category_proportions):
print(df_dsn.[Link][i] + ‘: ‘ + str(proportion))

Step 2. The second step is defining the overall size of the waffle chart

width = 40 #width of chart

height = 10 # height of chart

total_num_titles=width*height #total number of tiles

print(‘Total number of tiles is’, total_num_tiles)

Step 3. The third step is using the proportion of each category to determine it respective number of
tiles

# compute the number of tiles for each category

tiles_per_category = [round(proportion*total_num_tiles) for proportion in category_proportions]
# print out number of tiles per category
for i, tiles in enumerate(tiles_per_category):
print(df_dsn.[Link][i] + ‘: ‘ + str(tiles))

Based on the calculated proportions, Denmark will occupy 129 tiles of the waffle chart, Norway will
occupy 77 tiles, and Sweden will occupy 194 tiles.

Step 4. The fourth step is creating a matrix that resembles the waffle chart and populating it.

#initialize the waffle chart as an empty matrix

waffle_chart = [Link]((height, width))

#define indices to Loop through waffle chart

category_index=0
title_index=0

#populate the waffle chart

for col in range(width):
for row in range(height):
title_index += 1

#if the number of tiles populated for the current category is equal to its
corresponding allocated tiles…
if tile_index > sum(tiles_per_category[0:category_index]):
#...proceed to the next category
category_index +=1

# set the class value to an integer, which increases with class

waffle_chart[row, col] = category_index

Print (‘Waffle chart populated!’)

waffle_chart #to see the matrix looks like

Step 5. Map the waffle chart matrix into a visual

#instantiate a new figure object

fig = [Link]()

#use matshow to display the waffle chart

colormap=[Link]
[Link](waffle_chart, cmap=colormap)
[Link]()

Step 6. Prettify the chart.

#instantiate a new figure object

fig = [Link]()

#use matshow to display the waffle chart

colormap=[Link]
[Link](waffle_chart, cmap=colormap)
[Link]()

#get the axis

ax = [Link]()

#set minor tricks

ax.set_xticks([Link](-.5, (width), 1), minor=True)
ax.set_yticks([Link](-.5, (height), 1), minor=True)

#add gridlines based on minor ticks

[Link](which=’minor’, color=’w’, linestyle=’-‘, linewidth=2)

[Link]([])
[Link]([])
Step 7 Create a legend and add it to chart

Now it would very

inefficient to repeat these seven steps every time we wish to create a waffle chart. So let’s combine
all seven steps into one function called create_waffle_chart. This function called
create_waffle_chart. This function would take the following parameters as input:
Word Clouds

A Python package already exists in Python for generating word clouds. The package, called
word_cloud was developed by Andreas Mueller.

-First, let’s install the package.

#install wordcloud
!conda install -c conda -forge wordlcloud==1.4.1 –yes

#import package and its set of stopwords

from wordcloud import WordCloud, STOPWORDS

print(‘Wordcloud is installed and imported!’)

Word clouds are commonly used to perform high-level analysis and visualization of text data. Now,
let’s digress the immigration to Canada data and work analysing a short novel written by Lewis
Caroll titled Alice’s Adventures in Wonderland.

-First, download a .txt file of the novel.

#download file and save as alice_novel.txt

!we –quiet [Link] .txt

#open the file and read it into a variable alice_novel

alice_novel=open(‘alice_novel_txt’, ‘r’).read()

print(‘File downloaded and saved!)

-next, let’s use the stopwords that we imported from word_cloud. We use the function set to
remove any redundant stopwords.

stopwords = set(STOPWORDS)
Create a world cloud object and generate a world cloud. For simplicity, let’s generate a world cloud
using only the first 2000 words in the novel.

#instantiate a word cloud object

alice_wc = WordCloud(
background_color=’white’,
max_words=2000,
stopwords=stopwords)

#generate the word cloud

alice_wc.generate(alice_novel)

#display the word cloud

[Link](alice_wc, interpolation=’bilinear’)
[Link](‘off)
[Link]()

The bigger the words, assumingly the more common words within those 2000 words. Now, resize
the cloud so that we can see the less frequent words a little better.

fig=[Link]()
fig.set_figwidth(14) #set width
fig.set_figheight(18) #set height

#display the cloud

[Link](alice_wc, interpolation=’billinear’)
[Link](‘off’)
[Link]()

said isn’t really an informative word. So let’s add it to our stopwords and re-generate the cloud

[Link](‘said’) #add the words said to stopwords

#re-generate the word cloud
alice_wc.generate(alice_novel)

#display the cloud

fig=[Link]()
fig.set_figwidth(14) #set width
fig.set_figheight(18) #set height

[Link](alice_wc, interpolation=’bilinear’)
[Link](‘off’)
[Link]()

word_cloud also provide the package to superimpose the words onto a mask of any shape. For
example, using a mask of Alice and her rabbit.

# download image in png

!wget --quiet [Link]
data/CognitiveClass/DV0101EN/labs/Images/alice_mask.png

# save mask to alice_mask

alice_mask = [Link]([Link](‘alice_mask.png’)

#show the png

fig=[Link]()
fig=set_figwidth(14)
fig=set_figheight(18)

[Link](alice_mask, cmap=[Link], interpolation=’bilinear’)

[Link](‘off’)
[Link]()
Shaping the word cloud

#instantiate a word cloud object

alice_wc = WordClound(background_color=’white’, max_words=2000, mask=alice_mask,
stopwords=stopwords)

#generate the word cloud

alice_wc.generate(alice_novel)

#display the word cloud

fig=[Link]()
fig.set_figwidth(14) #set width
fig.set_figheight(18) #set height

[Link](alice_wc, interpolation=’bilinear’)
[Link](‘off’)
[Link]()

Regression Plot

-Install seaborn

#install seaborn
!conda install -c anaconda seaborn –yes
#import library
import seaborn as sns

print(‘Seaborn installed and imported!’)

-Create a new dataframe that stores that total number of landed immigrants to Canada per year
from 1980 to 2013.

#using the sum() method to get the total population per year
df_tot = [Link](df_can[years].sum(axis=0))

#change the years to type float(useful for regression later on)

df_tot.index = map(float, df_tot.index)

#reset the index to put in back in as a column in the df_tot dataframe

df_tot.reset_index(inplace=True)

#rename columns
df_tot.columns = [‘year’, ‘total’]

#view the final dataframe

df_tot.head()

-generating the regression plot

Import seaborn as sns

ax = [Link](x=’year’, y=’total’, data=df_tot)

-customize color

Import seaborn as sns

ax=[Link](x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’)
-customize the marker shape, so instead of circular markers, let’s use ‘+’.

Import seaborn as sns

ax=[Link](x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’)

-blow up the plot a little bit so that it is more appealing to the sight

[Link](figsize=(15,10))
ax=[Link](x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’)
-increase the size of markers so they match the new size of the figure, and add a title and x- and y-
labels.

[Link](figsize=(15,10))
ax=[Link](x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’, scatter_kws={‘s’:200})

[Link](xlabel=’year’, ylabel=’Total Immigration’) #add x- and y-labels

ax.set_title(‘Total Immigration to Canada from 1980-2013’) #add title

-increase the font size of the tickmark labels, the title, and the x- and y-labels.

[Link](figsize=15,10))

[Link](font_scale=1.5)

ax=[Link](x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’, scatter_kws={‘s’:200})

[Link](xlabel=’Year’, ylabel=’Total Immigration’)
ax.set_title(‘Total Immigration to Canada from 1980-2013’)
-change the background to a white plain background

[Link](figsize=(15,10))

[Link](font_scale=1.5)
sns.set_style(‘ticks’) # change background to white background

ax=[Link](x=’year’, y=’total’, data=df_total, color=’green’, marker=’+’, scatter_kws={‘s’:200})

[Link](xlabel=’Year’, ylabel=’Total Immigration’)
ax.set_title(‘Total Immigration to Canada fom 1980-2013’)

-or to a white background with gridlines.

[Link](figsize=(15,10))

[Link](font_scale=1.5)
sns.set_style(‘whitegrid’)

ax=[Link](x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’, scatter_kws={‘s’:200})

[Link](xlabel=’Year’, ylabel=’Total Immigration’)
ax.set_title(‘Total Immigration to Canada from 1980-2013’)
Another example, using seaborn to create a scatter plot with a regression line to visualize the total
immigration from Denmark, Sweden and Norway to Canada from 1980 to 2013.
WEEK 3-2 GENERATING MAPS WITH PYTHON
In this session, we wil learn how to creawte maps for different objectives. We will use python
visualization library, namely Folium instead of [Link] was developed for the sole purpose
of visualizing geospatial data. Other libraries are available to visualize data such as plotly, but they
might have a cap on how many API calls you can make within a defined time frame. Folium, on the
other hand, is completely free.

Two datasets used:

-San Francisco Police Department Incidents for the year 2016
-Immigration to Canada from 1980 to 2013

-Install Folium

!conda install -c conda -forge folium=0.5.0 –yes

import folium

print(‘Folium installed and imported!’) #if it’s printed, meaning that the folium is successfully
installed.

-Generating the world map is straightforward in Folium. You simply create Folium Map object and
then you display it. What is attractive about Folium maps is that they are interactive, so you can
zoom into any region of interest despite the initial zoom level.

#define the world map

world_map = [Link]()

#display world map

world_map

All locations on a map are defined by their respective latitude and longitude values. So you can
create a map and pass in a center of Latitude and Longitude values of [0,0].

For a defined center, you can also define the intial zoom level into that location when the map is
rendered. The higher he zoom level the more the map is zoomed into the center.

#define the world map centered around Canada with a low zoom level
world_map=[Link](location=[56.130, -106.35], zoom_start=4)

#display world map

world_map
-let’s create the map again with a higher zoom level

# define the world map centered around Canada with a higher zoom level
world_map=[Link](location=[56.130, -106.35], zoom_start=8) #blue is latitude, and red is
longitude.

#display world map

world_map

A. Stamen Toner Maps

There are high-contrast B+W (black and white) maps. They are perfect for data mashups and
exploring river meanders and coastal zones.

#create a Stamen Toner map of the world centered around Canada

world_map = [Link](location=[56.130, -106.35], zoom_start=4, tiles=’Stamen Toner’)

#display map
world_map

B. Stamen Terrain Maps

These are maps that feature hill shading and natural vegetation colors. They showcase advanced
labelling and linework generalization of dual-carriageway roads.

#create a Stamen Toner map of the world centered around Canada

world_map = [Link](location=[56.130, -106.35], zoom_start=4, tiles=’Stamen Terrain’)

#display map
world_map
C. Mapbox Bright Maps

These are maps that quite similar to the default style, except that the borders are not visible with a
low zoom level.

#create a world map with a Mapbox Bright style.

world_map=[Link](tiles=’Mapbox Bright’)

#display the map

world_map
MAPS WITH MARKERS

Let’s download and import the data on police incidents using pands read_csv() method

-download the dataset and read it into a pandas dataframe:

df_incidents = pd.read_csv(‘[Link]
data/CognitiveClass/DV0101EN/labs/Data_Files/Police_Department_Incidents_-
_Previous_Year__2016_.csv')

print(‘Dataset downloaded and read into a pandas dataframe!’)

-take a look at the first five items in our dataset.

df_incidents.head()

Each row consists of 13 features:

df_incidents.shape
(150500,13)

-So the dataframe consist of 150,500 crimes, which took place in the year 2016. In order to reduce
computational cost, let’s just work with the first 100 incidents in this dataset.

#get the first 100 crimes in the df_incidents dataframe

limit=100
df_incidents=df_incidents.iloc[0:limit, :]

df_incidents.shape

(100,13)

Now that we reduce the data a little bit, let’s visualize where these crimes took place in the city of
San Fransisco. We will use the default style and we will initialize the zoom level to 12.

#San Fransisco latitude and longitude valus

latitude = 37.77
longitude = -122.42

#create map and display it

sanfran_map=[Link](location=[latitude, longitude], zoom_start=12)

#display the map of San Fransisco

sanfran_map

*additional= zip in python

The zip() function returns a zip object, which is an iterator of tuples where the first item in each
passed iterator is paired together, and then the second item in each passed iterator are paired
together etc.
Example:
a = (“John”, “Charles”, “Mike”)
b = (“Jenny”, “Christy, “Monica”)

x=zip(a,b)

print(tuple(x))

Results:

((‘John’, ‘Jenny’), (‘Charles’,’Christy’),(‘Mike’,’Monica’))

-Now let’s superimpose the locations of the crimes onto the map. The way to do that in Folium is to
create a feature group with its own features and style and then add it to the sanfran map.

#instantiate a feature group for the incidents in the dataframe

incidents=[Link]()

#Loop through the 100 crimes and add each to the incidents feature group
for lat, lng, in zip(df_incidents.Y, df_incidents.X):
incidents.add_child(
[lat, lng],
radius=5, #define how big you want the circle markers to be
color=’yellow’,
fill=True,
fill_color=’blue’,
fill_opacity=0.6))

# add incidents to map

sanfran_map.add_child(incidents)
-You can also add some pop-up text that would get displayed when you hover over a marker. Let’s
make each marker display the category of the crime when hovered over.

#instantiate a feature group for the incidents in the dataframe

incidents = [Link]()

#loop through the 100 crimes and add each to the incidents feature group
for lat, lng in zip(df_incidents.Y, df_incidents.X):
incidents.add_child(
[Link](
[lat, lng]
radius=5 #define how big you want the circle markers to be
color=’yellow’
fill=True
fill_color=’blue’,
fill_opacity=0.6))

#add pop-up text to each marker on the map

latitudes=list(df_incidents.Y)
longitudes=list(df_incidents.X)
labels=list(df_incidents.Category)

#to recap, the Y, X, and Category is the three columns in the dataframe

for lat, lng, label in zip (latitudes, longitudes, labels):

[Link]([lat, lng], popup=label).add_to(sanfran_map)

#add incidents to map

sanfran_map.[Link](incidents)
We may find the map to be so congested. Therefore, there are two remedies that can solve this
problem.

1. The simpler solution is to remove these locations markers and just add the text to the circle
markers themselves as follows:

#create map and display it

sanfran_map = [Link](location=[latitude, longitude], zoom_start=12)

#loop through the 100 crimes and add each to the map
for lat, lng, label in zip(df_incidents.Y, df_incidents.X, df_incidents.Category):
[Link](
[lat, lng],
radius=5 #define how big you want the circle markers to be
color=’yellow’,
fill=True,
popup=label,
fill_color=’blue’,
fill_opacity=0.6).add_to(sanfran_map)

#show map
sanfran_map
2. The second way which is much proper is to group the markers into different clusters. Each cluster
is then represented by the number of crimes in each neighbourhood. These clusters can be thought
of as pockets of San Fransisco which you can then analyse separately.

To implement this, we start off by instantiating a MarkerCluster object and adding all the data points
in the dataframe t othis object

from folium import plugins

#starting again with a clean copy of the map of San Fransisco

sanfran_map=[Link](location=[latitude, longitude], zoom_start=12)

#instantiate a mark cluster object for the incidents in the dataframe

incidents=[Link]().add_to(sanfran_map)

#loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df_incidents.Y, df_incidents.X, df_incidents.Category):
[Link](
location=[lat,lng],
icon=None,
popup=label).add_to(incidents)

#display map
sanfran_map
When you zoom out all the way, all markers are groupd into one cluster.

Choropleth Maps

-Download the dataset and read it into a pandas dataframe: (n.p. if the xlrd is not installed, install it
first by typing a code, !conda install -c anaconda xlrd –yes

df_can=pd.read_excel(‘[Link]
data/CognitiveClass/DV0101EN/labs/Data_Files/[Link]', sheet_name=’Canada by Citizenship’,
skiprows=range(20), skipfooter=2)

print(‘Data downloaded and read into a dataframe!’)

-take a look at the first five items in our dataset.

df_can.head()

-Clean up data/pre-processing the data

#clean up the dataset to remove unnecessary columns (eg. REG)

df_can.drop([‘AREA’,’REG’,’DEV’,’Type’,’Coverage’], axis=1, inplace=True)
#let’s rename the columns so that they make sense
df_can.rename(columns={‘OdName’:’Country’, ‘AreaName’:’Continent’,’RegName’:’Region’},
inplace=True)

#for sake of consistency, let’s also make all column labels of type string
df_can.columns=list(map(str, df_can.columns))

#add total column

df_can[‘Total’]=df_can.sum(axis=1)

#years that we will be using in this lesson-useful for plotting later on

years=list(map(str, range(1980, 2014)))
print(‘data dimensions:’, df_can.shape)

-take a look at the first five cleaned dataframe

df_can.head()

-In order to create a Choropleth map, we need a GeoJSON file that defines the areas/boundaries of
the state, country, or country that we are interested in. In our case, since we are endeavouring to
create a world map, we want a GeoJSON that defines the boundaries of all world countries. For our
convenience, the developer has provided us with a file, and able to be downloaded. Let’s name it
world_countries.json.

#download countries geojson file

!wget –quite [Link]
/DV0101EN/labs/Data_Files/world_countries.json -O world_countries.json

Print(‘GeoJSON file downloaded!’)

-Now that we have GeoJSON file, let’s create a world map, centered [0,0] latitude and longitude
values, with an initial zoom level of 2, and using Mapbox Brigth style.

world_geo=r’world_countries.json’ #gejson file

#create a plain world map

world_map=[Link](location=[0,0], zoom start=2, tiles=Mapbox Bright’)

-And now to create a Choropleth map, we will use the choropleth method with the following main
parameters:

#generate choropleth map using the total immigration of each country to Canada from 1980 to 2013

World_map.choropleth(
geo_data=world_gep
data=df_can
columns=[‘Country’,’Total’],
key_on=’[Link]’,
fill_color=’Y10rRd’,
fill_opacity=0.7,
line_opacity=0.2,
legend_name=’Immigration to Canada’

#display map
world_map

Notice how the legend is displaying a negative boundary or threshold. Let’s fix that by defining our
own thresholds and starting with 0 instead of -6,918!

world_geo=r’world_countries.json’

#create a numpy array of length 6 and has linear spacing from the minimum total immigration to the
maximum total immigration
threshold_scale=[Link](df_can[‘Total’].min(), df_can[‘Total’].max(), 6, dtype=int)
threshold_scale=threshold_scale.tolist() change the numpy array to a list #make sure that the last
value of the list is greater than the maximum immigration

#Let Folium determine the scale.

world_map=[Link](location=[0,0], zoom_start=2, tiles=’Mapbox Bright’)
world_map.choropleth(
geo_data=world_geo,
data=df_can,
columns=[‘Country’, ‘Total’],
key_on=’[Link]’,
threshold_scale=threshold_scale,
fill_color=’Y10rRd’,
fill_opacity=0.7,
line_opacity=0.2,
legend_name=’Immigration to Canada’,
reset=True)

world_map

Exercise 1 Data Viz Histogram Bar Charts
No ratings yet
Exercise 1 Data Viz Histogram Bar Charts
7 pages
Python Data Visualization Techniques
No ratings yet
Python Data Visualization Techniques
5 pages
Data Visualization With Python
No ratings yet
Data Visualization With Python
42 pages
Plotting Directly With Matplotlib: Objectives
No ratings yet
Plotting Directly With Matplotlib: Objectives
28 pages
Python Data Visualization Techniques
No ratings yet
Python Data Visualization Techniques
30 pages
Immigration Data Visualization Techniques
No ratings yet
Immigration Data Visualization Techniques
9 pages
@PowerBI - Ir - Data Visualization Cheat Sheet
No ratings yet
@PowerBI - Ir - Data Visualization Cheat Sheet
15 pages
Unit 3 CHP 1
No ratings yet
Unit 3 CHP 1
18 pages
DV0101EN-2-2-1-Area-Plots-Histograms-and-Bar-Charts-py-v2.0: 1 Exploring Datasets With Pandas and Matplotlib
No ratings yet
DV0101EN-2-2-1-Area-Plots-Histograms-and-Bar-Charts-py-v2.0: 1 Exploring Datasets With Pandas and Matplotlib
29 pages
Data Visualisation Using Pyplot
No ratings yet
Data Visualisation Using Pyplot
20 pages
Matplotlib Guide for Data Scientists
No ratings yet
Matplotlib Guide for Data Scientists
5 pages
Basic Line Plot Using Matplotlib
No ratings yet
Basic Line Plot Using Matplotlib
9 pages
Mat Plot Lib
No ratings yet
Mat Plot Lib
10 pages
Session 7 - Data Visualization With Python
No ratings yet
Session 7 - Data Visualization With Python
17 pages
UNIT-IV - Matplotlib
No ratings yet
UNIT-IV - Matplotlib
10 pages
Data Visualization with Python Tutorial
100% (1)
Data Visualization with Python Tutorial
9 pages
Python Data Visualization Guide
No ratings yet
Python Data Visualization Guide
34 pages
Data Visualization - Matplotlib PDF
100% (1)
Data Visualization - Matplotlib PDF
15 pages
Basic Plotting
No ratings yet
Basic Plotting
8 pages
What Is Matplotlib
No ratings yet
What Is Matplotlib
4 pages
ML Week 7
No ratings yet
ML Week 7
12 pages
Line Plot (1) : Datacamp Courses-Jhu-Genomics-Demo
No ratings yet
Line Plot (1) : Datacamp Courses-Jhu-Genomics-Demo
22 pages
Pandas
No ratings yet
Pandas
25 pages
PRO Level Data Visualization Cheat Sheet
No ratings yet
PRO Level Data Visualization Cheat Sheet
15 pages
Data Visualization - 1 by Matplot Lib
No ratings yet
Data Visualization - 1 by Matplot Lib
19 pages
Data Visualization Techniques Guide
No ratings yet
Data Visualization Techniques Guide
48 pages
Lab 10
No ratings yet
Lab 10
16 pages
Data Visualization Using Matplotlib
No ratings yet
Data Visualization Using Matplotlib
10 pages
Pandas and Numpy
No ratings yet
Pandas and Numpy
9 pages
Beginner's Guide to Matplotlib in Python
No ratings yet
Beginner's Guide to Matplotlib in Python
14 pages
Matplotlib Cheat Sheet
No ratings yet
Matplotlib Cheat Sheet
6 pages
DV LAb Staff
No ratings yet
DV LAb Staff
73 pages
20 June BA Class
No ratings yet
20 June BA Class
17 pages
Data Visualization With Matplotlib
No ratings yet
Data Visualization With Matplotlib
20 pages
DVA Practical
No ratings yet
DVA Practical
19 pages
BDA File
No ratings yet
BDA File
26 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
A9bf73 - Introduction To Matplotlib
No ratings yet
A9bf73 - Introduction To Matplotlib
18 pages
Cheat Sheet Plotting With Matplotlib Using Pandas
No ratings yet
Cheat Sheet Plotting With Matplotlib Using Pandas
4 pages
Data Visualization
No ratings yet
Data Visualization
23 pages
Lab 3-24-07-20215 - Develop Python Programs To Demonstrate Various Data Visualization Techniques
No ratings yet
Lab 3-24-07-20215 - Develop Python Programs To Demonstrate Various Data Visualization Techniques
15 pages
Unit 4 (2) Python
No ratings yet
Unit 4 (2) Python
27 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
18 pages
Matplotlib Notes
No ratings yet
Matplotlib Notes
5 pages
Expt 2 EDAV
No ratings yet
Expt 2 EDAV
24 pages
Data Visualization & Pandas GroupBy Guide
No ratings yet
Data Visualization & Pandas GroupBy Guide
41 pages
Matplotlib Data Visualization Guide
No ratings yet
Matplotlib Data Visualization Guide
9 pages
Seminar Report in Python Veera
No ratings yet
Seminar Report in Python Veera
5 pages
Introduction to Matplotlib in Python
No ratings yet
Introduction to Matplotlib in Python
43 pages
Pandas 3-2
No ratings yet
Pandas 3-2
27 pages
Unit V Notes
No ratings yet
Unit V Notes
11 pages
Visualization RST
No ratings yet
Visualization RST
33 pages
ProgrammingForDS12 Viz
No ratings yet
ProgrammingForDS12 Viz
25 pages
Python Data Visualization Guide
No ratings yet
Python Data Visualization Guide
10 pages
Python Data Visualization Basics
No ratings yet
Python Data Visualization Basics
4 pages
Data Visualization-XII SC
No ratings yet
Data Visualization-XII SC
5 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Daiseikai Owners Manual
No ratings yet
Daiseikai Owners Manual
31 pages
Schwarz 1950
No ratings yet
Schwarz 1950
5 pages
Practice Problems For Modulus and Logarithm Section-I: Fiitjee
100% (3)
Practice Problems For Modulus and Logarithm Section-I: Fiitjee
8 pages
Pitfalls in Evaluating The Low Risk Chest Pain Patient
No ratings yet
Pitfalls in Evaluating The Low Risk Chest Pain Patient
19 pages
Final Research - 094055
No ratings yet
Final Research - 094055
60 pages
Pancasila Day by Slidesgo
No ratings yet
Pancasila Day by Slidesgo
49 pages
Z-4RTD2 Installation Manual
No ratings yet
Z-4RTD2 Installation Manual
8 pages
Theories - Nola Pender and Florence Nightingale
No ratings yet
Theories - Nola Pender and Florence Nightingale
6 pages
Asian Paints Financial Analysis Report
No ratings yet
Asian Paints Financial Analysis Report
15 pages
TCP Reliable Data Transfer
No ratings yet
TCP Reliable Data Transfer
17 pages
Immunity
No ratings yet
Immunity
39 pages
W 3 Dzzslides
No ratings yet
W 3 Dzzslides
19 pages
Engineering Math: Del Operator Basics
No ratings yet
Engineering Math: Del Operator Basics
36 pages
Is 13030 1991
No ratings yet
Is 13030 1991
10 pages
Bradford Protein Assay Analysis
No ratings yet
Bradford Protein Assay Analysis
2 pages
Mit Doe1
No ratings yet
Mit Doe1
57 pages
Intro to Psychology Basics
No ratings yet
Intro to Psychology Basics
9 pages
PQM4
No ratings yet
PQM4
3 pages
DHF - Dinkes
No ratings yet
DHF - Dinkes
31 pages
Tour Theme - Classical & Cultural India Tour
No ratings yet
Tour Theme - Classical & Cultural India Tour
14 pages
QEMIFLOC VH 1007 Safety Data Sheet
No ratings yet
QEMIFLOC VH 1007 Safety Data Sheet
3 pages
Material and Equipment Standard: IPS-M-TP-205
No ratings yet
Material and Equipment Standard: IPS-M-TP-205
12 pages
Progression Test Stage 5 2023 Maths P1
No ratings yet
Progression Test Stage 5 2023 Maths P1
18 pages
High-Altitude Cake Recipes for NM
100% (1)
High-Altitude Cake Recipes for NM
21 pages
Hi Qa Paper 2
No ratings yet
Hi Qa Paper 2
10 pages
Wollega University Shambu Campus Faculty of Technology Department of Chemical Engineering
No ratings yet
Wollega University Shambu Campus Faculty of Technology Department of Chemical Engineering
26 pages
DS-KV61X3 - (W) PE1 Door Station Guide UD20195B-B - English
No ratings yet
DS-KV61X3 - (W) PE1 Door Station Guide UD20195B-B - English
4 pages
Purchase Order for Ugolini Inverters
No ratings yet
Purchase Order for Ugolini Inverters
1 page
Chapter 60 - Xiao Feng, "The Only God"
No ratings yet
Chapter 60 - Xiao Feng, "The Only God"
6 pages
Value Pick: India Glycols Analysis
No ratings yet
Value Pick: India Glycols Analysis
5 pages