Lab 05_Data analysis and Visulaization
Lab 05_Data analysis and Visulaization
Parts: -
1. Python Packages.
2. Load Data.
3. Prepare Data.
4. Analyze Data.
5. Visualize Data.
© 2022 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page 1 of 16
Lab no 05 – Data Analysis and Visualization
In [27]:
# Code cell 1
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import folium
Page 2 of 16
Lab no 05 – Data Analysis and Visualization
Step 1: Load the San Francisco Crime data into a data frame.
In this step, you will import the San Francisco crime data from a comma separated values (csv) file into a
data frame.
In [28]:
# code cell 2
# This should be a local path
dataset_path = './Data/Map-Crime_Incidents-Previous_Three_Months.csv'
# read the original dataset (in comma separated values format) into a DataFrame
pd.read_csv(dataset_path, sep=",")
SF = pd.read_csv(dataset_path)
print(SF)
IncidntNum Category Descript \
0 NaN LARCENY/THEFT GRAND THEFT FROM UNLOCKED AUTO
1 NaN LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO
2 NaN LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO
3 NaN DRUG/NARCOTIC POSSESSION OF METH-AMPHETAMINE
4 NaN DRUG/NARCOTIC POSSESSION OF COCAINE
... ... ... ...
30755 NaN LARCENY/THEFT PETTY THEFT SHOPLIFTING
30756 NaN OTHER OFFENSES DRIVERS LICENSE, SUSPENDED OR REVOKED
30757 NaN ASSAULT BATTERY
30758 NaN ASSAULT ASSAULT WITH CAUSTIC CHEMICALS
30759 NaN OTHER OFFENSES DRIVERS LICENSE, SUSPENDED OR REVOKED
Resolution Address X Y \
0 NONE HYDE ST / CALIFORNIA ST -122.417393 37.790974
1 NONE COLUMBUS AV / JACKSON ST -122.404418 37.796302
2 NONE SUTTER ST / STOCKTON ST -122.406959 37.789435
3 ARREST, BOOKED 16TH ST / MISSION ST -122.419672 37.765050
4 ARREST, BOOKED LARKIN ST / OFARRELL ST -122.417904 37.785167
... ... ... ... ...
30755 ARREST, BOOKED 900.0 Block of MARKET ST -122.408052 37.783957
30756 ARREST, CITED POLK ST / MCALLISTER ST -122.418601 37.780261
30757 ARREST, CITED 0.0 Block of JONES ST -122.412122 37.781379
30758 NONE 200.0 Block of GEARY ST -122.407434 37.787494
30759 ARREST, CITED MISSION ST / BOSWORTH ST -122.426391 37.733675
Location
0 (37.7909741243888, -122.417392830334)
Page 3 of 16
Lab no 05 – Data Analysis and Visualization
1 (37.7963018736036, -122.404417620748)
2 (37.7894347630337, -122.406958660602)
3 (37.7650501214965, -122.419671780296)
4 (37.7851670875814, -122.417903977564)
... ...
30755 (37.7839574642528, -122.408051765969)
30756 (37.7802607511488, -122.418600974625)
30757 (37.7813786419025, -122.412121608136)
30758 (37.7874944447786, -122.407434204569)
30759 (37.7336749150401, -122.426391018521)
To view the first five lines of the csv file, the Linux command head is used.
In [29]:
# code cell 3
!head -n 5 ./Data/Map-Crime_Incidents-Previous_Three_Months.csv
In [30]:
# Code cell 4
pd.set_option('display.max_rows', 10) #Visualize 10 rows
SF
Out[30]:
IncidntN DayOfW Tim Resolut
Category Descript Date PdDistrict Address X Y Location
um eek e ion
0 GRAND 08/31/2
HYDE ST (37.790974124
THEFT 014 -
LARCENY/T 20: / 37.790 3888, -
NaN FROM Sunday 07:00:0 CENTRAL NONE 122.417
HEFT 30 CALIFOR 974 122.41739283
UNLOCKE 0 AM 393
NIA ST 0334)
D AUTO +0000
1 GRAND 08/31/2
COLUMB (37.796301873
THEFT 014 -
LARCENY/T 14: US AV / 37.796 6036, -
NaN FROM Sunday 07:00:0 CENTRAL NONE 122.404
HEFT 30 JACKSO 302 122.40441762
LOCKED 0 AM 418
N ST 0748)
AUTO +0000
2 GRAND 08/31/2
SUTTER (37.789434763
THEFT 014 -
LARCENY/T 11: ST / 37.789 0337, -
NaN FROM Sunday 07:00:0 CENTRAL NONE 122.406
HEFT 30 STOCKT 435 122.40695866
LOCKED 0 AM 959
ON ST 0602)
AUTO +0000
3 POSSESSI 08/31/2
ARRES (37.765050121
ON OF 014 16TH ST / -
DRUG/NARC 17: T, 37.765 4965, -
NaN METH- Sunday 07:00:0 MISSION MISSION 122.419
OTIC 49 BOOKE 050 122.41967178
AMPHETA 0 AM ST 672
D 0296)
MINE +0000
4 ARRES LARKIN (37.785167087
POSSESSI 08/31/2 -
DRUG/NARC 18: NORTHE T, ST / 37.785 5814, -
NaN ON OF Sunday 014 122.417
OTIC 05 RN BOOKE OFARRE 167 122.41790397
COCAINE 07:00:0 904
D LL ST 7564)
Page 4 of 16
Lab no 05 – Data Analysis and Visualization
b) Use the function columns to view the name of the variables in the DataFrame.
In [31]:
# Code cell 5
SF.columns
Out[31]:
Index(['IncidntNum', 'Category', 'Descript', 'DayOfWeek', 'Date', 'Time',
'PdDistrict', 'Resolution', 'Address', 'X', 'Y', 'Location'],
dtype='object')
How many variables are contained in the SF data frame (ignore the Index)?
c) Use the function len to determine the number of rows in the dataset.
In [32]:
# Code cell 6
len(SF)
Out[32]:
30760
Page 5 of 16
Lab no 05 – Data Analysis and Visualization
Step 1: Extract the month and day from the Date field.
lambda is a Python keyword to define so-called anonymous functions. lambda allows you to specify a
function in one line of code, without using def and without defining a specific name for it. The syntax for a
lambda expression is :
lambda parameters : expression.
In the following, the lambda function is used to create an inline function that selects only the month digits
from the Date variable, and int to transform a string representation into an integer. Then, the pandas
function apply is used to apply this function to an entire column (in practice, apply implicitly defines a for
loop and passes one by one the rows to the lambda function). The same procedure can be done for the
Day.
In [33]:
# Code cell 7
SF['Month'] = SF['Date'].apply(lambda row: int(row[0:2]))
SF['Day'] = SF['Date'].apply(lambda row: int(row[3:5]))
To verify that these two variables were added to the SF data frame, use the print function to print some
values from these columns, and type to check that these new columns contain indeed numerical values.
In [34]:
# Code cell 8
print(SF['Month'][0:2])
print(SF['Day'][0:2])
0 8
1 8
Name: Month, dtype: int64
0 31
1 31
Name: Day, dtype: int64
In [35]:
# Code cell 9
print(type(SF['Month'][0]))
<class 'numpy.int64'>
a) The column IncidntNum contains many cells with NaN. In this instance, the data is missing. Furthermore,
the IncidntNum is not providing any value to the analysis. The column can be dropped from the data frame.
One way to remove unwanted variables in a data frame is by using the del function.
In [36]:
# Code cell 10
del SF['IncidntNum']
Page 6 of 16
Lab no 05 – Data Analysis and Visualization
b) Similarly, the Location attribute will not be in this analysis. It can be droped from the data frame.
Alternatively, you can use the drop function on the data frame, specifying that the axis is the 1 (0 for rows),
and that the command does not require an assignment to another value to store the result (inplace = True ).
In [37]:
# Code cell 11
SF.drop('Location', axis=1, inplace=True )
In [38]:
SF
Out[38]:
DayOfW Tim Resoluti Mon Da
Category Descript Date PdDistrict Address X Y
eek e on th y
0 GRAND 08/31/2
HYDE ST
THEFT 014 -
LARCENY/TH 20: / 37.7909
FROM Sunday 07:00:0 CENTRAL NONE 122.417 8 31
EFT 30 CALIFOR 74
UNLOCKED 0 AM 393
NIA ST
AUTO +0000
1 GRAND 08/31/2
COLUMB
THEFT 014 -
LARCENY/TH 14: US AV / 37.7963
FROM Sunday 07:00:0 CENTRAL NONE 122.404 8 31
EFT 30 JACKSON 02
LOCKED 0 AM 418
ST
AUTO +0000
2 GRAND 08/31/2
SUTTER
THEFT 014 -
LARCENY/TH 11: ST / 37.7894
FROM Sunday 07:00:0 CENTRAL NONE 122.406 8 31
EFT 30 STOCKT 35
LOCKED 0 AM 959
ON ST
AUTO +0000
3 POSSESSI 08/31/2
ARRES
ON OF 014 16TH ST / -
DRUG/NARC 17: T, 37.7650
METH- Sunday 07:00:0 MISSION MISSION 122.419 8 31
OTIC 49 BOOKE 50
AMPHETA 0 AM ST 672
D
MINE +0000
4 08/31/2
ARRES LARKIN
POSSESSI 014 -
DRUG/NARC 18: NORTHER T, ST / 37.7851
ON OF Sunday 07:00:0 122.417 8 31
OTIC 05 N BOOKE OFARREL 67
COCAINE 0 AM 904
D L ST
+0000
... ... ... ... ... ... ... ... ... ... ... ... ...
307 06/01/2
PETTY ARRES 900.0
55 014 -
LARCENY/TH THEFT 15: SOUTHER T, Block of 37.7839
Sunday 07:00:0 122.408 6 1
EFT SHOPLIFTI 30 N BOOKE MARKET 57
0 AM 052
NG D ST
+0000
307 DRIVERS 06/01/2
56 LICENSE, 014 ARRES POLK ST / -
OTHER 16: NORTHER 37.7802
SUSPENDE Sunday 07:00:0 T, MCALLIS 122.418 6 1
OFFENSES 00 N 61
D OR 0 AM CITED TER ST 601
REVOKED +0000
307 06/01/2 ARRES 0.0 Block -
15: TENDERL 37.7813
57 ASSAULT BATTERY Sunday 014 T, of JONES 122.412 6 1
00 OIN 79
07:00:0 CITED ST 122
Page 7 of 16
Lab no 05 – Data Analysis and Visualization
In [39]:
# Code cell 12
SF.columns
Out[39]:
Index(['Category', 'Descript', 'DayOfWeek', 'Date', 'Time', 'PdDistrict',
'Resolution', 'Address', 'X', 'Y', 'Month', 'Day'],
dtype='object')
In [40]:
# Code cell 13
CountCategory = SF['Category'].value_counts()
print(CountCategory)
LARCENY/THEFT 8205
OTHER OFFENSES 4004
NON-CRIMINAL 3653
ASSAULT 2518
VEHICLE THEFT 1885
...
LOITERING 5
BAD CHECKS 3
PORNOGRAPHY/OBSCENE MAT 1
BRIBERY 1
GAMBLING 1
Name: Category, Length: 36, dtype: int64
b) By default, the counts are ordered in descending order. The value of the optional parameter ascending
can be set to True to reverse this behavior.
Page 8 of 16
Lab no 05 – Data Analysis and Visualization
In [41]:
# Code cell 14
SF['Category'].value_counts(ascending=True)
Out[41]:
GAMBLING 1
BRIBERY 1
PORNOGRAPHY/OBSCENE MAT 1
BAD CHECKS 3
LOITERING 5
...
VEHICLE THEFT 1885
ASSAULT 2518
NON-CRIMINAL 3653
OTHER OFFENSES 4004
LARCENY/THEFT 8205
Name: Category, Length: 36, dtype: int64
c) By nesting the two functions into one command, you can accomplish the same result with one line of
code.
In [42]:
# Code cell 15
print(SF['Category'].value_counts(ascending=True))
GAMBLING 1
BRIBERY 1
PORNOGRAPHY/OBSCENE MAT 1
BAD CHECKS 3
LOITERING 5
...
VEHICLE THEFT 1885
ASSAULT 2518
NON-CRIMINAL 3653
OTHER OFFENSES 4004
LARCENY/THEFT 8205
Name: Category, Length: 36, dtype: int64
Challenge Question: Which PdDistrict had the most incidents of reported crime? Provide the Python
command(s) used to support your answer.
In [43]:
# code cell 16
# Possible code for the challenge question
print(SF['PdDistrict'].value_counts(ascending=True))
RICHMOND 1622
PARK 1800
TARAVAL 2038
TENDERLOIN 2449
INGLESIDE 2613
BAYVIEW 2970
NORTHERN 3205
CENTRAL 3867
MISSION 4011
SOUTHERN 6185
Name: PdDistrict, dtype: int64
Page 9 of 16
Lab no 05 – Data Analysis and Visualization
a) Logical indexing can be used to select only the rows for which a given condition is satisfied. For example,
the following code extracts only the crimes committed in August, and stores the result in a new DataFrame.
In [44]:
# Code cell 17
AugustCrimes = SF[SF['Month'] == 8]
AugustCrimes
Out[44]:
DayOfW Tim PdDistri Resoluti Mon Da
Category Descript Date Address X Y
eek e ct on th y
0 08/31/2
GRAND HYDE ST
014 -
LARCENY/TH THEFT FROM 20: CENTRA / 37.7909
Sunday 07:00:0 NONE 122.417 8 31
EFT UNLOCKED 30 L CALIFOR 74
0 AM 393
AUTO NIA ST
+0000
1 08/31/2
GRAND COLUMB
014 -
LARCENY/TH THEFT FROM 14: CENTRA US AV / 37.7963
Sunday 07:00:0 NONE 122.404 8 31
EFT LOCKED 30 L JACKSO 02
0 AM 418
AUTO N ST
+0000
2 08/31/2
GRAND SUTTER
014 -
LARCENY/TH THEFT FROM 11: CENTRA ST / 37.7894
Sunday 07:00:0 NONE 122.406 8 31
EFT LOCKED 30 L STOCKT 35
0 AM 959
AUTO ON ST
+0000
3 08/31/2
POSSESSION ARRES
014 16TH ST / -
DRUG/NARC OF METH- 17: T, 37.7650
Sunday 07:00:0 MISSION MISSION 122.419 8 31
OTIC AMPHETAMI 49 BOOKE 50
0 AM ST 672
NE D
+0000
4 08/31/2
ARRES LARKIN
014 -
DRUG/NARC POSSESSION 18: NORTHE T, ST / 37.7851
Sunday 07:00:0 122.417 8 31
OTIC OF COCAINE 05 RN BOOKE OFARRE 67
0 AM 904
D LL ST
+0000
... ... ... ... ... ... ... ... ... ... ... ... ...
971 08/01/2
1100.0
5 AIDED CASE, 014 -
NON- 19: Block of 37.7542
MENTAL Friday 07:00:0 MISSION NONE 122.406 8 1
CRIMINAL 55 POTRER 79
DISTURBED 0 AM 497
O AV
+0000
971 08/01/2
MISCELLANE 1500.0
6 014 -
OTHER OUS 22: RICHMO Block of 37.7844
Friday 07:00:0 NONE 122.441 8 1
OFFENSES INVESTIGATI 47 ND BRODERI 27
0 AM 458
ON CK ST
+0000
971 08/01/2
400.0
7 THREATS 014 -
23: BAYVIE Block of 37.7097
ASSAULT AGAINST Friday 07:00:0 NONE 122.401 8 1
55 W TUNNEL 48
LIFE 0 AM 364
AV
+0000
Page 10 of 16
Lab no 05 – Data Analysis and Visualization
How many crime incidents were there for the month of August?
In [45]:
# code cell 18
# Possible code for the question: How many burglaries were reported in the month of
August?
AugustCrimes = SF[SF['Month'] == 8]
AugustCrimesB = SF[SF['Category'] == 'BURGLARY']
len(AugustCrimesB)
Out[45]:
1257
b) To create a subset of the SF data frame for a specific day, use the function query operand to compare
Month and Day at the same time.
In [46]:
# Code cell 19
Crime0704 = SF.query('Month == 7 and Day == 4')
Crime0704
Out[46]:
DayOf Ti PdDistri Resolu Mo D
Category Descript Date Address X Y
Week me ct tion nth ay
190 07/04/2
87 GRAND THEFT 014 -
LARCENY/ 22: SOUTH 8TH ST / 37.777
FROM LOCKED Friday 07:00:0 NONE 122.41 7 4
THEFT 30 ERN MISSION ST 457
AUTO 0 AM 3161
+0000
190 07/04/2
88 GRAND THEFT 014 -
LARCENY/ 18: SOUTH CLEMENTINA 37.774
FROM LOCKED Friday 07:00:0 NONE 122.41 7 4
THEFT 15 ERN ST / 9TH ST 201
AUTO 0 AM 2174
+0000
190 BURGLARY,RES 07/04/2
89 IDENCE UNDER 014 -
00: TARAV 0.0 Block of 37.748
BURGLARY CONSTRT, Friday 07:00:0 NONE 122.46 7 4
50 AL MENDOSA AV 011
FORCIBLE 0 AM 6414
ENTRY +0000
Page 11 of 16
Lab no 05 – Data Analysis and Visualization
In [47]:
# Code cell 20
SF.columns
Out[47]:
Index(['Category', 'Descript', 'DayOfWeek', 'Date', 'Time', 'PdDistrict',
'Resolution', 'Address', 'X', 'Y', 'Month', 'Day'],
dtype='object')
Page 12 of 16
Lab no 05 – Data Analysis and Visualization
Step 1: Plot a graph of the SF data frame using the X and Y variables.
a) Use the plot() function to plot the SF data frame. Use the optional parameter to plot the graph in red
and setting the marker shape to a circle using ro .
In [48]:
# Code cell 21
plt.plot(SF['X'],SF['Y'], 'ro')
plt.show()
b) Identify the number of police department district, then build the dictionary pd_districts to associate their
string to an integer.
In [49]:
# Code cell 22
pd_districts = np.unique(SF['PdDistrict'])
pd_districts_levels = dict(zip(pd_districts, range(len(pd_districts))))
pd_districts_levels
Out[49]:
{'BAYVIEW': 0,
'CENTRAL': 1,
'INGLESIDE': 2,
'MISSION': 3,
'NORTHERN': 4,
'PARK': 5,
'RICHMOND': 6,
'SOUTHERN': 7,
'TARAVAL': 8,
'TENDERLOIN': 9}
c) Use apply and lambda to add the police deparment integer id to a new column of the DataFrame
Page 13 of 16
Lab no 05 – Data Analysis and Visualization
In [50]:
# Code cell 23
SF['PdDistrictCode'] = SF['PdDistrict'].apply(lambda row: pd_districts_levels[row])
In [51]:
# Code cell 24
plt.scatter(SF['X'], SF['Y'], c=SF['PdDistrictCode'])
plt.show()
In Step 1, you created a simple plot that displays where crime incidents took place in SF County. This plot is
useful, but folium provides additional functions that will allow you to overlay this plot onto an OpenStreet
map.
a) Folium requires the color of the marker to be specified using an hexadecimal value. For this reason, we
use the colors package, and select the necessary colors.
In [52]:
# Code cell 25
from matplotlib import colors
districts = np.unique(SF['PdDistrict'])
print(list(colors.cnames.values())[0:len(districts)])
['#9932CC', '#FAEBD7', '#778899', '#00FF7F', '#C71585', '#3CB371', '#00FFFF', '#556B2F', '#80
8080', '#FFA07A']
In [53]:
Page 14 of 16
Lab no 05 – Data Analysis and Visualization
# Code cell 26
color_dict = dict(zip(districts, list(colors.cnames.values())[0:-
1:len(districts)]))
color_dict
Out[53]:
{'BAYVIEW': '#9932CC',
'CENTRAL': '#FFA500',
'INGLESIDE': '#FFF8DC',
'MISSION': '#FF7F50',
'NORTHERN': '#A0522D',
'PARK': '#FFE4B5',
'RICHMOND': '#FFB6C1',
'SOUTHERN': '#5F9EA0',
'TARAVAL': '#C0C0C0',
'TENDERLOIN': '#191970'}
c) Create the map using the middle coordinates of the SF Data to center the map (using mean). To reduce
the computation time, plotEvery is used to limit amount of plotted data. Set this value to 1 to plot all the rows
(might take a long time to visualize the map).
In [54]:
# Code cell 27
# Create map
map_osm = folium.Map(location=[SF['Y'].mean(), SF['X'].mean()], zoom_start = 12)
plotEvery = 50
obs = list(zip( SF['Y'], SF['X'], SF['PdDistrict']))
for el in obs[0:-1:plotEvery]:
folium.CircleMarker(el[0:2], color=color_dict[el[2]],
fill_color=el[2],radius=10).add_to(map_osm)
In [55]:
# Code cell 28
map_osm
Out[55]:
Page 15 of 16
Lab no 05 – Data Analysis and Visualization
© 2017 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 16 of 16