0% found this document useful (0 votes)
5 views

Lab 05_Data analysis and Visulaization

This document outlines Lab 05 for the CS433 course on Internet of Things, focusing on data analysis and visualization using the San Francisco Crime dataset. It details the objectives, required resources, and steps for importing Python packages, loading, preparing, analyzing, and visualizing the data. The lab emphasizes the use of Python and Jupyter Notebook to demonstrate the Data Analysis Lifecycle.

Uploaded by

hoanganhyen0901
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lab 05_Data analysis and Visulaization

This document outlines Lab 05 for the CS433 course on Internet of Things, focusing on data analysis and visualization using the San Francisco Crime dataset. It details the objectives, required resources, and steps for importing Python packages, loading, preparing, analyzing, and visualizing the data. The lab emphasizes the use of Python and Jupyter Notebook to demonstrate the Data Analysis Lifecycle.

Uploaded by

hoanganhyen0901
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Faculty of Computers and Artificial Intelligence

CS433: Internet of Things (IoT)


---------------------------------------------------------------------------------------------
Lab no 05 –Data Analysis and Visualization

This lab provides an introduction to data analysis and visualization.


In this lab, our data source is the San Francisco Crime data.

Parts: -

1. Python Packages.
2. Load Data.
3. Prepare Data.
4. Analyze Data.
5. Visualize Data.

© 2022 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page 1 of 16
Lab no 05 – Data Analysis and Visualization

Lab - San Francisco Crime


Objectives
Demonstrate your knowledge of the Data Analysis Lifecycle using a given set of data and the tools, Python
and Jupyter Notebook
Part 1: Import the Python Packages
Part 2: Load the Data
Part 3: Prepare the Data
Part 4: Analyze the Data
Part 5: Visualize the Data
Background / Scenario
In this lab, you will import some Python packages required to analyze a data set containing San Francisco
crime information. You will then use Python and Jupyter Notebook to prepare this data for analysis, analyze
it, graph it, and communicate your findings.
Required Resources

• 1 PC with Internet access


• Raspberry Pi version 2 or higher
• Python libraries: pandas, numpy, matplotlib, folium, datetime, and csv
• Datafiles: Map-Crime_Incidents-Previous_Three_Months.csv

Part 1: Import the Python Packages


In this part, you will import the following Python packages necessary for the rest of this lab.
numpy
NumPy is the fundamental package for scientific computing with Python. It contains among other things: a
powerful N-dimensional array object and sophisticated (broadcasting) functions.
pandas
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures
and data analysis tools for the Python programming language.
matplotlib
Matplotlib is a plotting library for the Python programming language and its numerical mathematics
extension NumPy.
folium
Folim is a library to create interactive map.

In [27]:
# Code cell 1
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import folium

Page 2 of 16
Lab no 05 – Data Analysis and Visualization

Part 2: Load the Data


In this part, you will load the San Francisco Crime Dataset and the Python packages necessary to analyze
and visualize it.

Step 1: Load the San Francisco Crime data into a data frame.
In this step, you will import the San Francisco crime data from a comma separated values (csv) file into a
data frame.

In [28]:
# code cell 2
# This should be a local path
dataset_path = './Data/Map-Crime_Incidents-Previous_Three_Months.csv'

# read the original dataset (in comma separated values format) into a DataFrame
pd.read_csv(dataset_path, sep=",")
SF = pd.read_csv(dataset_path)
print(SF)
IncidntNum Category Descript \
0 NaN LARCENY/THEFT GRAND THEFT FROM UNLOCKED AUTO
1 NaN LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO
2 NaN LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO
3 NaN DRUG/NARCOTIC POSSESSION OF METH-AMPHETAMINE
4 NaN DRUG/NARCOTIC POSSESSION OF COCAINE
... ... ... ...
30755 NaN LARCENY/THEFT PETTY THEFT SHOPLIFTING
30756 NaN OTHER OFFENSES DRIVERS LICENSE, SUSPENDED OR REVOKED
30757 NaN ASSAULT BATTERY
30758 NaN ASSAULT ASSAULT WITH CAUSTIC CHEMICALS
30759 NaN OTHER OFFENSES DRIVERS LICENSE, SUSPENDED OR REVOKED

DayOfWeek Date Time PdDistrict \


0 Sunday 08/31/2014 07:00:00 AM +0000 20:30 CENTRAL
1 Sunday 08/31/2014 07:00:00 AM +0000 14:30 CENTRAL
2 Sunday 08/31/2014 07:00:00 AM +0000 11:30 CENTRAL
3 Sunday 08/31/2014 07:00:00 AM +0000 17:49 MISSION
4 Sunday 08/31/2014 07:00:00 AM +0000 18:05 NORTHERN
... ... ... ... ...
30755 Sunday 06/01/2014 07:00:00 AM +0000 15:30 SOUTHERN
30756 Sunday 06/01/2014 07:00:00 AM +0000 16:00 NORTHERN
30757 Sunday 06/01/2014 07:00:00 AM +0000 15:00 TENDERLOIN
30758 Sunday 06/01/2014 07:00:00 AM +0000 15:20 CENTRAL
30759 Sunday 06/01/2014 07:00:00 AM +0000 13:15 INGLESIDE

Resolution Address X Y \
0 NONE HYDE ST / CALIFORNIA ST -122.417393 37.790974
1 NONE COLUMBUS AV / JACKSON ST -122.404418 37.796302
2 NONE SUTTER ST / STOCKTON ST -122.406959 37.789435
3 ARREST, BOOKED 16TH ST / MISSION ST -122.419672 37.765050
4 ARREST, BOOKED LARKIN ST / OFARRELL ST -122.417904 37.785167
... ... ... ... ...
30755 ARREST, BOOKED 900.0 Block of MARKET ST -122.408052 37.783957
30756 ARREST, CITED POLK ST / MCALLISTER ST -122.418601 37.780261
30757 ARREST, CITED 0.0 Block of JONES ST -122.412122 37.781379
30758 NONE 200.0 Block of GEARY ST -122.407434 37.787494
30759 ARREST, CITED MISSION ST / BOSWORTH ST -122.426391 37.733675

Location
0 (37.7909741243888, -122.417392830334)

Page 3 of 16
Lab no 05 – Data Analysis and Visualization

1 (37.7963018736036, -122.404417620748)
2 (37.7894347630337, -122.406958660602)
3 (37.7650501214965, -122.419671780296)
4 (37.7851670875814, -122.417903977564)
... ...
30755 (37.7839574642528, -122.408051765969)
30756 (37.7802607511488, -122.418600974625)
30757 (37.7813786419025, -122.412121608136)
30758 (37.7874944447786, -122.407434204569)
30759 (37.7336749150401, -122.426391018521)

[30760 rows x 12 columns]

To view the first five lines of the csv file, the Linux command head is used.

In [29]:
# code cell 3
!head -n 5 ./Data/Map-Crime_Incidents-Previous_Three_Months.csv

Step 2: View the imported data.


a) By typing the name of the data frame variable into a cell, you can visualize the top and bottom rows in a
structured way.

In [30]:
# Code cell 4
pd.set_option('display.max_rows', 10) #Visualize 10 rows
SF
Out[30]:
IncidntN DayOfW Tim Resolut
Category Descript Date PdDistrict Address X Y Location
um eek e ion
0 GRAND 08/31/2
HYDE ST (37.790974124
THEFT 014 -
LARCENY/T 20: / 37.790 3888, -
NaN FROM Sunday 07:00:0 CENTRAL NONE 122.417
HEFT 30 CALIFOR 974 122.41739283
UNLOCKE 0 AM 393
NIA ST 0334)
D AUTO +0000
1 GRAND 08/31/2
COLUMB (37.796301873
THEFT 014 -
LARCENY/T 14: US AV / 37.796 6036, -
NaN FROM Sunday 07:00:0 CENTRAL NONE 122.404
HEFT 30 JACKSO 302 122.40441762
LOCKED 0 AM 418
N ST 0748)
AUTO +0000
2 GRAND 08/31/2
SUTTER (37.789434763
THEFT 014 -
LARCENY/T 11: ST / 37.789 0337, -
NaN FROM Sunday 07:00:0 CENTRAL NONE 122.406
HEFT 30 STOCKT 435 122.40695866
LOCKED 0 AM 959
ON ST 0602)
AUTO +0000
3 POSSESSI 08/31/2
ARRES (37.765050121
ON OF 014 16TH ST / -
DRUG/NARC 17: T, 37.765 4965, -
NaN METH- Sunday 07:00:0 MISSION MISSION 122.419
OTIC 49 BOOKE 050 122.41967178
AMPHETA 0 AM ST 672
D 0296)
MINE +0000
4 ARRES LARKIN (37.785167087
POSSESSI 08/31/2 -
DRUG/NARC 18: NORTHE T, ST / 37.785 5814, -
NaN ON OF Sunday 014 122.417
OTIC 05 RN BOOKE OFARRE 167 122.41790397
COCAINE 07:00:0 904
D LL ST 7564)

Page 4 of 16
Lab no 05 – Data Analysis and Visualization

IncidntN DayOfW Tim Resolut


Category Descript Date PdDistrict Address X Y Location
um eek e ion
0 AM
+0000
... ... ... ... ... ... ... ... ... ... ... ... ...
307 06/01/2
PETTY ARRES 900.0 (37.783957464
55 014 -
LARCENY/T THEFT 15: SOUTHE T, Block of 37.783 2528, -
NaN Sunday 07:00:0 122.408
HEFT SHOPLIFTI 30 RN BOOKE MARKET 957 122.40805176
0 AM 052
NG D ST 5969)
+0000
307 DRIVERS 06/01/2
POLK ST (37.780260751
56 LICENSE, 014 ARRES -
OTHER 16: NORTHE / 37.780 1488, -
NaN SUSPEND Sunday 07:00:0 T, 122.418
OFFENSES 00 RN MCALLIS 261 122.41860097
ED OR 0 AM CITED 601
TER ST 4625)
REVOKED +0000
307 06/01/2
(37.781378641
57 014 ARRES 0.0 Block -
15: TENDERL 37.781 9025, -
NaN ASSAULT BATTERY Sunday 07:00:0 T, of JONES 122.412
00 OIN 379 122.41212160
0 AM CITED ST 122
8136)
+0000
307 ASSAULT 06/01/2
200.0 (37.787494444
58 WITH 014 -
15: Block of 37.787 7786, -
NaN ASSAULT CAUSTIC Sunday 07:00:0 CENTRAL NONE 122.407
20 GEARY 494 122.40743420
CHEMICAL 0 AM 434
ST 4569)
S +0000
307 DRIVERS 06/01/2
MISSION (37.733674915
59 LICENSE, 014 ARRES -
OTHER 13: INGLESID ST / 37.733 0401, -
NaN SUSPEND Sunday 07:00:0 T, 122.426
OFFENSES 15 E BOSWOR 675 122.42639101
ED OR 0 AM CITED 391
TH ST 8521)
REVOKED +0000
30760 rows × 12 columns

b) Use the function columns to view the name of the variables in the DataFrame.

In [31]:
# Code cell 5
SF.columns
Out[31]:
Index(['IncidntNum', 'Category', 'Descript', 'DayOfWeek', 'Date', 'Time',
'PdDistrict', 'Resolution', 'Address', 'X', 'Y', 'Location'],
dtype='object')

How many variables are contained in the SF data frame (ignore the Index)?

c) Use the function len to determine the number of rows in the dataset.

In [32]:
# Code cell 6
len(SF)
Out[32]:
30760

Page 5 of 16
Lab no 05 – Data Analysis and Visualization

Part 3: Prepare the Data


Now that you have the data loaded into the work environment and determined the analysis you want to
perform, it is time to prepare the data for analysis.

Step 1: Extract the month and day from the Date field.

lambda is a Python keyword to define so-called anonymous functions. lambda allows you to specify a
function in one line of code, without using def and without defining a specific name for it. The syntax for a
lambda expression is :
lambda parameters : expression.
In the following, the lambda function is used to create an inline function that selects only the month digits
from the Date variable, and int to transform a string representation into an integer. Then, the pandas
function apply is used to apply this function to an entire column (in practice, apply implicitly defines a for
loop and passes one by one the rows to the lambda function). The same procedure can be done for the
Day.

In [33]:
# Code cell 7
SF['Month'] = SF['Date'].apply(lambda row: int(row[0:2]))
SF['Day'] = SF['Date'].apply(lambda row: int(row[3:5]))

To verify that these two variables were added to the SF data frame, use the print function to print some
values from these columns, and type to check that these new columns contain indeed numerical values.

In [34]:
# Code cell 8
print(SF['Month'][0:2])
print(SF['Day'][0:2])
0 8
1 8
Name: Month, dtype: int64
0 31
1 31
Name: Day, dtype: int64

In [35]:
# Code cell 9
print(type(SF['Month'][0]))
<class 'numpy.int64'>

Step 2: Remove variables from the SF data frame.

a) The column IncidntNum contains many cells with NaN. In this instance, the data is missing. Furthermore,
the IncidntNum is not providing any value to the analysis. The column can be dropped from the data frame.
One way to remove unwanted variables in a data frame is by using the del function.

In [36]:
# Code cell 10
del SF['IncidntNum']

Page 6 of 16
Lab no 05 – Data Analysis and Visualization

b) Similarly, the Location attribute will not be in this analysis. It can be droped from the data frame.
Alternatively, you can use the drop function on the data frame, specifying that the axis is the 1 (0 for rows),
and that the command does not require an assignment to another value to store the result (inplace = True ).

In [37]:
# Code cell 11
SF.drop('Location', axis=1, inplace=True )

In [38]:
SF
Out[38]:
DayOfW Tim Resoluti Mon Da
Category Descript Date PdDistrict Address X Y
eek e on th y
0 GRAND 08/31/2
HYDE ST
THEFT 014 -
LARCENY/TH 20: / 37.7909
FROM Sunday 07:00:0 CENTRAL NONE 122.417 8 31
EFT 30 CALIFOR 74
UNLOCKED 0 AM 393
NIA ST
AUTO +0000
1 GRAND 08/31/2
COLUMB
THEFT 014 -
LARCENY/TH 14: US AV / 37.7963
FROM Sunday 07:00:0 CENTRAL NONE 122.404 8 31
EFT 30 JACKSON 02
LOCKED 0 AM 418
ST
AUTO +0000
2 GRAND 08/31/2
SUTTER
THEFT 014 -
LARCENY/TH 11: ST / 37.7894
FROM Sunday 07:00:0 CENTRAL NONE 122.406 8 31
EFT 30 STOCKT 35
LOCKED 0 AM 959
ON ST
AUTO +0000
3 POSSESSI 08/31/2
ARRES
ON OF 014 16TH ST / -
DRUG/NARC 17: T, 37.7650
METH- Sunday 07:00:0 MISSION MISSION 122.419 8 31
OTIC 49 BOOKE 50
AMPHETA 0 AM ST 672
D
MINE +0000
4 08/31/2
ARRES LARKIN
POSSESSI 014 -
DRUG/NARC 18: NORTHER T, ST / 37.7851
ON OF Sunday 07:00:0 122.417 8 31
OTIC 05 N BOOKE OFARREL 67
COCAINE 0 AM 904
D L ST
+0000
... ... ... ... ... ... ... ... ... ... ... ... ...
307 06/01/2
PETTY ARRES 900.0
55 014 -
LARCENY/TH THEFT 15: SOUTHER T, Block of 37.7839
Sunday 07:00:0 122.408 6 1
EFT SHOPLIFTI 30 N BOOKE MARKET 57
0 AM 052
NG D ST
+0000
307 DRIVERS 06/01/2
56 LICENSE, 014 ARRES POLK ST / -
OTHER 16: NORTHER 37.7802
SUSPENDE Sunday 07:00:0 T, MCALLIS 122.418 6 1
OFFENSES 00 N 61
D OR 0 AM CITED TER ST 601
REVOKED +0000
307 06/01/2 ARRES 0.0 Block -
15: TENDERL 37.7813
57 ASSAULT BATTERY Sunday 014 T, of JONES 122.412 6 1
00 OIN 79
07:00:0 CITED ST 122

Page 7 of 16
Lab no 05 – Data Analysis and Visualization

DayOfW Tim Resoluti Mon Da


Category Descript Date PdDistrict Address X Y
eek e on th y
0 AM
+0000
307 ASSAULT 06/01/2
200.0
58 WITH 014 -
15: Block of 37.7874
ASSAULT CAUSTIC Sunday 07:00:0 CENTRAL NONE 122.407 6 1
20 GEARY 94
CHEMICAL 0 AM 434
ST
S +0000
307 DRIVERS 06/01/2
MISSION
59 LICENSE, 014 ARRES -
OTHER 13: INGLESID ST / 37.7336
SUSPENDE Sunday 07:00:0 T, 122.426 6 1
OFFENSES 15 E BOSWOR 75
D OR 0 AM CITED 391
TH ST
REVOKED +0000
30760 rows × 12 columns

c) Check that the columns have been removed.

In [39]:
# Code cell 12
SF.columns
Out[39]:
Index(['Category', 'Descript', 'DayOfWeek', 'Date', 'Time', 'PdDistrict',
'Resolution', 'Address', 'X', 'Y', 'Month', 'Day'],
dtype='object')

Part 4: Analyze the Data


Now that the data frame has been prepared with the data, it is time to analyze the data.

Step 1: Summarize variables to obtain statistical information.


a) Use the function value_counts to summarize the number of crimes committed by type, then print to
display the contents of the CountCategory variable.

In [40]:
# Code cell 13
CountCategory = SF['Category'].value_counts()
print(CountCategory)
LARCENY/THEFT 8205
OTHER OFFENSES 4004
NON-CRIMINAL 3653
ASSAULT 2518
VEHICLE THEFT 1885
...
LOITERING 5
BAD CHECKS 3
PORNOGRAPHY/OBSCENE MAT 1
BRIBERY 1
GAMBLING 1
Name: Category, Length: 36, dtype: int64

b) By default, the counts are ordered in descending order. The value of the optional parameter ascending
can be set to True to reverse this behavior.

Page 8 of 16
Lab no 05 – Data Analysis and Visualization

In [41]:
# Code cell 14
SF['Category'].value_counts(ascending=True)
Out[41]:
GAMBLING 1
BRIBERY 1
PORNOGRAPHY/OBSCENE MAT 1
BAD CHECKS 3
LOITERING 5
...
VEHICLE THEFT 1885
ASSAULT 2518
NON-CRIMINAL 3653
OTHER OFFENSES 4004
LARCENY/THEFT 8205
Name: Category, Length: 36, dtype: int64

What type of crime was committed the most?

c) By nesting the two functions into one command, you can accomplish the same result with one line of
code.

In [42]:
# Code cell 15
print(SF['Category'].value_counts(ascending=True))
GAMBLING 1
BRIBERY 1
PORNOGRAPHY/OBSCENE MAT 1
BAD CHECKS 3
LOITERING 5
...
VEHICLE THEFT 1885
ASSAULT 2518
NON-CRIMINAL 3653
OTHER OFFENSES 4004
LARCENY/THEFT 8205
Name: Category, Length: 36, dtype: int64

Challenge Question: Which PdDistrict had the most incidents of reported crime? Provide the Python
command(s) used to support your answer.

In [43]:
# code cell 16
# Possible code for the challenge question
print(SF['PdDistrict'].value_counts(ascending=True))
RICHMOND 1622
PARK 1800
TARAVAL 2038
TENDERLOIN 2449
INGLESIDE 2613
BAYVIEW 2970
NORTHERN 3205
CENTRAL 3867
MISSION 4011
SOUTHERN 6185
Name: PdDistrict, dtype: int64

Page 9 of 16
Lab no 05 – Data Analysis and Visualization

Step 2: Subset the data into smaller data frames.

a) Logical indexing can be used to select only the rows for which a given condition is satisfied. For example,
the following code extracts only the crimes committed in August, and stores the result in a new DataFrame.

In [44]:
# Code cell 17
AugustCrimes = SF[SF['Month'] == 8]
AugustCrimes
Out[44]:
DayOfW Tim PdDistri Resoluti Mon Da
Category Descript Date Address X Y
eek e ct on th y
0 08/31/2
GRAND HYDE ST
014 -
LARCENY/TH THEFT FROM 20: CENTRA / 37.7909
Sunday 07:00:0 NONE 122.417 8 31
EFT UNLOCKED 30 L CALIFOR 74
0 AM 393
AUTO NIA ST
+0000
1 08/31/2
GRAND COLUMB
014 -
LARCENY/TH THEFT FROM 14: CENTRA US AV / 37.7963
Sunday 07:00:0 NONE 122.404 8 31
EFT LOCKED 30 L JACKSO 02
0 AM 418
AUTO N ST
+0000
2 08/31/2
GRAND SUTTER
014 -
LARCENY/TH THEFT FROM 11: CENTRA ST / 37.7894
Sunday 07:00:0 NONE 122.406 8 31
EFT LOCKED 30 L STOCKT 35
0 AM 959
AUTO ON ST
+0000
3 08/31/2
POSSESSION ARRES
014 16TH ST / -
DRUG/NARC OF METH- 17: T, 37.7650
Sunday 07:00:0 MISSION MISSION 122.419 8 31
OTIC AMPHETAMI 49 BOOKE 50
0 AM ST 672
NE D
+0000
4 08/31/2
ARRES LARKIN
014 -
DRUG/NARC POSSESSION 18: NORTHE T, ST / 37.7851
Sunday 07:00:0 122.417 8 31
OTIC OF COCAINE 05 RN BOOKE OFARRE 67
0 AM 904
D LL ST
+0000
... ... ... ... ... ... ... ... ... ... ... ... ...
971 08/01/2
1100.0
5 AIDED CASE, 014 -
NON- 19: Block of 37.7542
MENTAL Friday 07:00:0 MISSION NONE 122.406 8 1
CRIMINAL 55 POTRER 79
DISTURBED 0 AM 497
O AV
+0000
971 08/01/2
MISCELLANE 1500.0
6 014 -
OTHER OUS 22: RICHMO Block of 37.7844
Friday 07:00:0 NONE 122.441 8 1
OFFENSES INVESTIGATI 47 ND BRODERI 27
0 AM 458
ON CK ST
+0000
971 08/01/2
400.0
7 THREATS 014 -
23: BAYVIE Block of 37.7097
ASSAULT AGAINST Friday 07:00:0 NONE 122.401 8 1
55 W TUNNEL 48
LIFE 0 AM 364
AV
+0000

Page 10 of 16
Lab no 05 – Data Analysis and Visualization

DayOfW Tim PdDistri Resoluti Mon Da


Category Descript Date Address X Y
eek e ct on th y
971 DRIVING 08/01/2
ARRES
8 DRIVING WHILE 014 OAK ST / -
23: NORTHE T, 37.7745
UNDER THE UNDER THE Friday 07:00:0 LAGUNA 122.425 8 1
38 RN BOOKE 99
INFLUENCE INFLUENCE 0 AM ST 892
D
OF ALCOHOL +0000
971 08/01/2
ASSAULT TO 1000.0
9 SEX 014 -
RAPE WITH 00: Block of 37.7568
OFFENSES, Friday 07:00:0 MISSION NONE 122.406 8 1
BODILY 01 POTRER 26
FORCIBLE 0 AM 657
FORCE O AV
+0000
9720 rows × 12 columns

How many crime incidents were there for the month of August?

How many burglaries were reported in the month of August?

In [45]:
# code cell 18
# Possible code for the question: How many burglaries were reported in the month of
August?
AugustCrimes = SF[SF['Month'] == 8]
AugustCrimesB = SF[SF['Category'] == 'BURGLARY']
len(AugustCrimesB)
Out[45]:
1257

b) To create a subset of the SF data frame for a specific day, use the function query operand to compare
Month and Day at the same time.

In [46]:
# Code cell 19
Crime0704 = SF.query('Month == 7 and Day == 4')
Crime0704
Out[46]:
DayOf Ti PdDistri Resolu Mo D
Category Descript Date Address X Y
Week me ct tion nth ay
190 07/04/2
87 GRAND THEFT 014 -
LARCENY/ 22: SOUTH 8TH ST / 37.777
FROM LOCKED Friday 07:00:0 NONE 122.41 7 4
THEFT 30 ERN MISSION ST 457
AUTO 0 AM 3161
+0000
190 07/04/2
88 GRAND THEFT 014 -
LARCENY/ 18: SOUTH CLEMENTINA 37.774
FROM LOCKED Friday 07:00:0 NONE 122.41 7 4
THEFT 15 ERN ST / 9TH ST 201
AUTO 0 AM 2174
+0000
190 BURGLARY,RES 07/04/2
89 IDENCE UNDER 014 -
00: TARAV 0.0 Block of 37.748
BURGLARY CONSTRT, Friday 07:00:0 NONE 122.46 7 4
50 AL MENDOSA AV 011
FORCIBLE 0 AM 6414
ENTRY +0000

Page 11 of 16
Lab no 05 – Data Analysis and Visualization

DayOf Ti PdDistri Resolu Mo D


Category Descript Date Address X Y
Week me ct tion nth ay
190 07/04/2
90 014 -
NON- LOST 19: CASTRO ST / 37.764
Friday 07:00:0 PARK NONE 122.43 7 4
CRIMINAL PROPERTY 00 16TH ST 102
0 AM 5318
+0000
190 07/04/2
91 014 -
21: NORTH 1000.0 Block of 37.785
ASSAULT BATTERY Friday 07:00:0 NONE 122.41 7 4
00 ERN POLK ST 894
0 AM 9783
+0000
... ... ... ... ... ... ... ... ... ... ... ... ...
194 07/04/2
THE
23 GRAND THEFT 014 -
LARCENY/ 19: SOUTH EMBARCADER 37.787
FROM LOCKED Friday 07:00:0 NONE 122.38 7 4
THEFT 25 ERN OSOUTH ST / 103
AUTO 0 AM 8007
BRYANT ST
+0000
194 07/04/2
24 014 -
OTHER LOST/STOLEN 11: INGLES 0.0 Block of 37.716
Friday 07:00:0 NONE 122.39 7 4
OFFENSES LICENSE PLATE 00 IDE FRATESSA CT 129
0 AM 9762
+0000
194 07/04/2
THE
25 GRAND THEFT 014 -
LARCENY/ 20: SOUTH EMBARCADER 37.789
FROM LOCKED Friday 07:00:0 NONE 122.38 7 4
THEFT 30 ERN OSOUTH ST / 573
AUTO 0 AM 8486
HARRISON ST
+0000
194 07/04/2
26 GRAND THEFT 014 -
LARCENY/ 08: SOUTH 11TH ST / 37.770
FROM LOCKED Friday 07:00:0 NONE 122.41 7 4
THEFT 00 ERN HARRISON ST 631
AUTO 0 AM 2483
+0000
194 07/04/2
27 014 -
LARCENY/ PETTY THEFT 15: RICHM 3900.0 Block of 37.781
Friday 07:00:0 NONE 122.46 7 4
THEFT OF PROPERTY 30 OND GEARY BL 181
0 AM 1295
+0000
341 rows × 12 columns

In [47]:
# Code cell 20
SF.columns
Out[47]:
Index(['Category', 'Descript', 'DayOfWeek', 'Date', 'Time', 'PdDistrict',
'Resolution', 'Address', 'X', 'Y', 'Month', 'Day'],
dtype='object')

Part 5: Present the Data


Visualization and presentation of the data provides an instant overview that might not be apparent by simply
looking at the raw data. The SF data frame contains longitude and latitude coordinates that can be used to
plot the data.

Page 12 of 16
Lab no 05 – Data Analysis and Visualization

Step 1: Plot a graph of the SF data frame using the X and Y variables.

a) Use the plot() function to plot the SF data frame. Use the optional parameter to plot the graph in red
and setting the marker shape to a circle using ro .

In [48]:
# Code cell 21
plt.plot(SF['X'],SF['Y'], 'ro')
plt.show()

b) Identify the number of police department district, then build the dictionary pd_districts to associate their
string to an integer.

In [49]:
# Code cell 22
pd_districts = np.unique(SF['PdDistrict'])
pd_districts_levels = dict(zip(pd_districts, range(len(pd_districts))))
pd_districts_levels
Out[49]:
{'BAYVIEW': 0,
'CENTRAL': 1,
'INGLESIDE': 2,
'MISSION': 3,
'NORTHERN': 4,
'PARK': 5,
'RICHMOND': 6,
'SOUTHERN': 7,
'TARAVAL': 8,
'TENDERLOIN': 9}

c) Use apply and lambda to add the police deparment integer id to a new column of the DataFrame

Page 13 of 16
Lab no 05 – Data Analysis and Visualization

In [50]:
# Code cell 23
SF['PdDistrictCode'] = SF['PdDistrict'].apply(lambda row: pd_districts_levels[row])

d) Use the newly create PdDistrictCode to automatically change the color

In [51]:
# Code cell 24
plt.scatter(SF['X'], SF['Y'], c=SF['PdDistrictCode'])
plt.show()

Step 2: Add Map packages to enhance the plot.

In Step 1, you created a simple plot that displays where crime incidents took place in SF County. This plot is
useful, but folium provides additional functions that will allow you to overlay this plot onto an OpenStreet
map.

a) Folium requires the color of the marker to be specified using an hexadecimal value. For this reason, we
use the colors package, and select the necessary colors.

In [52]:
# Code cell 25
from matplotlib import colors
districts = np.unique(SF['PdDistrict'])
print(list(colors.cnames.values())[0:len(districts)])
['#9932CC', '#FAEBD7', '#778899', '#00FF7F', '#C71585', '#3CB371', '#00FFFF', '#556B2F', '#80
8080', '#FFA07A']

b) Create a color dictionary for each police department district.

In [53]:

Page 14 of 16
Lab no 05 – Data Analysis and Visualization

# Code cell 26
color_dict = dict(zip(districts, list(colors.cnames.values())[0:-
1:len(districts)]))
color_dict
Out[53]:
{'BAYVIEW': '#9932CC',
'CENTRAL': '#FFA500',
'INGLESIDE': '#FFF8DC',
'MISSION': '#FF7F50',
'NORTHERN': '#A0522D',
'PARK': '#FFE4B5',
'RICHMOND': '#FFB6C1',
'SOUTHERN': '#5F9EA0',
'TARAVAL': '#C0C0C0',
'TENDERLOIN': '#191970'}

c) Create the map using the middle coordinates of the SF Data to center the map (using mean). To reduce
the computation time, plotEvery is used to limit amount of plotted data. Set this value to 1 to plot all the rows
(might take a long time to visualize the map).

In [54]:
# Code cell 27
# Create map
map_osm = folium.Map(location=[SF['Y'].mean(), SF['X'].mean()], zoom_start = 12)
plotEvery = 50
obs = list(zip( SF['Y'], SF['X'], SF['PdDistrict']))

for el in obs[0:-1:plotEvery]:

folium.CircleMarker(el[0:2], color=color_dict[el[2]],
fill_color=el[2],radius=10).add_to(map_osm)

In [55]:
# Code cell 28
map_osm
Out[55]:

Page 15 of 16
Lab no 05 – Data Analysis and Visualization

© 2017 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.

Page 16 of 16

You might also like