0% found this document useful (0 votes)
10 views2 pages

Chicago Crime: .Shape .Dropna .Info Pandas DF - Convert Dtypes Date Updated On Datetime64 (NS) To Datetime

Uploaded by

swartwba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views2 pages

Chicago Crime: .Shape .Dropna .Info Pandas DF - Convert Dtypes Date Updated On Datetime64 (NS) To Datetime

Uploaded by

swartwba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Homework (26 pts) MA384 Data Mining (Shibberu) Winter 2024-25

1. (26 pts) Chicago Crime


Note: If your computer has only 8 GB of memory, you may find it challenging to complete this problem. The
data file used is ChicagoCrime2017-12-08.csv (1.5 GB). The file contains data on police crime reports in the
city of Chicago from 2001 to 2017. It took 18 seconds to load this file on my computer workstation.
(a) (2 pts) How many arrest records are in the data set and how many attributes does each arrest record
have? Hint: .shape
(b) (2 pts) What percent of the arrest records have missing values? Hint: .dropna()
(c) (2 pts) How much memory (in GB) is used to hold the Chicago crime data set? Hint: .info()
(d) (4 pts) Improve the functionality of Pandas by changing each column to its appropriate data type using
df.convert dtypes(). Manually convert the Date and Updated On columns to datetime64[ns] using
to datetime(). More information on each attribute can be found at the link Chicago Crime Dataset1 .
In your web browser, click on the vertical dots in the column header to get a description of the column
attribute.
df.types

ID Int64
Case Number string
Date datetime64[ns]
Block string
IUCR string
Primary Type string
Description string
Location Description string
Arrest boolean
Domestic boolean
Beat Int64
District Int64
Ward Int64
Community Area Int64
FBI Code string
X Coordinate Int64
Y Coordinate Int64
Year Int64
Updated On datetime64[ns]
Latitude Float64
Longitude Float64
Location string

(e) (2 pts) Writing a data frame to a .csv (comma separated format) file does not save information on column
data types. Use the parquet binary format instead. Not only will the file size be smaller, but data types
are preserved as well. Hint: df.to parquet()
(f) (2 pts) Reload df from the saved parquet file and check that data types have been preserved. Note the
.parquet file is significantly smaller than the .csv file.
(g) (2 pts) Construct a bar chart of the number of police crime reports each year from 2001 to 2017. Do you
observe a trend in the number of reported crimes? Explain.
(h) (2 pts) The commands below create a plot of the total number of crimes reported on each day of the year.
Explain the spikes at the beginning and end of the plot. Hint: Could it have something to do with New
Year’s eve and also leap years?
grouping_day = df.groupby(df[’Date’].dt.dayofyear)
grouping_day[’ID’].count().plot()
plt.title(’Total number of crimes reported on each day for all years.’)

1 https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2/data

1
Total number of crimes reported on each day for all years.

25000

20000

15000

10000

5000

0 50 100 150 200 250 300 350


Date

(i) (4 pts) How many unique FBI codes are listed in the data set? Hint: .unique() Construct a rough
description of the FBI codes by performing a cross tabulation with the Primary Type description of
each arrest record. Use the most common Primary Type description for each FBI code. Hint: I found
.idxmax() to be helpful. What does an FBI code of 03 correspond to?
(j) (4 pts) What types of crime are increasing and what types of crime are deceasing with time? Perform a
cross tabulation of FBI Code with Year. (FBI Code should be the index.) Then substitute the FBI code
with the Primary Type description determined above. Normalize the rows so that the rows are proportions
of each type of crime committed each year. The rows should add to 1. Finally, plot a heat map of the
data, sorting the rows by the proportions of crime committed in the first year, 2001. Draw conclusions
from the heat map. The commands below increase the size of the the figure and plot the heat map.
Hint: .div
plt.subplots(figsize=(20,10))
sns.heatmap(prob_crimes)

DECEPTIVE PRACTICE
LIQUOR LAW VIOLATION 0.200
DECEPTIVE PRACTICE
BATTERY
SEX OFFENSE 0.175
ARSON
MOTOR VEHICLE THEFT
PROSTITUTION 0.150
ASSAULT
ASSAULT
BATTERY
0.125
HOMICIDE
CRIMINAL DAMAGE
ROBBERY
0.100
OTHER OFFENSE
CRIM SEXUAL ASSAULT
THEFT
0.075
NARCOTICS
BURGLARY
DECEPTIVE PRACTICE
0.050
WEAPONS VIOLATION
GAMBLING
DECEPTIVE PRACTICE
0.025
OFFENSE INVOLVING CHILDREN
PUBLIC PEACE VIOLATION
HOMICIDE
0.000
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Year

You might also like