Final Project
Final Project
Homework #2
Instructions
Upload a zipped file (.zip) containing three folders, namely “Data”, “Figures”, and “Results”. Your parent
folder should include one Python notebook and one R Markdown (or notebook) file at the same level as
the folders mentioned above. The folder structure is provided to you with some “sample.xxx” files that
you should delete (they might not open properly).
Project Description
The aim of this project is to provide you with practical experience working with a real-world dataset, and
to help you develop your data cleaning, modeling, and visualization skills using Python and R. You are
provided with a sample of the IPUMS USA 2021 dataset (https://usa.ipums.org/) to work with
(“./Data/psyc8990_raw_data.csv.gz”). The primary objective is to predict whether employed individuals
over 18 years old have an income greater than $70,000 (INCTOT) based on the following variables:
• Location (STATEFIP)
• Sociodemographic variables (SEX, AGE, RACE, EDUC)
• Occupational category (OCC)
Note: Pandas can directly read a .csv.gz file using the .read_csv() method. If not, unzip the file.
Step 1: Data cleaning and recoding (Python) [20 points]
Cleaning and recoding data are often performed together based on the initial coding of the data. Refer to
the attached dictionary or https://usa.ipums.org/ for more details.
Your task is to prepare a dataset using the recoding scheme outlined in this document. Set any non-
applicable or missing data to NA (i.e., np.nan in Python). After recoding, create a dataframe that:
• Only has data for Employed people (see the EMPSTAT column).
• Only contains the columns listed above
• Does not have any rows with missing data (remove rows with even a single missing column)
• Only has unique PERNUM values (remove duplicates and keep the first item)
Hint: Install and import the ‘openpyxl’ package to write in Python. Pandas has a .write_excel() method.
Hint: You can create a multi-level dictionary with coded values and string values for each column and
write a function that recodes value for each column. For example:
dict_recoding = {
‘INCTOT’: {
1: ‘Less than $70,000,
2: ‘$70,000 or more
},
‘SEX’: {
1: ‘Male’,
2: ‘Female’
},
…
}
Recoding Scheme
• PERNUM: character (not numeric)
• INCTOT
o 1 – Less than $70,000
o 2 – $70,000 or more
• SEX
o 1 – Male
o 2 – Female
• AGE: numeric (in years), exclude people below 18 years old
• RACE
o 1 – White
o 2 – Black/African American
o 3 – Other
• HISPAN
o 1 – Hispanic
o 2 – Not Hispanic
• EDUC
o 1 – High school or less
o 2 – Some college or bachelor’s degree
o 3 – Greater than bachelor’s degree
• OCC (see https://usa.ipums.org/usa/volii/occ2018.shtml)
o 1 – Management, business, science, and arts occupations
o 2 – Service occupations
o 3 – Sales and office occupations
o 4 – Natural resources, construction, and maintenance occupations
o 5 – Production, transportation, and material moving occupations
• STATEFIP (see https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf)
o 1 – Northeast
o 2 – Midwest
o 3 – South
o 4 – West