0% found this document useful (0 votes)
5 views4 pages

Final Project

Uploaded by

gorakdevender01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

Final Project

Uploaded by

gorakdevender01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Spring 2023, PSYC 8110

Homework #2

Posted on: Wednesday, April 12, 2023


Due date: Monday, May 1, 2023, 11:59 PM
Maximum points: 40
Topics covered: Python and R

Instructions
Upload a zipped file (.zip) containing three folders, namely “Data”, “Figures”, and “Results”. Your parent
folder should include one Python notebook and one R Markdown (or notebook) file at the same level as
the folders mentioned above. The folder structure is provided to you with some “sample.xxx” files that
you should delete (they might not open properly).

Project Description
The aim of this project is to provide you with practical experience working with a real-world dataset, and
to help you develop your data cleaning, modeling, and visualization skills using Python and R. You are
provided with a sample of the IPUMS USA 2021 dataset (https://usa.ipums.org/) to work with
(“./Data/psyc8990_raw_data.csv.gz”). The primary objective is to predict whether employed individuals
over 18 years old have an income greater than $70,000 (INCTOT) based on the following variables:
• Location (STATEFIP)
• Sociodemographic variables (SEX, AGE, RACE, EDUC)
• Occupational category (OCC)

Note: Pandas can directly read a .csv.gz file using the .read_csv() method. If not, unzip the file.
Step 1: Data cleaning and recoding (Python) [20 points]
Cleaning and recoding data are often performed together based on the initial coding of the data. Refer to
the attached dictionary or https://usa.ipums.org/ for more details.
Your task is to prepare a dataset using the recoding scheme outlined in this document. Set any non-
applicable or missing data to NA (i.e., np.nan in Python). After recoding, create a dataframe that:
• Only has data for Employed people (see the EMPSTAT column).
• Only contains the columns listed above
• Does not have any rows with missing data (remove rows with even a single missing column)
• Only has unique PERNUM values (remove duplicates and keep the first item)

Save this dataframe in a file named “psyc8990_final_data.csv” in the “Data” folder.

Step 2: Descriptive statistics (Python) [5 points]


• Write a function that saves value counts by INCTOT for each column (except PERNUM) in a single
excel file named “psyc8990_stats.xlsx” in the “Results” folder with the corresponding column name in
lower case as the sheet name (i.e., HISPAN should be “hispan”).
• For the AGE column, use the following bins: 18-20, 21-30, 31-40, …. 91-100, 101+.
• Use value names rather than recoded values for the value counts. For example, the “sex” excel sheet
should look something like:

SEX Less than $70,000 $70,000 or more


Male 100 200
Female 400 500

Hint: Install and import the ‘openpyxl’ package to write in Python. Pandas has a .write_excel() method.
Hint: You can create a multi-level dictionary with coded values and string values for each column and
write a function that recodes value for each column. For example:
dict_recoding = {
‘INCTOT’: {
1: ‘Less than $70,000,
2: ‘$70,000 or more
},
‘SEX’: {
1: ‘Male’,
2: ‘Female’
},

}

Step 3: Descriptive plots (Python or R) [5 points]


Create countplots for each variable (except PERNUM; use the binned values for AGE). Save these plots
in a “Figures” folder.

Step 4: Statistical modeling (R) [5 points]


Using the clean dataframe, use the glm function with family = “binomial” argument to create a logistic
regression model to predict the effects of AGE, SEX, RACE, HISPAN, EDUC, OCC, and STATEFIP on
INCTOT. Make sure to use recoded variables as factors and not as continuous variables (except AGE,
which should be used as a continuous variable).
Save the result summary in a text file named “model_results.txt” in the “Results” folder.

Step 5: Results visualization (R) [5 points]


Think of a way to visualize the results in a single plot (can contain multiple panels/facets). This is an
open-ended question. If you are running out of ideas, you can try creating a plot of odds ratios (i.e., the
exponent of coefficient in logistic regression models) and confidence intervals like the one below (or
something similar).
Grading Remarks
• Ensure each step is in a separate section within the corresponding Python/R file, utilizing Markdown in
Jupyter or RStudio.
• Code should be organized and free from unnecessary code that could cause issues. Relative paths should
be used for filenames to allow for seamless code execution.
• The code should be able to run on a different dataset with the same structure without any modifications,
except for the path to the source data file.
• Points will be deducted for complicated or unclear variable names, redundant or superfluous code, and
a lack of comments, docstrings in functions (for Python), plot labels, and overall clarity of code.
• Additional points may be awarded for well-written functions and overall clarity of code.
• Should you encounter any difficulties during any step, proceed with the available data or outcomes,
even if they are incorrect. The correctness of your code will be prioritized over everything else.

Recoding Scheme
• PERNUM: character (not numeric)
• INCTOT
o 1 – Less than $70,000
o 2 – $70,000 or more
• SEX
o 1 – Male
o 2 – Female
• AGE: numeric (in years), exclude people below 18 years old
• RACE
o 1 – White
o 2 – Black/African American
o 3 – Other
• HISPAN
o 1 – Hispanic
o 2 – Not Hispanic
• EDUC
o 1 – High school or less
o 2 – Some college or bachelor’s degree
o 3 – Greater than bachelor’s degree
• OCC (see https://usa.ipums.org/usa/volii/occ2018.shtml)
o 1 – Management, business, science, and arts occupations
o 2 – Service occupations
o 3 – Sales and office occupations
o 4 – Natural resources, construction, and maintenance occupations
o 5 – Production, transportation, and material moving occupations
• STATEFIP (see https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf)
o 1 – Northeast
o 2 – Midwest
o 3 – South
o 4 – West

You might also like