Final Project

Uploaded by

gorakdevender01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views4 pages

Final Project

Uploaded by

gorakdevender01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Spring 2023, PSYC 8110

Homework #2

Posted on: Wednesday, April 12, 2023

Due date: Monday, May 1, 2023, 11:59 PM
Maximum points: 40
Topics covered: Python and R

Instructions
Upload a zipped file (.zip) containing three folders, namely “Data”, “Figures”, and “Results”. Your parent
folder should include one Python notebook and one R Markdown (or notebook) file at the same level as
the folders mentioned above. The folder structure is provided to you with some “sample.xxx” files that
you should delete (they might not open properly).

Project Description
The aim of this project is to provide you with practical experience working with a real-world dataset, and
to help you develop your data cleaning, modeling, and visualization skills using Python and R. You are
provided with a sample of the IPUMS USA 2021 dataset (https://usa.ipums.org/) to work with
(“./Data/psyc8990_raw_data.csv.gz”). The primary objective is to predict whether employed individuals
over 18 years old have an income greater than $70,000 (INCTOT) based on the following variables:
• Location (STATEFIP)
• Sociodemographic variables (SEX, AGE, RACE, EDUC)
• Occupational category (OCC)

Note: Pandas can directly read a .csv.gz file using the .read_csv() method. If not, unzip the file.
Step 1: Data cleaning and recoding (Python) [20 points]
Cleaning and recoding data are often performed together based on the initial coding of the data. Refer to
the attached dictionary or https://usa.ipums.org/ for more details.
Your task is to prepare a dataset using the recoding scheme outlined in this document. Set any non-
applicable or missing data to NA (i.e., np.nan in Python). After recoding, create a dataframe that:
• Only has data for Employed people (see the EMPSTAT column).
• Only contains the columns listed above
• Does not have any rows with missing data (remove rows with even a single missing column)
• Only has unique PERNUM values (remove duplicates and keep the first item)

Save this dataframe in a file named “psyc8990_final_data.csv” in the “Data” folder.

Step 2: Descriptive statistics (Python) [5 points]

• Write a function that saves value counts by INCTOT for each column (except PERNUM) in a single
excel file named “psyc8990_stats.xlsx” in the “Results” folder with the corresponding column name in
lower case as the sheet name (i.e., HISPAN should be “hispan”).
• For the AGE column, use the following bins: 18-20, 21-30, 31-40, …. 91-100, 101+.
• Use value names rather than recoded values for the value counts. For example, the “sex” excel sheet
should look something like:

SEX Less than $70,000 $70,000 or more

Male 100 200
Female 400 500

Hint: Install and import the ‘openpyxl’ package to write in Python. Pandas has a .write_excel() method.
Hint: You can create a multi-level dictionary with coded values and string values for each column and
write a function that recodes value for each column. For example:
dict_recoding = {
‘INCTOT’: {
1: ‘Less than $70,000,
2: ‘$70,000 or more
},
‘SEX’: {
1: ‘Male’,
2: ‘Female’
},
…
}

Step 3: Descriptive plots (Python or R) [5 points]

Create countplots for each variable (except PERNUM; use the binned values for AGE). Save these plots
in a “Figures” folder.

Step 4: Statistical modeling (R) [5 points]

Using the clean dataframe, use the glm function with family = “binomial” argument to create a logistic
regression model to predict the effects of AGE, SEX, RACE, HISPAN, EDUC, OCC, and STATEFIP on
INCTOT. Make sure to use recoded variables as factors and not as continuous variables (except AGE,
which should be used as a continuous variable).
Save the result summary in a text file named “model_results.txt” in the “Results” folder.

Step 5: Results visualization (R) [5 points]

Think of a way to visualize the results in a single plot (can contain multiple panels/facets). This is an
open-ended question. If you are running out of ideas, you can try creating a plot of odds ratios (i.e., the
exponent of coefficient in logistic regression models) and confidence intervals like the one below (or
something similar).
Grading Remarks
• Ensure each step is in a separate section within the corresponding Python/R file, utilizing Markdown in
Jupyter or RStudio.
• Code should be organized and free from unnecessary code that could cause issues. Relative paths should
be used for filenames to allow for seamless code execution.
• The code should be able to run on a different dataset with the same structure without any modifications,
except for the path to the source data file.
• Points will be deducted for complicated or unclear variable names, redundant or superfluous code, and
a lack of comments, docstrings in functions (for Python), plot labels, and overall clarity of code.
• Additional points may be awarded for well-written functions and overall clarity of code.
• Should you encounter any difficulties during any step, proceed with the available data or outcomes,
even if they are incorrect. The correctness of your code will be prioritized over everything else.

Recoding Scheme
• PERNUM: character (not numeric)
• INCTOT
o 1 – Less than $70,000
o 2 – $70,000 or more
• SEX
o 1 – Male
o 2 – Female
• AGE: numeric (in years), exclude people below 18 years old
• RACE
o 1 – White
o 2 – Black/African American
o 3 – Other
• HISPAN
o 1 – Hispanic
o 2 – Not Hispanic
• EDUC
o 1 – High school or less
o 2 – Some college or bachelor’s degree
o 3 – Greater than bachelor’s degree
• OCC (see https://usa.ipums.org/usa/volii/occ2018.shtml)
o 1 – Management, business, science, and arts occupations
o 2 – Service occupations
o 3 – Sales and office occupations
o 4 – Natural resources, construction, and maintenance occupations
o 5 – Production, transportation, and material moving occupations
• STATEFIP (see https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf)
o 1 – Northeast
o 2 – Midwest
o 3 – South
o 4 – West

Introduction To Statistical Methods: BITS Pilani
No ratings yet
Introduction To Statistical Methods: BITS Pilani
40 pages
David Altman - Direct Democracy in Comparative Perspective. Origins, Performance, and Reform-Cambridge University Press (2019) PDF
No ratings yet
David Altman - Direct Democracy in Comparative Perspective. Origins, Performance, and Reform-Cambridge University Press (2019) PDF
271 pages
R Cheat Sheet Merged
100% (2)
R Cheat Sheet Merged
35 pages
Project 4 - Cars-Datasets PDF
100% (2)
Project 4 - Cars-Datasets PDF
44 pages
The Effects of Weather On Football Matches Played Within The German Bundesliga
No ratings yet
The Effects of Weather On Football Matches Played Within The German Bundesliga
193 pages
CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
No ratings yet
CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
158 pages
Critical Factors Affecting Quality Performance in Construction Projects
No ratings yet
Critical Factors Affecting Quality Performance in Construction Projects
17 pages
Description: Hint: Perform Steps As Mentioned Below
100% (1)
Description: Hint: Perform Steps As Mentioned Below
11 pages
(Ebook PDF) Discovering Statistics Using IBM SPSS Statistics 4th Download
100% (1)
(Ebook PDF) Discovering Statistics Using IBM SPSS Statistics 4th Download
55 pages
R Cheat Sheet: 1. Basics 4. Input and Export of Data
100% (1)
R Cheat Sheet: 1. Basics 4. Input and Export of Data
4 pages
Simulation of Multilevel Data
No ratings yet
Simulation of Multilevel Data
9 pages
CS3361 Set2
No ratings yet
CS3361 Set2
12 pages
Basics of Credit Risk Modelling
100% (1)
Basics of Credit Risk Modelling
13 pages
Symbiosis School of Banking and Finance (SSBF)
No ratings yet
Symbiosis School of Banking and Finance (SSBF)
20 pages
ML Question
No ratings yet
ML Question
2 pages
CS3361 Set1
No ratings yet
CS3361 Set1
10 pages
Design and Analysis of Cross Over Trials 3rd Edition Illustrated Ebook Download
100% (8)
Design and Analysis of Cross Over Trials 3rd Edition Illustrated Ebook Download
16 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
R Tutorial
No ratings yet
R Tutorial
6 pages
CS3361 Set1
No ratings yet
CS3361 Set1
9 pages
CS3361 Set2
No ratings yet
CS3361 Set2
9 pages
Excel Definitivo 2
No ratings yet
Excel Definitivo 2
47 pages
Osaka University Knowledge Archive: OUKA Osaka University Knowledge Archive: OUKA
No ratings yet
Osaka University Knowledge Archive: OUKA Osaka University Knowledge Archive: OUKA
15 pages
Da (22C01156)
No ratings yet
Da (22C01156)
26 pages
CS3361 Set2
No ratings yet
CS3361 Set2
13 pages
ML Question Bank
No ratings yet
ML Question Bank
68 pages
Experiment - 1: AIM: Write A R Program To Get The Statistical Summary and Nature of The Data of A Given Data Frame
No ratings yet
Experiment - 1: AIM: Write A R Program To Get The Statistical Summary and Nature of The Data of A Given Data Frame
10 pages
2020 - Vietnam - Determinants of Patient Satisfaction Lessons From Large-Scale Inpatient Interviews in Vietnam
No ratings yet
2020 - Vietnam - Determinants of Patient Satisfaction Lessons From Large-Scale Inpatient Interviews in Vietnam
17 pages
Absenteeism Module
No ratings yet
Absenteeism Module
2 pages
TY - Lab-II CS-358 Web Tech & DS Slip (Rev 2021-22)
No ratings yet
TY - Lab-II CS-358 Web Tech & DS Slip (Rev 2021-22)
20 pages
SPSS Logistic Regression
No ratings yet
SPSS Logistic Regression
4 pages
Project3 1
No ratings yet
Project3 1
2 pages
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
No ratings yet
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
22 pages
EDUC/PSY 6600: Unit 2 Homework: Your Name Fall 2019
No ratings yet
EDUC/PSY 6600: Unit 2 Homework: Your Name Fall 2019
48 pages
Social-Cultural Factors Influence On Management of Shared Sanitation, in Nakuru Town West Slums
No ratings yet
Social-Cultural Factors Influence On Management of Shared Sanitation, in Nakuru Town West Slums
13 pages
Fernandez-Gonzalez Et Al Connection NAO Weather Types 2012
No ratings yet
Fernandez-Gonzalez Et Al Connection NAO Weather Types 2012
16 pages
Course: Applied Statistics Projects: Bui Anh Tuan March 1, 2022
No ratings yet
Course: Applied Statistics Projects: Bui Anh Tuan March 1, 2022
9 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Asrb Proposal 1
No ratings yet
Asrb Proposal 1
14 pages
CC7182 - Programming For Data Analytics
No ratings yet
CC7182 - Programming For Data Analytics
9 pages
Ritesh Machine Learning Project
100% (9)
Ritesh Machine Learning Project
46 pages
DSBDA Lab Plan
No ratings yet
DSBDA Lab Plan
5 pages
ML Lab Record
No ratings yet
ML Lab Record
38 pages
Caillier 2014
No ratings yet
Caillier 2014
23 pages
Dsda Manual
No ratings yet
Dsda Manual
64 pages
Lab Manual 04
No ratings yet
Lab Manual 04
12 pages
Effect of Socioeconomic Factors On Malnutrition Among Children in Pakistan
No ratings yet
Effect of Socioeconomic Factors On Malnutrition Among Children in Pakistan
11 pages
Teaching Statistics With Sports Examples
No ratings yet
Teaching Statistics With Sports Examples
14 pages
Lab File AD PDF
No ratings yet
Lab File AD PDF
25 pages
Phase 3.PDF Ramana
No ratings yet
Phase 3.PDF Ramana
17 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Python Practical Questions@Subas
No ratings yet
Python Practical Questions@Subas
7 pages
R Programing Bhagu
No ratings yet
R Programing Bhagu
40 pages
Categorical Data Analysis Assignment: Due DT.: 10/12/2022 Name: Soham Mallick Roll No.: MB-2202
No ratings yet
Categorical Data Analysis Assignment: Due DT.: 10/12/2022 Name: Soham Mallick Roll No.: MB-2202
6 pages
Spark Python Course APPLY Project Solution Guide Hints
No ratings yet
Spark Python Course APPLY Project Solution Guide Hints
2 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
10 1109@iadcc 2018 8692137
No ratings yet
10 1109@iadcc 2018 8692137
6 pages
Index: SR. NO. Practical Name Date of Perform NO. Sign
No ratings yet
Index: SR. NO. Practical Name Date of Perform NO. Sign
28 pages
R Assignment 10
No ratings yet
R Assignment 10
12 pages
CS3361 Set2
No ratings yet
CS3361 Set2
6 pages
End-Term Exam (PGDM 2019-21), Term-V Introduction To R in Business Applications (Open Book and Online) Max. Marks - 40 Max. Time - 4 Hours
No ratings yet
End-Term Exam (PGDM 2019-21), Term-V Introduction To R in Business Applications (Open Book and Online) Max. Marks - 40 Max. Time - 4 Hours
2 pages
Machine Learning-Breastfeeding
No ratings yet
Machine Learning-Breastfeeding
15 pages
PSYC8010 Practice Test (Computing + Which Test) No Answers For Jan 23
No ratings yet
PSYC8010 Practice Test (Computing + Which Test) No Answers For Jan 23
5 pages
PR List Dsbda
No ratings yet
PR List Dsbda
2 pages
Datascience
No ratings yet
Datascience
8 pages
ML Education
No ratings yet
ML Education
6 pages
Statistics CA 2023-24 Sem 2
No ratings yet
Statistics CA 2023-24 Sem 2
4 pages
Optional Lab - Sigmoid Function and Logistic Regression - Coursera
No ratings yet
Optional Lab - Sigmoid Function and Logistic Regression - Coursera
2 pages
Saurabh
No ratings yet
Saurabh
22 pages
DA Lab Manual
No ratings yet
DA Lab Manual
42 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
65 pages
2101 F 12 Logistic Regression With R1
No ratings yet
2101 F 12 Logistic Regression With R1
10 pages
Lab Questionbank
No ratings yet
Lab Questionbank
3 pages
00 - Lesson - Data Science Workflow - Jupyter Notebook
No ratings yet
00 - Lesson - Data Science Workflow - Jupyter Notebook
6 pages
Workshop Activity: X Seq y Length
No ratings yet
Workshop Activity: X Seq y Length
3 pages
2022 10 12 Exam Pa Model Solutions
No ratings yet
2022 10 12 Exam Pa Model Solutions
38 pages
Logit Probit
No ratings yet
Logit Probit
66 pages
End Sem PYQ
No ratings yet
End Sem PYQ
8 pages
Topics
No ratings yet
Topics
11 pages
IS5312 Mini Project-2
No ratings yet
IS5312 Mini Project-2
5 pages
Logistic Regression Assignment
No ratings yet
Logistic Regression Assignment
20 pages
R Programming Interview Questions-1
No ratings yet
R Programming Interview Questions-1
20 pages
Assignment #2 - For Statistical Software
No ratings yet
Assignment #2 - For Statistical Software
4 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
4 Assignment 3 - Unit 1 - Demographics and Employment
No ratings yet
4 Assignment 3 - Unit 1 - Demographics and Employment
12 pages
NPV 70 Marks Set 2
No ratings yet
NPV 70 Marks Set 2
4 pages
R Cheat Sheet
No ratings yet
R Cheat Sheet
4 pages
Census Income Project
No ratings yet
Census Income Project
4 pages

Final Project

Uploaded by

Final Project

Uploaded by

Spring 2023, PSYC 8110

Posted on: Wednesday, April 12, 2023

Save this dataframe in a file named “psyc8990_final_data.csv” in the “Data” folder.

Step 2: Descriptive statistics (Python) [5 points]

SEX Less than $70,000 $70,000 or more

Step 3: Descriptive plots (Python or R) [5 points]

Step 4: Statistical modeling (R) [5 points]

Step 5: Results visualization (R) [5 points]

You might also like