A1 Exploratory and Descriptive Data Analysis
A1 Exploratory and Descriptive Data Analysis
Instructions:
Questions:
1. Check out the City of Los Angeles public data sources and test the hypothesis that the statistics of
“affordable housing projects” (government housing for low-income people) in a ZIP code has a relation to
the health inspection scores of the restaurants in that ZIP code.
a) Download csv files from:
i. https://catalog.data.gov/dataset/restaurant-and-market-health-inspections
ii. https://catalog.data.gov/dataset/hcidla-affordable-housing-projects-list-2003-to-present
b) Perform EDA on the two files: [4]
i. Check if the data types are as expected, else convert them
ii. Check for missing values, then decide to either remove those rows, or fill an imputed value
iii. Check for unexpected entries in certain columns. Correct them if necessary and feasible.
iv. Plot some graphs to understand the data
c) Summarize each file by ZIP code using SQL: [2]
i. Ensure the right type of summarization (sum, mean, max etc.) for the other variables
d) Join the files using SQL by ZIP code: [2]
i. Ensure that the ZIP codes are in compatible formats and lengths
ii. For each ZIP, get the predictor variable from the housing projects file, and potential predicted
variables from the health inspections file
e) Formulate and test the hypothesis: [2]
i. Formulate a reasonable alternative hypothesis
ii. Formulate a null hypothesis
iii. Select an appropriate test and significance level
iv. Perform the test and decide if the null hypothesis should be rejected and alternative
hypothesis should be accepted
2. Open-ended: Find some interesting data from Indian government data portal https://www.data.gov.in
and perform EDA, derive some insights using graphs, and perform a statistical test for an interesting
hypothesis. No need to use multiple files for this question, unless you want to do the extra work for your
own learning. [4]