CSCA08H Assignment 3
CSCA08H Assignment 3
Background
A commonly-held belief is that an individual's health is largely influenced by the
choices they make. However, there is lots of evidence that health is affected by
systemic factors.
In this assignment, you will write code to assist with analysing data on the
relationship between hypertension (also known as high blood pressure) and income
levels in Toronto neighbourhoods. The data you will work with is real data, however
we have simplified it somewhat to make this assignment clearer for you.
The data analysis that your code will do will include some statistical analysis that we
have not talked about in the course. You do NOT need to understand the underlying
statistics to complete this assignment. The code you write will do some simple
mathematical operations, like adding up some numbers, or finding ratios using
division. We will use Pearson correlation for the more advanced analysis and you will
use existing functions that we have imported for you.
You will need to take a look at the examples of these functions in order to figure out
what arguments you need to pass to them, and what types of data they return, but
you do not need to understand how they work in any detail.
Correlation is a single coefficient expressing the tendency of one set of data to grow
linearly, in the same or opposite direction, with another set of data. This is done by
comparing whether points that have been paired between the two sets are similarly
greater or less than than their set's respective averages.
For example, if we wanted to compare whether for students in the class, age is
correlated with height, we would have two sets of data, birth date (which we could
express as, say, number of weeks old for finer granularity), and heights.
Numbers from each set are ordered in the same way so that each height value
corresponds to the age value for the same student. What is nice about the correlation
https://www.utsc.utoronto.ca/~atafliovich/csca08/assignments/a3/index.html 1/12
11/29/22, 11:12 AM CSCA08H Assignment 3
metric we are using, is that it is normalised to be between -1 and 1, with these values
giving us a nice human interpretation. A value of 1 means that the points make a
straight line. In our example, this means, for some increase in age, we have a
consistent increase in height. Similarly, a value of -1 is the same relationship but with
a flip of direction, where older students would be shorter than younger ones. Finally, a
value of 0 would say that there is no consistent increase or decrease in height for a
change in age. We will use this to investigate the relationship between low income
rates and hypertension, for any tendency to increase or decrease together.
If you are a statistics person, keep in mind that the learning goals of the assignment
are about writing code using what we've learned in the course, not about doing a
proper statistical analysis.
Dataset descriptions
This assignment uses data files related to one of the two variables of interest (i.e.,
hypertension data or income data). The files are CSV (comma separated values) files,
where each column in a line is separated by a comma. You can assume there are no
commas anywhere else in the files, other than to separate columns, and that any file
given is in the correct format. The two file types are described below.
The first row in a neighbourhood hypertension file contain header information, and the
remaining rows each contain data relating to hypertension prevalence in a particular
Toronto neighbourhood.
Here is a description of the different columns of the dataset. Notice the use of
constants and carefully study the starter file constants.py.
Column
Description
index
HT_ID_COL An ID that uniquely identifies each neighbourhood.
HT_NBH_NAME_COL The name of the neighbourhood. Neighbourhood names are unique.
The number of people aged 20 to 44 with hypertension in the
HT_20_44_COL
neighbourhood.
NBH_20_44_COL The total number of people aged 20 to 44 in the neighbourhood.
The number of people aged 45 to 64 with hypertension in the
HT_45_64_COL
neighbourhood.
NBH_45_64_COL The total number of people aged 45 to 64 in the neighbourhood.
The number of people aged 65 and older with hypertension in the
HT_65_UP_COL
neighbourhood.
NBH_65_UP_COL The total number of people aged 65 and older in the neighbourhood.
Neighbourhood hypertension dataset
The first row in a neighbourhood income data file contains header information, and
the remaining rows each contain data about low income status.
https://www.utsc.utoronto.ca/~atafliovich/csca08/assignments/a3/index.html 2/12
11/29/22, 11:12 AM CSCA08H Assignment 3
Here is a description of the different columns of the dataset. Notice the use of
constants and carefully study the starter file constants.py.
Column
Description
index
LI_ID_COL An ID that uniquely identifies each neighbourhood.
LI_NBH_NAME_COL The name of the neighbourhood. Neighbourhood names are unique.
POP_COL The total population in the neighbourhood.
LI_POP_COL The number of people in the neighbourhood with low income status.
Neighbourhood income dataset
Neighbourhood names and ids are the same between our hypertension data files and
our low income data files. However, the total population of a neighbourhood can be
different between the two data files, as they were collected at different times.
A dictionary that is a value in a dictionary of type CityData has the following key/value
pairs. Notice the use of constants and carefully study the starter file constants.py.
https://www.utsc.utoronto.ca/~atafliovich/csca08/assignments/a3/index.html 3/12
11/29/22, 11:12 AM CSCA08H Assignment 3
The sample CityData dictionary above represents hypertension and low income data for
five neighbourhoods: West Humber-Clairville, Mount Olive-Silverstone-Jamestown,
Thistletown-Beaumond Heights, Rexdale-Kipling, and Elms-Old Rexdale.
Let's take a closer look at the data for Elms-Old Rexdale. This neighbourhood is
represented by the key/value pair where the key is 'Elms-Old Rexdale'. The id of this
neighbourhood is 5. The hypertension data for this neighbourhood is as follows: 3353
people are between the ages of 20 and 44, 176 of whom have hypertension. There
are 2842 people between the ages of 45 and 64, 1040 of whom have hypertension,
and there are 1322 people aged 65 and up, 948 of whom have hypertension. The low
income data for this neighbourhood is that 2315 people are classified as low income,
from a total population of 9460 people.
Note that the totals do not match between the low income and the hypertension data
— this is because the low income data was collected before the hypertension data,
and the size of the neighbourhoods changed. For the purposes of this assignment, we
will assume the collection of these two datasets is close enough in time to compare
them to each other. You do not need to do anything about these differing totals, other
than to make sure you are using the correct total when computing rates, as described
later.
Age standardisation
This section describes the process of age standardisation that we will use in this
assignment to perform a more accurate analysis. Note that we have given you a
function that computes the age standardised rate from the raw rate
(described in Task 3). This section is for your information only; we have
already implemented this for you.
https://www.utsc.utoronto.ca/~atafliovich/csca08/assignments/a3/index.html 4/12
11/29/22, 11:12 AM CSCA08H Assignment 3
Our dataset will let us calculate the rate of hypertension in each Toronto
neighbourhood. One complicating factor is that different neighbourhoods have
different age demographics. For example, the Henry Farm neighbourhood has a
significantly lower proportion of 65+ residents than Hillcrest Village. And because
people aged 65+ have a higher overall rate of hypertension, this demographic
difference alone would cause us to expect to see a difference in the overall
hypertension between these neighbourhoods.
So because we care about the impact of low income status on hypertension rates, we
want to remove the impact of different age demographics between the
neighbourhoods. To do so, we will use a process called age standardisation to
calculate an adjusted hypertension rate that ignores differences in ages. This process
involves the following steps for each neighbourhood:
1. First, we'll calculate the hypertension rate within each of the following age
groups: 20-44, 45-64, and 65+. We'll report these rates as percentages, which
you can think of as being the number of cases of hypertension per 100 people
aged 20-44 / 45-64 / 65 and up.
2. Then, we'll pick one standard population with certain numbers of people in these
age groups. For the purpose of this assignment, we'll use the total Canadian
population from the 1991 census:
Age Group Population
20-44 11,199,830
45-64 5,365,865
65+ 3,169,970
Total (20+) 19,735,665
Population by age group
data
3. Then, we'll use the neighbourhood rates to calculate the hypothetical number of
people in the standard population who would have hypertension. For example, if
the rates for neighbourhood X were 20% of 20-44, 30% of 45-64, and 66% of
65+, the total number of people with hypertension in the standard population
would be 2,239,966 + 1,609,760 + 2,092,180 = 5,941,906.
4. Finally, divide this number of people with hypertension by the total size of the
standard population, yielding a final percentage 5,941,906 / 19,735,665 x 100 or
approximately 30%. This percentage is the age standardised rate for the
neighbourhood.
If you are interested, you can read more about age standardised rates here.
Required Functions
In the starter code file a3.py, follow the Function Design Recipe to complete the
functions described below.
You will need helper functions (i.e., functions you define yourself to be called in other
functions) for some of the required functions, but likely not for all of them. Helper
functions also require complete docstrings with doctests. We strongly recommend you
also follow any suggestions about helper functions in the table below; we give you
these hints to make your programming task easier.
https://www.utsc.utoronto.ca/~atafliovich/csca08/assignments/a3/index.html 5/12
11/29/22, 11:12 AM CSCA08H Assignment 3
Some indicators that you should consider writing a new helper function, or using
something you've already written as a helper are:
Rewriting code to solve a task you have already solved in another function
Getting a warning from the checker that your function is too long
Getting a warning from the checker that your function has too many nested
blocks or too many branches
Realising that your function can be broken down into smaller sub-problems (with
a helper function for each)
For each of the functions below, other than the file reading functions in Task 1, write
at least two examples in the docstring. You can use the provided SAMPLE_DATA
dictionary, and you should also create another small CityData dictionary for examples
and testing. If your helper function takes an open file as an argument, you do NOT
need to write any examples in that function's docstring. Otherwise, for any helper
functions you add, write at least two examples in the docstring.
Your functions should not mutate their arguments, unless the description says
that is what they do.
Assumptions
All neighbourhood ids and names are unique, and will appear the same in all data
files. That is, no neighbourhood will have a different id between files, or a
different name.
In all tasks except Task 1, the dictionary argument will have both hypertension
and low income data for every neighbourhood. That is, it will be a valid CityData
dictionary.
All float values should be left as is; do not round any of them.
Using Constants
The starter code contains constants in the file constants.py that you should use in your
solution for the list indices and key identifiers for the CityData dictionary as well as the
column numbers for the input files. You may add other constants if you wish, but DO
NOT place them in the file constants.py: instead put them in the a3.py file.
These functions will be used to build a CityData dictionary, however the dictionary that
is passed to the functions may not yet contain all of the data.
To illustrate this, we have provided two small data files. After passing the same
dictionary to both functions with each of those small files, the dictionary should be a
CityData dictionary that contains the same information as the provided SAMPLE_DATA
https://www.utsc.utoronto.ca/~atafliovich/csca08/assignments/a3/index.html 6/12
11/29/22, 11:12 AM CSCA08H Assignment 3
dictionary. Using the small hypertension file and an empty dictionary as arguments to
get_hypertension_data, the result should be that the dictionary now contains the
hypertension data as in SAMPLE_DATA, but not the low income data.
{'West Humber-Clairville':
{'id': 1, 'hypertension': [703, 13291, 3741, 9663, 3959, 5176]},
'Mount Olive-Silverstone-Jamestown':
{'id': 2, 'hypertension': [789, 12906, 3578, 8815, 2927, 3902]},
'Thistletown-Beaumond Heights':
{'id': 3, 'hypertension': [220, 3631, 1047, 2829, 1349, 1767]},
'Rexdale-Kipling':
{'id': 4, 'hypertension': [201, 3669, 1134, 3229, 1393, 1854]},
'Elms-Old Rexdale':
{'id': 5, 'hypertension': [176, 3353, 1040, 2842, 948, 1322]}}
Similarly, using the small low income file and an empty dictionary as arguments to
get_low_income_data, the result should be that the dictionary now contains the low income
data as in SAMPLE_DATA, but not the hypertension data.
{'West Humber-Clairville':
{'id': 1, 'total': 33230, 'low_income': 5950},
'Mount Olive-Silverstone-Jamestown':
{'id': 2, 'total': 32940, 'low_income': 9690},
'Thistletown-Beaumond Heights':
{'id': 3, 'total': 10365, 'low_income': 2005},
'Rexdale-Kipling':
{'id': 4, 'total': 10540, 'low_income': 2140},
'Elms-Old Rexdale':
{'id': 5, 'total': 9460, 'low_income': 2315}}
A complete CityData dictionary will have been passed to both functions. See the sample
usage at the end of the starter code file for an example of how both functions are
used to build a CityData dictionary.
Note: While this is the first task, it is not necessarily the easiest. If you are stuck
while working on this task, we suggest moving on to other tasks and coming back to
this later.
Recall that TextIO as the parameter type means the file is already open.
Function name:
Full Description (paraphrase to get a proper docstring
(Parameter types)
description)
-> Return type
get_hypertension_data: The first parameter is a dictionary representing hypertension
(dict, TextIO) -> None and/or low income data for a neighbourhood and the second
parameter is a hypertension data file that is open for reading.
This function should modify the dictionary so that it contains
the hypertension data in the file.
https://www.utsc.utoronto.ca/~atafliovich/csca08/assignments/a3/index.html 7/12
11/29/22, 11:12 AM CSCA08H Assignment 3
Functions: Task 1
Functions: Task 2
Functions: Task 3
https://www.utsc.utoronto.ca/~atafliovich/csca08/assignments/a3/index.html 10/12
11/29/22, 11:12 AM CSCA08H Assignment 3
There are multiple ways to solve this problem. You may choose
to solve this problem by writing your own sorting code, but you
do not have to do this. You can also use list.sort as part of your
solution, if you choose.
Functions: Task 4
Files to Download
Download a3.zip which contains starter code (a3.py and test_a3.py), the checker
(a3_checker.py together with the helper file checker.py and folder pyta), and two sizes of
each type of data file.
Marking
These are the aspects of your work that will be marked for Assignment 3:
What to Hand In
The very last thing you do before submitting should be to run the checker
program one last time.
Otherwise, you could make a small error in your final changes before submitting that
causes your code to receive zero for correctness.
Submit a3.py and test_a3.py on MarkUs by following the instructions on the course
website. Remember that spelling of filenames, including case, counts: your file must
be named exactly as above.
https://www.utsc.utoronto.ca/~atafliovich/csca08/assignments/a3/index.html 12/12