0% found this document useful (0 votes)
9 views149 pages

A Guide To Health Data Science Using Python

A Guide to Health Data Science Using Python by Probir Kumar Ghosh is a comprehensive resource for public health researchers looking to utilize Python for data analysis. The guide covers essential topics such as data management, descriptive statistics, data visualization, and geospatial analysis, providing step-by-step instructions and practical examples. It is designed for individuals with no prior experience in Python programming, making it accessible for students and researchers in the field of public health.

Uploaded by

Natielle Rabelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views149 pages

A Guide To Health Data Science Using Python

A Guide to Health Data Science Using Python by Probir Kumar Ghosh is a comprehensive resource for public health researchers looking to utilize Python for data analysis. The guide covers essential topics such as data management, descriptive statistics, data visualization, and geospatial analysis, providing step-by-step instructions and practical examples. It is designed for individuals with no prior experience in Python programming, making it accessible for students and researchers in the field of public health.

Uploaded by

Natielle Rabelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 149

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/370050221

A Guide to Health Data Science Using Python

Experiment Findings · April 2023


DOI: 10.13140/RG.2.2.25299.25129

CITATIONS READS

0 3,888

1 author:

Probir Kumar Ghosh


International Centre for Diarrhoeal Disease Research
83 PUBLICATIONS 848 CITATIONS

SEE PROFILE

All content following this page was uploaded by Probir Kumar Ghosh on 16 April 2023.

The user has requested enhancement of the downloaded file.


A Guide to
A Guide to Health Data Science Using Python

Health Data Science


Using Python

2023

PK Ghosh
1|Page Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

Preface
Probir Kumar Ghosh, MSc statistics from Rajshahi University, who has been working as a
seasoned statistician at the International Centre for Diarrhoeal Disease Research, Bangladesh
(icddr,b) for 12 years. Mr. Ghosh has an extensive background in research designing and
analyzing large-scale public health scientific studies. Mr. Ghosh has authored and co-authored
over 25 publications, in which he has utilized his statistical expertise to provide insightful
analyses of public health data. His expertise in statistics and data analysis has been invaluable
in advancing our understanding of various public health issues.
In "A Guide to Health Data Science Using Python", Mr. Ghosh shares his knowledge and
experience in using Python programming for data analysis in public health research. This guide
is a valuable resource for public health researchers who want to learn how to analyze their data
using Python programming. The guide provides step-by-step instructions for analyzing
different types of public health data, from epidemiological studies to clinical trials, using
Python. It also includes numerous examples and case studies that illustrate the practical
application of Python programming in public health research.
Mr. Ghosh's extensive experience in data analysis and his expertise in using Python
programming make this guide an essential tool for anyone involved in public health research.
I highly recommend this guide to public health researchers who want to learn how to use Python
for data analysis and to anyone who wants to gain a deeper understanding of public health
research.

I hope that this guide will be helpful to you.

Probir Kumar Ghosh, MSc.


Statistician,
International Centre for Diarrhoeal Disease Research, Bangladesh (icddr,b)

2|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Table of Contents
Introduction ......................................................................................................................................4
Chapter 1 ..........................................................................................................................................1
Python built-in Data types .............................................................................................................1
1.1 Data types ......................................................................................................................1
1.2 Type conversion.............................................................................................................2
1.3 Python Strings................................................................................................................4
1.3.1 Assign String to a Variable ........................................................................................4
1.3.3 Strings are Arrays ......................................................................................................4
1.3.3 String Length.............................................................................................................4
1.3.4 Check String ..............................................................................................................5
1.3.5 Slicing ........................................................................................................................5
1.3.6 Modify Strings ...........................................................................................................5
1.3.7 Remove Whitespace ...................................................................................................6
1.3.8 Replace String ...........................................................................................................6
1.3.9 Split String .................................................................................................................6
1.3.10 String Concatenation ................................................................................................6
1.3.11 String Format ............................................................................................................6
1.4 Python list ......................................................................................................................7
1.4.1 List Items indexing .....................................................................................................7
1.4.2 List Length ..................................................................................................................8
1.4.3 Range of Indexes .......................................................................................................8
1.4.4 Change Item Value ....................................................................................................8
1.4.5 Change a Range of Item Values ................................................................................8
1.4.6 Insert Items ................................................................................................................9
1.4.7 Append Items.............................................................................................................9
1.4.8 Extend List ..................................................................................................................9
1.4.9 Remove Specified Item ..............................................................................................9
1.4.10 Remove Specified Index ..........................................................................................10
1.4.11 Clear the List ............................................................................................................10
1.4.12 Loop Through the Index Numbers...........................................................................10
1.4.13 Sort List Alphanumerically ......................................................................................11
1.4.14 Reverse Order ..........................................................................................................11
4|Page Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

1.4.15 Join Two Lists .........................................................................................................12


1.5 Python Tuples ..............................................................................................................12
1.5.1 Tuple Length ............................................................................................................12
1.5.2 Access Tuple Items...................................................................................................12
1.5.3 Change Tuple Values ...............................................................................................12
1.5.4 Add Items .................................................................................................................13
1.5.5 Loop Through the Index Numbers...........................................................................14
1.5.6 Join Two Tuples .......................................................................................................14
1.6 Dictionary ....................................................................................................................14
1.6.1 Dictionary Length ....................................................................................................15
1.6.2 Accessing Items .......................................................................................................15
1.6.3 Get Keys ...................................................................................................................15
1.6.4 Get Values ................................................................................................................15
1.6.5 Removing Items .......................................................................................................16
1.6.6 Nested Dictionaries .................................................................................................16
1.6.7 Access Items in Nested Dictionaries........................................................................17
Chapter 2 ........................................................................................................................................18
Python DataFrame .......................................................................................................................18
2.1 Pandas package ..........................................................................................................18
2.2 Installation of Pandas package .................................................................................18
2.3 Import Pandas package and create dataframe .......................................................19
2.4 Locate Row .................................................................................................................19
2.5 Named Indexes ...........................................................................................................20
2.6 Locate Named Indexes ...............................................................................................21
2.7 Import Files into a DataFrame .................................................................................21
2.8 Export Files Into a DataFrame .................................................................................21
2.9 Import a location shape file using geopandas ..........................................................22
Chapter 3 ........................................................................................................................................23
Data Management .......................................................................................................................23
3.1 Remove Rows ..............................................................................................................24
3.2 Replace Empty Values .................................................................................................24
3.2.1 Replace Only for Specified Columns........................................................................25
3.3 Date of Wrong Format ................................................................................................27

5|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

3.4 Removing Rows using condition .................................................................................28


3.5 Discovering Duplicates ................................................................................................29
3.6 Remove Duplicates ......................................................................................................29
3.7 Discovering duplicates in a column and remove duplicates ......................................30
3.8 Discovering duplicates in a column and remove first duplicates ..............................31
3.9 Merging DataFrames ...................................................................................................31
3.10 Append Datasets .........................................................................................................33
Chapter 4 .........................................................................................................................................34
Descriptive statistics ....................................................................................................................34
4.10 Tabulate ......................................................................................................................34
4.1.1 One way Tabulate ..................................................................................................34
4.1.2 Cross-Tabulate .......................................................................................................35
4.2 Cross tabulation heatmap .........................................................................................38
4.3 Descriptive statistics for continuous variable ............................................................39
Chapter 5 ........................................................................................................................................41
Data visualization .........................................................................................................................41
Introduction .................................................................................................................................41
5.1 Pie charts.....................................................................................................................41
5.2 Bar charts ........................................................................................................................43
5.3 Stacked bar ......................................................................................................................47
5.4 Bar heatmap ....................................................................................................................49
5.5 Waffle charts ....................................................................................................................50
5.6 Histogram .........................................................................................................................54
5.7 Boxplot .............................................................................................................................59
5.8 Scatter plot ......................................................................................................................61
5.10 Swarm plot......................................................................................................................66
5.11 Timeseries plot ..............................................................................................................69
5.13 Survival curve ................................................................................................................72
5.14 Sankey plot ....................................................................................................................76
5.15 Forest Plot ......................................................................................................................78
5.16 Violin Plot.......................................................................................................................82
Chapter 6 ........................................................................................................................................87
Spatial data Mapping ...................................................................................................................87

6|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

6.1 Create Study site map .....................................................................................................87


6.2 Select Local areas in Bangladesh ...................................................................................89
6.3 Distance Calculation .......................................................................................................91
6.4 Migration Mapping with distance .................................................................................93
6.5 Highlighting study districts ............................................................................................95
6.7 Create Heatmap (Choropleth map)...............................................................................97
6.8 World mapping with COVID- 19 cases........................................................................99
6.9 Generate Random coordinate points ..........................................................................101
Chapter 7 .......................................................................................................................................105
Measures of association: Crude analysis ...................................................................................105
6.1 Mean estimate...............................................................................................................105
6.2 Two Means Comparison ..............................................................................................106
6.2.1 Independent two means comparison ...................................................................106
6.2.2 Paired Two means comparison .............................................................................107
6.3 Correlation coefficient estimate ....................................................................................108
6.3.1 Pearson’s correlation coefficient ..........................................................................108
6.3.2 Spearman’s correlation coefficient.......................................................................109
6.4 Prevalence estimate.......................................................................................................109
6.5 Chi Square test ...............................................................................................................111
6.6 Risk difference estimate ................................................................................................112
6.7 Risk ratio estimate .........................................................................................................113
6.8 Odds ratio estimate .......................................................................................................114
6.9 Incidence Rate Ratio ......................................................................................................115
6.10 Diagnostic Test ...............................................................................................................116
Chapter 8 .......................................................................................................................................118
Regression analysis for adjusting variables and clustering effect .............................................118
8.2 Multiple linear regression .......................................................................................120
8.4 Multiple linear regression with categorical exposure ...........................................122
8.5 Poisson regression ....................................................................................................124
8.6 Logistic regression....................................................................................................127
8.7 Conditional logistic regression ................................................................................128
8.8 GEE model ................................................................................................................130
8.9 Mixed effect model ...................................................................................................132

7|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

8|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Introduction
“A Guide to Health Data Science Using Python" is an essential guide for public health
researchers, and students who want to learn how to use Python programming to analyze
public health data. This guide is a comprehensive guide to using Python programming for
data management, computation, descriptive statistics, visualization, spatial analysis,
regression analysis and modeling for identifying risk factors.
The guide is divided into six parts, each focusing on a specific aspect of Python
programming in public health data analysis. The first part covers data management,
including techniques for importing, cleaning, and organizing large datasets. The second part
focuses on computation, introducing readers to important Python libraries such as NumPy,
Pandas, Matplotlib, seaborn, scipy, statsmodels, and geopandas and providing practical
examples of how these libraries can be used for data importation, exportation into another
data format and manipulation.
The third part of the guide delves into descriptive statistics, providing a thorough
introduction to statistical analysis using Python programming. The fourth part covers data
visualization, offering insights into best practices for presenting data in an understandable
and effective manner.
The fifth part of the guide focuses on geospatial analysis, providing a comprehensive
introduction to geospatial data analysis using Python programming. Finally, the sixth part
focuses on using analytical modelling for identifying risk factors for public health data,
covering important concepts such as regression analysis, machine learning and statistical
modelling.
What makes this guide particularly unique is that it is written for public health researchers,
and students with no prior experience in Python programming. Python language is object
oriented which uses objects to represent data and methods to manipulate that data. This
approach provides for code organization, reusability, and maintainability. The guide takes
a step-by-step approach, providing clear explanations and examples that allow readers to
build their understanding of Python programming from the ground up. One of the guide's
unique features is its use of attractive and informative graphs such as Waffle, Sankey and
Swarm plots to help readers understand complex public health data. These graphs are
carefully designed to provide clear and concise visual representations of data that are easy
to interpret and analyze. This approach is particularly useful for health data science, where
4|Page Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

large amounts of data can be difficult to interpret without the assistance of graphical
representations. Another unique feature of the guide is its focus on geospatial mapping. By
using geographic information systems (GIS) and other spatial data analysis tools, the guide
shows readers how to analyze and interpret health data in the context of geography. This is
an important skill for public health researchers, as it enables them to identify and visualize
spatial patterns in health data, which can help to inform public health interventions and
policies. Additionally, the guide's focus on statistical modeling is also a unique feature. The
book covers a range of statistical models commonly used in health data science, including
logistic regression, Poisson regression, survival analysis, cox proportional hazard, and
hierarchical models. By providing a detailed introduction to these models, the guide
prepares readers with the skills and knowledge needed to analyze health data using statistical
techniques.
In "A guide to health data science using python", the example data is simulated data which
is used to provide examples and test python codes. These data reflect the different real-
world scenarios. While example data may not reflect all the nuances of real-world data, it
can provide valuable insights and help researchers to identify potential issues before
deploying their models on real data.
Overall, "A Guide to Health Data Science Using Python" is an invaluable resource for public
health researchers and students who want to learn how to use Python programming to
manage, represent, analyze and interpret large datasets. The guide provides clear and
concise explanations of important concepts and techniques, accompanied by practical
examples and code snippets that readers can use to deepen their understanding and apply
the skills learned to real-world public health scenarios.

5|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Chapter 1
Python built-in Data types
Introduction
In Python, there are several built-in data types, including integers, floating-point numbers,
strings, lists, tuples, sets, and dictionaries. Each data type has its own unique characteristics
and functions, which make them suitable for different data science tasks. Integers and
floating-point numbers are used for numerical computations, while strings are used to
represent text data. Lists, tuples, and sets are used to store collections of data, and
dictionary is used to store related key value. These data types allow users to create variables
and objects without explicitly defining their data types. This makes it easy to compute and
manage and analyze data without worrying about type declarations. Understanding the
different data types is crucial for data science using Python. This guide is covered common
data types and best practices in Python data science.

1.1 Data types


In python, data type is one of the important features for data analysts. Python’s dynamic
data type refers to a variable in data science, and this type can do different things. Python
has the default built-in data types, in these categories:

Name of types Python Codes Setting the specific Data type


Text Str str(x)
Numeric int, float, complex int(x), float(x), complex(x)
Sequence list, tuple, range list((‘x1’, ‘x2’, ‘x3’))
tuple ((‘x1’, ‘x2’, ‘x3’))
range(x)
Mapping Dict dict(name=’John’, age=40)
Boolean Bool bool(x)
Binary Bytes bytes(5)
None NoneType None

1|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

1.1.1 Setting the data types

In Python data science, setting the data type when you assign a value to a variable (x) is
essential.

Data types Example


Str x=“Hello world”
Int x=20
Float x=20.0
Complex x=20j
List x= [‘apple’, ‘orange’, ‘banana’, ‘cherry’]
Tuple x=(‘apple’, ‘orange’, ‘banana’, ‘cherry’)
Range x=range(5)
Dict x={“Name”: “John”, “age”: 40}
Set x={‘apple’, ‘orange’, ‘banana’, ‘cherry’}
Bool x=True
NoneType x=None

You can get data type of any object by using type() function.
x=5
print(type(x))

Return will be : <class 'int'>

1.2 Type conversion


Data analysts sometimes need type conversion when computing, managing and analyzing
data. Integer type data returns rounding value while float-point numeric returns value with
decimal points. Python converts from one type to another using the specific data type. For
example int(), float(), and complex() etc.
x = 1 # int
y = 2.8 # float
z = 1j # complex

2|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

 Convert from int to float:


x=5
a = float(x)
print(a)
Return “a” will be 1.0

 Convert from float to int:


y=2.8
b = int(y)
print(b)
Return “b” will be 2

 Convert from int to complex:


z=1
c=complex(z)
print(c)
Return “c” will be (1+0j)

 Convert from string to list:


d="Hello"
print(list(d))

Return will be : ['H', 'e', 'l', 'l', 'o']

 Convert from list to tuple:


f=['H', 'e', 'l', 'l', 'o']
print(tuple(f))
Return will be : ('H', 'e', 'l', 'l', 'o')

 Convert from tuple to list:


f=('H', 'e', 'l', 'l', 'o')
print(list(f))
Return will be : ['H', 'e', 'l', 'l', 'o']

3|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

1.3 Python Strings


The string contains alphanumeric characters, numbers or symbols. In Python, single or
double quotation marks surround a string.
A ="Hello"
print(type(A))
Return will be : <class 'str'>

1.3.1 Assign String to a Variable

A string variable is assigned with the variable name followed by an equal sign and a string:

A ="Hello"
print(A)
Return will be: Hello

1.3.3 Strings are Arrays

Python strings are arrays of bytes representing characters. Python has a single character is
simply a string with a length of 1. Square brackets can be used to access elements of a
string. Python first character has the position 0.

A='Hello, World'
print(A)
print(A[1])
Return will be: e

1.3.3 String Length

String length is the total number of containing characters. To get the length of a string, use
the len() function.

A='Hello, World'
l=len(A)
print(l)
Return will be: 12

4|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

1.3.4 Check String


You can use the keyword to check whether a specific phrase or character is in a string in.
Check if "animal" is present in the following text:
txt = "The elephant is the largest animal in the world!"
print("animal" in txt)
Return will be: True

1.3.5 Slicing

You can return a range of characters by using the slice. You must specify the start and end
indexes, separated by a colon, to produce a part of the string. Get the characters from
position 4 to position 8 (end point (5) is not included):

txt = " The elephant is the largest animal in the world!"


print(txt[4:13])
Return will be : elephant

Starting from end of ‘txt’ string.

txt = " The elephant is the largest animal in the world!"


print(txt[-6:-1])
Return will be: free

1.3.6 Modify Strings


Upper Case

The upper() method returns the string in upper case:

txt = " The elephant is the largest animal in the world!"


print(txt.upper())
Return will be: THE ELEPHANT IS THE LARGEST ANIMAL IN THE WORLD!

Lower Case

The lower() method returns the string in lower case:

print(txt.lower())
Return will be : the elephant is the largest animal in the world!

5|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

1.3.7 Remove Whitespace


The strip() method removes any whitespace from the text:
txt1 = " The elephant is the largest animal in the world! "
print(txt1.strip())
Return will be : The elephant is the largest animal in the world!

1.3.8 Replace String

The replace() method replaces a string with another string:

txt1 = " The elephant is the largest animal in the world!"


print(txt.replace('animal', 'mammal'))
Return will be: The elephant is the largest mammal in the world!

1.3.9 Split String

The split() method returns a list where the text between the specified separator becomes
the list items.

txt1 = "The elephant is the largest animal in the world!”


print(txt.split(" "))
Return will be : ['The', elephant, 'is', 'the', 'largest', 'animal', 'in', 'the', 'world!']

1.3.10 String Concatenation


To concatenate, or combine, two strings you can use the + operator. Merge variable a with
variable b into variable c:

a="The elephant is the largest animal"


b= "in the world!"
txt=a+b
print(txt)
Return will be: The elephant is the largest animal in the world!

1.3.11 String Format

You can combine strings and numbers like this:

age=36
name="Mr. John"

6|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

state="My name is "+name+ ", "+"I am " + str(age)


print(state)

Return will be: My name is Mr. John, I am 36

The format() method takes the passed string arguments, where the placeholders are{}.
Strings are placed into the respective placeholders:

state="My name is {},I am {}"


print(state.format(name, age))

Return will be: My name is Mr. John, I am 36

You can use index numbers in the {} to be placed the arguments in the right placeholders:

state="My name is {1},I am {0}"


print(state.format(age, name))

Return will be: My name is Mr. John, I am 36

1.4 Python list


Python lists are used to store values in a variable. It can store multiple items in a single
variable. Square brackets use to keep all items in a list.

Create a List:

mystring="mango", "banana", "apple"


mylist=list(mystring)
print(mylist)

Return will be: ['mango', 'banana', 'apple']

1.4.1 List Items indexing

List items are ordered, changeable, and allow duplicate values. List items are indexed.

print(mylist[1])
Return will be: banana

7|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

1.4.2 List Length

You can determine how many items a list has by using the len() function:

print(len(mylist))
Return will be: 3

1.4.3 Range of Indexes

You can specify a range of indexes by determining where to start and where to end the
range. When defining a range, the return value will be a new list with the specified items.

print(mylist[0:2])

Return will be : [mango, 'banana']

1.4.4 Change Item Value

You can use index number to change the value of a specific item.

mylist[1]="blackcurrant"
print(mylist)
Return will be:
['mango', 'blackcurrant', 'apple']

1.4.5 Change a Range of Item Values

Change the values "banana" and "apple" with the value’s "blackcurrant" and "watermelon":

mylist[1:3]=["blackcurrant","watermelon"]
print(mylist)
Return will be: [mango, 'blackcurrant', 'watermelon']

If you insert more items than you replace, the new items will be inserted where you
specified, and the remaining items will move accordingly:

mylist[1:2]=["blackcurrant","watermelon"]
print(mylist)
Return will be : ['mango', 'blackcurrant', 'watermelon', apple]
8|Page Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

Change the second and third value by replacing it with one value:

mylist[1:3]=["blackcurrant"]
print(mylist)
Return will be : ['apple', 'blackcurrant’]

1.4.6 Insert Items


You can use insert() to insert a new list item without replacing any of the existing values.
The insert() method inserts an item at the specified index. Insert "watermelon" as the third
item: Mylist=['mango', 'banana', ' apple']
mylist.insert(2,'waterlemon')
print(mylist)
Return will be: ['mango', 'banana', 'waterlemon', 'apple']

1.4.7 Append Items

To add an item to the end of the list, use the append() method:

mylist=["mango", "banana", " apple"]


mylist.append("cherry")
print(mylist)
Return will be: ['mango', 'banana', ' apple', 'cherry']

1.4.8 Extend List


To append elements from another list to the current list, use the extend() method.
Add the elements of thatlist to mylist:

mylist=["apple", "banana", "cherry"]


thatlist = ["mango", "pineapple", "papaya"]
mylist.extend(thatlist)
print(mylist)
Return will be: ['apple', 'banana', 'cherry', 'mango', 'pineapple', 'papaya']

1.4.9 Remove Specified Item

The remove() method removes the specified item.

9|Page Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

mylist=["apple", "banana", "cherry"]


mylist.remove("banana")
print(mylist)
Return will be: ["apple", "cherry"]

1.4.10 Remove Specified Index

The pop() method removes the specified index.

mylist=["apple", "banana", "cherry"]


mylist.pop(1)
print(mylist)
Return will be: ["apple", "cherry"]

The del keyword also removes the specified index. Remove the first item:

mylist=["apple", "banana", "cherry"]


del mylist[1]
print(mylist)
Return will be: [ "banana", "cherry"]

1.4.11 Clear the List

The clear() method empties the list. The list still remains, but it has no content.

mylist=["apple", "banana", "cherry"]


mylist.clear()
print(mylist)
Return will be: []

1.4.12 Loop Through the Index Numbers

You can also loop through the list items by referring to their index number. Print all items
by referring to their index number:

mylist=["apple", "banana", "cherry"]


for i in range(len(mylist)):
print(mylist[i])

10 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

Return will be:

apple
banana
cherry

A short hand for loop that will print all items in a list:

[print(x) for x in mylist]


Return will be:

apple
banana
cherry

With list comprehension you can do all that with only one line of code:

newlist=[x for x in mylist]


print(newlist)
Return will be: ['apple', 'banana', 'mango']

1.4.13 Sort List Alphanumerically

To sort a list items ascending, you use sort() method. Sort the list alphabetically:

mylist.sort(reverse=False)
print(mylist)
Return will be : ['apple', 'banana', 'cherry']

Sort Descending To sort descending, use the keyword argument reverse = True:

mylist.sort(reverse=True)
print(mylist)
Return will be: ['cherry', 'banana', 'apple']

1.4.14 Reverse Order

The reverse() method reverses the current sorting order of the elements.

mylist=["apple", "banana", "cherry", "backcurrage"]


mylist.reverse()
print(mylist)
Return will be: ['backcurrage', 'cherry', 'banana', 'apple']

11 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

1.4.15 Join Two Lists

There are several ways to join, or concatenate, two or more lists in Python data science.
One of the easiest ways are by using the + operator.

mylist=["apple", "banana", "cherry", "backcurrage"]


thislist=["mango","strawberry"]
nowlist=mylist+thislist
print(nowlist)
Return will be: ['apple', 'banana', 'cherry', 'backcurrage', 'mango', 'strawberry']

1.5 Python Tuples

A tuple is a collection which contains multiple items. It is a built-in dynamic data types in
Python that is created with round brackets.

mytupe=("apple", "banana", "cherry")


print(mytupe)
Return will be: ('apple', 'banana', 'cherry')

1.5.1 Tuple Length

You can use the len() function to print the number of items in the tuple:

print(len(mytupe))
Return will be: 3

1.5.2 Access Tuple Items

You can access tuple items by referring to the index number, inside square brackets.
print(mytupe[1])
Return will be: banana

1.5.3 Change Tuple Values

Once a tuple is created, you cannot change its values. To change tuple values, you must
convert the tuple into a list, change the list, and convert the list back into a tuple. Convert
the tuple into a list to be able to change it:

12 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

x=list(mytupe)
x[1]="kiwi"
mytupe=tuple(x)
print(mytupe)
Return will be: ("apple", “kiwi”, "cherry")

1.5.4 Add Items


Since tuples are immutable, they do not have a build-in append() method, but there are
other ways to add items to a tuple.
Convert into a list: Just like the workaround for changing a tuple, you can convert it into
a list, add your item(s), and convert it back into a tuple.

Convert the tuple into a list, add "orange", and convert it back into a tuple:

mytupe=("apple", "banana", "cherry")


y=list(mytupe)
y.append("kiwi")
mytupe=tuple(y)
print(mytupe)
Return will be: ('apple', 'banana', 'cherry', 'kiwi')Remove Items

Note: You cannot remove items in a tuple. Tuples are unchangeable, so you cannot
remove items from it, but you can use the same workaround as we used for changing and
adding tuple items:

y.remove("kiwi")
mytupe=tuple(y)
print(mytupe)
Return will be: ('apple', 'banana', 'cherry')

mytupe=("apple", "banana", "cherry")


thislist=["mango","strawberry"]
a=list(mytupe)
b=list(thislist)
c=a+b

13 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

d=tuple(c)
print(d)
Return will be: ('apple', 'banana', 'cherry', 'mango', 'strawberry')
1.5.5 Loop Through the Index Numbers

You can also loop through the tuple items by referring to their index number. Use the
range() and len() functions to create a suitable iterable. Print all items by referring to their
index number:

for i in range(len(mytupe)):
print(mytupe[i])
Return will be:

apple
banana
cherry

1.5.6 Join Two Tuples

To join two or more tuples you can use the + operator: Join two tuples:

tuple1 = ("a", "b", "c")


tuple2 = (1, 2, 3)
tuple3 = tuple1 + tuple2
print(tuple3)

Return will be: ('a', 'b', 'c', 1, 2, 3)

1.6 Dictionary
Dictionaries contain data values in key and value pairs. A dictionary is a collection which
is ordered*, changeable and does not allow duplicates. Dictionaries are written with curly
brackets and have keys and values.
car = {
"brand": "Ford",
"model": "Mustang",
"year": 1964

14 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

}
print(car)
Return will be: {'brand': 'Ford', 'model': 'Mustang', 'year': 1964}
1.6.1 Dictionary Length

You use the len() function to determine how many items a dictionary has.

print(len(car))
Return will be: 3
1.6.2 Accessing Items

You can access the items of a dictionary by referring to its key name, inside square
brackets:

x=car["model"]
print(x)
Return will be: Mustang
1.6.3 Get Keys

The keys() method will return a list of all the keys in the dictionary.

key=car.keys()
print(key)
Return will be: dict_keys(['brand', 'model', 'year'])

1.6.4 Get Values

The values() method will return a list of all the values in the dictionary.

val=car.values()
print(val)
Return will be: dict_values(['Ford', 'Mustang', 1964])

Make a change in the original dictionary, and see that the values list gets updated as well:

car["year"]=2020
print(car)
before
car = {
"brand": "Ford",
15 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

"model": "Mustang",
"year": 1964
}

after
car = {
"brand": "Ford",
"model": "Mustang",
"year": 2020
}
Add a new item to the original dictionary, and see that the values list gets updated as
well:
car["color"]="red"
print(car)
Before
car = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
After
car = {
"brand": "Ford",
"model": "Mustang",
"year": 1964,
"color": “red”
}

1.6.5 Removing Items

There are several methods to remove items from a dictionary:

car.pop("model")
print(car)
car = {
"brand": "Ford",
"year": 1964,
"color": “red”
}
1.6.6 Nested Dictionaries

A dictionary can contain dictionaries, this is called nested dictionaries. Create a


dictionary that contain three dictionaries:

16 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

myfamily = {
"child1" : {
"name" : "Emil",
"year" : 2004
},
"child2" : {
"name" : "Tobias",
"year" : 2007
},
"child3" : {
"name" : "Linus",
"year" : 2011
}
}

1.6.7 Access Items in Nested Dictionaries

To access items from a nested dictionary, you use the name of the dictionaries, starting
with the outer dictionary:

myfamily = {
"child1" : {
"name" : "Emil",
"year" : 2004
},
"child2" : {
"name" : "Tobias",
"year" : 2007
},
"child3" : {
"name" : "Linus",
"year" : 2011
}
}
print(myfamily["child2"]["name"])
Return will be: Tobias

17 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

Chapter 2
Python DataFrame
Introduction
Dataframes are a crucial tool for any data scientist or analyst working with tabular data. A
dataframe is a two-dimensional data structure that is used to keep data in rows and columns.
Dataframes are highly flexible and can be used for a wide range of data analysis tasks, including
cleaning, filtering, aggregating, and visualizing data. The Python dataframes are typically
created by using the pandas library, which provides a tool for working with tabular data. With
pandas, data analysts can easily read data from a variety of file formats, such as CSV, Excel,
SQL databases, SPSS, STATA, and SAS databases, and compute and manipulate the data using
a wide range of functions and methods. Data analysts also can easily extract to a variety of file
formats. This guide provided an overview of the basics of working with Pandas in Python,
including how to create and manipulate dataframes, perform common data analysis tasks, and
visualize data, while Geopandas dataframes provided the basics of working with spatial data.
This guide will help us to take the data analysis skills to the next level.

2.1 Pandas package

Pandas is a Python library used for working with tabular dataframes. It has functions for
analyzing, cleaning, exploring, and manipulating data. Pandas allows us to analyze big data and
make conclusions based on statistical theories. It can be used to clean messy data, and make
them readable and relevant.

2.2 Installation of Pandas package

If you have Python and PIP already installed on a machine, then installation of Pandas is very
easy. Alternative way to install Pandas is on Pycharm project. Given a Pycharm project, you can
easily install the pandas library in your project within a virtual environment step by step.

• Open File > Settings > Project from the PyCharm menu.
• Select your current project.
• Click the Python Interpreter tab within your project tab.
18 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

• Click the small + symbol to add a new library to the project.


• Now type in the library to be installed, in your example Pandas, and click Install
Package.
• Wait for the installation to terminate and close all popup windows.

Note: You can install all libraries following above steps.

2.3 Import Pandas package and create dataframe

Once Pandas is installed, import it in your applications by adding the import keyword:

# import packages
import pandas as pd
# Create dataframe
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
mydf=pd.DataFrame(mydataset)
# Print dataframe
print(mydf)
Return will be:

cars passings
0 BMW 3
1 Volvo 7
2 Ford 2

2.4 Locate Row

Pandas use the loc attribute to return one or more specified row(s)

# import packages
import pandas as pd
# Create dataframe
data = {
"calories": [420, 380, 390],
19 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

"duration": [50, 40, 45]


}
df=pd.DataFrame(data)
# Print first column
print(df.loc[0])
Return will be :
calories 420
duration 50
Name: 0, dtype: int64

Return row 0 and 1:

# Print first to second columns

print(df.loc[0:1])

or
print(df.loc[[0,1]])

Return will be :
calories duration
0 420 50
1 380 40

2.5 Named Indexes

You can easily add row name using Pandas DataFrame with index argument.

# import packages
import pandas as pd
# Create dataframe
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df=pd.DataFrame(data, index=["Month1","Month2","Month3"])
# Print dataframe
print(df)
Return will be:

calories duration
Month1 420 50
20 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

Month2 380 40
Month3 390 45

2.6 Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).

print(df.loc["Month3"])
Return will be:
calories 390
duration 45

2.7 Import Files into a DataFrame

Pandas can easily load any other formatted dataset into a DataFrame. For example, Load a
comma separated file (CSV file) into a DataFrame using import pandas

import pandas as pd

df = pd.read_csv('~location path/data.csv')
Alternatively, you must before changing current directory and then import csv file.

import os
os.chdir('~Python notes')
df = pd.read_csv('data.csv')

Data frame Format Pandas import command


CSV (.csv) Pd.read_csv(“file name.csv”)
Excel (.xlsx) Pd.read_excel(“file name.xlsx”)
STATA (.dta) Pd.read_stata(“file name.dta”)
SPSS (.sav) Pd.read_spss(“file name.sav”)
SAS (.sas) Pd.read_sas(“file name.sas”)
JSON (.json) Pd.read_json(“file name.json”)

2.8 Export Files Into a DataFrame


pf.to_csv('data.csv') ## automatically save this file into current directory.

Data frame Format Pandas export command


CSV (.csv) df.to_csv(“file name.csv”)

21 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Excel (.xlsx) df.to_excel(“file name.xlsx”)


STATA (.dta) df.to_stata(“file name.dta”)
SPSS (.sav) df.to_spss(“file name.sav”)
SAS (.sas) df.to_sas(“file name.sas”)
JSON (.json) df.to_json(“file name.json”)

Viewing the Data


One of the most used method for getting a quick overview of the DataFrame, is the head()
method. The head() method returns the headers and a specified number of rows, starting from
the top. For instance, you get a quick overview by printing the first 10 rows of the DataFrame:

print(df.head(10))

2.9 Import a location shape file using geopandas

Geopandas can easily load geospatial shape file into a DataFrame. For example, load Bangladesh
geospatial shape file into a DataFrame using import geopandas

# Import packages
import geopandas as gpd
import os
# Change directory
os.chdir('~Python note')
# Load Bangladesh shape file
Bdg= gpd.read_file('~\bgd_admbnda_adm0_bbs_20201113.shp')

22 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Chapter 3
Data Management
Introduction
Data management is a crucial component of health data science that involves the organization,
cleaning, and data imputation. The quality of health data management can have a significant
impact on the accuracy and reliability of public health findings. Messy data can lead to errors
and biases in the analysis, which can undermine the validity of the study results. This process in
public health data science involves several key steps. The first step is the collection of data,
which may involve the use of various instruments such as surveys, medical records, and
laboratory tests. Once the data is collected, it must be carefully check to ensure valid and
completeness. This process may involve identifying and cleaning errors and discrepancies in the
data, as well as identifying missing or incomplete data. Data cleaning is another important step
in data management, which involves identifying and correcting errors, inconsistencies, and
missing data. The next step in data management is the organization and storage of the data. This
typically involves creating a database or spreadsheet that includes all of the collected data in a
structured and standardized format. The data must be properly labeled and formatted to facilitate
analysis and minimize errors. This may involve using statistical methods to identify outliers or
other data points that are not consistent with the overall pattern of the data.
Finally, data management means fixing incorrect data and managing datasets for preparing to
analyze. This chapter includes:
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
• Merging
• Appending

23 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

3.1 Remove Rows


One way to deal with empty row is to remove rows that contain empty cells. If you want to
remove rows from DataFrame that contain at least one empty cell and change the original
DataFrame, use the inplace = True argument:

# import packages
import pandas as pd
# Create dataframe
data = {
"calories": [420, 380, 390,400],
"duration": [50, 40, 45,np.nan]
}
# Drop NaN values
df.dropna(inplace=True)
# Print dataframe
print(df.head())
Return will be:
calories duration
Month1 420 50
Month2 380 40
Month3 390 45

3.2 Replace Empty Values


The fillna() method allows us to replace empty cells with a value. This method is a way of dealing
with empty cells is to insert a new value instead removing entire row just because of empty cells.
Replace NULL values with the number 130:
# import packages
import pandas as pd
# Create dataframe
data = {
"calories": [420, 380, 390,400],
"duration": [50, 40, 45,np.nan]
}
24 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

df=pd.DataFrame(data,
index=["Month1","Month2","Month3","Month4"])
# Change NaN values to 130
df.fillna(130,inplace=True)
# Print dataframe
print(df)
Return will be :
calories duration
Month1 420 50.0
Month2 380 40.0
Month3 390 45.0
Month4 400 130.0

3.2.1 Replace Only for Specified Columns

You can specify the column name for the DataFrame to replace empty values in a column.
Replace NULL values in the "Calories" columns with the number 130:
# import packages
import pandas as pd
# Create dataframe
data = {
"calories": [420, 380, 390,400],
"duration": [50, 40, 45,np.nan]
}
df=pd.DataFrame(data,
index=["Month1","Month2","Month3","Month4"])
# Change NaN values to 130
df["duration"].fillna(130,inplace=True)
# Print dataframe
print(df)
Return will be:
calories duration
Month1 420 50.0
Month2 380 40.0
25 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

Month3 390 45.0


Month4 400 130.0

3.2.2 Imputation in csv data format

You can easily impute value in a column to replace entire empty cells by using fillna() method .
Replace NULL values in the "hdlc" columns with the number 50.0:

# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('data.csv')
# Change NaN to 50 in specific column
df["hdlc"].fillna(50.0,inplace=True)
# Print dataframe
print(df)

randid death angina hospmi ... prevstrk prevhyp hdlc ldlc


0 2448 0 0 1 ... 0.0 0.0 31.0 178.0
1 6238 0 0 0 ... 0.0 0.0 54.0 141.0
2 9428 0 0 0 ... NaN NaN 50.0 NaN
3 10552 1 0 0 ... NaN NaN 50.0 NaN
4 11252 0 0 0 ... 0.0 1.0 50.0 NaN

3.2.3 Replace Using Mean, Median, or Mode

A common way to replace empty cells is to calculate the mean, median or mode value of the
column. Pandas uses the mean() median() and mode() methods to calculate the respective values
for a specified column. Calculate the MEAN, and replace any empty values with it:
# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')

26 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Load data
df=pd.read_csv('data.csv')
# Create mean
mean_hdlc=df["hdlc"].mean()
# Change NaN to mean value
df["hdlc"].fillna(mean_hdlc,inplace=True)
# Print dataframe
print(df["hdlc"].head(5))
Before mean imputation
0 31.0
1 54.0
2 NaN
3 NaN
4 NaN
After mean imputation
0 31.000000
1 54.000000
2 49.364718
3 49.364718
4 49.364718

3.3 Date of Wrong Format

Cells with data of wrong format can make it difficult, or even impossible to analyze data.
To fix it, you have two options: remove the rows, or convert all cells in the columns into the
same format.
Convert Into a Correct Format
In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the 'Date'
column should be a string that represents a date. Let's try to convert all cells in the 'Date' column
into dates. Pandas has a datetime() method for this:
Convert to date:
# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('datetime.csv')
27 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

df["Date"]=pd.to_datetime(df["Date"])
# Print dataframe
print(df)
Before correction
20 '2020/12/20'
21 '2020/12/21'
22 NaN
23 '2020/12/23'

After correction
20 2020-12-20
21 2020-12-21
23 2020-12-23

3.4 Removing Rows using condition


The way of handling wrong data is to remove the rows that contains wrong data. This way you
have to find out what to replace with. For example, you want to delete rows where "Duration" is
higher than 120:
# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('datetime.csv')
# Drop values
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x,inplace=True)
# Print dataframe
print(df.head(10))
Return will be :
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0

28 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

5 60 '2020/12/06' 102 127 300.0


6 60 '2020/12/07' 110 136 374.0
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3

3.5 Discovering Duplicates


Duplicate rows are rows that have been registered more than one time. To discover duplicates,
we can use the duplicated() method. The duplicated() method returns a Boolean values for each
row. Returns True for every row that is a duplicate, otherwise False:
# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('datetime.csv')
# Print duplication values
print(df.duplicated())

3.6 Remove Duplicates

To remove duplicates, use the drop_duplicates() method. This method keep first occurrence by
default. Remove all duplicates:

# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('datetime.csv')
# Drop duplicates values
df.drop_duplicates(inplace=True)
0 False

29 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

11 False
12 True
13 False
14 False
15 False

Dataset without row 12


10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN

3.7 Discovering duplicates in a column and remove duplicates


To remove duplicates, use the drop_duplicates() method with subset. Remove all duplicates
except first occurrence:

# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('datetime.csv')
# Drop duplicates values
df.drop_duplicates(subset=['Date'], Inplace=True)
0 False
11 False
12 True
13 False
14 False
15 False

Dataset without row 12


10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
30 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

3.8 Discovering duplicates in a column and remove first duplicates


By using the drop_duplicates() method you can remove last duplicates row. To remove first
row duplicates, use the drop_duplicates() with subset and keep=’last’ method. Remove all
duplicates row except last occurrence:

# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('datetime.csv')
# Drop duplicates values
df.drop_duplicates(subset=['Date'], keep='last', Inplace=True)
0 False
11 False
12 True
13 False
14 False
15 False

Dataset without row 11


10 60 '2020/12/11' 103 147 329.3
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN

3.9 Merging DataFrames

You can combine two or more datasets to create a single dataset using merge() method in
pandas Dataframe. Joining variables to a DataFrame can be more computationally easier for
analysis than different DataFrames. Pandas has a merge() function that join all standard
databases, such as Pandas DataFrames. The pandas merge() function structure is:

31 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

pd.merge(left, right, how="inner", on=None, left_on=None, right_on=None,


left_index=False, right_index=False, sort=True, suffixes=("_x", "_y"), copy=True,
indicator=False, validate=None,)
left: A DataFrame.
right: Another DataFrame.
on: Column or index level names to join on. Must be found in both the left and right DataFrame.
If not passed and left_index and right_index are False, the intersection of the columns in the
DataFrames will be inferred to be the join keys.
left_on: Columns or index levels from the left DataFrame to use as keys.
right_on: Columns or index levels from the right DataFrame to use as keys.
left_index: If True, use the index (row labels) from the left DataFrame as its join key(s). In the
case, the number of levels must match the number of join keys from the right DataFrame.
right_index: If True, use the index (row labels) from the right DataFrame as its join key(s). In
the case, the number of levels must match the number of join keys from the left DataFrame.
how: One of 'left', 'right', 'outer', 'inner', 'cross'. Defaults to inner.
sort: Sort the result DataFrame by the join keys in lexicographical order.
suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x', '_y').
copy: Always copy data (default True) from the passed DataFrame.
indicator: Add a column to the output DataFrame called _merge with information on the
source of each row. _merge is Categorical-type and takes on a value of left_only for
observations whose merge key only appears in 'left' DataFrame, right_only for observations
whose merge key only appears in 'right' DataFrame, and both if the observation’s merge key is
found in both.
validate : string, default None. Check duplicates and specified one type from below.
“one_to_one” or “1:1”: checks if merge keys are unique in both left and right datasets.
“one_to_many” or “1:m”: checks if merge keys are unique in left dataset.
“many_to_one” or “m:1”: checks if merge keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.

import packages
import pandas as pd
import os
# Change local directory

32 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

os.chdir(~Python notes')
# Load data
df_x=pd.read_csv("x.csv")
df_y=pd.read_csv("y.csv")
# Merge dataframes
data=pd.merge(df_x,df_y,on="id",validate="1:1",indicator=True)

3.10 Append Datasets


You can combine two or more datasets to create a single dataset using append in pandas
Dataframe. Appending rows to a DataFrame can be more computationally intensive than a single
concatenate for analysis. Pandas dataframe.append() function is used to append rows of other
datasets to the end of the given dataset. If ignore_index=Ture, do not use index labels.

# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df_x=pd.read_csv("x1.csv")
df_y=pd.read_csv("y1.csv")
# Append dataframes
data=df_x.append(df_y,ignore_index=True)
After appending, total row = 8868 (4434+4434)

33 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Chapter 4
Descriptive statistics
Introduction

Descriptive statistics is a branch of statistics that involves summarizing, organizing, and presenting
data in a meaningful way. It involves the use of various statistical measures and tools to describe
the central tendency, variability, and distribution of data. It is an essential tool for researchers,
analysts, and decision-makers who need to understand and communicate data effectively.
The purpose of descriptive statistics is to provide a clear and concise summary of data, which can
help to identify patterns, trends, and relationships within variables that may be useful in developing
hypotheses or models. It allows us to describe the features of variables, such as its mean, median,
mode, standard deviation, range percentiles and interquartile range.
Public health researchers use descriptive statistics to summarize and describe the characteristics of
a population or sample, such as the frequency and distribution of a particular disease or risk factors.
By providing a comprehensive and quantitative picture of the data, descriptive statistics allow
researchers to identify patterns and trends, explore relationships between variables, and draw
meaningful conclusions from their analyses. Descriptive statistics can also be used to compare
different populations or subgroups, evaluate the effectiveness of public health interventions, and
inform policy decisions aimed at improving the health of a population. Overall, descriptive statistics
are an important tool for researchers in the study of the distribution and determinants of health-
related events and conditions [1–26].

4.10 Tabulate
Discovering relationships between variables is the fundamental goal of data analysis. Frequency
tables are a basic tool you can use to explore data and get an idea of the relationships between
variables. A frequency table is just a data table that shows the counts of one or more categorical
variables.
4.1.1 One way Tabulate
Create frequency tables in pandas using the pd.crosstab() function. The function takes one or
more columns of a categorical variable and then constructs a new DataFrame of variable counts
based on the supplied arrays.

34 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load dataset
datf=pd.read_csv('data.csv',index_col='id')
# Create one way cross-tab
my_tab=pd.crosstab(index=datf['death'],columns='count',
margins=True, dropna=False)
# Save results
my_tab.to_csv('One way table for count.csv')
# Create one way cross-tab
my_tab=pd.crosstab(index=datf['death'],columns=’Proportion’,
margins=True,dropna=False, normalize='columns')
# Calculate proportion
prop1=my_tab/my_tab.sum()
# Save proportion
prop1.to_csv('One way table for percent.csv')
death count All
0 2884 2884
1 1550 1550
All 4434 4434

death Proportion
0 0.650429
1 0.349571

4.1.2 Cross-Tabulate
A contingency table called as two-way table, which includes two different dimensions (rows and
columns). Each dimension is a different variable. A two-way table shows the relationship between
two variables. To create a two way table, insert two variables to the pd.crosstab() function:

# Import packages

35 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load dataset
datf=pd.read_csv('data.csv',index_col='id')
### Create cross-tab
my_tab=pd.crosstab(index=datf['death'],columns=datf['sex1'],
margins=True,dropna=False)
# Change columns name
my_tab.columns=['Female','male','Total']
# Change rows name
my_tab.index=['Alive','Death','Total']
# Save cross-tab
my_tab.to_csv('Two way table for count.csv')
# Calculate column proportion
prop1=pd.crosstab(index=datf['death'],columns=datf['sex1'],
margins=True,dropna=False,normalize='columns')
# Change columns name
prop1.columns=['Female','male','Total']
# Change rows name
prop1.index=['Alive','Death']
# Save cross-tab
prop1.to_csv('Two way table for percent.csv')
Female Male Total
Alive 1101 1783 2884
Death 843 707 1550
Total 1944 2490 4434

Female male Total


Alive 0.566358 0.716064 0.650429
Death 0.433642 0.283936 0.349571

Row percentages

# Import packages
36 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load dataset
datf=pd.read_csv('data.csv',index_col='id')
prop2=pd.crosstab(index=datf['death'],columns=datf['sex1'],
margins=True, dropna=False, normalize='index')
# Change columns name
prop2.columns=['Female','male']
# Change rows name
prop2.index=['Alive','Death','Total']
# save cross-tab
prop2.to_csv('Two way table for row percent.csv')
Female Male
Alive 0.381761 0.618239
Death 0.543871 0.456129
Total 0.43843 0.56157

Total percentages

# Import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load dataset
datf=pd.read_csv('data.csv',index_col='id')
prop3=pd.crosstab(index=datf['death'],columns=datf['sex1'],
margins=True,dropna=False,normalize=True)
# Change columns name
prop3.columns=['Female','male','Total']
# Change rows name
prop3.index=['Alive','Death','Total']

37 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Save cross-tab
prop3.to_csv('Two way table for total percent.csv')
Female male Total
Alive 0.248309 0.40212 0.650429
Death 0.190122 0.15945 0.349571
Total 0.43843 0.56157 1
4.2 Cross tabulation heatmap

A cross tabulation heatmap is a way to compare multiple categories using colors.

# Import relevance packages


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
# Change local directory
os.chdir(~Python notes')
# Load dataset
datf=pd.read_csv('data.csv',index_col='id')
# Create cross-tab with column proportion
prop1=pd.crosstab(index=datf['death'],columns=datf['sex1'],
margins=True,dropna=False, normalize='columns')
# Create heatmap figure
plt.figure(figsize=(12,8))
sns.heatmap(prop1,cmap='YlGnBu',annot=True)
# Save figure
plt.savefig('Heatmap.png',bbox_inches='tight', dpi=600)

38 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

4.3 Descriptive statistics for continuous variable


Continuous variables are those that can take on any value within a given range, such as height,
weight, blood pressure, and cholesterol levels. The statistics for continuous data represent the basic
characteristics that describes the data sample. The main purpose of descriptive statistics in public
health data is to provide a summary of the distribution of continuous variables, including measures
of central tendency (such as mean, median, and mode) and measures of variability (such as range,
variance, and standard deviation).
# Import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load dataset
datf=pd.read_csv('data.csv',index_col='id')
print("N={}\nMean ={}\nSD ={}\nMedian ={}\nP25 ={}\n P75 = {}\n
Min = {}\nMax={}\n" .format(datf['bmi'].count(),
datf['bmi'].mean(),datf['bmi'].std(),datf['bmi'].median(),datf['b
mi'].quantile(0.25),datf['bmi'].quantile(0.75),datf['bmi'].min(),
datf['bmi'].max()))
# Using describe
print(datf['bmi'].describe())
# Calculate statistics by Group
print(datf['bmi'].groupby(datf['sex']).describe())

39 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

N= 4415
Mean = 25.85
SD =4.10
Median = 25.45
P25 = 23.09
P75 = 28.09
Min = 15.54
Max = 56.8

Group
count mean std min 25% 50% 75% max
Male 1939.0 26.169582 3.407115 15.54 23.97 26.08 28.32 40.38
Female 2476.0 25.592884 4.557443 15.96 22.54 24.83 27.82 56.80

40 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Chapter 5
Data visualization
Introduction

Data visualization is a crucial aspect of health data science, as it allows public health researchers to
demonstration complex data in a clear and concise way. The use of visual representation such as
graphs, and charts can help to convey critical information about the distribution of diseases, risk
factors, and health outcomes within populations. This chapter will provide an overview of the
different types of visualization techniques that are commonly used in health data science, as well
as their applications and limitations. This chapter also will provide an overview of the different
types of data that can be visualized, including categorical data, and continuous data, and describe
the most appropriate visual representation that can be used for each type of data. Additionally, the
chapter will discuss the different types of graphs and charts that are commonly used in recent health
data science, such as pie charts, waffle charts, swarm plots, heatmaps, violin plots, forest plots,
timeseries graphs, and scatterplots. Finally, the chapter will provide some best practices for
visualizing health data science, such as changing colors, fonts, and labels, and avoiding misleading
or confusing visualization.

5.1 Pie charts


A Pie chart is a circular statistical graphic for single categorical data. This chart divide categories
into slices to show numerical proportion. The slices of pie show the relative size of categories. Each
slice forms a specific portion of the total. The total area of the data is equal to 360o. Most argue that
pie charts fail to accurately display data with any consistency and hard to read for too many
categories of data [27, 28].

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
# Change directory
import os
os.chdir('~\Python notes')
# Create DataFrame
41 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

df=pd.DataFrame({
"Nutrition":["Underweight","Normal","Overweight","Obesity"],
"Number":[57,1226,2453,664]
})
# Create a pie graph
plt.pie(df["Number"], labels=df["Nutrition"], autopct='%1.2f%%',
colors=['red','green','orange','blue'])
# Insert title
plt.title("Nutrition status in a city")
# Save figure
plt.savefig('Pie chart.png',bbox_inches='tight', dpi=600)

Slices exploding Pie chart

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir(~\Python notes')
# Create dataframe
df=pd.DataFrame({
Nutrition":["Underweight","Normal","Overweight","Obesity"],

42 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

"Number":[57,1226,2453,664]
})
# Create figure
plt.pie(df["Number"],labels=df["Nutrition"],autopct='%1.2f%%',
colors=['red','green','orange','blue'],
explode=[0.02,0.012,0.01,0.1])
# Insert title
plt.title("Nutrition status in a city")
# Save figure
plt.savefig('Pie chart.png',bbox_inches='tight', dpi=600)

5.2 Bar charts

Bar charts are a commonly used in public health data science that can be used to display categorical
data in vertical or horizontal. Each bar is represented as rectangular bar with lengths proportional
to the values. It makes easy to compare the frequency or proportion of each category. It can also be
used to compare the distribution of a categorical variable across multiple groups or to display
changes in the distribution of a variable over time [8, 10, 12].

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os

43 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Change directory
os.chdir(~\Python notes')
# Create dataframe
df=pd.DataFrame({
"Nutrition":["Underweight","Normal","Overweight","Obesity"],
"Number":[57,1226,2453,664]
})
# Create figure
plt.bar(df["Nutrition"],df["Number"],color='maroon')
# Put value label on the bars
for i, v in enumerate(df["Number"]):
plt.text(i,v, str(v), color='blue')
# Create label of x-axis
plt.xlabel("nutritional status")
# Create label of y-axis
plt.ylabel("Number of participants")
# Put ticks in y-axis
plt.yticks([500,1000,1500,2000,2500])
# Insert title
plt.title("Nutrition status in a city")
# Save figure
plt.savefig('Bar chart_1.png',bbox_inches='tight', dpi=600)

44 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Horizontal bar chart


# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir(~\Python notes')
# Create horizontal bar chart
plt.barh(df["Nutrition"],df["Number"],color='maroon')
# Put value label on top of bars
for i, v in enumerate(df["Number"]):
plt.text(v,i, str(v), color='blue')
# Insert label of y-axis
plt.ylabel("Nutritional status")
# Insert label of x-axis
plt.xlabel("Number of participants")
# Insert ticks of x-axis
plt.xticks([500,1000,1500,2000,2500])
# Create title
plt.title("Nutrition status in a city")
# Save figure
plt.savefig('Barh chart_2.png',bbox_inches='tight', dpi=600)

45 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Multiple bar chart


Multiple bar chart represents many bars in a category.

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir(~\Python notes')
# Load data
df = pd.read_csv(~\data.csv',index_col='slno')
# Create figure
ax=df.plot(kind='bar',stacked=False,width=.8,figsize=(10,6))
# Insert value label on top of the bars
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
if height>0:
ax.text(x+width/2,
y+height,
'{:.0f}%'.format(height),
horizontalalignment='center',
verticalalignment='bottom',
fontweight='bold')
# Create x-axis label
plt.xlabel('\n Antibiotics')
# Create y-axis label
plt.ylabel('Percentage (%)')
# Create y-axis ticks with rotation and bold font
plt.yticks(np.arange(0,101,10))
plt.xticks(fontsize=9.5,rotation=360,fontweight='bold')
# Create legend in the graph
plt.legend(bbox_to_anchor=(.5, 0.9), ncol=2, loc='lower left')

46 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Save figure
plt.savefig('Multiple bar.png', bbox_inches="tight", dpi=600)

5.3 Stacked bar


A stacked bar chart is a visual representation of categorical data to demonstrate the magnitude of
different categories, and how they contribute to a total. It uses horizontal or vertical bars in a graph.
Each bar is divided into segments that represent the proportion of each category. Stacked bar charts
are commonly used in health data science to compare the proportions of multiple categories across
different variables or dimensions. It can also be clustered to display multiple categorical variables
simultaneously [1, 2, 12].

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir(~\Python notes')
# Load data
df = pd.read_csv(~\data.csv',index_col='id')
# Create figure
ax=df.plot(kind='bar',stacked=True,width=.8,figsize=(10,6))
# Insert value label on top of the bars
for p in ax.patches:

47 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

width, height = p.get_width(), p.get_height()


x, y = p.get_xy()
if height>0:
ax.text(x+width/2,
y+height/2,
'{:.0f}%'.format(height),
horizontalalignment='center',
verticalalignment='center')
# Create title
plt.title("Sex of customers\n")
# Create x-axis label
plt.xlabel('\n Name of Antibiotics')
# Create y-axis label
plt.ylabel('Percentage (%)')
# Create y-axis ticks with rotation and bold font
plt.yticks(np.arange(0,101,10))
plt.yticks(np.arange(0,201,10))
plt.xticks(fontsize=10,rotation=360
# Create legend in the graph
plt.legend(bbox_to_anchor=(.5, 0.9), ncol=2, loc='lower left')
# Save figure
plt.savefig('Stacked bar.png', bbox_inches="tight", dpi=600)

48 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

5.4 Bar heatmap

A heatmap shows values for a main variable of interest across two axis (row and column) variables
as a grid of colored squares. Each square color specifies the values of the main variable in the
corresponding range of cell. The purpose of a heatmap provides an immediate visual summary of
information and allow the users to understand the set of complex data [29, 30].

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir(~\Python notes')
# Load data
df = pd.read_csv('death.csv')
# Calculate count of death for each division and month
ct_counts=df.pivot(index='Division', columns='Month',
values='count')
# Create figure
plt.figure(figsize=(12, 8))
ax=sns.heatmap(ct_counts, annot=True, cmap='rainbow',
annot_kws={'size':12})
# Create title
plt.title('Division wise death distribution', fontsize=12)
# Create x-axis ticks
plt.xticks(rotation=0)
# Remove x-axis
plt.xlabel('')
# Create y-axis label
plt.ylabel('% of death from COVID-19',fontsize=12)
plt.yticks(fontsize=10,rotation=45)
# Save figure
plt.savefig("Division wise death distribution.jpg")
49 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

5.5 Waffle charts

A waffle chart is an advanced visualization tool. A waffle chart is a great way to visualize data in
relation to a whole or to highlight progress against a given threshold. We are interested in
visualizing the contribution of each of items to total. For given predefined the height and width, the
contribution of each item is transformed into a number of tiles that is proportional to the item’s
contribution to total. Therefore, the more tiles indicate more contribution resulting in what
resembles a waffle when combined [31, 32].
Example 1
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
from pywaffle import Waffle
import os
# Change directory
os.chdir(~\Python notes')
# Create dataframe
data=pd.DataFrame({
"Nutrition":["Underweight","Normal","Overweight","Obesity"],
"Number":[57,1226,2453,664]
})
# Create figure
50 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

fig = plt.figure(
FigureClass=Waffle,
rows=10,
columns=44,
values=data.Number,
colors=['darkorange','cyan','green', 'red',],
title={'label': 'One square tile= 10 respondents', 'loc':
'left', 'fontsize':8},
labels=list(data.Nutrition),
legend={'loc': 'upper left', 'bbox_to_anchor': (1, 1),
'ncol': 1, 'framealpha': 5,'fontsize':9},
starting_location='NW',
block_arranging_style='snake',
font_size=12,
icon_legend=True)
# Create figure resize
fig.set_size_inches(10,4)
# Save figure
plt.savefig('Waffle Chart.png',bbox_inches='tight', dpi=600)

Waffle chart with colorbar

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# Load dataset
51 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

df=pd.read_csv('~death.csv')
# Calculate total number of death
total=sum(df['Death'])
# Calculate proportion of death
prop=[(float(value)/total) for value in df['Death']]
# Set number of titles in a row
width=40
# Set number of titles in a column
height=10
total=width*height
tiles_per_cat=[round(proportion*total) for proportion in prop]
# Create figure
waffle = np.zeros((height, width))
category_index = 0
tile_index = 0
for col in range(width):
for row in range(height):
tile_index += 1
if tile_index > sum(tiles_per_cat[0:category_index]):
category_index += 1
waffle[row, col] = category_index
colormap = plt.cm.coolwarm
plt.matshow(waffle, cmap=colormap)
ax = plt.gca()
# Create ticks for color bar
ax.set_xticks(np.arange(-0.5, (width), 1), minor=True)
ax.set_yticks(np.arange(-0.5, (height), 1), minor=True)
ax.grid(which='minor', color='w', linestyle='-', linewidth=1.5)
# Remove x-axis ticks
plt.xticks([])
# Remove y-axis ticks
plt.yticks([])

52 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Create legend
values_num = df['Death']
values=[round(pr*100,1) for pr in prop]
categories = df['Month']
values_cumsum = np.cumsum(values)
total_values = values_cumsum[len(values_cumsum) - 1]
legend_handles = []
for i, category in enumerate(categories):
if value_sign == '%':
label_str = category + ' (' + str(values_num[i])+'
('+str(values[i]) + value_sign + '%'+'))'
else:
label_str = category + ' (' + value_sign +
str(values_num[i])+' ('+str(values[i]) + '%'+'))'
color_val = colormap(float(values_cumsum[i]) / total_values)
legend_handles.append(mpatches.Patch(color=color_val,
label=label_str))
# Insert legend
plt.legend(handles=legend_handles, loc='lower center',ncol=
round(len(categories)/4),
bbox_to_anchor=(0, -0.6,.95, 0.1))
# Insert title
plt.title('Deaths from Covid-19 in Bangladesh')
# Insert color bar
plt.clim(1,14)
plt.colorbar(ticks=np.arange(1,15,1)).ax.set_yticklabels(
categories)
# Save figure
plt.savefig('Waffle Chart.png',bbox_inches='tight', dpi=600)

53 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

5.6 Histogram
Histograms are a commonly used visualization tool in health data science to display the distribution
of a continuous variable. It is a visual representation of the frequency or probability of observations
falling within different ranges (bins) of a continuous variable. It’s x-axis represents the variable
being measured, while the y-axis represents the frequency or probability of observations within
each bin. They are useful in identifying patterns in the data, such as whether the distribution is
symmetrical or skewed, and can help to identify potential outliers or data anomalies.

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("data.csv")
# Create figure
fig,ax=plt.subplots(figsize=(10,7))
ax.hist(df_can['totchol'],bins=40,color="darkorange")
# Insert title
plt.title('Histogram of total cholesterol level')
# Put x-axis ticks
54 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

plt.xticks(rotation=90,fontsize=8)
# Insert y-axis label
plt.ylabel('Frequencies')
# Save figure
plt.savefig("Histogram.png",bbox_inches="tight", dpi=600)

Histogram with separate bins

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("data.csv")
# Create figure
fig,ax=plt.subplots(figsize=(10,7))
ax.hist(df_can['totchol'],bins=40,color="darkorange",
rwidth=0.85)

55 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Insert title
plt.title('Histogram of total cholesterol level')
# Put x-axis ticks
plt.xticks(rotation=90,fontsize=8)
# Insert y-axis label
plt.ylabel('Frequencies')
# Save figure
plt.savefig("Histogram.png",bbox_inches="tight", dpi=600)

Histogram with density plot

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("data.csv")

56 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Create figure
fig,ax=plt.subplots(figsize=(10,7))
ax.hist(df_can['totchol1'],bins=40,color="darkorange",
rwidth=0.85, density=True)
# Insert title
plt.title('Histogram of total cholesterol level')
# Put x-axis ticks
plt.xticks(rotation=90,fontsize=8)
# Insert y-axis label
plt.ylabel('Probability')
# Save figure
plt.savefig("Histogram.png",bbox_inches="tight", dpi=600)

Histogram with kernel density line

# Import packages
import pandas as pd
import numpy as np
57 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

import matplotlib.pyplot as plt


import matplotlib.patches as mpatches
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("data.csv")
# Create figure
fig,ax=plt.subplots(figsize=(10,7))
ax.hist(df_can['totchol'],bins=40,color="darkgreen", rwidth=0.85,
density=True)
# Create kernel density curve
df_can['totchol'].plot.kde(ax=ax1, color="red")
# Insert title
plt.title('Histogram of total cholesterol level')
# Put x-axis ticks
plt.xticks(rotation=90,fontsize=8)
# Insert y-axis label
plt.ylabel('Probability')
# Save figure
plt.savefig("Histogram.png",bbox_inches="tight", dpi=600)

58 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

5.7 Boxplot
Box plots are a popular tool for visualizing the distribution of continuous data in health data science.
A box plot, also known as a box-and-whisker plot, displays the distribution of data through five
summary statistics: the minimum value, the first quartile (25th percentile), the median (50th
percentile), the third quartile (75th percentile), and the maximum value. The rectangular box
represents the interquartile range (IQR), which is the range between the first and third quartiles.
The median is represented by a vertical line inside the box. The whiskers extend from the box to
the minimum and maximum values, respectively. Outliers may also be represented by individual
data points or circles beyond the whiskers. They can also be used to compare the distribution of a
continuous variable across multiple groups or to display changes in the distribution of a variable
over time [33].

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("~data.csv")
# Create figure
sns.boxplot(x='Month',y='cases',data=df_can,palette='rainbow')
# Insert title
plt.title("Monthwise cases")
# Put y-axis label
plt.ylabel('Number of positive cases')
# Put x-axis ticks
plt.xticks(fontsize=8,rotation=45)
# Put y-axis ticks
plt.yticks(fontsize=8)

59 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Save figure
plt.savefig('Box plot.png',bbox_inches="tight", dpi=600)

Horizontal boxplot
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("~data.csv")
# Create figure
sns.boxplot(x='Month',y='cases',data=df_can,palette='rainbow',
orient="h")
# Insert title
plt.title("Monthwise cases")
# Put y-axis label
plt.ylabel('Number of positive cases')

60 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Put x-axis ticks


plt.xticks(fontsize=8,rotation=45)
# Put y-axis ticks
plt.yticks(fontsize=8)
# Save figure
plt.savefig('Box plot.png',bbox_inches="tight", dpi=600)

5.8 Scatter plot


Scatter plots are a common tool for visualizing the relationship between two continuous variables
in health data science. A scatter plot is a graphical representation with each observation represented
by a point on the plot. The position of each point corresponds to the values of the two variables for
that observation. They are useful in health data science as they can reveal patterns and kinds of
relationships between variables, such as whether the relationship is linear, nonlinear, or non-
existent. They can also be used to identify potential outliers or influential observations, as well as
to explore potential confounding variables.

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

61 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("~data.csv")
# Create figure
sns.scatterplot(x='cases',y='Death',data=df_can,color='blue')
# Put y-axis ticks
plt.yticks(np.arange(0,121,20))
# Save figure
plt.savefig('scatter plot.png',bbox_inches="tight", dpi=600)

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Change directory
os.chdir(~\Python notes')
# Load data
df=pd.read_csv("~data.csv”)
# Create categories
df.loc[df[“age”]<40,”AGE”]=”Below 40 years”
62 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

df.loc[(df[“age”]>=40) & (df[“age”]<60),”AGE”]=”25-60 years”


df.loc[(df[“age”]>=60),”AGE”]=”Above 60 years”
# Create figure
sns.scatterplot(x=’totchol’,y=’sysbp’,data=df_fhs,hue=”AGE”,
palette=[“black”,”blue”,”red”])
# Insert x-axis label
plt.xlabel(“Total cholesterol level”)
# Insert x-axis label
plt.ylabel(“Total systolic blood pressure”)
# Put x-axis ticks
plt.yticks(np.arange(0,301,50))
# Save figure
plt.savefig(‘scatter plot.png’,bbox_inches=”tight”, dpi=600)

5.9 Strip plot


Strip plots are a type of data visualization that allow us to display the distribution of a continuous
variable corresponding to categories or groups. Strip plots are a valuable tool for health data
science, and they are mainly used to compare multiple groups or categories within a continuous
variable.

# Import packages

63 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

import pandas as pd
import seaborn as sns
import os
# Change directory
os.chdir('~Python notes')
# Load data
df=pd.read_csv('data.csv',index_col="id")
# Create categories
bins=[0,18.5,23.5,29.5,np.inf]
names=["Underweight","Normal","Overweight","Obese"]
df['new_nutrition']=pd.cut(df["bmi"],bins,labels=names)
# Create figure
fig=plt.figure(figsize=(10,6))
sns.stripplot(data=df,x="sysbp", y="new_nutrition",palette='magma')
sns.set_theme(style='whitegrid')
# Insert labels
plt.ylabel('Nutritional status')
plt.xlabel('Systolic blood pressure')
# Save figure
plt.savefig("Strip plot.png",bbox_inches='tight', dpi=600)

To identify whether age influences on the relation between nutritional status and blood pressure,
you can use hue=”Age” in the striptplot() method.

# Import packages
64 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

import pandas as pd
import seaborn as sns
import os
# Change directory
os.chdir('~Python notes')
# Load data
df=pd.read_csv('data.csv',index_col="id")
# Create categories
bins=[0,18.5,23.5,29.5,np.inf]
names=["Underweight","Normal","Overweight","Obese"]
df['new_nutrition']=pd.cut(df["bmi"],bins,labels=names)
# Create figure
fig=plt.figure(figsize=(10,6))
sns.stripplot(data=df,x="sysbp", y="new_nutrition",hue='Age',
palette='magma')
sns.set_theme(style='whitegrid')
# Insert labels
plt.ylabel('Nutritional status')
plt.xlabel('Systolic blood pressure')
# Save figure
plt.savefig("Strip plot.png",bbox_inches='tight', dpi=600)

65 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

5.10 Swarm plot


Swarm plot is a modern health data science visualization tool that can combine box and strip plot.
A swarm plot where points are adjusted so it won’t get overlap to each other as it helps to represent
the better representation of the distribution of values. This type of plot sometimes known as
“beeswarm”.

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
# Change directory
os.chdir(~\Python notes')
# Load data
df_covid=pd.read_csv(‘~data.csv’)
# Create figure
fig=plt.figure(figsize=(10,6))
sns.boxplot(x=’Month’,y=’cases’,data=df_covid,palette=’rainbow’)
# Change background color
sns.set_theme(style=’whitegrid’)
# Create swarm plot
sns.swarmplot(x=’Month’,y=’cases’,data=df_covid,color=’k’,
size=3)
# Insert title
plt.title(‘Combined box and swarm plot of COVID-19 cases in
Bangladesh’)
# Save figure
plt.ioff()
plt.savefig(“Swarm plot.png”,bbox_inches=’tight’, dpi=600)

66 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Watermark and wave selection

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
# Change directory
os.chdir(~\Python notes')
# Load data
df_covid=pd.read_csv(‘~data.csv’)
# Create figure
fig=plt.figure(figsize=(10,6))
sns.boxplot(x=’Month’,y=’cases’,data=df_covid,palette=’rainbow’)
# Change background color
sns.set_theme(style=’whitegrid’)
# Create swarm plot
sns.swarmplot(x=’Month’,y=’cases’,data=df_covid,color=’k’,
size=3)
# Insert title
plt.title(‘Monthly COVID-19 positive cases in Bangladesh’)
# Insert y-axis label
plt.ylabel(‘Number of positive cases’)

67 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Put y-axis ticks


plt.yticks(np.arange(0,20001,1000))
plt.xticks(rotation=80, fontsize=8)
plt.ioff()
# Put horizontal lines
plt.hlines(y=5000, xmin=1.5, xmax=6.5, linewidth=1, color=’gray’,
linestyles=’—')
plt.hlines(y=7800, xmin=11.5,xmax=13.5, linewidth=1,
color=’gray’,linestyles=’—')
plt.hlines(y=16300,xmin=14.5,xmax=17.5, linewidth=1,
color=’gray’,linestyles=’—')
# Put vertical lines
plt.vlines(x=[1.5,6.5],ymin=[0,0],ymax=[5000,5000], linewidth=1,
color=’gray’,linestyles=’—')
plt.vlines(x=[11.5,13.5],ymin=[0,0],ymax=[7800,7800],
linewidth=1, color=’gray’,linestyles=’—')
plt.vlines(x=[14.5,17.5],ymin=[0,0],ymax=[16300,16300],
linewidth=1, color=’gray’,linestyles=’—')
# Insert texts
plt.text(3,5000,’First wave’,color=’red’)
plt.text(11.6,7800,’Second wave’,color=’red’)
plt.text(15,16300,’Third wave’,color=’red’)
# Insert image
ax2.imshow(im,aspect=’auto’,extent=(-1,20,0,20000), alpha=0.2,
cmap=’gray’,zorder=1,origin=’lower’)
# Save figure
plt.savefig(‘Monthly.png’, dpi=600)

68 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

5.11 Timeseries plot


A time series plot is a graphical representation of a variable measured at different points in time.
The x-axis represents time, while the y-axis represents the value of the variable. They are useful in
public health data science as they can reveal trends and patterns over time, such as seasonal
fluctuations or long-term trends. They can also be used to identify potential wave or changes in the
distribution of diseases over time [10, 12, 13].

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir('~Study maps\Map')
# Load data
df=pd.read_csv('~data.csv',index_col='Date',parse_dates=True)
# Create figure
fig,ax=plt.subplots(figsize=(10,6))
ins1=ax.plot(df['cases'],label='Case',color='green')

69 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

ax2=plt.twinx()
ins2=ax2.plot(df['Death'],label='Death',color='red')
ins=ins1+ins2
labs=[i.get_label() for i in ins]
# Insert legend
ax.legend(ins,labs,loc=0)
# Put y-axis label
ax.set_ylabel('Number of COVID-19 cases')
ax2.set_ylabel('Number of death from COVID-19')
ax.set_ylim(0,8000)
ax2.set_ylim(0,100)
# Save figure
plt.savefig("Timeseries plot.png",bbox_inches='tight', dpi=600)

Watermark with wave selection

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
# Change directory
70 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

os.chdir('~Study maps\Map')
# Load data
df_case=pd.read_csv('data.csv')
# Create figure
fig, ax=plt.subplots(figsize=(10,6))
sns.lineplot(x=df_case['Date'],y=df_case['Death'],data=df_case,
color='red',linewidth=1)
# Put x-axis label
plt.xlabel('Days passed')
# Put y-axis label
plt.ylabel('Number of death from COVID-19')
# Put y-axis tick
plt.yticks(np.arange(0,301,20))
# Put x-axis tick
plt.xticks(np.arange(0,560,7))
plt.xticks(rotation=80, fontsize=6)
# Insert title
plt.title('Daily number of death from COVID-19 in Bangladesh')
# Put horizontal line
plt.hlines(y=70, xmin='7-May-20', xmax='14-Oct-20', linewidth=1,
color='gray',linestyles='--')
plt.hlines(y=113, xmin='19-Mar-21',xmax='14-May-21', linewidth=1,
color='gray',linestyles='--')
plt.hlines(y=265, xmin='10-Jun-21',xmax='1-Sep-21', linewidth=1,
color='gray',linestyles='--')
# Put vertical line
plt.vlines(x=['19-Mar-21','14-May-21'], ymin=[0,0],
ymax=[113,113], linewidth=1, color='gray',linestyles='--')
plt.vlines(x=['10-Jun-21','1-Sep-21'], ymin=[0,0],
ymax=[265,265], linewidth=1, color='gray',linestyles='--')
plt.vlines(x=['7-May-20','14-Oct-20'],ymin=[0,0],ymax=[70,70],
linewidth=1, color='gray',linestyles='--')
# Insert text

71 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

plt.text(110,72,'First wave',color='purple')
plt.text(380,114,'Second wave',color='purple')
plt.text(480,266,'Third wave',color='purple')
# No grid lines
plt.grid(False)
plt.ioff()
# Insert image
ax.imshow(im,aspect='auto',extent=(1,560,0,300), alpha=0.2,
cmap='gray',zorder=1,origin='lower')
# Save figure
plt.tight_layout()
plt.savefig('Number of death.png',dpi=600)

5.13 Survival curve


The survival curves give a visual representation of the life tables. The horizontal axis shows the
time to event. In this plot, the vertical axis shows the probability of survival. It is commonly used
to compare the survival between groups in heath data science [11, 15].

# Import packages
import pandas as pd
import matplotlib.pyplot as plt

72 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

import numpy as np
import os
from lifelines import KaplanMeierFitter
from lifelines import NelsonAalenFitter
from lifelines.statistics import logrank_test
from lifelines import CoxPHFitter
from lifelines.plotting import add_at_risk_counts
# Change directory
os.chdir('~Python note')
# Load data
data=pd.read_csv('data.csv')
data.loc[data.status==1,'Dead']=0
data.loc[data.status==2,'Dead']=1
# Fit survival model
kmf=KaplanMeierFitter()
kmf.fit(durations=data['time'],event_observed=data['Dead'])
# Print results
print(kmf.event_table)
# Print predict at theshold
print(kmf.predict([0,1,11,12,15]))
# Print results
print(kmf.survival_function_)
print(kmf.median_survival_time_)
print(kmf.confidence_interval_)
# probability of die
print(kmf.cumulative_density_)
# Create cumulative density plot
kmf.plot_cumulative_density()
# Fit Nelson Aalen model
naf=NelsonAalenFitter()
naf.fit(durations=data['time'],event_observed=data['Dead'])
print(naf.cumulative_hazard_)
naf.plot_cumulative_hazard()

73 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

print(naf.event_table)
print(naf.predict([0,1,11,12,15]))
print(naf.confidence_interval_)
#data separation (Male and Female)
kmf_m=KaplanMeierFitter()
kmf_f=KaplanMeierFitter()
male=data.query('sex==1')
Female=data.query('sex==2')
# Create plot
ax=plt.subplot(111)
kmf_m.fit(durations=male['time'],event_observed=male['Dead'],
label='Male')
kmf_f.fit(durations=Female['time'],event_observed=Female['Dead'],
label='Female')
ax=kmf_m.plot_survival_function(ax=ax)
ax=kmf_f.plot_survival_function(ax=ax)
add_at_risk_counts(kmf_m,kmf_f,ax=ax,xticks=np.arange(0,1001,100)
)
# Insert title
plt.title('Kaplan Meier estimate')
# Put y-axis label
ax.set_ylabel('Probability of survival')
# Put a-axis label
ax.set_xlabel('Days passed')
# Put x-axis ticks
ax.set_xticks(np.arange(0,1001,100))
# Put y-axis ticks
ax.set_yticks(np.arange(0,1.1,.1))
# Save figure
plt.savefig('Survival.png', bbox_inches="tight",dpi=600)

74 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Cox proportion model


# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
from lifelines import CoxPHFitter
# Change directory
os.chdir('~Python note')
# Load data
data=pd.read_csv('data.csv')
# Fit Cox proportional hazard model
cph=CoxPHFitter()
# Crate independent variables set
data=data[['time','sex','age','ph.ecog','ph.karno','meal.cal',
'wt.loss','Dead']]
# Change NaN
data=data.dropna(subset=['time','sex','age','ph.ecog','ph.karno',
'meal.cal','wt.loss','Dead'])
cph.fit(data,duration_col='time',event_col='Dead')

75 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

cph.print_summary()
# Create plot
cph.plot()
# Insert title
plt.title('Hazard ratio plot for variables')
# Save figure
plt.savefig("HR.png", bbox_inches="tight", dpi=600)

5.14 Sankey plot


In public health research studies, researchers often use visualization techniques to display complex
data in a clear and concise manner. One such visualization tool is the Sankey plot, which is a type
of diagram that displays the flow or movement of data between areas. It consists of a series of
rectangles lines or boxes, representing with the width of each box proportional to the size of the
data. The boxes are connected by lines or arrows, with the width of the lines proportional to the
amount of data flowing between the areas. Sankey plots are particularly useful in health data science
as they can help to visualize the progression of disease within a population. [34].
76 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pySankey.sankey import sankey
import os
# Change directory
os.chdir('~Python note')
# Load data
df=pd.read_csv('Sanky_data.csv')
# Create figure
sankey(left=df["Infected Area"], right=df["Treatment Area"],
leftWeight=df["Total"], rightWeight=df["Total"],
aspect=20, fontsize=12)
# Reset figure size
fig1=plt.gcf()
fig1.set_size_inches(8,6)
# Change color
fig1.set_facecolor("w")
# Put top texts
fig1.text(x=0.1,y=0.85,s='Infected Area', fontsize=10,
fontweight='bold')
fig1.text(x=0.8,y=0.85,s='Treatment Area', fontsize=10,
fontweight='bold')
# Save figure
fig1.savefig("Sankey_data.png", bbox_inches="tight", dpi=600)

77 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

5.15 Forest Plot


In health data science, forest plots are commonly used to display and compare the results of multiple
studies or risk factors on a particular research question. A forest plot is a graphical representation
of the results of each individual study, along with a summary estimate of the effect size across all
studies. It typically includes a vertical line representing the null effect, which is the line of no
difference between the groups being compared. Each study is represented by a square or diamond,
with the size of the square or diamond representing the weight of the study, typically based on the
sample size or precision of the estimate. The position of the square or diamond represents the point
estimate of the effect size, and the horizontal line extending from the square or diamond represents
the confidence interval for the estimate [1, 8, 11, 15].

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from zepid.graphics import EffectMeasurePlot
import os
# Change directory
os.chdir('~Python note')

78 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Load data
df=pd.read_csv('~database_forest.csv')
# Create variables
labs = df['Authors']
measure = df['Proportion']
lower = df['lu']
upper = df['up']
# Create figure
p = EffectMeasurePlot(label=labs, effect_measure=measure,
lcl=lower, ucl=upper)
# Create labels
p.labels(effectmeasure='Proportion(%)',fontsize=8)
# Change color
p.colors(pointshape="D",pointcolor='red')
# Adjust figure
ax=p.plot(figsize=(12,8), t_adjuster=0.01, max_value=100,
min_value=0)
# Draw vertical line
ax.vlines(x=50,ymax=60,ymin=0,linestyles='--',colors='gray')
# Put x-axis ticks
x=np.arange(0,101,10)
labds=['0','10','20','30','40','50','60','70','80','90','100']
ax.set_xticks(x)
ax.set_xticklabels(labds)
# Creare spines
ax.spines['top'].set_visible(True)
ax.spines['right'].set_visible(True)
ax.spines['bottom'].set_visible(True)
ax.spines['left'].set_visible(False)
# Save figure
plt.tight_layout()
plt.savefig("Forestplot.png",bbox_inches="tight", dpi=600)

79 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Forestplot for risk factors

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from zepid.graphics import EffectMeasurePlot
import os
# Change directory
os.chdir('~Python note')
# Load data
df=pd.read_csv('data.csv')
# Define variables
labs = df['Authors']
measure = df['RR']
lower = df['lu']
upper = df['up']

80 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Creare figure
p = EffectMeasurePlot(label=labs, effect_measure=measure,
lcl=lower, ucl=upper)
# Insert label
p.labels(effectmeasure='Risk ratio (RR)',fontsize=8)
# Change color
p.colors(pointshape="D",pointcolor='red')
ax=p.plot(figsize=(11,4), t_adjuster=0.05, max_value=100,
min_value=0)
# Put x-axis tick
x=np.arange(-10,101,10)
labds=['-
10','0','10','20','30','40','50','60','70','80','90','100']
ax.set_xticks(x)
ax.set_xticklabels(labds)
# Create spines
ax.spines['top'].set_visible(True)
ax.spines['right'].set_visible(True)
ax.spines['bottom'].set_visible(True)
ax.spines['left'].set_visible(False)
# Save figure
plt.savefig("Forestplot.png",bbox_inches="tight", dpi=600)

81 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

5.16 Violin Plot


Violin plot is a type of plot used to display the distribution of a continuous data. A violin plot is
similar to a box plot, which is commonly used in public health research studies. However, instead
of showing a box and whisker plot, the violin plot shows the distribution of the data using a kernel
density plot, where the width of the plot at a particular point represents the density of data points at
that point. It is particularly useful in public health research studies as it can reveal important
information about the distribution of a variable, such as the presence of bimodal distributions,
skewness, and outliers. This information can be used to identify subgroups within the data set, to
assess the normality of the data, and to evaluate the appropriateness of statistical tests [10].

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os
# Change directory
os.chdir('~Python note')
# Load data
df=pd.read_csv('data.csv',index_col='id')
# Create figure
fig=plt.figure(figsize=(10,6))
sns.violinplot(data=df,x='totchol',color="darkorange")
# Background change
sns.set_theme(style='whitegrid')
plt.ioff()
# Insert x-axis label
plt.xlabel("Total cholesterol level of respondent")
# Save figure
plt.savefig("Violin plot.png",bbox_inches='tight', dpi=600)

82 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Note: white dot in the violin indicates median, box shows interquartile range, curve shows the
kernel density of the variable.

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os
# Change directory
os.chdir('~Python note')
# Load data
df=pd.read_csv('data.csv',index_col='id')
# Create figure
fig=plt.figure(figsize=(10,6))
sns.violinplot(data=df,x='totchol',y="sex",color="darkorange")
# Change background
sns.set_theme(style='whitegrid')
plt.ioff()
# Insert x-axis label
plt.xlabel("Total cholesterol level of respondent")
# Change y-axis label
plt.ylabel("Sex of respondents")
# Save figure
plt.savefig("Violin plot.png",bbox_inches='tight', dpi=600)

83 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Swarm violin plot


# Import packages
import pandas as pd
import seaborn as sns
# Import packages
df=pd.read_csv('data.csv',index_col="id")
# Create categories
bins=[0,18.5,23.5,29.5,np.inf]
names=["Underweight","Normal","Overweight","Obese"]
df['new_nutrition']=pd.cut(df["bmi"],bins,labels=names)
# Create figures
fig=plt.figure(figsize=(10,6))
sns.violinplot(data=df,x="sysbp",y="new_nutrition",
palette='rainbow')
sns.set_theme(style='whitegrid')
sns.swarmplot(data=df,x="sysbp",y="new_nutrition",color='green',
alpha=0.3)
plt.ylabel('Nutritional status')
plt.xlabel('Systolic blood pressure')
# Save figure
plt.savefig("Swarm violin plot.png",bbox_inches='tight', dpi=600)

84 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

To identify whether age influences on the relation between nutritional status and blood pressure,
you can use hue=”Age” in the swarmplot() method.

# Import packages
import pandas as pd
import seaborn as sns
# Import packages
df=pd.read_csv('data.csv',index_col="id")
# Create categories
bins=[0,18.5,23.5,29.5,np.inf]
names=["Underweight","Normal","Overweight","Obese"]
df['new_nutrition']=pd.cut(df["bmi"],bins,labels=names)
# Create figures
fig=plt.figure(figsize=(10,6))
sns.violinplot(data=df,x="sysbp",y="new_nutrition",
palette='rainbow')
sns.set_theme(style='whitegrid')
sns.swarmplot(data=df,x="sysbp",y="new_nutrition",hue='Age',
color='green', alpha=0.3)
plt.ylabel('Nutritional status')
plt.xlabel('Systolic blood pressure')
# Save figure
plt.savefig("Swarm violin plot.png",bbox_inches='tight', dpi=600)

85 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

86 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Chapter 6
Spatial data Mapping
Introduction

Spatial data mapping is an important visualization tool for health data science to visualize disease
patterns and distributions within populations around local and globe. It involves the use of
geographical information systems (GIS) and other spatial analysis techniques to map and analyze
disease incidence, prevalence, and mortality rates, as well as risk factors and environmental
exposures that may contribute to disease outcomes. The chapter will begin by guiding to mapping
of geodata and the benefits of using different mapping for disease surveillance and public health
decision-making. The chapter also will discuss the different types of spatial data mapping that can
be analyzed, including point data, polygon data, and raster data. The chapter will then focus on the
different types of maps that can be created to visualize spatial data, including choropleth maps,
point maps, and migration maps [1, 8, 13, 16, 21].

6.1 Create Study site map


One key feature of spatial mapping is the ability to effectively display study sites or sampling
locations, which can provide critical information on the distribution of disease and potential risk
factors [1, 2].

# Import packages
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir('~Python note')
# Load compass
im = plt.imread(get_sample_data('~compass.jpg'))
# Load Bangladesh shape file
Bdg_d =gpd.read_file('~\bgd_admbnda_adm1_bbs_20201113.shp')
# Load data
87 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

df = pd.read_csv("Long_Lats.csv", delimiter=',', skiprows=0,


low_memory=False)
# Create geometric points
geometry = [Point(xy) for xy in zip(df['Lon'], df['Lat'])]
gdf = GeoDataFrame(df, geometry=geometry)
# Create figure
fig,ax4=plt.subplots(figsize=(12,8))
# Set color
cmap_csub=['darkgreen','darkorange']
color_mapsub=ListedColormap(cmap_csub)
# Plot Bangladesh shape file
Bdg_d.plot(color='grey',ax=ax4,edgecolor='k',linewidth=0.8,
alpha=0.1)
# Plot geodata file
gdf.plot(column='Area',ax=ax4,cax=None, categorical=True,
legend=True,
marker='o',
markersize=45,
cmap=color_mapsub)
# Create name of areas
for i in range(len(gdf)):
txt=ax4.text(float(gdf.Lon[i]+0.05),float(gdf.Lat[i]), "{}"
.format(gdf.District[i]),size=8, color='k',
horizontalalignment='left')
# Put label for both x-axis and y-axis
ax4.set(xlabel="Longitude(Degrees)", ylabel="Latitude(Degrees)")
# Put legend
newax = fig.add_axes([0.55, 0.8, 0.10, 0.10], anchor='C',
zorder=0)
# Insert compass
newax.imshow(im)
newax.axis('off')

88 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Save figure
plt.savefig('~map.png',dpi=600)

6.2 Select Local areas in Bangladesh


Suppose you will study in a piece of areas of a country where samples will be collected for your
study. It is easy to display the study areas on a map. It is helpful to show the study location.

# Import packages
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from pyproj import Proj, CRS,transform
import os
# Change directory
os.chdir('~Python note')
# Load data

89 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

df_dis=pd.read_csv('data.csv',index_col="Area")
# Load Bangladesh district shape file
Bdg_dis = gpd.read_file('~\bgd_admbnda_adm2_bbs_20201113.shp')
# Load data
df_long=pd.read_csv('Districts.csv', delimiter=',', skiprows=0,
low_memory=False)
# Create geodata
geom = [Point(xy) for xy in zip(df_long['lon'],df_long['lat'])]
gdf_long = GeoDataFrame(df_long, geometry=geom)
# Create figure
fig,ax8=plt.subplots(figsize=(12,8))
# Create color panel
cmap_csub_f=['darkgreen',"darkorange",'purple','blue','black','br
own','red']
color_mapsub_f=ListedColormap(cmap_csub_f)
# Create maps
Bdg_dis[Bdg_dis.ADM2_EN=='Faridpur'].plot(color='w',ax=ax8,
edgecolor= 'black',alpha=0.3)
Bdg_dis[Bdg_dis.ADM2_EN=='Magura'].plot(color='w',ax=ax8,
edgecolor='black',alpha=0.3)
Bdg_dis[Bdg_dis.ADM2_EN=='Rajbari'].plot(color='w',ax=ax8,
edgecolor='black',alpha=0.3)
gdf_long.plot(column='Area',ax=ax8,cax=None,categorical=True,
legend=True,
marker='o',
markersize=45,
cmap=color_mapsub_f)
for i in range(len(gdf_long)):
if gdf_long.District[i]=='Faridpur':

txt=ax8.text(float(gdf_long.lon[i]+0.3),float(gdf_long.lat[i]-
0.05), "{}" .format(gdf_long.District[i]),size=10,
color='k',horizontalalignment='left')

90 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

if gdf_long.District[i]=='Magura':
txt=ax8.text(float(gdf_long.lon[i]-
0.06),float(gdf_long.lat[i]-0.2), "{}"
.format(gdf_long.District[i]),size=10,
color='k',horizontalalignment='left')
if gdf_long.District[i]=='Rajbari':
txt=ax8.text(float(gdf_long.lon[i]-
0.05),float(gdf_long.lat[i]+0.05), "{}"
.format(gdf_long.District[i]),size=10,
color='k',horizontalalignment='left')
# Print map
plt.savefig('Map.png',dpi=600)

6.3 Distance Calculation


Distance calculation is an important tool used in spatial mapping for public health research studies
to measure the distance between different geographic locations, such as study sites, health facilities,
or other relevant landmarks. Python commonly calculates the Euclidean distance between two
coordinate points. Accurate distance calculation is crucial for understanding the geographical
relationships between different geographic locations and identifying potential sources of disease
transmission [16].
# Import packages

91 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from pyproj import Proj, CRS,transform
import os
# Change directory
os.chdir('~Python note')
# Load data
df_dis=pd.read_csv('Dataset.csv',index_col="slno")
# Load Bangladesh district shape file
Bdg_dis = gpd.read_file('~\\bgd_admbnda_adm2_bbs_20201113.shp')
# Create geodata
geometry_1 = [Point(xy) for xy in zip(df_dis['lon1'],
df_dis['lat1'])]
gdf_1 = GeoDataFrame(df_dis,
geometry=geometry_1,crs=CRS('EPSG:4326'))
gdf_1.to_crs(epsg=5234,inplace=True)
geometry_2 = [Point(xy) for xy in zip(df_dis['lon2'],
df_dis['lat2'])]
gdf_2 = GeoDataFrame(df_dis,
geometry=geometry_2,crs=CRS('EPSG:4326'))
gdf_2.to_crs(epsg=5234,inplace=True)
# Calculate distance in km
df_dis["Distance"]=np.round((gdf_1.distance(gdf_2))/1000,
decimals=2)
# Save file
df_dis.to_csv('Distant_calculation.csv')
District_City Lat1 Lon1 district Lon2 lat2 Distance
Dhaka 23.79691 90.40901 Rajshahi 88.65 24.5639 217.37
Gazipur 23.99639 90.42093 Rajshahi 88.85 24.375 172.68
Kishoreganj 24.38262 90.95009 Rajshahi 88.8083 24.5639 203.35
Madaripur 23.20803 90.15302 Rajshahi 88.85 24.375 189.51
Narayanganj 23.62043 90.50067 Joypurhat 89.18333 25.05833 210.11
Munshiganj 23.53782 90.52981 Pabna 89.2372 24.00644 137.1
Narshingdi 23.97086 90.74489 Borga 89.0333 24.8167 181.44

92 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

6.4 Migration Mapping with distance


Migration mapping is a powerful tool that visualizes and shows the direction of disease transmission
and spread in public health research studies. Understanding the direction of migration can help
researchers to identifing potential risk factors and develop effective prevention and control
strategies [16].
# Import packages
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir('~Python note')
# Load coordinate points data
df_long=pd.read_csv(~data.csv', delimiter=',', skiprows=0,
low_memory=False)
# Load Bangladesh district shape file
Bdg_dis = gpd.read_file(~bgd_admbnda_adm2_bbs_20201113.shp')
# Create geodata
geom = [Point(xy) for xy in zip(df_long['lon1'],df_long['lat1'])]
gdf_long = GeoDataFrame(df_long, geometry=geom)
# Create figure
fig,ax8=plt.subplots(figsize=(12,15))
# Create map
Bdg_dis.plot(color='w',ax=ax8,edgecolor='black',alpha=0.1)
gdf_long.plot(column='location',ax=ax8,cax=None,categorical=True,
legend=False,marker='o',markersize=45,cmap="BuGn_r")
# Create latitudes and longitudes
Bdg_dis['coords'] =Bdg_dis['geometry'].apply(lambda x:
x.representative_point().coords[:])
Bdg_dis['coords'] = [coords[0] for coords in Bdg_dis['coords']]
Bdg_dis['longs']=[longs[0] for longs in Bdg_dis['coords']]
Bdg_dis['lats']=[lat[1] for lat in Bdg_dis['coords']]
# Insert district names
93 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

for i in range(len(Bdg_dis)):
txt=ax8.text(float(Bdg_dis.longs[i]),float(Bdg_dis.lats[i]),
"{}".format(Bdg_dis.ADM2_EN[i]),size=8, color='k',
horizontalalignment='left')
# Load distance data
df_distant=pd.read_csv(~distance.csv', delimiter=',', skiprows=0,
low_memory=False)
for slat,dlat, slon,dlon,distant in
zip(df_distant['lat1'],df_distant['lat2'],
df_distant['lon1'],df_distant['lon2'],df_distant['Distance']):
# Draw line between two coordinates points
plt.plot([slon,dlon],[slat,dlat],linewidth=1,linestyle='dashdot',
color='blue', alpha=0.5)
# Insert distance in km
if distant>0:
plt.text(dlon-0.1,dlat+0.05,
str(np.round(distant,decimals=1))+ 'Km', color='brown', size=8,
alpha=0.8)
# Save map
plt.savefig("~ mapping.png",dpi=600)

94 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

6.5 Highlighting study districts


Spatial mapping is a powerful tool used in public health research studies to visualize and analyze
patterns of disease incidence, prevalence, and mortality rates in areas. One important aspect of
spatial mapping is the ability to highlight specific study areas, such as neighborhoods or regions of
interest, to better understand disease patterns and potential risk factors [12].

# Import packages
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os
# Change directory
os.chdir('~Python note')
# Load coordinate points data
df_dis=pd.read_csv('District.csv')
# Load Bangladesh district shape file
Bdg_d = gpd.read_file('~\\bgd_admbnda_adm2_bbs_20201113.shp')
# Merge shape file and coordinate data
BDG_d=pd.merge(Bdg_d,df_dis,on='ADM2_PCODE',how='left')
# Chagne NaN to Bangladesh
BDG_d['Hosp'].fillna("Bangladesh",inplace=True)
# Create map
fig,ax4=plt.subplots(figsize=(6,8))
cmap_c=['w','orange']
BDG_d.plot(column='Hosp',ax=ax4, categorical=True, edgecolor='k',
figsize=(6,8),
alpha=0.8,
linewidth=0.5,
cmap=ListedColormap(cmap_c),
legend=True)
# Create latitudes and longitudes
BDG_d['coords'] = BDG_d['geometry'].apply(lambda x:

95 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

x.representative_point().coords[:])
BDG_d['coords'] = [coords[0] for coords in BDG_d['coords']]
BDG_d['longs']=[longs[0] for longs in BDG_d['coords']]
BDG_d['lat']=[lat[1] for lat in BDG_d['coords']]
# Insert Study district names
for i in range(len(BDG_d)):
if BDG_d.Hosp[i]=='H':
txt=ax4.text(float(BDG_d.longs[i])-
0.08,float(BDG_d.lat[i]),
"{}\n{}".format(BDG_d.Hosp[i],BDG_d.ADM2_EN[i]),size=6,
color='k',fontweight='bold',wrap=True)
txt.set_path_effects([PathEffects.withStroke(linewidth=2,
foreground='w')])
# Format legend
legenelement=[mpatches.Patch(edgecolor='k',facecolor='w',
label="Bangladesh",alpha=0.5),
mpatches.Patch(color='orange',label="Selected
district",alpha=0.8),
mpatches.Patch(color='w',label="H: Hospital")]
ax4.legend(handles=legenelement)
leg=ax4.get_legend()
leg.set_bbox_to_anchor((.35,.15))
# Insert compass
ax4.text(x=92, y=26, s='N', fontsize=20)
ax4.arrow(92.12, 25.80, 0, 0.18, length_includes_head=True,
head_width=0.2, head_length=0.3, overhang=.2,
facecolor='k')
# Save map
plt.tight_layout()
plt.savefig(~ map.png',dpi=600)

96 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

6.7 Create Heatmap (Choropleth map)

Spatial heatmap analysis is a powerful tool used in public health research studies to visualize and
analyze patterns with color of disease incidence, prevalence, and mortality rates in geographic
areas. Heatmaps are maps that use color-coded shading to represent the density or intensity of a
particular variable at a specific location, making it easy to identify spatial patterns and trends in
disease occurrence. They are helpful to identify effectively and communicate spatial patterns and
trends in disease occurrence and contribute to the advancement of public health [8, 16].

# Import packages
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os
# Change directory
97 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

os.chdir('~Study maps\Map')
# Load coordinate points data
df_dis=pd.read_csv('~\Divisions.csv')
# Load Bangladesh shape file
Bdg_d = gpd.read_file('~\\bgd_admbnda_adm1_bbs_20201113.shp')
# Load compass
im = plt.imread(get_sample_data('~\compass.jpg'))
# Merge coordinate and shape file
BDG_d=pd.merge(Bdg_d,df_dis,on='ADM1_PCODE',how='left')
# Create figure
fig1,ax5=plt.subplots(figsize=(6,8))
# Create color bar
divider5 = make_axes_locatable(ax5)
cax5 = divider5.append_axes("right", size="2%", pad=0.2)
# Create Bangladesh map
BDG_d.plot(column='Access',ax=ax5, cax=cax5,categorical=False,
edgecolor='k',figsize=(6,8),
vmin=0,
vmax=10,
alpha=0.8,
linewidth=0.5,
legend_kwds={'label':"% of patients",
'orientation':'vertical', 'ticks':np.arange(0,10.1,1)},
cmap=plt.get_cmap('OrRd',10),
legend=True)
# Separate coordinat points
BDG_d['coords'] = BDG_d['geometry'].apply(lambda x:
x.representative_point().coords[:])
BDG_d['coords'] = [coords[0] for coords in BDG_d['coords']]
BDG_d['longs']=[longs[0] for longs in BDG_d['coords']]
BDG_d['lat']=[lat[1] for lat in BDG_d['coords']]
# Insert names
for i in range(len(BDG_d)):

98 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

txt=ax5.text(float(BDG_d.longs[i]),float(BDG_d.lat[i]), "{}\n
{}%".format(BDG_d.Divisions[i],BDG_d.AAccess[i]),size=8,
color='k',horizontalalignment='center')
# Insert legend
leg=ax5.get_legend()
plt.tight_layout()
# Insert compass
newax1 = fig1.add_axes([0.55, 0.7, 0.15, 0.15], anchor='C',
zorder=0)
newax1.imshow(im)
newax1.axis('off')
# Save map
plt.savefig('~ map_1.png',dpi=600)

6.8 World mapping with COVID- 19 cases

# Import packages

99 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os
# Change directory
os.chdir('~Python note')
# Load death data
df_death=pd.read_csv('world_covid death.csv')
# Load World shape file
world =
gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Rename the columns so that we can merge with our data
world.columns=['pop_est', 'continent', 'name', 'CODE',
'gdp_md_est', 'geometry']
# Merge shape file and death dataset
COVID_death=pd.merge(world,df_death,on='CODE',how='left')
# Fill NaN values to zeros
COVID_death['COVID_Death'].fillna(value=0,inplace=True)
# Calculate death per million for each country
COVID_death["d_per_1m"]=1000000*(COVID_death['COVID_Death']/COVID
_death['pop_est'])
# Create map
fig1, ax1=plt.subplots()
# Create color bar
divider1 = make_axes_locatable(ax1)
cax1 = divider1.append_axes("top", size="2%", pad=0.1)
# Create map
COVID_death.plot(column='d_per_1m', ax=ax1,cax=cax1,
figsize=(12,10),
alpha=0.5,
edgecolor='k',
legend=True,

100 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

legend_kwds={'orientation':'horizontal'},
cmap='Reds')
# Insert title
plt.title('Death/1Million population from COVID infection in
Different Countries', fontsize=14)
# Save map
plt.tight_layout()
plt.savefig('~ map_1.png',dpi=600)

6.9 Generate Random coordinate points


Coordinate points random sampling is a fundamental aspect of public health research studies, as it
helps to ensure that study samples are representative of the larger population being studied when
smapling frame of household is not available. In household-based studies, one common approach
is to generate random coordinate points within a defined study area to select households or smapling
units. This section will provide step-by-step guidance on how to define a study area, generate
random coordinate points within that area, and use those points to select households for inclusion
in the study [22].
# Import packages

101 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os
# Change directory
os.chdir('~Python note')
# Load Bangladesh shape file
Bdg= gpd.read_file('~\bgd_admbnda_adm0_bbs_20201113.shp')
# Load Bangladesh district shape file
Bdg_dis = gpd.read_file('~bgd_admbnda_adm2_bbs_20201113.shp')
# Create function for random select points
def random_points_in_polygon(number, polygon):
points = []
min_x, min_y, max_x, max_y = polygon.bounds
i= 0
while i < number:
point = Point(random.uniform(min_x, max_x),
random.uniform(min_y, max_y))
if polygon.contains(point):
points.append(point)
i += 1
return points
# Create dataframe with selected points
crs = {'init': 'epsg:4326'}
points_result = pd.DataFrame(random_points_in_polygon(10,
Bdg.iloc[0].geometry))
points_result.columns=['geometry']
# Separate longitudes and latitudes
points_result['coords'] = points_result['geometry'].apply(lambda
x: x.representative_point().coords[:])
points_result['coords'] = [coords[0] for coords in
points_result['coords']]

102 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

points_result['lon']=[longs[0] for longs in


points_result['coords']]
points_result['lat']=[lat[1] for lat in points_result['coords']]
points_final=points_result.drop(columns=["geometry","coords"])
# Save selected points into csv
points_final.to_csv('~selected points.csv')
# Create map
fig1, ax4 = plt.subplots(1,1,figsize=(8,6))
Bdg_dis.plot(ax=ax4,color='white',linewidth=1,edgecolor='gray',
alpha=0.4)
points_result=gpd.GeoDataFrame(points_result,crs=crs,geometry=
points_result['geometry'])
# Create plot
points_result.plot(ax=ax4,color='brown',markersize=10)
# Save map
plt.savefig('~Select_point map.png',dpi=600)

Selected 10 points

Slno Lon lat


0 89.54288 25.69371
1 91.41746 22.75663
2 90.65333 23.54728
3 90.65129 25.03515
4 89.52333 25.90696
5 90.03716 24.39989
6 90.33238 22.73423
7 88.99034 24.31096
8 89.6321 26.18273
9 90.84406 23.01838

103 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

104 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Chapter 7
Measures of association: Crude analysis
Introduction
In health data science, crude analyses are statistical methods that are used to investigate the
relationship between two variables. The crude measures are essential for understanding the
distribution of diseases and health outcomes in populations and for identifying risk factors that
contribute to the occurrence of these outcomes. They are one approach to measuring association
that involves examining the overall relationship between an exposure and an outcome without
accounting for any other factors that may influence this relationship. These measures depend on
data type and the research study design and research question [35, 36].

6.1 Mean estimate

Mean estimation is a statistical inference. A sample is used to estimate the mean as an unknown
population mean. We assume that the distribution of the continuous variables of the unknown
population is normal [37].
# Import packages
import pandas as pd
import scipy.stats as stat
# load data
df=pd.read_csv("data.csv",index_col="id")
# calculate mean
var=df["bmi"]
Mean=round(var.mean(),3)
# calculate standard error
se=round(var.sem(),3)
# calculate 95% confidence interval
ci_95=stat.norm.interval(alpha=0.95,loc=mean,scale=se))
# Reduce decimal
ci_lu=round(ci_95[0],3)
ci_up=round(ci_95[1],3)

105 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

ci_95=((ci_lu,ci_up))
# Format Results
txt=("Mean={}\nStandard Error={}\n95% confidence interval={}")
# Print Results
print(txt.format(Mean,se,ci_95))
Return will be :
Mean =25.846
Standard Error= 0.062
95% confidence interval= (25.725, 25.967)

6.2 Two Means Comparison

Two means comparison compare the mean between two variables or groups. The compare means
t-test is used to compare the means between two groups of variables. The null hypothesis that there
is no different between two means in the population.
6.2.1 Independent two means comparison
You assume that the observations of two groups are independent.
# Import packages
import pandas as pd
from scipy.stats import ttest_ind
import scipy.stats as stat
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Drop NaN
df.dropna(subset=["totchol"],inplace=True)
# Classify data
x=df[df["sex"]==1]
y=df[df["sex"]==2]
# Calculate mean
mean_x=x["totchol"].mean()
mean_y=y["totchol"].mean()
# Calculate standard error
se_x=x["totchol"].sem()
se_y=y["totchol"].sem()
106 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

# Calculate 95% confidence interval


ci_95_x=stat.norm.interval(alpha=0.95,loc=mean_x,scale=se_x)
ci_95_y=stat.norm.interval(alpha=0.95,loc=mean_y,scale=se_y)
# Calculate mean difference
mean_diff=mean_x-mean_y
# Calculate T-test
tt=ttest_ind(x["totchol"],y["totchol"],equal_var=False)
# Format results
txt="Mean of x={} 95%CI={}\nMean of y={} 95%CI={}\n Mean
difference = {}"
# Print results
print(txt.format(mean_x,ci_95_x,mean_y,ci_95_y,mean_diff),tt)
Mean of x= 233.57976251935983 95%CI=(231.69351851842183, 235.46600652029784)
Mean of y=239.68139059304704 95%CI=(237.84934819211185, 241.51343299398224)
Mean difference = -6.101628073687209
Ttest_indResult(statistic=-4.54799360507726, pvalue=5.565456324180336e-06)

6.2.2 Paired Two means comparison

You assume that the observations of two groups are paired.

# Import packages
import pandas as pd
from scipy.stats import ttest_rel
import scipy.stats as stat
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Drop NaN
df.dropna(subset=["totchol"],inplace=True)
# Classify data
x=df[df["sex"]==1]
y=df[df["sex"]==2]
# Calculate Mean
mean_x=df["totchol"].mean()
mean_y=df["totchol"].mean()
# Calculate standard error
107 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

se_x=df["totchol"].sem()
se_y=df["totchol"].sem()
# Calculate 95% confidence interval
ci_95_x=stat.norm.interval(alpha=0.95,loc=mean_x,scale=se_x)
ci_95_y=stat.norm.interval(alpha=0.95,loc=mean_y,scale=se_y)
# Calculate means difference
mean_diff=mean_x-mean_y
# Calculate Paired T test statistics
tt=ttest_rel(df["totchol"],df["totchol"])
# Format results
txt="Mean of x= {}95%CI={}\nMean of y={}95%CI={}\nMean
difference = {}"
# Print results
print(txt.format(mean_x,ci_95_x,mean_y,ci_95_y,mean_diff),tt)

Mean of x= 236.3761638733706 95%CI=(234.9593250581809, 237.79300268856028)


Mean of y=249.54189944134077 95%CI=(248.08095757516355, 251.002841307518)
Mean difference = -13.165
Ttest_relResult(statistic=-24.235842456123002, pvalue=1.0929767627175546e-120)

6.3 Correlation coefficient estimate

Correlation coefficient between two continuous variables indicates the strength of the relationship.
A correlation coefficient is greater than zero indicates a positive relationship. It is less than zero
signifies a negative relationship, and equal or close to zero indicates no relationship between the
two variables.
6.3.1 Pearson’s correlation coefficient
Pearson’s correlation coefficient is the way of measuring a linear relationship between two
continuous variables. These two variables are normally distributed. This measures the strength and
direction of the relationship between two variables.
# Import packages
import pandas as pd
from scipy.stats import pearsonr
# Data load
df=pd.read_csv("data.csv",index_col="id")

108 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Drop NaN
df.dropna(subset=["totchol","age"],inplace=True)
# Calculate correlation coefficient and p-value
corr, p = pearsonr(df["age"], df["totchol"])
# Format results
txt="Pearson Correlation coefficient= {:0.4f}\np-value= {:0.4f}"
# Print results
print(txt.format(corr,p))

Pearson Correlation coefficient= 0.2493


p-value=0.0

6.3.2 Spearman’s correlation coefficient


The Spearman’s correlation coefficient is a nonparametric measure of relationship between
variables. Its measurement depends of ranking between two variables.
# Import packages
import pandas as pd
from scipy.stats import spearmanr
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Drop NaN
subset=["totchol","heartrte"]
df.dropna(subset= subset, inplace=True)
# Calculate correlation coefficient and p-value
corr, p = spearmanr(df["totchol"], df["heartrte"])
# Format results
txt="Spearman's Correlation coefficient={:0.4f}\n p-
value={:0.4f}"
# Print results
print(txt.format(corr,p))

Spearman's Correlation coefficient= 0.0915


p-value=0.0

6.4 Prevalence estimate

109 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

In public health cross-sectional studies, prevalence refers to the proportion of individuals in a


population with a particular disease or condition at a given time. It is an important measure in
public health research as it provides an estimate of the burden of disease in a population.
Prevalence estimates are typically derived from cross-sectional studies, where data is collected
multiple outcomes and exposures simultaneously at a single point in time from a representative
sample of the population. These studies can provide valuable information about the distribution of
disease in a population, as well as identify associated factors that may be influence the occurance
of diseases or events. [2, 3, 7].

# Import packages
import pandas as pd
import scipy.stats as stat
# load data
df=pd.read_csv("data.csv",index_col="id")
# Calculate proportion
var=df["hyperten"]
Prop=round(var.mean(),3)
# Calculate standard error
se=round(var.sem(),3)
# Calculate 95% confidence interval
ci_95=stat.norm.interval(alpha=0.95,loc=np.Prop(var),
scale=var.sem())
# Rounding results
ci_lu=round(ci_95[0],3)
ci_up=round(ci_95[1],3)
ci_95=((ci_lu,ci_up))
# Format results
txt=(" Prevalence ={}\n Standard Error= {}\n 95% confidence
interval= {}")
# Print results
print(txt.format(Mean,se,ci_95))
Return will be:
Prevalence =0.733
Standard Error= 0.007
110 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

95% confidence interval= (0.72, 0.746)

6.5 Chi Square test

Chi-square test is a statistical test used to examine the association between two categorical
variables. It is one of the most commonly used tests in health data science. The test measures the
difference between the observed and expected frequencies of the variables in a contingency table.
The chi-square test can help to determine whether the observed differences between the groups are
due to chance or whether they are statistically significant [1, 2, 4].
The H0 (Null Hypothesis): There is no relationship between two variables.
The H1 (Alternative Hypothesis): There is a relationship between two variables.
# Import packages
import pandas as pd
from scipy.stats import chi2_contingency
# Load data
datf=pd.read_csv('data.csv',index_col='id')
# Calculate cross tabulation
my_tab=pd.crosstab(index=datf['death'],columns=datf['sex'])
# Calculate chi square test statistics
c, p, dof, expected = chi2_contingency(my_tab)
# Print results
print(f"Chi2 value={c}\n p-value= {p} \n Degrees of freedom=
{dof} \n")
Chi2 value= 107.6078869510524
p-value= 2.355353265828955e-22
Degrees of freedom= 4
The value of expected cells is not greater than 5. If all of these assumptions are met, then Fisher
exact test is the correct test to use.
# Import packages
import pandas as pd
from scipy.stats import fisher_exact
# Load data
df=pd.read_csv('data.csv',index_col='id')
# Calculate cross tabulation
my_tab=pd.crosstab(index=df['death'],columns=df['sex'])
111 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

# Calculate fisher exact test statistics


fisher=fisher_exact(my_tab,alternative='two-sided')
p_value=round(fisher[1],4)
# Print results
print("P-value= ", p_value )
P-value= 0.0

6.6 Risk difference estimate


One of the key objectives of public health research studies is to understand the association between
an exposure and a health outcome. Risk difference is a commonly used measure in public health
research to quantify the absolute difference in the risk of a health outcome between two groups.
This measure is particularly useful in assessing the public health impact of an exposure or
intervention, as it provides an estimate of the number of cases that could be prevented by
eliminating the exposure or implementing the intervention. Risk difference (RD) is calculated by
subtracting the risk of the outcome in the unexposed group from the risk of the outcome in the
exposed group. The resulting is expressed as the absolute risk difference, which represents the
difference in the risk of the outcome between the two groups in terms of the number of cases per
unit of population. It is particularly useful in cohort and intervention studies, where researchers
can measure exposure and follow participants over time to determine the incidence of the outcome.
RD can also be used in cross-sectional studies as known as prevalence difference (PD), where
exposure and outcome data are collected instantaneously [35, 38, 39].

# Import packages
import pandas as pd
from zepid import RiskDifference
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Calculate risk difference
rd=RiskDifference()
rd.fit(df,exposure="hyperten",outcome="death")
# Print summary results
print(rd.summary())

112 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

======================================================================
Risk Difference
======================================================================
Risk SD(Risk) Risk_LCL Risk_UCL
Ref:0 0.298 0.013 0.272 0.324
1 0.368 0.008 0.352 0.385
----------------------------------------------------------------------
RiskDifference SD(RD) RD_LCL RD_UCL
Ref:0 0.000 NaN NaN NaN
1 0.071 0.016 0.04 0.101
----------------------------------------------------------------------

6.7 Risk ratio estimate

One important aspect of public health research is the quantification of the association between an
exposure and a health outcome. Risk ratio (RR) is a widely used in public health research to
estimate the strength of the association between an exposure and a health outcome. RR is defined
as the ratio of the risk of the outcome in the exposed group to the risk of the outcome in the
unexposed group. It provides an estimate of the relative risk of developing the outcome in the
exposed group compared to the unexposed group. RR values greater than 1 indicate an increased
risk of the outcome in the exposed group, while RR values less than 1 indicate a decreased risk.
RR is a useful measure in public health research because it allows researchers to compare the risk
of an outcome between groups with different levels of exposure. It is particularly useful in cohort
and intervention studies, where researchers can measure exposure and follow participants over
time to determine the incidence of the outcome. RR can also be used in cross-sectional studies as
known as prevalence ratio (PR), where exposure and outcome data are collected simultaneously
[6, 7, 40–42].

# Import packages
import pandas as pd
from zepid import RiskRatio
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Calculate Risk Ratio
rr=RiskRatio()
rr.fit(df,exposure="hyperten",outcome="death")

113 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Print results
print(rr.summary())

Comparison:0 to 1
+-----+-------+-------+
| | D=1 | D=0 |
+=====+=======+=======+
| E=1 | 1198 | 2054 |
+-----+-------+-------+
| E=0 | 352 | 830 |
+-----+-------+-------+

====================================================================
Risk Ratio
===================================================================
Risk SD(Risk) Risk_LCL Risk_UCL
Ref:0 0.298 0.013 0.272 0.324
1 0.368 0.008 0.352 0.385
----------------------------------------------------------------------
RiskRatio SD(RR) RR_LCL RR_UCL
Ref:0 1.000 NaN NaN NaN
1 1.237 0.05 1.121 1.365
----------------------------------------------------------------------

6.8 Odds ratio estimate

Odds ratio (OR) is a widely used measure in public health research to estimate the strength of the
association between an exposure and a health outcome. OR is defined as the odds of exposure
among individuals with the outcome divided by the odds of exposure among individuals without
the outcome. It provides an estimate of the relative odds of developing the outcome in the exposed
group compared to the unexposed group. OR values greater than 1 indicate an increased odd of
the outcome in the exposed group, while OR values less than 1 indicate a decreased odd. OR is a
useful measure in public health research because it allows researchers to compare the odds of an
outcome between groups with different levels of exposure. It is particularly useful in case-control
studies, where exposure and outcome data are collected retrospectively. OR can also be used in
cross-sectional, cohort and intervention studies [43–46].
# Import packages
import pandas as pd
from zepid import OddsRatio
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Calculate odds ratio (OR)

114 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

OR=OddsRatio()
OR.fit(df,exposure="hyperten",outcome="death")
# Print result summary
print(OR.summary())
Comparison:0 to 1
+-----+-------+-------+
| | D=1 | D=0 |
+=====+=======+=======+
| E=1 | 1198 | 2054 |
+-----+-------+-------+
| E=0 | 352 | 830 |
+-----+-------+-------+
======================================================================
Odds Ratio
======================================================================
OddsRatio SD(OR) OR_LCL OR_UCL
Ref:0 1.000 NaN NaN NaN
1 1.375 0.073 1.191 1.588
----------------------------------------------------------------------

6.9 Incidence Rate Ratio


Incidence rate ratio (IRR) is a commonly used measure in public health research to estimate the
strength of the association between an exposure and the incidence of a health outcome over time.
IRR is defined as the ratio of the incidence rate of the outcome in the exposed group to the
incidence rate of the outcome in the unexposed group. It provides an estimate of the relative risk
of developing the outcome in the exposed group compared to the unexposed group over a specific
period of time. IRR values greater than 1 indicate an increased incidence of the outcome in the
exposed group, while IRR values less than 1 indicate a decreased incidence. It is a useful measure
in public health research because it allows researchers to account for differences in the duration of
follow-up and the time-at-risk between groups. It is particularly useful in cohort and intervention
studies, where researchers can measure exposure and follow participants over time to determine
the incidence of the outcome [5, 47, 48].

# Import packages
import pandas as pd
from zepid import IncidenceRateRatio
# Data load
df=pd.read_csv("Incidencerate.csv",index_col="id")
# Calculate incidence rate ratio
115 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

IRR=IncidenceRateRatio()
IRR.fit(df,exposure="sex",outcome="total_case",time="ptime")
# Print result summary
print(IRR.summary())

Comparison:0 to 1
+-----+-------+---------------+
| | D=1 | Person-time |
+=====+=======+===============+
| E=1 | 176 | 3502.94 |
+-----+-------+---------------+
| E=0 | 486 | 10467.8 |
+-----+-------+---------------+

======================================================================
Incidence Rate Ratio
======================================================================
IncRate SD(IncRate) IncRate_LCL IncRate_UCL
Ref:0 0.046 0.002 0.042 0.051
1 0.050 0.004 0.043 0.058
----------------------------------------------------------------------
IncRateRatio SD(IRR) IRR_LCL IRR_UCL
Ref:0 1.000 NaN NaN NaN
1 1.082 0.088 0.911 1.286
----------------------------------------------------------------------

6.10 Diagnostic Test


In public health research studies, diagnostic tests are used to detect the presence or absence of a
particular disease or condition in individuals or populations. These tests play a critical role in
disease screening, diagnosis, and surveillance. Diagnostic tests are often used in combination with
other public health measures, such as incidence and prevalence, to provide a comprehensive
understanding of the burden of disease in a population. It can be classified into two categories:
screening tests and confirmatory tests. Screening tests are used to identify individuals who may
have the disease and require further diagnostic evaluation. Confirmatory tests, on the other hand,
are used to confirm or rule out the presence of a disease after a positive screening test. The accuracy
of a diagnostic test is typically assessed using measures such as sensitivity, specificity, positive
predictive value, negative predictive value, and likelihood ratios. Sensitivity refers to the
proportion of individuals with the disease who test positive on the diagnostic test, while specificity
refers to the proportion of individuals without the disease who test negative on the test. Positive
predictive value and negative predictive value are used to estimate the probability of disease given
a positive or negative test result, respectively [38, 49, 50].
116 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

# Import packages
import pandas as pd
from zepid import Diagnostics
# Data load
df=pd.read_csv("Testdata.csv",index_col="id")
# Calculate sensitivity and specificity
Dia=Diagnostics()
Dia.fit(df,test="art", disease="dead")
# Print result summary
print(Dia.summary())

+----+------+------+
| | D+ | D- |
+====+======+======+
| T+ | 10 | 67 |
+----+------+------+
| T- | 77 | 363 |
+----+------+------+
======================================================================
Diagnostics
======================================================================
Sensitivity SD(Se) Se_LCL Se_UCL
0 0.13 0.038 0.055 0.205
Specificity SD(Sp) Sp_LCL Sp_UCL
0 0.825 0.018 0.789 0.861
======================================================================

117 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Chapter 8
Regression analysis for adjusting variables and
clustering effect
Introduction
Public health research studies aim to identify associations between exposures and health outcomes.
However, these associations can be confounded by other variables that are associated with both
the exposure and the outcome, and which may mispresent the true effect of the exposure.
Confounding can lead to biased estimates of the association and wrong conclusions. One approach
to addressing confounding in health data science is through multivariate regression analysis.
Multivariate regression analysis allows us to adjust for the effects of potential confounding
variables and estimate the effect of the exposure of interest on the outcome while controlling for
the influence of confounders. Additionally, public health research studies often include the
investigation of outcomes in populations or groups of individuals. In many cases, the data collected
from these studies show a clustering effect, where individuals within the same cluster or group are
more similar to each other than they are to individuals in other clusters. This clustering effect can
lead to biased estimates of the effects of risk factors or interventions if it is not properly accounted
for in the analysis. Adjusting for the clustering effect is essential for obtaining accurate and precise
estimates of the effects of risk factors on health outcomes. This can be achieved through the use
of appropriate regression models that account for the correlation among individuals within the
same cluster.
This chapter will focus on multivariate regression analysis as a tool for adjusting for confounding
and clustering effect in health data sciece. This chapter will also provide an overview of regressions
commonly used in health data science, including linear regression, logistic regression, Poisson
regression and Cox proportional hazards regression models, and will be used to account for
clustering, including Generalized Estimating Equations (GEE), random effects model, and Mixed
effect model. Overall, this chapter aims to provide a comprehensive introduction to regression
analysis for adjusting covariates and clustering effects in health data science.

8.1 Simple Linear Regression

118 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Linear regression analysis is a statistical technique in health data science to investigate the
relationship between an exposure and an outcome. The method involves fitting a linear equation
to the data, with the aim of estimating the effect of exposure on the outcome as a crude estimate.

# Import packages
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Drop NaN
subset=["totchol","bmi"]
df.dropna(subset= subset ,inplace=True)
# Set variables
y=df["totchol"]
x=df["bmi"]
# Add constant in the regression
x=sm.add_constant(x)
# Fit linear regression model
mod=sm.OLS(y,x)
res=mod.fit()
# Print result summary
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: totchol R-squared: 0.015
Model: OLS Adj. R-squared: 0.015
Method: Least Squares F-statistic: 66.65
Date: Mon, 20 Mar 2023 Prob (F-statistic): 4.20e-16
Time: 10:32:23 Log-Likelihood: -22737.
No. Observations: 4364 AIC: 4.548e+04
Df Residuals: 4362 BIC: 4.549e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------------------
const 202.4903 4.284 47.262 0.000 194.091 210.890
bmi 1.3371 0.164 8.164 0.000 1.016 1.658

119 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

8.2 Multiple linear regression


Multiple Linear regression can be used to examine the association between risk factors and
outcome of interest, to identify potential confounding factors, and to estimate the magnitude of
effect of an exposure or intervention on an outcome.

# Import packages
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Drop NaN
subset=["totchol","bmi","age","cigpday"]
df.dropna(subset= subset, inplace=True)
# Set variables
y=df["totchol"]
x=df[["bmi","age","cigpday"]]
# Add constant
x=sm.add_constant(x)
# Fit regression model
mod=sm.OLS(y,x)
res=mod.fit()
# Print result summary
print(res.summary())

==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------------------------------------
const 146.6426 5.497 26.676 0.000 135.865 157.420
bmi 1.0281 0.161 6.370 0.000 0.712 1.345
age 1.2601 0.077 16.296 0.000 1.108 1.412
cigpday 0.1033 0.056 1.846 0.065 -0.006 0.213
==============================================================================

8.3 Linear regression with continuous and categorical variables


# Import packages

120 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Create dummy variables
sex_dummy=pd.get_dummies(df["sex"])
diabetes_dummy=pd.get_dummies(df["diabetes"])
# Set columns name for diabetes
sex_dummy.columns=["Male","Female"]
diabetes_dummy.columns=["nodiabetes","Diabetes"]
# Join data
df1=pd.concat([df,sex_dummy,diabetes_dummy],axis=1)
# Drop NaN
subset=["totchol","bmi","age","cigpday","Male","Female","nodiabe
tes","Diabetes"]
df1.dropna(subset= subset,inplace=True)
# Set x and y variables
y=df1["totchol"]
x=df1[["bmi","age","cigpday","Female","Diabetes"]]
# Add constant
x=sm.add_constant(x)
# Fit OLS regression model
mod=sm.OLS(y,x)
res=mod.fit()
# Print results
print(res.summary())

==============================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------
const 138.0041 5.715 24.149 0.000 126.801 149.208
bmi 1.1181 0.162 6.897 0.000 0.800 1.436
age 1.2741 0.077 16.453 0.000 1.122 1.426
cigpday 0.2175 0.059 3.682 0.000 0.102 0.333
Female 8.1163 1.393 5.828 0.000 5.386 10.846
Diabetes 2.4708 4.036 0.612 0.540 -5.442 10.384
==============================================================================
121 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

8.4 Multiple linear regression with categorical exposure


Linear regression with categorical exposure is a widely used statistical method in public health
research studies to estimate the effect size of the exposure or investigate the relationship between
an outcome and exposure variable considering confounders. These models allow for the estimation
of the mean difference associated with the continuous outcome and categorical exposure variable,
and for the estimation of the risk (or prevalence for cross-sectional studies) difference associated
with the binary outcome and categorical exposure [7, 10].
Adjusted mean difference estimate

# Import packages
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Create dummy variables
sex_dummy=pd.get_dummies(df["sex"])
diabetes_dummy=pd.get_dummies(df["diabetes"])
# Create columns name
sex_dummy.columns=["Male","Female"]
diabetes_dummy.columns=["nodiabetes","Diabetes"]
# Join data
df1=pd.concat([df,sex_dummy,diabetes_dummy],axis=1)
# Drop NaN
subset=["totchol","bmi","age","cigpday","Female","Diabetes"]
df1.dropna(subset= subset, inplace=True)
# Set x and y variables
y=df1["totchol"]
x=df1[["Female","bmi", "age","Diabetes"]]
# Add constant
x=sm.add_constant(x)
# Fit regression model
mod=sm.OLS(y,x)
res=mod.fit()

122 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

# Format Results
res_means=np.round(res.params,3)
res_pvalues=np.round(res.pvalues,3)
res_ci=np.round(res.conf_int(),3)
column=["Mean Difference"])
model_Means=pd.DataFrame(res_means,columns= column)
model_Means["p-vlaues"]=res_pvalues
model_Means[["2.5% ", " 97.5%"]]=res_ci
# Print results
print(model_Means)

Mean Difference p-vlaues 2.5% 97.5%


const 145.048 0.000 134.476 155.621
Female 6.423 0.000 3.842 9.003
bmi 1.057 0.000 0.741 1.374
age 1.222 0.000 1.073 1.372
Diabetes 2.151 0.595 -5.772 10.074

Adjusted risk or prevalence difference estimate

# Import packages
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Create dummy variables
sex_dummy=pd.get_dummies(df["sex"])
diabetes_dummy=pd.get_dummies(df["diabetes"])
bins=[0,200,np.inf]
names=["Low","High"]
df['new_chol']=pd.cut(df["totchol"],bins,labels=names)
chol_dummy=pd.get_dummies(df["new_chol"])
# Create column names
sex_dummy.columns=["Male","Female"]

123 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

diabetes_dummy.columns=["nodiabetes","Diabetes"]
# Join data
df1=pd.concat([df,sex_dummy,diabetes_dummy,chol_dummy],axis=1)
# Drop NaN
subset=["prevchd","High","bmi","age","Male","Diabetes"]
df1.dropna(subset= subset, inplace=True)
y=df1["prevchd"]
x=df1[["High","Male","bmi", "age","Diabetes"]]
# Add constant
x=sm.add_constant(x)
# Fit regression model
mod=sm.OLS(y,x)
res=mod.fit()
# Format results
res_rd=np.round(res.params,3)
res_pvalues=np.round(res.pvalues,3)
res_ci=np.round(res.conf_int(),3)
columns=["Prevalence Difference"]
model_Rd=pd.DataFrame(res_rd,columns= columns
model_Rd["p-vlaues"]=res_pvalues
model_Rd[["2.5% ", " 97.5%"]]=res_ci
# Print results
print(model_Rd)

Prevalence Difference p-vlaues 2.5% 97.5%


const -0.206 0.000 -0.254 -0.158
High chol -0.021 0.005 -0.035 -0.006
Male 0.036 0.000 0.024 0.048
bmi 0.001 0.204 -0.001 0.002
age 0.005 0.000 0.004 0.005
Diabetes 0.031 0.102 -0.006 0.067

8.5 Poisson regression


Poisson regression is a statistical method used to model count data, which is a type of data where
the response variable represents the number of occurrences of a particular event or phenomenon.

124 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

It is commonly used in public health research studies, where the data of interest involves the
frequency of occurrences of a certain event or outcome. This regression model allows us to
investigate the relationship between the outcome variable, and an exposure, while taking into
account the potential confounders effects. The model assumes that the logarithm of the expected
count is a linear function of the variables [3, 22, 25].

# Import packages
import pandas as pd
import statsmodels.api as sm
# Load dataset
df=pd.read_csv("London.csv",index_col="slno")
# Drop NaN values
subset=["temperature","relative_humidity","numdeaths","ozone10"]
df.dropna(subset= subset, inplace=True)
# Set variables
y=df["numdeaths"]
x=df[["temperature","relative_humidity","ozone10"]]
# Add constant
x=sm.add_constant(x)
# Fit Poisson regression model
mod=sm.Poisson(y,x)
res=mod.fit()
# Format results
res_rr=np.exp(res.params)
res_pvalues=res.pvalues
res_ci=np.exp(res.conf_int())
model_RR=pd.DataFrame(res_rr,columns=["RR"])
model_RR["p-vlaues"]=res_pvalues
model_RR[["2.5%","97.5%"]]= res_ci
# Print results
print(model_RR.head())

RR p-vlaues 2.5% 97.5%


const 152.884933 0.000000e+00 148.276959 157.636108
temperature 0.989865 1.271853e-146 0.989099 0.990631
125 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

relative_humidity 1.000946 3.128194e-08 1.000611 1.001281


ozone10 1.008685 1.648941e-11 1.006150 1.011227

Poisson regression model for binary outcome


# Import packages
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Create dummy variables
sex_dummy=pd.get_dummies(df["sex"])
diabetes_dummy=pd.get_dummies(df["diabetes"])
# Create column names
sex_dummy.columns=["Male","Female"]
diabetes_dummy.columns=["nodiabetes","Diabetes"]
# Join data
df1=pd.concat([df,sex_dummy,diabetes_dummy],axis=1)
# Drop NaN values
subset=["prevstrk","totchol","bmi","age","cigpday","Female",
"Diabetes"]
df1.dropna(subset= subset, inplace=True)
# Set variables
y=df1["prevstrk"]
x=df1[["Diabetes","totchol","bmi","age","cigpday","Female"]]
# Add constant
x=sm.add_constant(x)
# Fit Poisson regression model
mod=sm.Poisson(y,x)
res=mod.fit()
# Calculate exponential values
res_rr=np.exp(res.params)
res_pvalues=res.pvalues
res_ci=np.exp(res.conf_int())
model_RR=pd.DataFrame(res_rr,columns=["RR"])
126 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

model_RR["p-vlaues"]=res_pvalues
model_RR[["2.5%","97.5%"]]=res_ci
# Print results
print(model_RR.head(10))

RR p-vlaues 2.5% 97.5%


const 0.000093 8.368115e-07 0.000002 0.003737
Diabetes 1.593784 5.311900e-01 0.370535 6.855344
totchol 0.997813 6.235219e-01 0.989127 1.006575
bmi 1.038388 3.554151e-01 0.958656 1.124752
age 1.078790 1.407633e-03 1.029721 1.130197
cigpday 0.976701 2.719767e-01 0.936471 1.018659
Female 0.866610 7.170002e-01 0.399598 1.879423

8.6 Logistic regression


Logistic regression is often used in public health research studies to investigate the relationship
between a binary or dichotomous outcome variable and an exposure. The multiple logistic
regression model allows us to adjust for one or more continuous or categorical confounding
variables and interaction variables. This model is widely used in public health research to estimate
the odds ratio (OR) investigating the relationship between risk factors and outcome [1, 51].

# Import packages
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Create dummy variables
sex_dummy=pd.get_dummies(df["sex"])
diabetes_dummy=pd.get_dummies(df["diabetes"])
# Create column names
sex_dummy.columns=["Male","Female"]
diabetes_dummy.columns=["nodiabetes","Diabetes"]
# Join data
df1=pd.concat([df,sex_dummy,diabetes_dummy],axis=1)
# Drop NaN value
subset=["prevstrk","totchol","bmi","age","cigpday","Female",
"Diabetes"]
127 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

df1.dropna(subset= subset, inplace=True)


# Set variables
y=df1["prevstrk"]
x=df1[["totchol","bmi","age","cigpday","Female","Diabetes"]]
# Fit logistic regression model
mod=sm.Logit(y,x)
model=mod.fit()
# Calculate exponential values
model_or=np.exp(model.params)
model_pvalues=model.pvalues
model_ci=np.exp(model.conf_int())
model_OR=pd.DataFrame(model_or,columns=["OR"])
model_OR["p-vlaues"]=model_pvalues
model_OR[["2.5%","97.5%"]]=model_ci
# Print results
print(model_OR.head(10))

OR p-vlaues 2.5% 97.5%


Diabetes 2.551817 0.210711 0.588495 11.065122
totchol 0.988493 0.010714 0.979745 0.997319
bmi 0.899265 0.006292 0.833314 0.970436
age 1.019309 0.308558 0.982468 1.057531
cigpday 0.947082 0.012960 0.907326 0.988581
Female 0.616844 0.181453 0.303684 1.252935

8.7 Conditional logistic regression


Conditional logistic regression is widely used to analyze matched case-control studies, where the
controls are selected based on matching criteria. This method is commonly used in public health
research studies to investigate the relationship between a binary outcome variable, an exposure
and one or more confounding variables, which estimates the odds ratio for the matched groups.
This allows for a more precise estimation of the effect size of the predictor variables, as the
matching criteria reduce the variability in the data [52].

# Import packages
import pandas as pd

128 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

import statsmodels.api as sm
from statsmodels.discrete.conditional_models import
ConditionalLogit
# Load data
df=pd.read_csv("paireddata.csv",index_col="slno")
# Drop NaN values
subset=["idcode","year", "age","grade", "not_smsa",
"south","union","black"]
df.dropna(subset= subset, inplace=True)
# Set variables
y=df["union"]
x=df[["age","grade","not_smsa","south","black"]]
group=df["idcode"]
# Fit conditional logistic regression model
mod=ConditionalLogit(y,x,groups=group)
model=mod.fit()
# Calculate exponential values
model_or=np.exp(model.params)
model_pvalues=model.pvalues
model_ci=np.exp(model.conf_int())
model_OR=pd.DataFrame(model_or,columns=["OR"])
model_OR["p-vlaues"]=model_pvalues
model_OR[["2.5%","97.5%"]]=model_ci
# Print results
print(model_OR.head())

OR p-vlaues 2.5% 97.5%


age 1.017171 4.018569e-05 1.008939 1.025470
grade 1.089231 4.125536e-02 1.003397 1.182406
not_smsa 1.009544 9.328905e-01 0.809308 1.259320
south 0.473415 2.314974e-09 0.370420 0.605047
black 1.000000 1.000000e+00 0.000000 inf

129 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

8.8 GEE model


A Generalized Estimating Equations (GEE) model is a type of regression. It is often used in public
health research studies to analyze repeated measures and longitudinal data. Unlike traditional
regression models, GEE model takes into account the correlation between repeated measurements
on the same individual or the correlation between observations within the same cluster. This model
also allows for the analysis of complex data structures, such as clustered data.
GEE models can accommodate a wide range of outcome variables, including binary, categorical,
and continuous variables. The method is particularly useful in situations where the outcome
variable is not normally distributed, such as in count or time-to-event data [24, 53].

# Import packages
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
# Load data
df=pd.read_csv("Geedata.csv",index_col="slno")
# Drop NaN values
subset=["id","numvisit","age","educ","married","badh","loginc",
"reform","summer"]
df.dropna(subset= subset, inplace=True)
# set covariate structure (identity or exchangeable or
autocorrelation)
ind=sm.cov_struct.Exchangeable()
# Set regression family
family=sm.families.Poisson()
# Fit GEE model
mod=smf.gee("numvisit ~ reform+age + educ+married+badh+
loginc+summer", "id", df,cov_struct=ind,family=family)
model=mod.fit()
# Calculate exponential values
model_rr=np.exp(model.params)
model_pvalues=model.pvalues
model_ci=np.exp(model.conf_int())
130 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

model_RR=pd.DataFrame(model_rr,columns=["RR"])
model_RR["p-vlaues"]=model_pvalues
model_RR[["2.5%","97.5%"]]=model_ci
# Print results
print(model_RR.head())

GEE Regression Results


===================================================================================
Dep. Variable: numvisit No. Observations: 2227
Model: GEE No. clusters: 1518
Method: Generalized Min. cluster size: 1
Estimating Equations Max. cluster size: 2
Family: Poisson Mean cluster size: 1.5
Dependence structure: Exchangeable Num. iterations: 7
Date: Mon, 20 Mar 2023 Scale: 1.000
Covariance type: robust Time: 14:42:21
==============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------------------
Intercept -0.3825 0.576 -0.664 0.506 -1.511 0.746
reform -0.1220 0.053 -2.311 0.021 -0.226 -0.019
age 0.0052 0.003 1.572 0.116 -0.001 0.012
educ -0.0093 0.012 -0.791 0.429 -0.032 0.014
married 0.0375 0.070 0.534 0.593 -0.100 0.175
badh 1.1051 0.087 12.642 0.000 0.934 1.276
loginc 0.1398 0.079 1.759 0.079 -0.016 0.296
summer -0.0265 0.088 -0.299 0.765 -0.200 0.147
==============================================================================
Skew: 4.5617 Kurtosis: 40.9168
Centered skew: 0.0000 Centered kurtosis: 14.1065

RR p-vlaues 2.5% 97.5%


Intercept 0.682156 5.064052e-01 0.220739 2.108088
reform 0.885120 2.081513e-02 0.798106 0.981620
age 1.005260 1.159082e-01 0.998707 1.011855
educ 0.990702 4.291000e-01 0.968027 1.013909
married 1.038257 5.930670e-01 0.904701 1.191528
badh 3.019655 1.232504e-36 2.544186 3.583980
loginc 1.150084 7.856368e-02 0.984158 1.343985
summer 0.973872 7.647633e-01 0.818821 1.158283

131 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

8.9 Mixed effect model

Mixed effects models, also known as hierarchical or multilevel models, are a statistical method
commonly used in public health research studies to analyze clustered data and longitudinal data.
These models account for the correlation among observations within the same cluster or individual,
and the variability between different clusters or individuals. Mixed effects models allow for the
analysis of complex data structures, where the outcome variable is influenced by both individual-
level and group-level factors. These models can accommodate a wide range of outcome variables,
including binary, categorical, and continuous variables [19].

# Import packages
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Load data
df=pd.read_csv("mixeddata.csv")
# Drop NaN
subset=["kid","mom","cluster","immun","kid2p","mom25p","order23"
,"order46","order7p","indNoSpa","indSpa","momEdPri","momEdSec","
husEdPri","husEdSec","husEdDK","momWork","rural","pcInd81"]
df.dropna(subset= subset, inplace=True)
# Fit mixed effect model with one cluster
mod=smf.glm("immun ~ kid2p+mom25p+order23+order46+
order7p+indNoSpa+indSpa+momEdPri+momEdSec+husEdPri+husEdSec+husE
dDK+momWork+rural+pcInd81+(1|cluster)", df,
family=sm.families.Binomial())
model=mod.fit()
# Calculate exponential coefficient (OR)
model_or=np.exp(model.params)
model_pvalues=model.pvalues
model_ci=np.exp(model.conf_int())
model_OR=pd.DataFrame(model_or,columns=["OR"])
model_OR["p-vlaues"]=model_pvalues

132 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

model_OR[["2.5%","97.5%"]]=model_ci
# Print results and summary
print(model.summary())
print(model_OR.head(20))

OR p-vlaues 2.5% 97.5%


Intercept 0.405693 3.628524e-04 0.247079 0.666129
kid2p 2.583842 1.052072e-16 2.064899 3.233205
mom25p 0.921962 5.034879e-01 0.726665 1.169746
order23 0.922585 5.488956e-01 0.708899 1.200682
order46 1.111649 5.080353e-01 0.812553 1.520841
order7p 1.180221 4.012233e-01 0.801567 1.737749
indNoSpa 1.332124 1.499632e-01 0.901548 1.968341
indSpa 1.295360 1.196573e-01 0.935063 1.794485
momEdPri 1.297698 1.400330e-02 1.054146 1.597521
momEdSec 1.329634 2.319628e-01 0.833390 2.121367
husEdPri 1.360955 5.455990e-03 1.095066 1.691405
husEdSec 1.269819 2.324957e-01 0.857910 1.879498
husEdDK 1.040264 8.237447e-01 0.734996 1.472320
momWork 1.310848 5.022884e-03 1.085001 1.583705
rural 0.592969 6.597525e-06 0.472404 0.744303
pcInd81 0.447531 1.026133e-04 0.298287 0.671447
1 | cluster 1.001133 1.701459e-01 0.999515 1.002754

Multilevel logistic regression modelling with two cluster variables


# Import packages
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Load data
df=pd.read_csv("mixeddata.csv")
# Drop NaN

133 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

subset=["kid","mom","cluster","immun","kid2p","mom25p","order23"
,"order46","order7p","indNoSpa","indSpa","momEdPri","momEdSec","
husEdPri","husEdSec","husEdDK","momWork","rural","pcInd81"]
df.dropna(subset= subset, inplace=True)
# Fit multilevel modelling
mod=smf.glm("immun ~ kid2p+mom25p+order23+order46+
order7p+indNoSpa+indSpa+momEdPri+momEdSec+husEdPri+husEdSec+husE
dDK+momWork+rural+pcInd81+(1|cluster/mom)", df,
family=sm.families.Binomial())
model=mod.fit()
# Calculate exponential coefficient (OR)
model_or=np.exp(model.params)
model_pvalues=model.pvalues
model_ci=np.exp(model.conf_int())
model_OR=pd.DataFrame(model_or,columns=["OR"])
model_OR["p-vlaues"]=model_pvalues
model_OR[["2.5%","97.5%"]]=model_ci
# Print results and summary
print(model_OR.head())

OR p-vlaues 2.5% 97.5%


Intercept 0.250969 7.255013e-05 0.126780 0.496811
kid2p 2.587454 1.021801e-16 2.067297 3.238488
mom25p 0.928174 5.400648e-01 0.731278 1.178085
order23 0.925053 5.625228e-01 0.710651 1.204138
order46 1.112853 5.041626e-01 0.813161 1.522998
order7p 1.173857 4.173833e-01 0.796834 1.729270
indNoSpa 1.401172 9.328863e-02 0.944963 2.077630
indSpa 1.341996 7.862690e-02 0.966899 1.862608
momEdPri 1.312157 1.060141e-02 1.065364 1.616121
momEdSec 1.314432 2.528989e-01 0.822605 2.100315
husEdPri 1.362355 5.345458e-03 1.095967 1.693491
husEdSec 1.283147 2.139163e-01 0.866020 1.901188
husEdDK 1.051469 7.771636e-01 0.742765 1.488473
momWork 1.283140 1.023703e-02 1.060785 1.552104
rural 0.578759 2.877296e-06 0.460277 0.727739
pcInd81 0.368574 1.295487e-05 0.235344 0.577228
1 | cluster 1.009496 2.511188e-02 1.001181 1.017880
1 | cluster:mom 0.999998 4.429169e-02 0.999995 1.000000

134 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

References
1. Rashid MM, Akhtar Z, Chowdhury S, Islam MA, Parveen S, Ghosh PK, Rahman A, Khan
ZH, Islam K, Debnath N (2022) Pattern of antibiotic use among hospitalized patients
according to WHO access, watch, reserve (AWaRe) classification: Findings from a point
prevalence survey in Bangladesh. Antibiotics 11:810

2. Islam MA, Akhtar Z, Hassan MZ, Chowdhury S, Rashid MM, Aleem MA, Ghosh PK, Mah-
E-Muneer S, Parveen S, Ahmmed MK (2022) Pattern of antibiotic dispensing at pharmacies
according to the WHO Access, Watch, Reserve (AWaRe) classification in Bangladesh.
Antibiotics 11:247

3. Ghosh PK, Das P, Goswam DR, Islam A, Chowdhury S, Mollah MM, Harun GD, Akhtar Z,
Chowdhury F (2021) Maternal Characteristics Mediating the Impact of Household Poverty
on the Nutritional Status of Children Under 5 Years of Age in Bangladesh. Food Nutr Bull
42:389–398

4. Chowdhury F, Shahid ASMSB, Ghosh PK, Rahman M, Hassan MZ, Akhtar Z, Muneer SM-
E-, Shahrin L, Ahmed T, Chisti MJ (2020) Viral etiology of pneumonia among severely
malnourished under-five children in an urban hospital, Bangladesh. PloS One 15:e0228329

5. Biswas D, Ahmed M, Roguski K, Ghosh PK, Parveen S, Nizame FA, Rahman MZ,
Chowdhury F, Rahman M, Luby SP (2019) Effectiveness of a behavior change intervention
with hand sanitizer use and respiratory hygiene in reducing laboratory-confirmed influenza
among schoolchildren in Bangladesh: a cluster randomized controlled trial. Am J Trop Med
Hyg 101:1446–1455

6. Halder AK, Luby SP, Akhter S, Ghosh PK, Johnston RB, Unicomb L (2017) Incidences and
costs of illness for diarrhea and acute respiratory infections for children< 5 years of age in
rural Bangladesh. Am J Trop Med Hyg 96:953

7. Alam M-U, Luby SP, Halder AK, Islam K, Opel A, Shoab AK, Ghosh PK, Rahman M,
Mahon T, Unicomb L (2017) Menstrual hygiene management among Bangladeshi
adolescent schoolgirls and risk factors affecting school absence: results from a cross-
sectional survey. BMJ Open 7:e015508

8. Chowdhury S, Barai L, Afroze SR, Ghosh PK, Afroz F, Rahman H, Ghosh S, Hossain MB,
Rahman MZ, Das P (2022) The epidemiology of melioidosis and its association with
diabetes mellitus: a systematic review and meta-analysis. Pathogens 11:149

9. Sultana R, Luby SP, Gurley ES, Rimi NA, Swarna ST, Khan JA, Nahar N, Ghosh PK,
Howlader SR, Kabir H (2021) Cost of illness for severe and non-severe diarrhea borne by
households in a low-income urban community of Bangladesh: A cross-sectional study. PLoS
Negl Trop Dis 15:e0009439

10. Islam A, McKee C, Ghosh PK, Abedin J, Epstein JH, Daszak P, Luby SP, Khan SU, Gurley
ES (2021) Seasonality of date palm sap feeding behavior by bats in Bangladesh. EcoHealth
18:359–371

135 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

11. Chowdhury F, Shahid ASMSB, Tabassum M, Parvin I, Ghosh PK, Hossain MI, Alam NH,
Faruque A, Huq S, Shahrin L (2021) Vitamin D supplementation among Bangladeshi
children under-five years of age hospitalised for severe pneumonia: A randomised placebo
controlled trial. Plos One 16:e0246460

12. Akhtar Z, Islam MA, Aleem MA, Mah-E-Muneer S, Ahmmed MK, Ghosh PK, Rahman M,
Rahman MZ, Sumiya MK, Rahman MM (2021) SARS-CoV-2 and influenza virus
coinfection among patients with severe acute respiratory infection during the first wave of
COVID-19 pandemic in Bangladesh: a hospital-based descriptive study. BMJ Open
11:e053768

13. Akhtar Z, Chowdhury F, Rahman M, Ghosh PK, Ahmmed MK, Islam MA, Mott JA, Davis
W (2021) Seasonal influenza during the COVID-19 pandemic in Bangladesh. PLoS One
16:e0255646

14. Akhtar Z, Chowdhury F, Aleem MA, Ghosh PK, Rahman M, Rahman M, Hossain ME,
Sumiya MK, Islam AM, Uddin MJ (2021) Undiagnosed SARS-CoV-2 infection and
outcome in patients with acute MI and no COVID-19 symptoms. Open Heart 8:e001617

15. Akhtar Z, Aleem MA, Ghosh PK, Islam AM, Chowdhury F, MacIntyre CR, Fröbert O
(2021) In-hospital and 30-day major adverse cardiac events in patients referred for ST-
segment elevation myocardial infarction in Dhaka, Bangladesh. BMC Cardiovasc Disord
21:1–9

16. Ghosh P, Mollah MM (2020) The risk of public mobility from hotspots of COVID-19 during
travel restriction in Bangladesh. J Infect Dev Ctries 14:732–736

17. Ghosh PK, Mollah MMH, Chowdhury AA, Alam N, Harun GD (2020) Hypertension and
sex related differences in mortality of COVID-19 infection: A systematic review and Meta-
analysis

18. Ghosh P (2020) The Dissimilarity of Attack Rate (AR) of SARS-CoV-2 Virus and Infection
Fatality Risk (IFR) Across Different Divisions of Bangladesh. J Trop Dis 8:356

19. Chowdhury S, Azziz-Baumgartner E, Kile JC, Hoque MA, Rahman MZ, Hossain ME,
Ghosh PK, Ahmed SS, Kennedy ED, Sturm-Ramirez K (2020) Association of biosecurity
and hygiene practices with environmental contamination with influenza a viruses in live bird
markets, Bangladesh. Emerg Infect Dis 26:2087

20. Chowdhury F, Shahid ASMSB, Ghosh PK, Rahman M, Hassan MZ, Akhtar Z, Muneer SM-
E-, Shahrin L, Ahmed T, Chisti MJ (2020) Viral etiology of pneumonia among severely
malnourished under-five children in an urban hospital, Bangladesh. PloS One 15:e0228329

21. Chowdhury S, Hossain ME, Ghosh PK, Ghosh S, Hossain MB, Beard C, Rahman M,
Rahman MZ (2019) The pattern of highly pathogenic avian influenza H5N1 outbreaks in
South Asia. Trop Med Infect Dis 4:138

22. Biswas D, Ahmed M, Roguski K, Ghosh PK, Parveen S, Nizame FA, Rahman MZ,
Chowdhury F, Rahman M, Luby SP (2019) Effectiveness of a behavior change intervention
with hand sanitizer use and respiratory hygiene in reducing laboratory-confirmed influenza
136 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python

among schoolchildren in Bangladesh: a cluster randomized controlled trial. Am J Trop Med


Hyg 101:1446–1455

23. Chowdhury F, Ghosh PK, Shahunja K, Shahid AS, Shahrin L, Sarmin M, Sharifuzzaman,
Afroze F, Chisti MJ (2018) Hyperkalemia was an independent risk factor for death while
under mechanical ventilation among children hospitalized with diarrhea in Bangladesh. Glob
Pediatr Health 5:2333794X17754005

24. Parvez SM, Kwong L, Rahman MJ, Ercumen A, Pickering AJ, Ghosh PK, Rahman MZ, Das
KK, Luby SP, Unicomb L (2017) Escherichia coli contamination of child complementary
foods and association with domestic hygiene in rural Bangladesh. Trop Med Int Health
22:547–557

25. Halder AK, Luby SP, Akhter S, Ghosh PK, Johnston RB, Unicomb L (2017) Incidences and
costs of illness for diarrhea and acute respiratory infections for children< 5 years of age in
rural Bangladesh. Am J Trop Med Hyg 96:953

26. Horng L, Unicomb L, Alam M-U, Halder A, Ghosh P, Luby S (2015) Health Worker and
Family Caregiver Hand Hygiene in Bangladesh Healthcare Facilities: Results From a
Nationally Representative Survey. Infectious Diseases Society of America, p 1621

27. Chen X (2022) Quantitative Descriptive Epidemiology. In: Quantitative Epidemiology.


Springer, pp 61–89

28. Aquino JEAP de, Cruz Filho NA, Aquino JNP de (2011) Epidemiology of middle ear and
mastoid cholesteatomas: study of 1146 cases. Braz J Otorhinolaryngol 77:341–347

29. Auman JT, Boorman GA, Wilson RE, Travlos GS, Paules RS (2007) Heat map visualization
of high-density clinical chemistry data. Physiol Genomics 31:352–356

30. Bougioukas KI, Vounzoulaki E, Mantsiou CD, Savvides ED, Karakosta C, Diakonidis T,
Tsapas A, Haidich A-B (2021) Methods for depicting overlap in overviews of systematic
reviews: An introduction to static tabular and graphical displays. J Clin Epidemiol 132:34–
45

31. Gadi N, Saleh S, Johnson J-A, Trinidade A (2022) The impact of the COVID-19 pandemic
on the lifestyle and behaviours, mental health and education of students studying healthcare-
related courses at a British university. BMC Med Educ 22:115

32. Robinson JM, Jorgensen A, Cameron R, Brindley P (2020) Let nature be thy medicine: a
socioecological exploration of green prescribing in the UK. Int J Environ Res Public Health
17:3460

33. Varraso R, Garcia-Aymerich J, Monier F, Le Moual N, De Batlle J, Miranda G, Pison C,


Romieu I, Kauffmann F, Maccario J (2012) Assessment of dietary patterns in nutritional
epidemiology: principal component analysis compared with confirmatory factor analysis.
Am J Clin Nutr 96:1079–1092

34. Lewis SJ, Gardner M, Higgins J, Holly JM, Gaunt TR, Perks CM, Turner SD, Rinaldi S,
Thomas S, Harrison S (2017) Developing the WCRF International/University of Bristol

137 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

Methodology for Identifying and Carrying Out Systematic Reviews of Mechanisms of


Exposure–Cancer AssociationsMethodology for Systematic Reviews of Mechanisms. Cancer
Epidemiol Biomarkers Prev 26:1667–1675

35. Rothman KJ, Greenland S, Lash TL (2008) Measures of effect and measures of association.
na

36. Bertollini R, Lebowitz MD, Savitz DA, Saracci R (1995) Environmental epidemiology:
exposure and disease. CRC Press

37. Hoy D, Brooks P, Blyth F, Buchbinder R (2010) The epidemiology of low back pain. Best
Pract Res Clin Rheumatol 24:769–781

38. Thrusfield M (2018) Veterinary epidemiology. John Wiley & Sons

39. Kok BC, Herrell RK, Thomas JL, Hoge CW (2012) Posttraumatic stress disorder associated
with combat service in Iraq or Afghanistan: reconciling prevalence differences between
studies. J Nerv Ment Dis 200:444–450

40. Parvez SM, Kwong L, Rahman MJ, Ercumen A, Pickering AJ, Ghosh PK, Rahman MZ, Das
KK, Luby SP, Unicomb L (2017) Escherichia coli contamination of child complementary
foods and association with domestic hygiene in rural Bangladesh. Trop Med Int Health
22:547–557

41. Robbins AS, Chao SY, Fonseca VP (2002) What’s the relative risk? A method to directly
estimate risk ratios in cohort studies of common outcomes. Ann Epidemiol 12:452–454

42. Katz D, Baptista J, Azen S, Pike M (1978) Obtaining confidence intervals for the risk ratio in
cohort studies. Biometrics 469–474

43. Chen H, Cohen P, Chen S (2010) How big is a big odds ratio? Interpreting the magnitudes of
odds ratios in epidemiological studies. Commun Stat Comput 39:860–864

44. VanderWeele TJ, Vansteelandt S (2010) Odds ratios for mediation analysis for a
dichotomous outcome. Am J Epidemiol 172:1339–1348

45. Vandenbroucke JP, Pearce N (2012) Case–control studies: basic concepts. Int J Epidemiol
41:1480–1489

46. Plant JD, Lund EM, Yang M (2011) A case–control study of the risk factors for canine
juvenile‐onset generalized demodicosis in the USA. Vet Dermatol 22:95–99

47. Zacchilli MA, Owens BD (2010) Epidemiology of shoulder dislocations presenting to


emergency departments in the United States. JBJS 92:542–549

48. Waterman BR, Owens BD, Davey S, Zacchilli MA, Belmont Jr PJ (2010) The epidemiology
of ankle sprains in the United States. Jbjs 92:2279–2284

49. Desakorn V, Wuthiekanun V, Thanachartwet V, Sahassananda D, Chierakul W,


Apiwattanaporn A, Day NP, Limmathurotsakul D, Peacock SJ (2012) Accuracy of a

138 | P a g e Probir Kumar Ghosh


01741124916
[email protected]
A Guide to Health Data Science Using Python

commercial IgM ELISA for the diagnosis of human leptospirosis in Thailand. Am J Trop
Med Hyg 86:524

50. Haynes RB (2012) Clinical epidemiology: how to do clinical practice research. Lippincott
williams & wilkins

51. Chowdhury F, Shahid ASMSB, Ghosh PK, Rahman M, Hassan MZ, Akhtar Z, Muneer SM-
E-, Shahrin L, Ahmed T, Chisti MJ (2020) Viral etiology of pneumonia among severely
malnourished under-five children in an urban hospital, Bangladesh. PloS One 15:e0228329

52. Shakerkhatibi M, Dianat I, Asghari Jafarabadi M, Azak R, Kousha A (2015) Air pollution
and hospital admissions for cardiorespiratory diseases in Iran: artificial neural network
versus conditional logistic regression. Int J Environ Sci Technol 12:3433–3442

53. Unicomb L, Horng L, Alam M-U, Halder AK, Shoab AK, Ghosh PK, Islam MK, Opel A,
Luby SP (2018) Health-care facility water, sanitation, and health-care waste management
basic service levels in Bangladesh: results from a nation-wide survey. Am J Trop Med Hyg
99:916

139 | P a g e Probir Kumar Ghosh


01741124916
[email protected]

View publication stats

You might also like