A Guide To Health Data Science Using Python
A Guide To Health Data Science Using Python
net/publication/370050221
CITATIONS READS
0 3,888
1 author:
SEE PROFILE
All content following this page was uploaded by Probir Kumar Ghosh on 16 April 2023.
2023
PK Ghosh
1|Page Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
Preface
Probir Kumar Ghosh, MSc statistics from Rajshahi University, who has been working as a
seasoned statistician at the International Centre for Diarrhoeal Disease Research, Bangladesh
(icddr,b) for 12 years. Mr. Ghosh has an extensive background in research designing and
analyzing large-scale public health scientific studies. Mr. Ghosh has authored and co-authored
over 25 publications, in which he has utilized his statistical expertise to provide insightful
analyses of public health data. His expertise in statistics and data analysis has been invaluable
in advancing our understanding of various public health issues.
In "A Guide to Health Data Science Using Python", Mr. Ghosh shares his knowledge and
experience in using Python programming for data analysis in public health research. This guide
is a valuable resource for public health researchers who want to learn how to analyze their data
using Python programming. The guide provides step-by-step instructions for analyzing
different types of public health data, from epidemiological studies to clinical trials, using
Python. It also includes numerous examples and case studies that illustrate the practical
application of Python programming in public health research.
Mr. Ghosh's extensive experience in data analysis and his expertise in using Python
programming make this guide an essential tool for anyone involved in public health research.
I highly recommend this guide to public health researchers who want to learn how to use Python
for data analysis and to anyone who wants to gain a deeper understanding of public health
research.
Table of Contents
Introduction ......................................................................................................................................4
Chapter 1 ..........................................................................................................................................1
Python built-in Data types .............................................................................................................1
1.1 Data types ......................................................................................................................1
1.2 Type conversion.............................................................................................................2
1.3 Python Strings................................................................................................................4
1.3.1 Assign String to a Variable ........................................................................................4
1.3.3 Strings are Arrays ......................................................................................................4
1.3.3 String Length.............................................................................................................4
1.3.4 Check String ..............................................................................................................5
1.3.5 Slicing ........................................................................................................................5
1.3.6 Modify Strings ...........................................................................................................5
1.3.7 Remove Whitespace ...................................................................................................6
1.3.8 Replace String ...........................................................................................................6
1.3.9 Split String .................................................................................................................6
1.3.10 String Concatenation ................................................................................................6
1.3.11 String Format ............................................................................................................6
1.4 Python list ......................................................................................................................7
1.4.1 List Items indexing .....................................................................................................7
1.4.2 List Length ..................................................................................................................8
1.4.3 Range of Indexes .......................................................................................................8
1.4.4 Change Item Value ....................................................................................................8
1.4.5 Change a Range of Item Values ................................................................................8
1.4.6 Insert Items ................................................................................................................9
1.4.7 Append Items.............................................................................................................9
1.4.8 Extend List ..................................................................................................................9
1.4.9 Remove Specified Item ..............................................................................................9
1.4.10 Remove Specified Index ..........................................................................................10
1.4.11 Clear the List ............................................................................................................10
1.4.12 Loop Through the Index Numbers...........................................................................10
1.4.13 Sort List Alphanumerically ......................................................................................11
1.4.14 Reverse Order ..........................................................................................................11
4|Page Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
Introduction
“A Guide to Health Data Science Using Python" is an essential guide for public health
researchers, and students who want to learn how to use Python programming to analyze
public health data. This guide is a comprehensive guide to using Python programming for
data management, computation, descriptive statistics, visualization, spatial analysis,
regression analysis and modeling for identifying risk factors.
The guide is divided into six parts, each focusing on a specific aspect of Python
programming in public health data analysis. The first part covers data management,
including techniques for importing, cleaning, and organizing large datasets. The second part
focuses on computation, introducing readers to important Python libraries such as NumPy,
Pandas, Matplotlib, seaborn, scipy, statsmodels, and geopandas and providing practical
examples of how these libraries can be used for data importation, exportation into another
data format and manipulation.
The third part of the guide delves into descriptive statistics, providing a thorough
introduction to statistical analysis using Python programming. The fourth part covers data
visualization, offering insights into best practices for presenting data in an understandable
and effective manner.
The fifth part of the guide focuses on geospatial analysis, providing a comprehensive
introduction to geospatial data analysis using Python programming. Finally, the sixth part
focuses on using analytical modelling for identifying risk factors for public health data,
covering important concepts such as regression analysis, machine learning and statistical
modelling.
What makes this guide particularly unique is that it is written for public health researchers,
and students with no prior experience in Python programming. Python language is object
oriented which uses objects to represent data and methods to manipulate that data. This
approach provides for code organization, reusability, and maintainability. The guide takes
a step-by-step approach, providing clear explanations and examples that allow readers to
build their understanding of Python programming from the ground up. One of the guide's
unique features is its use of attractive and informative graphs such as Waffle, Sankey and
Swarm plots to help readers understand complex public health data. These graphs are
carefully designed to provide clear and concise visual representations of data that are easy
to interpret and analyze. This approach is particularly useful for health data science, where
4|Page Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
large amounts of data can be difficult to interpret without the assistance of graphical
representations. Another unique feature of the guide is its focus on geospatial mapping. By
using geographic information systems (GIS) and other spatial data analysis tools, the guide
shows readers how to analyze and interpret health data in the context of geography. This is
an important skill for public health researchers, as it enables them to identify and visualize
spatial patterns in health data, which can help to inform public health interventions and
policies. Additionally, the guide's focus on statistical modeling is also a unique feature. The
book covers a range of statistical models commonly used in health data science, including
logistic regression, Poisson regression, survival analysis, cox proportional hazard, and
hierarchical models. By providing a detailed introduction to these models, the guide
prepares readers with the skills and knowledge needed to analyze health data using statistical
techniques.
In "A guide to health data science using python", the example data is simulated data which
is used to provide examples and test python codes. These data reflect the different real-
world scenarios. While example data may not reflect all the nuances of real-world data, it
can provide valuable insights and help researchers to identify potential issues before
deploying their models on real data.
Overall, "A Guide to Health Data Science Using Python" is an invaluable resource for public
health researchers and students who want to learn how to use Python programming to
manage, represent, analyze and interpret large datasets. The guide provides clear and
concise explanations of important concepts and techniques, accompanied by practical
examples and code snippets that readers can use to deepen their understanding and apply
the skills learned to real-world public health scenarios.
Chapter 1
Python built-in Data types
Introduction
In Python, there are several built-in data types, including integers, floating-point numbers,
strings, lists, tuples, sets, and dictionaries. Each data type has its own unique characteristics
and functions, which make them suitable for different data science tasks. Integers and
floating-point numbers are used for numerical computations, while strings are used to
represent text data. Lists, tuples, and sets are used to store collections of data, and
dictionary is used to store related key value. These data types allow users to create variables
and objects without explicitly defining their data types. This makes it easy to compute and
manage and analyze data without worrying about type declarations. Understanding the
different data types is crucial for data science using Python. This guide is covered common
data types and best practices in Python data science.
In Python data science, setting the data type when you assign a value to a variable (x) is
essential.
You can get data type of any object by using type() function.
x=5
print(type(x))
A string variable is assigned with the variable name followed by an equal sign and a string:
A ="Hello"
print(A)
Return will be: Hello
Python strings are arrays of bytes representing characters. Python has a single character is
simply a string with a length of 1. Square brackets can be used to access elements of a
string. Python first character has the position 0.
A='Hello, World'
print(A)
print(A[1])
Return will be: e
String length is the total number of containing characters. To get the length of a string, use
the len() function.
A='Hello, World'
l=len(A)
print(l)
Return will be: 12
1.3.5 Slicing
You can return a range of characters by using the slice. You must specify the start and end
indexes, separated by a colon, to produce a part of the string. Get the characters from
position 4 to position 8 (end point (5) is not included):
Lower Case
print(txt.lower())
Return will be : the elephant is the largest animal in the world!
The split() method returns a list where the text between the specified separator becomes
the list items.
age=36
name="Mr. John"
The format() method takes the passed string arguments, where the placeholders are{}.
Strings are placed into the respective placeholders:
You can use index numbers in the {} to be placed the arguments in the right placeholders:
Create a List:
List items are ordered, changeable, and allow duplicate values. List items are indexed.
print(mylist[1])
Return will be: banana
You can determine how many items a list has by using the len() function:
print(len(mylist))
Return will be: 3
You can specify a range of indexes by determining where to start and where to end the
range. When defining a range, the return value will be a new list with the specified items.
print(mylist[0:2])
You can use index number to change the value of a specific item.
mylist[1]="blackcurrant"
print(mylist)
Return will be:
['mango', 'blackcurrant', 'apple']
Change the values "banana" and "apple" with the value’s "blackcurrant" and "watermelon":
mylist[1:3]=["blackcurrant","watermelon"]
print(mylist)
Return will be: [mango, 'blackcurrant', 'watermelon']
If you insert more items than you replace, the new items will be inserted where you
specified, and the remaining items will move accordingly:
mylist[1:2]=["blackcurrant","watermelon"]
print(mylist)
Return will be : ['mango', 'blackcurrant', 'watermelon', apple]
8|Page Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
Change the second and third value by replacing it with one value:
mylist[1:3]=["blackcurrant"]
print(mylist)
Return will be : ['apple', 'blackcurrant’]
To add an item to the end of the list, use the append() method:
The del keyword also removes the specified index. Remove the first item:
The clear() method empties the list. The list still remains, but it has no content.
You can also loop through the list items by referring to their index number. Print all items
by referring to their index number:
10 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
apple
banana
cherry
A short hand for loop that will print all items in a list:
apple
banana
cherry
With list comprehension you can do all that with only one line of code:
To sort a list items ascending, you use sort() method. Sort the list alphabetically:
mylist.sort(reverse=False)
print(mylist)
Return will be : ['apple', 'banana', 'cherry']
Sort Descending To sort descending, use the keyword argument reverse = True:
mylist.sort(reverse=True)
print(mylist)
Return will be: ['cherry', 'banana', 'apple']
The reverse() method reverses the current sorting order of the elements.
11 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
There are several ways to join, or concatenate, two or more lists in Python data science.
One of the easiest ways are by using the + operator.
A tuple is a collection which contains multiple items. It is a built-in dynamic data types in
Python that is created with round brackets.
You can use the len() function to print the number of items in the tuple:
print(len(mytupe))
Return will be: 3
You can access tuple items by referring to the index number, inside square brackets.
print(mytupe[1])
Return will be: banana
Once a tuple is created, you cannot change its values. To change tuple values, you must
convert the tuple into a list, change the list, and convert the list back into a tuple. Convert
the tuple into a list to be able to change it:
12 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
x=list(mytupe)
x[1]="kiwi"
mytupe=tuple(x)
print(mytupe)
Return will be: ("apple", “kiwi”, "cherry")
Convert the tuple into a list, add "orange", and convert it back into a tuple:
Note: You cannot remove items in a tuple. Tuples are unchangeable, so you cannot
remove items from it, but you can use the same workaround as we used for changing and
adding tuple items:
y.remove("kiwi")
mytupe=tuple(y)
print(mytupe)
Return will be: ('apple', 'banana', 'cherry')
13 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
d=tuple(c)
print(d)
Return will be: ('apple', 'banana', 'cherry', 'mango', 'strawberry')
1.5.5 Loop Through the Index Numbers
You can also loop through the tuple items by referring to their index number. Use the
range() and len() functions to create a suitable iterable. Print all items by referring to their
index number:
for i in range(len(mytupe)):
print(mytupe[i])
Return will be:
apple
banana
cherry
To join two or more tuples you can use the + operator: Join two tuples:
1.6 Dictionary
Dictionaries contain data values in key and value pairs. A dictionary is a collection which
is ordered*, changeable and does not allow duplicates. Dictionaries are written with curly
brackets and have keys and values.
car = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
14 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
}
print(car)
Return will be: {'brand': 'Ford', 'model': 'Mustang', 'year': 1964}
1.6.1 Dictionary Length
You use the len() function to determine how many items a dictionary has.
print(len(car))
Return will be: 3
1.6.2 Accessing Items
You can access the items of a dictionary by referring to its key name, inside square
brackets:
x=car["model"]
print(x)
Return will be: Mustang
1.6.3 Get Keys
The keys() method will return a list of all the keys in the dictionary.
key=car.keys()
print(key)
Return will be: dict_keys(['brand', 'model', 'year'])
The values() method will return a list of all the values in the dictionary.
val=car.values()
print(val)
Return will be: dict_values(['Ford', 'Mustang', 1964])
Make a change in the original dictionary, and see that the values list gets updated as well:
car["year"]=2020
print(car)
before
car = {
"brand": "Ford",
15 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
"model": "Mustang",
"year": 1964
}
after
car = {
"brand": "Ford",
"model": "Mustang",
"year": 2020
}
Add a new item to the original dictionary, and see that the values list gets updated as
well:
car["color"]="red"
print(car)
Before
car = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
After
car = {
"brand": "Ford",
"model": "Mustang",
"year": 1964,
"color": “red”
}
car.pop("model")
print(car)
car = {
"brand": "Ford",
"year": 1964,
"color": “red”
}
1.6.6 Nested Dictionaries
16 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
myfamily = {
"child1" : {
"name" : "Emil",
"year" : 2004
},
"child2" : {
"name" : "Tobias",
"year" : 2007
},
"child3" : {
"name" : "Linus",
"year" : 2011
}
}
To access items from a nested dictionary, you use the name of the dictionaries, starting
with the outer dictionary:
myfamily = {
"child1" : {
"name" : "Emil",
"year" : 2004
},
"child2" : {
"name" : "Tobias",
"year" : 2007
},
"child3" : {
"name" : "Linus",
"year" : 2011
}
}
print(myfamily["child2"]["name"])
Return will be: Tobias
17 | P a g e Probir Kumar
Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
Chapter 2
Python DataFrame
Introduction
Dataframes are a crucial tool for any data scientist or analyst working with tabular data. A
dataframe is a two-dimensional data structure that is used to keep data in rows and columns.
Dataframes are highly flexible and can be used for a wide range of data analysis tasks, including
cleaning, filtering, aggregating, and visualizing data. The Python dataframes are typically
created by using the pandas library, which provides a tool for working with tabular data. With
pandas, data analysts can easily read data from a variety of file formats, such as CSV, Excel,
SQL databases, SPSS, STATA, and SAS databases, and compute and manipulate the data using
a wide range of functions and methods. Data analysts also can easily extract to a variety of file
formats. This guide provided an overview of the basics of working with Pandas in Python,
including how to create and manipulate dataframes, perform common data analysis tasks, and
visualize data, while Geopandas dataframes provided the basics of working with spatial data.
This guide will help us to take the data analysis skills to the next level.
Pandas is a Python library used for working with tabular dataframes. It has functions for
analyzing, cleaning, exploring, and manipulating data. Pandas allows us to analyze big data and
make conclusions based on statistical theories. It can be used to clean messy data, and make
them readable and relevant.
If you have Python and PIP already installed on a machine, then installation of Pandas is very
easy. Alternative way to install Pandas is on Pycharm project. Given a Pycharm project, you can
easily install the pandas library in your project within a virtual environment step by step.
• Open File > Settings > Project from the PyCharm menu.
• Select your current project.
• Click the Python Interpreter tab within your project tab.
18 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
Once Pandas is installed, import it in your applications by adding the import keyword:
# import packages
import pandas as pd
# Create dataframe
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
mydf=pd.DataFrame(mydataset)
# Print dataframe
print(mydf)
Return will be:
cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
Pandas use the loc attribute to return one or more specified row(s)
# import packages
import pandas as pd
# Create dataframe
data = {
"calories": [420, 380, 390],
19 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
print(df.loc[0:1])
or
print(df.loc[[0,1]])
Return will be :
calories duration
0 420 50
1 380 40
You can easily add row name using Pandas DataFrame with index argument.
# import packages
import pandas as pd
# Create dataframe
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df=pd.DataFrame(data, index=["Month1","Month2","Month3"])
# Print dataframe
print(df)
Return will be:
calories duration
Month1 420 50
20 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
Month2 380 40
Month3 390 45
Use the named index in the loc attribute to return the specified row(s).
print(df.loc["Month3"])
Return will be:
calories 390
duration 45
Pandas can easily load any other formatted dataset into a DataFrame. For example, Load a
comma separated file (CSV file) into a DataFrame using import pandas
import pandas as pd
df = pd.read_csv('~location path/data.csv')
Alternatively, you must before changing current directory and then import csv file.
import os
os.chdir('~Python notes')
df = pd.read_csv('data.csv')
print(df.head(10))
Geopandas can easily load geospatial shape file into a DataFrame. For example, load Bangladesh
geospatial shape file into a DataFrame using import geopandas
# Import packages
import geopandas as gpd
import os
# Change directory
os.chdir('~Python note')
# Load Bangladesh shape file
Bdg= gpd.read_file('~\bgd_admbnda_adm0_bbs_20201113.shp')
Chapter 3
Data Management
Introduction
Data management is a crucial component of health data science that involves the organization,
cleaning, and data imputation. The quality of health data management can have a significant
impact on the accuracy and reliability of public health findings. Messy data can lead to errors
and biases in the analysis, which can undermine the validity of the study results. This process in
public health data science involves several key steps. The first step is the collection of data,
which may involve the use of various instruments such as surveys, medical records, and
laboratory tests. Once the data is collected, it must be carefully check to ensure valid and
completeness. This process may involve identifying and cleaning errors and discrepancies in the
data, as well as identifying missing or incomplete data. Data cleaning is another important step
in data management, which involves identifying and correcting errors, inconsistencies, and
missing data. The next step in data management is the organization and storage of the data. This
typically involves creating a database or spreadsheet that includes all of the collected data in a
structured and standardized format. The data must be properly labeled and formatted to facilitate
analysis and minimize errors. This may involve using statistical methods to identify outliers or
other data points that are not consistent with the overall pattern of the data.
Finally, data management means fixing incorrect data and managing datasets for preparing to
analyze. This chapter includes:
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
• Merging
• Appending
# import packages
import pandas as pd
# Create dataframe
data = {
"calories": [420, 380, 390,400],
"duration": [50, 40, 45,np.nan]
}
# Drop NaN values
df.dropna(inplace=True)
# Print dataframe
print(df.head())
Return will be:
calories duration
Month1 420 50
Month2 380 40
Month3 390 45
df=pd.DataFrame(data,
index=["Month1","Month2","Month3","Month4"])
# Change NaN values to 130
df.fillna(130,inplace=True)
# Print dataframe
print(df)
Return will be :
calories duration
Month1 420 50.0
Month2 380 40.0
Month3 390 45.0
Month4 400 130.0
You can specify the column name for the DataFrame to replace empty values in a column.
Replace NULL values in the "Calories" columns with the number 130:
# import packages
import pandas as pd
# Create dataframe
data = {
"calories": [420, 380, 390,400],
"duration": [50, 40, 45,np.nan]
}
df=pd.DataFrame(data,
index=["Month1","Month2","Month3","Month4"])
# Change NaN values to 130
df["duration"].fillna(130,inplace=True)
# Print dataframe
print(df)
Return will be:
calories duration
Month1 420 50.0
Month2 380 40.0
25 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
You can easily impute value in a column to replace entire empty cells by using fillna() method .
Replace NULL values in the "hdlc" columns with the number 50.0:
# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('data.csv')
# Change NaN to 50 in specific column
df["hdlc"].fillna(50.0,inplace=True)
# Print dataframe
print(df)
A common way to replace empty cells is to calculate the mean, median or mode value of the
column. Pandas uses the mean() median() and mode() methods to calculate the respective values
for a specified column. Calculate the MEAN, and replace any empty values with it:
# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('data.csv')
# Create mean
mean_hdlc=df["hdlc"].mean()
# Change NaN to mean value
df["hdlc"].fillna(mean_hdlc,inplace=True)
# Print dataframe
print(df["hdlc"].head(5))
Before mean imputation
0 31.0
1 54.0
2 NaN
3 NaN
4 NaN
After mean imputation
0 31.000000
1 54.000000
2 49.364718
3 49.364718
4 49.364718
Cells with data of wrong format can make it difficult, or even impossible to analyze data.
To fix it, you have two options: remove the rows, or convert all cells in the columns into the
same format.
Convert Into a Correct Format
In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the 'Date'
column should be a string that represents a date. Let's try to convert all cells in the 'Date' column
into dates. Pandas has a datetime() method for this:
Convert to date:
# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('datetime.csv')
27 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
df["Date"]=pd.to_datetime(df["Date"])
# Print dataframe
print(df)
Before correction
20 '2020/12/20'
21 '2020/12/21'
22 NaN
23 '2020/12/23'
After correction
20 2020-12-20
21 2020-12-21
23 2020-12-23
To remove duplicates, use the drop_duplicates() method. This method keep first occurrence by
default. Remove all duplicates:
# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('datetime.csv')
# Drop duplicates values
df.drop_duplicates(inplace=True)
0 False
11 False
12 True
13 False
14 False
15 False
# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('datetime.csv')
# Drop duplicates values
df.drop_duplicates(subset=['Date'], Inplace=True)
0 False
11 False
12 True
13 False
14 False
15 False
# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df=pd.read_csv('datetime.csv')
# Drop duplicates values
df.drop_duplicates(subset=['Date'], keep='last', Inplace=True)
0 False
11 False
12 True
13 False
14 False
15 False
You can combine two or more datasets to create a single dataset using merge() method in
pandas Dataframe. Joining variables to a DataFrame can be more computationally easier for
analysis than different DataFrames. Pandas has a merge() function that join all standard
databases, such as Pandas DataFrames. The pandas merge() function structure is:
import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df_x=pd.read_csv("x.csv")
df_y=pd.read_csv("y.csv")
# Merge dataframes
data=pd.merge(df_x,df_y,on="id",validate="1:1",indicator=True)
# import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load data
df_x=pd.read_csv("x1.csv")
df_y=pd.read_csv("y1.csv")
# Append dataframes
data=df_x.append(df_y,ignore_index=True)
After appending, total row = 8868 (4434+4434)
Chapter 4
Descriptive statistics
Introduction
Descriptive statistics is a branch of statistics that involves summarizing, organizing, and presenting
data in a meaningful way. It involves the use of various statistical measures and tools to describe
the central tendency, variability, and distribution of data. It is an essential tool for researchers,
analysts, and decision-makers who need to understand and communicate data effectively.
The purpose of descriptive statistics is to provide a clear and concise summary of data, which can
help to identify patterns, trends, and relationships within variables that may be useful in developing
hypotheses or models. It allows us to describe the features of variables, such as its mean, median,
mode, standard deviation, range percentiles and interquartile range.
Public health researchers use descriptive statistics to summarize and describe the characteristics of
a population or sample, such as the frequency and distribution of a particular disease or risk factors.
By providing a comprehensive and quantitative picture of the data, descriptive statistics allow
researchers to identify patterns and trends, explore relationships between variables, and draw
meaningful conclusions from their analyses. Descriptive statistics can also be used to compare
different populations or subgroups, evaluate the effectiveness of public health interventions, and
inform policy decisions aimed at improving the health of a population. Overall, descriptive statistics
are an important tool for researchers in the study of the distribution and determinants of health-
related events and conditions [1–26].
4.10 Tabulate
Discovering relationships between variables is the fundamental goal of data analysis. Frequency
tables are a basic tool you can use to explore data and get an idea of the relationships between
variables. A frequency table is just a data table that shows the counts of one or more categorical
variables.
4.1.1 One way Tabulate
Create frequency tables in pandas using the pd.crosstab() function. The function takes one or
more columns of a categorical variable and then constructs a new DataFrame of variable counts
based on the supplied arrays.
# Import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load dataset
datf=pd.read_csv('data.csv',index_col='id')
# Create one way cross-tab
my_tab=pd.crosstab(index=datf['death'],columns='count',
margins=True, dropna=False)
# Save results
my_tab.to_csv('One way table for count.csv')
# Create one way cross-tab
my_tab=pd.crosstab(index=datf['death'],columns=’Proportion’,
margins=True,dropna=False, normalize='columns')
# Calculate proportion
prop1=my_tab/my_tab.sum()
# Save proportion
prop1.to_csv('One way table for percent.csv')
death count All
0 2884 2884
1 1550 1550
All 4434 4434
death Proportion
0 0.650429
1 0.349571
4.1.2 Cross-Tabulate
A contingency table called as two-way table, which includes two different dimensions (rows and
columns). Each dimension is a different variable. A two-way table shows the relationship between
two variables. To create a two way table, insert two variables to the pd.crosstab() function:
# Import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load dataset
datf=pd.read_csv('data.csv',index_col='id')
### Create cross-tab
my_tab=pd.crosstab(index=datf['death'],columns=datf['sex1'],
margins=True,dropna=False)
# Change columns name
my_tab.columns=['Female','male','Total']
# Change rows name
my_tab.index=['Alive','Death','Total']
# Save cross-tab
my_tab.to_csv('Two way table for count.csv')
# Calculate column proportion
prop1=pd.crosstab(index=datf['death'],columns=datf['sex1'],
margins=True,dropna=False,normalize='columns')
# Change columns name
prop1.columns=['Female','male','Total']
# Change rows name
prop1.index=['Alive','Death']
# Save cross-tab
prop1.to_csv('Two way table for percent.csv')
Female Male Total
Alive 1101 1783 2884
Death 843 707 1550
Total 1944 2490 4434
Row percentages
# Import packages
36 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load dataset
datf=pd.read_csv('data.csv',index_col='id')
prop2=pd.crosstab(index=datf['death'],columns=datf['sex1'],
margins=True, dropna=False, normalize='index')
# Change columns name
prop2.columns=['Female','male']
# Change rows name
prop2.index=['Alive','Death','Total']
# save cross-tab
prop2.to_csv('Two way table for row percent.csv')
Female Male
Alive 0.381761 0.618239
Death 0.543871 0.456129
Total 0.43843 0.56157
Total percentages
# Import packages
import pandas as pd
import os
# Change local directory
os.chdir(~Python notes')
# Load dataset
datf=pd.read_csv('data.csv',index_col='id')
prop3=pd.crosstab(index=datf['death'],columns=datf['sex1'],
margins=True,dropna=False,normalize=True)
# Change columns name
prop3.columns=['Female','male','Total']
# Change rows name
prop3.index=['Alive','Death','Total']
# Save cross-tab
prop3.to_csv('Two way table for total percent.csv')
Female male Total
Alive 0.248309 0.40212 0.650429
Death 0.190122 0.15945 0.349571
Total 0.43843 0.56157 1
4.2 Cross tabulation heatmap
N= 4415
Mean = 25.85
SD =4.10
Median = 25.45
P25 = 23.09
P75 = 28.09
Min = 15.54
Max = 56.8
Group
count mean std min 25% 50% 75% max
Male 1939.0 26.169582 3.407115 15.54 23.97 26.08 28.32 40.38
Female 2476.0 25.592884 4.557443 15.96 22.54 24.83 27.82 56.80
Chapter 5
Data visualization
Introduction
Data visualization is a crucial aspect of health data science, as it allows public health researchers to
demonstration complex data in a clear and concise way. The use of visual representation such as
graphs, and charts can help to convey critical information about the distribution of diseases, risk
factors, and health outcomes within populations. This chapter will provide an overview of the
different types of visualization techniques that are commonly used in health data science, as well
as their applications and limitations. This chapter also will provide an overview of the different
types of data that can be visualized, including categorical data, and continuous data, and describe
the most appropriate visual representation that can be used for each type of data. Additionally, the
chapter will discuss the different types of graphs and charts that are commonly used in recent health
data science, such as pie charts, waffle charts, swarm plots, heatmaps, violin plots, forest plots,
timeseries graphs, and scatterplots. Finally, the chapter will provide some best practices for
visualizing health data science, such as changing colors, fonts, and labels, and avoiding misleading
or confusing visualization.
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
# Change directory
import os
os.chdir('~\Python notes')
# Create DataFrame
41 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
df=pd.DataFrame({
"Nutrition":["Underweight","Normal","Overweight","Obesity"],
"Number":[57,1226,2453,664]
})
# Create a pie graph
plt.pie(df["Number"], labels=df["Nutrition"], autopct='%1.2f%%',
colors=['red','green','orange','blue'])
# Insert title
plt.title("Nutrition status in a city")
# Save figure
plt.savefig('Pie chart.png',bbox_inches='tight', dpi=600)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir(~\Python notes')
# Create dataframe
df=pd.DataFrame({
Nutrition":["Underweight","Normal","Overweight","Obesity"],
"Number":[57,1226,2453,664]
})
# Create figure
plt.pie(df["Number"],labels=df["Nutrition"],autopct='%1.2f%%',
colors=['red','green','orange','blue'],
explode=[0.02,0.012,0.01,0.1])
# Insert title
plt.title("Nutrition status in a city")
# Save figure
plt.savefig('Pie chart.png',bbox_inches='tight', dpi=600)
Bar charts are a commonly used in public health data science that can be used to display categorical
data in vertical or horizontal. Each bar is represented as rectangular bar with lengths proportional
to the values. It makes easy to compare the frequency or proportion of each category. It can also be
used to compare the distribution of a categorical variable across multiple groups or to display
changes in the distribution of a variable over time [8, 10, 12].
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir(~\Python notes')
# Create dataframe
df=pd.DataFrame({
"Nutrition":["Underweight","Normal","Overweight","Obesity"],
"Number":[57,1226,2453,664]
})
# Create figure
plt.bar(df["Nutrition"],df["Number"],color='maroon')
# Put value label on the bars
for i, v in enumerate(df["Number"]):
plt.text(i,v, str(v), color='blue')
# Create label of x-axis
plt.xlabel("nutritional status")
# Create label of y-axis
plt.ylabel("Number of participants")
# Put ticks in y-axis
plt.yticks([500,1000,1500,2000,2500])
# Insert title
plt.title("Nutrition status in a city")
# Save figure
plt.savefig('Bar chart_1.png',bbox_inches='tight', dpi=600)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir(~\Python notes')
# Load data
df = pd.read_csv(~\data.csv',index_col='slno')
# Create figure
ax=df.plot(kind='bar',stacked=False,width=.8,figsize=(10,6))
# Insert value label on top of the bars
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
if height>0:
ax.text(x+width/2,
y+height,
'{:.0f}%'.format(height),
horizontalalignment='center',
verticalalignment='bottom',
fontweight='bold')
# Create x-axis label
plt.xlabel('\n Antibiotics')
# Create y-axis label
plt.ylabel('Percentage (%)')
# Create y-axis ticks with rotation and bold font
plt.yticks(np.arange(0,101,10))
plt.xticks(fontsize=9.5,rotation=360,fontweight='bold')
# Create legend in the graph
plt.legend(bbox_to_anchor=(.5, 0.9), ncol=2, loc='lower left')
# Save figure
plt.savefig('Multiple bar.png', bbox_inches="tight", dpi=600)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir(~\Python notes')
# Load data
df = pd.read_csv(~\data.csv',index_col='id')
# Create figure
ax=df.plot(kind='bar',stacked=True,width=.8,figsize=(10,6))
# Insert value label on top of the bars
for p in ax.patches:
A heatmap shows values for a main variable of interest across two axis (row and column) variables
as a grid of colored squares. Each square color specifies the values of the main variable in the
corresponding range of cell. The purpose of a heatmap provides an immediate visual summary of
information and allow the users to understand the set of complex data [29, 30].
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir(~\Python notes')
# Load data
df = pd.read_csv('death.csv')
# Calculate count of death for each division and month
ct_counts=df.pivot(index='Division', columns='Month',
values='count')
# Create figure
plt.figure(figsize=(12, 8))
ax=sns.heatmap(ct_counts, annot=True, cmap='rainbow',
annot_kws={'size':12})
# Create title
plt.title('Division wise death distribution', fontsize=12)
# Create x-axis ticks
plt.xticks(rotation=0)
# Remove x-axis
plt.xlabel('')
# Create y-axis label
plt.ylabel('% of death from COVID-19',fontsize=12)
plt.yticks(fontsize=10,rotation=45)
# Save figure
plt.savefig("Division wise death distribution.jpg")
49 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
A waffle chart is an advanced visualization tool. A waffle chart is a great way to visualize data in
relation to a whole or to highlight progress against a given threshold. We are interested in
visualizing the contribution of each of items to total. For given predefined the height and width, the
contribution of each item is transformed into a number of tiles that is proportional to the item’s
contribution to total. Therefore, the more tiles indicate more contribution resulting in what
resembles a waffle when combined [31, 32].
Example 1
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
from pywaffle import Waffle
import os
# Change directory
os.chdir(~\Python notes')
# Create dataframe
data=pd.DataFrame({
"Nutrition":["Underweight","Normal","Overweight","Obesity"],
"Number":[57,1226,2453,664]
})
# Create figure
50 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
fig = plt.figure(
FigureClass=Waffle,
rows=10,
columns=44,
values=data.Number,
colors=['darkorange','cyan','green', 'red',],
title={'label': 'One square tile= 10 respondents', 'loc':
'left', 'fontsize':8},
labels=list(data.Nutrition),
legend={'loc': 'upper left', 'bbox_to_anchor': (1, 1),
'ncol': 1, 'framealpha': 5,'fontsize':9},
starting_location='NW',
block_arranging_style='snake',
font_size=12,
icon_legend=True)
# Create figure resize
fig.set_size_inches(10,4)
# Save figure
plt.savefig('Waffle Chart.png',bbox_inches='tight', dpi=600)
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# Load dataset
51 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
df=pd.read_csv('~death.csv')
# Calculate total number of death
total=sum(df['Death'])
# Calculate proportion of death
prop=[(float(value)/total) for value in df['Death']]
# Set number of titles in a row
width=40
# Set number of titles in a column
height=10
total=width*height
tiles_per_cat=[round(proportion*total) for proportion in prop]
# Create figure
waffle = np.zeros((height, width))
category_index = 0
tile_index = 0
for col in range(width):
for row in range(height):
tile_index += 1
if tile_index > sum(tiles_per_cat[0:category_index]):
category_index += 1
waffle[row, col] = category_index
colormap = plt.cm.coolwarm
plt.matshow(waffle, cmap=colormap)
ax = plt.gca()
# Create ticks for color bar
ax.set_xticks(np.arange(-0.5, (width), 1), minor=True)
ax.set_yticks(np.arange(-0.5, (height), 1), minor=True)
ax.grid(which='minor', color='w', linestyle='-', linewidth=1.5)
# Remove x-axis ticks
plt.xticks([])
# Remove y-axis ticks
plt.yticks([])
# Create legend
values_num = df['Death']
values=[round(pr*100,1) for pr in prop]
categories = df['Month']
values_cumsum = np.cumsum(values)
total_values = values_cumsum[len(values_cumsum) - 1]
legend_handles = []
for i, category in enumerate(categories):
if value_sign == '%':
label_str = category + ' (' + str(values_num[i])+'
('+str(values[i]) + value_sign + '%'+'))'
else:
label_str = category + ' (' + value_sign +
str(values_num[i])+' ('+str(values[i]) + '%'+'))'
color_val = colormap(float(values_cumsum[i]) / total_values)
legend_handles.append(mpatches.Patch(color=color_val,
label=label_str))
# Insert legend
plt.legend(handles=legend_handles, loc='lower center',ncol=
round(len(categories)/4),
bbox_to_anchor=(0, -0.6,.95, 0.1))
# Insert title
plt.title('Deaths from Covid-19 in Bangladesh')
# Insert color bar
plt.clim(1,14)
plt.colorbar(ticks=np.arange(1,15,1)).ax.set_yticklabels(
categories)
# Save figure
plt.savefig('Waffle Chart.png',bbox_inches='tight', dpi=600)
5.6 Histogram
Histograms are a commonly used visualization tool in health data science to display the distribution
of a continuous variable. It is a visual representation of the frequency or probability of observations
falling within different ranges (bins) of a continuous variable. It’s x-axis represents the variable
being measured, while the y-axis represents the frequency or probability of observations within
each bin. They are useful in identifying patterns in the data, such as whether the distribution is
symmetrical or skewed, and can help to identify potential outliers or data anomalies.
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("data.csv")
# Create figure
fig,ax=plt.subplots(figsize=(10,7))
ax.hist(df_can['totchol'],bins=40,color="darkorange")
# Insert title
plt.title('Histogram of total cholesterol level')
# Put x-axis ticks
54 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
plt.xticks(rotation=90,fontsize=8)
# Insert y-axis label
plt.ylabel('Frequencies')
# Save figure
plt.savefig("Histogram.png",bbox_inches="tight", dpi=600)
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("data.csv")
# Create figure
fig,ax=plt.subplots(figsize=(10,7))
ax.hist(df_can['totchol'],bins=40,color="darkorange",
rwidth=0.85)
# Insert title
plt.title('Histogram of total cholesterol level')
# Put x-axis ticks
plt.xticks(rotation=90,fontsize=8)
# Insert y-axis label
plt.ylabel('Frequencies')
# Save figure
plt.savefig("Histogram.png",bbox_inches="tight", dpi=600)
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("data.csv")
# Create figure
fig,ax=plt.subplots(figsize=(10,7))
ax.hist(df_can['totchol1'],bins=40,color="darkorange",
rwidth=0.85, density=True)
# Insert title
plt.title('Histogram of total cholesterol level')
# Put x-axis ticks
plt.xticks(rotation=90,fontsize=8)
# Insert y-axis label
plt.ylabel('Probability')
# Save figure
plt.savefig("Histogram.png",bbox_inches="tight", dpi=600)
# Import packages
import pandas as pd
import numpy as np
57 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
5.7 Boxplot
Box plots are a popular tool for visualizing the distribution of continuous data in health data science.
A box plot, also known as a box-and-whisker plot, displays the distribution of data through five
summary statistics: the minimum value, the first quartile (25th percentile), the median (50th
percentile), the third quartile (75th percentile), and the maximum value. The rectangular box
represents the interquartile range (IQR), which is the range between the first and third quartiles.
The median is represented by a vertical line inside the box. The whiskers extend from the box to
the minimum and maximum values, respectively. Outliers may also be represented by individual
data points or circles beyond the whiskers. They can also be used to compare the distribution of a
continuous variable across multiple groups or to display changes in the distribution of a variable
over time [33].
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("~data.csv")
# Create figure
sns.boxplot(x='Month',y='cases',data=df_can,palette='rainbow')
# Insert title
plt.title("Monthwise cases")
# Put y-axis label
plt.ylabel('Number of positive cases')
# Put x-axis ticks
plt.xticks(fontsize=8,rotation=45)
# Put y-axis ticks
plt.yticks(fontsize=8)
# Save figure
plt.savefig('Box plot.png',bbox_inches="tight", dpi=600)
Horizontal boxplot
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("~data.csv")
# Create figure
sns.boxplot(x='Month',y='cases',data=df_can,palette='rainbow',
orient="h")
# Insert title
plt.title("Monthwise cases")
# Put y-axis label
plt.ylabel('Number of positive cases')
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Change directory
os.chdir(~\Python notes')
# Load data
df_can=pd.read_csv("~data.csv")
# Create figure
sns.scatterplot(x='cases',y='Death',data=df_can,color='blue')
# Put y-axis ticks
plt.yticks(np.arange(0,121,20))
# Save figure
plt.savefig('scatter plot.png',bbox_inches="tight", dpi=600)
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Change directory
os.chdir(~\Python notes')
# Load data
df=pd.read_csv("~data.csv”)
# Create categories
df.loc[df[“age”]<40,”AGE”]=”Below 40 years”
62 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
# Import packages
import pandas as pd
import seaborn as sns
import os
# Change directory
os.chdir('~Python notes')
# Load data
df=pd.read_csv('data.csv',index_col="id")
# Create categories
bins=[0,18.5,23.5,29.5,np.inf]
names=["Underweight","Normal","Overweight","Obese"]
df['new_nutrition']=pd.cut(df["bmi"],bins,labels=names)
# Create figure
fig=plt.figure(figsize=(10,6))
sns.stripplot(data=df,x="sysbp", y="new_nutrition",palette='magma')
sns.set_theme(style='whitegrid')
# Insert labels
plt.ylabel('Nutritional status')
plt.xlabel('Systolic blood pressure')
# Save figure
plt.savefig("Strip plot.png",bbox_inches='tight', dpi=600)
To identify whether age influences on the relation between nutritional status and blood pressure,
you can use hue=”Age” in the striptplot() method.
# Import packages
64 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
import pandas as pd
import seaborn as sns
import os
# Change directory
os.chdir('~Python notes')
# Load data
df=pd.read_csv('data.csv',index_col="id")
# Create categories
bins=[0,18.5,23.5,29.5,np.inf]
names=["Underweight","Normal","Overweight","Obese"]
df['new_nutrition']=pd.cut(df["bmi"],bins,labels=names)
# Create figure
fig=plt.figure(figsize=(10,6))
sns.stripplot(data=df,x="sysbp", y="new_nutrition",hue='Age',
palette='magma')
sns.set_theme(style='whitegrid')
# Insert labels
plt.ylabel('Nutritional status')
plt.xlabel('Systolic blood pressure')
# Save figure
plt.savefig("Strip plot.png",bbox_inches='tight', dpi=600)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
# Change directory
os.chdir(~\Python notes')
# Load data
df_covid=pd.read_csv(‘~data.csv’)
# Create figure
fig=plt.figure(figsize=(10,6))
sns.boxplot(x=’Month’,y=’cases’,data=df_covid,palette=’rainbow’)
# Change background color
sns.set_theme(style=’whitegrid’)
# Create swarm plot
sns.swarmplot(x=’Month’,y=’cases’,data=df_covid,color=’k’,
size=3)
# Insert title
plt.title(‘Combined box and swarm plot of COVID-19 cases in
Bangladesh’)
# Save figure
plt.ioff()
plt.savefig(“Swarm plot.png”,bbox_inches=’tight’, dpi=600)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
# Change directory
os.chdir(~\Python notes')
# Load data
df_covid=pd.read_csv(‘~data.csv’)
# Create figure
fig=plt.figure(figsize=(10,6))
sns.boxplot(x=’Month’,y=’cases’,data=df_covid,palette=’rainbow’)
# Change background color
sns.set_theme(style=’whitegrid’)
# Create swarm plot
sns.swarmplot(x=’Month’,y=’cases’,data=df_covid,color=’k’,
size=3)
# Insert title
plt.title(‘Monthly COVID-19 positive cases in Bangladesh’)
# Insert y-axis label
plt.ylabel(‘Number of positive cases’)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir('~Study maps\Map')
# Load data
df=pd.read_csv('~data.csv',index_col='Date',parse_dates=True)
# Create figure
fig,ax=plt.subplots(figsize=(10,6))
ins1=ax.plot(df['cases'],label='Case',color='green')
ax2=plt.twinx()
ins2=ax2.plot(df['Death'],label='Death',color='red')
ins=ins1+ins2
labs=[i.get_label() for i in ins]
# Insert legend
ax.legend(ins,labs,loc=0)
# Put y-axis label
ax.set_ylabel('Number of COVID-19 cases')
ax2.set_ylabel('Number of death from COVID-19')
ax.set_ylim(0,8000)
ax2.set_ylim(0,100)
# Save figure
plt.savefig("Timeseries plot.png",bbox_inches='tight', dpi=600)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
# Change directory
70 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
os.chdir('~Study maps\Map')
# Load data
df_case=pd.read_csv('data.csv')
# Create figure
fig, ax=plt.subplots(figsize=(10,6))
sns.lineplot(x=df_case['Date'],y=df_case['Death'],data=df_case,
color='red',linewidth=1)
# Put x-axis label
plt.xlabel('Days passed')
# Put y-axis label
plt.ylabel('Number of death from COVID-19')
# Put y-axis tick
plt.yticks(np.arange(0,301,20))
# Put x-axis tick
plt.xticks(np.arange(0,560,7))
plt.xticks(rotation=80, fontsize=6)
# Insert title
plt.title('Daily number of death from COVID-19 in Bangladesh')
# Put horizontal line
plt.hlines(y=70, xmin='7-May-20', xmax='14-Oct-20', linewidth=1,
color='gray',linestyles='--')
plt.hlines(y=113, xmin='19-Mar-21',xmax='14-May-21', linewidth=1,
color='gray',linestyles='--')
plt.hlines(y=265, xmin='10-Jun-21',xmax='1-Sep-21', linewidth=1,
color='gray',linestyles='--')
# Put vertical line
plt.vlines(x=['19-Mar-21','14-May-21'], ymin=[0,0],
ymax=[113,113], linewidth=1, color='gray',linestyles='--')
plt.vlines(x=['10-Jun-21','1-Sep-21'], ymin=[0,0],
ymax=[265,265], linewidth=1, color='gray',linestyles='--')
plt.vlines(x=['7-May-20','14-Oct-20'],ymin=[0,0],ymax=[70,70],
linewidth=1, color='gray',linestyles='--')
# Insert text
plt.text(110,72,'First wave',color='purple')
plt.text(380,114,'Second wave',color='purple')
plt.text(480,266,'Third wave',color='purple')
# No grid lines
plt.grid(False)
plt.ioff()
# Insert image
ax.imshow(im,aspect='auto',extent=(1,560,0,300), alpha=0.2,
cmap='gray',zorder=1,origin='lower')
# Save figure
plt.tight_layout()
plt.savefig('Number of death.png',dpi=600)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
from lifelines import KaplanMeierFitter
from lifelines import NelsonAalenFitter
from lifelines.statistics import logrank_test
from lifelines import CoxPHFitter
from lifelines.plotting import add_at_risk_counts
# Change directory
os.chdir('~Python note')
# Load data
data=pd.read_csv('data.csv')
data.loc[data.status==1,'Dead']=0
data.loc[data.status==2,'Dead']=1
# Fit survival model
kmf=KaplanMeierFitter()
kmf.fit(durations=data['time'],event_observed=data['Dead'])
# Print results
print(kmf.event_table)
# Print predict at theshold
print(kmf.predict([0,1,11,12,15]))
# Print results
print(kmf.survival_function_)
print(kmf.median_survival_time_)
print(kmf.confidence_interval_)
# probability of die
print(kmf.cumulative_density_)
# Create cumulative density plot
kmf.plot_cumulative_density()
# Fit Nelson Aalen model
naf=NelsonAalenFitter()
naf.fit(durations=data['time'],event_observed=data['Dead'])
print(naf.cumulative_hazard_)
naf.plot_cumulative_hazard()
print(naf.event_table)
print(naf.predict([0,1,11,12,15]))
print(naf.confidence_interval_)
#data separation (Male and Female)
kmf_m=KaplanMeierFitter()
kmf_f=KaplanMeierFitter()
male=data.query('sex==1')
Female=data.query('sex==2')
# Create plot
ax=plt.subplot(111)
kmf_m.fit(durations=male['time'],event_observed=male['Dead'],
label='Male')
kmf_f.fit(durations=Female['time'],event_observed=Female['Dead'],
label='Female')
ax=kmf_m.plot_survival_function(ax=ax)
ax=kmf_f.plot_survival_function(ax=ax)
add_at_risk_counts(kmf_m,kmf_f,ax=ax,xticks=np.arange(0,1001,100)
)
# Insert title
plt.title('Kaplan Meier estimate')
# Put y-axis label
ax.set_ylabel('Probability of survival')
# Put a-axis label
ax.set_xlabel('Days passed')
# Put x-axis ticks
ax.set_xticks(np.arange(0,1001,100))
# Put y-axis ticks
ax.set_yticks(np.arange(0,1.1,.1))
# Save figure
plt.savefig('Survival.png', bbox_inches="tight",dpi=600)
cph.print_summary()
# Create plot
cph.plot()
# Insert title
plt.title('Hazard ratio plot for variables')
# Save figure
plt.savefig("HR.png", bbox_inches="tight", dpi=600)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pySankey.sankey import sankey
import os
# Change directory
os.chdir('~Python note')
# Load data
df=pd.read_csv('Sanky_data.csv')
# Create figure
sankey(left=df["Infected Area"], right=df["Treatment Area"],
leftWeight=df["Total"], rightWeight=df["Total"],
aspect=20, fontsize=12)
# Reset figure size
fig1=plt.gcf()
fig1.set_size_inches(8,6)
# Change color
fig1.set_facecolor("w")
# Put top texts
fig1.text(x=0.1,y=0.85,s='Infected Area', fontsize=10,
fontweight='bold')
fig1.text(x=0.8,y=0.85,s='Treatment Area', fontsize=10,
fontweight='bold')
# Save figure
fig1.savefig("Sankey_data.png", bbox_inches="tight", dpi=600)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from zepid.graphics import EffectMeasurePlot
import os
# Change directory
os.chdir('~Python note')
# Load data
df=pd.read_csv('~database_forest.csv')
# Create variables
labs = df['Authors']
measure = df['Proportion']
lower = df['lu']
upper = df['up']
# Create figure
p = EffectMeasurePlot(label=labs, effect_measure=measure,
lcl=lower, ucl=upper)
# Create labels
p.labels(effectmeasure='Proportion(%)',fontsize=8)
# Change color
p.colors(pointshape="D",pointcolor='red')
# Adjust figure
ax=p.plot(figsize=(12,8), t_adjuster=0.01, max_value=100,
min_value=0)
# Draw vertical line
ax.vlines(x=50,ymax=60,ymin=0,linestyles='--',colors='gray')
# Put x-axis ticks
x=np.arange(0,101,10)
labds=['0','10','20','30','40','50','60','70','80','90','100']
ax.set_xticks(x)
ax.set_xticklabels(labds)
# Creare spines
ax.spines['top'].set_visible(True)
ax.spines['right'].set_visible(True)
ax.spines['bottom'].set_visible(True)
ax.spines['left'].set_visible(False)
# Save figure
plt.tight_layout()
plt.savefig("Forestplot.png",bbox_inches="tight", dpi=600)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from zepid.graphics import EffectMeasurePlot
import os
# Change directory
os.chdir('~Python note')
# Load data
df=pd.read_csv('data.csv')
# Define variables
labs = df['Authors']
measure = df['RR']
lower = df['lu']
upper = df['up']
# Creare figure
p = EffectMeasurePlot(label=labs, effect_measure=measure,
lcl=lower, ucl=upper)
# Insert label
p.labels(effectmeasure='Risk ratio (RR)',fontsize=8)
# Change color
p.colors(pointshape="D",pointcolor='red')
ax=p.plot(figsize=(11,4), t_adjuster=0.05, max_value=100,
min_value=0)
# Put x-axis tick
x=np.arange(-10,101,10)
labds=['-
10','0','10','20','30','40','50','60','70','80','90','100']
ax.set_xticks(x)
ax.set_xticklabels(labds)
# Create spines
ax.spines['top'].set_visible(True)
ax.spines['right'].set_visible(True)
ax.spines['bottom'].set_visible(True)
ax.spines['left'].set_visible(False)
# Save figure
plt.savefig("Forestplot.png",bbox_inches="tight", dpi=600)
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os
# Change directory
os.chdir('~Python note')
# Load data
df=pd.read_csv('data.csv',index_col='id')
# Create figure
fig=plt.figure(figsize=(10,6))
sns.violinplot(data=df,x='totchol',color="darkorange")
# Background change
sns.set_theme(style='whitegrid')
plt.ioff()
# Insert x-axis label
plt.xlabel("Total cholesterol level of respondent")
# Save figure
plt.savefig("Violin plot.png",bbox_inches='tight', dpi=600)
Note: white dot in the violin indicates median, box shows interquartile range, curve shows the
kernel density of the variable.
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os
# Change directory
os.chdir('~Python note')
# Load data
df=pd.read_csv('data.csv',index_col='id')
# Create figure
fig=plt.figure(figsize=(10,6))
sns.violinplot(data=df,x='totchol',y="sex",color="darkorange")
# Change background
sns.set_theme(style='whitegrid')
plt.ioff()
# Insert x-axis label
plt.xlabel("Total cholesterol level of respondent")
# Change y-axis label
plt.ylabel("Sex of respondents")
# Save figure
plt.savefig("Violin plot.png",bbox_inches='tight', dpi=600)
To identify whether age influences on the relation between nutritional status and blood pressure,
you can use hue=”Age” in the swarmplot() method.
# Import packages
import pandas as pd
import seaborn as sns
# Import packages
df=pd.read_csv('data.csv',index_col="id")
# Create categories
bins=[0,18.5,23.5,29.5,np.inf]
names=["Underweight","Normal","Overweight","Obese"]
df['new_nutrition']=pd.cut(df["bmi"],bins,labels=names)
# Create figures
fig=plt.figure(figsize=(10,6))
sns.violinplot(data=df,x="sysbp",y="new_nutrition",
palette='rainbow')
sns.set_theme(style='whitegrid')
sns.swarmplot(data=df,x="sysbp",y="new_nutrition",hue='Age',
color='green', alpha=0.3)
plt.ylabel('Nutritional status')
plt.xlabel('Systolic blood pressure')
# Save figure
plt.savefig("Swarm violin plot.png",bbox_inches='tight', dpi=600)
Chapter 6
Spatial data Mapping
Introduction
Spatial data mapping is an important visualization tool for health data science to visualize disease
patterns and distributions within populations around local and globe. It involves the use of
geographical information systems (GIS) and other spatial analysis techniques to map and analyze
disease incidence, prevalence, and mortality rates, as well as risk factors and environmental
exposures that may contribute to disease outcomes. The chapter will begin by guiding to mapping
of geodata and the benefits of using different mapping for disease surveillance and public health
decision-making. The chapter also will discuss the different types of spatial data mapping that can
be analyzed, including point data, polygon data, and raster data. The chapter will then focus on the
different types of maps that can be created to visualize spatial data, including choropleth maps,
point maps, and migration maps [1, 8, 13, 16, 21].
# Import packages
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import os
# Change directory
os.chdir('~Python note')
# Load compass
im = plt.imread(get_sample_data('~compass.jpg'))
# Load Bangladesh shape file
Bdg_d =gpd.read_file('~\bgd_admbnda_adm1_bbs_20201113.shp')
# Load data
87 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
# Save figure
plt.savefig('~map.png',dpi=600)
# Import packages
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from pyproj import Proj, CRS,transform
import os
# Change directory
os.chdir('~Python note')
# Load data
df_dis=pd.read_csv('data.csv',index_col="Area")
# Load Bangladesh district shape file
Bdg_dis = gpd.read_file('~\bgd_admbnda_adm2_bbs_20201113.shp')
# Load data
df_long=pd.read_csv('Districts.csv', delimiter=',', skiprows=0,
low_memory=False)
# Create geodata
geom = [Point(xy) for xy in zip(df_long['lon'],df_long['lat'])]
gdf_long = GeoDataFrame(df_long, geometry=geom)
# Create figure
fig,ax8=plt.subplots(figsize=(12,8))
# Create color panel
cmap_csub_f=['darkgreen',"darkorange",'purple','blue','black','br
own','red']
color_mapsub_f=ListedColormap(cmap_csub_f)
# Create maps
Bdg_dis[Bdg_dis.ADM2_EN=='Faridpur'].plot(color='w',ax=ax8,
edgecolor= 'black',alpha=0.3)
Bdg_dis[Bdg_dis.ADM2_EN=='Magura'].plot(color='w',ax=ax8,
edgecolor='black',alpha=0.3)
Bdg_dis[Bdg_dis.ADM2_EN=='Rajbari'].plot(color='w',ax=ax8,
edgecolor='black',alpha=0.3)
gdf_long.plot(column='Area',ax=ax8,cax=None,categorical=True,
legend=True,
marker='o',
markersize=45,
cmap=color_mapsub_f)
for i in range(len(gdf_long)):
if gdf_long.District[i]=='Faridpur':
txt=ax8.text(float(gdf_long.lon[i]+0.3),float(gdf_long.lat[i]-
0.05), "{}" .format(gdf_long.District[i]),size=10,
color='k',horizontalalignment='left')
if gdf_long.District[i]=='Magura':
txt=ax8.text(float(gdf_long.lon[i]-
0.06),float(gdf_long.lat[i]-0.2), "{}"
.format(gdf_long.District[i]),size=10,
color='k',horizontalalignment='left')
if gdf_long.District[i]=='Rajbari':
txt=ax8.text(float(gdf_long.lon[i]-
0.05),float(gdf_long.lat[i]+0.05), "{}"
.format(gdf_long.District[i]),size=10,
color='k',horizontalalignment='left')
# Print map
plt.savefig('Map.png',dpi=600)
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from pyproj import Proj, CRS,transform
import os
# Change directory
os.chdir('~Python note')
# Load data
df_dis=pd.read_csv('Dataset.csv',index_col="slno")
# Load Bangladesh district shape file
Bdg_dis = gpd.read_file('~\\bgd_admbnda_adm2_bbs_20201113.shp')
# Create geodata
geometry_1 = [Point(xy) for xy in zip(df_dis['lon1'],
df_dis['lat1'])]
gdf_1 = GeoDataFrame(df_dis,
geometry=geometry_1,crs=CRS('EPSG:4326'))
gdf_1.to_crs(epsg=5234,inplace=True)
geometry_2 = [Point(xy) for xy in zip(df_dis['lon2'],
df_dis['lat2'])]
gdf_2 = GeoDataFrame(df_dis,
geometry=geometry_2,crs=CRS('EPSG:4326'))
gdf_2.to_crs(epsg=5234,inplace=True)
# Calculate distance in km
df_dis["Distance"]=np.round((gdf_1.distance(gdf_2))/1000,
decimals=2)
# Save file
df_dis.to_csv('Distant_calculation.csv')
District_City Lat1 Lon1 district Lon2 lat2 Distance
Dhaka 23.79691 90.40901 Rajshahi 88.65 24.5639 217.37
Gazipur 23.99639 90.42093 Rajshahi 88.85 24.375 172.68
Kishoreganj 24.38262 90.95009 Rajshahi 88.8083 24.5639 203.35
Madaripur 23.20803 90.15302 Rajshahi 88.85 24.375 189.51
Narayanganj 23.62043 90.50067 Joypurhat 89.18333 25.05833 210.11
Munshiganj 23.53782 90.52981 Pabna 89.2372 24.00644 137.1
Narshingdi 23.97086 90.74489 Borga 89.0333 24.8167 181.44
for i in range(len(Bdg_dis)):
txt=ax8.text(float(Bdg_dis.longs[i]),float(Bdg_dis.lats[i]),
"{}".format(Bdg_dis.ADM2_EN[i]),size=8, color='k',
horizontalalignment='left')
# Load distance data
df_distant=pd.read_csv(~distance.csv', delimiter=',', skiprows=0,
low_memory=False)
for slat,dlat, slon,dlon,distant in
zip(df_distant['lat1'],df_distant['lat2'],
df_distant['lon1'],df_distant['lon2'],df_distant['Distance']):
# Draw line between two coordinates points
plt.plot([slon,dlon],[slat,dlat],linewidth=1,linestyle='dashdot',
color='blue', alpha=0.5)
# Insert distance in km
if distant>0:
plt.text(dlon-0.1,dlat+0.05,
str(np.round(distant,decimals=1))+ 'Km', color='brown', size=8,
alpha=0.8)
# Save map
plt.savefig("~ mapping.png",dpi=600)
# Import packages
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os
# Change directory
os.chdir('~Python note')
# Load coordinate points data
df_dis=pd.read_csv('District.csv')
# Load Bangladesh district shape file
Bdg_d = gpd.read_file('~\\bgd_admbnda_adm2_bbs_20201113.shp')
# Merge shape file and coordinate data
BDG_d=pd.merge(Bdg_d,df_dis,on='ADM2_PCODE',how='left')
# Chagne NaN to Bangladesh
BDG_d['Hosp'].fillna("Bangladesh",inplace=True)
# Create map
fig,ax4=plt.subplots(figsize=(6,8))
cmap_c=['w','orange']
BDG_d.plot(column='Hosp',ax=ax4, categorical=True, edgecolor='k',
figsize=(6,8),
alpha=0.8,
linewidth=0.5,
cmap=ListedColormap(cmap_c),
legend=True)
# Create latitudes and longitudes
BDG_d['coords'] = BDG_d['geometry'].apply(lambda x:
x.representative_point().coords[:])
BDG_d['coords'] = [coords[0] for coords in BDG_d['coords']]
BDG_d['longs']=[longs[0] for longs in BDG_d['coords']]
BDG_d['lat']=[lat[1] for lat in BDG_d['coords']]
# Insert Study district names
for i in range(len(BDG_d)):
if BDG_d.Hosp[i]=='H':
txt=ax4.text(float(BDG_d.longs[i])-
0.08,float(BDG_d.lat[i]),
"{}\n{}".format(BDG_d.Hosp[i],BDG_d.ADM2_EN[i]),size=6,
color='k',fontweight='bold',wrap=True)
txt.set_path_effects([PathEffects.withStroke(linewidth=2,
foreground='w')])
# Format legend
legenelement=[mpatches.Patch(edgecolor='k',facecolor='w',
label="Bangladesh",alpha=0.5),
mpatches.Patch(color='orange',label="Selected
district",alpha=0.8),
mpatches.Patch(color='w',label="H: Hospital")]
ax4.legend(handles=legenelement)
leg=ax4.get_legend()
leg.set_bbox_to_anchor((.35,.15))
# Insert compass
ax4.text(x=92, y=26, s='N', fontsize=20)
ax4.arrow(92.12, 25.80, 0, 0.18, length_includes_head=True,
head_width=0.2, head_length=0.3, overhang=.2,
facecolor='k')
# Save map
plt.tight_layout()
plt.savefig(~ map.png',dpi=600)
Spatial heatmap analysis is a powerful tool used in public health research studies to visualize and
analyze patterns with color of disease incidence, prevalence, and mortality rates in geographic
areas. Heatmaps are maps that use color-coded shading to represent the density or intensity of a
particular variable at a specific location, making it easy to identify spatial patterns and trends in
disease occurrence. They are helpful to identify effectively and communicate spatial patterns and
trends in disease occurrence and contribute to the advancement of public health [8, 16].
# Import packages
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os
# Change directory
97 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
os.chdir('~Study maps\Map')
# Load coordinate points data
df_dis=pd.read_csv('~\Divisions.csv')
# Load Bangladesh shape file
Bdg_d = gpd.read_file('~\\bgd_admbnda_adm1_bbs_20201113.shp')
# Load compass
im = plt.imread(get_sample_data('~\compass.jpg'))
# Merge coordinate and shape file
BDG_d=pd.merge(Bdg_d,df_dis,on='ADM1_PCODE',how='left')
# Create figure
fig1,ax5=plt.subplots(figsize=(6,8))
# Create color bar
divider5 = make_axes_locatable(ax5)
cax5 = divider5.append_axes("right", size="2%", pad=0.2)
# Create Bangladesh map
BDG_d.plot(column='Access',ax=ax5, cax=cax5,categorical=False,
edgecolor='k',figsize=(6,8),
vmin=0,
vmax=10,
alpha=0.8,
linewidth=0.5,
legend_kwds={'label':"% of patients",
'orientation':'vertical', 'ticks':np.arange(0,10.1,1)},
cmap=plt.get_cmap('OrRd',10),
legend=True)
# Separate coordinat points
BDG_d['coords'] = BDG_d['geometry'].apply(lambda x:
x.representative_point().coords[:])
BDG_d['coords'] = [coords[0] for coords in BDG_d['coords']]
BDG_d['longs']=[longs[0] for longs in BDG_d['coords']]
BDG_d['lat']=[lat[1] for lat in BDG_d['coords']]
# Insert names
for i in range(len(BDG_d)):
txt=ax5.text(float(BDG_d.longs[i]),float(BDG_d.lat[i]), "{}\n
{}%".format(BDG_d.Divisions[i],BDG_d.AAccess[i]),size=8,
color='k',horizontalalignment='center')
# Insert legend
leg=ax5.get_legend()
plt.tight_layout()
# Insert compass
newax1 = fig1.add_axes([0.55, 0.7, 0.15, 0.15], anchor='C',
zorder=0)
newax1.imshow(im)
newax1.axis('off')
# Save map
plt.savefig('~ map_1.png',dpi=600)
# Import packages
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os
# Change directory
os.chdir('~Python note')
# Load death data
df_death=pd.read_csv('world_covid death.csv')
# Load World shape file
world =
gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Rename the columns so that we can merge with our data
world.columns=['pop_est', 'continent', 'name', 'CODE',
'gdp_md_est', 'geometry']
# Merge shape file and death dataset
COVID_death=pd.merge(world,df_death,on='CODE',how='left')
# Fill NaN values to zeros
COVID_death['COVID_Death'].fillna(value=0,inplace=True)
# Calculate death per million for each country
COVID_death["d_per_1m"]=1000000*(COVID_death['COVID_Death']/COVID
_death['pop_est'])
# Create map
fig1, ax1=plt.subplots()
# Create color bar
divider1 = make_axes_locatable(ax1)
cax1 = divider1.append_axes("top", size="2%", pad=0.1)
# Create map
COVID_death.plot(column='d_per_1m', ax=ax1,cax=cax1,
figsize=(12,10),
alpha=0.5,
edgecolor='k',
legend=True,
legend_kwds={'orientation':'horizontal'},
cmap='Reds')
# Insert title
plt.title('Death/1Million population from COVID infection in
Different Countries', fontsize=14)
# Save map
plt.tight_layout()
plt.savefig('~ map_1.png',dpi=600)
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os
# Change directory
os.chdir('~Python note')
# Load Bangladesh shape file
Bdg= gpd.read_file('~\bgd_admbnda_adm0_bbs_20201113.shp')
# Load Bangladesh district shape file
Bdg_dis = gpd.read_file('~bgd_admbnda_adm2_bbs_20201113.shp')
# Create function for random select points
def random_points_in_polygon(number, polygon):
points = []
min_x, min_y, max_x, max_y = polygon.bounds
i= 0
while i < number:
point = Point(random.uniform(min_x, max_x),
random.uniform(min_y, max_y))
if polygon.contains(point):
points.append(point)
i += 1
return points
# Create dataframe with selected points
crs = {'init': 'epsg:4326'}
points_result = pd.DataFrame(random_points_in_polygon(10,
Bdg.iloc[0].geometry))
points_result.columns=['geometry']
# Separate longitudes and latitudes
points_result['coords'] = points_result['geometry'].apply(lambda
x: x.representative_point().coords[:])
points_result['coords'] = [coords[0] for coords in
points_result['coords']]
Selected 10 points
Chapter 7
Measures of association: Crude analysis
Introduction
In health data science, crude analyses are statistical methods that are used to investigate the
relationship between two variables. The crude measures are essential for understanding the
distribution of diseases and health outcomes in populations and for identifying risk factors that
contribute to the occurrence of these outcomes. They are one approach to measuring association
that involves examining the overall relationship between an exposure and an outcome without
accounting for any other factors that may influence this relationship. These measures depend on
data type and the research study design and research question [35, 36].
Mean estimation is a statistical inference. A sample is used to estimate the mean as an unknown
population mean. We assume that the distribution of the continuous variables of the unknown
population is normal [37].
# Import packages
import pandas as pd
import scipy.stats as stat
# load data
df=pd.read_csv("data.csv",index_col="id")
# calculate mean
var=df["bmi"]
Mean=round(var.mean(),3)
# calculate standard error
se=round(var.sem(),3)
# calculate 95% confidence interval
ci_95=stat.norm.interval(alpha=0.95,loc=mean,scale=se))
# Reduce decimal
ci_lu=round(ci_95[0],3)
ci_up=round(ci_95[1],3)
ci_95=((ci_lu,ci_up))
# Format Results
txt=("Mean={}\nStandard Error={}\n95% confidence interval={}")
# Print Results
print(txt.format(Mean,se,ci_95))
Return will be :
Mean =25.846
Standard Error= 0.062
95% confidence interval= (25.725, 25.967)
Two means comparison compare the mean between two variables or groups. The compare means
t-test is used to compare the means between two groups of variables. The null hypothesis that there
is no different between two means in the population.
6.2.1 Independent two means comparison
You assume that the observations of two groups are independent.
# Import packages
import pandas as pd
from scipy.stats import ttest_ind
import scipy.stats as stat
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Drop NaN
df.dropna(subset=["totchol"],inplace=True)
# Classify data
x=df[df["sex"]==1]
y=df[df["sex"]==2]
# Calculate mean
mean_x=x["totchol"].mean()
mean_y=y["totchol"].mean()
# Calculate standard error
se_x=x["totchol"].sem()
se_y=y["totchol"].sem()
106 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
# Import packages
import pandas as pd
from scipy.stats import ttest_rel
import scipy.stats as stat
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Drop NaN
df.dropna(subset=["totchol"],inplace=True)
# Classify data
x=df[df["sex"]==1]
y=df[df["sex"]==2]
# Calculate Mean
mean_x=df["totchol"].mean()
mean_y=df["totchol"].mean()
# Calculate standard error
107 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
se_x=df["totchol"].sem()
se_y=df["totchol"].sem()
# Calculate 95% confidence interval
ci_95_x=stat.norm.interval(alpha=0.95,loc=mean_x,scale=se_x)
ci_95_y=stat.norm.interval(alpha=0.95,loc=mean_y,scale=se_y)
# Calculate means difference
mean_diff=mean_x-mean_y
# Calculate Paired T test statistics
tt=ttest_rel(df["totchol"],df["totchol"])
# Format results
txt="Mean of x= {}95%CI={}\nMean of y={}95%CI={}\nMean
difference = {}"
# Print results
print(txt.format(mean_x,ci_95_x,mean_y,ci_95_y,mean_diff),tt)
Correlation coefficient between two continuous variables indicates the strength of the relationship.
A correlation coefficient is greater than zero indicates a positive relationship. It is less than zero
signifies a negative relationship, and equal or close to zero indicates no relationship between the
two variables.
6.3.1 Pearson’s correlation coefficient
Pearson’s correlation coefficient is the way of measuring a linear relationship between two
continuous variables. These two variables are normally distributed. This measures the strength and
direction of the relationship between two variables.
# Import packages
import pandas as pd
from scipy.stats import pearsonr
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Drop NaN
df.dropna(subset=["totchol","age"],inplace=True)
# Calculate correlation coefficient and p-value
corr, p = pearsonr(df["age"], df["totchol"])
# Format results
txt="Pearson Correlation coefficient= {:0.4f}\np-value= {:0.4f}"
# Print results
print(txt.format(corr,p))
# Import packages
import pandas as pd
import scipy.stats as stat
# load data
df=pd.read_csv("data.csv",index_col="id")
# Calculate proportion
var=df["hyperten"]
Prop=round(var.mean(),3)
# Calculate standard error
se=round(var.sem(),3)
# Calculate 95% confidence interval
ci_95=stat.norm.interval(alpha=0.95,loc=np.Prop(var),
scale=var.sem())
# Rounding results
ci_lu=round(ci_95[0],3)
ci_up=round(ci_95[1],3)
ci_95=((ci_lu,ci_up))
# Format results
txt=(" Prevalence ={}\n Standard Error= {}\n 95% confidence
interval= {}")
# Print results
print(txt.format(Mean,se,ci_95))
Return will be:
Prevalence =0.733
Standard Error= 0.007
110 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
Chi-square test is a statistical test used to examine the association between two categorical
variables. It is one of the most commonly used tests in health data science. The test measures the
difference between the observed and expected frequencies of the variables in a contingency table.
The chi-square test can help to determine whether the observed differences between the groups are
due to chance or whether they are statistically significant [1, 2, 4].
The H0 (Null Hypothesis): There is no relationship between two variables.
The H1 (Alternative Hypothesis): There is a relationship between two variables.
# Import packages
import pandas as pd
from scipy.stats import chi2_contingency
# Load data
datf=pd.read_csv('data.csv',index_col='id')
# Calculate cross tabulation
my_tab=pd.crosstab(index=datf['death'],columns=datf['sex'])
# Calculate chi square test statistics
c, p, dof, expected = chi2_contingency(my_tab)
# Print results
print(f"Chi2 value={c}\n p-value= {p} \n Degrees of freedom=
{dof} \n")
Chi2 value= 107.6078869510524
p-value= 2.355353265828955e-22
Degrees of freedom= 4
The value of expected cells is not greater than 5. If all of these assumptions are met, then Fisher
exact test is the correct test to use.
# Import packages
import pandas as pd
from scipy.stats import fisher_exact
# Load data
df=pd.read_csv('data.csv',index_col='id')
# Calculate cross tabulation
my_tab=pd.crosstab(index=df['death'],columns=df['sex'])
111 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
# Import packages
import pandas as pd
from zepid import RiskDifference
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Calculate risk difference
rd=RiskDifference()
rd.fit(df,exposure="hyperten",outcome="death")
# Print summary results
print(rd.summary())
======================================================================
Risk Difference
======================================================================
Risk SD(Risk) Risk_LCL Risk_UCL
Ref:0 0.298 0.013 0.272 0.324
1 0.368 0.008 0.352 0.385
----------------------------------------------------------------------
RiskDifference SD(RD) RD_LCL RD_UCL
Ref:0 0.000 NaN NaN NaN
1 0.071 0.016 0.04 0.101
----------------------------------------------------------------------
One important aspect of public health research is the quantification of the association between an
exposure and a health outcome. Risk ratio (RR) is a widely used in public health research to
estimate the strength of the association between an exposure and a health outcome. RR is defined
as the ratio of the risk of the outcome in the exposed group to the risk of the outcome in the
unexposed group. It provides an estimate of the relative risk of developing the outcome in the
exposed group compared to the unexposed group. RR values greater than 1 indicate an increased
risk of the outcome in the exposed group, while RR values less than 1 indicate a decreased risk.
RR is a useful measure in public health research because it allows researchers to compare the risk
of an outcome between groups with different levels of exposure. It is particularly useful in cohort
and intervention studies, where researchers can measure exposure and follow participants over
time to determine the incidence of the outcome. RR can also be used in cross-sectional studies as
known as prevalence ratio (PR), where exposure and outcome data are collected simultaneously
[6, 7, 40–42].
# Import packages
import pandas as pd
from zepid import RiskRatio
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Calculate Risk Ratio
rr=RiskRatio()
rr.fit(df,exposure="hyperten",outcome="death")
# Print results
print(rr.summary())
Comparison:0 to 1
+-----+-------+-------+
| | D=1 | D=0 |
+=====+=======+=======+
| E=1 | 1198 | 2054 |
+-----+-------+-------+
| E=0 | 352 | 830 |
+-----+-------+-------+
====================================================================
Risk Ratio
===================================================================
Risk SD(Risk) Risk_LCL Risk_UCL
Ref:0 0.298 0.013 0.272 0.324
1 0.368 0.008 0.352 0.385
----------------------------------------------------------------------
RiskRatio SD(RR) RR_LCL RR_UCL
Ref:0 1.000 NaN NaN NaN
1 1.237 0.05 1.121 1.365
----------------------------------------------------------------------
Odds ratio (OR) is a widely used measure in public health research to estimate the strength of the
association between an exposure and a health outcome. OR is defined as the odds of exposure
among individuals with the outcome divided by the odds of exposure among individuals without
the outcome. It provides an estimate of the relative odds of developing the outcome in the exposed
group compared to the unexposed group. OR values greater than 1 indicate an increased odd of
the outcome in the exposed group, while OR values less than 1 indicate a decreased odd. OR is a
useful measure in public health research because it allows researchers to compare the odds of an
outcome between groups with different levels of exposure. It is particularly useful in case-control
studies, where exposure and outcome data are collected retrospectively. OR can also be used in
cross-sectional, cohort and intervention studies [43–46].
# Import packages
import pandas as pd
from zepid import OddsRatio
# Data load
df=pd.read_csv("data.csv",index_col="id")
# Calculate odds ratio (OR)
OR=OddsRatio()
OR.fit(df,exposure="hyperten",outcome="death")
# Print result summary
print(OR.summary())
Comparison:0 to 1
+-----+-------+-------+
| | D=1 | D=0 |
+=====+=======+=======+
| E=1 | 1198 | 2054 |
+-----+-------+-------+
| E=0 | 352 | 830 |
+-----+-------+-------+
======================================================================
Odds Ratio
======================================================================
OddsRatio SD(OR) OR_LCL OR_UCL
Ref:0 1.000 NaN NaN NaN
1 1.375 0.073 1.191 1.588
----------------------------------------------------------------------
# Import packages
import pandas as pd
from zepid import IncidenceRateRatio
# Data load
df=pd.read_csv("Incidencerate.csv",index_col="id")
# Calculate incidence rate ratio
115 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
IRR=IncidenceRateRatio()
IRR.fit(df,exposure="sex",outcome="total_case",time="ptime")
# Print result summary
print(IRR.summary())
Comparison:0 to 1
+-----+-------+---------------+
| | D=1 | Person-time |
+=====+=======+===============+
| E=1 | 176 | 3502.94 |
+-----+-------+---------------+
| E=0 | 486 | 10467.8 |
+-----+-------+---------------+
======================================================================
Incidence Rate Ratio
======================================================================
IncRate SD(IncRate) IncRate_LCL IncRate_UCL
Ref:0 0.046 0.002 0.042 0.051
1 0.050 0.004 0.043 0.058
----------------------------------------------------------------------
IncRateRatio SD(IRR) IRR_LCL IRR_UCL
Ref:0 1.000 NaN NaN NaN
1 1.082 0.088 0.911 1.286
----------------------------------------------------------------------
# Import packages
import pandas as pd
from zepid import Diagnostics
# Data load
df=pd.read_csv("Testdata.csv",index_col="id")
# Calculate sensitivity and specificity
Dia=Diagnostics()
Dia.fit(df,test="art", disease="dead")
# Print result summary
print(Dia.summary())
+----+------+------+
| | D+ | D- |
+====+======+======+
| T+ | 10 | 67 |
+----+------+------+
| T- | 77 | 363 |
+----+------+------+
======================================================================
Diagnostics
======================================================================
Sensitivity SD(Se) Se_LCL Se_UCL
0 0.13 0.038 0.055 0.205
Specificity SD(Sp) Sp_LCL Sp_UCL
0 0.825 0.018 0.789 0.861
======================================================================
Chapter 8
Regression analysis for adjusting variables and
clustering effect
Introduction
Public health research studies aim to identify associations between exposures and health outcomes.
However, these associations can be confounded by other variables that are associated with both
the exposure and the outcome, and which may mispresent the true effect of the exposure.
Confounding can lead to biased estimates of the association and wrong conclusions. One approach
to addressing confounding in health data science is through multivariate regression analysis.
Multivariate regression analysis allows us to adjust for the effects of potential confounding
variables and estimate the effect of the exposure of interest on the outcome while controlling for
the influence of confounders. Additionally, public health research studies often include the
investigation of outcomes in populations or groups of individuals. In many cases, the data collected
from these studies show a clustering effect, where individuals within the same cluster or group are
more similar to each other than they are to individuals in other clusters. This clustering effect can
lead to biased estimates of the effects of risk factors or interventions if it is not properly accounted
for in the analysis. Adjusting for the clustering effect is essential for obtaining accurate and precise
estimates of the effects of risk factors on health outcomes. This can be achieved through the use
of appropriate regression models that account for the correlation among individuals within the
same cluster.
This chapter will focus on multivariate regression analysis as a tool for adjusting for confounding
and clustering effect in health data sciece. This chapter will also provide an overview of regressions
commonly used in health data science, including linear regression, logistic regression, Poisson
regression and Cox proportional hazards regression models, and will be used to account for
clustering, including Generalized Estimating Equations (GEE), random effects model, and Mixed
effect model. Overall, this chapter aims to provide a comprehensive introduction to regression
analysis for adjusting covariates and clustering effects in health data science.
Linear regression analysis is a statistical technique in health data science to investigate the
relationship between an exposure and an outcome. The method involves fitting a linear equation
to the data, with the aim of estimating the effect of exposure on the outcome as a crude estimate.
# Import packages
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Drop NaN
subset=["totchol","bmi"]
df.dropna(subset= subset ,inplace=True)
# Set variables
y=df["totchol"]
x=df["bmi"]
# Add constant in the regression
x=sm.add_constant(x)
# Fit linear regression model
mod=sm.OLS(y,x)
res=mod.fit()
# Print result summary
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: totchol R-squared: 0.015
Model: OLS Adj. R-squared: 0.015
Method: Least Squares F-statistic: 66.65
Date: Mon, 20 Mar 2023 Prob (F-statistic): 4.20e-16
Time: 10:32:23 Log-Likelihood: -22737.
No. Observations: 4364 AIC: 4.548e+04
Df Residuals: 4362 BIC: 4.549e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------------------
const 202.4903 4.284 47.262 0.000 194.091 210.890
bmi 1.3371 0.164 8.164 0.000 1.016 1.658
# Import packages
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Drop NaN
subset=["totchol","bmi","age","cigpday"]
df.dropna(subset= subset, inplace=True)
# Set variables
y=df["totchol"]
x=df[["bmi","age","cigpday"]]
# Add constant
x=sm.add_constant(x)
# Fit regression model
mod=sm.OLS(y,x)
res=mod.fit()
# Print result summary
print(res.summary())
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------------------------------------
const 146.6426 5.497 26.676 0.000 135.865 157.420
bmi 1.0281 0.161 6.370 0.000 0.712 1.345
age 1.2601 0.077 16.296 0.000 1.108 1.412
cigpday 0.1033 0.056 1.846 0.065 -0.006 0.213
==============================================================================
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Create dummy variables
sex_dummy=pd.get_dummies(df["sex"])
diabetes_dummy=pd.get_dummies(df["diabetes"])
# Set columns name for diabetes
sex_dummy.columns=["Male","Female"]
diabetes_dummy.columns=["nodiabetes","Diabetes"]
# Join data
df1=pd.concat([df,sex_dummy,diabetes_dummy],axis=1)
# Drop NaN
subset=["totchol","bmi","age","cigpday","Male","Female","nodiabe
tes","Diabetes"]
df1.dropna(subset= subset,inplace=True)
# Set x and y variables
y=df1["totchol"]
x=df1[["bmi","age","cigpday","Female","Diabetes"]]
# Add constant
x=sm.add_constant(x)
# Fit OLS regression model
mod=sm.OLS(y,x)
res=mod.fit()
# Print results
print(res.summary())
==============================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------
const 138.0041 5.715 24.149 0.000 126.801 149.208
bmi 1.1181 0.162 6.897 0.000 0.800 1.436
age 1.2741 0.077 16.453 0.000 1.122 1.426
cigpday 0.2175 0.059 3.682 0.000 0.102 0.333
Female 8.1163 1.393 5.828 0.000 5.386 10.846
Diabetes 2.4708 4.036 0.612 0.540 -5.442 10.384
==============================================================================
121 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
# Import packages
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Create dummy variables
sex_dummy=pd.get_dummies(df["sex"])
diabetes_dummy=pd.get_dummies(df["diabetes"])
# Create columns name
sex_dummy.columns=["Male","Female"]
diabetes_dummy.columns=["nodiabetes","Diabetes"]
# Join data
df1=pd.concat([df,sex_dummy,diabetes_dummy],axis=1)
# Drop NaN
subset=["totchol","bmi","age","cigpday","Female","Diabetes"]
df1.dropna(subset= subset, inplace=True)
# Set x and y variables
y=df1["totchol"]
x=df1[["Female","bmi", "age","Diabetes"]]
# Add constant
x=sm.add_constant(x)
# Fit regression model
mod=sm.OLS(y,x)
res=mod.fit()
# Format Results
res_means=np.round(res.params,3)
res_pvalues=np.round(res.pvalues,3)
res_ci=np.round(res.conf_int(),3)
column=["Mean Difference"])
model_Means=pd.DataFrame(res_means,columns= column)
model_Means["p-vlaues"]=res_pvalues
model_Means[["2.5% ", " 97.5%"]]=res_ci
# Print results
print(model_Means)
# Import packages
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Create dummy variables
sex_dummy=pd.get_dummies(df["sex"])
diabetes_dummy=pd.get_dummies(df["diabetes"])
bins=[0,200,np.inf]
names=["Low","High"]
df['new_chol']=pd.cut(df["totchol"],bins,labels=names)
chol_dummy=pd.get_dummies(df["new_chol"])
# Create column names
sex_dummy.columns=["Male","Female"]
diabetes_dummy.columns=["nodiabetes","Diabetes"]
# Join data
df1=pd.concat([df,sex_dummy,diabetes_dummy,chol_dummy],axis=1)
# Drop NaN
subset=["prevchd","High","bmi","age","Male","Diabetes"]
df1.dropna(subset= subset, inplace=True)
y=df1["prevchd"]
x=df1[["High","Male","bmi", "age","Diabetes"]]
# Add constant
x=sm.add_constant(x)
# Fit regression model
mod=sm.OLS(y,x)
res=mod.fit()
# Format results
res_rd=np.round(res.params,3)
res_pvalues=np.round(res.pvalues,3)
res_ci=np.round(res.conf_int(),3)
columns=["Prevalence Difference"]
model_Rd=pd.DataFrame(res_rd,columns= columns
model_Rd["p-vlaues"]=res_pvalues
model_Rd[["2.5% ", " 97.5%"]]=res_ci
# Print results
print(model_Rd)
It is commonly used in public health research studies, where the data of interest involves the
frequency of occurrences of a certain event or outcome. This regression model allows us to
investigate the relationship between the outcome variable, and an exposure, while taking into
account the potential confounders effects. The model assumes that the logarithm of the expected
count is a linear function of the variables [3, 22, 25].
# Import packages
import pandas as pd
import statsmodels.api as sm
# Load dataset
df=pd.read_csv("London.csv",index_col="slno")
# Drop NaN values
subset=["temperature","relative_humidity","numdeaths","ozone10"]
df.dropna(subset= subset, inplace=True)
# Set variables
y=df["numdeaths"]
x=df[["temperature","relative_humidity","ozone10"]]
# Add constant
x=sm.add_constant(x)
# Fit Poisson regression model
mod=sm.Poisson(y,x)
res=mod.fit()
# Format results
res_rr=np.exp(res.params)
res_pvalues=res.pvalues
res_ci=np.exp(res.conf_int())
model_RR=pd.DataFrame(res_rr,columns=["RR"])
model_RR["p-vlaues"]=res_pvalues
model_RR[["2.5%","97.5%"]]= res_ci
# Print results
print(model_RR.head())
model_RR["p-vlaues"]=res_pvalues
model_RR[["2.5%","97.5%"]]=res_ci
# Print results
print(model_RR.head(10))
# Import packages
import pandas as pd
import statsmodels.api as sm
# Load data
df=pd.read_csv("data.csv",index_col="id")
# Create dummy variables
sex_dummy=pd.get_dummies(df["sex"])
diabetes_dummy=pd.get_dummies(df["diabetes"])
# Create column names
sex_dummy.columns=["Male","Female"]
diabetes_dummy.columns=["nodiabetes","Diabetes"]
# Join data
df1=pd.concat([df,sex_dummy,diabetes_dummy],axis=1)
# Drop NaN value
subset=["prevstrk","totchol","bmi","age","cigpday","Female",
"Diabetes"]
127 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
# Import packages
import pandas as pd
import statsmodels.api as sm
from statsmodels.discrete.conditional_models import
ConditionalLogit
# Load data
df=pd.read_csv("paireddata.csv",index_col="slno")
# Drop NaN values
subset=["idcode","year", "age","grade", "not_smsa",
"south","union","black"]
df.dropna(subset= subset, inplace=True)
# Set variables
y=df["union"]
x=df[["age","grade","not_smsa","south","black"]]
group=df["idcode"]
# Fit conditional logistic regression model
mod=ConditionalLogit(y,x,groups=group)
model=mod.fit()
# Calculate exponential values
model_or=np.exp(model.params)
model_pvalues=model.pvalues
model_ci=np.exp(model.conf_int())
model_OR=pd.DataFrame(model_or,columns=["OR"])
model_OR["p-vlaues"]=model_pvalues
model_OR[["2.5%","97.5%"]]=model_ci
# Print results
print(model_OR.head())
# Import packages
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
# Load data
df=pd.read_csv("Geedata.csv",index_col="slno")
# Drop NaN values
subset=["id","numvisit","age","educ","married","badh","loginc",
"reform","summer"]
df.dropna(subset= subset, inplace=True)
# set covariate structure (identity or exchangeable or
autocorrelation)
ind=sm.cov_struct.Exchangeable()
# Set regression family
family=sm.families.Poisson()
# Fit GEE model
mod=smf.gee("numvisit ~ reform+age + educ+married+badh+
loginc+summer", "id", df,cov_struct=ind,family=family)
model=mod.fit()
# Calculate exponential values
model_rr=np.exp(model.params)
model_pvalues=model.pvalues
model_ci=np.exp(model.conf_int())
130 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
model_RR=pd.DataFrame(model_rr,columns=["RR"])
model_RR["p-vlaues"]=model_pvalues
model_RR[["2.5%","97.5%"]]=model_ci
# Print results
print(model_RR.head())
Mixed effects models, also known as hierarchical or multilevel models, are a statistical method
commonly used in public health research studies to analyze clustered data and longitudinal data.
These models account for the correlation among observations within the same cluster or individual,
and the variability between different clusters or individuals. Mixed effects models allow for the
analysis of complex data structures, where the outcome variable is influenced by both individual-
level and group-level factors. These models can accommodate a wide range of outcome variables,
including binary, categorical, and continuous variables [19].
# Import packages
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Load data
df=pd.read_csv("mixeddata.csv")
# Drop NaN
subset=["kid","mom","cluster","immun","kid2p","mom25p","order23"
,"order46","order7p","indNoSpa","indSpa","momEdPri","momEdSec","
husEdPri","husEdSec","husEdDK","momWork","rural","pcInd81"]
df.dropna(subset= subset, inplace=True)
# Fit mixed effect model with one cluster
mod=smf.glm("immun ~ kid2p+mom25p+order23+order46+
order7p+indNoSpa+indSpa+momEdPri+momEdSec+husEdPri+husEdSec+husE
dDK+momWork+rural+pcInd81+(1|cluster)", df,
family=sm.families.Binomial())
model=mod.fit()
# Calculate exponential coefficient (OR)
model_or=np.exp(model.params)
model_pvalues=model.pvalues
model_ci=np.exp(model.conf_int())
model_OR=pd.DataFrame(model_or,columns=["OR"])
model_OR["p-vlaues"]=model_pvalues
model_OR[["2.5%","97.5%"]]=model_ci
# Print results and summary
print(model.summary())
print(model_OR.head(20))
subset=["kid","mom","cluster","immun","kid2p","mom25p","order23"
,"order46","order7p","indNoSpa","indSpa","momEdPri","momEdSec","
husEdPri","husEdSec","husEdDK","momWork","rural","pcInd81"]
df.dropna(subset= subset, inplace=True)
# Fit multilevel modelling
mod=smf.glm("immun ~ kid2p+mom25p+order23+order46+
order7p+indNoSpa+indSpa+momEdPri+momEdSec+husEdPri+husEdSec+husE
dDK+momWork+rural+pcInd81+(1|cluster/mom)", df,
family=sm.families.Binomial())
model=mod.fit()
# Calculate exponential coefficient (OR)
model_or=np.exp(model.params)
model_pvalues=model.pvalues
model_ci=np.exp(model.conf_int())
model_OR=pd.DataFrame(model_or,columns=["OR"])
model_OR["p-vlaues"]=model_pvalues
model_OR[["2.5%","97.5%"]]=model_ci
# Print results and summary
print(model_OR.head())
References
1. Rashid MM, Akhtar Z, Chowdhury S, Islam MA, Parveen S, Ghosh PK, Rahman A, Khan
ZH, Islam K, Debnath N (2022) Pattern of antibiotic use among hospitalized patients
according to WHO access, watch, reserve (AWaRe) classification: Findings from a point
prevalence survey in Bangladesh. Antibiotics 11:810
2. Islam MA, Akhtar Z, Hassan MZ, Chowdhury S, Rashid MM, Aleem MA, Ghosh PK, Mah-
E-Muneer S, Parveen S, Ahmmed MK (2022) Pattern of antibiotic dispensing at pharmacies
according to the WHO Access, Watch, Reserve (AWaRe) classification in Bangladesh.
Antibiotics 11:247
3. Ghosh PK, Das P, Goswam DR, Islam A, Chowdhury S, Mollah MM, Harun GD, Akhtar Z,
Chowdhury F (2021) Maternal Characteristics Mediating the Impact of Household Poverty
on the Nutritional Status of Children Under 5 Years of Age in Bangladesh. Food Nutr Bull
42:389–398
4. Chowdhury F, Shahid ASMSB, Ghosh PK, Rahman M, Hassan MZ, Akhtar Z, Muneer SM-
E-, Shahrin L, Ahmed T, Chisti MJ (2020) Viral etiology of pneumonia among severely
malnourished under-five children in an urban hospital, Bangladesh. PloS One 15:e0228329
5. Biswas D, Ahmed M, Roguski K, Ghosh PK, Parveen S, Nizame FA, Rahman MZ,
Chowdhury F, Rahman M, Luby SP (2019) Effectiveness of a behavior change intervention
with hand sanitizer use and respiratory hygiene in reducing laboratory-confirmed influenza
among schoolchildren in Bangladesh: a cluster randomized controlled trial. Am J Trop Med
Hyg 101:1446–1455
6. Halder AK, Luby SP, Akhter S, Ghosh PK, Johnston RB, Unicomb L (2017) Incidences and
costs of illness for diarrhea and acute respiratory infections for children< 5 years of age in
rural Bangladesh. Am J Trop Med Hyg 96:953
7. Alam M-U, Luby SP, Halder AK, Islam K, Opel A, Shoab AK, Ghosh PK, Rahman M,
Mahon T, Unicomb L (2017) Menstrual hygiene management among Bangladeshi
adolescent schoolgirls and risk factors affecting school absence: results from a cross-
sectional survey. BMJ Open 7:e015508
8. Chowdhury S, Barai L, Afroze SR, Ghosh PK, Afroz F, Rahman H, Ghosh S, Hossain MB,
Rahman MZ, Das P (2022) The epidemiology of melioidosis and its association with
diabetes mellitus: a systematic review and meta-analysis. Pathogens 11:149
9. Sultana R, Luby SP, Gurley ES, Rimi NA, Swarna ST, Khan JA, Nahar N, Ghosh PK,
Howlader SR, Kabir H (2021) Cost of illness for severe and non-severe diarrhea borne by
households in a low-income urban community of Bangladesh: A cross-sectional study. PLoS
Negl Trop Dis 15:e0009439
10. Islam A, McKee C, Ghosh PK, Abedin J, Epstein JH, Daszak P, Luby SP, Khan SU, Gurley
ES (2021) Seasonality of date palm sap feeding behavior by bats in Bangladesh. EcoHealth
18:359–371
11. Chowdhury F, Shahid ASMSB, Tabassum M, Parvin I, Ghosh PK, Hossain MI, Alam NH,
Faruque A, Huq S, Shahrin L (2021) Vitamin D supplementation among Bangladeshi
children under-five years of age hospitalised for severe pneumonia: A randomised placebo
controlled trial. Plos One 16:e0246460
12. Akhtar Z, Islam MA, Aleem MA, Mah-E-Muneer S, Ahmmed MK, Ghosh PK, Rahman M,
Rahman MZ, Sumiya MK, Rahman MM (2021) SARS-CoV-2 and influenza virus
coinfection among patients with severe acute respiratory infection during the first wave of
COVID-19 pandemic in Bangladesh: a hospital-based descriptive study. BMJ Open
11:e053768
13. Akhtar Z, Chowdhury F, Rahman M, Ghosh PK, Ahmmed MK, Islam MA, Mott JA, Davis
W (2021) Seasonal influenza during the COVID-19 pandemic in Bangladesh. PLoS One
16:e0255646
14. Akhtar Z, Chowdhury F, Aleem MA, Ghosh PK, Rahman M, Rahman M, Hossain ME,
Sumiya MK, Islam AM, Uddin MJ (2021) Undiagnosed SARS-CoV-2 infection and
outcome in patients with acute MI and no COVID-19 symptoms. Open Heart 8:e001617
15. Akhtar Z, Aleem MA, Ghosh PK, Islam AM, Chowdhury F, MacIntyre CR, Fröbert O
(2021) In-hospital and 30-day major adverse cardiac events in patients referred for ST-
segment elevation myocardial infarction in Dhaka, Bangladesh. BMC Cardiovasc Disord
21:1–9
16. Ghosh P, Mollah MM (2020) The risk of public mobility from hotspots of COVID-19 during
travel restriction in Bangladesh. J Infect Dev Ctries 14:732–736
17. Ghosh PK, Mollah MMH, Chowdhury AA, Alam N, Harun GD (2020) Hypertension and
sex related differences in mortality of COVID-19 infection: A systematic review and Meta-
analysis
18. Ghosh P (2020) The Dissimilarity of Attack Rate (AR) of SARS-CoV-2 Virus and Infection
Fatality Risk (IFR) Across Different Divisions of Bangladesh. J Trop Dis 8:356
19. Chowdhury S, Azziz-Baumgartner E, Kile JC, Hoque MA, Rahman MZ, Hossain ME,
Ghosh PK, Ahmed SS, Kennedy ED, Sturm-Ramirez K (2020) Association of biosecurity
and hygiene practices with environmental contamination with influenza a viruses in live bird
markets, Bangladesh. Emerg Infect Dis 26:2087
20. Chowdhury F, Shahid ASMSB, Ghosh PK, Rahman M, Hassan MZ, Akhtar Z, Muneer SM-
E-, Shahrin L, Ahmed T, Chisti MJ (2020) Viral etiology of pneumonia among severely
malnourished under-five children in an urban hospital, Bangladesh. PloS One 15:e0228329
21. Chowdhury S, Hossain ME, Ghosh PK, Ghosh S, Hossain MB, Beard C, Rahman M,
Rahman MZ (2019) The pattern of highly pathogenic avian influenza H5N1 outbreaks in
South Asia. Trop Med Infect Dis 4:138
22. Biswas D, Ahmed M, Roguski K, Ghosh PK, Parveen S, Nizame FA, Rahman MZ,
Chowdhury F, Rahman M, Luby SP (2019) Effectiveness of a behavior change intervention
with hand sanitizer use and respiratory hygiene in reducing laboratory-confirmed influenza
136 | P a g e Probir Kumar Ghosh
01741124916
[email protected]
A Guide to Health Data Science Using Python
23. Chowdhury F, Ghosh PK, Shahunja K, Shahid AS, Shahrin L, Sarmin M, Sharifuzzaman,
Afroze F, Chisti MJ (2018) Hyperkalemia was an independent risk factor for death while
under mechanical ventilation among children hospitalized with diarrhea in Bangladesh. Glob
Pediatr Health 5:2333794X17754005
24. Parvez SM, Kwong L, Rahman MJ, Ercumen A, Pickering AJ, Ghosh PK, Rahman MZ, Das
KK, Luby SP, Unicomb L (2017) Escherichia coli contamination of child complementary
foods and association with domestic hygiene in rural Bangladesh. Trop Med Int Health
22:547–557
25. Halder AK, Luby SP, Akhter S, Ghosh PK, Johnston RB, Unicomb L (2017) Incidences and
costs of illness for diarrhea and acute respiratory infections for children< 5 years of age in
rural Bangladesh. Am J Trop Med Hyg 96:953
26. Horng L, Unicomb L, Alam M-U, Halder A, Ghosh P, Luby S (2015) Health Worker and
Family Caregiver Hand Hygiene in Bangladesh Healthcare Facilities: Results From a
Nationally Representative Survey. Infectious Diseases Society of America, p 1621
28. Aquino JEAP de, Cruz Filho NA, Aquino JNP de (2011) Epidemiology of middle ear and
mastoid cholesteatomas: study of 1146 cases. Braz J Otorhinolaryngol 77:341–347
29. Auman JT, Boorman GA, Wilson RE, Travlos GS, Paules RS (2007) Heat map visualization
of high-density clinical chemistry data. Physiol Genomics 31:352–356
30. Bougioukas KI, Vounzoulaki E, Mantsiou CD, Savvides ED, Karakosta C, Diakonidis T,
Tsapas A, Haidich A-B (2021) Methods for depicting overlap in overviews of systematic
reviews: An introduction to static tabular and graphical displays. J Clin Epidemiol 132:34–
45
31. Gadi N, Saleh S, Johnson J-A, Trinidade A (2022) The impact of the COVID-19 pandemic
on the lifestyle and behaviours, mental health and education of students studying healthcare-
related courses at a British university. BMC Med Educ 22:115
32. Robinson JM, Jorgensen A, Cameron R, Brindley P (2020) Let nature be thy medicine: a
socioecological exploration of green prescribing in the UK. Int J Environ Res Public Health
17:3460
34. Lewis SJ, Gardner M, Higgins J, Holly JM, Gaunt TR, Perks CM, Turner SD, Rinaldi S,
Thomas S, Harrison S (2017) Developing the WCRF International/University of Bristol
35. Rothman KJ, Greenland S, Lash TL (2008) Measures of effect and measures of association.
na
36. Bertollini R, Lebowitz MD, Savitz DA, Saracci R (1995) Environmental epidemiology:
exposure and disease. CRC Press
37. Hoy D, Brooks P, Blyth F, Buchbinder R (2010) The epidemiology of low back pain. Best
Pract Res Clin Rheumatol 24:769–781
39. Kok BC, Herrell RK, Thomas JL, Hoge CW (2012) Posttraumatic stress disorder associated
with combat service in Iraq or Afghanistan: reconciling prevalence differences between
studies. J Nerv Ment Dis 200:444–450
40. Parvez SM, Kwong L, Rahman MJ, Ercumen A, Pickering AJ, Ghosh PK, Rahman MZ, Das
KK, Luby SP, Unicomb L (2017) Escherichia coli contamination of child complementary
foods and association with domestic hygiene in rural Bangladesh. Trop Med Int Health
22:547–557
41. Robbins AS, Chao SY, Fonseca VP (2002) What’s the relative risk? A method to directly
estimate risk ratios in cohort studies of common outcomes. Ann Epidemiol 12:452–454
42. Katz D, Baptista J, Azen S, Pike M (1978) Obtaining confidence intervals for the risk ratio in
cohort studies. Biometrics 469–474
43. Chen H, Cohen P, Chen S (2010) How big is a big odds ratio? Interpreting the magnitudes of
odds ratios in epidemiological studies. Commun Stat Comput 39:860–864
44. VanderWeele TJ, Vansteelandt S (2010) Odds ratios for mediation analysis for a
dichotomous outcome. Am J Epidemiol 172:1339–1348
45. Vandenbroucke JP, Pearce N (2012) Case–control studies: basic concepts. Int J Epidemiol
41:1480–1489
46. Plant JD, Lund EM, Yang M (2011) A case–control study of the risk factors for canine
juvenile‐onset generalized demodicosis in the USA. Vet Dermatol 22:95–99
48. Waterman BR, Owens BD, Davey S, Zacchilli MA, Belmont Jr PJ (2010) The epidemiology
of ankle sprains in the United States. Jbjs 92:2279–2284
commercial IgM ELISA for the diagnosis of human leptospirosis in Thailand. Am J Trop
Med Hyg 86:524
50. Haynes RB (2012) Clinical epidemiology: how to do clinical practice research. Lippincott
williams & wilkins
51. Chowdhury F, Shahid ASMSB, Ghosh PK, Rahman M, Hassan MZ, Akhtar Z, Muneer SM-
E-, Shahrin L, Ahmed T, Chisti MJ (2020) Viral etiology of pneumonia among severely
malnourished under-five children in an urban hospital, Bangladesh. PloS One 15:e0228329
52. Shakerkhatibi M, Dianat I, Asghari Jafarabadi M, Azak R, Kousha A (2015) Air pollution
and hospital admissions for cardiorespiratory diseases in Iran: artificial neural network
versus conditional logistic regression. Int J Environ Sci Technol 12:3433–3442
53. Unicomb L, Horng L, Alam M-U, Halder AK, Shoab AK, Ghosh PK, Islam MK, Opel A,
Luby SP (2018) Health-care facility water, sanitation, and health-care waste management
basic service levels in Bangladesh: results from a nation-wide survey. Am J Trop Med Hyg
99:916