Bda 22 - Merged
Bda 22 - Merged
Bachelor of Engineering
in
Computer Engineering
by
Kartikey Singh(21102096)
Suraj Yadav(21102120)
Parth Vora(21102007)
for
Big Data Analytics Lab (CSL7012)
Page No.
Sr. No. Table of Contents
1 Introduction 1
Objectives 2
2
3 Scope 2
6 Algorithms Details 5
7 Result 6
8 References 7
1. Introduction
The MIMICIII (Medical Information Mart for Intensive Care III) is one of the
most comprehensive and largescale databases available for healthcare analytics,
containing healthrelated data associated with over 40,000 critical care patients
admitted to the Beth Israel Deaconess Medical Center between 2001 and 2012. This
database is not only a valuable resource for health professionals but also for
researchers and data scientists seeking to gain insights into patient care, treatment
patterns, and outcomes in critical care units. Given the richness and depth of data,
the MIMICIII dataset serves as a benchmark for numerous research studies in
health informatics, machine learning, and clinical decisionmaking.
In this project, we aim to leverage the power of Big Data analytics using Python
and PySpark to analyze the MIMICIII dataset. The analysis involves extracting
meaningful insights from different tables of the dataset, specifically focusing on
patients' diagnosis history, treatment procedures, and clinical notes. Through this
project, we aim to demonstrate how datadriven insights can significantly enhance
healthcare strategies, improve patient care, and contribute to better decisionmaking
processes in the medical field.
1
2. Objectives:
To explore and analyze the MIMICIII dataset to understand patient care
patterns in critical care units.
To extract and visualize key insights from the dataset using Big Data
analytics tools like PySpark and Python.
To identify the most common diagnoses and procedures and analyze their
relationships using the ICD9 coding system.
To create meaningful visualizations that help in understanding trends and
patterns in the health care data.
To demonstrate how big data analytics can contribute to improving
healthcare delivery, decisionmaking, and patient outcomes.
3. Scope:
Data Ingestion : Extracting data from structured CSV files stored on Google
Drive and loading them into PySpark DataFrames for efficient processing.
Data Preprocessing : Cleaning, transforming, and integrating data from
multiple tables (e.g., noteevents, diagnosis_icd, procedures_icd) to create a
comprehensive dataset suitable for analysis.
Data Analysis : Analyzing the data using PySpark and Python to identify
patterns, trends, and relationships among diagnoses, procedures, and patient
outcomes.
Data Visualization : Using visualization tools such as Matplotlib and Seaborn
to present the insights derived from the analysis in a clear and concise manner.
Outcome Prediction : Providing actionable insights that can help healthcare
professionals make informed decisions and improve patient care strategies.
2
4. Summarizing the Dataset:
1.noteevents Table
Columns Present: ROW_ID, SUBJECT_ID, HADM_ID, CHARTDATE,
CHARTTIME, STORETIME, CATEGORY, DESCRIPTION, CGID,
ISERROR, TEXT.
Columns Selected: SUBJECT_ID, HADM_ID, TEXT.
Description: This table contains over 2 million rows with clinical notes,
divided into sections such as admission date, discharge summary, history of
present illness, medications, allergies, and laboratory studies.
Purpose: Extract patient-related information, understand the patient's
medical journey, and analyze common trends in disease progression.
2. diagnosis_icd Table
Columns Present : ROW_ID, SUBJECT_ID, HADM_ID, SEQ_NUM,
ICD9_CODE.
Columns Selected : SUBJECT_ID, HADM_ID, ICD9_CODE.
Description : Contains around 651,000 rows with 6,984 unique diagnoses.
For each SUBJECT_ID and HADM_ID combination, patients can have
between 1 and 38 diagnoses, with SEQ_NUM denoting their relevance.
Purpose : Identify the most frequent diagnoses and study the relationships
between different diseases using the ICD-9 codes.
3. procedures_icd Table
Columns Selected : SUBJECT_ID, HADM_ID, ICD9_CODE.
Columns Present : ROW_ID, SUBJECT_ID, HADM_ID, SEQ_NUM,
ICD9_CODE.
Description : Contains around 240,000 rows with 2,009 unique procedure
codes.
3
6. Algorithms Details
Data Ingestion: Data from the MIMIC-III dataset was imported into
PySpark DataFrames using spark.read.csv.
Data Preprocessing:
Handled missing values with dropna().
Removed duplicate records using dropDuplicates().
Converted data types using cast().
Merged tables using PySpark's join() function.
Data Analysis:
Aggregated data using groupBy() and count() functions to identify top
diagnoses and procedures.
Generated insights on patient demographics, diagnoses, and treatments.
Data Visualization: Used Seaborn and Matplotlib to create bar plots,
heatmaps, and word clouds
4
Result:
1.The most frequent diagnoses were related to cardiovascular and respiratory
conditions.
2.Common procedures included respiratory intubation and mechanical
ventilation.
3.The heatmap showed strong correlations between certain diagnoses and
procedures, indicating typical treatment pathways.
4.Text analysis of clinical notes identified common themes such as
"hypertension," "diabetes," and "heart failure."
5
8. References