0% found this document useful (0 votes)
25 views8 pages

Bda 22 - Merged

Hh

Uploaded by

Parth Vora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views8 pages

Bda 22 - Merged

Hh

Uploaded by

Parth Vora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Mini Project Report on

Health Care Analysis System

Submitted in partial fulfillment of the requirements for the


award of the degree of

Bachelor of Engineering

in
Computer Engineering
by
Kartikey Singh(21102096)
Suraj Yadav(21102120)
Parth Vora(21102007)

Under the Guidance of


Prof Shamika Mule

for
Big Data Analytics Lab (CSL7012)

Department of Computer Engineering

A.P. Shah Institute of Technology


G.B.Road,Kasarvadavli, Thane(W), Mumbai-400615
UNIVERSITY OF MUMBAI

Academic Year 2024-2025


Index

Page No.
Sr. No. Table of Contents

1 Introduction 1

Objectives 2
2

3 Scope 2

4 Summarizing the dataset. 3

5 Visualizing the dataset. 4

6 Algorithms Details 5

7 Result 6

8 References 7
1. Introduction

The MIMICIII (Medical Information Mart for Intensive Care III) is one of the
most comprehensive and largescale databases available for healthcare analytics,
containing healthrelated data associated with over 40,000 critical care patients
admitted to the Beth Israel Deaconess Medical Center between 2001 and 2012. This
database is not only a valuable resource for health professionals but also for
researchers and data scientists seeking to gain insights into patient care, treatment
patterns, and outcomes in critical care units. Given the richness and depth of data,
the MIMICIII dataset serves as a benchmark for numerous research studies in
health informatics, machine learning, and clinical decisionmaking.

The data in MIMICIII includes detailed patient information such as demographics,


vital signs, laboratory test results, procedures, medications, and textual clinical
notes. It covers various aspects of patient care, ranging from admission details,
diagnoses, and treatment procedures to discharge summaries, making it a valuable
resource for deriving insights into disease progression, treatment effectiveness, and
patient outcomes. The granularity of the data enables healthcare researchers to
perform deep analyses that can drive advancements in medical research, predict
patient outcomes, and identify potential areas for improving healthcare delivery.

In this project, we aim to leverage the power of Big Data analytics using Python
and PySpark to analyze the MIMICIII dataset. The analysis involves extracting
meaningful insights from different tables of the dataset, specifically focusing on
patients' diagnosis history, treatment procedures, and clinical notes. Through this
project, we aim to demonstrate how datadriven insights can significantly enhance
healthcare strategies, improve patient care, and contribute to better decisionmaking
processes in the medical field.

1
2. Objectives:
To explore and analyze the MIMICIII dataset to understand patient care
patterns in critical care units.
To extract and visualize key insights from the dataset using Big Data
analytics tools like PySpark and Python.
To identify the most common diagnoses and procedures and analyze their
relationships using the ICD9 coding system.
To create meaningful visualizations that help in understanding trends and
patterns in the health care data.
To demonstrate how big data analytics can contribute to improving
healthcare delivery, decisionmaking, and patient outcomes.

3. Scope:
Data Ingestion : Extracting data from structured CSV files stored on Google
Drive and loading them into PySpark DataFrames for efficient processing.
Data Preprocessing : Cleaning, transforming, and integrating data from
multiple tables (e.g., noteevents, diagnosis_icd, procedures_icd) to create a
comprehensive dataset suitable for analysis.
Data Analysis : Analyzing the data using PySpark and Python to identify
patterns, trends, and relationships among diagnoses, procedures, and patient
outcomes.
Data Visualization : Using visualization tools such as Matplotlib and Seaborn
to present the insights derived from the analysis in a clear and concise manner.
Outcome Prediction : Providing actionable insights that can help healthcare
professionals make informed decisions and improve patient care strategies.

2
4. Summarizing the Dataset:

1.noteevents Table
Columns Present: ROW_ID, SUBJECT_ID, HADM_ID, CHARTDATE,
CHARTTIME, STORETIME, CATEGORY, DESCRIPTION, CGID,
ISERROR, TEXT.
Columns Selected: SUBJECT_ID, HADM_ID, TEXT.
Description: This table contains over 2 million rows with clinical notes,
divided into sections such as admission date, discharge summary, history of
present illness, medications, allergies, and laboratory studies.
Purpose: Extract patient-related information, understand the patient's
medical journey, and analyze common trends in disease progression.

2. diagnosis_icd Table
Columns Present : ROW_ID, SUBJECT_ID, HADM_ID, SEQ_NUM,
ICD9_CODE.
Columns Selected : SUBJECT_ID, HADM_ID, ICD9_CODE.
Description : Contains around 651,000 rows with 6,984 unique diagnoses.
For each SUBJECT_ID and HADM_ID combination, patients can have
between 1 and 38 diagnoses, with SEQ_NUM denoting their relevance.
Purpose : Identify the most frequent diagnoses and study the relationships
between different diseases using the ICD-9 codes.

3. procedures_icd Table
Columns Selected : SUBJECT_ID, HADM_ID, ICD9_CODE.
Columns Present : ROW_ID, SUBJECT_ID, HADM_ID, SEQ_NUM,
ICD9_CODE.
Description : Contains around 240,000 rows with 2,009 unique procedure
codes.

3
6. Algorithms Details

Data Ingestion: Data from the MIMIC-III dataset was imported into
PySpark DataFrames using spark.read.csv.
Data Preprocessing:
Handled missing values with dropna().
Removed duplicate records using dropDuplicates().
Converted data types using cast().
Merged tables using PySpark's join() function.
Data Analysis:
Aggregated data using groupBy() and count() functions to identify top
diagnoses and procedures.
Generated insights on patient demographics, diagnoses, and treatments.
Data Visualization: Used Seaborn and Matplotlib to create bar plots,
heatmaps, and word clouds

4
Result:
1.The most frequent diagnoses were related to cardiovascular and respiratory
conditions.
2.Common procedures included respiratory intubation and mechanical
ventilation.
3.The heatmap showed strong correlations between certain diagnoses and
procedures, indicating typical treatment pathways.
4.Text analysis of clinical notes identified common themes such as
"hypertension," "diabetes," and "heart failure."

5
8. References

MIMIC-III Clinical Database: Available on PhysioNet (doi:


10.13026/C2XW26)
Choi et al., (2017). Medical Data Analysis with Big Data Technologies.
Xiao, Choi, & Sun (2018). Healthcare Data Analytics: Challenges and
Opportunities.

You might also like