0% found this document useful (0 votes)
19 views9 pages

Electronic Health Records EHR Data Analysis Using Hadoop and Spark

This project analyzes Electronic Health Records (EHR) data using Apache Hadoop and Spark to process large-scale healthcare data, focusing on patient demographics, hospital admissions, and stay durations. It involves setting up the environment, uploading data to HDFS, and performing data analysis and visualizations with Python. Future improvements include implementing machine learning models, optimizing performance, and integrating real-time data streaming.

Uploaded by

reaperz0704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

Electronic Health Records EHR Data Analysis Using Hadoop and Spark

This project analyzes Electronic Health Records (EHR) data using Apache Hadoop and Spark to process large-scale healthcare data, focusing on patient demographics, hospital admissions, and stay durations. It involves setting up the environment, uploading data to HDFS, and performing data analysis and visualizations with Python. Future improvements include implementing machine learning models, optimizing performance, and integrating real-time data streaming.

Uploaded by

reaperz0704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Electronic Health Records

(EHR) Data Analysis using


Hadoop and Spark

This project focuses on analyzing Electronic Health Records (EHR)


data using Apache Hadoop and Apache Spark to efficiently process
large-scale healthcare data. The analysis involves data preprocessing,
exploratory data analysis (EDA), and visualizations to extract
meaningful insights regarding patient demographics, hospital
admissions, and stay durations.
by Siddharth Panda
Project Setup and
Prerequisites

Apache Hadoop Apache Spark Python


With Pandas, NumPy,
Configured and Used for fast data Matplotlib, and
running for analysis and Seaborn libraries.
distributed data processing.
processing.
Ensure Apache Hadoop and Spark are installed and configured.
Python is required with Pandas, NumPy, Matplotlib, and Seaborn
libraries. A Jupyter Notebook or any Python IDE is needed. HDFS must
be configured and running.
Dataset Overview
Patient Demographics
Age, gender, and other relevant patient information.

Hospital Admission Sources


Details on how patients were admitted to the hospital.

Length of Hospital Stays


Duration of patient stays in the hospital.

Discharge Statuses
Information on patient discharge outcomes.

The dataset, ehr_data.csv, contains structured data related to hospital records. It


includes fields such as patient demographics (age, gender, etc.), hospital
admission sources, length of hospital stays, and discharge statuses.
Running Hadoop Services

Start DFS
Run start-dfs.sh to start Hadoop Distributed File System.

Start YARN
Run start-yarn.sh to start YARN services.

Verify Services
Use jps command to verify NameNode and DataNode are running.

To start Hadoop services, run the start-dfs.sh and start-yarn.sh


commands. Verify that NameNode and DataNode are running using the jps
command. This ensures the Hadoop environment is properly set up for data
processing.
Data Upload to HDFS
Create Directories
1 Create /EHR_project and /EHR_project/input in HDFS.

Upload Dataset
2 Move ehr_data.csv to /EHR_project/input/.

Verify Upload
3 Use hdfs dfs -ls to confirm the file is in HDFS.

Create directories in Hadoop for storing the dataset using hdfs dfs -mkdir /EHR_project and hdfs dfs -mkdir
/EHR_project/input. Move the dataset into Hadoop’s file system using hdfs dfs -put. Verify the file upload using hdfs
dfs -ls /EHR_project/input/.
EHR Data Analysis with Python
Load Data Data Cleaning Spark DataFrame Data Visualization

Load EHR data from HDFS Handle missing values


using Pandas. using forward fill. Convert Pandas Create visualizations
DataFrame to Spark using Matplotlib and
DataFrame. Seaborn.

Use a Python script to perform analysis on the dataset. Load data from HDFS, handle missing values, convert to Spark
DataFrame, and create visualizations. Key visualizations include gender distribution, age distribution, hospital admission
sources, hospital stay length, and patient discharge status.
Key Insights from Visualizations
52% 65 Emergency
Gender Distribution Average Age Admission Source
Identifies the proportion of male Displays the average age of patients. Highlights the most common
patients. admission source.

The visualizations provide insights into gender distribution, age distribution, hospital admission sources, hospital stay
length, and patient discharge status. These insights help identify trends and patterns in the data.
Conclusion and Future
Improvements
Scalable Analysis
Hadoop and Spark enable scalable data analysis for large healthcare datasets.

Machine Learning
Implement Machine Learning models to predict patient outcomes.

Performance Optimization
Optimize performance using Spark SQL and Parquet format.

Real-time Data
Integrate real-time data streaming with Kafka and Spark Streaming.

This project demonstrates how Hadoop and Spark enable scalable data analysis and
visualization for large healthcare datasets. Possible future improvements include
implementing Machine Learning models to predict patient outcomes, performance
optimization using Spark SQL and Parquet format, and real-time data streaming
integration with Kafka and Spark Streaming.
Thank You
Thank you for attending this presentation.

We have covered the key aspects of using Hadoop and Spark for EHR
data analysis.

You might also like