Electronic Health Records EHR Data Analysis Using Hadoop and Spark
Electronic Health Records EHR Data Analysis Using Hadoop and Spark
Discharge Statuses
Information on patient discharge outcomes.
Start DFS
Run start-dfs.sh to start Hadoop Distributed File System.
Start YARN
Run start-yarn.sh to start YARN services.
Verify Services
Use jps command to verify NameNode and DataNode are running.
Upload Dataset
2 Move ehr_data.csv to /EHR_project/input/.
Verify Upload
3 Use hdfs dfs -ls to confirm the file is in HDFS.
Create directories in Hadoop for storing the dataset using hdfs dfs -mkdir /EHR_project and hdfs dfs -mkdir
/EHR_project/input. Move the dataset into Hadoop’s file system using hdfs dfs -put. Verify the file upload using hdfs
dfs -ls /EHR_project/input/.
EHR Data Analysis with Python
Load Data Data Cleaning Spark DataFrame Data Visualization
Use a Python script to perform analysis on the dataset. Load data from HDFS, handle missing values, convert to Spark
DataFrame, and create visualizations. Key visualizations include gender distribution, age distribution, hospital admission
sources, hospital stay length, and patient discharge status.
Key Insights from Visualizations
52% 65 Emergency
Gender Distribution Average Age Admission Source
Identifies the proportion of male Displays the average age of patients. Highlights the most common
patients. admission source.
The visualizations provide insights into gender distribution, age distribution, hospital admission sources, hospital stay
length, and patient discharge status. These insights help identify trends and patterns in the data.
Conclusion and Future
Improvements
Scalable Analysis
Hadoop and Spark enable scalable data analysis for large healthcare datasets.
Machine Learning
Implement Machine Learning models to predict patient outcomes.
Performance Optimization
Optimize performance using Spark SQL and Parquet format.
Real-time Data
Integrate real-time data streaming with Kafka and Spark Streaming.
This project demonstrates how Hadoop and Spark enable scalable data analysis and
visualization for large healthcare datasets. Possible future improvements include
implementing Machine Learning models to predict patient outcomes, performance
optimization using Spark SQL and Parquet format, and real-time data streaming
integration with Kafka and Spark Streaming.
Thank You
Thank you for attending this presentation.
We have covered the key aspects of using Hadoop and Spark for EHR
data analysis.