Data Science Projects

This repository contains a collection of Data Science projects that demonstrate various skills in data analysis, modeling, and deployment.

Projects

Description: This project processes CMS synthetic inpatient claims data (2008–2010) to uncover hidden patterns in ICD9 diagnosis and procedure codes using the FPGrowth machine learning algorithm. The pipeline cleans raw claims data, standardizes codes, and extracts frequent item sets and association rules to reveal significant healthcare trends.
Tools Used: Python, PySpark, Pandas, NumPy
Key Features:
- Preprocesses CMS claims data by filtering out non-ICD9 columns and labelling diagnosis and procedure codes.
- Leverages PySpark DataFrame operations to efficiently process large-scale healthcare datasets.
- Uses FPGrowth to mine frequent item sets and extract meaningful association rules.
- Ranks association rules by support, confidence, and lift to highlight significant patterns in patient care.

Description: This project analyzes exoplanetary data from the NASA Exoplanet Archive to compute the Earth Similarity Index (ESI), a quantitative measure of how similar an exoplanet is to Earth based on its stellar flux and planetary radius. The analysis ranks exoplanets by their ESI score, identifying the most potentially habitable candidates.
Tools Used: Python, Pandas, NumPy, Seaborn, Matplotlib
Key Features:
- Preprocesses exoplanet data by handling missing values and removing duplicate observations
- Implements a computational model to evaluate planetary habitability(ESI)
- Identifies and ranks the top 10 exoplanets with the highest Earth similarity
- Visualizes the ESI distribution to highlight trends in planetary characteristics

Description: This project extracts, transforms, and loads (ETL) data from NASA's Near-Earth Object (NEO) API to analyze asteroid close approaches. The pipeline collects asteroid data, cleans and formats it, and stores it in a structured format for further analysis.
Tools Used: Python, Pandas, PySpark, Requests
Key Features
- Data Extraction – Fetches asteroid close-approach data from NASA's API.
- Data Transformation – Converts JSON response into a structured DataFrame.
- Data Storage – Stores the final dataset in a PySpark DataFrame for analysis.
- Scalability – Uses PySpark for efficient big data processing.

Clone the repository:

git clone https://github.com/skoc3r/Data-Science-projects.git
cd Data-Science-Projects

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
CMS_Data_Project.ipynb		CMS_Data_Project.ipynb
Earth_Similarity_Index_Project.ipynb		Earth_Similarity_Index_Project.ipynb
NEO_ETL_Pipeline.ipynb		NEO_ETL_Pipeline.ipynb
README.md		README.md