CIS 5500 Database
CIS 5500 Database
Section I - Motivation
Our motivation behind choosing this as our project was that we saw there was a need for people
to be able to find Healthcare Providers in a streamlined fashion with specific characteristics:
proximity, specialty, cost, and ratings. Currently, if you try to google for healthcare providers near
you your page is cluttered with ads and sponsored websites so you don’t always get the best /
most appropriate healthcare provider for what you specifically need. That’s why we thought of
making a website where you could do that. Additionally, with the information required to perform
this search, we thought that it would be beneficial to also tell people what conditions they might
be predisposed to and educate them on what can be done. We saw this as a problem as people
often don’t know what conditions they are predisposed to and what steps they need to take to
avoid getting potentially life-altering conditions.
User Table
CREATE TABLE Users (
UserID INT PRIMARY KEY,
DemographicInfo VARCHAR(255),
GeographicInformation VARCHAR(255)
);
Insurance Table
CREATE TABLE Insurance (
InsuranceID INT PRIMARY KEY,
GeographicArea VARCHAR(255),
PlanName VARCHAR(255),
PlanBenefits TEXT,
APILink VARCHAR(255)
);
1. Dealing with the missing values: For essential fields that cannot be imputed (e.g., NPI,
Provider Last Name, Provider First Name), we will consider removing rows with missing
values. For non-essential fields, we can also fill missing values with a placeholder (e.g.,
"Unknown" for categorical data, or the column's median for numerical data).
2. Deduplication: Identify and remove duplicate entries to avoid redundancy. This can be
particularly important for providers of healthcare insurance services listed multiple times
with slight variations in their address or other details since that could give incorrect
results.
3. Standardization: Standardize the formatting of key fields such as names, addresses, and
phone numbers to ensure consistency. This might include converting text to title case,
removing extraneous characters from phone numbers, and standardizing address
formats.
4. Data Type Conversions: We will need to ensure that each column is of the appropriate
data type. For example, ZIP codes should be treated as strings to preserve leading
zeros, and graduation years should be integers.
5. Normalization: We also plan to normalize the dataset to ensure that similar data points
are represented uniformly. This might involve unifying similar specialty names or
grouping them into broader categories to facilitate easier analysis and matching.
Especially when it comes to analyzing the dataset of insurance services per region and
being able to match, there needs to be efficient grouping on the basis of distance and
proximity.
6. Feature Engineering: We will create new features that could be useful for our application.
For example, we can extract or compute the provider's years of experience from the
graduation year, or create flags indicating if the provider offers telehealth services based
on the Telehealth field.
7. We also plan to specifically work on handling the specialties: The dataset contains
multiple columns for specialties (pri_spec, sec_spec_1, sec_spec_2, etc.). We can
aggregate these into a single column or a structured format (like a list) associated with
each provider to simplify querying and analysis.
8. For the Geographical Data like City/Town, State, and ZIP Code, ensure these are
correctly formatted and consider creating a combined location field if useful for
application's geolocation features which we definitely have to use and thus this will be
very important!!!!
9. Binary/Indicator Fields: For fields like Telehlth, ind_assgn, and grp_assgn, we will ensure
they are consistently coded (e.g., Y/N converted to True/False) to facilitate analysis and
filtering.
10. Lastly, to make sure our data is consistent and correct, we will also have a validity check
for all the geographic information and specialties.