0% found this document useful (0 votes)
31 views14 pages

Research Paper

research

Uploaded by

Umang Soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views14 pages

Research Paper

research

Uploaded by

Umang Soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Using machine learning to predict the total

cost of a medical service


Abstract
The UK is currently seeing a rapid growth in private healthcare services. This growth is
complemented by an increase demand for private healthcare as well, as per The Guardian,
consumer expenditure on private healthcare ranges from 3,200 pounds to 15, 075 pounds
(Campbell, 2023). Granted that this growth in private healthcare can be beneficial for the
increasing demand and supply, it is imperative that these private medical practitioners maintain
financial transparency between their patients. Private medical services are deemed to be
significantly costlier than public healthcare, which is why to assure the patient that private
institutes are not exploiting their expenditure on healthcare, they must provide the patient with an
estimated total cost of the medical service beforehand. Subsequently, patients will be able to
decide based on the urgency of their medical situation when and where to get treated. The
prediction of the total cost of a medical service can be done by analyzing the data of a patient’s
medical history. Given that the prediction of the total cost required the analysis and processing of
data, it can be argued that the most efficient method of prediction would be via a machine
learning model. Hence, this paper will compare the performance of a ridge regression machine
learning model, the polynomial regression model, and a random forest regressor. The dataset
used for this comparison is constant for both models and was taken from Kaggle. To measure the
accuracy of the two models for their comparison, they will be tested based on their accuracy,
mean absolute error, and root mean squared error to compare the results of the two models. After
analyzing the data collected, this paper concluded that the model obtaining the highest accuracy
and portrayal of better performance was the random forest regressor which attained the lowest
mean absolute error, lowest root mean squared error, and highest accuracy at 87.5%. These
reports suggest that machine learning models could be the solution to financial transparency in
private medical institutes, and the utilization of a random forest regressor for this task seems to
be the best fit.

Key words: Machine learning, Private healthcare, financial transparency.

Introduction
It can be argued that the most important factor that will steer modern healthcare to evolve is
financial transparency. In fact, that is the exact opinion of the McKinsey Health Systems and
Services Practice. Their study revolves around the idea of financial transparency being the key to
improving medical services in the near future. With more information about the service available
to the consumers, specifically the financial aspect behind it, they will be able to make informed
decisions which are gravely pivotal when it comes to healthcare (Henke, 2011). Accordingly,
with more information available to consumers about their healthcare services, healthcare
providers will not be able to increase the prices of their services and exploit their patients. With
less product information, healthcare services would be able to seize this power and raise costs of
healthcare. As healthcare is an essential necessity that should be accessible to every individual,
making sure this market dominance is controlled is imperative (Chown et al., 2019). Hence, with
financial transparency between the patients and medical services it reduces the ability of medical
services to exploit patients for their money. Furthermore, as per a survey conducted by The
Healthcare Imperative, around 4% of consumers buying a drug were informed and aware of the
cost (Young et al., 2010). This percentage underscores how common it is for medical institutes to
charge for their services or products without providing any evidence or context of the products
origin and price. Thus, it is important to demobilize such activity; the Health Care Financing
Review exemplify how consumer awareness of healthcare products lead to patients making
informed decisions (Sangl & Wolf, 1996). It can be specifically hard for the average patient to
understand the remedies and procedures that medical institutes prescribe them; this is because to
comprehend these medical prescriptions you must have a certain background in the pedagogical
concepts of biology. As a result, patients often trust whatever medical remedies that these
medical institutes prescribe them and blindly accept their prices as well (G et al., 2017).
Consequently, with financial transparency of these medical services and remedies consumers can
now make informed decisions and it restricts medical institutes from exploiting their patients due
to the lack of product information.

Before we dive into exploring potential solutions to this problem, we must understand briefly
what financial transparency is and its impact on the healthcare sector. Financial transparency
with regard to the healthcare sector is essentially just being upfront and truthful with the costs
between the medical institute and the patient. Financial transparency encapsulates the idea of
perfect financial knowledge for the consumer, which means that the patient will have full
disclosure to the pinpoint details with respect to each medical service or product issued by these
medical practices (Saver, 2021). Moreover, financial transparency cultivates a sense of trust
between the patient and doctor which benefits both parties as this sense of trust increases
consumer loyalty securing more sales for the medical practice and the patient is more confident
in their recovery.

Artificial intelligence is a rapidly growing phenomena that surprises mankind through its
consistent and complex evolution. To solve this issue, we explored a specific field in AI that is
commonly used to manipulate and process data, machine learning. Machine learning models can
help identify trends and relationship between data that humans may not be able to think of as fast
or as efficiently (Moubayed et al., 2018). Additionally, despite the challenging precision required
for medical data analytics, machine learning has been implemented into healthcare to optimize
and improve healthcare services. For example, machine learning has been imposed into
genomics. Genomics is the study of genomes which is infamously deemed to be a complicated
field in biology, it requires utmost precision, though in today’s world machine learning has been
utilized to optimize and further improve the precision of genomics which is astounding (Habehh
& Gohel, 2021). Given that machine learning can be used for processing such complicated and
important data to produce acceptable outcomes, it can also be used to forecast the total cost of a
medical service.

Incidentally, we compared the performance and scored the random forest regressor (RFR), the
ridge regression model (RRM), and the polynomial regression model (PRM). We tested these
models on a dataset from Kaggle which consisted of medical information and the total cost of the
medical procedure. At the end of this paper after analyzing and concluding the results attained,
we were able to showcase that the random forest regressor proved to perform the best out of the
3 models with the best accuracy of 87.5%. This percentage accentuates the potential for machine
learning to be integrated into achieving financial transparency in medical practices.

The goal of this research paper is to evaluate the performance of 3 different machine learning
models to highlight the best fit for accomplishing financial transparency in the healthcare sector.
We will score the 3 models a RFR, RRM, and PRM based on their accuracy in predicting the
total cost of a medical service by processing a relationship between the medical information of a
patient (e.g., BMI, sex, smoker, etc.). The impact of the findings of this paper are applicable to
every medical institute; as mentioned before integrating financial transparency in healthcare
services is integral, every healthcare service can apply the best performing machine learning
model showcased in this paper to predict the total cost of a medical service for a patient. As a
result, they can disclose said cost to the patient to ensure financial integrity between the patient
and medical service.

Understanding the Dataset


I. Charges dataset
The dataset we used was obtained from Kaggle, Kaggle is a well-known source popular for their
vast collection of datasets available to all. The dataset contained medical information about 1338
patients. There were 3 binary variables in the dataset, sex, smoker, and insurance claim. Though
the insurance claim was not of any importance to our research objective. For the patient sex, [0]
represented female and [1] represented male. If the patient was a smoker it was represented by
[1] and [0] if they did not smoke. The residential area of the patient was represented by [0] for
Northeast, [1] for Northwest, [2] for Southeast, and [3] for Southwest. The rest of the medical
information were all numerical values, patient age, BMI (body mass index), average walking
steps per day, and number of children.

Figure 1 - Snippet of dataset (source made by candidate)

This dataset as shown in figure 1 was in a CSV format and was originally designed to predict the
patient’s insurance claim. However, this data is dynamic and can be used for predicting the total
charges which is why when preprocessing we first eliminated the ‘insuranceclaim’ column. From
the new dataset without the insurance column, every variable besides the charges were the x-
variables and the charges column was assigned to be the y-variable as we are predicting the total
cost of a medical service. Subsequently, we observed the data spread. Before operating with your
dataset, to ensure model optimization you must clean and organize your dataset. The spread of
each individual variable is visible in figure 2. From figure 2 we can notice an apparent variance
in the data spread; specifically, from the y-variable ‘charges’, the x-variable columns ‘steps’ and
‘smokers’.
Figure 2 - Dataset variable spread (source made by candidate)

In addition, as there is an oblivious variance in the dataset, it caused a bias in the data which is
shown by the heatmap in figure 3 (great correlation between smoker and charges). Moreover, the
biggest variation was present in the y-variable data which had a great range in its spread shown
by the box plot graph in figure 3. Therefore, in our methodology we had to clean this data by
eliminating outliers. With a reduced variance in the data, the models will be able to detect a
stronger correlation between the x-variables and y-variables which leads to greater model
performance.

Figure 3 - Heatmap and Dataset box plot (source made by candidate)

II. Train-test split


Firstly, the 3 machine learning models that are evaluated on this dataset are supervised models; a
supervised machine learning model uses a training dataset to find a correlation then performs on
a testing set then you can evaluate their performance based on the actual result outputs.
Incidentally, the best way for these models to operate and function is using the train-test split
method (Pawluszek-Filipiak & Borkowski, 2020). The train-test split method is commonly used
to maximize the performance of supervised machine learning models. In this method you create
a subset of the main dataset for training and for testing. Subsequently, you feed the training
subset to the model and the model tries to fit a correlation between the variables and output. You
then allow the model to perform on the testing subset where the model applies the correlation
developed from the training set to predict the output values. Finally, to evaluate the performance
you can use various scoring metrics to see how close the model’s output predictions were to the
actual test set output values.

According to a study conducted on the optimal ratio between training and testing sets, they
concluded that a ratio of around 80% of your dataset should be used for training and around 20%
should be used for testing (Bichri et al., 2024). Accordingly, we split the charges dataset into
80% for the training and 20% for the models’ testing. This ratio ensures that the model has
enough data to form a valid correlation, the greater the amount of training data the stronger the
model’s correlation. However, it is important to ensure you have a reasonable number of testing
data so that the performance metrics are fair and calculated over a reasonable range of output,
which is why 20% for the testing subset fits well.

Now that our train-test split has been established, it will corroborate a fair testing range for our
scoring metrics. Not only that, but the train also-test split will allow the models to produce a
stronger fitting over the x-variables upon the y-variable. In consequence, the RMSE, MAE, and
accuracy we attain from each model will be improve.

Performance Metrics
I. Accuracy
Accuracy is one of the most common and effective scoring metrics used for machine learning
models. Accuracy measures the percentage of predictions that the model got correct, precisely it
is a calculation of the correctly predicted values divided by the total number of y-values. The
generalized method of calculating accuracy is dividing the sum of true positive (Tp) values with
the true negative (Tn) values and dividing that sum by the sum of Tp, Tn, false negative (Fn),
and false positive (Fp) values. The equation is shown below.
𝑇𝑝 + 𝑇𝑛
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑝 + 𝑇𝑛 + 𝐹𝑝 + 𝐹𝑛
However, in our case, the y-variable is not a Boolean datatype. Therefore, the accuracy of the
models is calculated by the following equation where C represents correct predictions and T
represents total number of results.
𝐶
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇
I. Mean Absolute Error
The mean absolute error (MAE) is another common scoring metric used to evaluate regression
models. The mean absolute error is a very useful metric because it disregards the direction of the
error in the prediction.
"
1
𝑀𝑒𝑎𝑛 𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝐸𝑟𝑟𝑜𝑟 = 8 |𝑦! − ŷ! |
𝑛
!#$
As shown by the equation above, the MAE takes the difference of the actual value represented by
𝑦! and the predicted value represented by ŷ! . Accordingly, if the predicted value is greater than or
less than the actual value it does not sway the score as the positive value of the difference is
taken and summed up. Finally, that sum is divided by the total number of observations
represented by n to give the mean of that error.
I. Root Mean Squared Error
The root mean squared error (RMSE) is the final metric being used to score the 3 models. The
RMSE like the MAE disregards the direction of the error. So, if the actual value is less than or
greater than the predicted value it will not matter because all error values (the differences in
actual and predicted values) will be squared in this case as shown in the equation below.
"
1
𝑅𝑜𝑜𝑡 𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑒𝑟𝑟𝑜𝑟 = @ 8(𝑦! − ŷ! )%
𝑛
!#$

From this equation, we can see that once the difference in actual and predicted values are
squared, the value turns positive regardless. However, as you square this value this metric can be
influenced by outliers which can heavily impact your RMSE score. Although the outliers in the
dataset are taken care of in the methodology. Once the error values are squared, they are summed
up and divided by the total number of observations which provides a mean again. To cancel out
the square of the difference you take the square root of this mean value to produce a reliable
score which abides by the range and scalability of your dataset values.

Research Methodology
I. Preprocessing the data
Firstly, we must import the dataset as a CSV file. Then drop the ‘insuranceclaim’ column from
the dataset and split the data into x-variables (all the columns besides the charges column). Then
split the dataset into 80% training and 20% testing, assign x-train to be the x-variables used from
80% of the data, and x-test to be the charges from 80% of the data. Assign y-train and y-test in
the same manner but use only 20% of the dataset.

As we observed when understanding the data, there was a great variance in the data spread of
each individual variable which caused a bias in the data. Referring to figure 4, we can see how
there are obvious outliers present in the data. Thus, we eliminated outliers in the data by these
following calculations.
𝐿𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑 = 𝑄1 − (1.5 × 𝐼𝑄𝑅)
𝑈𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑 = 𝑄3 + (1.5 × 𝐼𝑄𝑅)
𝑇𝑟𝑢𝑒 𝑖𝑓 𝐿𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑 ≤ 𝑥! ≤ 𝑈𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑
𝐼𝑠 𝑁𝑜𝑛 − 𝑜𝑢𝑡𝑙𝑖𝑒𝑟𝑠 (𝑥! ) = {
𝐹𝑎𝑙𝑠𝑒 𝑖𝑓 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑚𝑎𝑠𝑘 = 𝐼𝑠 𝑛𝑜𝑛 − 𝑜𝑢𝑡𝑙𝑖𝑒𝑟 (𝑥! ) | ∀𝑥! ∈ 𝑥 }
{
𝑥'()!" = {𝑥! ∈ 𝑥 | 𝑚𝑎𝑠𝑘(𝑖) = 𝑡𝑟𝑢𝑒}
𝑦'()!" = {𝑦! ∈ 𝑦 | 𝑚𝑎𝑠𝑘(𝑖) = 𝑡𝑟𝑢𝑒}
Q1 represents the 25 percentile while Q3 represents the 75th percentile, the IQR is calculated by
th

Q3 – Q1. We then calculated the lower and upper bound of the data, we then created a Boolean
property where 𝑥! contains data that lie within the restrictions specified. With these Boolean
values, we then created another Boolean variable called mask which would indicate that if the
value is a non-outlier it is contained within the mask filter. Finally, to specify the new range of
values to operate with, the x train values contained only the data which was flagged as masked
which indicated that it was not an outlier. The same was done for the y train values.
Figure 4 - Data variance (source made by candidate)

Additionally, to further regulate the data variance, we applied the logarithmic conversion method
available from the numpy library which allowed us to stabilize and convert each training and
testing value to a scalable value to operate with. This is done by taking the natural log of 1 + the
data value. The calculation for that conversion is shown below.
𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 𝑥 = ln(1 + 𝑥)
𝑥'()"*+,(-./ = {ln(1 + 𝑥0 ) , ln(1 + 𝑥% ) … ln (1 + 𝑥" )}
II. Hyperparameter tuning and model implementation
Import the required libraries and each model for random forest regressor and ridge regression.
For the polynomial regression, import linear regression and polynomial features, then set the
degree of the model to be 2. Train each model with the transformed x variable set which was
defined before. Then import grid search CV and k-fold cross validation, then identify the
parameters which elucidate the models’ best performance. Finally implement these parameters
for each model.
III. Model evaluation
Now to evaluate the models, calculate the accuracy of each model and plot those values
graphically. Additionally, to get an enhanced understanding of the model performances, calculate
the RMSE and MAE for each model and plot those. Compare each value for each model and
validate their effectiveness with respect to the data spread. Given the high average of the y-
variable a high numerical value for RMSE and MAE are expected to lie within the range of the
y-variable set.
IV. Model scoring and analysis
After collecting each relevant graph showing the model scores against each other, compare the
results and conclude the best performing model. The accuracy can range from a value 0 to 100%
which is the fundamental score which indicates better performance. Though to gain assurance of
optimal performance, compare the RMSE and MAE values from the graph to the data spread as
shown in figure 3. Finally, draw conclusions derived from the results collected.

Zoom in on Model training


I. Ridge regression model
The ridge regression model is a machine learning model which implements the base linear
regression ideology with an L2 regularization (Arashi et al., 2021). L2 regularization is a
parameter on linear regression which helps in preventing overfitting, it does so by penalizing the
size of the coefficients. Accordingly, these coefficients are shrunk caused by the L2
regularization leading to decreased variance. However, this process introduces a slight
probability of bias in the model.
𝐻𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑢𝑠𝑒𝑑: [1 𝑎𝑙𝑝ℎ𝑎1 , 0.1, 1, 10, 100, 1000]

Figure 5 - Ridge regression model performance (source made by candidate)

As shown by the data table in figure 5, the ridge regression model was not able to adapt and
cultivate a relationship between the variables in the data set to an extent because it achieved an
accuracy of 79%; the testing accuracy is the accuracy that underscores the model’s performance
on the testing data, hence this accuracy is what we are after. In figure 5, the graph showcasing
the model’s train and test RMSE and MAE shows how the training RMSE and MAE were
greater than the testing. This can be explained by the L2 regularization effect that the ridge
regression model implements to penalize large errors.

To employ the ridge regression model, import the model into the project. Assuming that the
preprocessing of the data has been complete from before beginning hyperparameter tuning by
implementing grid search cross validation and k-fold cross validation. The k-fold cross validation
was set on split 5 which means the set was split into 4 training sets and 1 testing set to check
which parameters worked best. Apply grid search cross validation on the model after training it
with the transformed x and y variables, then apply the best alpha value (the alpha value utilized
in this paper was 1.0) to your model and collect the results of the model performance against the
testing set.
II. Polynomial regression model
Polynomial regression, like ridge regression, can be considered another extension of linear
regression. In ridge regression, the linear regression model was introduced to an L2
regularization which makes it ridge regression. On the other hand, polynomial regression
introduces the idea of a polynomial relationship between the variables (Cheng et al., 2019). Not
all relationships can be mapped out by a linear trend line, which is why this polynomial model
tries to identify which polynomial equation can map out the relationship between the variables
the best.
𝐻𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑢𝑠𝑒𝑑: [1 𝑑𝑒𝑔𝑟𝑒𝑒 1 , 1,2,3,4,5]

Figure 6 - Polynomial regression performance (source made by candidate)

Figure 6 depicts the performance of the polynomial regression model. As shown by the data
table, the polynomial regression model already achieved a higher accuracy compared to ridge
regression with 86.6% on its testing set. Furthermore, the polynomial regression model achieved
a lower RMSE and MAE value compared to the ridge regression as shown by the table.
However, the same trend is shown in the graph which was apparent in the ridge model. The train
RMSE and MAE are significantly greater than the testing MAE and RMSE. This suggests
overfitting was done on the training set.
To setup the polynomial regression model firstly import linear regression and polynomial
features. Then considering preprocessing of data has been executed as suggested in the research
methodology, fit the model to train using the transformed variables. Then configure the
hyperparameters by importing grid search cross validation and k-fold cross validation. Now set
up the k-fold cross validation with a split value of 5 and use the degree parameter specified
above for the grid search cross validation. Output the best model and use that degree value (the
degree used in this paper was 2) for training your model and collecting results.
III. Random forest regressor
A random forest regressor completely deviates from the methodology of the previous two
models. The previous two models had an extension on the linear regression model, whereas
random forest regressors use decision trees. A decision tree is a flowchart-like structure used for
making decisions, where each internal node represents a test on a feature, each branch represents
the outcome of that test, and each leaf node signifies a predicted outcome or class label. This
method recursively partitions the data, allowing for both classification and regression tasks by
capturing complex relationships within the dataset. The random forest regressor is an ensemble
learning method which builds multiple decision trees when training. Random forest is also
compatible with complex data sets (Segal, 2004).
𝐻𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑢𝑠𝑒𝑑
[ 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟𝑠 1 , 50, 100,200]
1

[1 maximum 𝑑𝑒𝑝𝑡ℎ1 , 𝑁𝑜𝑛𝑒, 4,6,8]


[1 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑠𝑝𝑙𝑖𝑡 1 , 2,5,10]
[1 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑙𝑒𝑎𝑓 1 , 1,2,4]

Figure 7 - Random forest regressor performance (source made by candidate)


Figure 7 portrays the results gathered by the random forest regression model. By viewing the
data table, it is already apparent that the random forest regressor performed the best with an
accuracy of 86.7%. However, as shown by the diagram, the training RMSE and MAE is
significantly greater than the testing RMSE and MAE which was a common trend in the previous
two models. This again suggests overfitting of the data in the training set.

In order to employ the random forest regressor, first import the model from ensemble. Given that
preprocessing of the data should have been complete as per the research methodology, train the
random forest regressor using the transformed variables and then implement the grid search cross
validation and k-fold cross validation by importing them. Execute the two cross validations by
using the hyperparameters suggested above, estimators represent the number of decision trees,
maximum depth defines the depth of each tree, minimum split determines the minimum number
of samples per required to split, and finally minimum leaf specifies the minimum number of
samples required to be in a leaf node. After figuring out the best parameters for the model (this
paper used 100 estimators, max depth of 5, minimum split of 2, and minimum leaf of 4) train the
model as per the best parameters. Finally collect the results of the model’s performance.

Results

Figure 8 - Model performance (source made by candidate)

As shown by the data table in figure 8, the random forest regressor performed the best out of all
three models. Although the polynomial regression model came close with a testing accuracy of
86.6% versus the random forest testing accuracy of 86.7%, the random forest model had a
greater training accuracy compared to all three models with 87.5%. Furthermore, not only did the
random forest trump in accuracy, but it also had a lower testing RMSE and MAE value
compared to polynomial and ridge regression which suggests lower error in the random forest
regressor. When the accuracy diagram is viewed, it is visually crystal that the random forest had
the greatest accuracy. In addition to that, when looking at the error diagram, the random forest
has the smallest error bars.

Conclusions
I. Best performance
This study aims to compare and identify the better performing machine learning model which
would assist healthcare services predict the total cost of a patient’s procedure or visit. This was
done by scoring and evaluating the random forest regressor, polynomial regressor, and ridge
regressor on the charges dataset. After a methodical and extensive experimentation on the three
ML models, this study concluded that the random forest regressor performed the best.
II. Understanding the error scores
Referring back to the graph in figure 3, when remembering the huge range in the charges, the
values for the random forest regressor’s RMSE and MAE make sense. As this paper tries to
identify the best performing model for the medical industry’s application, the medical charges
range can vary drastically (Grover et al., 2022). Therefore, with a high range, a high error is
expected which should conform to the data sets mean. In the case of the random forest regressor,
the RMSE and MAE are lowest showing low error, with the values lying in the thousands it is
reasonable as medical practices are certain to have a great variance in their medical prices as
well.
III. Relation to existing studies
Alternatively, this paper not only served as an informative recommendation of which machine
learning model to use when predicting medical costs, it also acted as an extension to existing
work. An investigation conducted by the Hindustan Institute of Technology and Service explored
the use of the linear regression model for predicting medical prices (KUMAR et al., 2022). The
linear regression model achieved an accuracy of around 75% whereas all three models in this
paper achieved a higher accuracy; two of which are simply extensions to linear regression. This
extension in research provides an alternative perspective that also concludes using machine
learning models to predict the medical prices have proven to be effective and beneficial.
IV. Future research work
As this paper was able to operate with a dataset consisting of 1338 rows of data, big hospitals
and medical services can have datasets with seemingly larger data. As a result, it is important to
have assurance of a model that can effectively operate and be compatible with your large data.
Although the random forest regressor (which was concluded as the best performing model out of
the three) is deemed to have the capability of handling large data sets, it should be guaranteed.
Hence, future studies should explore which machine learning models can effectively predict
medical prices with a reasonably larger data set.
V. Societal contribution
In conclusion, this paper has concluded that machine learning models are an effective and
plausible tool to be used in order to predict medical prices. Accordingly, financial transparency
can be achieved among medical services resulting in a societal welfare boost (Pollack, 2022).
Finally, this paper will now enable financial transparency in the medical industry, allowing
policy makers to use machine learning to assist them in determining regulations and policies that
promote equitable pricing leading to improved societal welfare.

Bibliography

Campbell, D. (2023) Record rise in people using private healthcare amid NHS frustration, The
Guardian. Available at: https://www.theguardian.com/society/2023/may/24/record-rise-in-people-
using-private-healthcare-amid-nhs-frustration (Accessed: 18 November 2024).

Henke, N. (2011) McKinsey, Transparency - The most powerful driver of health care improvement.
Available at:
https://www.mckinsey.com/~/media/mckinsey/dotcom/client_service/Healthcare%20Systems%20a
nd%20Services/Health%20International/Issue%2011%20new%20PDFs/HI11_64%20Transparency
_noprint.ashx (Accessed: 11 November 2024).

Chown, J. et al. (2019) NBER Working Paper Series, THE OPPORTUNITIES AND LIMITATIONS
OF MONOPSONY POWER IN HEALTHCARE: EVIDENCE FROM THE UNITED STATES AND
CANADA. Available at: https://www.kellogg.northwestern.edu/faculty/garthwaite/htm/w26122.pdf
(Accessed: 12 November 2024).

Young, P.L., Saunders, R.S. and Olsen, L. (2010) National Library of Medicine, The Healthcare
Imperative: Lowering Costs and Improving Outcomes: Workshop Series Summary. Available at:
https://www.ncbi.nlm.nih.gov/books/NBK53921/ (Accessed: 08 November 2024).

Sangl, J.A. and Wolf, L.F. (1996) National Library of Medicine, Role of Consumer Information in
Today’s Health Care System. Available at: https://pmc.ncbi.nlm.nih.gov/articles/PMC4193620/
(Accessed: 18 November 2024).

G, M., NR, A. and SJ, N. (2017) National Library of Medicine, Making Medicines Affordable: A
National Imperative. Available at: https://www.ncbi.nlm.nih.gov/books/NBK493099/ (Accessed:
18 November 2024).

Saver, R.S. (2021) Deciphering the sunshine act: American Journal of Law & Medicine,
Cambridge Core. Available at: https://www.cambridge.org/core/journals/american-journal-of-law-
and-medicine/article/abs/deciphering-the-sunshine-
act/0D357AC2CB03EC021F0B1ADEB87D8CCC (Accessed: 18 November 2024).

Moubayed, A. et al. (2018) E-Learning: Challenges and Research Opportunities Using Machine
Learning & Data Analytics, IEEE Xplore. Available at:
https://bibliotecnica.upc.edu/sites/default/files/pagines_generals/investigadors/ieee_xplore.pdf
(Accessed: 18 November 2024).

Habehh, H. and Gohel, S. (2021) Machine learning in Healthcare, Current genomics. Available at:
https://pmc.ncbi.nlm.nih.gov/articles/PMC8822225/ (Accessed: 4 November 2024).

Pawluszek-Filipiak, K. and Borkowski, A. (2020) On the importance of train–test split ratio of


datasets in automatic landslide detection by supervised classification, MDPI. Available at:
https://www.mdpi.com/2072-4292/12/18/3054 (Accessed: 02 November 2024).

Bichri, H., Chergui, A. and Hain, M. (2024) Investigating the impact of train / test split ratio on the
performance of pre-trained models with custom datasets, International Journal of Advanced
Computer Science and Applications (IJACSA). Available at:
https://thesai.org/Publications/ViewPaper?Volume=15&Issue=2&Code=IJACSA&SerialNo=35
(Accessed: 04 November 2024).

Arashi, M. et al. (2021) Ridge regression and its applications in genetic studies, PLOS ONE.
Available at: https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0245376
(Accessed: 24 October 2024).

Cheng, X. et al. (2019) Polynomial regression as an alternative to neural nets. Available at:
https://arxiv.org/pdf/1806.06850.pdf (Accessed: 24 October 2024).
Segal, M.R. (2004) Machine learning benchmarks and random forest regression, eScholarship,
University of California. Available at: https://escholarship.org/uc/item/35x3v9t4 (Accessed: 28
October 2024).

Grover, A., Orgera, K. and Pincus, L. (2022) Health care costs: What’s the problem?, Research
and Action Institute. Available at: https://www.aamcresearchinstitute.org/our-work/issue-
brief/health-care-costs-what-s-problem (Accessed: 05 November 2024).

KUMAR, P. et al. (2022) Medical expense prediction using machine learning, MEDICAL
EXPENSE PREDICTION USING MACHINE LEARNING. Available at:
https://hindustanuniv.ac.in/assets/naac/CA/1_3_4/2591_M_RAJA_VARDHAN.pdf (Accessed: 16
November 2024).

Pollack, H.A. (2022) Necessity for and limitations of Price Transparency in American Health Care,
Journal of Ethics | American Medical Association. Available at: https://journalofethics.ama-
assn.org/article/necessity-and-limitations-price-transparency-american-health-care/2022-11
(Accessed: 17 November 2024).

You might also like