0% found this document useful (0 votes)
5 views

Midterm Report

Uploaded by

bv47nxzv5h
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Midterm Report

Uploaded by

bv47nxzv5h
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Esther Attar

Professor Lance Galletti

Data Science Tools and Applications

Tuesday, March 28, 2023

Midterm Report

Machine learning (ML) algorithms are widely used in various fields, including finance,

health, and engineering. In the context of regression analysis, the XGBoost model is a powerful

tool for predicting the values of a target variable. This report discusses a possible approach to

train the XGBoost model for regression analysis and why it is a good approach.

Before training the model, we first preprocess the data by removing unnecessary columns

such as Id, ProductId, UserId, Text, and Summary. The remaining columns represent the features

that will be used to train the model. To ensure that the features are on the same scale, we

normalize the data using the StandardScaler method. This method scales the data so that it has a

mean of zero and a standard deviation of one. The normalized data is then split into training and

testing sets using the train_test_split function. The training set is used to train the model, while

the testing set is used to evaluate its performance. The approach taken to train this model

involves several key steps. First, the training set with new features is loaded into a Pandas

DataFrame. Then, the data is split into training and testing sets using a 75:25 ratio. The training

data is preprocessed by dropping certain columns and normalizing the remaining features using a

StandardScaler. The XGBoost algorithm is then used to learn the model, with hyperparameters

such as the number of estimators, learning rate, and maximum depth set to specific values.

Finally, the model is evaluated on the testing set using accuracy score and root mean squared
error (RMSE), and confusion matrices are plotted for both the training and testing sets. The

model is also pickled, so it can be saved for later use.

Going back to the model I used and expanding a little more on it, the XGBoost model is

trained using the XGBRegressor class, which is part of the xgboost library. The class takes

several hyperparameters that need to be tuned to achieve the best performance. In this example,

we use n_estimators = 600, learning_rate = 0.1, and max_depth = 5. The n_estimators parameter

specifies the number of trees in the model, the learning_rate parameter controls the step size

during the optimization process, and the max_depth parameter sets the maximum depth of each

tree. These hyperparameters were chosen based on prior knowledge and experimentation.

Once the model is trained, it is evaluated on the testing set. The accuracy_score function

is used to calculate the accuracy of the model, while the mean_squared_error function is used to

calculate the root mean squared error (RMSE). The accuracy score is the proportion of correct

predictions, while the RMSE is a measure of the average deviation of the predictions from the

actual values. In this example, the accuracy on the testing set is 0.373, which is not very high,

and the RMSE is 1.03. These metrics indicate that the model performs reasonably well.

Along the way I encountered many challenges that I had to overcome to improve my model. One

challenge that I had when training the model is overfitting, which occurs when the model is too

complex and fits the training data too closely, resulting in poor performance on new, unseen

data. To address this, it is important to perform feature selection and regularization to prevent the

model from becoming too complex. In this case, certain columns are dropped from the dataset

during preprocessing, which helps to simplify the model and reduce the risk of overfitting.

Another challenge that can arise is dealing with missing or incomplete data. It is

important to carefully consider how to handle missing data, as it can have a significant impact on
the performance of the model. In this case, no missing data was encountered in the dataset,

which made it easier to work with. But I find it important to know that this is something very

possible and we should know how to tackle it when we encounter this problem.

During the process of training the model, some interesting findings were discovered

about the dataset. For example, it was observed that there was a wide range of scores in the

dataset, ranging from 1 to 5, with a relatively even distribution among the different scores. This

suggests that the dataset is well-balanced and contains a diverse range of reviews. Additionally,

it was observed that certain features in the dataset, such as the length of the review and the

number of exclamation marks used, were highly correlated with the score given by the reviewer.

This suggests that these features may be important predictors of the score, and could be used to

improve the performance of the model.

To further evaluate the model, I also plotted confusion matrices for both the testing and

training sets. The confusion matrix is a way to visualize the performance of a classification

model. It shows the number of true positives, false positives, true negatives, and false negatives.

The confusion matrices show that the model performs better on the training set than on the

testing set. This suggests that the model may be overfitting the training data and may not

generalize well to new data.

In conclusion, the approach taken to train this model is a good one, as it involves several

important steps such as feature selection, normalization, and evaluation on a testing set. The

XGBoost algorithm is an effective choice for this type of problem, as it is able to handle a wide

range of features and can prevent overfitting through regularization. During the training process,

it is important to be aware of potential challenges such as overfitting and missing data, and to

carefully preprocess the data to address these issues. Finally, it is important to analyze the dataset
during the training process to identify any key findings that can help to improve the performance

of the model.

You might also like