Midterm Report
Midterm Report
Midterm Report
Machine learning (ML) algorithms are widely used in various fields, including finance,
health, and engineering. In the context of regression analysis, the XGBoost model is a powerful
tool for predicting the values of a target variable. This report discusses a possible approach to
train the XGBoost model for regression analysis and why it is a good approach.
Before training the model, we first preprocess the data by removing unnecessary columns
such as Id, ProductId, UserId, Text, and Summary. The remaining columns represent the features
that will be used to train the model. To ensure that the features are on the same scale, we
normalize the data using the StandardScaler method. This method scales the data so that it has a
mean of zero and a standard deviation of one. The normalized data is then split into training and
testing sets using the train_test_split function. The training set is used to train the model, while
the testing set is used to evaluate its performance. The approach taken to train this model
involves several key steps. First, the training set with new features is loaded into a Pandas
DataFrame. Then, the data is split into training and testing sets using a 75:25 ratio. The training
data is preprocessed by dropping certain columns and normalizing the remaining features using a
StandardScaler. The XGBoost algorithm is then used to learn the model, with hyperparameters
such as the number of estimators, learning rate, and maximum depth set to specific values.
Finally, the model is evaluated on the testing set using accuracy score and root mean squared
error (RMSE), and confusion matrices are plotted for both the training and testing sets. The
Going back to the model I used and expanding a little more on it, the XGBoost model is
trained using the XGBRegressor class, which is part of the xgboost library. The class takes
several hyperparameters that need to be tuned to achieve the best performance. In this example,
we use n_estimators = 600, learning_rate = 0.1, and max_depth = 5. The n_estimators parameter
specifies the number of trees in the model, the learning_rate parameter controls the step size
during the optimization process, and the max_depth parameter sets the maximum depth of each
tree. These hyperparameters were chosen based on prior knowledge and experimentation.
Once the model is trained, it is evaluated on the testing set. The accuracy_score function
is used to calculate the accuracy of the model, while the mean_squared_error function is used to
calculate the root mean squared error (RMSE). The accuracy score is the proportion of correct
predictions, while the RMSE is a measure of the average deviation of the predictions from the
actual values. In this example, the accuracy on the testing set is 0.373, which is not very high,
and the RMSE is 1.03. These metrics indicate that the model performs reasonably well.
Along the way I encountered many challenges that I had to overcome to improve my model. One
challenge that I had when training the model is overfitting, which occurs when the model is too
complex and fits the training data too closely, resulting in poor performance on new, unseen
data. To address this, it is important to perform feature selection and regularization to prevent the
model from becoming too complex. In this case, certain columns are dropped from the dataset
during preprocessing, which helps to simplify the model and reduce the risk of overfitting.
Another challenge that can arise is dealing with missing or incomplete data. It is
important to carefully consider how to handle missing data, as it can have a significant impact on
the performance of the model. In this case, no missing data was encountered in the dataset,
which made it easier to work with. But I find it important to know that this is something very
possible and we should know how to tackle it when we encounter this problem.
During the process of training the model, some interesting findings were discovered
about the dataset. For example, it was observed that there was a wide range of scores in the
dataset, ranging from 1 to 5, with a relatively even distribution among the different scores. This
suggests that the dataset is well-balanced and contains a diverse range of reviews. Additionally,
it was observed that certain features in the dataset, such as the length of the review and the
number of exclamation marks used, were highly correlated with the score given by the reviewer.
This suggests that these features may be important predictors of the score, and could be used to
To further evaluate the model, I also plotted confusion matrices for both the testing and
training sets. The confusion matrix is a way to visualize the performance of a classification
model. It shows the number of true positives, false positives, true negatives, and false negatives.
The confusion matrices show that the model performs better on the training set than on the
testing set. This suggests that the model may be overfitting the training data and may not
In conclusion, the approach taken to train this model is a good one, as it involves several
important steps such as feature selection, normalization, and evaluation on a testing set. The
XGBoost algorithm is an effective choice for this type of problem, as it is able to handle a wide
range of features and can prevent overfitting through regularization. During the training process,
it is important to be aware of potential challenges such as overfitting and missing data, and to
carefully preprocess the data to address these issues. Finally, it is important to analyze the dataset
during the training process to identify any key findings that can help to improve the performance
of the model.