0% found this document useful (0 votes)

5 views

Midterm Report

Uploaded by

bv47nxzv5h

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Midterm Report

Uploaded by

bv47nxzv5h

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Esther Attar

Professor Lance Galletti

Data Science Tools and Applications

Tuesday, March 28, 2023

Midterm Report

Machine learning (ML) algorithms are widely used in various fields, including finance,

health, and engineering. In the context of regression analysis, the XGBoost model is a powerful

tool for predicting the values of a target variable. This report discusses a possible approach to

train the XGBoost model for regression analysis and why it is a good approach.

Before training the model, we first preprocess the data by removing unnecessary columns

such as Id, ProductId, UserId, Text, and Summary. The remaining columns represent the features

that will be used to train the model. To ensure that the features are on the same scale, we

normalize the data using the StandardScaler method. This method scales the data so that it has a

mean of zero and a standard deviation of one. The normalized data is then split into training and

testing sets using the train_test_split function. The training set is used to train the model, while

the testing set is used to evaluate its performance. The approach taken to train this model

involves several key steps. First, the training set with new features is loaded into a Pandas

DataFrame. Then, the data is split into training and testing sets using a 75:25 ratio. The training

data is preprocessed by dropping certain columns and normalizing the remaining features using a

StandardScaler. The XGBoost algorithm is then used to learn the model, with hyperparameters

such as the number of estimators, learning rate, and maximum depth set to specific values.

Finally, the model is evaluated on the testing set using accuracy score and root mean squared
error (RMSE), and confusion matrices are plotted for both the training and testing sets. The

model is also pickled, so it can be saved for later use.

Going back to the model I used and expanding a little more on it, the XGBoost model is

trained using the XGBRegressor class, which is part of the xgboost library. The class takes

several hyperparameters that need to be tuned to achieve the best performance. In this example,

we use n_estimators = 600, learning_rate = 0.1, and max_depth = 5. The n_estimators parameter

specifies the number of trees in the model, the learning_rate parameter controls the step size

during the optimization process, and the max_depth parameter sets the maximum depth of each

tree. These hyperparameters were chosen based on prior knowledge and experimentation.

Once the model is trained, it is evaluated on the testing set. The accuracy_score function

is used to calculate the accuracy of the model, while the mean_squared_error function is used to

calculate the root mean squared error (RMSE). The accuracy score is the proportion of correct

predictions, while the RMSE is a measure of the average deviation of the predictions from the

actual values. In this example, the accuracy on the testing set is 0.373, which is not very high,

and the RMSE is 1.03. These metrics indicate that the model performs reasonably well.

Along the way I encountered many challenges that I had to overcome to improve my model. One

challenge that I had when training the model is overfitting, which occurs when the model is too

complex and fits the training data too closely, resulting in poor performance on new, unseen

data. To address this, it is important to perform feature selection and regularization to prevent the

model from becoming too complex. In this case, certain columns are dropped from the dataset

during preprocessing, which helps to simplify the model and reduce the risk of overfitting.

Another challenge that can arise is dealing with missing or incomplete data. It is

important to carefully consider how to handle missing data, as it can have a significant impact on
the performance of the model. In this case, no missing data was encountered in the dataset,

which made it easier to work with. But I find it important to know that this is something very

possible and we should know how to tackle it when we encounter this problem.

During the process of training the model, some interesting findings were discovered

about the dataset. For example, it was observed that there was a wide range of scores in the

dataset, ranging from 1 to 5, with a relatively even distribution among the different scores. This

suggests that the dataset is well-balanced and contains a diverse range of reviews. Additionally,

it was observed that certain features in the dataset, such as the length of the review and the

number of exclamation marks used, were highly correlated with the score given by the reviewer.

This suggests that these features may be important predictors of the score, and could be used to

improve the performance of the model.

To further evaluate the model, I also plotted confusion matrices for both the testing and

training sets. The confusion matrix is a way to visualize the performance of a classification

model. It shows the number of true positives, false positives, true negatives, and false negatives.

The confusion matrices show that the model performs better on the training set than on the

testing set. This suggests that the model may be overfitting the training data and may not

generalize well to new data.

In conclusion, the approach taken to train this model is a good one, as it involves several

important steps such as feature selection, normalization, and evaluation on a testing set. The

XGBoost algorithm is an effective choice for this type of problem, as it is able to handle a wide

range of features and can prevent overfitting through regularization. During the training process,

it is important to be aware of potential challenges such as overfitting and missing data, and to

carefully preprocess the data to address these issues. Finally, it is important to analyze the dataset
during the training process to identify any key findings that can help to improve the performance

of the model.

cz4041 Project Final Report Nyc Taxi Fare Prediction
0% (1)
cz4041 Project Final Report Nyc Taxi Fare Prediction
18 pages
Project Submission Machine Learning - Ankit Bhagat - 8th Jan
100% (9)
Project Submission Machine Learning - Ankit Bhagat - 8th Jan
36 pages
ML Project Shivani Pandey
100% (2)
ML Project Shivani Pandey
49 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
116 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Machine Leaning
No ratings yet
Machine Leaning
29 pages
Ml Ese 031223 Openbook
No ratings yet
Ml Ese 031223 Openbook
4 pages
Machine Leafning
No ratings yet
Machine Leafning
5 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
ML 5
No ratings yet
ML 5
14 pages
SML Updated UNIT 4
No ratings yet
SML Updated UNIT 4
44 pages
Trust-In Machine Learning Models
No ratings yet
Trust-In Machine Learning Models
11 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Guide
No ratings yet
Guide
24 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
Performance Improvement of Model
No ratings yet
Performance Improvement of Model
4 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
DEEP LEARNING UNIT 3
No ratings yet
DEEP LEARNING UNIT 3
19 pages
Ritesh Machine Learning Project
100% (9)
Ritesh Machine Learning Project
46 pages
Machine Learning Project: Raghul Harish
100% (2)
Machine Learning Project: Raghul Harish
46 pages
subtitle
No ratings yet
subtitle
2 pages
Slides on DataI
No ratings yet
Slides on DataI
33 pages
chapter 1 capstone project ai class 12
No ratings yet
chapter 1 capstone project ai class 12
5 pages
Training Evaluation
No ratings yet
Training Evaluation
42 pages
Exercise - 3: DS203-2024-S1 Roll Number: 23B2215
No ratings yet
Exercise - 3: DS203-2024-S1 Roll Number: 23B2215
25 pages
Xgboost: Notebook
No ratings yet
Xgboost: Notebook
8 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
ML Unit 2
No ratings yet
ML Unit 2
33 pages
Machine Learning Extended Project - BrahmaChari
No ratings yet
Machine Learning Extended Project - BrahmaChari
29 pages
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
No ratings yet
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
28 pages
C2W3_Lab_02_Diagnosing_Bias_and_Variance
No ratings yet
C2W3_Lab_02_Diagnosing_Bias_and_Variance
11 pages
30 Days ML Projects Challenge
No ratings yet
30 Days ML Projects Challenge
288 pages
unit 2 (1)
No ratings yet
unit 2 (1)
23 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Pytorch (Tabular) - Regression
No ratings yet
Pytorch (Tabular) - Regression
13 pages
ECON 460202E006 MLforBI2 S23o
No ratings yet
ECON 460202E006 MLforBI2 S23o
5 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
ANN_EXPERIENTIAL_LEARNING
No ratings yet
ANN_EXPERIENTIAL_LEARNING
43 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
Lecture 5b - Model Performance Analytics
No ratings yet
Lecture 5b - Model Performance Analytics
27 pages
Final Report (1)
No ratings yet
Final Report (1)
17 pages
ML Unit-3 - RTU
No ratings yet
ML Unit-3 - RTU
20 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Python Learning
No ratings yet
Python Learning
21 pages
MLT_Notes
No ratings yet
MLT_Notes
28 pages
06 Regularizations
No ratings yet
06 Regularizations
42 pages
Lecture5
No ratings yet
Lecture5
26 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
bias
No ratings yet
bias
3 pages
L03 Generalization, Train Test Splits and Validation
No ratings yet
L03 Generalization, Train Test Splits and Validation
49 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
Adobe Scan 04-Jan-2025 (6)
No ratings yet
Adobe Scan 04-Jan-2025 (6)
6 pages
SQL - Functions (Aggregate and Scalar Functions)
No ratings yet
SQL - Functions (Aggregate and Scalar Functions)
12 pages
Demographic Surfaces: Estimation, Assessment and Presentation, With Application To Danish Mortality, 1835-1995
No ratings yet
Demographic Surfaces: Estimation, Assessment and Presentation, With Application To Danish Mortality, 1835-1995
178 pages
Pupil Practice Book
0% (1)
Pupil Practice Book
89 pages
Thermodynamics - II: Clausius-Clapeyron Equation
No ratings yet
Thermodynamics - II: Clausius-Clapeyron Equation
12 pages
An Assessment of Maturity From Anthropometric Measurements
No ratings yet
An Assessment of Maturity From Anthropometric Measurements
7 pages
Chain Rule HIgher Order Derivatives
No ratings yet
Chain Rule HIgher Order Derivatives
51 pages
Chapter 4 Intro
No ratings yet
Chapter 4 Intro
34 pages
Jen-Chi Pangs Curriculum Vitae v3
No ratings yet
Jen-Chi Pangs Curriculum Vitae v3
4 pages
Ruby Is A Pure Object Oriented Programming Language
No ratings yet
Ruby Is A Pure Object Oriented Programming Language
95 pages
IB Lab Report Guide v20162
No ratings yet
IB Lab Report Guide v20162
28 pages
Banksoal Digabungkan
No ratings yet
Banksoal Digabungkan
249 pages
CIE 2000 Colour Difference Formula
No ratings yet
CIE 2000 Colour Difference Formula
6 pages
Add Info B-65280EN 07
No ratings yet
Add Info B-65280EN 07
156 pages
Motion Control With Labview
No ratings yet
Motion Control With Labview
30 pages
Math 110 Prob Sets 5 Trigonometry
No ratings yet
Math 110 Prob Sets 5 Trigonometry
4 pages
Notes To Teachers - Consolidation Camp
No ratings yet
Notes To Teachers - Consolidation Camp
4 pages
MAT 3103: Computational Statistics and Probability Chapter 3: Probability
No ratings yet
MAT 3103: Computational Statistics and Probability Chapter 3: Probability
23 pages
Radian and Degree Measure
100% (3)
Radian and Degree Measure
22 pages
Yro) I: Sec. Kinematics of A Particle
No ratings yet
Yro) I: Sec. Kinematics of A Particle
17 pages
Dr. Joon-Yeoul Oh: IEEN 5335 Principles of Optimization
No ratings yet
Dr. Joon-Yeoul Oh: IEEN 5335 Principles of Optimization
27 pages
Matlab DCE
No ratings yet
Matlab DCE
21 pages
Exam1 355 Fall 11
No ratings yet
Exam1 355 Fall 11
7 pages
QCAA Examplar
No ratings yet
QCAA Examplar
24 pages
1.1 Functions Topic Questions 0606 Set 4 QP Ms
No ratings yet
1.1 Functions Topic Questions 0606 Set 4 QP Ms
17 pages
5 - Capital Budgeting - Risk & Uncertainty
No ratings yet
5 - Capital Budgeting - Risk & Uncertainty
22 pages
CWALSHT - Retaining Wall Design PDF
No ratings yet
CWALSHT - Retaining Wall Design PDF
130 pages
Creating An Escape Room #3. The PUZZLES
No ratings yet
Creating An Escape Room #3. The PUZZLES
10 pages
FEBRUARY 6 - 10, 2023: (Unit 7) Relative Pronouns Who, Which, Where (Unit 7) Relative Pronouns Who, Which, Where
No ratings yet
FEBRUARY 6 - 10, 2023: (Unit 7) Relative Pronouns Who, Which, Where (Unit 7) Relative Pronouns Who, Which, Where
1 page