The document outlines the process of executing an end-to-end machine learning project, including data acquisition, model selection, training, and evaluation. It discusses various loss functions, particularly Mean Squared Error and Mean Absolute Error, and their implications on model performance. Additionally, it covers techniques for model fine-tuning such as Grid Search and Randomized Search, as well as the importance of cross-validation in assessing model accuracy.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
9 views41 pages
Lecture 5
The document outlines the process of executing an end-to-end machine learning project, including data acquisition, model selection, training, and evaluation. It discusses various loss functions, particularly Mean Squared Error and Mean Absolute Error, and their implications on model performance. Additionally, it covers techniques for model fine-tuning such as Grid Search and Randomized Search, as well as the importance of cross-validation in assessing model accuracy.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41
Machine Learning
Siraj
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 1
Lecture Content • End to End ML Project
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 2
End-to-End ML Project • Look at the big picture. • Get the data. • Discover and visualize the data to gain insights. • Prepare the data for Machine Learning algorithms. Select a model and train it. Fine-tune your model. Present your solution. Launch, monitor, and maintain your system
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 3
Select and Train a Model • At last! • You framed the problem • You got the data and explored it • You sampled a training set and a test set • You wrote transformation pipelines to clean up and prepare your data for Machine Learning algorithms automatically. • You are now ready to select and train a Machine Learning mode.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 4
Training and Evaluating on the Training Set • Let's train a Linear Regression model.
• Done! You now have a working Linear Regression model.
Let’s try it out on a few instances from the training set:
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 5
• It works, although the predictions are not exactly accurate (e.g., the second prediction is off by more than 50%!). • Let’s measure this regression model’s RMSE on the whole training set using Scikit-Learn’s mean_squared_error function:
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 6
Loss Function • Machines learn by means of a loss function. • It’s a method of evaluating how well specific algorithm models the given data. • If predictions deviate too much from actual results, loss function would cough up a very large number. • Gradually, with the help of some optimization function, loss function learns to reduce the error in prediction.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 7
Which Loss Function to choose? • There’s no one-size-fits-all loss function to algorithms in machine learning. • There are various factors involved in choosing a loss function for specific problem such as • type of machine learning algorithm chosen • ease of calculating the derivatives and to some degree the percentage of outliers in the data set.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 8
Categories of Loss Functions • Broadly, loss functions can be classified into two major categories depending upon the type of learning task we are dealing with — Regression losses and Classification losses. • In Regression, it deals with predicting a continuous value for example given floor area, number of rooms, size of rooms, predict the price of room. • In classification, we are trying to predict output from set of finite categorical values i.e Given large data set of images of hand written digits, categorizing them into one of 0–9 digits. Sunday, May 18, 2025 Department of Computer Science, BUITEMS 9 Regression Losses • Mean Square Error/Quadratic Loss/L2 Loss
• Mean Absolute Error/L1 Loss
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 10
Mean Squared Error (MSE) • What it is: The average of the squares of the differences between predicted and actual values. • Gradient behavior: The error increases faster as the prediction moves further from the target because of the square. That means: • Small errors get smaller gradients. • Large errors get very big gradients — this makes the model learn faster from big mistakes.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 11
MSE • As the name suggests, Mean square error is measured as the average of squared difference between predictions and actual observations. • It’s only concerned with the average magnitude of error irrespective of their direction. • However, due to squaring, predictions which are far away from actual values are penalized heavily in comparison to less deviated predictions. • Plus MSE has nice mathematical properties which makes it easier to calculate gradients. Sunday, May 18, 2025 Department of Computer Science, BUITEMS 12 Mean Absolute Error (MAE)
What it is: The average of the absolute values of the
differences between predicted and actual values. Gradient behavior: The gradient is constant — it's either +1 or -1 depending on the direction of the error. o All errors, big or small, have the same influence on the update. o Makes learning more stable and robust to outliers but can be slower to converge. Sunday, May 18, 2025 Department of Computer Science, BUITEMS 13 MAE • MAE is measured as the average of sum of absolute differences between predictions and actual observations. • Like MSE, this as well measures the magnitude of error without considering their direction. • Unlike MSE, MAE needs more complicated tools such as linear programming to compute the gradients. • Plus MAE is more robust to outliers since it does not make use of square.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 14
In short: MSE gradient gets bigger for bigger errors, so it reacts strongly to big mistakes. MAE gradient stays constant, so it treats all errors more equally.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 15
Building Intuition • this is better than nothing but clearly not a great score: most districts’ median_housing_values range between $120,000 and $265,000, so a typical prediction error of $68,628 is not very satisfying. • This is an example of a model underfitting the training data. • When this happens it can mean that the features do not provide enough information to make good predictions, or that the model is not powerful enough.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 16
Overcome Underfitting • The main ways to fix underfitting are to select a more powerful model, to feed the training algorithm with better features, or to reduce the constraints on the model. • You could try to add more features (e.g., the log of the population), but first let’s try a more complex model to see how it does. • Let’s train a DecisionTreeRegressor. • This is a powerful model, capable of finding complex nonlinear relationships in the data. • The code should look familiar by now: Sunday, May 18, 2025 Department of Computer Science, BUITEMS 17 • Now that the model is trained, let’s evaluate it on the training set:
• No error at all? Could this model really be absolutely perfect?
• Of course, it is much more likely that the model has badly overfit the data. Sunday, May 18, 2025 Department of Computer Science, BUITEMS 18 Better Evaluation Using Cross- Validation • One way to evaluate the Decision Tree model would be to use the train_test_split function to split the training set into a smaller training set and a validation set • Then train your models against the smaller training set and evaluate them against the validation set. • It’s a bit of work, but nothing too difficult and it would work fairly well.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 19
Alternative is to use Scikit-Learn’s cross-validation feature • K-fold cross-validation: it randomly splits the training set into 10 distinct subsets called folds • Then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 20
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 21 Sunday, May 18, 2025 Department of Computer Science, BUITEMS 22 • Now the Decision Tree doesn’t look as good as it did earlier. In fact, it seems to perform worse than the Linear Regression model! • Notice that cross-validation allows you to get not only an estimate of the performance of your model, but also a measure of how precise this estimate is (i.e., its standard deviation). • The Decision Tree has a score of approximately 71,200, generally ±3,200. • You would not have this information if you just used one validation set. But cross-validation comes at the cost of training the model several times, so it is not always possible.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 23
Let’s compute the same scores for the Linear Regression model just to be sure:
Note: The Decision Tree model is overfitting so badly that it performs worse than the Linear Regression model.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 24
RandomForestRegression • Lets try another model:
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 25
Fine-Tune Your Model • Grid Search • Randomized Search • Ensemble Methods • Analyze the best Models and their Errors • Evaluate Your System on the Test Set
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 26
Grid Search • One way to do that would be to fiddle with the hyperparameters manually, until you find a great combination of hyperparameter values, which is very tedious of course! • Instead you should get Scikit-Learn’s GridSearchCV to search for you. • All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 27
For example, the following code searches for the best combination of hyperparameter values for the RandomForestRegressor:
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 28
• This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations of n_estimators and max_features hyperparameter values specified in the first dict • Then try all 2 × 3 = 6 combinations of hyperparameter values in the second dict • But this time with the bootstrap hyperparameter set to False instead of True (which is the default value for this hyperparameter).
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 29
• All in all, the grid search will explore 12 + 6 = 18 combinations of RandomForestRegressor hyperparameter values, and it will train each model five times (since we are using five-fold cross validation). • In other words, all in all, there will be 18 × 5 = 90 rounds of training! • It may take quite a long time, but when it is done you can get the best combination of parameters like this:
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 30
You can also get the best estimator:
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 31
Take Away • In this example, we obtain the best solution by setting the max_features hyperparameter to 6, and the n_estimators hyperparameter to 30. • The RMSE score for this combination is 49,959, which is slightly better than the score you got earlier using the default hyperparameter values (which was 52,634). • Congratulations, you have successfully fine-tuned your best model!
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 32
Randomized Search • The grid search approach is fine when you are exploring relatively few combinations • Like in the previous example, but when the hyperparameter search space is large • It is often preferable to use RandomizedSearchCV instead. • This class can be used in much the same way as the GridSearchCV class, but instead of trying out all possible combinations • It evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration. • This approach has two main benefits:
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 33
• If you let the randomized search run for, say, 1,000 iterations, this approach will explore 1,000 different values for each hyperparameter (instead of just a few values per hyperparameter with the grid search approach). • You have more control over the computing budget you want to allocate to hyperparameter search, simply by setting the number of iterations.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 34
Ensemble Methods • Another way to fine-tune your system is to try to combine the models that perform best. • The group (or “ensemble”) will often perform better than the best individual model (just like Random Forests perform better than the individual Decision Trees they rely on), especially if the individual models make very different types of errors.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 35
Analyze the Best Models and Their Errors • You will often gain good insights on the problem by inspecting the best models. • For example, the RandomForestRegressor can indicate the relative importance of each attribute for making accurate predictions.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 36
corresponding attribute names scores next to their Sunday, May 18, 2025 Department of Computer Science, BUITEMS 37 • With this information, you may want to try dropping some of the less useful features (e.g., apparently only one ocean_proximity category is really useful, so you could try dropping the others). • You should also look at the specific errors that your system makes, then try to understand why it makes them and what could fix the problem (adding extra features or, on the contrary, getting rid of uninformative ones, cleaning up outliers, etc.).
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 38
Evaluate Your System on the Test Set • After tweaking your models for a while, you eventually have a system that performs sufficiently well. • Now is the time to evaluate the final model on the test set. • There is nothing special about this process; just get the predictors and the labels from your test se, run your full_pipeline to transform the data (call transform(), not fit_transform()!), and evaluate the final model on the test set:
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 39
• The performance will usually be slightly worse than what you measured using crossvalidation if you did a lot of hyperparameter tuning (because your system ends up fine-tuned to perform well on the validation data and will likely not perform as well on unknown datasets). Sunday, May 18, 2025 Department of Computer Science, BUITEMS 40 Try It Out! Sunday, May 18, 2025 Department of Computer Science, BUITEMS 41