Vtu ML
Vtu ML
Introduction to Scikit-learn
Scikit-learn is one of the most widely-used open-source machine learning libraries in Python. It
was developed as a part of the SciPy ecosystem, designed to offer simple and efficient tools for
data analysis and machine learning. Scikit-learn’s appeal lies in its easy-to-understand API,
extensive algorithm options, and strong community support, making it an excellent choice for both
beginners and experienced data scientists.
1. Supervised Learning:
In supervised learning, the model learns from labeled data, meaning the dataset includes both input
data (features) and the expected output (labels). Common tasks include:
Classification: Predicting discrete labels (e.g., spam vs. not spam, cancerous vs. non-cancerous).
Regression: Predicting continuous values (e.g., predicting house prices, stock values).
1
2. Unsupervised Learning:
In unsupervised learning, the model is not given labeled data. Instead, it finds patterns or structures
within the dataset, such as:
3. Semi-Supervised Learning:
Semi-supervised learning involves a small amount of labeled data and a large amount of unlabeled
data. It is useful when labeling data is expensive or time-consuming, but there is an abundance of
unlabeled data.
4. Reinforcement Learning:
While Scikit-learn does not natively support reinforcement learning, it is a paradigm where agents
learn to make decisions by interacting with an environment to maximize cumulative rewards over
time (e.g., training a bot to play a game).
1. Data Preprocessing:
Preprocessing is an essential step in machine learning pipelines. Before training a model, it’s
important to clean and standardize the data. Scikit-learn offers several modules for this, including:
Standardization: Features are rescaled so that they have zero mean and unit variance.
Normalization: Scaling features to a range, such as [0, 1].
Encoding: Converting categorical variables into numerical ones (e.g., using one-hot encoding).
Example:
Python
2
Output:
[[-1.22474487 -1.22474487]
[ 0. 0. ]
[ 1.22474487 1.22474487]]
Scikit-learn provides a variety of machine learning algorithms for both classification and
regression. Some of the most popular models include:
Linear Regression: For predicting continuous values (e.g., predicting housing prices).
Logistic Regression: For binary classification tasks (e.g., predicting whether an email is spam or
not).
Decision Trees: For both classification and regression tasks.
Random Forest: An ensemble method using multiple decision trees to improve accuracy.
Example:
Python
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
3. Model Evaluation:
Scikit-learn provides various metrics and tools to evaluate model performance, which helps to
fine-tune and optimize models. Some of the most commonly used evaluation metrics include:
3
Example:
Python
# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)
# Confusion matrix
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", conf_matrix)
4. Hyperparameter Tuning:
Hyperparameter tuning is essential for optimizing machine learning models. Scikit-learn provides
tools like GridSearchCV and RandomizedSearchCV to automatically search for the best
hyperparameters.
Example:
Python
5. Cross-Validation:
Cross-validation is a technique to assess the model's performance more accurately by splitting the
dataset into training and validation sets multiple times. Scikit-learn offers different methods for
cross-validation, such as K-Fold and StratifiedKFold.
Example:
Python
clf = RandomForestClassifier(n_estimators=100)
4
scores = cross_val_score(clf, X, y, cv=5)
6. Unsupervised Learning:
For tasks like clustering and dimensionality reduction, Scikit-learn provides popular algorithms
such as K-Means, DBSCAN, and PCA (Principal Component Analysis).
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
1. Healthcare:
o Disease Prediction: Logistic regression and decision trees can be used for predicting
diseases based on patient history, symptoms, and test results.
o Medical Image Analysis: Techniques like K-Means and PCA can be used for segmenting
and analyzing medical images (e.g., tumor detection).
2. Finance:
o Fraud Detection: Classification algorithms like Random Forest and SVM help detect
fraudulent transactions based on past transaction patterns.
o Stock Market Prediction: Regression models are widely used to predict stock prices and
market trends based on historical data.
3. Retail:
o Customer Segmentation: Retailers use clustering techniques like K-Means to group
customers based on buying behavior for personalized marketing.
o Sales Forecasting: Regression models help predict future sales by analyzing historical
data, seasonality, and market trends.
4. Natural Language Processing (NLP):
o Text Classification: Algorithms like Naive Bayes are used to classify emails as spam or
not spam, and to categorize news articles based on topics.
5. Recommender Systems:
o Scikit-learn can be used to implement recommendation algorithms for suggesting products,
movies, or songs based on user preferences and historical interactions.
5
Conclusion
Scikit-learn is a versatile and powerful library that provides a wide range of machine learning
algorithms and tools. Its ease of use, scalability, and comprehensive set of features make it a
preferred choice for data analysis and machine learning tasks across a wide variety of domains,
from healthcare and finance to e-commerce and natural language processing. Whether you're
building a basic machine learning model or fine-tuning a complex algorithm, Scikit-learn offers
the right tools to get the job done efficiently.
Scikit-learn offers a comprehensive collection of packages (also called modules) that provide
functionality for different stages of the machine learning workflow, from data preprocessing to
model evaluation. Below is an elaboration of the main packages in Scikit-learn, along with
examples to demonstrate their use.
The preprocessing module contains functions to transform raw data into a format more suitable for
machine learning algorithms. These transformations include scaling, normalizing, encoding
categorical variables, and imputing missing values.
Key Functions:
StandardScaler: Standardizes features by removing the mean and scaling to unit variance.
MinMaxScaler: Scales features to a fixed range, typically [0, 1].
OneHotEncoder: Converts categorical variables into a one-hot numeric array.
LabelEncoder: Encodes target labels with values between 0 and n_classes - 1.
SimpleImputer: Fills missing values in the dataset.
# Example data
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
6
scaled_data = scaler.fit_transform(data)
print(scaled_data)
Example: One-Hot Encoding
Python
print(encoded_data)
Key Functions:
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
7
Example: GridSearchCV for Hyperparameter Tuning
Python
The metrics module provides various tools for evaluating the performance of machine learning
models. These include accuracy, precision, recall, F1 score, ROC curves, and regression metrics.
Key Functions:
y_true = [1, 0, 1, 1, 0, 1]
y_pred = [1, 0, 0, 1, 0, 1]
# Confusion matrix
print(confusion_matrix(y_true, y_pred))
# Accuracy score
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)
8
4. sklearn.ensemble: Ensemble Methods
The ensemble module provides a collection of algorithms that combine the predictions of multiple
models to improve performance. These include popular ensemble techniques like bagging and
boosting.
Key Algorithms:
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
This module contains a wide variety of linear models for regression and classification, such as
linear regression, logistic regression, ridge regression, and more.
Key Algorithms:
9
# Sample data (X: features, y: target variable)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 2, 3, 3, 5])
The tree module provides tools for building decision tree models, both for classification and
regression tasks.
Key Algorithms:
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
Key Algorithms:
10
SVR: Support Vector Regression.
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
8. sklearn.cluster: Clustering
The cluster module provides tools for performing unsupervised learning tasks like clustering, which
groups similar data points together.
Key Algorithms:
# Example data
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
11
9. sklearn.decomposition: Dimensionality Reduction
This module includes algorithms for reducing the number of features in a dataset while retaining
as much information as possible. Dimensionality reduction is useful for data visualization and
speeding up machine learning algorithms.
Key Algorithms:
# Load dataset
iris = load_iris()
X = iris.data
print(X_reduced)
Conclusion
Scikit-learn is packed with powerful packages that allow users to implement a wide range of
machine learning algorithms. Each package serves a critical role in the machine learning
pipeline—from preprocessing data to fine-tuning models and making predictions. The consistency
of its API makes it a highly usable library for both beginners and professionals.
12
How to Install Scikit-learn
Scikit-learn can be installed easily using Python's package management tools like pip or conda (for
Anaconda users). Follow the steps below to install Scikit-learn.
If you are using the default Python installation, you can install Scikit-learn via pip.
Bash
This will install Scikit-learn along with its dependencies like NumPy, SciPy, and joblib.
If you're using the Anaconda distribution of Python, you can install Scikit-learn using conda:
Bash
This will install the version of Scikit-learn that is compatible with Anaconda and any necessary
dependencies.
Verifying Installation
Once the installation is complete, you can verify it by checking the installed version:
Python
import sklearn
print(sklearn.__version__)
13
1. Loading Datasets
Scikit-learn provides several built-in datasets that are useful for practice and experimentation. You
can load these datasets using the sklearn.datasets module.
2. Data Preprocessing
Before feeding data into machine learning algorithms, it's common to preprocess it. Scikit-learn
provides several preprocessing tools.
# Example data
data = [[1, 2], [3, 4], [5, 6]]
print(scaled_data)
14
# Example categorical data
data = [['male'], ['female'], ['female'], ['male']]
print(encoded_data)
A common practice is to split your dataset into training and testing subsets. Scikit-learn provides
train_test_split to easily perform this split.
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the dataset into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Load dataset
iris = load_iris()
15
X, y = iris.data, iris.target
# Make predictions
y_pred = knn.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Once you build a model, you can evaluate its performance using various metrics like accuracy,
precision, recall, F1 score, and confusion matrix.
6. Making Predictions
Once you’ve trained your model, you can use it to make predictions on new, unseen data.
16
prediction = knn.predict(new_data)
print("Predicted class:", prediction)
7. Hyperparameter Tuning
You can use GridSearchCV or RandomizedSearchCV to find the best hyperparameters for your model.
# Initialize GridSearchCV
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
# Best parameters
print("Best Parameters:", grid_search.best_params_)
Conclusion
Scikit-learn is an easy-to-use library that simplifies the process of building and evaluating machine
learning models. By following these steps, you can quickly get started with Scikit-learn, from
installation to training a model and evaluating its performance.
17
MACHINE LEARNING MODEL
Training: The training process involves feeding the model a set of input data along with
corresponding output labels (in supervised learning). The model uses this data to adjust its internal
parameters, striving to learn the relationship between the inputs and outputs.
Learning: During the training phase, the model iteratively updates its parameters to minimize a
loss function, which quantifies the difference between predicted outputs and actual outputs. This
process is crucial for improving the accuracy of the model’s predictions.
2. Types of Learning
Supervised Learning:
o In this paradigm, the model learns from a labeled dataset, where each input is associated
with the correct output. The objective is to learn a mapping from inputs to outputs.
o Example: Consider an email classification system where emails are labeled as "spam" or
"not spam." The model is trained on historical emails, learning to recognize features that
are indicative of spam (such as certain keywords, frequency of links, etc.) and
distinguishing them from non-spam emails.
Unsupervised Learning:
o Here, the model is exposed to data without explicit labels. It attempts to learn the
underlying structure or patterns within the data.
o Example: Customer segmentation in marketing can utilize unsupervised learning. A model
may analyze purchasing behavior data to group customers into segments based on
similarities (e.g., frequent buyers vs. occasional buyers), allowing marketers to tailor
strategies for different groups.
Reinforcement Learning:
o This learning type involves training agents to make sequences of decisions. The agent
learns to achieve a goal in an uncertain environment by receiving feedback in the form of
rewards or penalties based on its actions.
o Example: A model designed to play chess learns strategies through reinforcement learning.
It plays numerous games, receiving rewards for winning and penalties for losing, thereby
optimizing its strategies over time.
18
3. Types of ML Models
Classification Models:
o These models predict categorical outcomes. They are widely used for tasks such as spam
detection, image recognition, and medical diagnosis.
o Examples: Logistic Regression, Decision Trees, Random Forests, Support Vector
Machines (SVM).
Regression Models:
o Regression models predict continuous values and are commonly applied in scenarios like
predicting stock prices, house prices, and temperature forecasts.
o Examples: Linear Regression, Polynomial Regression, Support Vector Regression (SVR).
Clustering Models:
o Clustering models group data points into clusters based on similarity, making them useful
for exploratory data analysis.
o Examples: K-Means Clustering, Hierarchical Clustering, DBSCAN (Density-Based
Spatial Clustering of Applications with Noise).
Dimensionality Reduction Models:
o These models reduce the number of features in a dataset while retaining as much important
information as possible. They help visualize high-dimensional data and mitigate the curse
of dimensionality.
o Examples: Principal Component Analysis (PCA), t-SNE (t-distributed Stochastic
Neighbor Embedding).
After training a model, it’s crucial to evaluate its performance using various metrics to ensure its
effectiveness in making predictions:
The workflow for developing and deploying an ML model typically includes the following steps:
1. Data Collection: Gather a dataset that is relevant to the problem you aim to solve. Data
can come from various sources, such as databases, APIs, and web scraping.
2. Data Preprocessing: Clean and prepare the data by:
o Handling missing values (e.g., imputation or removal).
o Encoding categorical variables (e.g., one-hot encoding).
o Normalizing or scaling features to ensure that they are on a similar scale.
3. Model Selection: Choose a suitable machine learning algorithm based on the type of
learning (supervised, unsupervised, or reinforcement) and the problem domain
(classification, regression, clustering).
19
4. Training: Fit the selected model to the training data, allowing it to learn patterns and
relationships within the dataset.
5. Validation and Testing: Evaluate the model using a separate validation or test set to
ensure that it performs well on unseen data. This step helps assess generalization ability.
6. Hyperparameter Tuning: Optimize the model's performance by tuning hyperparameters
using techniques like Grid Search or Random Search.
7. Deployment: Once validated, the model can be deployed in a real-world application. This
could involve integrating the model into an existing system, providing predictions via an
API, or even embedding it into a mobile app.
Let’s illustrate the entire process using an example of predicting house prices:
Conclusion
In summary, a machine learning model is a sophisticated tool that harnesses the power of data and
algorithms to identify patterns, make predictions, and facilitate decision-making in diverse
applications. By understanding the various types of learning and models, as well as the entire
workflow from data collection to deployment, we can better leverage machine learning to solve
real-world problems effectively. The versatility of machine learning models ensures their
20
applicability across numerous fields, from healthcare and finance to marketing and entertainment,
making them an integral part of modern technology.
Supervised learning is a foundational approach in machine learning, where models are trained
using labeled datasets. This means that each training example is paired with an output label,
guiding the learning process. In this section, we'll delve deeper into two main categories:
classification models and regression models.
A. Classification Models
Classification models are used to predict categorical outcomes. Here are some commonly used
algorithms:
1. Logistic Regression
Description: Logistic regression is primarily used for binary classification tasks. It estimates the
probability of a binary response based on one or more predictor variables. The probabilities are
modeled using the logistic (sigmoid) function, which constrains the output to the range [0, 1].
Example Code:
Python
# Load dataset
iris = load_iris()
X = iris.data[:100] # Use only two classes for binary classification
y = iris.target[:100]
# Make predictions
predictions = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
21
print("Accuracy:", accuracy)
Output Explanation:
Accuracy: 1.0
This indicates that the model made accurate predictions for all test samples. The accuracy score is
a common evaluation metric for classification models, measuring the proportion of correctly
predicted instances.
2. Decision Trees
Description: Decision trees use a tree-like model of decisions, where each internal node
represents a feature, each branch represents a decision rule, and each leaf node represents an
outcome.
Example Code:
Python
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Make predictions
predictions = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
Output Explanation:
Accuracy: 1.0
22
Again, this perfect accuracy score signifies that the model made correct predictions for every
instance in the test set.
3. Random Forest
Description: Random forests are an ensemble method that combines multiple decision trees to
improve the overall accuracy and reduce the risk of overfitting.
Example Code:
Python
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Make predictions
predictions = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
Output Explanation:
Accuracy: 1.0
The random forest model also achieved perfect accuracy, demonstrating the effectiveness of
ensemble methods.
23
Use Case: Handwritten digit recognition.
Example Code:
Python
# Load dataset
iris = load_iris()
X = iris.data[:100] # Use only two classes for binary classification
y = iris.target[:100]
# Make predictions
predictions = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
Output Explanation:
Accuracy: 1.0
The SVM model also demonstrates perfect accuracy, reinforcing its effectiveness in binary
classification tasks.
5. Naive Bayes
Description: Naive Bayes classifiers are based on Bayes' theorem and assume the independence
of features given the class label.
Example Code:
Python
24
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Make predictions
predictions = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
Output Explanation:
Accuracy: 1.0
Naive Bayes also achieves perfect accuracy, showcasing its utility in classification tasks.
B. Regression Models
Regression models are used for predicting continuous outcomes. Here are a few popular regression
algorithms:
1. Linear Regression
Description: Linear regression models the relationship between a dependent variable and one or
more independent variables using a linear equation.
Use Case: Predicting housing prices based on features like size and location.
Example Code:
Python
# Example dataset
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 2, 3, 4, 5]) # Linear relationship
25
# Create and train the model
model = LinearRegression()
model.fit(X, y)
# Make predictions
predictions = model.predict(X)
print("Predictions:", predictions)
Output Explanation:
The model perfectly predicts the output, demonstrating the effectiveness of linear regression for
simple linear relationships.
2. Ridge Regression
Description: Ridge regression adds an L2 regularization term to linear regression to prevent
overfitting.
Example Code:
Python
# Make predictions
predictions = model.predict(X)
print("Predictions:", predictions)
Output Explanation:
The Ridge regression model also predicts accurately, maintaining linear relationships while
controlling for complexity.
26
3. Lasso Regression
Description: Similar to Ridge regression, Lasso includes an L1 regularization term that can set
some coefficients to zero, effectively performing feature selection.
Example Code:
Python
# Make predictions
predictions = model.predict(X)
print("Predictions:", predictions)
Output Explanation:
Lasso regression also achieves accurate predictions while potentially eliminating less important
features.
4. Polynomial Regression
Description: Polynomial regression extends linear regression by adding polynomial terms,
allowing the model to fit non-linear relationships.
Example Code:
Python
# Make predictions
27
predictions = model.predict(X_poly)
print("Predictions:", predictions)
Output Explanation:
Unsupervised learning models are trained on datasets without labeled responses. The goal is to
identify patterns or structures within the data. Here are some common unsupervised learning
models:
A. Clustering Models
1. K-Means Clustering
K-means is a popular clustering algorithm that partitions the dataset into k clusters by minimizing
the variance within each cluster.
Example Code:
Python
# Make predictions
labels = model.predict(X)
28
Output Explanation:
The output will be a scatter plot showing clusters identified by the K-Means algorithm. Each
color represents a different cluster, illustrating how the algorithm has grouped similar data points
together.
2. Hierarchical Clustering
Hierarchical clustering builds a tree of clusters, which can be cut at different levels to achieve
various cluster sizes.
Example Code:
Python
Output Explanation:
The resulting scatter plot illustrates the clusters formed by the hierarchical clustering algorithm,
showcasing the grouping of data points.
3. DBSCAN
Example Code:
29
Python
Output Explanation:
The scatter plot displays the clusters identified by the DBSCAN algorithm, highlighting the dense
areas and outliers.
PCA reduces the dimensionality of the dataset while preserving variance, transforming the data
into a new set of variables known as principal components.
Example Code:
Python
# Load dataset
iris = load_iris()
X = iris.data
# Apply PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
30
Output Explanation:
The output is a scatter plot representing the reduced-dimensional representation of the Iris
dataset. Each color corresponds to a different species of iris, demonstrating how PCA captures
variance in fewer dimensions.
2. t-SNE
t-SNE (t-distributed Stochastic Neighbor Embedding) is particularly well-suited for visualizing
high-dimensional datasets by embedding them into a lower-dimensional space.
Example Code:
Python
# Load dataset
iris = load_iris()
X = iris.data
# Apply t-SNE
tsne = TSNE(n_components=2)
X_embedded = tsne.fit_transform(X)
Output Explanation:
The scatter plot visualizes the clusters in the Iris dataset using t-SNE. Each point represents an iris
flower, and the clusters indicate the natural groupings of the data.
Conclusion
Machine learning models can be categorized into supervised and unsupervised learning, each with
a variety of algorithms tailored for specific tasks. The examples provided illustrate how to
implement and evaluate these models using the Scikit-learn library. From classification to
31
clustering and dimensionality reduction, Scikit-learn serves as a robust toolkit for diverse machine
learning applications, empowering practitioners to extract meaningful insights from their data.
Each algorithm discussed has its strengths and is suited for different types of data and objectives,
making it crucial to choose the right model based on the task at hand.
32
DATA SPLITTING AND MODEL EVALUATION METRICS
Data splitting is a fundamental step in the machine learning workflow. It involves dividing the
available dataset into distinct subsets for training, validation, and testing. This process is crucial
for several reasons:
1. Preventing Overfitting:
o Overfitting occurs when a model learns not only the underlying patterns in the training data
but also the noise and outliers. This results in a model that performs exceptionally well on
training data but poorly on unseen data.
o By splitting the data, we can train the model on a subset (training set) and then evaluate its
performance on a different subset (test set). This helps to gauge how well the model
generalizes to new data.
2. Ensuring Model Generalization:
o Generalization refers to a model's ability to perform well on unseen data. A model that
generalizes effectively can make accurate predictions beyond the specific examples it was
trained on.
o A well-defined split allows for proper validation of model performance, ensuring that it
can handle variations in data that it has not encountered before.
3. Hyperparameter Tuning:
o The validation set, which is separate from the training and test sets, is essential for tuning
hyperparameters of the model. Hyperparameters are configuration settings used to
optimize the model's performance, such as learning rate or the number of hidden layers.
o By evaluating model performance on the validation set, we can adjust these
hyperparameters to improve generalization without peeking at the test set.
4. Performance Evaluation:
o To objectively assess a model's performance, we need a dataset that it has never seen during
training. The test set provides this capability, allowing us to measure the model’s accuracy,
precision, recall, and other metrics.
o Performance metrics derived from the test set offer insights into how well the model is
likely to perform in real-world applications.
5. Reducing Bias:
o Data splitting helps reduce bias in model evaluation. If we only evaluate a model using the
same data it was trained on, we risk inflating performance metrics. Using a separate test
set mitigates this issue.
Understanding the types of data splits is essential for effective model training and evaluation. The
three primary subsets are:
1. Training Set:
o This subset is used to train the model. The model learns the relationships and patterns in
the data by adjusting its internal parameters based on this training data.
33
o A typical split might allocate around 70-80% of the total dataset to the training set.
2. Validation Set:
o The validation set is used during the training process to tune model hyperparameters and
assess how well the model is learning.
o It acts as a feedback mechanism, allowing us to make adjustments without using the test
data. The validation set usually constitutes about 10-15% of the dataset.
3. Test Set:
o The test set is used to evaluate the final model after training and validation. It provides an
unbiased evaluation of the model's performance on unseen data.
o Generally, 10-20% of the dataset is reserved for testing purposes. The model should not
have access to this data during the training phase.
To illustrate the concept of data splitting, consider a dataset containing 1,000 samples. A common
approach might involve the following splits:
In Python, using Scikit-learn, you can easily split a dataset like this:
Python
Code Breakdown
Python
34
o This line imports the train_test_split function from the sklearn.model_selection module, which
is essential for splitting datasets in a way that maintains the distribution of classes in
classification tasks.
2. Defining the Dataset:
Python
o Here, X represents the feature data, which can be a NumPy array or a DataFrame containing
input variables.
o y represents the target labels corresponding to the feature data (the output we want to
predict).
o In a complete program, X and y would need to be defined (e.g., X = data.drop('target', axis=1)
and y = data['target']).
3. Splitting the Data:
Python
o This line splits the original dataset into two parts: the training set and a temporary set
(denoted as X_temp and y_temp).
o The test_size=0.3 parameter indicates that 30% of the data will be reserved for the temporary
set (which will later be divided into validation and test sets), while 70% will be used for
training.
o The random_state=42 ensures reproducibility of the results; using the same seed will
produce the same split every time the code is run.
Python
o This line takes the temporary set (30% of the original dataset) and splits it into the
validation set and the test set.
o The test_size=0.5 parameter here means that 50% of the temporary set will be used for the
validation set, and the other 50% will be for the test set.
o Thus, both the validation and test sets will each consist of 15% of the original dataset (0.5
* 30% = 15%).
4. Printing Sizes of the Splits:
Python
o These lines print the sizes of each of the resulting splits using the shape attribute of NumPy
arrays or DataFrames, which provides the dimensions of the data. shape[0] gives the number
of samples in each set.
35
Expected Output
Assuming that the original dataset contained 1000 samples, the expected output would be:
Conclusion
Data splitting is a critical step in the machine learning pipeline that safeguards against overfitting,
ensures model generalization, facilitates hyperparameter tuning, and allows for unbiased
performance evaluation. Understanding the different types of data splits—training, validation, and
test sets—enables practitioners to build robust models that can perform effectively in real-world
scenarios.
36
1. Data Splitting Techniques
In machine learning, effectively splitting datasets is crucial for building robust models. Proper data
splitting helps ensure that models generalize well to unseen data and reduces the risk of overfitting.
The three primary techniques we will explore are:
Technique Description
Random Splitting A method that divides the dataset into different subsets randomly.
Time Series Techniques specific to time series data that respect the chronological order of
Splitting observations.
Explanation:
Random Splitting is the most straightforward approach to dividing a dataset. It involves randomly
selecting a certain percentage of the data to use for training, validation, and testing. This
randomness allows for unbiased training and testing of the model.
For example, if you have a dataset of 1,000 samples, you might randomly choose 700 samples for
training and 300 for testing. The random nature of this split helps in ensuring that the model's
performance is evaluated on unseen data.
Advantages:
Prevents Selection Bias: Randomly splitting the data helps to mitigate any bias that may arise from
systematic sampling.
Generalization: It encourages the model to learn general patterns rather than memorizing specific
data points.
Simplicity: The method is easy to implement and understand, making it a popular choice among
data scientists.
Python Example:
Python
import numpy as np
from sklearn.model_selection import train_test_split
# Example dataset
X = np.arange(100).reshape(-1, 1) # Feature data: 100 samples, single feature
y = np.random.randint(0, 2, size=(100,)) # Binary target labels (0 or 1)
37
# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Output
The training set consists of 70 samples (70% of the total 100), and the test set consists of 30 samples
(30%). This split is done randomly, ensuring that both sets have a diverse representation of the data.
Visual Representation: You can imagine the dataset as a large box filled with various colored
balls. Randomly selecting some balls for training means you might pick red, blue, and green balls,
ensuring a mix that represents the whole box, which would help your model learn effectively.
Explanation:
Stratified Splitting is particularly valuable when dealing with imbalanced datasets, where some
classes are significantly more prevalent than others. This method ensures that each subset maintains
the original distribution of classes.
For instance, in a dataset where 90% of the samples belong to class 0 and only 10% belong to class
1, a stratified split would ensure that both the training and testing sets retain this class distribution.
Advantages:
Maintains Class Distribution: Ensures that the model is trained and tested on a representative
sample of each class, which is crucial for fair evaluation.
Reduces Bias: By representing all classes in both training and testing sets, stratified splitting
reduces the chances of bias that could arise from class imbalances.
Better Performance Metrics: Models trained on stratified samples are often better at predicting
minority classes, leading to more reliable performance metrics.
Python Example:
Python
38
# Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Stratified splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
Output
The training set maintains approximately 90% of class 0 and 10% of class 1, mirroring the original
dataset's distribution. This allows the model to learn from both classes effectively and reduces the
risk of misclassification during testing.
Visual Representation: Consider a pie chart representing the class distribution of your dataset.
Stratified splitting ensures that each segment of the pie (representing each class) is proportionally
represented in both the training and testing datasets. This is crucial for models that need to
recognize patterns across all classes.
Explanation:
Time Series Splitting addresses the unique challenges presented by time-dependent data. Unlike
other datasets, the order of observations in time series is critical. Randomly shuffling this data can
lead to future values being used to predict past values, a situation known as "data leakage."
In this context, Time Series Splitting allows you to create train-test splits that respect the
chronological order of the data, ensuring that predictions for future time points are only based on
past observations.
Advantages:
Preserves Temporal Structure: Ensures that the model learns from past data to predict future
values, which is essential for any time-dependent application.
Avoids Data Leakage: Prevents the model from having access to future information during
training, leading to more realistic evaluations.
Supports Time-Dependent Validation: Facilitates validation techniques that account for trends
and seasonality in time series data.
39
Python Example:
Python
import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
Output
TEST data:
Value
2020-01-10 -0.362647
2020-01-11 0.288111
Each iteration of TimeSeriesSplit produces a different train-test split where the training set contains
all samples up to a certain point in time, while the test set consists of the subsequent time points.
This output illustrates the first split: the first 10 samples are used for training, and the following 2
samples are reserved for testing. This method of splitting preserves the temporal nature of the data
and prevents leakage.
Visual Representation: Think of a movie trailer. You cannot see the end of the movie before it is
released. Similarly, in time series analysis, your model can only learn from past data (the trailer)
to make predictions about future events (the full movie). This split method mimics that
chronological storytelling.
40
Conclusion
Understanding the appropriate data splitting techniques is crucial for building effective machine
learning models. Each method serves a specific purpose depending on the nature of the dataset:
Random Splitting is a great starting point for balanced datasets and provides a solid foundation
for understanding how models generalize.
Stratified Splitting is essential for datasets with class imbalances, ensuring fair representation and
evaluation of all classes.
Time Series Splitting is critical for any dataset where time plays a significant role, allowing for
the correct modeling of trends and avoiding leakage from future data.
By mastering these techniques, you can significantly enhance the robustness and reliability of your
machine learning models, leading to better performance and more accurate predictions.
2. Cross-Validation
Cross-validation is a statistical method used to estimate the skill of machine learning models. It is
crucial for determining how the results of a statistical analysis will generalize to an independent
dataset. By using cross-validation, we can effectively detect overfitting and ensure that the model
has the ability to generalize to new, unseen data.
Concept:
K-Fold Cross-Validation divides the dataset into 'k' equal (or nearly equal) parts, known as folds.
The process involves the following steps:
1. Shuffling: The dataset is shuffled randomly to ensure that each fold is representative of the entire
dataset.
2. Splitting: The data is then split into 'k' parts.
3. Training and Validation: For each fold:
o The model is trained on 'k-1' folds and validated on the remaining fold.
o This is repeated 'k' times, each time using a different fold as the validation set.
4. Performance Metric Calculation: After completing the folds, the performance metrics (like
accuracy, precision, etc.) are averaged to provide a more robust estimate of the model's
performance.
41
Python Example:
Python
# Load a dataset
X, y = load_iris(return_X_y=True)
# K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)
42
Explanation:
In this example, the Iris dataset is used, which consists of 150 samples with 4 features. The dataset
is split into 5 folds.
The cross_val_score function automatically handles the training and validation for each fold. The
individual scores (accuracy in this case) for each fold are printed, along with the mean score, which
gives an overall performance measure of the model.
Output Explanation:
The output would display an array of accuracy scores corresponding to each of the 5 folds and the
mean accuracy score, which reflects the model's ability to generalize across different subsets of
data.
Expected Output:
Explanation:
Concept:
Leave-One-Out Cross-Validation (LOOCV) is an extreme case of k-fold cross-validation where
the number of folds equals the number of samples in the dataset. This means that for each iteration,
one sample is used as the validation set while the rest are used for training. This method ensures
that the model is trained on nearly the entire dataset.
43
Advantages:
Maximum Training Data: Each model is trained on almost all available data, which can be
advantageous, especially in small datasets.
Comprehensive Evaluation: Provides a robust measure of model performance as each sample is
used for validation.
Disadvantages:
Computational Cost: As the number of samples increases, the number of iterations increases
linearly, leading to high computational costs.
Variance: Since each fold consists of just one sample, the variance of the estimate may be higher
compared to k-fold cross-validation.
Python Example:
Python
# Load a dataset
X, y = load_wine(return_X_y=True)
44
model = LogisticRegression(max_iter=200)
# Leave-One-Out Cross-Validation
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
Explanation:
In this example, the Wine dataset (with 178 samples) is evaluated using LOOCV. The LeaveOneOut
object generates indices for training and validation where each sample serves as a validation case
once.
The model's performance is measured by the accuracy for each left-out sample, and the mean score
provides a comprehensive evaluation of the model.
Output Explanation:
The output will show the total accuracy over all iterations (which should equal the number of
samples if they all predict correctly) and the mean accuracy, giving insight into how well the model
performs across all single-sample validations.
Explanation:
Total LOOCV Scores: This output indicates that the model correctly predicted a total of
178 samples when each sample was left out once for validation. Since the dataset consists
of 178 samples, this means the model performed perfectly for every sample in the dataset.
Mean Score: The mean accuracy of approximately 0.994 indicates that the model achieved
about 99.4% accuracy across all iterations. LOOCV, while computationally expensive,
provides a nearly unbiased estimate of the model's performance, especially useful when
working with small datasets.
Concept:
Stratified K-Fold Cross-Validation ensures that each fold has approximately the same proportion
of class labels as the complete dataset. This is particularly important when dealing with imbalanced
datasets, as it ensures that each class is well represented in both training and validation sets.
45
Advantages:
Class Distribution Maintenance: Each fold retains the class distribution, preventing bias in model
evaluation that might occur in standard k-fold.
Improved Generalization: By preserving the distribution, the model is trained and validated under
conditions more reflective of the actual dataset.
Python Example:
Python
46
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
scores.append(score)
Explanation:
Here, a synthetic dataset is generated with a 90:10 class imbalance, reflecting a common scenario
in real-world datasets.
StratifiedKFold ensures that each of the 5 folds maintains the same 90:10 ratio. The model is trained
on the training folds and validated on the test folds.
Output Explanation:
The output will show individual scores for each fold, demonstrating how the model performs across
balanced class distributions. The mean score gives an overall sense of model effectiveness while
respecting class distributions.
Output:
Explanation:
Stratified K-Fold Scores: This array represents the accuracy scores from each of the five
folds. The scores indicate how well the model performed on the test data for each fold,
considering the class distribution:
o Fold 1: 0.96
o Fold 2: 0.96
o Fold 3: 0.95
o Fold 4: 0.93
o Fold 5: 0.92
Mean Score: The mean accuracy score of approximately 0.94 reflects the model's
performance, indicating that it correctly classified about 94% of the samples across all folds
while maintaining the original class distribution. This is particularly beneficial in scenarios
where certain classes are underrepresented, ensuring that each class is adequately
represented in training and validation sets.
47
Summary of Cross-Validation Techniques
Divides the dataset into 'k' folds, with each fold used once
K-Fold Cross- as a validation set while the rest are used for training. The General-purpose model
Validation model is trained and validated 'k' times. Scores are averaged evaluation.
to provide a robust estimate of model performance.
Maintains the same class distribution in each fold as in the Imbalanced datasets
entire dataset. This is particularly beneficial for imbalanced where class
Stratified K-Fold
datasets, ensuring that each fold is representative of the class representation is
distribution. crucial.
Final Remarks
48
1. Model Evaluation Metrics
Model evaluation metrics are vital tools that allow practitioners to assess the effectiveness of their
machine learning models. They provide quantitative measures that inform decisions regarding
model selection, tuning, and improvement. Understanding these metrics is crucial for ensuring that
models generalize well and provide reliable predictions.
Assessing Model Performance: Evaluation metrics help quantify how well a model
performs, allowing for comparisons across different algorithms. For instance, when
experimenting with various models (like decision trees, logistic regression, and SVM),
metrics help identify which model performs best under the given circumstances. An
effective evaluation strategy ensures that practitioners can make informed choices rather
than relying on intuition.
Model Generalization: These metrics provide insights into how well the model will
perform on unseen data. For example, if a model has high accuracy on the training set but
low accuracy on the validation set, it may be overfitting, meaning it learns noise and details
in the training data instead of the underlying pattern. This is critical in real-world
applications where the model is deployed in dynamic environments with new, unseen data.
Guiding Improvements: By pinpointing strengths and weaknesses, evaluation metrics
inform the feature engineering process, hyperparameter tuning, and overall model design.
For instance, if precision is low, it may indicate the need for better feature selection or a
different model that focuses on reducing false positives. Evaluation metrics not only help
diagnose problems but also guide the iterative process of improving models through
experimentation.
2. Classification Metrics
Classification metrics are designed for evaluating models that predict categorical outcomes. These
metrics are crucial in applications such as spam detection, disease diagnosis, and image
classification.
Accuracy: This metric calculates the ratio of correctly predicted instances to the total
instances in the dataset.
49
Example:
Suppose you have a dataset of 100 samples, where 90 are labeled as "No Disease"
(negative) and 10 as "Disease" (positive). A model predicts 85 negatives correctly and 5
positives incorrectly. The confusion matrix would look like this:
o Calculating Accuracy:
A high misclassification rate indicates that the model may require improvement,
particularly in classifying the minority class.
50
2.2 Precision and Recall
Precision: This metric indicates how many of the predicted positive instances are actually
positive. High precision means that the model is reliable in its positive predictions.
Example:
A precision of 100% suggests that whenever the model predicts "Disease," it is always
correct. This is particularly important in scenarios such as disease detection, where false
positives could lead to unnecessary anxiety or treatments.
Recall: Also known as sensitivity or true positive rate, recall measures how many of the
actual positive instances were correctly predicted. A high recall means the model captures
most of the positive instances.
Example:
A recall of 50% suggests that the model missed half of the actual positive cases, which
could be unacceptable in certain medical scenarios where detecting all positive cases is
critical.
F1 Score: This metric provides a balance between precision and recall, especially useful
for imbalanced classes. It is the harmonic mean of precision and recall.
51
Example:
The F1 score is particularly valuable in applications like fraud detection, where it’s crucial
to minimize both false positives and false negatives.
ROC Curve: The Receiver Operating Characteristic (ROC) curve illustrates the trade-off
between true positive rate (sensitivity) and false positive rate at various threshold settings.
o True Positive Rate (TPR): The proportion of actual positives correctly identified by the
model:
o False Positive Rate (FPR): The proportion of actual negatives that are incorrectly
identified as positives:
AUC: The area under the ROC curve (AUC) quantifies the overall ability of the model to
discriminate between positive and negative classes.
o An AUC of 1 indicates a perfect model, while an AUC of 0.5 suggests no discriminative
ability (random guessing).
o An AUC value above 0.8 is generally considered excellent, while an AUC below 0.6
indicates a poor model. This metric is useful because it provides an aggregate measure of
performance across all classification thresholds, giving a more comprehensive picture of
model performance than accuracy alone.
52
Example:
Suppose a model's ROC curve plots points at various threshold levels, producing an AUC
of 0.85. This indicates a good ability to distinguish between the classes, making it a reliable
choice for deployment.
3. Regression Metrics
Regression metrics evaluate models used for predicting continuous outcomes, such as house
prices, stock prices, or temperature forecasts. These metrics provide insights into how well the
model predicts the actual values.
Definition: Mean Absolute Error (MAE) measures the average absolute difference
between predicted and actual values. It provides a straightforward interpretation of
prediction accuracy.
Example:
A lower MAE indicates a better fit of the model to the data, which is crucial in applications
like predictive maintenance, where accurate predictions can lead to cost savings and
efficiency improvements.
Definition: Mean Squared Error (MSE) quantifies the average squared difference between
predicted and actual values, thus emphasizing larger errors.
53
Example:
The MSE is particularly sensitive to outliers due to the squaring of the errors, making it a
crucial consideration in scenarios where larger errors are significantly more problematic
than smaller ones, such as in financial forecasting.
R² Score: The R-squared score indicates the proportion of variance in the dependent
variable that is predictable from the independent variables. It ranges from 0 to 1, with
higher values indicating a better fit.
Example: If the total sum of squares (SS_tot) is 100 and the residual sum of squares
(SS_res) is 20:
This indicates that 80% of the variance in the dependent variable can be explained by the
model.
Adjusted R²: Unlike R², which can increase with the addition of predictors, adjusted R²
accounts for the number of predictors in the model and provides a more accurate measure
when comparing models with different numbers of predictors:
54
The adjusted R² can decrease if the new predictor does not contribute to improving the
model's explanatory power, which helps prevent overfitting by penalizing excessive use of
irrelevant predictors.
Summary
Understanding model evaluation metrics is crucial in machine learning, as they provide insights
into model performance, generalization, and areas for improvement. Different metrics are
applicable depending on whether the problem is classification or regression, and they serve various
purposes in evaluating model effectiveness. By effectively leveraging these metrics, practitioners
can ensure that their models are robust, accurate, and suitable for deployment in real-world
scenarios.
55
Choosing the Right Metrics
In the realm of machine learning, selecting the appropriate evaluation metric is critical for
accurately assessing model performance. Different scenarios demand different metrics to ensure
that the model aligns with business objectives and addresses the nuances of the data effectively.
Choosing the right metric often hinges on the characteristics of the dataset and the specific goals
of the model. Here are some common scenarios:
Imbalanced Classes: In classification problems where classes are imbalanced (e.g., fraud
detection, disease diagnosis), accuracy may not be a reliable measure. For example, if 95%
of a dataset consists of "No Fraud" instances, a model that predicts every instance as "No
Fraud" would achieve 95% accuracy. Instead, metrics like precision, recall, and the F1
score provide better insights into model performance:
o Precision: Important when the cost of false positives is high (e.g., identifying fraud where
wrongful accusations can lead to reputational damage).
o Recall: Crucial when the cost of false negatives is high (e.g., in medical diagnoses where
missing a positive case can lead to severe consequences).
Regression Tasks: When evaluating regression models, the choice between metrics like
Mean Absolute Error (MAE) and Mean Squared Error (MSE) depends on the nature of the
data and the importance of outlier management:
o MAE: Useful when all errors are equally important, as it provides a straightforward
interpretation of average error.
o MSE: More suitable when larger errors are more significant, as it squares the error terms,
heavily penalizing larger discrepancies.
Time-Sensitive Applications: In scenarios where predictions need to be made in real-time
(e.g., stock price predictions), the speed of evaluation may also play a role. Metrics that are
quick to compute can be prioritized in such cases.
Business Objectives: Always align the chosen metrics with business goals. For instance,
in a recommendation system, optimizing for user engagement metrics (like click-through
rates) may be more important than overall accuracy.
Understanding the trade-offs between different metrics is crucial for fine-tuning model
performance. Here are some common examples:
Precision vs. Recall: There is often a trade-off between precision and recall. Increasing
one may decrease the other. This trade-off can be visualized using the Precision-Recall
curve.
o Scenario: In spam detection, if a model is tuned to maximize precision, it may classify
fewer emails as spam, potentially allowing some spam emails to reach the inbox (lower
56
recall). Conversely, maximizing recall may result in many legitimate emails being
misclassified as spam (lower precision).
MSE vs. MAE: MSE and MAE present a trade-off in terms of error sensitivity. While
MSE can highlight models that perform poorly on outliers (due to squaring), MAE treats
all errors equally.
o Scenario: In house price prediction, if large price discrepancies are particularly
problematic (e.g., significant losses in revenue), MSE might be preferable. However, if all
errors should be treated uniformly (e.g., predicting customer purchases), MAE may be
more suitable.
ROC Curve and AUC: The trade-off between true positive rate (sensitivity) and false
positive rate can also be visualized through the ROC curve. A higher area under the curve
(AUC) indicates a better-performing model, but the chosen threshold for classification may
significantly impact precision and recall.
Practical Application
Python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# Sample dataset
data = {
'feature1': np.random.rand(100),
'feature2': np.random.rand(100),
'target': np.random.choice([0, 1], size=100) # Binary target
}
df = pd.DataFrame(data)
57
Output Explanation:
The output will display the sizes of the training, validation, and test sets. For example:
This indicates that the dataset has been split into 70% training data, 15% validation data, and 15%
test data.
Once the model is trained, we can evaluate it using various metrics. Here’s how to calculate
common classification metrics:
Python
# Make predictions
y_pred = model.predict(X_val)
# Evaluation metrics
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred))
print("\nClassification Report:\n", classification_report(y_val, y_pred))
Output Explanation: The output will consist of a confusion matrix and a classification report
detailing precision, recall, F1 score, and support for each class.
Confusion Matrix:
Confusion Matrix:
[[TN FP]
[FN TP]]
58
Classification Report:
accuracy 0.78 15
macro avg 0.78 0.75 0.76 15
weighted avg 0.79 0.78 0.77 15
This report provides a comprehensive view of the model's performance, indicating that it has a
decent balance between precision and recall for both classes.
To illustrate the entire process from data splitting to model evaluation, let’s work through a case
study of predicting customer churn in a telecommunications dataset.
df = pd.read_csv('customer_churn.csv')
print(df.head())
Output (Example):
Explanation: The output displays the first five rows of the dataset, showing columns such as
CustomerID, Tenure, MonthlyCharges, TotalCharges, and the target variable Churn (where 0 indicates
the customer did not churn, and 1 indicates they did).
X = df.drop(columns='Churn')
y = df['Churn']
59
print("Test set size:", X_test.shape[0])
Output (Example):
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
Output: This step does not produce console output but indicates that the model has been trained
on the training dataset.
y_val_pred = model.predict(X_val)
Output (Example):
Confusion Matrix:
[[85 5]
[10 50]]
Classification Report:
precision recall f1-score support
60
Explanation:
o Confusion Matrix:
True Negatives (TN): 85 (correctly predicted as No Churn)
False Positives (FP): 5 (incorrectly predicted as Churn)
False Negatives (FN): 10 (incorrectly predicted as No Churn)
True Positives (TP): 50 (correctly predicted as Churn)
o Classification Report:
Precision:
Class 0 (No Churn): 89%
Class 1 (Churn): 91%
Recall:
Class 0 (No Churn): 94%
Class 1 (Churn): 83%
F1-Score:
Balances precision and recall, indicating how well the model is
performing.
Overall Accuracy: 89% suggests that the model is generally effective.
y_test_pred = model.predict(X_test)
Output (Example):
Explanation:
o Confusion Matrix:
True Negatives (TN): 42 (correctly predicted as No Churn)
False Positives (FP): 8 (incorrectly predicted as Churn)
False Negatives (FN): 4 (incorrectly predicted as No Churn)
True Positives (TP): 46 (correctly predicted as Churn)
o Classification Report:
Precision:
Class 0 (No Churn): 91%
Class 1 (Churn): 85%
61
Recall:
Class 0 (No Churn): 84%
Class 1 (Churn): 92%
F1-Score: Close to 88% for both classes, indicating a balanced performance.
Overall Accuracy: 88% on the test set suggests good generalization to unseen
data.
Summary of Outputs
Model Training: Successfully trained on the training set.
Validation Metrics: Provide insights into model tuning and possible overfitting.
Test Metrics: Validate model performance on unseen data, ensuring reliability in real-world
applications.
The outputs of each step offer valuable information about model performance, allowing data
scientists to iterate and improve their models effectively.
62