HHS ML Assignment
HHS ML Assignment
TABLE OF CONTENTS.........................................................................................................0
TABLE OF FIGURES.............................................................................................................0
P3 Investigate a range of machine learning algorithms and how these algorithms solve the learning
problems...................................................................................................................................................... 5
P4 Demonstrate the efficiency of these algorithms by implementing them using an appropriate programming
language or machine learning tool................................................................................................................ 6
P5 Prepare training and test data sets in order to implement a machine learning solution for an appropriate
learning problem.......................................................................................................................................... 8
P6 Implement a machine learning solution with a suitable machine learning algorithm and demonstrate the
outcome..................................................................................................................................................... 10
P8 Analyse the result of the application to determine the effectiveness of the algorithm.............................13
REFERENCES.......................................................................................................................15
Table of figures
FIGURE 1: SUPERVISED LEARNING.........................................................................................................................2
FIGURE 2: UNSUPERVISED LEARNING....................................................................................................................3
FIGURE 3: SEMI-SUPERVISED LEARNING...............................................................................................................3
FIGURE 4: REINFORCEMENT LEARNING.................................................................................................................4
FIGURE 5: SUPERVISED LEARNING: LINEAR REGRESSION......................................................................................7
FIGURE 6: UNSUPERVISED LEARNING: K-MEANS CLUSTERING..............................................................................8
FIGURE 7: REINFORCEMENT LEARNING: Q LEARNING...........................................................................................8
FIGURE 8: REINFORCEMENT LEARNING: Q LEARNING...........................................................................................9
FIGURE 9: LEARNING CURVE................................................................................................................................13
LO1 Analyse the theoretical foundation of machine
learning to determine how an intelligent machine
works
P1 Analyse the types of learning problems.
Supervised Learning Problems
“Supervised learning is the process of learning a mapping from input data to
output labels using a labeled dataset.
Examples include classification tasks in which the goal is to predict a category
label from given input data (e.g., spam detection, image classification). Regression
tasks entail forecasting a continuous variable (for example, property prices or
market prices)” (S.Gillis, 2024).
Overfitting (the model learns noise from the data), underfitting (the model is too
simplistic to grasp the underlying patterns), and data imbalance (where one class
dominates the dataset) are all examples of supervised learning challenges.
https://www.geeksforgeeks.org/supervised-unsupervised-learning/
Unsupervised Learning Problems
“Unsupervised learning works to identify underlying patterns or structures in
unlabeled data.
Examples include clustering algorithms, which group comparable data points
together without the use of predefined labels. Dimensionality reduction strategies
seek to reduce the number of features while retaining important information in
the data (e.g., Principal Component Analysis)” (S.Gillis, 2024).
The challenges include establishing the ideal number of clusters, dealing with high-
dimensional data, and interpreting the significance of clusters or decreased
dimensions.
Page 1 of 16
Figure 2: Unsupervised Learning
https://databasetown.com/unsupervised-learning-types-applications/
Semi-supervised Learning Problems
“Semi-supervised learning uses labeled and unlabeled data to improve model
performance.
Examples include anomaly detection, in which the majority of the data is
unlabeled but anomalies (such as fraud) are tagged. Semi-supervised learning is
also effective in situations when labeled data is rare but unlabeled data is
abundant.
Challenges include maximizing the use of available labeled and unlabeled data,
guaranteeing consistency between labeled and unlabeled samples, and preventing
the model from overfitting to labeled data” (Bergmann, 2023).
https://medium.datadriveninvestor.com/the-ultimate-beginner-guide-of-semi-
supervised-learning-3bd11cb19835
Reinforcement Learning Problems
Reinforcement learning is the process of learning to make successive decisions
through interactions with the environment.
Page 2 of 16
Examples include playing games (e.g., AlphaGo), controlling robotics, and driving
autonomously.
The challenges include the exploration-exploitation trade-off (balancing the
discovery of novel actions with the exploitation of established activities), reward
shaping (creating appropriate reward functions), and dealing with scarce
incentives.
Regression Algorithms: These algorithms are used when the target variable is
continuous. Examples include:
1. Linear Regression
Page 3 of 16
2. Ridge Regression
3. Lasso Regression
4. Polynomial Regression
5. Support Vector Regression (SVR)
6. Decision Tree Regression
Clustering Algorithms: These algorithms are used to partition data into clusters
based on similarity. Examples include:
1. K-Means Clustering
2. Hierarchical Clustering
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
4. Gaussian Mixture Models (GMM)
Dimensionality Reduction Algorithms: These algorithms reduce the number of
features in the data while preserving essential information. Examples include:
1. Principal Component Analysis (PCA)
2. t-Distributed Stochastic Neighbor Embedding (t-SNE)
3. Linear Discriminant Analysis (LDA)
4. Autoencoders
Page 4 of 16
Policy-based Methods: These algorithms directly learn the optimal policy without
explicitly estimating value functions. Examples include:
Policy Gradient Methods
Proximal Policy Optimization (PPO)
Model-based Methods: These algorithms learn a model of the environment and use
it for planning and decision-making. Examples include:
1. Model Predictive Control (MPC)
2. Dyna-Q
Page 5 of 16
Principal Component Analysis (PCA) decreases the dimensionality of data by
identifying orthogonal linear combinations of features that capture the most
variance. Its applications include feature extraction and data visualization.
This code generates synthetic data points in a two-dimensional space and applies
K-Means clustering to partition the data into three clusters. It then prints the
coordinates of the centroids of the clusters, which represent the center of each
cluster.
Reinforcement Learning: Q-Learning
Page 7 of 16
Figure 8: Reinforcement learning: Q learning
After initializing a Random Forest classifier with 100 decision trees, we train it on
the training data. The Random Forest technique was chosen due to its capacity to
handle high-dimensional data and perform well in classification tests.
# Predictions
y_pred = rf_classifier.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)
To evaluate the model, we create predictions on the test set and compare them to
the true labels. The accuracy score indicates how well the model generalizes to
previously encountered data.
To gain insights into the model’s performance and potential for improvement, we
define a function to plot learning curves. These curves depict the model’s accuracy
on both the training and cross-validation sets across different training set sizes. By
visualizing the learning curves, we can assess whether the model suffers from
issues like overfitting or underfitting.
def plot_learning_curves(estimator, title, X, y, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure(figsize=(12, 6))
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Accuracy")
Page 9 of 16
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
Data Preparation:
The first step is to load the dataset and prepare it for modeling. In this case, the
dataset consists of 20,000 letters, each described by 16 features.
The dataset is loaded into a pandas DataFrame, and appropriate column names are
assigned.
Page 10 of 16
The data is then split into features (X) and target labels (y), where X contains the
features describing each letter, and y contains the corresponding letter labels.
Model Selection:
For this letter recognition challenge, an appropriate machine learning algorithm is
selected. In this situation, a Random Forest classifier is used.
Random Forest is a powerful ensemble learning method that performs well in
classification tasks, particularly when working with high-dimensional data and a
large number of features. It combines numerous decision trees to promote
generality while minimizing overfitting.
Model Training:
The Random Forest classifier is initialized with hyperparameters such as the
number of decision trees (100 trees in this case) and a random seed for
reproducibility.
The classifier is then trained on the training data, where it learns patterns and
relationships between the input features and the target labels (letters).
Model Evaluation:
Following training, the classifier’s performance is assessed using the test set.
The trained classifier is applied to the test set to make predictions.
The model’s accuracy is calculated by comparing the predicted labels to the actual
labels from the test set.
Outcome Demonstration:
The outcome of the machine learning solution is demonstrated through various
means, such as printing the test accuracy and visualizing learning curves.
The test accuracy provides an overall measure of how well the model performs on
unseen data.
Learning curves visualize the model’s performance on both the training and cross-
validation sets across different training set sizes. They help assess whether the
model suffers from issues like overfitting or underfitting and provide insights into
its generalization capabilities.
By following this approach and utilizing the Random Forest classifier, we can
effectively implement a machine-learning solution for the letter recognition
problem. The outcome is demonstrated through quantitative metrics like accuracy
and visualizations like learning curves, providing insights into the model's
performance and behavior.
Page 11 of 16
LO4 Evaluate the outcome or the result of the
application to determine the effectiveness of the
learning algorithm used in the application.
P7 Discuss whether the result is balanced, underfitting or
overfitting.
In the result picture, we are given the accuracy of a Random Forest Classifier on
both the training set and the cross-validation set. The accuracy is a measure of
how often the model's predictions are correct. In this case, the training score and
cross-validation score are both 1.000, which is the maximum accuracy. This
indicates that the model is performing well and is able to accurately predict the
target variable for both the training examples and the cross-validation examples.
The learning curve offered further supports the conclusion that the model is well-
fitted. The learning curve depicts how the model performs as the number of
training samples grows. In this case, the learning curve demonstrates that the
model's performance improves as the number of training samples grows, indicating
a well-fitted model. The learning curve also indicates a minimal difference
between the training and cross-validation scores, showing that the model is neither
overfitting nor underfitting the data.
However, in this case, the training score and cross-validation score are both high,
indicating that the model is performing well and is able to generalize to new data.
Page 12 of 16
The learning curve also supports this conclusion, since it shows that the model's
performance improves as the number of training examples increases and the gap
between the training score and the cross-validation score is small.
The effectiveness of the Random Forest algorithm for the letter recognition
problem can be analyzed based on the following aspects:
Accuracy: The Random Forest classifier’s test accuracy is a key sign of its
usefulness. A higher accuracy indicates that the model performs better at
recognizing letters based on the features provided.
Learning Curves: Visualizing the learning curves allows to obtain insight into the
model’s performance as the training set size grows. The key observations from the
learning curves are:
The training score measures the model’s accuracy on the training set. A high
training score means that the model matches the training data effectively. Cross-
Validation Score: The accuracy of the model on the cross-validation set. It provides
an estimate of the model’s performance on unseen data. The convergence of the
training and cross-validation scores indicates whether the model is overfitting or
underfitting.
Based on these aspects, we can analyze the effectiveness of the Random Forest
algorithm for letter recognition:
High Accuracy: A high test accuracy suggests that the Random Forest model
efficiently captures the underlying patterns in the data and can generalize to
previously unseen letters.
Converging Learning Curves: If the training and cross-validation scores are
converging, it indicates that the model is not overfitting and can effectively
generalize to new data.
Page 13 of 16
Consistent Prediction Performance: If the model correctly predicts letters of varied
forms and sizes in the test set, it illustrates the Random Forest algorithm's
robustness and usefulness.
By considering these factors and analyzing the results obtained from the
application of the Random Forest algorithm to the letter recognition problem, we
can determine the overall effectiveness of the algorithm for this specific task.
Page 14 of 16
References
Bergmann, D., 2023. IBM. [Online]
Available at: https://www.ibm.com/topics/semi-supervised-
learning#:~:text=Contributors%3A%20Dave%20Bergmann-,What%20is%20semi
%2Dsupervised%20learning%3F,for%20classification%20and%20regression%20tasks.
[Accessed 20 April 2024].
S.Gillis, A., 2024. TechTarget. [Online]
Available at:
https://www.techtarget.com/searchenterpriseai/definition/supervised-learning
[Accessed 20 April 2024].
S.Gillis, A., 2024. TechTarget. [Online]
Available at:
https://www.techtarget.com/searchenterpriseai/definition/unsupervised-learning
[Accessed April 2024].
Page 15 of 16