0% found this document useful (0 votes)
4 views

HHS ML Assignment

Uploaded by

kimmyat2003
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

HHS ML Assignment

Uploaded by

kimmyat2003
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Table of contents

TABLE OF CONTENTS.........................................................................................................0

TABLE OF FIGURES.............................................................................................................0

LO1 ANALYSE THE THEORETICAL FOUNDATION OF MACHINE LEARNING


TO DETERMINE HOW AN INTELLIGENT MACHINE WORKS.................................1

P1 Analyse the types of learning problems.................................................................................................... 1

P2 Demonstrate the taxonomy of machine learning algorithms.....................................................................3

LO2 INVESTIGATE THE MOST POPULAR AND EFFICIENT MACHINE


LEARNING ALGORITHMS USED IN INDUSTRY...........................................................5

P3 Investigate a range of machine learning algorithms and how these algorithms solve the learning
problems...................................................................................................................................................... 5

P4 Demonstrate the efficiency of these algorithms by implementing them using an appropriate programming
language or machine learning tool................................................................................................................ 6

LO3 DEVELOP A MACHINE LEARNING APPLICATION USING AN


APPROPRIATE PROGRAMMING LANGUAGE OR MACHINE LEARNING TOOL
FOR SOLVING A REAL-WORLD PROBLEM...................................................................8

P5 Prepare training and test data sets in order to implement a machine learning solution for an appropriate
learning problem.......................................................................................................................................... 8

P6 Implement a machine learning solution with a suitable machine learning algorithm and demonstrate the
outcome..................................................................................................................................................... 10

LO4 EVALUATE THE OUTCOME OR THE RESULT OF THE APPLICATION TO


DETERMINE THE EFFECTIVENESS OF THE LEARNING ALGORITHM USED IN
THE APPLICATION.............................................................................................................12

P7 Discuss whether the result is balanced, underfitting or overfitting..........................................................12

P8 Analyse the result of the application to determine the effectiveness of the algorithm.............................13

REFERENCES.......................................................................................................................15

Table of figures
FIGURE 1: SUPERVISED LEARNING.........................................................................................................................2
FIGURE 2: UNSUPERVISED LEARNING....................................................................................................................3
FIGURE 3: SEMI-SUPERVISED LEARNING...............................................................................................................3
FIGURE 4: REINFORCEMENT LEARNING.................................................................................................................4
FIGURE 5: SUPERVISED LEARNING: LINEAR REGRESSION......................................................................................7
FIGURE 6: UNSUPERVISED LEARNING: K-MEANS CLUSTERING..............................................................................8
FIGURE 7: REINFORCEMENT LEARNING: Q LEARNING...........................................................................................8
FIGURE 8: REINFORCEMENT LEARNING: Q LEARNING...........................................................................................9
FIGURE 9: LEARNING CURVE................................................................................................................................13
LO1 Analyse the theoretical foundation of machine
learning to determine how an intelligent machine
works
P1 Analyse the types of learning problems.
 Supervised Learning Problems
“Supervised learning is the process of learning a mapping from input data to
output labels using a labeled dataset.
Examples include classification tasks in which the goal is to predict a category
label from given input data (e.g., spam detection, image classification). Regression
tasks entail forecasting a continuous variable (for example, property prices or
market prices)” (S.Gillis, 2024).
Overfitting (the model learns noise from the data), underfitting (the model is too
simplistic to grasp the underlying patterns), and data imbalance (where one class
dominates the dataset) are all examples of supervised learning challenges.

Figure 1: Supervised Learning

https://www.geeksforgeeks.org/supervised-unsupervised-learning/
 Unsupervised Learning Problems
“Unsupervised learning works to identify underlying patterns or structures in
unlabeled data.
Examples include clustering algorithms, which group comparable data points
together without the use of predefined labels. Dimensionality reduction strategies
seek to reduce the number of features while retaining important information in
the data (e.g., Principal Component Analysis)” (S.Gillis, 2024).
The challenges include establishing the ideal number of clusters, dealing with high-
dimensional data, and interpreting the significance of clusters or decreased
dimensions.

Page 1 of 16
Figure 2: Unsupervised Learning

https://databasetown.com/unsupervised-learning-types-applications/
 Semi-supervised Learning Problems
“Semi-supervised learning uses labeled and unlabeled data to improve model
performance.
Examples include anomaly detection, in which the majority of the data is
unlabeled but anomalies (such as fraud) are tagged. Semi-supervised learning is
also effective in situations when labeled data is rare but unlabeled data is
abundant.
Challenges include maximizing the use of available labeled and unlabeled data,
guaranteeing consistency between labeled and unlabeled samples, and preventing
the model from overfitting to labeled data” (Bergmann, 2023).

Figure 3: Semi-supervised learning

https://medium.datadriveninvestor.com/the-ultimate-beginner-guide-of-semi-
supervised-learning-3bd11cb19835
 Reinforcement Learning Problems
Reinforcement learning is the process of learning to make successive decisions
through interactions with the environment.

Page 2 of 16
Examples include playing games (e.g., AlphaGo), controlling robotics, and driving
autonomously.
The challenges include the exploration-exploitation trade-off (balancing the
discovery of novel actions with the exploitation of established activities), reward
shaping (creating appropriate reward functions), and dealing with scarce
incentives.

Figure 4: Reinforcement Learning

P2 Demonstrate the taxonomy of machine learning


algorithms.
There is a taxonomy of machine learning algorithms based on their learning
approach and the types of problems they are designed to solve:

 Supervised Learning Algorithms


In classification problems, algorithms predict categorical labels, such as whether
an email is spam or not. Common classification algorithms include Logistic
Regression, Support Vector Machines (SVM), Decision Trees, and Naive Bayes. On
the other hand, regression algorithms are used when the target variable is
continuous, aiming to predict values like house prices or stock prices. Linear
Regression, Ridge Regression, and Random Forest are examples of regression
algorithms.
Classification Algorithms: These algorithms are used when the target variable is
categorical. Examples include:
1. Logistic Regression
2. Support Vector Machines (SVM)
3. Decision Trees
4. Random Forest
5. k-Nearest Neighbors (k-NN)
6. Naive Bayes

Regression Algorithms: These algorithms are used when the target variable is
continuous. Examples include:
1. Linear Regression
Page 3 of 16
2. Ridge Regression
3. Lasso Regression
4. Polynomial Regression
5. Support Vector Regression (SVR)
6. Decision Tree Regression

 Unsupervised Learning Algorithms


Clustering techniques divide data into groupings of similar instances, like customer
segments or image clusters. Popular clustering algorithms include K-means
clustering, hierarchical clustering, and DBSCAN. Dimensionality reduction
techniques minimize the amount of features while retaining critical information.
Dimensionality reduction techniques include principal component analysis (PCA), t-
Distributed Stochastic Neighbor Embedding (t-SNE), and autoencoders.

Clustering Algorithms: These algorithms are used to partition data into clusters
based on similarity. Examples include:
1. K-Means Clustering
2. Hierarchical Clustering
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
4. Gaussian Mixture Models (GMM)
Dimensionality Reduction Algorithms: These algorithms reduce the number of
features in the data while preserving essential information. Examples include:
1. Principal Component Analysis (PCA)
2. t-Distributed Stochastic Neighbor Embedding (t-SNE)
3. Linear Discriminant Analysis (LDA)
4. Autoencoders

 Semi-supervised Learning Algorithms


Semi-supervised Learning Algorithms utilize both labeled and unlabeled data to
enhance model performance, with approaches like self-training and co-training.
1. Self-training Algorithms: These algorithms iteratively improve a model’s
performance by pseudo-labeling unlabeled data based on the model’s
predictions.
2. Co-training Algorithms: These algorithms train multiple models on different
subsets of features or data and exchange information between them to
improve performance.

 Reinforcement Learning Algorithms


Reinforcement Learning Algorithms make sequential decisions based on rewards or
penalties received from interacting with an environment. Value-based, policy-
based, and model-based methods are common approaches.
Value-based Methods: These algorithms estimate the value of taking different
actions in a given state. Examples include:
1. Q-Learning
2. Deep Q-Networks (DQN)

Page 4 of 16
Policy-based Methods: These algorithms directly learn the optimal policy without
explicitly estimating value functions. Examples include:
 Policy Gradient Methods
Proximal Policy Optimization (PPO)
Model-based Methods: These algorithms learn a model of the environment and use
it for planning and decision-making. Examples include:
1. Model Predictive Control (MPC)
2. Dyna-Q

 Transfer Learning Algorithms


“Transfer Learning Algorithms transfer knowledge from one domain or task to
another, reducing the need for extensive labeled data. Fine-tuning pre-trained
models and domain adaptation methods are examples.
Fine-tuning Pre-trained Models: This approach involves taking a pre-trained model
on a source task and fine-tuning it on a related target task.
Domain Adaptation Methods: These methods aim to adapt a model trained on a
source domain to perform well on a different target domain. Examples include:
1. Adversarial Domain Adaptation
2. Instance Weighting
3. Transfer Component Analysis (TCA)

LO2 Investigate the most popular and efficient


machine learning algorithms used in industry
P3 Investigate a range of machine learning algorithms and
how these algorithms solve the learning problems.
Supervised Learning Algorithms
Logistic regression is a type of binary classification job that models the probability
of a binary outcome based on one or more predictor factors. It calculates the
probability that a given input belongs to a specific class.
Decision trees make judgments by recursively splitting data based on features.
They are intuitive and simple to understand, making them appropriate for both
classification and regression problems.
Support Vector Machines (SVM) identify the hyperplane that best separates classes
in a high-dimensional space. It is suitable for both linear and non-linear
classification applications and performs well with small to medium-sized datasets.

Unsupervised Learning Algorithms


K-Means Clustering is a method for partitioning data into k clusters by minimizing
the sum of squares inside each cluster. It is commonly used for clustering jobs and
performs well on huge datasets.

Page 5 of 16
Principal Component Analysis (PCA) decreases the dimensionality of data by
identifying orthogonal linear combinations of features that capture the most
variance. Its applications include feature extraction and data visualization.

Semi-supervised Learning Algorithms


Self-training is an iterative process that enhances a model's performance by
pseudo-labeling unlabeled data using the model's predictions. It is useful when
there is little labeled data but plenty of unlabeled data.
Co-training involves training multiple models on separate subsets of features or
data and exchanging information to improve performance. It works best when the
data is organically partitioned into distinct feature sets.

Reinforcement Learning Algorithms


Q-Learning is a value-based reinforcement learning algorithm that iteratively
learns an action-value function via exploration and exploitation. It's employed in
settings with distinct state and action spaces.
Deep Q Networks (DQN): DQN builds on Q-Learning by utilizing deep neural
networks to approximate the action-value function. It performs well in high-
dimensional state spaces, such as image-based settings.

Transfer Learning Algorithms


Fine-tuning Pre-trained Models: Fine-tuning is the process of taking a pre-trained
model from one task and applying it to a related target task. It works well when
the target job has a small amount of labeled data.
Domain Adaptation: The goal of domain adaptation approaches is to get a model
trained on one domain to function effectively on another. The techniques include
adversarial domain adaptation and instance weighting

P4 Demonstrate the efficiency of these algorithms by


implementing them using an appropriate programming
language or machine learning tool.
 Supervised Learning: Linear Regression

Figure 5: Supervised learning: linear regression


Page 6 of 16
This code generates synthetic data representing a linear relationship between a
single feature and a target variable with added noise. It then implements linear
regression using sci-kit-learn’s LinearRegression class to fit a line to the data and
displays the intercept and coefficient of the fitted line.
 Unsupervised Learning: K-Means Clustering

Figure 6: Unsupervised learning: K-Means Clustering

This code generates synthetic data points in a two-dimensional space and applies
K-Means clustering to partition the data into three clusters. It then prints the
coordinates of the centroids of the clusters, which represent the center of each
cluster.
 Reinforcement Learning: Q-Learning

Figure 7: Reinforcement learning: Q learning

Page 7 of 16
Figure 8: Reinforcement learning: Q learning

This code implements Q-Learning for a simplified environment represented as a


3x3 grid, where the agent learns to navigate to a goal state. It initializes a Q-table
with random values, updates the Q-values based on actions taken and rewards
received, and extracts the optimal policy, which represents the best action to take
in each state according to the learned Q-values.

LO3 Develop a machine learning application using an


appropriate programming language or machine
learning tool for solving a real-world problem.
P5 Prepare training and test data sets in order to
implement a machine learning solution for an appropriate
learning problem.
The objective of this project is to construct a machine learning solution for letter
recognition using the Random Forest classifier. The dataset consists of 20,000
letters, each of which is defined by 16 attributes including size and shape. Our
responsibilities include preparing the data, dividing it into training and test sets,
training the Random Forest classifier, evaluating its performance, and showing the
learning curves.
# Load the dataset
file_path = "letter-recognition.data"
data = pd.read_csv(file_path)

# Assign column names to the DataFrame


attributes = ["lettr", "x-box", "y-box", "width", "high", "onpix",
"x-bar", "y-bar", "x2bar", "y2bar", "xybar", "x2ybr", "xy2br", "x-
ege", "xegvy", "y-ege", "yegvx"]
data.columns = attributes

# Split data into features (X) and target labels (y)


X = data.drop('lettr', axis=1)
y = data['lettr']

We begin by loading the dataset from the “letter-recognition.data” file into a


pandas DataFrame. The dataset contains 16 features and a target variable
Page 8 of 16
indicating the letter. We assign appropriate column names and split the data into
features (X) and target labels (y).
# Initialize Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100,
random_state=42)

# Train the Random Forest classifier


rf_classifier.fit(X_train, y_train)

After initializing a Random Forest classifier with 100 decision trees, we train it on
the training data. The Random Forest technique was chosen due to its capacity to
handle high-dimensional data and perform well in classification tests.
# Predictions
y_pred = rf_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

To evaluate the model, we create predictions on the test set and compare them to
the true labels. The accuracy score indicates how well the model generalizes to
previously encountered data.

To gain insights into the model’s performance and potential for improvement, we
define a function to plot learning curves. These curves depict the model’s accuracy
on both the training and cross-validation sets across different training set sizes. By
visualizing the learning curves, we can assess whether the model suffers from
issues like overfitting or underfitting.
def plot_learning_curves(estimator, title, X, y, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure(figsize=(12, 6))
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Accuracy")

# Plot learning curve


train_sizes, train_scores, test_scores = learning_curve(

Page 9 of 16
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")

plt.legend(loc="best")
return plt

# Plot learning curves


title = "Learning Curve of Random Forest Classifier"
plot_learning_curves(rf_classifier, title, X_train, y_train, cv=5, n_jobs=-1)
plt.show()

P6 Implement a machine learning solution with a suitable


machine learning algorithm and demonstrate the outcome.
Implementing a machine learning solution entails several essential processes,
including data preparation, model selection, training, evaluation, and outcome
demonstration. We should break down each of these phases in terms of the
project’s codes:

Data Preparation:
The first step is to load the dataset and prepare it for modeling. In this case, the
dataset consists of 20,000 letters, each described by 16 features.
The dataset is loaded into a pandas DataFrame, and appropriate column names are
assigned.
Page 10 of 16
The data is then split into features (X) and target labels (y), where X contains the
features describing each letter, and y contains the corresponding letter labels.
Model Selection:
For this letter recognition challenge, an appropriate machine learning algorithm is
selected. In this situation, a Random Forest classifier is used.
Random Forest is a powerful ensemble learning method that performs well in
classification tasks, particularly when working with high-dimensional data and a
large number of features. It combines numerous decision trees to promote
generality while minimizing overfitting.
Model Training:
The Random Forest classifier is initialized with hyperparameters such as the
number of decision trees (100 trees in this case) and a random seed for
reproducibility.
The classifier is then trained on the training data, where it learns patterns and
relationships between the input features and the target labels (letters).
Model Evaluation:
Following training, the classifier’s performance is assessed using the test set.
The trained classifier is applied to the test set to make predictions.
The model’s accuracy is calculated by comparing the predicted labels to the actual
labels from the test set.
Outcome Demonstration:
The outcome of the machine learning solution is demonstrated through various
means, such as printing the test accuracy and visualizing learning curves.
The test accuracy provides an overall measure of how well the model performs on
unseen data.
Learning curves visualize the model’s performance on both the training and cross-
validation sets across different training set sizes. They help assess whether the
model suffers from issues like overfitting or underfitting and provide insights into
its generalization capabilities.
By following this approach and utilizing the Random Forest classifier, we can
effectively implement a machine-learning solution for the letter recognition
problem. The outcome is demonstrated through quantitative metrics like accuracy
and visualizations like learning curves, providing insights into the model's
performance and behavior.

Page 11 of 16
LO4 Evaluate the outcome or the result of the
application to determine the effectiveness of the
learning algorithm used in the application.
P7 Discuss whether the result is balanced, underfitting or
overfitting.

Figure 9: Learning curve

In machine learning, it is critical to assess the performance of a model to ensure


that it is not overfitting or underfitting the data. Overfitting happens when a
model is overly complicated and learns noise from training data, resulting in poor
performance on fresh, unknown data. Underfitting happens when a model is overly
simplistic and fails to grasp underlying patterns in the data, resulting in poor
performance on both training and new data.

In the result picture, we are given the accuracy of a Random Forest Classifier on
both the training set and the cross-validation set. The accuracy is a measure of
how often the model's predictions are correct. In this case, the training score and
cross-validation score are both 1.000, which is the maximum accuracy. This
indicates that the model is performing well and is able to accurately predict the
target variable for both the training examples and the cross-validation examples.

The learning curve offered further supports the conclusion that the model is well-
fitted. The learning curve depicts how the model performs as the number of
training samples grows. In this case, the learning curve demonstrates that the
model's performance improves as the number of training samples grows, indicating
a well-fitted model. The learning curve also indicates a minimal difference
between the training and cross-validation scores, showing that the model is neither
overfitting nor underfitting the data.

However, in this case, the training score and cross-validation score are both high,
indicating that the model is performing well and is able to generalize to new data.
Page 12 of 16
The learning curve also supports this conclusion, since it shows that the model's
performance improves as the number of training examples increases and the gap
between the training score and the cross-validation score is small.

Therefore, the results are balanced and there is no evidence of underfitting or


overfitting. The Random Forest Classifier is performing well and is able to
accurately predict the target variable for both the training data and new, unseen
data.

P8 Analyse the result of the application to determine the


effectiveness of the algorithm.

The effectiveness of the Random Forest algorithm for the letter recognition
problem can be analyzed based on the following aspects:

Accuracy: The Random Forest classifier’s test accuracy is a key sign of its
usefulness. A higher accuracy indicates that the model performs better at
recognizing letters based on the features provided.
Learning Curves: Visualizing the learning curves allows to obtain insight into the
model’s performance as the training set size grows. The key observations from the
learning curves are:
The training score measures the model’s accuracy on the training set. A high
training score means that the model matches the training data effectively. Cross-
Validation Score: The accuracy of the model on the cross-validation set. It provides
an estimate of the model’s performance on unseen data. The convergence of the
training and cross-validation scores indicates whether the model is overfitting or
underfitting.

A small gap between training and cross-validation scores implies strong


generalization, whereas a big gap indicates potential overfitting.
Prediction Performance: Measuring the model’s ability to properly anticipate
letters from the test set might provide qualitative information about its
effectiveness. It is critical to determine whether the model accurately identifies
different letters in various shapes and sizes.

Based on these aspects, we can analyze the effectiveness of the Random Forest
algorithm for letter recognition:
High Accuracy: A high test accuracy suggests that the Random Forest model
efficiently captures the underlying patterns in the data and can generalize to
previously unseen letters.
Converging Learning Curves: If the training and cross-validation scores are
converging, it indicates that the model is not overfitting and can effectively
generalize to new data.

Page 13 of 16
Consistent Prediction Performance: If the model correctly predicts letters of varied
forms and sizes in the test set, it illustrates the Random Forest algorithm's
robustness and usefulness.
By considering these factors and analyzing the results obtained from the
application of the Random Forest algorithm to the letter recognition problem, we
can determine the overall effectiveness of the algorithm for this specific task.

Page 14 of 16
References
Bergmann, D., 2023. IBM. [Online]
Available at: https://www.ibm.com/topics/semi-supervised-
learning#:~:text=Contributors%3A%20Dave%20Bergmann-,What%20is%20semi
%2Dsupervised%20learning%3F,for%20classification%20and%20regression%20tasks.
[Accessed 20 April 2024].
S.Gillis, A., 2024. TechTarget. [Online]
Available at:
https://www.techtarget.com/searchenterpriseai/definition/supervised-learning
[Accessed 20 April 2024].
S.Gillis, A., 2024. TechTarget. [Online]
Available at:
https://www.techtarget.com/searchenterpriseai/definition/unsupervised-learning
[Accessed April 2024].

Page 15 of 16

You might also like