0% found this document useful (0 votes)
5 views12 pages

Scs 414 Machine Learning Assignment 2 Sc212-1012-2019

Uploaded by

Simon Maina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

Scs 414 Machine Learning Assignment 2 Sc212-1012-2019

Uploaded by

Simon Maina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

MURANG’A UNIVERSITY OF TECHNOLOGY

SCHOOL OF COMPUTING AND INFORMATION TECHNOLOGY


NAME: SIMON MAINA
REG NO:SC212/1012/2019
DEPARTMENT OF COMPUTER SCIENCE
SCS 414: MACHINE LEARNING.
ASSIGNMENT 2 SC212/1012/2019

a) What is a Support Vector Machine? (2 Marks)


It is a machine learning algorithm that can be used for classification and regression
analysis. They are supervised learning models that analyze data and recognize patterns,
then use these patterns to make predictions on new data

b) State the mathematical formulation of the SVM problem. Give an outline of the
method for solving the problem.
(6 Marks)

The SVM problem can be formulated as an optimization problem that aims to find the
hyperplane or set hyperplanes that maximally separate the classes of data.

Assuming a binary classification problem with training data ${(x_1, y_1) ,(x_2,y_2), /dots,
(x_n, y_n)}$, where $x_i$ represents the $i$-th input feature vector and $y_i\in {-1,1}$
represents the corresponding class label, the SVM problem can be formulated as follows:

Minimize
$\frac{1}{2}||w||^2 + C\sum_{i=1}^{n}\xi_i$
Subject to:
$y_i(w\cdot x_i+b) \geq 1 - \xi_i$, for $i=1,\dots,n$
$\xi_i\geq 0$, for $i=1, \dots,n$

Where $w$ represents the weight vector of the hyperplane, $b$ represents the bias term,
$C$ Is a regularization parameter that controls the trade-off between maximizing the
margin and minimizing the classification error and $\xi_i$ are slack variables that allow
some training examples to violate the margin or be misclassified.

The optimization problem can be solved using various methods such as quadratic
programming, gradient descent or sequential minimal optimization (SMO). One popular
method for solving the SVM problem is the SMO algorithm which breaks down the
optimization problem into smaller subproblems that can be solved analytically. The SMO
algorithm starts by selecting two examples that violate the KKT( karush-kuhn-Tucker)
conditions which are necessary conditions for the optimal solution. It the updates the
weight vector and bias term to move closer to the optimal solution. This process is repeated
until convergence or a maximum number of iterations is reached.

Once the optimal weight vector and bias term are obtained, the SVM can be used to predict
the class label of new data points by computing the sign of the decision function $f(x) = w \
cdot x+ b$. if $f(x) 0$, the data points belongs to one class and if $f(x) <0$, it belongs to
the other class, the distance between the decision boundary and the closest data points is
called the margin and SVMs aim to maximize this margin to achieve better generalization
performance.

c) Explain how Support Vector Machine can be used for classification of linearly
separable data. (6 Marks)

Support Vector Machines (SVMs) can be used to classify linearly separable data by finding the
hyperplane that maximally separates the two classes of data. In other words, the goal is to find a
hyperplane that separates the data into two regions in such a way that all the data points of one
class are on one side of the hyperplane, and all the data points of the other class are on the other
side.

Assuming a binary classification problem with training data ${(x_1, y_1), (x_2, y_2), \dots,
(x_n, y_n)}$, where $x_i$ represents the $i$-th input feature vector and $y_i \in {-1, 1}$
represents the corresponding class label, SVMs aim to find the hyperplane that maximizes the
margin between the two classes. The margin is defined as the distance between the hyperplane
and the closest data points from either class.

To find the hyperplane, SVMs solve the following optimization problem:

Minimize:

$\frac{1}{2}||w||^2$

Subject to:

$y_i(w \cdot x_i + b) \geq 1$, for $i=1,\dots,n$ where $w$ represents the weight vector of

the hyperplane, $b$ represents the bias term, and $y_i$ is the class label of the $i$-th

training example.

This optimization problem can be solved using various methods such as quadratic programming
or gradient descent. Once the optimal weight vector and bias term are obtained, the hyperplane
can be represented as $w \cdot x + b = 0$, where $x$ represents a new input feature vector to be
classified.

To classify a new data point, SVMs compute the sign of the decision function $f(x) = w \cdot x +
b$. If $f(x) > 0$, the data point belongs to one class, and if $f(x) < 0$, it belongs to the other
class. If the hyperplane does not perfectly separate the two classes, SVMs can use slack variables
to allow some training examples to violate the margin or be misclassified.

In summary, SVMs can be used for classification of linearly separable data by finding the
hyperplane that maximally separates the two classes. The optimization problem can be solved
using various methods, and the optimal hyperplane can be represented as a decision boundary
that can be used to classify new data points
d) Define Hidden Markov Model. What is meant by evaluation problem and how is this
solved? (6 Marks)

A Hidden Markov Model (HMM) is a statistical model that can be used to model sequential data
where the underlying process is assumed to be a Markov process with hidden states. In an HMM,
the observed data is generated by an unknown process that transitions between a set of hidden
states, where each state emits an observation with a certain probability.

Formally, an HMM can be defined as a tuple $(S, V, A, B, \pi)$, where:

$S$ is a set of hidden states.

$V$ is a set of possible observations.

$A$ is a state transition probability matrix, where $a_{ij}$ represents the probability of
transitioning from state $i$ to state $j$.

$B$ is an observation probability matrix, where $b_j(v_k)$ represents the probability of


observing $v_k$ when in state $j$.

$\pi$ is an initial state distribution, where $\pi_i$ represents the probability of starting in state
$i$.

Given an HMM and a sequence of observations, the evaluation problem in HMMs is to compute
the probability of observing the given sequence, which is also known as the likelihood of the
sequence. Formally, given an HMM $(S, V, A, B, \pi)$ and an observation sequence $O = (o_1,
o_2, \dots, o_T)$, the evaluation problem is to compute $P(O|\lambda)$, the probability of
observing the sequence $O$ given the HMM $\lambda = (S, V, A, B, \pi)$.

The evaluation problem in HMMs can be solved using the forward algorithm, which is a
dynamic programming algorithm that computes the probability of being in a certain state at a
certain time step and emitting a certain observation. The forward algorithm computes the
probability of observing the given sequence by summing over all possible paths that could have
generated the sequence, taking into account the transition probabilities and the emission
probabilities.
The forward algorithm starts by initializing the probability of being in each state at the first time
step, given the first observation. It then recursively computes the probability of being in each
state at each subsequent time step, given the previous observations and the transition and
emission probabilities. The probability of observing the entire sequence is then obtained by
summing over the probabilities of being in each state at the final time step.

e) Illustrate K means clustering algorithm with an example. (8 Marks)

K-means clustering algorithm computes the centroids and iterates until we it finds optimal
centroid. It assumes that the number of clusters are already known. It is also called flat clustering
algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-
means.

In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the
squared distance between the data points and centroid would be minimum. It is to be understood
that less variation within the clusters will lead to more similar data points within same cluster.

Working of K-Means Algorithm

We can understand the working of K-Means clustering algorithm with the help of following
steps −

Step 1 − First, we need to specify the number of clusters, K, need to be generated by this
algorithm.

Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple
words, classify the data based on the number of data points.

Step 3 − Now it will compute the cluster centroids.

Step 4 − Next, keep iterating the following until we find optimal centroid which is the
assignment of data points to the clusters that are not changing any more

4.1 − First, the sum of squared distance between data points and centroids would be computed.
4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster
(centroid).

4.3 − At last compute the centroids for the clusters by taking the average of all data points of that
cluster.

K-means follows Expectation-Maximization approach to solve the problem. The Expectationstep


is used for assigning the data points to the closest cluster and the Maximization-step is used for
computing the centroid of each cluster.

Due to the iterative nature of K-Means and random initialization of centroids, K-Means may
stick in a local optimum and may not converge to global optimum. That is why it is
recommended to use different initializations of centroids.

Example

In this example, we are going to first generate 2D dataset containing 4 different blobs and after
that will apply k-means algorithm to see the result.

First, we will start by importing the necessary packages −

%matplotlib inline import

matplotlib.pyplot as plt import

seaborn as sns; sns.set() import

numpy as np from sklearn.cluster

import KMeans

The following code will generate the 2D, containing four blobs − from

sklearn.datasets.samples_generator import make_blobs

X, y_true = make_blobs(n_samples = 400, centers = 4, cluster_std = 0.60, random_state = 0)


Next, the following code will help us to visualize the dataset − plt.scatter(X[:,

0], X[:, 1], s = 20);

plt.show()

Next, make an object of KMeans along with providing number of clusters, train the model

and do the prediction as follows − kmeans = KMeans(n_clusters = 4) kmeans.fit(X)

y_kmeans = kmeans.predict(X)

Now, with the help of following code we can plot and visualize the cluster’s centers picked

by k-means Python estimator − from sklearn.datasets.samples_generator import make_blobs

X, y_true = make_blobs(n_samples = 400, centers = 4, cluster_std = 0.60, random_state = 0)

Next, the following code will help us to visualize the dataset −

plt.scatter(X[:, 0], X[:, 1], c = y_kmeans, s = 20, cmap = 'summer')

centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:,

1], c = 'blue', s = 100, alpha = 0.9); plt.show()


f) Discuss the issues involved in decision tree learning. (6 Marks)

Decision tree learning is a popular machine learning technique used for classification and
regression analysis. It involves constructing a decision tree that recursively partitions the data
into subsets based on the values of the input features, and then predicting the target variable
based on the majority class or the mean value in each leaf node. While decision trees are easy to
interpret and can handle both categorical and continuous input features, there are several issues
that need to be considered when using this algorithm:

i. Overfitting: Decision trees are prone to overfitting when the tree is too complex and
captures noise in the data. This can result in poor generalization performance on new
data. One solution is to use pruning techniques to simplify the tree or to use an
ensemble of decision trees.
ii. Bias-variance tradeoff: Decision trees have a high variance because small changes in
the training data can lead to different trees. This can be addressed by aggregating the
predictions of multiple trees using bagging, boosting, or random forests.
iii. Feature selection: Decision trees rely on the quality of the input features, and
selecting the right set of features is crucial for the accuracy of the model. One way to
address this is to use feature selection techniques to choose the most informative
features.
iv. Missing values: Decision trees cannot handle missing values in the input data, and
this can lead to bias or errors in the predictions. One approach is to use imputation
techniques to fill in the missing values or to use algorithms that are robust to missing
data.
v. Handling continuous variables: Decision trees are designed to work with categorical
variables, and converting continuous variables into discrete variables can lead to loss
of information. One way to handle continuous variables is to use binary splits based
on thresholds or to use regression trees that predict continuous values.
vi. Class imbalance: Decision trees may not perform well when the classes are
imbalanced, and the tree may be biased towards the majority class. This can be
addressed by using techniques such as class weighting, cost-sensitive learning, or
sampling methods to balance the classes.

g) What is deep learning? Discuss its importance. (6 Marks)

Deep learning is a subset of machine learning that involves training artificial neural networks
with multiple layers to learn complex patterns and relationships from large datasets. It is inspired
by the structure and function of the human brain, where neurons are connected in layers to
process and transmit information.

Importance of deep learning

i. High accuracy: Deep learning models can achieve state-of-the-art accuracy on many
tasks such as image and speech recognition, natural language processing, and
recommendation systems.
ii. Large-scale data processing: Deep learning models can process and learn from vast
amounts of data, making it well-suited for big data applications.
iii. Automated feature engineering: Deep learning models can learn relevant features
directly from raw data, eliminating the need for manual feature engineering.
iv. Transfer learning: Deep learning models can transfer knowledge learned from one
task to another related task, reducing the need for large amounts of labeled data and
training time.
v. Real-time processing: Deep learning models can process data in real-time, making it
useful for applications such as autonomous vehicles, fraud detection, and predictive
maintenance.
vi. Improved user experience: Deep learning can improve user experience in areas such
as speech recognition, natural language processing, and computer vision, leading to
better and more intuitive interfaces.

h) Determine if the following scenarios uses supervised or unsupervised learning.


Explain your answers.
i) Facebook recognizes your friend from an album of 30 photographs. (5
Marks)

Facebook recognizes your friend from an album of 30 photographs: Supervised learning is a


machine learning technique that involves training an algorithm on labeled data. Labeled data is a
dataset where each data point has an associated label or target value. In this scenario, Facebook
has labeled data of photos and their corresponding labels (friends' names), and it trains a
classification model to recognize a friend's face in a new photo. The algorithm uses the labeled
data to learn the patterns and features that distinguish one friend from another. During the
training process, the algorithm learns a decision boundary that separates the different classes of
friends, and this decision boundary is used to predict the label of new, unseen data points.

ii) Netflix recommends new movies based on passed movie choices. (5 Marks)

Netflix recommends new movies based on past movie choices: Collaborative filtering is an
unsupervised learning technique that involves finding patterns and similarities between users and
items in a dataset. In this scenario, Netflix does not have labeled data for each movie, but it has
data on users' past movie choices and ratings. The algorithm uses this data to find patterns and
similarities between users and movies, and recommends new movies to users based on their past
preferences and other users' preferences. Collaborative filtering can be either user-based or
itembased. User-based collaborative filtering finds similar users and recommends items that
those users have enjoyed, while item-based collaborative filtering finds similar items and
recommends those items to users who have enjoyed similar items.

iii) Analysing bank data for suspicious looking transactions and flags the fraud
transactions. (5 Marks)

Analysing bank data for suspicious-looking transactions and flags the fraud transactions: This
scenario can use both supervised and unsupervised learning. In supervised learning, the
algorithm is trained on labeled data of past fraudulent and non-fraudulent transactions to learn
the patterns and features that distinguish them. The labeled data is used to train a classification
model that can predict whether a new transaction is fraudulent or not. In unsupervised learning,
the algorithm is used to find anomalies and outliers in the data that do not conform to typical
patterns of legitimate transactions, and flag them as potentially fraudulent. This technique is
known as anomaly detection, and it is useful when labeled data is scarce or non-existent.
Anomaly detection can use statistical methods, clustering methods, or neural network-based
methods to identify outliers in the data.

i) How is machine learning possible in today’s era? (5 Marks)

Machine learning is possible in today's era due to several factors:

i. Data availability: With the increasing use of digital technologies and the internet,
massive amounts of data are being generated every day. This data provides a valuable
resource for machine learning algorithms to learn from and improve their
performance.
ii. Computing power: The development of high-performance computing systems, such
as GPUs and cloud computing platforms, has made it possible to train and deploy
complex machine learning models in a reasonable amount of time.
iii. Algorithmic advancements: The development of new algorithms and techniques, such
as deep learning, reinforcement learning, and transfer learning, has significantly
improved the performance of machine learning models on a wide range of tasks. iv.
Open-source tools and libraries: The availability of open-source tools and
libraries, such as TensorFlow, PyTorch, and scikit-learn, has made it easier for
developers and researchers to build, train, and deploy machine learning models.
v. Industry adoption: Many industries, such as healthcare, finance, and e-commerce,
have started adopting machine learning to improve their operations and provide better
services to their customers. This has created a demand for skilled machine learning
professionals and has led to the development of specialized courses and training
programs.

You might also like