0% found this document useful (0 votes)
18 views

Algorithms 1

The document provides an overview of various machine learning algorithms, including regression, classification, and clustering techniques. It discusses the strengths and weaknesses of specific algorithms such as Multiple Linear Regression, Logistic Regression, Decision Trees, Random Forests, K Nearest Neighbour, Support Vector Machine, Naive Bayes, and K Means. Additionally, it emphasizes the importance of training and testing datasets in evaluating machine learning models.

Uploaded by

qubefexe
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Algorithms 1

The document provides an overview of various machine learning algorithms, including regression, classification, and clustering techniques. It discusses the strengths and weaknesses of specific algorithms such as Multiple Linear Regression, Logistic Regression, Decision Trees, Random Forests, K Nearest Neighbour, Support Vector Machine, Naive Bayes, and K Means. Additionally, it emphasizes the importance of training and testing datasets in evaluating machine learning models.

Uploaded by

qubefexe
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

ML

ALGORITHMS
REGRESSION

A statistical technique that relates a


dependent variable to one or more
independent (explanatory) variables.
A regression model is able to show
whether changes observed in the
dependent variable are associated
with changes in one or more of the
explanatory variables
CLASSIFICATION

Statistical classification is the broad


supervised learning approach that trains a
program to categorize new information
based upon its relevance to known, labeled
data. The algorithms that sort new data
into labeled classes, or categories of
information, are called classifiers.
Multiple Linear Regression

• Multiple linear regression


attempts to model the
relationship between two or
more independent variables and
a dependent variable by fitting a
linear equation to observed data.
Every value of the independent
variable x is associated with a
value of the dependent variable
Multiple Linear Regression

• Good
- Simple to implement and efficient to train.
- Overfitting can be reduced by regularization.
- Performs well when the dataset is linearly
separable.
• Bad
- Assumes that the data is independent which
is rare in real life.
-Prone to noise and overfitting.
- Sensitive to outliers.
Logistic Regression

• binary classification algorithm used


to predict a binary outcome. It is
often used in cases where the
outcome is either yes or no. Logistic
regression uses a sigmoid function
to map input variables to a
probability of the output variable
Logistic Regression

• Good
- Less prone to over-fitting but it can overfit in
high dimensional datasets.
- Efficient when the dataset has features that
are linearly separable.
- Easy to implement and efficient to train.
• Bad
-Should not be used when the number of
observations are lesser than the number of
features.
- Assumption of linearity which is rare in
practice.
- Can only be used to predict discrete functions.
Decision Trees

• A decision tree is a simple and intuitive model


that is used for both regression and classification
problems. It is a tree-like structure where each
node represents a feature or attribute, and each
branch represents a decision rule. Decision trees
are particularly useful in situations where the
data has multiple variables and is non-linear
Decision Trees

• Good
- Can solve non-linear problems.
- Can work on high-dimensional data with
excellent accuracy.
- Easy to visualize and explain.
• Bad
- Overfitting. Might be resolved by random
forest.
- A small change in the data can lead to a
large change in the structure of the optimal
decision tree.
- Calculations can get very complex.
Random Forests

• Random forests are a powerful and popular


ensemble learning technique used for
classification, regression, and anomaly detection.
It is an extension of decision trees, where a large
number of decision trees are trained on subsets of
data. The final prediction is made by taking the
average of all the individual tree predictions
Random Forests

• Good
- It can perform both regression and
classification tasks.
- Handle large datasets efficiently.
- Higher level of accuracy in predicting.
• Bad
-These are prone to overfitting.
-It can be quite large, thus making pruning
necessary.
-Calculations can become complex when
there are many class variables.
K Nearest Neighbour

• The K in the name of this classifier represents


the k nearest neighbors, where k is an integer
value specified by the user. Hence as the name
suggests, this classifier implements learning
based on the k nearest neighbors. The choice of
the value of k is dependent on data
K Nearest Neighbour

• Good
- Can make predictions without training.
- Time complexity is O(n).
- Can be used for both classification and
regression.
• Bad
- Does not work well with large dataset.
- Sensitive to noisy data, missing values
and outliers.
- Need feature scaling.
- Choose the correct K value.
Support Vector Machine

Support vector machines (SVM) are a supervised


learning algorithm used for classification and
regression analysis. SVM tries to find the best
hyperplane that separates the data points into
different classes, with maximum margin. SVM can
also handle non-linearly separable data by
transforming the data into a higher-dimensional
space.
Support Vector Machine

• Good
- Good at high dimensional
data.
- Can work on small dataset.
- Can solve non-linear
problems.
• Bad
-Inefficient on large data.
- Requires picking the right
Naive Bayes

• probability of happening of A depends upon


occurrence of B. Naive Bayes classifiers are a
collection of classification algorithms based on
Bayes’ Theorem. every pair of features being
classified is independent of each other
• we will divide data set into
• 1. Feature matrix contains all the vectors(rows) of
dataset in which each vector consists of the value of
dependent features
• 2. Response vector contains the value of class
variable(prediction or output) for each row of
feature matrix
Naive Bayes

• Good
- Training period is less. Bayes Theorem
- Better suited for categorical inputs.
- Easy to implement.
• Bad
- Assumes that all features are
independent which is rarely happening in
real life.
- Zero Frequency.
- Estimations can be wrong in some
K Means(UNSUPERVISED LEARNING)

• K-Means Clustering is an Unsupervised Learning


algorithm, which groups the unlabeled dataset into different
clusters. Here K defines the number of pre-defined clusters
that need to be created in the process.
• As if K=2, there will be two clusters, and for K=3, there will
be three clusters, and so on.
• Allows us to cluster the data into different groups and a
convenient way to discover the categories of groups in the
unlabeled dataset on its own without the need for any
training . It is a centroid-based algorithm, where each
cluster is associated with a centroid.
K Means

• Good
- Simple to implement.
- Scales to large data sets.
- Guarantees convergence.
- Easily adapts to new examples.
- Generalizes to clusters of different shapes
and sizes.
• Bad
- Sensitive to the outliers.
- Choosing the k values manually is tough.
- Dependent on initial values.
- Scalability decreases when dimension
TRAIN AND TEST
• Machine learning is about learning
some properties of a data set and
then testing those properties against
another data set.
• A common practice in machine
learning is to evaluate an algorithm
by splitting a data set into two.
• We call one of those sets the
training set, on which we learn some
properties. we call the other set the
testing set, on which we test the
learned properties
*

You might also like