0% found this document useful (0 votes)
11 views33 pages

Unit 3

Machine learning is a subfield of artificial intelligence that enables computers to learn from data and improve performance over time through algorithms that identify patterns and make predictions. It encompasses various types, including supervised, unsupervised, semi-supervised, and reinforcement learning, each with distinct applications and algorithms. While machine learning offers numerous advantages such as improved efficiency and personalized experiences, it also presents challenges related to data quality, ethical concerns, and the need for robust governance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views33 pages

Unit 3

Machine learning is a subfield of artificial intelligence that enables computers to learn from data and improve performance over time through algorithms that identify patterns and make predictions. It encompasses various types, including supervised, unsupervised, semi-supervised, and reinforcement learning, each with distinct applications and algorithms. While machine learning offers numerous advantages such as improved efficiency and personalized experiences, it also presents challenges related to data quality, ethical concerns, and the need for robust governance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

What is machine learning?

Machine learning is a subfield of artificial intelligence that focuses on enabling computers to


learn from data and improve their performance over time without being explicitly programmed.
It involves developing algorithms that can analyze data, identify patterns, and make predictions
or decisions, often without human intervention.

●​ A Decision Process: In general, machine learning algorithms are used to make a


prediction or classification. Based on some input data, which can be labeled or unlabeled,
your algorithm will produce an estimate about a pattern in the data.
●​ An Error Function: An error function evaluates the prediction of the model. If there are
known examples, an error function can make a comparison to assess the accuracy of the
model.
●​ A Model Optimization Process: If the model can fit better to the data points in the
training set, then weights are adjusted to reduce the discrepancy between the known
example and the model estimate. The algorithm will repeat this iterative “evaluate and
optimize” process, updating weights autonomously until a threshold of accuracy has been
met.

Here's a more detailed explanation:

​ Learning from data:​


Machine learning algorithms are trained on datasets, allowing them to identify patterns and
relationships within the data.
​ Automated learning:​
Unlike traditional programming, where rules are explicitly defined, machine learning
algorithms learn from examples and adapt to new data.
​ Improved performance:​
As machine learning models are exposed to more data, they can improve their accuracy and
efficiency over time.
​ Various applications:​
Machine learning is used in diverse fields, including image recognition, natural language
processing, fraud detection, and more.
Machine learning models fall into three primary categories.
Supervised learning
Supervised learning, also known as supervised machine learning, is defined by its use of labeled
datasets to train algorithms to classify data or predict outcomes accurately. As input data is fed
into the model, the model adjusts its weights until it has been fitted appropriately. This occurs as
part of the cross validation process to ensure that the model avoids overfitting or underfitting.
Supervised learning helps organizations solve a variety of real-world problems at scale, such as
classifying spam in a separate folder from your inbox. Some methods used in supervised learning
include neural networks, Naïve Bayes, linear regression, logistic regression, random forest, and
support vector machine (SVM).
Unsupervised learning
Unsupervised learning, also known as unsupervised machine learning, uses machine learning
algorithms to analyze and cluster unlabeled datasets (subsets called clusters). These algorithms
discover hidden patterns or data groupings without the need for human intervention.

Unsupervised learning’s ability to discover similarities and differences in information make it


ideal for exploratory data analysis, cross-selling strategies, customer segmentation, and image
and pattern recognition. It’s also used to reduce the number of features in a model through the
process of dimensionality reduction. Principal component analysis (PCA) and singular value
decomposition (SVD) are two common approaches for this. Other algorithms used in
unsupervised learning include neural networks, k-means clustering, and probabilistic clustering
methods.
Semi-supervised learning
Semi-supervised learning offers a happy medium between supervised and unsupervised learning.
During training, it uses a smaller labeled data set to guide classification and feature extraction
from a larger, unlabeled data set. Semi-supervised learning can solve the problem of not having
enough labeled data for a supervised learning algorithm. It also helps if it’s too costly to label
enough data.

Reinforcement learning
Reinforcement learning is a machine learning model that is similar to supervised learning, but
the algorithm isn’t trained using sample data. This model learns as it goes by using trial and
error. A sequence of successful outcomes will be reinforced to develop the best recommendation
or policy for a given problem.

Common machine learning algorithms


A number of machine learning algorithms are commonly used. These include:

●​ Neural networks: Neural networks simulate the way the human brain works, with a
huge number of linked processing nodes. Neural networks are good at recognizing
patterns and play an important role in applications including natural language translation,
image recognition, speech recognition, and image creation.
●​ Linear regression: This algorithm is used to predict numerical values, based on a
linear relationship between different values. For example, the technique could be used to
predict house prices based on historical data for the area.
●​ Logistic regression: This supervised learning algorithm makes predictions for
categorical response variables, such as “yes/no” answers to questions. It can be used for
applications such as classifying spam and quality control on a production line.
●​ Clustering: Using unsupervised learning, clustering algorithms can identify patterns in
data so that it can be grouped. Computers can help data scientists by identifying
differences between data items that humans have overlooked.
●​ Decision trees: Decision trees can be used for both predicting numerical values
(regression) and classifying data into categories. Decision trees use a branching sequence
of linked decisions that can be represented with a tree diagram. One of the advantages of
decision trees is that they are easy to validate and audit, unlike the black box of the neural
network.
●​ Random forests: In a random forest, the machine learning algorithm predicts a value or
category by combining the results from a number of decision trees.

Advantages and disadvantages of machine learning algorithms


Depending on your budget, need for speed and precision required, each algorithm type
supervised, unsupervised, semi-supervised, or reinforcement has its own advantages and
disadvantages.

For example, decision tree algorithms are used for both predicting numerical values (regression
problems) and classifying data into categories. Decision trees use a branching sequence of linked
decisions that may be represented with a tree diagram. A prime advantage of decision trees is
that they are easier to validate and audit than a neural network. The bad news is that they can be
more unstable than other decision predictors.​

Overall, there are many advantages to machine learning that businesses can leverage for new
efficiencies. These include machine learning identifying patterns and trends in massive volumes
of data that humans might not spot at all. And this analysis requires little human intervention:
just feed in the dataset of interest and let the machine learning system assemble and refine its
own algorithms, which will continually improve with more data input over time. Customers and
users can enjoy a more personalized experience as the model learns more with every experience
with that person.​

On the downside, machine learning requires large training datasets that are accurate and
unbiased. GIGO is the operative factor: garbage in / garbage out. Gathering sufficient data and
having a system robust enough to run it might also be a drain on resources. Machine learning can
also be prone to error, depending on the input. With too small a sample, the system could
produce a perfectly logical algorithm that is completely wrong or misleading. To avoid wasting
budget or displeasing customers, organizations should act on the answers only when there is high
confidence in the output.

Real-world machine learning use cases


Here are just a few examples of machine learning you might encounter every day:

●​ Speech recognition: It is also known as automatic speech recognition (ASR), computer


speech recognition, or speech-to-text, and it is a capability which uses natural language
processing (NLP) to translate human speech into a written format. Many mobile devices
incorporate speech recognition into their systems to conduct voice search e.g. Siri or
improve accessibility for texting.
●​ Customer service: Online chatbots are replacing human agents along the customer
journey, changing the way we think about customer engagement across websites and
social media platforms. Chatbots answer frequently asked questions (FAQs) about topics
such as shipping, or provide personalized advice, cross-selling products or suggesting
sizes for users. Examples include virtual agents on e-commerce sites; messaging bots,
using Slack and Facebook Messenger; and tasks usually done by virtual assistants and
voice assistants.
●​ Computer vision: This AI technology enables computers to derive meaningful
information from digital images, videos, and other visual inputs, and then take the
appropriate action. Powered by convolutional neural networks, computer vision has
applications in photo tagging on social media, radiology imaging in healthcare, and
self-driving cars in the automotive industry.
●​ Recommendation engines: Using past consumption behavior data, AI algorithms can help
to discover data trends that can be used to develop more effective cross-selling strategies.
Recommendation engines are used by online retailers to make relevant product
recommendations to customers during the checkout process.
●​ Robotic process automation (RPA): Also known as software robotics, RPA uses
intelligent automation technologies to perform repetitive manual tasks.
●​ Automated stock trading: Designed to optimize stock portfolios, AI-driven
high-frequency trading platforms make thousands or even millions of trades per day
without human intervention.
●​ Fraud detection: Banks and other financial institutions can use machine learning to spot
suspicious transactions. Supervised learning can train a model using information about
known fraudulent transactions. Anomaly detection can identify transactions that look
atypical and deserve further investigation.

Challenges of machine learning


As machine learning technology has developed, it has certainly made our lives easier. However,
implementing machine learning in businesses has also raised a number of ethical concerns about
AI technologies. Some of these include:

Technological singularity
While this topic garners a lot of public attention, many researchers are not concerned with the
idea of AI surpassing human intelligence in the near future. Technological singularity is also
referred to as strong AI or superintelligence. Philosopher Nick Bostrum defines superintelligence
as “any intellect that vastly outperforms the best human brains in practically every field,
including scientific creativity, general wisdom, and social skills.”

Despite the fact that superintelligence is not imminent in society, the idea of it raises some
interesting questions as we consider the use of autonomous systems, like self-driving cars. It’s
unrealistic to think that a driverless car would never have an accident, but who is responsible and
liable under those circumstances? Should we still develop autonomous vehicles, or do we limit
this technology to semi-autonomous vehicles which help people drive safely? The jury is still out
on this, but these are the types of ethical debates that are occurring as new, innovative AI
technology develops.
AI impact on jobs
While a lot of public perception of artificial intelligence centers around job losses, this concern
should probably be reframed. With every disruptive, new technology, we see that the market
demand for specific job roles shifts. For example, when we look at the automotive industry,
many manufacturers, like GM, are shifting to focus on electric vehicle production to align with
green initiatives. The energy industry isn’t going away, but the source of energy is shifting from
a fuel economy to an electric one.

In a similar way, artificial intelligence will shift the demand for jobs to other areas. There will
need to be individuals to help manage AI systems. There will still need to be people to address
more complex problems within the industries that are most likely to be affected by job demand
shifts, such as customer service. The biggest challenge with artificial intelligence and its effect
on the job market will be helping people to transition to new roles that are in demand.

Privacy

Privacy tends to be discussed in the context of data privacy, data protection, and data security.
These concerns have allowed policymakers to make more strides in recent years. For example, in
2016, GDPR legislation was created to protect the personal data of people in the European Union
and European Economic Area, giving individuals more control of their data. In the United States,
individual states are developing policies, such as the California Consumer Privacy Act (CCPA),
which was introduced in 2018 and requires businesses to inform consumers about the collection
of their data. Legislation such as this has forced companies to rethink how they store and use
personally identifiable information (PII). As a result, investments in security have become an
increasing priority for businesses as they seek to eliminate any vulnerabilities and opportunities
for surveillance, hacking, and cyberattacks.

Bias and discrimination

Instances of bias and discrimination across a number of machine learning systems have raised
many ethical questions regarding the use of artificial intelligence. How can we safeguard against
bias and discrimination when the training data itself may be generated by biased human
processes? Bias and discrimination aren’t limited to the human resources function either; they
can be found in a number of applications from facial recognition software to social media
algorithms.

As businesses become more aware of the risks with AI, they’ve also become more active in this
discussion around AI ethics and values. For example, IBM has sunset its general purpose facial
recognition and analysis products. IBM CEO Arvind Krishna wrote: “IBM firmly opposes and
will not condone uses of any technology, including facial recognition technology offered by
other vendors, for mass surveillance, racial profiling, violations of basic human rights and
freedoms, or any purpose which is not consistent with our values and Principles of Trust and
Transparency.”
Accountability

Since there isn’t significant legislation to regulate AI practices, there is no real enforcement
mechanism to ensure that ethical AI is practiced. The current incentives for companies to be
ethical are the negative repercussions of an unethical AI system on the bottom line. To fill the
gap, ethical frameworks have emerged as part of a collaboration between ethicists and
researchers to govern the construction and distribution of AI models within society. However, at
the moment, these only serve to guide. Some research shows that the combination of distributed
responsibility and a lack of foresight into potential consequences aren’t conducive to preventing
harm to society.
How to choose the right AI platform for machine learning

Selecting a platform can be a challenging process, as the wrong system can drive up costs, or
limit the use of other valuable tools or technologies. When reviewing multiple vendors to select
an AI platform, there is often a tendency to think that more features = a better system. Maybe so,
but reviewers should start by thinking through what the AI platform will be doing for their
organization. What machine learning capabilities need to be delivered and what features are
important to accomplish them? One missing feature might doom the usefulness of an entire
system. Here are some features to consider.

MLOps capabilities. Does the system have:

●​ a unified interface for ease of management?


●​ automated machine learning tools for faster model creation with low-code and no-code
functionality?
●​ decision optimization to streamline the selection and deployment of optimization models?
●​ visual modeling to combine visual data science with open-source libraries and
notebook-based interfaces on a unified data and AI studio?
●​ automated development for beginners to get started quickly and more advanced data
scientists to experiment?
●​ synthetic data generator as an alternative or supplement to real-world data when
real-world data is not readily available?

Generative AI capabilities. Does the system have:

●​ a content generator that can generate text, images and other content based on the data it
was trained on?
●​ automated classification to read and classify written input, such as evaluating and sorting
customer complaints or reviewing customer feedback sentiment?
●​ a summary generator that can transform dense text into a high-quality summary, capture
key points from financial reports, and generate meeting transcriptions?
●​ a data extraction capability to sort through complex details and quickly pull the necessary
information from large documents?

Regression analysis

Regression is a statistical method that allows modeling relationships between a dependent


variable and one or more independent variables. So a regression analysis makes it possible to
infer or predict one variable based on one or more other variables. For example, you might be
interested in what influences a person's salary. In order to find it out, you could take level of
education, the weekly working hours and the age of a person.

Further you could now investigate whether these three variables have an influence on a
person's salary. If so, you can predict a person's salary by using the highest education
level, the weekly working hours and the age of a person.

What are dependent and independent variables?


The variable to be inferred is called the dependent variable (criterion). The variables used
for prediction are called independent variables (predictors).

Thus, in the example above, salary is the dependent variable and highest educational
attainment, weekly hours worked, and age are the independent variables.

When do we use a regression analysis?


By performing a regression analysis two goals can be pursued. On the one hand, the
influence of one or more variables on another variable can be measured, and on the other
hand, the regression can be used to predict a variable by one or more other variables. For
example:

1) Measurement of the influence of one or more variables on another variable


■​ What influences children's ability to concentrate?
■​ Do the educational level of the parents and the place of residence affect the future
educational attainments of children?

2) Prediction of a variable by one or more other variables


■​ How long does a patient stay in the hospital?
■​ What product is a person most likely to buy from an online store?

The regression analysis thus provides information about how the value of the dependent
variable changes if one of the independent variables is changed.

Types of regression analysis


Regression analyses are divided into simple linear regression, multiple linear regression
and logistic regression. The type of regression analysis that should be used, depends on
the number of independent variables and the scale of measurement. If you only want to
use one variable for prediction, a simple regression is used. If you use more than one
variable, you need to perform a multiple regression. If the dependent variable is
nominally scaled, a logistic regression must be calculated. If the dependent variable is
metrically scaled, a linear regression is used. Whether a linear or a non-linear regression
is used depends on the relationship itself. In order to perform a linear regression, a linear
relationship between the independent variables and the dependent variable is necessary.

Linear Regression
What is a linear regression analysis?
Linear regression analysis is used to create a model that describes the relationship
between a dependent variable and one or more independent variables. Depending on
whether there are one or more independent variables, a distinction is made between
simple and multiple linear regression analysis.

In the case of a simple linear regression, the aim is to examine the influence of an
independent variable on one dependent variable. In the second case, a multiple linear
regression, the influence of several independent variables on one dependent variable is
analyzed. Example: Does the height have an influence on the weight of a person?

The goal of a simple linear regression is to predict the value of a dependent variable
based on an independent variable. The greater the linear relationship between the
independent variable and the dependent variable, the more accurate is the prediction. This
goes along with the fact that the greater the proportion of the dependent variable's
variance that can be explained by the independent variable is, the more accurate is the
prediction. Visually, the relationship between the variables can be shown in a scatter plot.
The greater the linear relationship between the dependent and independent variables, the
more the data points lie on a straight line.

The task of simple linear regression is to exactly determine the straight line which best
describes the linear relationship between the dependent and independent variable. In
linear regression analysis, a straight line is drawn in the scatter plot. To determine this
straight line, linear regression uses the method of least squares.

The regression line can be described by the following equation:

Definition of "Regression coefficients":

■​ a : point of intersection with the y-axis


■​ b : gradient of the straight line

ŷ is the respective estimate of the y-value. This means that for each x-value the
corresponding y-value is estimated. In our example, this means that the height of people
is used to estimate their weight.

Multiple Linear Regression


Unlike simple linear regression, multiple linear regression allows more than two
independent variables to be considered. The goal is to estimate a variable based on
several other variables. The variable to be estimated is called the dependent variable
(criterion). The variables that are used for the prediction are called independent variables
(predictors).

Multiple linear regression is frequently used in empirical social research as well as in


market research. In both areas it is of interest to find out what influence different factors
have on a variable. For example, what determinants influence a person's health or
purchasing behavior?

Marketing example:
For a video streaming service you should predict how many times a month a person
streams videos. For this you get a record of user's data (age, income, gender, ...).

Medical example:
You want to find out which factors have an influence on the cholesterol level of patients.
For this purpose, you analyze a patient data set with cholesterol level, age, hours of sport
per week and so on.

The equation necessary for the calculation of a multiple regression is obtained with k
dependent variables as:
The coefficients can now be interpreted similarly to the linear regression equation. If all
independent variables are 0, the resulting value is a. If an independent variable changes
by one unit, the associated coefficient indicates by how much the dependent variable
changes. So if the independent variable x increases by one unit, the dependent variable y
increases by bi.

Logistic Regression
Logistic regression is a special case of regression analysis and is used when the
dependent variable is nominally scaled. This is the case, for example, with the variable
purchase decision with the two values buys a product and does not buy a product.

Logistical regression analysis is thus the counterpart of linear regression, in which the
dependent variable of the regression model must at least be interval-scaled.

With logistic regression, it is now possible to explain the dependent variable or estimate
the probability of occurrence of the categories of the variable.

Business example:
For an online retailer, you need to predict which product a particular customer is most
likely to buy. For this, you receive a data set with past visitors and their purchases from
the online retailer.

Medical example:
You want to investigate whether a person is susceptible to a certain disease or not. For
this purpose, you receive a data set with diseased and non-diseased persons as well as
other medical parameters.

Political example:
Would a person vote for party A if there were elections next weekend?

What is a logistic regression?


In the basic form of logistic regression, dichotomous variables (0 or 1) can be predicted.
For this purpose, the probability of the occurrence of value 1 (characteristic present) is
estimated.

In medicine, for example, a frequent application is to find out which variables have an
influence on a disease. In this case, 0 could stand for not diseased and 1 for diseased.
Subsequently, the influence of age, gender and smoking status (smoker or not) on this
particular disease could be examined.
Logistic regression and probabilities
In linear regression, the independent variables (e.g., age and gender) are used to estimate
the specific value of the dependent variable (e.g., body weight).

In logistic regression, on the other hand, the dependent variable is dichotomous (0 or 1)


and the probability that expression 1 occurs is estimated. Returning to the example above,
this means: How likely is it that the disease is present if the person under consideration
has a certain age, sex and smoking status.

Calculate logistic regression


To build a logistic regression model, the linear regression equation is used as the starting
point. however, values between plus and minus infinity can now occur. The goal of
logistic regression, however, is to estimate the probability of occurrence and not the value
of the variable itself. Therefore, the equation must still be transformed.

To do this, it is necessary to restrict the value range for the prediction to the range
between 0 and 1. To ensure that only values between 0 and 1 are possible, the logistic
function f is used.

Logistic(Sigmoid) function
The logistic model is based on the logical function. The special thing about the logistic
function is that for values between minus and plus infinity, it always assumes only values
between 0 and 1.
So the logistic function is perfect to describe the probability P(y=1). If the logistic
function is now applied to the upper regression equation the result is:

This now ensures that no matter in which range the x values are located, only values
between 0 and 1 will come out. The new graph now looks like this:
The probability that for given values of the independent variable the dichotomous
dependent variable y is 0 or 1 is given by:

To calculate the probability of a person being sick or not using the logistic regression for
the example above, the model parameters b1, b2, b3 and a must first be determined. Once
these have been determined, the equation for the example above is:

Evaluation Metrics

Evaluation metric plays an important role in obtaining the best possible classifier in the
classification training. Performance metrics for model evaluation shed light on:

●​ Model efficiency
●​ Production readiness of the model
●​ Potential performance enhancement with more training data
●​ Overfitting or underfitting tendencies of the model

A high-performing model should be:

●​ Accurate
●​ Consistent
●​ Reliable on new data

Why Evaluation Metrics Matter


When building a machine learning model, accuracy alone is not enough to assess how
well the model performs. Especially in imbalanced datasets (e.g., fraud detection,
medical diagnosis), accuracy can be misleading.

Example:

A model that predicts “no fraud” 99% of the time in a dataset with 1% fraud will have
99% accuracy — but it’s useless for detecting fraud.

Hence, we need detailed metrics to understand:

●​ What the model is doing right and wrong.​

●​ How it performs across different types of errors.​

●​ Whether the performance is generalizable to unseen data.​

Confusion Matrix

The confusion matrix is a table that shows the performance of a classification model:

Predicted: Positive Predicted: Negative

Actual: True Positive (TP) False Negative (FN)


Positive

Actual: False Positive (FP) True Negative (TN)


Negative
Example:

Imagine a model that predicts whether a person has COVID based on symptoms:

●​ TP = Model correctly predicts a sick person as sick.​

●​ FP = Model incorrectly predicts a healthy person as sick.​

●​ TN = Model correctly predicts a healthy person as healthy.​

●​ FN = Model incorrectly predicts a sick person as healthy.​

Evaluation Metrics Derived from the Confusion Matrix

Accuracy


F1 Score

Cross-Validation

Rather than relying on a single train/test split, cross-validation helps us estimate how
well our model performs on unseen data.

K-Fold Cross-Validation:

●​ The dataset is split into K subsets (folds).


●​ The model is trained on K-1 folds and tested on the remaining fold.
●​ This process is repeated K times, each time with a different fold as the test set.
●​ The final score is the average of all K scores.

Why Cross-Validation Matters:


●​ Reduces bias and variance.
●​ Ensures the model is not just performing well on one specific subset of data.

Example of use Metrics

Scenario Metric to Reason


Prioritize

Cancer Detection Recall Missing a positive case is


dangerous.

Email Spam Filter Precision Don’t want to block important


emails.

Search Results F1-Score Need a balance of precision and


Ranking recall.

Regression Metrics

Mean Absolute Error (MAE)

Definition:​
The average of the absolute differences between predicted and actual values. It measures
the average magnitude of errors in predictions without considering their direction.

Formula:
When to Use:

●​ When you want a simple, interpretable measure of error.


●​ Useful when all errors are equally important.
●​ Less sensitive to outliers than MSE.

Mean Squared Error (MSE)

Definition:​
The average of the squared differences between predicted and actual values. It penalizes
larger errors more than smaller ones.

Formula:

When to Use:

●​ When you want to penalize large errors more heavily.


●​ Often used in optimization algorithms like gradient descent.

R² Score (Coefficient of Determination)

Definition:​
Indicates how well the model explains the variability of the target variable. R² = 1 means
perfect prediction; R² = 0 means no explanatory power.
Formula:

When to Use:

●​ When you want to understand how well your model fits the data.
●​ Good for comparing different regression models on the same dataset.

When to Use What

Metric Type Use When

MAE Regression Need interpretable average error

MSE Regression Want to penalize large errors

R² Score Regression Want to explain variance in data

Accuracy Classification Classes are balanced

Precision Classification False positives are costly

Recall Classification False negatives are risky

F1-Score Classification Need balance in imbalanced datasets


What is a Classification Algorithm?

A classification algorithm is a supervised learning technique in machine learning


that:

●​ Learns from labeled data (input-output pairs),


●​ Predicts a category or class label for new, unseen data.

Goal: Assign the correct class (like "spam" or "not spam", "cat" or "dog", "disease" or
"no disease").

How It Works (Simple Steps)

1.​ Training Phase:​


The model studies the data where inputs are linked to known labels.
2.​ Learning Phase:​
It finds patterns or rules that connect input features to output classes.
3.​ Prediction Phase:​
For a new input, it predicts which class it belongs to.​
Examples of Classification Problems
●​ Email filtering: Spam or Not Spam
●​ Medical diagnosis: Disease or No Disease
●​ Image recognition: Cat, Dog, Horse, etc.
●​ Credit scoring: Default or No Default

Popular Classification Algorithms

Algorithm Notes
Logistic Regression Good for binary outcomes
(Yes/No)

Decision Tree Tree-based, easy to interpret

Random Forest Ensemble of trees, powerful

Support Vector Machine Best for high-dimensional data


(SVM)

K-Nearest Neighbors Based on proximity


(KNN)

Naive Bayes Probabilistic, fast

Neural Networks Good for complex datasets

Decision trees

A decision tree is a non-parametric supervised learning algorithm, which is


utilized for both classification and regression tasks. It has a hierarchical, tree
structure, which consists of a root node, branches, internal nodes and leaf
nodes.
As you can see from the diagram below, a decision tree starts with a root node, which does not
have any incoming branches. The outgoing branches from the root node then feed into the
internal nodes, also known as decision nodes. Based on the available features, both node types
conduct evaluations to form homogenous subsets, which are denoted by leaf nodes, or terminal
nodes. The leaf nodes represent all the possible outcomes within the dataset.
Here's a breakdown of key aspects:

Structure and Components:

●​ Root Node: The starting point of the tree, representing the entire dataset.
●​ Internal Nodes: Represent decision points or conditions based on features.
●​ Branches: Links between nodes, representing the possible outcomes of a decision.
●​ Leaf Nodes: Terminal nodes that represent the final predictions or classifications.

As an example, let’s imagine that you were trying to assess whether or not you should go surf,
you may use the following decision rules to make a choice:
Decision tree learning employs a divide and conquer strategy by conducting a greedy search to
identify the optimal split points within a tree. This process of splitting is then repeated in a
top-down, recursive manner until all, or the majority of records have been classified under
specific class labels.

Whether or not all data points are classified as homogenous sets is largely dependent on the
complexity of the decision tree. Smaller trees are more easily able to attain pure leaf nodes—i.e.
data points in a single class. However, as a tree grows in size, it becomes increasingly difficult to
maintain this purity, and it usually results in too little data falling within a given subtree. When
this occurs, it is known as data fragmentation, and it can often lead to overfitting.

As a result, decision trees have preference for small trees, which is consistent with the principle
of parsimony in Occam’s Razor; that is, “entities should not be multiplied beyond necessity.”
Said differently, decision trees should add complexity only if necessary, as the simplest
explanation is often the best. To reduce complexity and prevent overfitting, pruning is usually
employed; this is a process, which removes branches that split on features with low importance.
The model’s fit can then be evaluated through the process of cross-validation.
Types of decision trees
Hunt’s algorithm, which was developed in the 1960s to model human learning in Psychology,
forms the foundation of many popular decision tree algorithms, such as the following:

- ID3: Ross Quinlan is credited within the development of ID3, which is shorthand for “Iterative
Dichotomiser 3.” This algorithm leverages entropy and information gain as metrics to evaluate
candidate splits.

- C4.5: This algorithm is considered a later iteration of ID3, which was also developed by
Quinlan. It can use information gain or gain ratios to evaluate split points within the decision
trees.

- CART: The term, CART, is an abbreviation for “classification and regression trees” and was
introduced by Leo Breiman. This algorithm typically utilizes Gini impurity to identify the ideal
attribute to split on. Gini impurity measures how often a randomly chosen attribute is
misclassified. When evaluating using Gini impurity, a lower value is more ideal.

Key Concepts:

●​ Inductive Bias: A set of assumptions that the algorithm uses to make predictions.
●​ Information Gain: A measure of how much a feature helps reduce uncertainty in the
data.
●​ Gini Index/Gini Impurity: A measure of the impurity or disorder in a dataset, used to
determine the best split.
●​ Overfitting: When a tree becomes too complex and learns the training data too well,
leading to poor performance on new data.
●​ Pruning: A technique to remove unnecessary branches from the tree to prevent
overfitting.

Algorithm Steps:

●​ Feature Selection: Identifying the most relevant features for splitting the data.
●​ Splitting: Dividing the data based on the chosen feature, creating sub-trees.
●​ Recursion: Applying the same splitting process to each sub-tree until a stopping
condition is met (e.g., maximum depth, minimum number of samples per leaf).

How to choose the best attribute at each node


While there are multiple ways to select the best attribute at each node, two methods, information
gain and Gini impurity, act as popular splitting criterion for decision tree models. They help to
evaluate the quality of each test condition and how well it will be able to classify samples into a
class.
Entropy and information gain
It’s difficult to explain information gain without first discussing entropy. Entropy is a concept
that stems from information theory, which measures the impurity of the sample values.

where

●​ A represents a specific attribute or class label


●​ Entropy(S) is the entropy of dataset, S
●​ |Sv|/|S| represents the proportion of the values in Sv to the number of values in
dataset, S.
Entropy values can fall between 0 and 1. If all samples in data set, S, belong to one class, then
entropy will equal zero. If half of the samples are classified as one class and the other half are in
another class, entropy will be at its highest at 1. In order to select the best feature to split on and
find the optimal decision tree, the attribute with the smallest amount of entropy should be used.

Information gain represents the difference in entropy before and after a split on a given attribute.
The attribute with the highest information gain will produce the best split as it’s doing the best
job at classifying the training data according to its target classification. Information gain is
usually represented with the following formula,

Example: Imagine that we have the following arbitrary dataset -

For this dataset, the entropy is 0.94. This can be calculated by finding the proportion of days
where “Play Tennis” is “Yes”, which is 9/14, and the proportion of days where “Play Tennis” is
“No”, which is 5/14. Then, these values can be plugged into the entropy formula above.

Entropy (Tennis) = -(9/14) log2(9/14) – (5/14) log2 (5/14) = 0.94

We can then compute the information gain for each of the attributes individually. For example,
the information gain for the attribute, “Humidity” would be the following:
Gain (Tennis, Humidity) = (0.94)-(7/14)*(0.985) – (7/14)*(0.592) = 0.151

- 7/14 represents the proportion of values where humidity equals “high” to the total number of
humidity values. In this case, the number of values where humidity equals “high” is the same as
the number of values where humidity equals “normal”.

- 0.985 is the entropy when Humidity = “high”

- 0.59 is the entropy when Humidity = “normal”

Then, repeat the calculation for information gain for each attribute in the table above, and select
the attribute with the highest information gain to be the first split point in the decision tree. In
this case, outlook produces the highest information gain. From there, the process is repeated for
each subtree.

Advantages and disadvantages of decision trees


While decision trees can be used in a variety of use cases, other algorithms typically outperform
decision tree algorithms. That said, decision trees are particularly useful for data mining and
knowledge discovery tasks. Let’s explore the key benefits and challenges of utilizing decision
trees more below:
Advantages
●​ Easy to interpret: The Boolean logic and visual representations of decision trees make
them easier to understand and consume. The hierarchical nature of a decision tree also
makes it easy to see which attributes are most important, which isn’t always clear with
other algorithms, like neural networks.
●​ Little to no data preparation required: Decision trees have a number of characteristics,
which make it more flexible than other classifiers. It can handle various data types—i.e.
discrete or continuous values, and continuous values can be converted into categorical
values through the use of thresholds. Additionally, it can also handle values with missing
values, which can be problematic for other classifiers, like Naïve Bayes.
●​ More flexible: Decision trees can be leveraged for both classification and regression
tasks, making it more flexible than some other algorithms. It’s also insensitive to
underlying relationships between attributes; this means that if two variables are highly
correlated, the algorithm will only choose one of the features to split on.

Disadvantages
●​ Prone to overfitting: Complex decision trees tend to overfit and do not generalize well to
new data. This scenario can be avoided through the processes of pre-pruning or
post-pruning. Pre-pruning halts tree growth when there is insufficient data while
post-pruning removes subtrees with inadequate data after tree construction.
●​ High variance estimators: Small variations within data can produce a very different
decision tree. Bagging, or the averaging of estimates, can be a method of reducing
variance of decision trees. However, this approach is limited as it can lead to highly
correlated predictors.
●​ More costly: Given that decision trees take a greedy search approach during construction,
they can be more expensive to train compared to other algorithms.

You might also like