0% found this document useful (0 votes)
47 views142 pages

Module 1-Basics of ML

Here are some examples of classification algorithms: - Logistic Regression - Decision Trees - Naive Bayes - K-Nearest Neighbors (k-NN) - Support Vector Machines (SVM) These algorithms analyze the training data and learn the relationship between the input features and the categorical output variable. They then use this relationship to predict the class or category of new, unlabeled data. Some common classification problems include spam filtering, document classification, image recognition, etc. Regression - Regression algorithms are used to solve the regression problems in which the output variable is continuous or numeric, such as price, sales, temperature, etc. Examples of regression algorithms include: - Linear Regression - Polynomial Regression

Uploaded by

pmdcd2wjf5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views142 pages

Module 1-Basics of ML

Here are some examples of classification algorithms: - Logistic Regression - Decision Trees - Naive Bayes - K-Nearest Neighbors (k-NN) - Support Vector Machines (SVM) These algorithms analyze the training data and learn the relationship between the input features and the categorical output variable. They then use this relationship to predict the class or category of new, unlabeled data. Some common classification problems include spam filtering, document classification, image recognition, etc. Regression - Regression algorithms are used to solve the regression problems in which the output variable is continuous or numeric, such as price, sales, temperature, etc. Examples of regression algorithms include: - Linear Regression - Polynomial Regression

Uploaded by

pmdcd2wjf5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 142

What is Machine Learning?

“Learning is any process by which a system improves


performance from experience.”
- Herbert Simon

Definition by Tom Mitchell (1998):


Machine Learning is the study of algorithms that
• improve their performance P
• at some task T
• with experience E.
A well-defined learning task is given by <P, T, E>.
3
So What Is Machine Learning?
• *Automating automation
• *Getting computers to
program themselves
• *Writing software is the
bottleneck
• *Let the data do the work
instead!
Traditional
Programming
Data
Program Computer Output

Machine Learning
Data
Output Computer Program

4
Magic?
No, more like gardening

• Seeds = Algorithms
• Nutrients = Data
• Gardener = You
• Plants = Programs
When Do We Use Machine
Learning?
ML is used when:
• Human expertise does not exist (navigating on Mars)
• Humans can’t explain their expertise (speech recognition)
• Models must be customized (personalized medicine)
• Models are based on huge amounts of data (genomics)

Learning isn’t always useful:


• There is no need to “learn” to calculate
payroll 5
A classic example of a task that requires machine
learning: It is very hard to say what makes
a2

6
Some more examples of tasks that are
best solved by using a learning
algorithm
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences
• Recognizing anomalies:
– Unusual credit card transactions
– Unusual patterns of sensor readings in a nuclear power
plant
• Prediction:
– Future stock prices or currency exchange rates 7
Sample Applications
• Web search
• Computational biology
• Finance
• E-commerce
• Space exploration
• Robotics
• Information extraction
• Social networks
• Debugging software
• [Your favorite area]

8
ML in a Nutshell
• Tens of thousands of machine learning
algorithms
• Hundreds new every year
• Every machine learning algorithm has three
components:
– Representation
– Evaluation
– Optimization
Representation
• -Decision trees
• -Sets of rules / Logic programs
• -Instances
• -Graphical models
(Bayes/Markov nets)
• -Neural networks
• -Support vector machines
• -Model ensembles Etc.
Various Function
Representations
• Numerical functions
– Linear regression
– Neural networks
– Support vector machines
• Symbolic functions
– Decision trees
– Rules in propositional logic
– Rules in first-order predicate logic
• Instance-based functions
– Nearest-neighbor
– Case-based
• Probabilistic Graphical Models
– Naïve Bayes
– Bayesian networks
– Hidden-Markov Models (HMMs)
Evaluation
• -Accuracy
• -Precision and recall
• -Squared error
• -Likelihood
• -Posterior probability
• -Cost / Utility
• -Margin
• -Entropy
• -K-L divergence etc.
Optimization
• Combinatorial optimization
–E.g. Greedy search
• Convex optimization
–E.g. Gradient descent
• Constrained optimization
–E.g. Linear programming
Various Search/Optimization
Algorithms
• Gradient descent
– Perceptron
– Backpropagation
• Dynamic Programming
– HMM Learning
– PCFG Learning
• Divide and Conquer
– Decision tree induction
– Rule learning
• Evolutionary Computation
– Genetic Algorithms (GAs)
– Genetic Programming (GP)
– Neuro-evolution
“Machine Learning: Field of study that gives
computers the ability to learn without being
explicitly programmed.” -Arthur Samuel (1959)

Samuel’s Checkers-Player

9
Defining the Learning Task
Improve on task T, with respect to performance metric P, based on
experience E

T: Playing checkers
P: Percentage of games won against an arbitrary opponent
E: Playing practice games against itself

T: Recognizing hand-written words


P: Percentage of words correctly classified
E: Database of human-labeled images of handwritten words

T: Driving on four-lane highways using vision sensors


P: Average distance traveled before a human-judged error
E: A sequence of images and steering commands recorded while
observing a human driver.

T:Categorize email messages as spam or legitimate.

P: Percentage of email messages correctly classified. 10


State of the Art Applications
of Machine Learning

11
Autonomous Cars

• Nevada made it legal for


autonomous cars to drive on
roads in June 2011
• As of 2013, four states (Nevada,
Florida, California, and
Michigan) have legalized
autonomous cars
Penn’s Autonomous Car 🡪
12
(Ben Franklin Racing
Autonomous Car
Sensors

13
Autonomous Car Technology
Path
Planning

Laser Terrain Mapping

Learning from Human Drivers


Adaptive Vision

Sebastian

Stanley

Images and movies taken from Sebastian Thrun’s multimedia


1 4
Deep Learning in the
Headlines

15
Deep Belief Net on Face
Images
object models

object parts
(combination
of edges)

edges

pixels
16
Learning of Object
Parts
Training on Multiple
Objects

• Trained on 4 classes
(cars, faces,
motorbikes,
airplanes).
• Second layer:
Shared-features
and object-specific
features.
Scene Labeling via Deep
Learning

[Farabet et al. ICML 2012, PAMI 2013] 19


Inference from Deep Learned
Models
Generating posterior samples from faces by “filling in” experiments
(cf. Lee and Mumford, 2003). Combine bottom-up and top-down inference.

Input images

Samples from
feedforward
Inference
(control)

Samples from
Full posterior
inference

20
Machine Learning in
Automatic Speech
Recognition
A Typical Speech Recognition
System

ML used to predict of phone states from the sound spectrogram

Deep learning has state-of-the-art results


# Hidden Layers 1 2 4 8 10 12

Word Error Rate % 16.0 12.8 11.4 10.9 11.0 11.1

Baseline GMM performance = 15.4%


[Zeiler et al. “On rectified linear units for speech
recognition” ICASSP 2013]
21
Impact of Deep Learning in Speech
Technology

22
Types of Learning

23
Types of Learning
Based on the methods and way of learning,
machine learning is divided into mainly four
types
. Supervised (inductive) learning
– Given: training data + desired outputs (labels)
• Unsupervised learning
– Given: training data (without desired outputs)
• Semi-supervised learning
– Given: training data + a few desired outputs
• Reinforcement learning
– Rewards from sequence of actions 24
SUPERVISED LEARNING
• we train the machines using the "labelled" dataset,
and based on the training, the machine predicts the
output. Here, the labelled data specifies that some
of the inputs are already mapped to the output.
• first, we train the machine with the input and
corresponding output, and then we ask the machine
to predict the output using the test dataset.
• E.g. Suppose we have an input dataset of cats and
dog images. So, first, we will provide the training to
the machine to understand the images, such as
the shape & size of the tail of cat and dog, Shape of
eyes, colour, height (dogs are taller, cats are
smaller), etc
• After completion of training, we input the
picture of a cat and ask the machine to identify
the object and predict the output. Now, the
machine is well trained, so it will check all the
features of the object, such as height, shape,
colour, eyes, ears, tail, etc., and find that it's a
cat. So, it will put it in the Cat category
• The main goal of the supervised learning
technique is to map the input variable(x) with
the output variable(y).Some real-world
applications of supervised learning are Risk
Assessment, Fraud Detection, Spam
filtering, etc.
Categories of Supervised Machine Learning
Supervised machine learning can be
classified into two types of problems, which
are given below:
• Classification
• Regression
Classification
• Classification algorithms are used to
solve the classification problems in which
the output variable is categorical, such
as "Yes" or “No”, “Male” or “Female”,
“Red” or “Blue”, etc. The classification
algorithms predict the categories present
in the dataset. Some real-world
examples of classification algorithms
are Spam Detection, Email filtering, etc.
Some popular classification algorithms are
given below:
• Random Forest Algorithm
• Decision Tree Algorithm
• Logistic Regression Algorithm
• Support Vector Machine Algorithm
Classification
Multi class classifier

Multiclass classification -text classification. For example,


classifying news articles, tweets, or scientific papers.
Regression
• Regression algorithms are used to solve
regression problems in which there is a
linear relationship between input and
output variables. These are used to
predict continuous output variables, such
as market trends, weather prediction, etc.
Regression: predict a number, not a
class
• Don’t just predict whether stock will go up
or down in the present circumstance –
predict by how much!

• Better, predict probabilities that it will go up


and down by different amounts

41
Some popular Regression algorithms are
given below:
• Simple Linear Regression Algorithm
• Multivariate Regression Algorithm
• Decision Tree Algorithm
• Lasso Regression
Advantages:
• Since supervised learning work with the
labelled dataset so we can have an exact
idea about the classes of objects.
• These algorithms are helpful in predicting
the output on the basis of prior experience.
Disadvantages:
• These algorithms are not able to solve
complex tasks.
• It may predict the wrong output if the test
data is different from the training data.
• It requires lots of computational time to
train the algorithm.
Applications of Supervised Learning
Image Segmentation:
• Supervised Learning algorithms are used in
image segmentation. In this process, image
classification is performed on different image
data with pre-defined labels.
Medical Diagnosis:
• Supervised algorithms are also used in the
medical field for diagnosis purposes. It is done
by using medical images and past labelled data
with labels for disease conditions. With such a
process, the machine can identify a disease for
the new patients.
• Fraud Detection - Supervised Learning classification
algorithms are used for identifying fraud
transactions, fraud customers, etc. It is done by
using historic data to identify the patterns that can
lead to possible fraud.
• Spam detection - In spam detection & filtering,
classification algorithms are used. These algorithms
classify an email as spam or not spam. The spam
emails are sent to the spam folder.
• Speech Recognition - Supervised learning algorithms
are also used in speech recognition. The algorithm is
trained with voice data, and various identifications
can be done using the same, such as voice-activated
passwords, voice commands, etc.
Unsupervised Machine Learning
• there is no need for supervision.
• the machine is trained using the
unlabeled dataset, and the machine
predicts the output without any
supervision.
• the models are trained with the data that
is neither classified nor labelled, and the
model acts on that data without any
supervision.
• The main aim of the unsupervised learning
algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and
differences. Machines are instructed to find the
hidden patterns from the input dataset.
• E.g.Suppose there is a basket of fruit images, and
we input it into the machine learning model. The
images are totally unknown to the model, and the
task of the machine is to find the patterns and
categories of the objects.
• So, now the machine will discover its patterns an
differences, such as colour difference, shape
difference, and predict the output when it is teste
with the test dataset
Unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will
interpret the raw data to find the hidden patterns
from the data and then will apply suitable
algorithms such as
k-means clustering, Decisiontree, etc.
Types of Unsupervised
Learning Algorithm:
1) Clustering
• The clustering technique is used when we
want to find the inherent groups from the
data. It is a way to group the objects into a
cluster such that the objects with the most
similarities remain in one group and have
fewer or no similarities with the objects of
other groups. An example of the clustering
algorithm is grouping the customers by
their purchasing behaviour.
Clustering is the task of dividing the population or
data points into a number of groups such that data
points in the same groups are more similar to
other data points in the same group and
dissimilar to the data points in other groups. It is
basically a collection of objects on the basis of
similarity and dissimilarity between them.
Some of the popular clustering algorithms
are given below:
• K-Means Clustering algorithm
• Mean-shift algorithm
• DBSCAN Algorithm
• Principal Component Analysis
• Independent Component Analysis
2) Association
• finds interesting relations among variables
within a large dataset.
• find the dependency of one data item on
another data item and map those variables
accordingly so that it can generate
maximum profit. This algorithm is mainly
applied in Market Basket analysis, Web
usage mining, continuous production, etc.
• Some popular algorithms of Association
rule learning are Apriori Algorithm, Eclat,
FP-growth algorithm.
:

Association: An association rule is an unsupervised


learning method which is used for findingthe
relationships between variables in the large
database. It determines the set of items that occurs
together in the dataset. Association rule makes
marketing strategy more effective.
Such as people who buy X item (suppose a bread)
are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket
Analysis.
Association
Advantages:
• These algorithms can be used for
complicated tasks compared to the
supervised ones because these algorithms
work on the unlabeled dataset.
• Unsupervised algorithms are preferable for
various tasks as getting the unlabeled
dataset is easier as compared to the
labelled dataset.
Disadvantages:
• The output of an unsupervised algorithm
can be less accurate as the dataset is not
labelled, and algorithms are not trained
with the exact output in prior.
• Working with Unsupervised learning is
more difficult as it works with the
unlabelled dataset that does not map with
the output.
Applications of Unsupervised Learning
• Network Analysis: Unsupervised learning is
used for identifying plagiarism and copyright in
document network analysis of text data for
scholarly articles.
• Recommendation
Systems: Recommendation systems widely
use unsupervised learning techniques for
building recommendation applications for
different web applications and e-commerce
websites.
• Anomaly Detection: Anomaly detection is
a popular application of unsupervised
learning, which can identify unusual data
points within the dataset. It is used to
discover fraudulent transactions.
• Singular Value Decomposition: Singular
Value Decomposition or SVD is used to
extract particular information from the
database. For example, extracting
information of each user located at a
particular location.
Unsupervised Learning

Organize computing clusters Social network analysis

Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison)

Market segmentation Astronomical data analysis 33


Difference Between Supervised and
Unsupervised Learning
Supervised Learning Unsupervised Learning
It uses known and labeled data as input
It uses unlabeled data as input
It has a feedback mechanism
It has no feedback mechanism

The most commonly used supervised learning algorithms are: The most commonly used unsupervised learning algorithms are:

∙ Decision tree ∙ K-means


clustering
∙ Logistic regression ∙ Hierarchical
clustering
∙ Support vector machine ∙ Apriori
algorithm
Semi-Supervised Learning

• lies between Supervised and


Unsupervised machine learning.
• represents the intermediate ground
between Supervised (With Labelled
training data) and Unsupervised learning
(with no labelled training data) algorithms
and uses the combination of labelled and
unlabeled datasets during the training
period.
• Although Semi-supervised learning is the
middle ground between supervised and
unsupervised learning and operates on the
data that consists of a few labels, it mostly
consists of unlabeled data.
• As labels are costly, but for corporate
purposes, they may have few labels.
• It is completely different from supervised
and unsupervised learning as they are
based on the presence & absence of
labels.
• The main aim of semi-supervised
learning is to effectively use all the
available data, rather than only labelled
data like in supervised learning.
• Initially, similar data is clustered along with
an unsupervised learning algorithm, and
further, it helps to label the unlabeled data
into labelled data. It is because labelled
data is a comparatively more expensive
acquisition than unlabeled data.
Example
• Supervised learning is where a student is under
the supervision of an instructor at home and
college.
• Further, if that student is self-analysing the same
concept without any help from the instructor, it
comes under unsupervised learning.
• Under semi-supervised learning, the student has
to revise himself after analyzing the same
concept under the guidance of an instructor at
college.
Advantages:

• It is simple and easy to understand the


algorithm.
• It is highly efficient.
• It is used to solve drawbacks of
Supervised and Unsupervised Learning
algorithms.
Disadvantages:

• Iterations results may not be stable.


• We cannot apply these algorithms to
network-level data.
• Accuracy is low.
Reinforcement Learning
• Reinforcement learning works on a feedback-based
process, in which an AI agent (A software
component) automatically explore its surrounding
by hitting & trail, taking action, learning from
experiences, and improving its performance.
• Agent gets rewarded for each good action and gets
punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the
rewards.
• In reinforcement learning, there is no labelled data
like supervised learning, and agents learn from
their experiences only.
• Reinforcement Learning is a feedback-based
Machine learning technique in which an agent
learns to behave in an environment by
performing the actions and seeing the results
of actions.
• For each good action, the agent gets positive
feedback, and for each bad action, the agent
gets negative feedback or penalty.
• It is employed by various software and
machines to find the best possible behavior
• process is similar to a human being; for example, a
child learns various things by experiences in his
day-to-day life.
• An example of reinforcement learning is to play a
game, where the Game is the environment, moves
of an agent at each step define states, and the goal
of the agent is to get a high score. Agent receives
feedback in terms of punishment and rewards.
• Due to its way of working, reinforcement learning
is employed in different fields such as Game
theory, Operation Research, Information theory,
multi-agent systems.
Reinforcement learning is a type of machine
learning method where an intelligent
agent(computer program) interacts with the
environment and learns to act within that.

It must then learn by itself what is the best


strategy,called a policy, to get the most
reward over time. A policy defines what action
the agent should choose.
In the above image, the agent is at the first block of
the maze. The maze consists of an S6 block, which
is a wall, S8, a fire pit, and S4 ,a diamond block.
• Environment: It can be anything such as a room,
maze, football ground, etc.
• Agent: An intelligent agent such as AI robot.

The agent cannot cross the S6 block, as it is a


solidwall. If the agent reaches the S4 block,
then get the +1 reward; if it reaches the fire
pit, then gets -1 reward point. It can take
four actions: move up, move down, move
left, and move right.
Categories of Reinforcement
Learning
• Positive Reinforcement Learning: Positive
reinforcement learning specifies increasing the
tendency that the required behaviour would
occur again by adding something. It enhances
the strength of the behaviour of the agent and
positively impacts it.
• Negative Reinforcement Learning: Negative
reinforcement learning works exactly opposite
to the positive RL. It increases the tendency tha
the specific behaviour would occur again by
Real-world Use cases of
Reinforcement Learning

Video Games:
• RL algorithms are much popular in gaming
applications. It is used to gain super-human
performance. Some popular games that use RL
algorithms are AlphaGO and AlphaGO Zero.
Resource Management:
• automatically learn and schedule resources to
wait for different jobs in order to minimize
average job slowdown.
Robotics:
• RL is widely being used in Robotics
applications. Robots are used in the industrial
and manufacturing area, and these robots are
made more powerful with reinforcement
learning. There are different industries that
have their vision of building intelligent robots
using AI and Machine learning technology.
Text Mining
• Text-mining, one of the great applications of
NLP, is now being implemented with the help
of Reinforcement Learning by Salesforce
company.
Advantages
• It helps in solving complex real-world
problems which are difficult to be solved by
general techniques.
• The learning model of RL is similar to the
learning of human beings; hence most
accurate results can be found.
• Helps in achieving long term results.
Disadvantages
• RL algorithms are not preferred for simple
problems.
• RL algorithms require huge data and
computations.
• Too much reinforcement learning can lead
to an overload of states which can weaken
the results.
Features
• Machine learning models are trained using
data that can be represented as raw features
(same as data) or derived features (derived
from data).
• One of the most important aspects of the
machine learning model is identifying the
features which will help to create a great
model, the model that performs well on
unseen data.
• Machine learning models are trained using
data that can be represented as raw features
• A model for predicting the risk of cardiac disease
mayhave features such as the following:
– Age
– Gender
– Weight
– Whether the person smokes
– Whether the person is suffering from diabetic
disease, etc.
• A model for predicting whether the person is
suitable for a job may have features such as
the educational qualification, number of years of
experience, experience working in the field
etc
• A model for predicting the size of a shirt for a
person may have features such as age, gender,
•Features are nothing but the independent
variables in machine learning models.
•What is required to be learned in any specific
machine learning problem is a set of these
features (independent variables), coefficients
of these features, and parameters for coming
up with appropriate functions or models.
Feature Selection
Feature selection is selecting a subset feature out
of the original features in order to reduce model
complexity,enhance computational efficiency of the
models and reduce generalization error introduced
due to noise by irrelevant features.
Feature extraction
• Feature extraction is
extracting/deriving information from
the original features set to create a new
features subspace.
• The primary idea behind feature extraction
is to compress the data with the goal of
maintaining most of the relevant
information.
Feature extraction
•Feature extraction refers to the process
of transforming raw data into
numerical features that can be
processed while preserving the
information in the original data set. It
yields better results than applying
machine learning directly to the raw data.
Feature extraction
•The key difference between feature
selection and feature extraction
techniques used for dimensionality
reduction is that while the original
features are maintained in the case of
feature selection algorithms, the feature
extraction algorithms transform the data
onto a new feature space.
Spectrogram of a signal using short-time Fourier transform.
Spectrogram shows variation of frequency content over time.
Dataset
• Training Dataset: The sample of data used to train
the model.
• Validation Dataset: The sample of data used to
provide an unbiased evaluation of a model trained on
the training dataset . The evaluation becomes more
biased as skill on the validation dataset is
incorporated into the model configuration.
• Test Dataset: The sample of data used to provide
an unbiased evaluation of a final model trained on
the training dataset.
Framing a Learning
Problem
Designing a Learning
System
• Choose the training experience
• Choose exactly what is to be learned
– i.e. the target function
• Choose how to represent the target function
• Choose a learning algorithm to infer the
target function from the experience
Training data Learner

Environment/
Knowledge
Experience

Testing data
Performance
Element 41
Training vs. Test
Distribution
• We generally assume that the training and
test examples are independently drawn from
the same overall distribution of data
– We call this “i.i.d” which stands for
“independent and identically distributed”

• If examples are not independent, requires


collective classification

• If test distribution is different, requires


transfer learning
42
Target vector
• Target: The target is the output. ie,Final output
you are trying to predict, also known as y. It can
be categorical (sick vs non-sick) or continuous
(price of a house).
• It could be the individual classes that the input
variables may be mapped to in case of a
classification problem or the output value range in a
regression problem.
• If the training set is considered then the target
is the training output values .
Main Challenges of Machine
Learning
1. Insufficient Quantity of Training Data
2. Nonrepresentative Training Data
•It is crucial to use a training set that is
representative of the cases you want to
generalize.If the sample is too small,
you will have sampling noise (i.e.,
nonrepresentative)
3. Poor-Quality Data
• if your training data is full of errors,
outliers, and noise (e.g.,due to
poor-quality measurements), it will make it
harder for the system to detect the
underlying patterns, so your system is
less likely to perform well. It is often well
worth the effort to spend time cleaning
up your training data.
4. Irrelevant Features
5.Under fitting
• A statistical model or a machine
learning algorithm is said to have
underfitting when it cannot capture the
underlying trend of the data, i.e., it
only performs well on training data but
performs poorly on testing data.
• Underfitting destroys the accuracy
of our machine learning model
Reasons for Underfitting:
• The size of the training dataset used is
not enough.
• The model is too simple.
• Training data is not cleaned and also contains
noise in it.
Techniques to reduce underfitting:
• Increase model complexity
• Increase the number of
features, performing feature
engineering
• Remove noise from the data.
6.Overfitting the Data
•Overfitting occurs when our machine
learning model tries to cover all the
data points ormore than the required
data points present in the given
dataset.
•Because of this, the model starts caching
noise and inaccurate values present in
the dataset, and all these factors reduce
the efficiency andaccuracy of the model.
Techniques to reduce overfitting:
• Increase training data.
• Reduce model complexity.
• Early stopping during the training
Good Fit in a Statistical Model: Ideally,
the case when the model makes the predictions
with 0 error, is said to have a good fit
CURSE OF DIMENSIONALITY
• The curse of dimensionality basically
refers to the difficulties a machine learning
algorithm faces when working with data in
the higher dimensions, that did not exist in
the lower dimensions. This happens
because when you add dimensions
(features), the minimum data requirements
also increase rapidly.
• This means, that as the number of
features (columns) increases, you need an
exponentially growing number of samples
(rows) to have all combinations of feature
values well-represented in our sample.
• With the increase in the data dimensions,
your model –

• would also increase in complexity.


• would become increasingly dependent on
the data it is being trained on.
• This leads to overfitting of the model, so
even though the model performs really
well on training data, it fails drastically on
any real data.
DIMENSIONALITY
REDUCTION
• higher dimensional data is dominated by
a rather small number of features. If we
can find a subset of the superfeatures that
can represent the information just as well
as the original dataset, we can remove the
curse of dimensionality!

• This is what dimensionality reduction is – a


process of reducing the dimension of your
data to a few principal features.
Why is Dimensionality Reduction necessary?
• Avoids overfitting – the lesser assumptions a
model makes, the simpler it will be.
• Easier computation – the lesser the dimensions
the faster the model trains.
• Improved model performance – removes
redundant features and noise, lesser misleadin
data improves model accuracy.
• Lower dimensional data requires less storage
space.
• Lower dimensional data can work with other
algorithms that were unfit for larger dimensions
How is Dimensionality Reduction done?
• Several techniques can be employed for
dimensionality reduction depending on the
problem and the data. These techniques
are divided into two broad categories:
• Feature Selection: Choosing the most
important features from the data
• Feature Extraction: Combining features to
create new superfeatures.
Curse of
dimensionality

https://www.youtube.com/watch?v=FJ6Z_
-HC eg4
Probability Theory
• Probability theory is at the foundation of
many machine learning algorithms.
• Probability is all about the possibility of various outcomes.
• The set of all possible outcomes is called the sample space.
• The sample space for a coin flip is {heads, tails}. The
sample space for the temperature of water is all values
between the freezing and boiling point.
• Only one outcome in the sample space is possible at
a time, and the sample space must contain all
possible values.
• The sample space is often depicted as Ω (capital
omega) and a specific outcome as ω (lowercase
Representation of the probability of an
event ω as P(ω).The two basic axioms of
probability are:

The probability of any event has to be between 0 (impossible) and 1 (certain), and
the sum of the probabilities of all events should be 1.
This follows from the fact that the sample space must contain all possible
outcomes. Therefore, we are certain (probability 1) that one of the possible
outcomes will occur.
•A random variable x, is a variable which
randomly takes on values from a sample space.
• We often indicate a specific value x can take .
with italics.
•For example, if x represents the outcome of a
coin flip, we may represent a specific outcome
as x = heads.
•Random variables can either be discrete like
the coin flip, or continuous (can take on an
uncountably infinite amount of possible
values).
•To describe the likelihood of each possible
value of a random variable x, we specify a
probability distribution.
• We write x ~ P(x) to indicate that x is
a random variable which is drawn from
a probability distribution P(x).

•Probability distributions are described differently


depending on if the random variable is discrete
or continuous.
Discrete Distributions
• Discrete random variables are described with
a probability mass function (PMF).
• A PMF maps each value in the variable’s sample
space to a probability.
• One such PMF is the uniform distribution over n
possible outcomes:
• P(x=x) = 1/n. This reads as “The probability of x
taking on the value x is 1 divided by the number of
possible values”. It’s called the uniform distribution
because each outcome is equally likely (the
likelihood is spread uniformly over all possible
values)
• Another common discrete distribution is the
Bernoulli.

•A Bernoulli distribution specifies the


probability for a random variable which can
take on one of two values (1/0, heads/tails,
true/false, rain/no rain, etc.).
• The PMF of a Bernoulli distribution is

• P(x) = {p if x =1, and 1-p if x=0}.


● Therefore,
we can specify the entire
distribution with a single parameter p,
the probability of the positive outcome.
● For a fair coin we have p = 0.5, and so
heads or tails is equally likely.
● Alternatively, if we say that the
probability of rain tomorrow is p = 0.2,
then we can infer that the probability of
no rain is 0.8
Continuous Distributions

• Continuous random variables are


described by probability density
functions (PDF).
• PDF for a random variable x as f(x)
1. The Gaussian distribution (colloquially called
the bell curve) can be used to model several natural
phenomena.
• The Gaussian distribution is parameterized by two
values: the mean μ and variance σ² .
• The mean specifies the center of the distribution,
and the variance specifies the width of the
distribution.
• Standard deviation σ, which is just the square root
of the variance. To indicate that x is a random
variable drawn from a Gaussian with mean μ and
variance σ²,
So the area under the
PDF represents the
total probability of the
Gaussian
Cumulative distribution function (CDF).

For a given value x, we’re taking the integral of the PDF


from negative infinity to that value. So F(x) gives us the
area under the PDF for the interval negative infinity to x.
F(x) gives us P(x≤x)., we can use the CDF to determine the
robability of any given range [a,b] by noticing that P(a≤x≤b) =
(b)-F(a).
This gives What is the probability that x will be between a and
b
the probability of x taking on values less than -2.5 is nearly 0
Joint Probability Distributions:

● A distribution over multiple random variables is


called a joint probability distribution.
● A collection of random variables as a vector x. A joint distribution over
x specifies the probability of any particular setting of all the

random variables contained in x.

● Consider the roll of a fair dice and let A=1 if the number is even (i.e.
{

2, 4, or 6) and A=0 otherwise. Furthermore, B=1 if the number is


prime (i.e. 2, 3, or 5) and B=0 otherwise.
Conditional Probability
Distributions
Probability of an event given that another event
has already been observed.
Let x and y be events. Then the conditional probability that x
occurs given y has occurred, which is denoted by P(x∣y), is
defined as

P(x∣y)=P(x∩y)/P(y)
.Conditional probability distribution is the
likelihood of one condition being true if another
condition is known to be true. This forms the
foundation of Bayes’ theorem and Bayesian
networks.
Two fair dice are rolled. What is the conditional
probability that first one lands on 6 given that
the dice land on different numbers?

• 30 equally likely out comes.

• There are 36 including doubles because it is 6


x 6. But then you need to subtract the 6
doubles, (1,1) (2,2) (3,3) (4,4) (5,5) (6,6).

Of these there are 5 outcomes that begin with 6


(6,1) (6,2) (6,3) (6,4) (6,5)

• So the probability is 5/30 = 1/6 = approximately


16.67%
Bayes' Theorem and Conditional
Probability
• Named after 18th-century British
mathematician Thomas Bayes, is a
mathematical formula for determining
conditional probability.
• The theorem provides a way to revise
existing predictions or theories (update
probabilities) given new or additional
evidence.
• Bayes' theorem is well suited to and widely
used in machine learning.
Bayes' Theorem
• Bayes' Theorem states that the conditional probability of
an event, based on the occurrence of another event, is
equal to the product of the probabiliy of the second event
given the first event multiplied by the probability of the
first event divided by the probability of the second event.
Model
• A model can be viewed as any one of the
process, relationship, equation or an
approximation to understand and
describe the data.
This could be a model as
it gives us a ‘description’
of how our data look like.
We can see a relationship
between the features (x
and y) in the above graph
i.e., the variation of the
features w.r.t each other.
• Now, if I try to approximate it with a mathematical equation, I
get:
y = 1.3055 * x - 5.703
This equation is also a model as it gives a more concrete
description of the data (more specifically the relationship
between the features).
A lot of statistics is dedicated to determining if a model is a good or bad
approximation of the data.
• Whenever we calculate the probability of
an event in a stochastic process, it
depends on the parameters of the model
we are using to describe our process.
• Say, O is an observed outcome of the
process and θ be the parameter
describing the underlying model.
• Then the probability we are interested in
calculating is denoted by P(O|θ) i.e., “the
probability of a particular outcome given
the parameter of the model used to
describe the process”.
• Example 1: Consider a jar with 7 red marbles, 3
green marbles, and 4 blue marbles. What is the
probability of randomly selecting a non-blue
marble from the jar?
• Number of Red Marbles = 7, Number of Green
Marbles = 3, Number of Blue Marbles = 4
• So, Total number of possible outcomes in this
case: 7 + 3 + 4 = 14
• Now, Number of non-blue marbles are: 7 + 3 = 10
• According to the formula of theoretical Probability
we can find, P(Non-Blue) = 10/14 = 5/7
• Hence, theoretical probability of selecting a
non-blue marble is 5/7.
• 2)Consider Two players, Naveena and Isha, playing
a table tennis match. The probability of Naveena
winning the match is 0.62. What is the probability of
Isha winning the match?
• Let N and I represent the events that Naveena wins
the match and Isha wins the match, respectively.
• The probability of Naveena’s winning P(N) = 0.62
• The probability of Isha’s winning P(I) = ?
• Winning of the match is an mutually exclusive
event, since only one of them can win the match.
• P(N) + P(I) =1 P(I) = 1 – P(N)
• P(I) = 1 – 0.62 = 0.38 Thus, the Probability of Isha
winning the match is 0.38.
• In a school the total number of students is
300. 95 students like chicken only, 120
students like fish only, 80 students like
mutton only and 5 students do not like
anything above. If randomly one student is
chosen, find the probability that
• The student likes mutton.
likes either chicken or mutton
likes neither fish nor mutton.
A Brief History
of Machine
Learning

50
History of Machine
Learning
• 1950s
– Samuel’s checker player
– Selfridge’s Pandemonium
• 1960s:
– Neural networks: Perceptron
– Pattern recognition
– Learning in the limit theory
– Minsky and Papert prove limitations of Perceptron
• 1970s:
– Symbolic concept induction
– Winston’s arch learner
– Expert systems and the knowledge acquisition bottleneck
– Quinlan’s ID3
– Michalski’s AQ and soybean diagnosis
– Scientific discovery with BACON
– Mathematical discovery with AM
History of Machine Learning
• 1980s:
– Advanced decision tree and rule learning
– Explanation-based Learning (EBL)
– Learning and planning and problem solving
– Utility problem
– Analogy
– Cognitive architectures
– Resurgence of neural networks (connectionism, backpropagation)
– Valiant’s PAC Learning Theory
– Focus on experimental methodology
• 1990s
– Data mining
– Adaptive software agents and web applications
– Text learning
– Reinforcement learning (RL)
– Inductive Logic Programming (ILP)
– Ensembles: Bagging, Boosting, and Stacking
History of Machine Learning
• 2000s
– Support vector machines & kernel methods
– Graphical models
– Statistical relational learning
– Transfer learning
– Sequence labeling
– Collective classification and structured outputs
– Computer Systems Applications (Compilers, Debugging, Graphics, Security)
– E-mail management
– Personalized assistants that learn
– Learning in robotics and vision
• PRESENT
– Deep learning systems
– Learning for big data
– Bayesian methods
– Multi-task & lifelong learning

You might also like