0% found this document useful (0 votes)
5 views61 pages

1-Introduction To Machine Learning

This document provides an introduction to machine learning, covering key concepts such as its definition, types, modeling flow, and performance measures. It discusses the importance of data, training and testing phases, and various algorithms used in machine learning. Additionally, it highlights the advantages, limitations, and applications of machine learning in different fields.

Uploaded by

yashw609
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views61 pages

1-Introduction To Machine Learning

This document provides an introduction to machine learning, covering key concepts such as its definition, types, modeling flow, and performance measures. It discusses the importance of data, training and testing phases, and various algorithms used in machine learning. Additionally, it highlights the advantages, limitations, and applications of machine learning in different fields.

Uploaded by

yashw609
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

1

Supervised Learning
Introduction to Machine Learning
2

LEARNING OBJECTIVES

At the end of this session, you


will be able to understand:
• Introduction to Machine
Learning
• Machine Learning Modelling
Flow
• Parametric and Non-
parametric Algorithms
• Types of Machine Learning
• Performance Measures
• Bias-Variance Tradeoff
• Data Inconsistencies
3

Introduction to Machine
Learning
Machine Learning is… 4

Machine learning is a programming computers to optimize a performance criterion using example data or past
experience.

-- Ethem Alpaydin

The goal of machine learning is to develop methods that can automatically detect patterns in data, and then to use the
uncovered patterns to predict future data or other outcomes of interest.

-- Kevin P. Murphy

The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of
computer algorithms and with the use of these regularities to take actions.

-- Christopher M. Bishop
Machine Learning is… 5

Machine learning is about predicting the future based on the past.


-- Hal Daume III
What is This Image? 6

6
What is This Image? 7

Humayun's Tomb,

located in Delhi,

India

7
What is this image? 8

8
What is this image? 9

9
Machine Learning is… 1
0

Machine learning is about predicting the future based on the past.


-- Hal Daume III

past future

Training model/ Testing model/


Data predictor Data predictor
1
1

Relationship of AI, ML and DL


● Artificial Intelligence (AI) is Artificial Intelligence
anything about man-made
intelligence exhibited by
machines. Machine Learning
● Machine Learning (ML) is
an approach to achieve AI.
● Deep Learning (DL) is one
Deep Learning
technique to implement
ML.
1
2
WHAT IS MACHINE LEARNING?

Machine learning is a subset of artificial intelligence that often uses statistical


techniques to give the ability to "learn" with data

Consideration to define a problem:

• Problem definition
• Define data requirements and its source
• Define if whole dataset is considered or subset will do
1
3

Data and its data division strategies

13
Data everywhere! Big Data Statistics 2023: How Much Data is in The World? 1
4

• Global big data analytics market annual revenue is estimated to reach $68.09 billion
by 2025.
• There were 79 zettabytes of data generated worldwide in 2021.
• 90% of the data in the global datasphere is replicated data.
• In 2020, every person generated 1.7 megabytes in just a second.
• Global IoT connections already generated 13.6 zettabytes of data in 2019 alone.
• By 2025, more than 150 zettabytes of big data will need analysis.
• In 2021, there was 24% of big data revenue in software, 16% in hardware, and
another 24% in services.
• The COVID-19 pandemic increased the rate of data breaches by more than 400%.
• By 2027, the use of big data application database solutions and analytics is predicted
to grow to $12 billion.
• 97.2% of organizations are investing in big data and AI.
• Using big data, Netflix saves $1 billion per year on customer retention.
1
5

TABLE
SLIDE
1
6
Data types
Data comes in different sizes and different flavors (types):

H Texts

H Numbers

H Clickstreams

H Graphs

H Tables

H Images

H Transactions

H Videos

H Some or all the above!


1
7
we are ’DATAFIED’!

• Wherever we go, we are “datafied”.


• Smartphones are tracking our locations.
• We leave a data trail in our web browsing.
• Interaction in social networks.
• Privacy is an important issue in Data Science.
1
8

Training and testing

Data Practical
acquisition usage

Universal set
(unobserved)

Training set Testing set


(observed) (unobserved)
1
9

Training and testing


• Training is the process of making the system able to learn.

• No free lunch rule:


• Training set and testing set come from the same distribution
• Need to make some assumptions or bias
2

Cross-validation: A better way to choose meta parameters 0

Divide the total dataset into three subsets:


Training data is used for learning the parameters of the model.
Validation data is not used for learning but is used for deciding what
settings of the meta parameters work best.
Test data is used to get a final, unbiased estimate of how well the
ML model works. We expect this estimate to be worse than on the
validation data.
We could divide the total dataset into one final test set and N other
subsets and train on all but one of those subsets to get N different
estimates of the validation error rate.
This is called N-fold cross-validation.
The N estimates are not independent.
2

k-Fold Cross-Validation
1

k-Fold Cross-Validation

• Using k-fold cross-validation for hyper-parameter tuning is common when


the size of the training data is small
– It also leads to a better and less noisy estimate of the model performance by
averaging the results across several folds
• E.g., 5-fold cross-validation (see the figure on the next slide)
1. Split the train data into 5 equal folds
2. First use folds 2-5 for training and fold 1 for validation
3. Repeat by using fold 2 for validation, then fold 3, fold 4, and fold 5
4. Average the results over the 5 runs (for reporting purposes)
5. Once the best hyper-parameters are determined, evaluate the model on the test data

21
2

k-Fold Cross-Validation
2

k-Fold Cross-Validation

• Illustration of a 5-fold cross-validation

22
Picture from: https://scikit-learn.org/stable/modules/cross_validation.html
are validation strategies matters? 2
3
are validation strategies matters? 2
4
SDE vs SIE 2
5

Difference between Scene Dependent Evaluation (SDE) and Scene Independent Evaluation (SIE) data
division schemes. In SDE setup, training and testing video frames share the same background, leading to
high similarity between them. However, in SIE, completely unseen videos are tested for evaluation.
2
6
PHASES OF MACHINE LEARNING

The figure shows how learning can be applied to predict the behavior
Sample Questions 2
7

1) machine learning is_________________

A) Machine Learning is a type of artificial intelligence that allows a system to learn from data.

B) The science of getting computers to operate without being explicitly programmed is known as machine learning.

C) A&B

D) Non of the above

2) What are the phases of machine learning?

A) Training Phase

B) Validation Phase

C) Testing Phase

D) All of the above


Sample Questions 2
8

1) machine learning is_________________

A) Machine Learning is a type of artificial intelligence that allows a system to learn from data.

B) The science of getting computers to operate without being explicitly programmed is known as machine learning.

C) A&B

D) Non of the above

2) What are the phases of machine learning?

A) Training Phase

B) Validation Phase

C) Testing Phase

D) All of the above


Sample Questions 2
9

What is the role of training data in machine learning?

A) To evaluate the final performance of the model.

B) To fine-tune the model's hyperparameters.

C) To learn the model parameters.

D) To estimate the validation error rate

Why is validation data important in the machine learning process?

A) It is used to train the model initially.

B) It helps in deciding the best settings for the model's hyperparameters.

C) It provides the final estimate of the model's performance.


Sample Questions 3
0

What is the role of training data in machine learning?

A) To evaluate the final performance of the model.

B) To fine-tune the model's hyperparameters.

C) To learn the model parameters.

D) To estimate the validation error rate

Why is validation data important in the machine learning process?

A) It is used to train the model initially.

B) It helps in deciding the best settings for the model's hyperparameters.

C) It provides the final estimate of the model's performance.


Sample Questions 3
1

What is the purpose of test data in machine learning?


A. To learn the parameters of the model.
B. To optimize the hyperparameters.
C. To provide an unbiased performance estimate of the model after
training.
D. To continuously improve the model post-deployment.
Sample Questions 3
2

What is the purpose of test data in machine learning?


A. To learn the parameters of the model.
B. To optimize the hyperparameters.
C. To provide an unbiased performance estimate of the model after
training.
D. To continuously improve the model post-deployment.
3
3
ADVANTAGES OF MACHINE LEARNING

• Easily identifies trends and patterns

• No human intervention needed

• Cheap and flexible — can apply to any learning task

• Automatic method to search for hypotheses explaining data

• Continuous Improvement
3
4
LIMITATIONS OF MACHINE LEARNING

• Need a massive data to train

• Error prone - usually impossible to get perfect accuracy


3

Performance of ML 5

• There are several factors affecting the performance:


• Types of training provided
• The form and extent of any initial background knowledge
• The type of feedback provided
• The learning algorithms used
• Type of data provided

• Two important factors:


• Modeling
• Optimization
3

Algorithms 6

• The success of machine learning system also depends on the algorithms.

• The algorithms control the search to find and build the knowledge
structures.

• The learning algorithms should extract useful information from training


examples.
3
7
COMPLEMENTING FIELDS OF MACHINE LEARNING
3
8
APPLICATIONS OF MACHINE LEARNING

• Spam Detection

• Speech Recognition

• Language translation

• Fraud detection

• Product
Recommendation
Applications of Machine learning 3
9

Classification
Applications of Machine learning 4
0

Machine Translation

Point your camera at the


menu and the
restaurant’s selections
will magically appear in
English via the Google
Translate app
Applications of Machine learning 4
1

Image Captioning
4
2

Every day people are finding more and more applications of machine learning.
Some more applications of machine learning:
▪ Driverless vehicles
▪ Email Spam and Malware Filtering
▪ Online Customer Support
▪ Product Recommendations
▪ Search Engine Result Refining
▪ Online Fraud Detection
▪ Sentiment Analysis
▪ Action Recognition
▪ Anomaly Detection
▪ Intelligent Video Surveillance
▪ Depression Analysis
▪ Traffic Prediction (Verities of prediction tasks)
4
3

Did You
Know?
ML is a subset of artificial
intelligence that automates
data mining:
Machine learning can be stated as more
automated and continuous version of
data mining. Data mining can often
detect patterns in data sets that no
human would be able to find. Machine
learning is capable of generalizing
information from large and dynamically
changing data sets, and then detecting
and extrapolating patterns in order to
apply that information to new solutions
and actions
4
4

Machine Learning
Modeling Flow
4
5
MODELING FLOW
4
6
MODELING FLOW

• Get data: Gather data from different sources

• Clean, prepare and Manipulate data: Check for the


null values and outliers and clear

• Train Model: Build a model using train data

• Evaluation: Tweak the model using the test data

• Improve: Optimize the model to increase its accuracy


4
7
DATA IN MACHINE LEARNING

• Data forms the main source of learning in Machine Learning

• The data that is being referenced here can be in any format,


can be received at any frequency and can be of any size

• Data in the ML context, can either be labeled or unlabeled


4
8
LABELED DATA

Labeled Data takes a set of unlabeled data and augments each piece of that
unlabeled data with some sort of meaningful "label"

For example, labels for the unlabeled data might be whether this photo contains
a cat or a dog
4
9
UNLABELED DATA

Unlabeled Data consists of samples of natural or human-created artifacts that you can
obtain relatively easily from the world

Some examples of unlabeled data are photos, audio recordings, videos,


news articles, tweets etc
5
0
DATA TERMINOLOGY

• Column: Describes data of a single type. For


example, column of weights or heights or prices. All
the data in one column will have the same scale

• Row: A row describes a single entity or observation


whose properties are described by columns.

• Cell: A cell is a single value in a row and column. It


is represented as
Cell = dataset[ row ][ columns ]
Some Notation/Nomenclature/Convention 5
1

Supervised Learning requires training data given as a set of input-output pairs {(x n, yn )}N
n=1
Unsupervised Learning requires training data given as a set of inputs {x n }N
n=1
Each input x n is (usually) a vector containing the values of the features or attributes or
covariates that encode properties of the data it represents, e.g.,
Representing a 7 × 7 image: Xn can be a 49 × 1 vector of pixel intensities

Note: Good features can also be learned from data (feature learning) or extracted
using hand-crafted rules defined by a domain expert. Having a good set of
features is half the battle won!
Each yn is the output or response or label associated with input x n
Some Notation/Nomenclature/Convention 5
2

Will assume each input x n to be a D × 1 column vector (its transpose x Tn will be row vector)
xnd will denote the d-th feature of the n-th input
We will use X (N × D feature matrix) to collectively denote all the N inputs
We will use y (N × 1 output/response/label vector) to collectively denote all the N outputs
A feature
D

Input n xn1 xn2 xnD Output for


xT y
n
n
input n
N X y

Feature Matrix Outputs


A B C D 5
3
VISUALIZATION DATA TERMINOLOGY Column Column Column
1 2 3
Row 1 Cell11 Cell12 Cell13
Row 2 Cell21 Cell22 Cell23
Row 3 Cell31 Cell32 Cell33
5
4

54
5
5
STATISTICAL LEARNING PERSPECTIVE

The statistical perspective frames data in the context of a hypothetical function (f) that
the machine learning algorithm is trying to learn

The columns that are the inputs are referred to as input variables

The column of data we would like to predict is called the output


variable or response variable
5
6
STATISTICAL LEARNING PERSPECTIVE

• The most common type of machine learning is to


learn the mapping Y = f(X) to make predictions of Y
for new X

• This is called predictive modeling or predictive


analytics and the goal is to make the most accurate
predictions possible

• If, there are more than one input variable, then they
are referred as the Input Vector
5
7
STATISTICAL LEARNING PERSPECTIVE

• For example, a statistics text may talk about the input variables as independent
variables and the output variable as the dependent variable.
5
8

Model gives the best results when tested on the same data on which it was trained. If
you don’t have much data, you should stick to the simple models.
5
9
Sample Question
Why is it not advisable to test a model on the same data used for training?
A. It can lead to underfitting.
B. It does not provide a true measure of the model's performance.
C. .It increases the computational cost unnecessarily.
D. It reduces the model's ability to generalize to new data

What is a potential risk when a model is trained and tested on the same dataset?
A. The model may not perform well on unseen data due to lack of exposure.
B. The model will require more data to be validated accurately.
C. The computational time for training the model will increase.
D. The model will be too complex to understand.
6
0
Sample Question
Why is it not advisable to test a model on the same data used for training?
A. It can lead to underfitting.
B. It does not provide a true measure of the model's performance.
C. .It increases the computational cost unnecessarily.
D. It reduces the model's ability to generalize to new data

What is a potential risk when a model is trained and tested on the same dataset?
A. The model may not perform well on unseen data due to lack of exposure.
B. The model will require more data to be validated accurately.
C. The computational time for training the model will increase.
D. The model will be too complex to understand.
6
1

Next Class Parametric and Non Parametric Machine Learning

You might also like