0% found this document useful (0 votes)
5 views

Introduction

Uploaded by

mert
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Introduction

Uploaded by

mert
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

CMPE 442 INTRODUCTION

MACHINE LEARNING
MACHINE LEARNING
 Machine learning is the field of study that gives
computers the ability to learn without being
explicitly programmed.
MACHINE LEARNING
 Machine learning is the field of study that gives
computers the ability to learn without being
explicitly programmed.
 A computer program is said to learn from
experience E with respect to some task T and
some performance measure P, if its performance
on T, as measured by P, improves with
experience E.
MACHINE LEARNING
 Example: Spam filter- given examples of spam e-
mails and examples of ham e-mails, learns to flag
spam.
 Training set- examples that the system uses to learn.
 T (task)- flag spam for new e-mails
 E (experience)- training data
 P (performance)- ? needs to be defined:
 Ex: the ratio of correctly classified emails  accuracy
EVALUATING PERFORMANCE ON A TASK
 Machine learning problems don’t have a
“correct” answer.
 Consider sorting problem:
 Many sorting algorithms available: bubble sort, quick
sort, insertion sort ...
 The performance is measured in terms of how fast
they are and how much data they can handle.
 Would we compare the sorting algorithms with
respect to the correctness of the result?
EVALUATING PERFORMANCE ON A TASK
 Machine learning problems don’t have a
“correct” answer.
 Consider sorting problem:
 Many sorting algorithms available: bubble sort, quick
sort, insertion sort ...
 The performance is measured in terms of how fast
they are and how much data they can handle.
 Would we compare the sorting algorithms with
respect to the correctness of the result?
 Algorithm that isn’t guaranteed to produce a sorted list
every time is useless as a sorting algorithm.
EVALUATING PERFORMANCE ON A TASK
 No perfect solution in machine learning
 Perfect e-mail spam filter does not exist!!!
 In many cases data is “noisy”
 Examples mislabelled
 Features contain errors
o Performance evaluation of learning algorithms is
important in machine learning.
WHY USE MACHINE LEARNING?
WHY USE MACHINE LEARNING?
 Let’s write a spam filter using traditional
programming technique
1) Study spam emails and get the patterns and most
occurring words.
2) Write detection algorithm.
3) Test and repeat steps 1 and 2 until it is good
enough
WHY USE MACHINE LEARNING?

Launch!

Study the Write


Evaluate
problem rules

Analyze
errors
WHY USE MACHINE LEARNING?

Launch!
Data

Study the Train ML


Evaluate
problem algorithm

Analyze
errors
WHY USE MACHINE LEARNING?
 Consider the example of recognizing handwritten
digits.

 Each digit corresponds to a 28x28 pixel image


and so can be represented by a vector x
comprising 784 real numbers.
 Goal: build a machine that will take such a vector
x as input and that will produce the identity of
the digit 0, …, 9 as the output.
WHY USE MACHINE LEARNING?
 Better use a machine learning approach where a
large set N digits called the training set is
used to tune the parameters of an adaptive model.
 The categories of the digits in the training set are
known in advance  target vector t.
 The goal is to determine the function y(x) which takes
a new digit image x as an input and generates an
output vector y  learning, training phase
 Once the model is trained we can run it on the test
set.
 The ability to categorize correctly new examples that
differ from those used for training is known as
generalization.
WHY USE MACHINE LEARNING?
 For problems that are too complex for traditional
approach.
 For problem that have no known algorithm.
 Ex.: Speech recognition
 Helps human learn: applying ML techniques to
large amounts of data reveals patterns that were
not immediately apparent data mining.
SOME ML PROBLEMS
 Speech Recognition

 Document Classification

 Face Detection and Recognition


 ...
TYPES OF MACHINE LEARNING SYSTEMS
 Whether or not they are trained with human
supervision: supervised, unsupervised, semi-
supervised, reinforcement learning.

 Instance-based versus model-based learning.


SUPERVISED LEARNING
 The training data includes the desired solutions,
called labels.
 Spam filter  classification
SPAM FILTERING AS A CLASSIFICATION
TASK
MACHINE LEARNING FOR SPAM
FILTERING
SUPERVISED LEARNING
 The training data includes the desired solutions,
called labels.
 House price prediction  regression
SUPERVISED LEARNING
 Some most important supervised algorithms:
 K-Nearest Neighbours
 Linear Regression
 Naïve Bayes
 Logistic Regression
 Support Vector Machines
 Decision Trees and Random Forests
 Neural Networks
UNSUPERVISED LEARNING
 The training data is unlabelled.
 The system tries to learn without anyone's
guidance.
UNSUPERVISED LEARNING
 Some most important unsupervised algorithms:
 Clustering
 K-Means
 Hierarchical Cluster Analysis

 Expectation Maximization

 Visualization and Dimensionality Reduction


 Principal Component Analysis (PCA)
 Locally-Linear Embedding (LLE)

 t-distribution Stochastic Neighbour Embedding (t-SNE)


SUPERVISED/UNSUPERVISED LEARNING
INSTANCE-BASED VS. MODEL-BASED
LEARNING
 Most ML problems are about making prediction
 Given training examples, the system needs to be
able to generalize to examples it has never seen
before
 The true goal is to perform well on new instances

 Two main generalization approaches:


 Instance-based: The system learns the examples by
heart, then generalizes to new cases using a
similarity measure.
 Model-based: generalizes from a set of examples by
building a model of these examples, then use that
model to make predictions.
INSTANCE-BASED LEARNING
MODEL-BASED LEARNING
REGRESSION PROBLEM
LINEAR REGRESSION
LINEAR REGRESSION
LINEAR REGRESSION
PROJECT PHASES
 Study data
 Select a learning algorithm

 Train it on the training data

 Apply the model to make predictions on new


cases
MAIN CHALLENGES IN MACHINE
LEARNING
 Two things that can go wrong:
 Bad data
 Bad algorithm
BAD DATA
 Insufficient quantity of training data
 It takes a lot of data for most ML algorithms to work
properly.
 Non-representative training data
 It is crucial that your training data is representative of the
new cases you want to generalize to.
 Poor-quality data
 It is better to spend time cleaning up the training data:
decide about outliers and missing features.
 Irrelevant features
 Feature engineering involves:
 Feature selection: selecting the most useful features to train on
among existing features
 Feature extraction: combining existing features to produce a
more useful one
 Creating new features by gathering new data
BAD ALGORITHM
 Overfitting the training data:
 Happens when the model performs well on the
training data, but it does not generalize well.
 Underfitting the training data
 Happens when the model is too simple to learn the
underlying structure of the data
BAD ALGORITHM: EXAMPLE
 Simple regression problem: Suppose we observe a
real-valued input variable x and we wish to use
this observation to predict the value of a real-
valued target variable t.
 The data for this example is generated from the
function with random noise included in
the target values.
 Suppose we are given a training set containing N
observations of x, and the
corresponding observations t, t
BAD ALGORITHM: EXAMPLE
 N=10, the input data set x is generated by
choosing values of , for , spaced
uniformly in range
 The target data set t is obtained by computing
for corresponding x values and adding
small level of noise having Gaussian distribution
 Goal: exploit the training set in order to
make predictions of the value of the target
variable for some new value of the input
variable.
 In other words we are trying to
discover the underlying function
POLYNOMIAL CURVE FITTING

M order of polynomial
 coefficients
CURVE FITTING
TESTING AND VALIDATING
 Once you have a trained model, evaluate it and
fine-tune it.
 Split your data into two sets: training set and the
test set.
 Generalization error: error rate on the new cases,
estimated by evaluating the model on test set.
 If the training error is low (makes few mistakes
on training set) but the generalization error is
high, then the model is overfitting the training
set.
HOW ML HELPS TO SOLVE A TASK?

You might also like