Machine Learning and Data Mining
Machine Learning and Data Mining
Introduction to Artificial
Intelligence
CBAS
School of Physical and Mathematical Sciences
2020/2021 – 2022/2023
Subsets of Artificial Intelligence
Lesson Goals:
• Understand the basic concepts of the learning problem and why/how machine
learning methods are used to learn from data to find underlying patterns for
prediction and decision-making.
• Understand the basic concepts of assessing model accuracy and the bias-variance
trade-off.
Introduction
• Machine learning is making great strides
– Large, good data sets
– Compute power
– Progress in algorithms
• Many interesting applications
– commericial
– scientific
• Links with artificial intelligence
– However, AI machine learning
4
Big Data is Everywhere
• We are in the era of big data!
– 40 billion indexed web pages
– 100 hours of video are uploaded
to YouTube every minute
• The deluge of data calls for
automated methods of data
analysis, which is what
machine learning provides!
What is Machine Learning?
• Machine learning is a set of methods that can
automatically detect patterns in data.
• Communication Systems
– Speech recognition, image analysis
The Learning Problem
• Learning from data is used in situations where we
don’t have any analytic solution, but we do have data
that we can use to construct an empirical solution
0.10
0.05
0.00
y
-0.05
-0.10
x
The Learning Problem: Example (cont.)
The Learning Problem: Example (cont.)
{( X1 , Y1 ), ( X 2 , Y2 ), , ( X n , Yn )}
• Second, we use the training data and a machine learning
method to estimate f.
– Parametric or non-parametric methods
Parametric Methods
• This reduces the learning problem of estimating the target
function f down to a problem of estimating a set of
parameters.
f (X i ) 0 1 X i1 2 X i 2 p X ip
– In this course, we will examine far more complicated and
flexible models for f.
Parametric Methods (cont.)
• Step 2:
– We use the training data to fit the model (i.e. estimate f….the
unknown parameters).
– The most common approach for estimating the parameters in a
linear model is via ordinary least squares (OLS) linear
regression.
– However, there are superior approaches, as we will see in this
course.
Example: Income vs. Education Seniority
Example: OLS Regression Estimate
• Even if the standard deviation is low, we will still get a
bad answer if we use the incorrect model.
Non-Parametric Methods
• As opposed to parametric methods, these do not make
explicit assumptions about the functional form of f.
• Advantages:
– Accurately fit a wider range of possible shapes of f.
• Disadvantages:
– Requires a very large number of observations to acquire an
accurate estimate of f.
Example: Thin-Plate Spline Estimate
• Non-linear regression
methods are more flexible
and can potentially provide
more accurate estimates.
• Conceptual Question:
– Why not just use a more flexible method if it is more realistic?
• Reason 1:
– A simple method (such as OLS regression) produces a model
that is easier to interpret (especially for inference purposes).
Predictive Accuracy vs. Interpretability (cont.)
• Reason 2:
– Even if the primary
purpose of learning from
the data is for prediction, it
is often possible to get
more accurate predictions
with a simple rather than a
complicated model.
Learning Algorithm Trade-off
• There are always two aspects to consider when designing
a learning algorithm:
– Try to fit the data well
– Be as robust as possible
• Supervised Learning:
– All the predictors, Xi, and the response, Yi, are observed.
• Many regression and classification methods
• Unsupervised Learning:
– Here, only the Xi’s are observed (not Yi’s).
– We need to use the Xi’s to guess what Y would have been, and
then build a model form there.
• Clustering and principal components analysis
Terminology
• Notation
– Input X: feature, predictor, or independent variable
– Output Y: response, dependent variable
• Categorization
– Supervised learning vs. unsupervised learning
• Key question: Is Y available in the training data?
– Regression vs. Classification
• Key question: Is Y quantitative or qualitative?
Terminology (cont.)
• Quantitative:
– Measurements or counts, recorded as numerical values (e.g.
height, temperature, etc.)
• Classification
– Covers situations where Y is categorical (qualitative)
– E.g. Will the Dow be up or down in 6 months? Is this email spam
or not?
Supervised Learning: Examples
• Email Spam:
– predict whether an email is a junk email (i.e. spam)
Supervised Learning: Examples
• Handwritten Digit Recognition:
– Identify single digits 0~9 based on images
Supervised Learning: Examples
• Face Detection/Recognition:
– Identify human faces
Supervised Learning: Examples
• Speech Recognition:
– Identify words spoken according to speech signals
• Automatic voice recognition systems used by airline companies,
automatic stock price reporting, etc.
Supervised Learning:
Linear Regression
Supervised Learning:
Linear/Quadratic Discriminant Analysis
Supervised Learning:
Logistic Regression
Supervised Learning:
K Nearest Neighbors
Supervised Learning:
Decision Trees / CART
Supervised Learning:
Support Vector Machines
Unsupervised Learning
Unsupervised Learning (cont.)
• The training data does not contain any output information
at all (i.e. unlabeled data).
Overfitting
Different Levels of Flexibility (cont.)
Different Levels of Flexibility (cont.)
Bias-Variance Trade-off
• The previous graphs of test versus training MSEs
illustrates a very important trade-off that governs the
choice of machine learning methods.
• Note that the expected test MSE can never lie below the
irreducible error - Var(ε).
Test MSE, Bias and Variance (cont.)
The Classification Setting
• For a classification problem, we can use the
misclassification error rate to assess the accuracy of the
machine learning method.
n
Error Rate I ( yi yˆ i ) / n
which represents the fraction i 1 of misclassifications.
• On test data, no classifier can get lower error rates than the
Bayes error rate.
• The smaller that k is, the more flexible the method will be.
KNN: K=10
KNN decision
boundary
Bayes decision
boundary
KNN: K=1 and K=100