0% found this document useful (0 votes)
38 views

Introduction To ML Lecture 1

The document provides an agenda for an introduction to machine learning theory and practice workshop. The agenda includes sessions on the machine learning landscape, classification and regression, linear regression with NumPy, and an introduction to Scikit-Learn. It also lists the instructor's background and references related machine learning books and documentation.

Uploaded by

Amal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Introduction To ML Lecture 1

The document provides an agenda for an introduction to machine learning theory and practice workshop. The agenda includes sessions on the machine learning landscape, classification and regression, linear regression with NumPy, and an introduction to Scikit-Learn. It also lists the instructor's background and references related machine learning books and documentation.

Uploaded by

Amal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Introduction to Machine Learning

Theory and Practice


David R. Pugh
Instructional Assistant Professor, KAUST
Director, SDAIA-KAUST AI

• 5+ years teaching applied machine learning and deep learning at KAUST.


• 2+ years as the director of SDAIA-KAUST AI where I work to match applied AI
problems of interest to SDAIA with AI solutions developed at KAUST.
• 15+ years experience with the core data science Python stack: NumPy, SciPy,
Pandas, Matplotlib, NetworkX, Jupyter, Scikit-Learn, PyTorch, etc.

KAUST Academy 2
Agenda
Introduction to Machine Learning: Theory and Practice

09:00 - 09:05 Welcome and Opening Remarks Prof. David Pugh

09:05 - 10:30 The Machine Learning Landscape Prof. David Pugh

10:30 - 10:45 Break

10:45 - 12:00 Classification and Regression Prof. David Pugh

12:00 - 13:00 Lunch

13:00 - 14:30 Linear Regression with NumPy Prof. David Pugh + TAs

14:30 - 14:45 Break

14:45 – 16:00 Introduction to Scikit-Learn Prof. David Pugh + TAs

KAUST Academy 3
References

• Slides closely follow Hands-on Machine Learning with Scikit-Learn,


Keras and Tensorflow by Aurelien Geron.
• Another great reference is Machine Learning with PyTorch and Scikit-
Learn by Sebastian Raschka.
• Official documentation for Scikit-Learn is also fantastic.

KAUST Academy Prof. Da vi d R. Pugh 4


The ML Landscape

Prof. Da vi d R. Pugh
What is difference between AI and ML?

KAUST Academy Prof. Da vi d R. Pugh 6


What is ML?

• ML is the science (and art) of programming computers so they can learn from
data (Geron, 2019).
• [ML is the] field of study that gives computers the ability to learn without
being explicitly programmed (Samuel, 1959).
• A computer program is said to learn from experience E with respect to some
task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E (Mitchell, 1997).

KAUST Academy Prof. Da vi d R. Pugh 7


Why is ML so popular right now?

Stanford’s Coursera machine learning course had more than 100,000 expressing interest
in the first year.

1. The field has matured both in terms of identity and in terms of methods and tools.
2. There is an abundance of data available
3. There is an abundance of computation to run methods
4. There have been impressive results, increasing acceptance, respect, and competition

Resources + Ingredients + Tools + Desire = Popularity

KAUST Academy Based on: http://machinelearningmastery.com/machine-learning-is-popular/?__s=yq1qzcnf67sfiuzmnvjf 8


Traditional approach is model/rules based...

KAUST Academy Prof. Da vi d R. Pugh 9


...ML approach is data-driven!

KAUST Academy Prof. Da vi d R. Pugh 10


ML adapts to change!

KAUST Academy Prof. Da vi d R. Pugh 11


ML can help humans learn!

KAUST Academy Prof. Da vi d R. Pugh 12


Types of ML systems

• Supervised vs unsupervised
• Semi-supervised vs self-supervised
• Batch (offline) vs incremental (online)
• Instance-based vs model-based

KAUST Academy Prof. Da vi d R. Pugh 13


Supervised learning

Classification Regression

KAUST Academy Prof. Da vi d R. Pugh 14


Other forms of supervised learning
Semi-supervised learning Self-supervised learning

KAUST Academy Prof. Da vi d R. Pugh 15


Unsupervised learning
Clustering Data visualization

KAUST Academy Prof. Da vi d R. Pugh 16


Reinforcement Learning

KAUST Academy Prof. Da vi d R. Pugh 17


Batch (offline) vs incremental (online) learning

Batch (offline) Learning Incremental (online) learning

KAUST Academy Prof. Da vi d R. Pugh 18


Out-of-core learning

KAUST Academy Prof. Da vi d R. Pugh 19


Instance-based vs model-based learning

Instance-based learning Model-based learning

KAUST Academy Prof. Da vi d R. Pugh 20


Main Challenges of Applying ML

KAUST Academy 21
Main Challenges of Applying ML

• Insufficient quantity of training data


• Non-representative training data
• Poor quality data
• Irrelevant features
• Overfitting the training data
• Underfitting the training data

KAUST Academy Prof. Da vi d R. Pugh 22


Insufficient quantity of training data

• The more data for training the


better!
• It can take a lot of data for most
ML algorithms to work.
• "Simple" problems often require
O(10k) samples.
• "Complex" problems often
require O(1m) samples.

KAUST Academy Prof. Da vi d R. Pugh 23


Non-representative training data

• Need training data to be


representative of new data for
generalization.
• Sampling noise: not enough
data => training data not
representative by chance.
• Sampling bias: poor sampling
technique => training data not
representative (biased).

KAUST Academy Prof. Da vi d R. Pugh 24


Poor quality training data

• Data can be full of errors, • Data types? Do you have


outliers, and noise (e.g., due to numeric features? Ordinal
poor-quality measurements). features? Categorical features?
• Dirty data => hard for • Look for outliers in your data:
any algorithm to detect Remove? Fix manually?
patterns. • Look for missing data:
• Significant amount of your Remove? Impute values?
time will be spent cleaning
data.

KAUST Academy Prof. Da vi d R. Pugh 25


Irrelevant features

Garbage in => garbage out! Feature engineering is often


critical to success.
• Learning requires sufficient
relevant features (and not too • Feature selection:
many irrelevant ones!). selecting the "best" subset
• Developing a good set of of features for training.
features for training is critical
part of ML project. • Feature extraction:
• Significant amount of combining existing features to
your time will be spent doing produce new ones.
feature engineering. • Creating new features
from new data.
KAUST Academy Prof. Da vi d R. Pugh 26
Overfitting the training data

What is overfitting?

• Overfitting is when model


performs well on training data
but poorly on new data.
• If model is complex or training
data is limited, model will detect
spurious patterns.
• Constraining a complex
model to make it simpler is
called regularization.
KAUST Academy Prof. Da vi d R. Pugh 27
Underfitting the training data

What is underfitting? How to reduce underfitting?

• Underfitting is when a model is • Select more complex (more


too simple to learn the parameters) model.
underlying structure of the data.
• Linear models will often • Feed better features to the
underfit (but often a good place model (feature engineering).
to start). • Reduce the constraints on
model (reduce regularization).

KAUST Academy Prof. Da vi d R. Pugh 28


Validation and Testing

KAUST Academy 29
Why measure generalization error?

• Only way to know if your model Some train-test split heuristics:


is good is to measure
performance new data! • For datasets smaller than
• Split your data into train and O(100k) samples, take 80%
test sets: error on the test set is for train and holdout 20%
estimate of generalization error. for test.
• Low training error, high • For larger datasets, O(1m)
generalization error => samples, holdout 1-10% of the
overfitting! dataset for test.

KAUST Academy Prof. Da vi d R. Pugh 30


Model Selection

• Often need to tune • Validation set too small =>


hyperparameters to find a good might select "bad" model by
model within a particular class mistake.
of models. • Validation set too large
• How? Split training data into => training set too small!
training set and validation set. • Cross validation: create lots
• Always compare tuned models of small validation sets,
using the test set! evaluate model on each
validation set, measure
average performance across
validation sets.
KAUST Academy Prof. Da vi d R. Pugh 31
Model selection process

KAUST Academy Prof. Da vi d R. Pugh 32


Thanks!

KAUST Academy Prof. Da vi d R. Pugh 36

You might also like