0% found this document useful (0 votes)
60 views

Introduction To Machine Learning

Uploaded by

Hanson Tian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Introduction To Machine Learning

Uploaded by

Hanson Tian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Week 1: ML Intro; Linear Models

MBusA Machine Learning 2022

Copyright: University of Melbourne


Who?
• Lecturer
* James Bailey ([email protected])

2
Why Machine Learning

3
Motivation
• “We are drowning in information,
but we are starved for knowledge”
- John Naisbitt, Megatrends

• Data = raw information


• Knowledge = patterns or models behind the data

4
Solution: Machine Learning
• Hypothesis: pre-existing data repositories contain a lot of
potentially valuable knowledge

• Mission of learning: find it

• One definition of ML:


(semi-)automatic extraction of valid, novel, useful and
comprehensible knowledge – in the form of rules, regularities,
patterns, constraints or models – from arbitrary sets of data

5
Applications: Widespread
• Online ad selection and placement
• Risk management in finance, insurance, security

• High-frequency trading
• Medical diagnosis
• Mining and natural resources
• Malware analysis

• Drug discovery
• Search engines
• Education
• Sport

• …
6
Draws on Many Disciplines
• Artificial Intelligence
• Statistics
• Continuous optimisation
• Databases
• Information Retrieval
• Communications/information theory
• Signal Processing
• Computer Science Theory
• Philosophy
• Psychology and neurobiology
• Linguistics

7
Data Science / BusA Landscape

Domain
Computing Statistics
expertise

-Data wrangling -Robust models and -Health


-Machine learning methods -Business
-Data mining -Sampling
-Social sciences
-Databases -Hypothesis testing
-…….
-Distributed computing - …..
-AI

9
AI, Machine Learning, Big Data
Statistics / ML
Artificial Intelligence
“Intelligent machines and Big Data, data
software” processing

Planning,
Reasoning,
Decision Making
6 9x(x ) (z _ ¬y))

10
This item: The Martian by Andy Weir Paperback $8.92
The Revenant: A Novel of Revenge by Michael Punke Paperback $9.52

“AI is the new electricity” – Andrew Ng


The Life We Bury by Allen Eskens Paperback $8.75

Customers Who Bought This Item Also Bought

Data-driven, intelligent systems


Page 1 of 15

The Revenant: A Novel of Ready Player One: A Novel The Life We Bury The 5th Wave: The First The Boys in the Boat: Nine
Revenge › Ernest Cline › Allen Eskens Book of the 5th Wave Americans and Their Epic
› Michael Punke 9,210 1,896 Series Quest for Gold at the…
1,250 Paperback Paperback › Rick Yancey › Daniel James Brown
Paperback $8.37 $8.75 2,006 17,056
$9.52 Paperback #1 Best Seller in Boating
$6.70 Paperback
$9.15
Netflix 13/03/2016 10:03 26am

Kids Categories Search Kids... Exit Kids

Fuller House The Wiggles My Little Pony Mako Mermaids H2O: Just Add Water Good Luck Charlie Pokémon

Recently watched Top Picks for Kids

Popular

Action

https://www.netflix.com/Kids Page 1 of 4
11
Jobs
Numerous companies
across all industries hire
ML experts:

Data Scientist
Analytics Expert
Business Analyst
Statistician
Software Engineer
Researcher

12
Companies Employing our Students in ML Roles

Telstra, Citibank, Danske Bank, Deutsche Bank, NAB, ANZ,


Veda, Tencent, LexisNexis Risk Solutions, GE Capital, Deloitte,
PwC, Accenture, Deloitte, IBM Research, IBM, Sportsbet,
OpenBet:, CrowdsourceHire, Hugo, Flipkart, Rome2rio,
Breadtrip, SAP, Salesforce, Hitachi, Oracle, Google Apple,
Microsoft, Amazon, Groupon, Nokia, CSIRO, MongoDB, DST
Group, Data61, Evernote, Teradata, Kepler Analytics, Business
Predictions, Thales, Tata, LinkedIn, Ford, Huawei, KPMG,
northraine, Woolworths, jet.com, Microsoft Research, SAS,
Peter MacCallum Cancer Centre, Commonwealth Bank,
Computershare, Blackmagic Design, Baker IDI, AIG, ….

13
Discussion
Share an example of how machine
learning can help in either your
workplace or in your daily life.
About this Subject

15
Teaching Staff
• Lecturer (James)

• Tutors
* Curtis (Hanxun) Huang
[email protected]
• PhD candidate, School of Computing and Information Systems
* Edmund Lau
* [email protected]
* PhD candidate, School of Mathematics and Statistics
* Yuning Zhou
[email protected]
• PhD candidate, School of Computing and Information Systems

16
Getting Help
• Machine learning subject on Canvas is operational
* Please check for announcements, lecture and workshop
materials, discussion forums, ….

• Ask questions during and after lecture


• Post questions to Canvas (before emailing if possible)
• Ask questions to your tutor during afternoon session
• Consultation by appointment – send an email
* If emailing, please start subject line with “BUSA90501”

17
Timetable
• 9am-10:20am Part A
* 9:00-9:15 of Part A reserved for quizzes (Weeks 3-7)

• 10:20-10:40 Break

• 10:40-12:00 Part B

• 2:00-5:00 Workshop.

18
Relation to Other Subjects
• Machine learning (aka “Statistical Learning II”) versus
* Statistical Learning
* Predictive Analytics
* Text and Web Analytics

19
Versus “Statistical Learning”
• Complementary, with some overlap on regression

• “Statistical Learning” more emphasis on


* Statistical flavour; Frequentist stats (MLE)
* (Generalised) linear models
* Statistical validation techniques like model coefficients

• This subject
* More computer science (CS) in flavour
* More: Regularisation, nonlinear & computational perspectives
* Covers variety of learning tasks beyond regression

20
Versus “Predictive Analytics”
• Complementary, again overlapping but mostly early

• “Predictive Analytics” more emphasis on


* Econometrics flavour
* Time series forecasting, and model selection

• This subject
* Less timeseries, more approaches to prediction in CS
* Drill down further into the algorithms - implementations;
scratch the theory surface

21
Versus “Text and Web Analytics”
• Also complementary, with some overlap

• “Text and Web analytics” emphasises


* Web data, info. retrieval, natural language processing
* Using machine learning algorithms
• Naïve Bayes, logistic regression, neural networks, clustering

• This subject “Machine Learning”


* Variety of approaches to prediction (not only for text)
* More time to go into the how and why

22
Assumed Knowledge
• Programming
* Familiarity with computer programming
* Load data, perform simple manipulations, call ML libraries,
inspect & plot results

• Maths
* Comfort with formal notation (“mathematical maturity”)
* Familiarity with probability (e.g. Bayes rule, multivariate
distributions)
* Exposure to optimisation (and some calculus, linear algebra)

• Masters level subject

23
Textbooks (Optional)
• No set textbook for the subject. You can find required information in lecture
notes, supplemented with easy to connect info on the Web. Some independent
research is reasonable for a masters-level subject. However, there exist number of
good books can might used as references:

• Hastie, Tibshirani, and Friedman (2009), The Elements of Statistical Learning: Data
Mining, Inference and Prediction
* This is a seminal book on machine learning that covers major machine learning
tools in depth with great rigour.

• Bishop (2007), Pattern Recognition and Machine Learning


* The contents of this book overlaps with the previous book. However, the
author often uses a different way to explain the concepts, that is sometimes
more accessible. Having a different perspective can be beneficial also.

26
Textbooks (Optional)
• Russell and Norvig (2002), Artificial Intelligence: A
Modern Approach
* A very broad (but less deep) overview of the whole field of
artificial intelligence, including machine learning
• Data mining resources are also useful
* Data Mining, Fourth Edition: Practical Machine Learning
Tools and Techniques (Morgan Kaufmann Series in Data
Management Systems), 4th edition, Witten, Frank, Hall and
Pal.
* Introduction to Data Mining, 1st edition, Tan, Steinbach and
Kumar.

27
Materials

• Lectures and workshop content will be posted to


Canvas.
• Where possible these will be posted several days in
advance.
• Any updates will be flagged in Canvas

28
Software – Python Stack
• We will be using Python 3 as the primary language in
workshops
• Get a copy of Python for your machine
* The Anaconda distribution is particularly convenient
• https://www.continuum.io/downloads
* Jupyter used extensively in workshops (and industry!)
* See Software Guide published to Canvas

• Welcome to use other languages (e.g. R) as well but


not instead of.

29
Assessment
• 25% individual short in-lecture quizzes (Weeks 3 to 8)
* 5 quizzes (each 13+2min reading time)
* Worth 5% each

• 25% syndicate assignment with report


(Released ~Week 3, due in ~Week 6)
* Hands-on machine learning experience

• 50% final individual exam, ~3 hours (Week 9)

30
Syndicate assignment
We are planning to use a property dataset from
ANZ (CoreLogic dataset)

You will be requested to sign a confidentiality


agreement, indicating you will keep the dataset
confidential and not distribute, etc

Signatures for this confidentiality agreement will be


collected via a canvas assignment (you will upload
a signed personal confidentiality deed)
Subject Plan (Preliminary)
• Week 1:
* Introduction, performance evaluation, ML approaches, linear models

• Week 2:
* Feature selection and decision trees. Ensemble methods, bagging and
boosting

• Week 3:
* Regularization, support vector machines

• Week 4:
* Neural networks and optimisation
32
Subject Plan (cont.)
• Week 5:
* Boosting: gradient tree boosting (XGBoost) and AdaBoost

• Week 6:
* Unsupervised learning: Clustering
• Week 7:
* Unsupervised learning: Network analysis, community detection
and semi-supervised learning

• Week 8:
* Revision

33
Machine Learning – A Dizzying Array
• We will be looking at a range of machine learning
techniques
* Regression, naïve Bayes, decision trees, random forests,
gradient tree boosting, neural networks, clustering,
community detection

• It can seem like a bag of tricks, without strong


connection between techniques …..

34
Machine Learning – Common Themes

1. Supervised (today) versus unsupervised (weeks 6-7)


2. Types of approaches to ML (today)

1. How varying loss functions lead to different learners


2. Use of regularisation to control model complexity, to
avoid overfitting
3. Input to training the model (matrix versus graph versus
text)
4. Single versus (ensemble of) multiple models: (weighted)
averages vs compositions
5. Important role of optimization in learning

35
ML Setup

Focus on evaluation first

36
Terminology
• Input to a machine learning system can consist of
* Instance (aka object): measurements about individual
entities/objects a loan application
* Attribute (aka Feature): component of the instances
the applicant’s salary, number of dependents, etc.
* (Class) Label: an outcome that is categorical, numeric, etc.
forfeit vs. paid off
* Examples: instance coupled with label
<(100k, 3), “forfeit”>
* Models: discovered relationship between attributes and/or label

37
Terminology
Height Weight Age Gender
1.8 80 22 Male
1.53 82 23 Male
1.6 62 18 Female

• The 4 columns (height, weight, age, gender) are features or


attributes
• The data items (3 rows) are called instances or objects or
samples
• Height, Weight and Age are continuous features
• Gender is a categorical or discrete feature

38
Supervised vs Unsupervised Learning
Data Model used for

Supervised Labelled Predict labels on (typically)


learning new instances – encompasses
classification, regression,
ordinal regression/ranking,
etc.
Unsupervised Unlabelled Cluster related instances;
learning Understand attribute
relationships; Visualise; etc.

39
Architecture of a Supervised Learner

Examples
Learner
Train data

More soon

Model
Instances
Labels
Test data
Evaluation
Labels
40
Evaluation (Supervised Learners)
• How you measure quality depends on your problem!
• Typical process
* Pick an evaluation metric comparing label vs prediction
* Procure an independent, labelled test set
* “Average” the evaluation metric over the test set

• Example evaluation metrics


* Accuracy, Precision-Recall, Root Mean Squared Error

41
Training and Testing: If Sufficient Data
• Divide data into:
* Training set (e.g. 2/3)
* Test set (e.g. 1/3)

• Learn model (e.g. logistic


regression) using the training
set
• Evaluate performance of
model on the test set

Workshop will
cover cross
validation
Why Evaluate on “Independent” Data?

43
Why Evaluate on “Independent” Data?

44
Why Evaluate on “Independent” Data?

45
Metrics for Performance Evaluation
• Can be summarised in a Confusion Matrix
(contingency table)
– Actual class: {yes, no, yes, yes, …}
– Predicted class: {no, yes, yes, no…}
For binary classification
PREDICTED CLASS
a: TP (true positive)
Class=Yes Class=No
b: FN (false negative)
c: FP (false positive)
ACTUAL Class=Yes a b
d: TN (true negative)
CLASS
Class=No c d
Metrics for Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No

ACTUAL Class=Yes a b
CLASS (TP) (FN)
Class=No c d
(FP) (TN)

a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
* Actual class: {yes, no, yes, yes, no, yes, no, no}
* Predicted: {no, yes, yes, no, yes, no, no, yes}

PREDICTED CLASS
Class=Yes Class=No

ACTUAL Class=Yes a= 1 b=3


CLASS (TP) (FN)
Class=No c=3 d=1
(FP) (TN)
Limitations of Accuracy
• Consider a 2-class problem
– Number of Class 0 examples = 9990
* Number of Class 1 examples = 10

• If model predicts everything to be class 0,


accuracy is 9990/10000 = 99.9 %

Question: Is accuracy useful here? Why?


Exercise
Suppose you are designing a system that
predicts presence of brain cancer from MRI
data. What is more important, precision or
recall?

Would your answer change if instead the


system was predicting COVID-19 infection,
based on audio of coughing?
More Metrics
P positive instances and N negative instances

• True positive rate (aka sensitivity, recall): TP/P

• True negative rate (aka specificity): TN/N

• False positive rate: FP/N=1-specificity

• Precision: TP/(TP+FP)

• F-measure (F1-score): 2TP/(2TP+FP+FN)


=2 * (1/(1/recall + 1/precision) )
Example

What is accuracy?
What is precision?
What is recall?
What is F1-score?

Accuracy=(60+9760)/10000
Precision=60/(200)=6/20
Recall=60/100
F-measure=2/(20/6 + 10/6)
ROC Curves
• Many classification algorithms output not only a
classification for each test instance but also some
“rating” of classification accuracy:
• naive Bayes, logistic regression, ... support vector machines,
neural networks
• Often in machine learning tasks, we can afford the
luxury of “skimming off” a subset of the instances with
higher classification plausibility
• Also, we are often more interested in how reliably we
can predict a small subset of positive instances than
the vast majority of negative instances
• Is this a good classifier?
Receiver Operating Characteristic (ROC) Curves

• Reflects trade-off between


• TPR= TP/(TP+FN)
• FPR=FP/(FP+TN)

• Many models output “score” of classification confidence:


* naive Bayes, logistic regression, neural networks, etc.
• The ROC curve is formed by thresholding this score
• Convenient graphic tool for:
* visualising the ability of a classifier to classify positive instances;
* visually comparing classifiers over a given test set;
* arriving at a single-figure classifier evaluation metric (cf. F-score)
* selecting an operating threshold based on resulting trade-off
Perfect on all
ROC Curve Example
Predict all test
test points
points as positive

Predict all test points


by random coin

Predict all test


points as negative
Generating ROC Curves
Area Under the ROC Curve (AUC)
• Scalar “figure of merit” for a given classifier based on the
ROC curve by calculating the area under the curve (AUC):
* AUC = 1: perfect classifier
* AUC = 0.5: random baseline classifier

• Advantages
* Can compare classifiers inc. relative to a baseline
* Unbiased estimate of: probability that a randomly chosen
positive instance will be ranked higher than a randomly chosen
negative instance. Why is this an advantage in practice?
Generating ROC Curves Example

If score>0.95 then predict + else predict – (final column)


If score>=0.95 then predict + else predict - (2nd last column)
If score>=0.93 then predict + else predict - (3rd last column)
…….
If score>=0.85 then predict + else predict - (2nd column)
Generating ROC Curves Example

Area under curve=1 - (1/3)*(1/2)=5/6


Approaches to Learning
With example learners on linear classifiers

61
Major Frameworks in Statistical ML

62
Major Types of Supervised Models
• Given instance x wish to predict response y
• Recall conditional probability Pr(y|x) = Pr(x,y) / Pr(x)
Example: Wish to distinguish between Swedish and Russian

• Discriminative models
* Model only Pr(y|x)
* E.g. Logistic regression (also linear regression, SVMs, …)
Identify characteristics that differentiate the languages, use presence
absence to compute Pr(Swedish|speech) and Pr(Russian|speech)
• Generative models
* Model full Pr(x,y) = Pr(y|x) Pr(x)
* E.g. Naïve Bayes
Learn to speak Russian and Swedish, then classify the
speech with your knowledge of each language
Linear Models
Discriminative approaches mostly as refresher

You have seen some of these methods already in “Statistical


Learning” module.

64
Bayes Rule
Bayes rule in action
Naïve Bayes (NB) Classifiers
Simplifying assumption
The final NB formulation
Estimating the probabilities (1)
Estimating the probabilities (2)
Naïve Bayes in Action
Marginals

Q: Why not sum to one?

76
Naïve Bayes: Summary
• A simple linear classifier with a generative model
• Frequentist: Its probabilistic model is fit by MLE
• Bayesian? Bayes rule, but not necessarily Bayesian!
• Naïve? It models strong independence assumptions
• Easy to implement, fast, scalable; good baseline
• Can handle continuous features (use Gaussians)
• Can handle missing data (just ignore; v simple!)
• Scores not always great; Feature correlations ignored

77
Linear Regression

78
Example: Predict Humidity from Temperature

79
Method of Least Squares

Question: decision theoretic as


written, how equivalent to frequentist?

80
Regression for classification

• Any regression technique can be used for classification


• Training: perform a regression for each class, setting the output
to 1 for training instances that belong to class, and 0 for those
that don’t
• Prediction: predict class corresponding to model with largest
output value (membership value)
• For linear regression this method is also known as multi-
response linear regression
• Problem: membership values are not in the [0,1] range,
so they cannot be considered proper probability
estimates
• In practice, they are often simply clipped into the [0,1]
range and normalized to sum to 1
81
Linear models: logistic regression
• Can we do better than using linear regression for classification?
• Yes, we can, by applying logistic regression
• Logistic regression builds a linear model for a transformed target
variable
• Assume we have two classes
• Logistic regression replaces the target

by this target

• This logit transformation maps [0,1] to (-¥ , +¥ ), i.e., the new target
values are no longer restricted to the [0,1] interval

82
Logistic Regression Model
Logistic function

1.0
0.8
0.6
Probabilities
" !
0.4
0.2
0.0
-10 -5 0 5 10
!
Reals

83
Logistic Regression Model
predict predict
• “no” “yes”
T2D

Note: here we
do not use sum
0.5 of squared
errors for fitting

BMI
0

84
Logistic Regression: Linearity, Training

85
Decision boundary example
http://www.kdnuggets.com/2016/08/role-activation-function-neural-network.html
Decision boundary example
http://www.kdnuggets.com/2016/08/role-activation-function-neural-network.html
Exercise
How can you use linear regression (or logistic
regression) to model non-linear functions on your
data?
Summary
• Subject intro and logistics
• Performance evaluation metrics
* Accuracy, AUC, and a veritable zoo

• Approaches to ML
* Frequentist vs Bayesian vs Decision Theoretic
* Supervised models: Generative vs Discriminative

• Linear approaches
* Naïve Bayes
* Linear regression
* Logistic regression

89

You might also like