0% found this document useful (0 votes)
55 views25 pages

CS3244 (2120) - Project Discussion 1 - Overview

This document discusses designing a machine learning application for a class project. It covers choosing objectives and performance measures, collecting and preparing data, constructing predictive models, and evaluating models. Key topics include bias-variance tradeoffs, overfitting and underfitting, validation techniques like cross-validation, and ensemble methods to improve accuracy. Students are instructed to determine a benchmark for evaluation and conduct a hypothesis test to demonstrate their model performs significantly better.

Uploaded by

dylantan.yhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views25 pages

CS3244 (2120) - Project Discussion 1 - Overview

This document discusses designing a machine learning application for a class project. It covers choosing objectives and performance measures, collecting and preparing data, constructing predictive models, and evaluating models. Key topics include bias-variance tradeoffs, overfitting and underfitting, validation techniques like cross-validation, and ensemble methods to improve accuracy. Students are instructed to determine a benchmark for evaluation and conduct a hypothesis test to demonstrate their model performs significantly better.

Uploaded by

dylantan.yhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Project Discussion

CS3244: Machine Learning


03 March 2022
CS3244 Project
Assessment Components

1. Application design & problem formulation (8%)


2. Model design & construction (6%)
3. Evaluation (6%)
4. Novelty (4%)
5. Instructions (1%)

3 3
Content Overview

1. The Application
2. The Model
3. The Evaluation

4 4
Designing a
Machine Learning (ML)
Application
The Machine Learning Application

▪ Is it just about a dataset?

6
Designing Applications

▪ SDLC
– Planning → Analysis → Design → Implementation → Maintenance → Planning → ...

▪ Planning and Analysis


– Is there a need for ML in a particular system?
▪ User ML needs / needs analysis
▪ Use cases incorporating ML solutions

7
Main Issues

▪ Assume supervised learning

1. What objectives?
– Model accuracy?
– Performance measures
▪ Quantifying the objectives

2. What data?
– Use existing dataset or collecting data?
– What do I know about the domain?
▪ Features?
▪ Hypothesis representation/space?

8
Objectives Apart from Accuracy

▪ What are other possible performance measures?

9
Gathering Data

▪ What do you intend to do about data?

10
Constructing a Good Predictor

11
Consistency with Training Data
Versus Generalisation

▪ Consistency with training data is just the beginning ...

Real problems → massive instance spaces

Size of instance space:


1,638,333,457,367,040

https://archive.ics.uci.edu/ml/datasets/Mushroom

... generalisation is more important (usually)


12
Consequence of Generalisation Objective

▪ Data is not enough Picking the right Inductive Bias is essential

Example Inductive bias:


D: 1,000,000 instances; 100 Boolean variables • Hypothesis Representation
• Hypothesis Preference
2100 – 106 instances unlabelled

If all target functions are equally likely, any


hypothesis cannot do better than random guessing...
No-Free-Lunch Theorems

Thankfully, real-world problems not drawn


uniformly from the set of all possible function

Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms.


Neural computation, 8(7), 1341-1390
13
Bias-Variance Decomposition

▪ Generalisation error may be divided


– Noise
▪ Error inherent within the data
▪ Typically cannot reduce this
– Bias
▪ Error from assumptions about target function
– Appropriateness of hypothesis representation
– Relevance of features / sufficient features

– Variance
▪ Error from sensitivity to small fluctuations in training data
– More general hypotheses have lower variance

14 14
Overfitting & Underfitting

▪ High bias ⇒ underfitting


– Model not expressive enough or
not appropriately expressive to capture c
– Higher training error

▪ High variance ⇒ overfitting


– Model too expressive and sensitive to smaller
changes in training sample
– Requires more data to converge
– Lower training error, higher testing error

▪ Examples
– Decision trees
▪ Larger/deeper tree ⇒ lower bias; higher variance
– Neural Networks
▪ More hidden units ⇒ lower bias; higher variance

15 15
Simple Ideas to Improve
Generalisation Performance

16
Feature Selection

▪ Determine the “right” attributes to use


– Remove redundant / irrelevant attributes

– Filter approach – use a heuristic


▪ Pearson correlation coefficient
▪ Information gain

– Wrapper approach
▪ ML algorithm used to assess value of attribute sets

– Embedded approach
▪ Feature selection is part of the ML algorithm

17 17
Validation Using Cross-Validation

▪ k-Fold Cross-validation
▪ Divide training set, S, into k-folds, s1, ..., sk
For each fold si
Train model using S \ si
Test model using si
Take mean performance

▪ Wrapper-based approach
– Selecting hyperparameters
– Selecting attributes

19 19
Ensembles

▪ General idea
– Aggregate predictions of multiple hypotheses to generate an overall classification that is more accurate

▪ General motivation
– Assume k independent (i.e., uncorrelated) hypotheses
– Assume generalisation performance > 0.5

sum binomial
terms with 51+
successes

much higher
success as p
tends to 1

20 20
Ensemble Framework

Example:
Random Forest

21 21
Evaluation

22
Model Evaluation

▪ How do you know you have succeeded?


▪ Determine a benchmark
– Competing model
– Threshold

▪ Form hypothesis test to see your model is significantly better than the
benchmark
– m × k-Fold Cross-Validation Performance
– Each value is a mean (central limit theorem applies)
– Apply t-test

23 23
Experimental Setup for Empirical Evaluation

▪ Example Walkthrough

24
Summary

▪ Determine appropriate user ML need

▪ Determine the important performance measures


– For prediction, prioritise generalisation performance

▪ Determine your sources of data

▪ Apply domain knowledge and feature selection

▪ Consider performing validation to choose hyperparameters

▪ Consider ensemble methods

▪ Evaluate model against a benchmark via a valid hypothesis test

25 25
Questions?

26

You might also like