0% found this document useful (0 votes)

340 views

An Introduction To Machine Learning

The document provides an introduction to machine learning, outlining key concepts such as supervised learning, model selection, cross-validation, and overfitting. It describes applications of machine learning such as autonomous driving, recommendation systems, and credit risk analysis. The document also discusses related fields, problem definitions, and glossary terms to provide an overview of the fundamentals of machine learning.

Uploaded by

erman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

340 views

An Introduction To Machine Learning

Uploaded by

erman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 136

An introduction to

Machine Learning
Pierre Geurts

[email protected]

Department of EE and CS &

GIGA-R, Bioinformatics and modelling
University of Lige

Outline

Introduction

Supervised Learning

Other learning protocols/frameworks

Machine Learning: definition

Machine Learning is concerned with the development,

the analysis, and the application of algorithms that
allow computers to learn
Learning:

A computer learns if it improves its performance at

some task with experience (i.e. by collecting data)

Extracting a model of a system from the sole

observation (or the simulation) of this system in some
situations.

A model = some relationships between the variables

used to describe the system.

Two main goals: make prediction and better

understand the system
3

Machine learning: when ?

Learning is useful when:

Human expertise does not exist (navigating on Mars),

Humans are unable to explain their expertise (speech

recognition)

Solution changes in time (routing on a computer

network)

Solution needs to be adapted to particular cases (user

biometrics)

Example: It is easier to write a program that learns to

play checkers or backgammon well by self-play rather
than converting the expertise of a master player to a
program.
4

Applications: autonomous driving

DARPA Grand challenge 2005: build a robot capable

of navigating 175 miles through desert terrain in less
than 10 hours, with no human intervention
The actual wining time of Stanley [Thrun et al., 05]
was 6 hours 54 minutes.

http://www.darpa.mil/grandchallenge/

Applications: recommendation system

Netflix prize: predict how much someone is going to love a

movie based on their movies preferences
Data: over 100 million ratings that over 480,000 users
gave to nearly 18,000 movies
Reward: $1,000,000 dollars if 10% improvement with
respect to Netflix's current system

http://www.netflixprize.com

Applications: credit risk analysis

Data:

Logical rules automatically learned from data:

Applications

Machine learning has a wide spectrum of applications

including:

Retail: Market basket analysis, Customer relationship

management (CRM)

Finance: Credit scoring, fraud detection

Manufacturing: Optimization, troubleshooting

Medicine: Medical diagnosis

Telecommunications: Quality of service optimization,

routing

Bioinformatics: Motifs, alignment

Web mining: Search engines

...
8

Related fields

Artificial Intelligence: smart algorithms

Statistics: inference from a sample

Computer Science: efficient algorithms and complex

models
Systems and control: analysis, modeling, and control
of dynamical systems
Data Mining: searching through large volumes of data

Problem definition

One part of the data

mining process

Data generation
Raw data

Each step generates many questions:

Data generation: data types, sample

size, online/offline...

Preprocessing: normalization,
missing values, feature
selection/extraction...

Machine learning: hypothesis,

choice of learning
paradigm/algorithm...

Hypothesis validation: crossvalidation, model deployment...

Preprocessing
Preprocessed data

Machine
learning

Hypothesis
Validation
Knowledge/Predictive model

Glossary

Data=a table (dataset, database, sample)

Variables (attributes, features) =
measurements made on objects

Object 1
Object 2
Object 3
Object 4
Object 5
Object 6
Object 7
Object 8
Object 9
Object 10
...

VAR 1
0
2
0
1
0
0
2
2
1
1
...

VAR 2
1
1
0
1
1
1
1
2
1
2
...

VAR 3
2
2
1
2
0
2
0
1
0
2
...

VAR 4
0
0
0
2
2
1
1
0
1
0
...

VAR 5
1
1
1
0
1
1
1
0
0
1
...

Objects (samples, observations,

individuals, examples, patterns)

VAR 6
1
1
1
0
0
1
2
0
0
0
...

VAR 7 VAR 8 VAR 9 VAR 10

2
1
0
2
0
2
1
0
2
0
2
1
0
1
2
1
2
1
1
0
1
1
1
1
2
2
1
1
1
1
1
1
0
0
1
2
1
2
1
0
...
...
...
...

VAR 11
0
2
2
1
1
1
1
2
1
1
...

...
...
...
...
...
...
...
...
...
...
...
...

Dimension=number of variables
Size=number of objects

Objects: samples, patients, documents, images...

Variables: genes, proteins, words, pixels...

Outline

Introduction

Supervised Learning

Introduction

Model selection, cross-validation, overfitting

Some supervised learning algorithms

Beyond classification and regression

Other learning protocols/frameworks

Supervised learning
Inputs
A1
-0.69
-2.3
0.32
0.37
-0.67
0.51

A2
-0.72
-1.2
-0.9
-1
-0.53
-0.09

A3
Y
N
N
Y
N
Y

Output
A4
0.47
0.15
-0.76
-0.59
0.33
-0.05

Y
Healthy
Disease
Healthy
Disease
Healthy
Disease

Supervised
learning

= h(A1,A2,A3,A4)

Learning sample

model,
hypothesis

Goal: from the database (learning sample), find a

function h of the inputs that approximates at best the
output

Symbolic output classification problem,

Numerical output regression problem

Two main goals

Predictive:
Make predictions for a new sample described by its
attributes
A1
0.83
-2.3
0.08
0.06
-0.98
-0.68
0.92

A2
-0.54
-1.2
0.63
-0.29
-0.18
0.82
-0.33

A3
T
F
F
T
F
T
F

A4
0.68
-0.83
0.76
-0.57
-0.38
-0.95
-0.48

Y
Healthy
Disease
Healthy
Disease
Healthy
Disease
?

Informative:
Help to understand the relationship between the inputs
and the output
Y=disease if A3=F and A2<0.3

Find the most relevant inputs

Example of applications

Biomedical domain: medical diagnosis, differentiation

of diseases, prediction of the response to a
treatment...

Gene expression, Metabolite concentrations...

Patients

A1
-0.61
-2.3
-0.82
-0.74
-0.14
-0.37

A2
0.23
-1.2
-0.41
-0.1
0.98
0.27

...
...
...
...
...
...
...

A4
0.49
-0.11
0.24
-0.15
-0.13
-0.67

Y
Healthy
Disease
Healthy
Disease
Healthy
Disease

Example of applications

Perceptual tasks: handwritten character recognition,

speech recognition...
Inputs:
a grey intensity [0,255] for
each pixel
each image is
represented by a vector of
pixel intensities
eg.: 32x32=1024
dimensions
Output:
9 discrete values
Y={0,1,2,...,9}

Example of applications

Time series prediction: predicting electricity load,

network usage, stock market prices...

Outline

Introduction

Supervised Learning

Introduction

Model selection, cross-validation, overfitting

Some supervised learning algorithms

Beyond classification and regression

Other learning protocols/frameworks

Illustrative problem
Medical diagnosis from two measurements (eg., weights
and temperature)
M1
0.52
0.44
0.89
0.99
...
0.95
0.29

M2
0.18
0.29
0.88
0.37
...
0.47
0.09

Y
Healthy
Disease
Healthy
Disease
...
Disease
Healthy

0
0

Goal: find a model that classifies at best new cases for

which M1 and M2 are known
19

Learning algorithm

A learning algorithm is defined by:

a family of candidate models (=hypothesis space H)

a quality measure for a model

an optimization strategy

It takes as input a learning sample and outputs a

function h in H of maximum quality
1

a model
obtained by
supervised
learning

0
0

Linear model
h(M1,M2)=

Disease if w0+w1*M1+w2*M2>0
Normal otherwise

0
0

Learning phase: from the learning sample, find the

best values for w0, w1 and w2
Many alternatives even for this simple model (LDA,
Perceptron, SVM...)
21

Quadratic model
h(M1,M2)=

Disease if w0+w1*M1+w2*M2+w3*M12+w4*M22>0
Normal otherwise

0
0

Learning phase: from the learning sample, find the

best values for w0, w1,w2, w3 and w4
Many alternatives even for this simple model (LDA,
Perceptron, SVM...)

Artificial neural network

h(M1,M2)=

Disease if some very complex function of M1,M2>0

Normal otherwise

0
0

Learning phase: from the learning sample, find the

numerous parameters of the very complex function
23

Which model is the best?

linear

quadratic

0
0

neural net

0
0

Why not choose the model that minimises the error

rate on the learning sample? (also called resubstitution error)
How well are you going to predict future data drawn
from the same distribution? (generalisation error)
24

The test set method

1.Randomly choose 30% of the data
to be in a test sample

2.The remainder is a learning

sample
0
0

3.Learn the model from the learning

sample
4.Estimate its future performance on
the test sample

Which model is the best?

linear

quadratic

0
0

LS error= 3.4%
TS error= 3.5%

neural net

0
0

LS error= 1.0%
TS error= 1.5%

LS error= 0%
TS error= 3.5%

We say that the neural network overfits the data

Overfitting occurs when the learning algorithm starts
fitting noise.
(by opposition, the linear model underfits the data)
26

The test set method

Upside:

very simple

Computationally efficient

Downside:

Wastes data: we get an estimate of the best method to

apply to 30% less data

Very unstable when the database is small (the test

sample choice might just be lucky or unlucky)

Leave-one-out Cross Validation

For k=1 to N

0
0

remove the kth object from the

learning sample
learn the model on the remaining
objects
apply the model to get a prediction
for the kth object

report the proportion of

missclassified objects
28

Leave-one-out Cross Validation

Upside:

Does not waste the data (you get an estimate of the

best method to apply to N-1 data)

Downside:

Expensive (need to train N models)

High variance

k-fold Cross Validation

Randomly partition the dataset into k subsets (for example

10)
TS

For each subset:

learn the model on the objects that are not in the subset

compute the error rate on the points in the subset

Report the mean error rate over the k subsets

When k=the number of objects leave-one-out cross
validation
30

Which kind of Cross Validation?

Test set:

Leave-one-out:

Doesn't waste data but expensive

k-fold cross validation:

Cheap but waste data and unreliable when few data

compromise between the two

Rule of thumb:

a lot of data (>1000): test set validation

small data (100-1000): 10-fold CV

very small data(<100): leave-one-out CV

CV-based complexity control

Error
Under-fitting

Over-fitting
CV error

LS error
Optimal complexity

Complexity
32

Complexity

Controlling complexity is called regularization or

smooting
Complexity can be controlled in several ways

The size of the hypothesis space: number of candidate

models, range of the parameters...

The performance criterion: learning set performance

versus parameter range, eg. minimizes
Err(LS)+ C(model)

The optimization algorithms: number of iterations,

nature of the optimization problem (one global optimum
versus several local optima)...
33

CV-based algorithm choice

Step 1: compute 10-fold (or test set or LOO) CV error

for different algorithms

CV
error

Algo 1

Algo 4

Algo 2
Algo 3

Step 2: whichever algorithm gave best CV score: learn

a new model with all data, and that's the predictive
model
What is the expected error rate of this model?

Warning: Intensive use of CV can overfit

If you compare many (complex) models, the

probability that you will find a good one by chance on
your data increases
Solution:

Hold out an additional test set before starting the

analysis (or, better, generate this data afterwards)

Use it to estimate the performance of your final

model

(For small datasets: use two stages of 10-fold CV)

A note on performance measures

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

True class
Negative
Negative
Negative
Negative
Negative
Negative
Negative
Negative
Negative
Positive
Positive
Positive
Positive
Positive
Positive

Model 1
Positive
Negative
Positive
Positive
Negative
Negative
Negative
Negative
Negative
Positive
Positive
Positive
Positive
Negative
Positive

Model 2
Negative
Negative
Positive
Negative
Negative
Negative
Positive
Negative
Negative
Positive
Negative
Positive
Positive
Negative
Negative

Which of these two models

is the best?
The choice of an error or
quality measure is highly
application dependent.

A note on performance measures

The error rate is not the only way to assess a predictive

model
In binary classification, results can be summarized in a
contingency table (aka confusion matrix)
Predicted class
Actual class
p
n
Total
p
True Positive False Negative P
n

False Positive True Negative

Various criterion
Error rate = (FP+FN)/(N+P)

Sensitivity = TP/P

(aka recall)

Accuracy = (TP+TN)/(N+P)
= 1-Error rate

Specificity = TN/(TN+FP)
Precision = TP/(TP+FP) (aka PPV)
37

ROC and Precision/recall curves

Each point corresponds to a particular choice of the
decision threshold

Precision

True Positive Rate

(Sensitivity)

False Positive Rate

(1-Specificity)

Recall
(Sensitivity)
38

Outline

Introduction

Model selection, cross-validation, overfitting

Some supervised learning algorithms

k-NN

Linear methods

Artificial neural networks

Support vector machines

Decision trees

Ensemble methods

Beyond classification and regression

Comparison of learning algorithms

Three main criteria:

Accuracy:

Efficiency:

Computing times and scalability for learning and testing

Interpretability:

Measured by the generalization error (estimated by CV)

Comprehension brought by the model about the inputoutput relationship

Unfortunately, there is usually a tradeoff between

these criteria

1-Nearest Neighbor (1-NN)

(prototype based method, instance based learning, non-parametric
method)

One of the simplest learning algorithm:

outputs as a prediction the output associated to the

sample which is the closest to the test object

1
2
3
4
5

M1
0.32
0.15
0.39
0.62
0.92

M2
0.81
0.38
0.34
0.11
0.43

Y
Healthy
Disease
Healthy
Disease
?

d 5,2 = 0. 15 0 .92 0 . 380 . 43 =0 . 77

d 5,3 = 0. 390 . 92 0 . 34 0. 43 =0 . 71
d 5,4 = 0 .620 . 92 0 . 430 . 43 = 0 . 44
2

d 5,1 = 0. 320 . 92 0 . 810 . 43 =0 . 71

closest=usually of minimal Euclidian distance

Obvious extension: k-NN

Find the k nearest neighbors (instead of only the first

one) with respect to Euclidian distance
Output the most frequent class (classification) or the
average outputs (regression) among the k neighbors.

Effect of k on the error

Error
Under-fitting

Over-fitting

CV error
LS error
Optimal k

k
43

Small exercise

In this classification problem

with two inputs:

What it the resubstitution

error (LS error) of 1-NN?

What is the LOO error of 1NN?

What is the LOO error of 3NN?

What is the LOO error of

22-NN?

Andrew Moore
44

k-NN

Advantages:

very simple

can be adapted to any data type by changing the

distance measure

Drawbacks:

choosing a good distance measure is a hard problem

very sensitive to the presence of noisy variables

slow for testing

Linear methods

Find a model which is a linear combinations of the inputs

Regression: y=w 0 w 1 x 1 w 2 x 2 ...w n w n

Classification: y=c 1 if w 0 w1 x1 ...w n x n 0 , c 2 otherwise

Several methods exist to find coefficients w0,w1... corresponding

to different objective functions, optimization algorithms, eg.:

Regression: least-square regression, ridge regression, partial

least square, support vector regression, LASSO...

Classification: linear discriminant analysis, PLS-discriminant

analysis, support vector machines...

Example: ridge regression

Find w that minimizes (>0):

i y i w x i w

From simple algebra, the solution is given by:

w = X X I X y
where X is the input matrix and y is the output vector

regulates complexity (and avoids problems related to

the singularity of XTX)
47

Example: perceptron

Find w that minimizes:

i y i w x i 2

using gradient descent: given a training example x , y

ywT x
j w j w j x j

Online algorithm, ie. that treats every example in turn

(vs Batch algorithm that treats all examples at once)
Complexity is regulated by the learning rate and the
number of iterations
Can be adapted to classification
48

Linear methods

Advantages:

simple

there exist fast and scalable variants

provide interpretable models through variable weights

(magnitude and sign)

Drawbacks:

often not as accurate as other (non-linear) methods

Non-linear extensions

Generalization of linear methods:

y=w 0 w1 1 xw 2 2 x 2 ...w n n x

Any linear methods can be applied (but regularization

becomes more important)

Artificial neural networks (with a single hidden layer):

y= g W j g w i , j x i
j

where g is a non linear function (eg. sigmoid)

(a non linear function of a linear combination of non

linear functions of linear combinations of inputs)

Kernel methods:

y= w i i x
i

y= j k x j , x
j

where k x , x ' = x , x ' is the dot-product in the

feature space and j indexes training examples

Artificial neural networks

Supervised learning method initially inspired by the

behaviour of the human brain
Consists of the inter-connection of several small units
Essentially numerical but can handle classification and
discrete inputs with appropriate coding
Introduced in the late 50s, very popular in the 90s

Hypothesis space: a single neuron

1
A1
A2

x w1
x w2

x w0
+ tanh

x wN
tanh
+1

Y=tanh(w1*A1+w2*A2++wN*AN+w0)

-1
52

Hypothesis space:
Multi-layers Perceptron

Inter-connection of several neurons (just like in the human

brain)
Hidden layer
Input layer

Output layer

With a sufficient number of neurons and a sufficient

number of layers, a neural network can model any function
of the inputs.
53

Learning

Choose a structure
Tune the value of the parameters (connections
between neurons) so as to minimize the learning
sample error.

Non-linear optimization by the back-propagation

algorithm. In practice, quite slow.

Repeat for different structures

Select the structure that minimizes CV error

Illustrative example
1

1 neuron

0
0
1

2 +2 neurons

0
0

10 +10 neurons

...

0
0

Artificial neural networks

Advantages:

Universal approximators

May be very accurate (if the method is well used)

Drawbacks:

The learning phase may be very slow

Black-box models, very difficult to interprete

Scalability

Support vector machines

Recent (mid-90's) and very successful method

Based on two smart ideas:

large margin classifier

kernelized input space

Linear classifier

Where would you place a linear classifier?

Margin of a linear classifier

The margin = the

width that the
boundary could be
increased by before
hitting a datapoint.

Maximum-margin linear classifier

Support vectors: the

samples the closest to the
hyperplane

The linear classifier

with the maximum
margin (= Linear
SVM)
Intuitively, this feels
safest
Works very well in
practice

Mathematically

Linearly separable case: amount at solving the

following quadratic programming optimization problem:
1
2
w
minimizes
2
T
subject to y i w x i b1, i=1,... , N

Decision function:

T
y=1 if w xb0, y =1 otherwise

Non linearly separable case:

1
2
w C i
minimizes
2
i
T
subject to y i w x i b1i , i=1,... , N
61

Non-linear boundary

What about this problem?

x12

Solution:

map the data into a new feature space where the

boundary is linear

Find the maximum margin model in this new space

x22

The kernel trick

Intuitively:

You don't need to compute explicitly the mapping

All you need is a (special) similarity measure between

objects (like for the kNN)

This similarity measure is called a kernel

Mathematically:

The maximum-margin classifier in some feature space

can be written only in terms of dot-products in that
feature space:
k(x,x')=<(x),(x')>

Mathematically

Primal form of the optimization problem:

1
2
w
minimizes
2
subject to y i w , xi b1, i=1,... , N

Dual form:

1
minimizes i i j yi y j xi , x j
2 i, j
i

subject to i 0 and

i y i =0
i

w= i y i x i
i

Decision function:

y=1 if w , x = i y i x i , x= i y i k x i , x0

y=1 otherwise

Support vector machines

1
2
3
4
5
6

G1
0.21
0.57
0.21
0.69
0.83
0.48

G2
0.64
0.26
0.68
0.52
0.96
-0.52

Y
C1
C2
C2
C1
C1
C2

kernel matrix
1
2
3
4
5
6

1
1
0.14
0.96
0.17
0.01
0.24

2
0.14
1
0.02
0.7
0.22
0.67

3
0.96
0.02
1
0.15
0.27
0.07

Class labels
1
2
3
4
5
6

G1
ACGCTCTATAG
ACTCGCTTAGA
GTCTCTGAGAG
CGCTAGCGTCG
CGATCAGCAGC
GCTCGCGCTCG

Y
C1
C2
C2
C1
C1
C2

4
0.17
0.17
0.15
1
0.37
0.55

1
2
3
4
5
6

5
6
0.01 0.24
0.22 0.67
0.27 0.07
0.37 0.55
1 -0.25
-0.25 1

Y
C1
C2
C2
C1
C1
C2

SVM
algorithm

Classification
model

Examples of kernels

Linear kernel:
k(x,x')= <x,x'>

Polynomial kernel
k(x,x')=(<x,x'>+1)d

(main parameter: d, the maximum degree)

Radial basis function kernel:

k(x,x')=exp(-||x-x'||2/(22))

(main parameter: , the spread of the distribution)

+ many kernels that have been defined for structured

data types (eg. texts, graphs, trees, images)
66

Feature ranking with linear kernel

With a linear kernel, the model looks like:

h(x1,x2,...,xK)=

|w|

C1 if w0+w1*x1+w2*x2+...+wK*xK>0
C2 otherwise

Most important variables are those corresponding to

large |wi|
100
75
50
25
0

variables
67

SVM parameters

Mainly two sets of parameters in SVM:

Optimization algorithm's parameters:

Kernel's parameters:

Control the number of training errors versus the margin

(when the learning sample is not linearly separable)
choice of particular kernel
given this choice, usually one complexity parameter
eg, the degree of the polynomial kernel

Again, these parameters can be determined by crossvalidation

Support vector machines

Advantages:

State-of-the-art accuracy on many problems

Can handle any data types by changing the kernel

(many applications on sequences, texts, graphs...)

Drawbacks:

Tuning the parameter is very crucial to get good results

and somewhat tricky

Black-box models, not easy to interprete

A note on kernel methods

The kernel trick can be applied to any (learning)

algorithm whose solution can be expressed in terms of
dot-products in the original input space

It makes a non-linear algorithm from a linear one

Can work in a very highly dimensional space (even

infinite) without requiring to explicitly compute the
features

Decouple the representation stage from the learning

stage. The same learning machine can be applied to a
large range of problems

Examples: ridge regression, perceptron, PCA, kmeans...

Decision (classification) trees

A learning algorithm that can handle:

Classification problems (binary or multi-valued)

Attributes may be discrete (binary or multi-valued) or

continuous.

Classification trees were invented at least twice:

By statisticians: CART (Breiman et al.)

By the AI community: ID3, C4.5 (Quinlan et al.)

Decision trees

A decision tree is a tree where:

Each interior node tests an attribute

Each branch corresponds to an attribute value

Each leaf node is labeled with a class

A1
a11

a13

a12

a21

a22

a31

a32
c1
72

A simple database: playtennis

Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14

Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain

Temperature
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Hot
Mild
Cool
Mild
Hot
Mild

Humidity
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
Normal
High
Normal
High

Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Strong
Strong
Strong
Weak
Strong

Play Tennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No

A decision tree for playtennis

Outlook
Sunny

Overcast

Humidity
High

yes
Normal

Wind
Strong

Weak
yes

yes

Rain

Should we play tennis on D15?

Day
D15

Outlook
Sunny

Temperature
Hot

Humidity
High

Wind
Weak

Play Tennis
?
74

Top-down induction of DTs

Choose best attribute

Split the learning sample

Proceed recursively until each object is correctly classified

Outlook
Sunny

Rain

Overcast

Day

Outlook

Temp.

Humidity

Wind

Play

Day

Outlook

Temp.

Humidity

Wind

Play

Sunny

Hot

High

Weak

Rain

Mild

Normal

Weak

Yes

Sunny

Hot

High

Strong

Rain

Cool

Normal

Weak

Yes

Sunny

Mild

High

Weak

Rain

Cool

Normal

Strong

Sunny

Hot

Normal

Weak Day Yes Outlook

D11

Sunny

Cool

Normal

Strong D3

Temp.

Humidity

Wind

Play
D10

Rain

Mild

Normal

Strong

Yes

Yes Overcast

Hot

High

Weak

Yes
D14

Rain

Mild

High

Strong

Overcast

Cool

High

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

Top-down induction of DTs

Procedure learn_dt(learning sample, LS)
If all objects from LS have the same class

Create a leaf with that class

Else

Find the best splitting attribute A

Create a test node for this attribute
For each value a of A

Build LSa= {o LS | A(o) is a}

Use Learn_dt(LSa) to grow a subtree from LSa.

Which attribute is best ?

A1=?

[29+,35-]

A2=?

[29+,35-]

[21+,5-]

[8+,30-]

[18+,33-]

[11+,2-]

A score measure is defined to evaluate splits

This score should favor class separation at each step (to
shorten the tree depth)
Common score measures are based on information theory

I (LS, A) H( LS) | LS left | H(LS left ) | LSright | H(LS right)

| LS |
| LS |

Effect of number of nodes on error

Error
Under-fitting

Over-fitting
CV error

LS error
Optimal complexity

Nb nodes
78

How can we avoid overfitting?

Pre-pruning: stop growing the tree earlier, before it

reaches the point where it perfectly classifies the
learning sample
Post-pruning: allow the tree to overfit and then postprune the tree
Ensemble methods (later)

Post-pruning
Error
Under-fitting

Over-fitting
CV error

2. Tree pruning

1. Tree growing
Optimal complexity

LS error
Nb nodes
80

Numerical variables

Example: temperature as a number instead of a

discrete value
Two solutions:

Pre-discretize: Cold if Temperature<70, Mild between 70 and

75, Hot if Temperature>75
Discretize during tree growing:
Temperature
65.4
no

>65.4
yes

optimization of the threshold to maximize the score

Illustrative example
1

M2<0.33?
yes
Healthy

M1<0.91?
0

M2<0.91?

M1<0.23?

Healthy

Sick

M2<0.75?
M2<0.49?
Healthy

Sick

M2<0.65?
Sick

Sick

Healthy
82

Regression trees
Trees for regression problems: exactly the same model but
with a number in each leaf instead of a class

Outlook
Sunny

Overcast

Humidity
High

45.6
Normal

Temperature

22.3
<71

1.2

Rain
Wind
Strong
64.4

Weak
7.4

>71
3.4
83

Interpretability and attribute selection

Interpretability

Intrinsically, a decision tree is highly interpretable

A tree may be converted into a set of ifthen
rules.

Attribute selection

If some attributes are not useful for classification,

they will not be selected in the (pruned) tree
Of practical importance, if measuring the value of a
variable is costly (e.g. medical diagnosis)
Decision trees are often used as a pre-processing for
other learning algorithms that suffer more when
there are irrelevant variables
84

Attribute importance

In many applications, all variables do not contribute

equally in predicting the output.
We can evaluate variable importances with trees

Outlook
Humidity
Wind
Temperature

Decision and regression trees

Advantages:

very fast and scalable method (able to handle a very

large number of inputs and objects)

provide directly interpretable models and give an idea of

the relevance of attributes

Drawbacks:

high variance (more on this later)

often not as accurate as other methods

Ensemble methods
...

Sick

...

Healthy

Sick

Combine the predictions of several models built with a learning

algorithm. Often improve very much accuracy.
Often used in combination with decision trees for efficiency
reasons
Examples of algorithms: Bagging, Random Forests, Boosting...
87

Bagging: motivation

Different learning samples yield different models,

especially when the learning algorithm overfits the data
1

0
0

As there is only one optimal model, this variance is source

of error
Solution: aggregate several models to obtain a more
stable one
1

88
0
0

Bagging: bootstrap aggregating

Boostrap sampling
(sampling with replacement)

...
0

...

Sick

...

Healthy

Sick

Note: the more models, the better.

Bootstrap sampling

Sampling with replacement

1
2
3
4
5
6
7
8
9
10

G1
0.74
0.78
0.86
0.2
0.2
0.32
-0.34
0.89
0.1
-0.34

G2
0.68
0.45
0.09
0.61
-5.6
0.6
-0.45
-0.34
0.3
-0.65

Y
Healthy
Disease
Healthy
Disease
Healthy
Disease
Healthy
Disease
Healthy
Healthy

3
7
2
9
3
10
1
8
6
10

G1
0.86
-0.34
0.78
0.1
0.86
-0.34
0.74
0.89
0.32
-0.34

G2
0.09
-0.45
0.45
0.3
0.09
-0.65
0.68
-0.34
0.6
-0.65

Y
Healthy
Healthy
Disease
Healthy
Healthy
Healthy
Healthy
Disease
Disease
Healthy

Some objects do not appear, some objects appear

several times

Boosting

Idea of boosting: combine many weak models to

produce a more powerful one.
Weak model = a model that underfits the data (strictly,
in classification, a model slightly better than random
guessing)

Adaboost:

At each step, adaboost forces the learning algorithm to focus

on the cases from the learning sample misclassified by the
last model
The predictions of the models are combined through a
weighted vote. More accurate models have more weights in
the vote.
Eg., by duplicating
the missclassified
examples in the
learning sample

Boosting
LS
LS1

LS2

LST

...

Healthy
w1

Sick

Healthy
wT

Healthy
92

Interpretability and efficiency

When combined with decision trees, ensemble methods loose

interpretability and efficiency
However,

We still can use the ensemble to compute the importance of

variables (by averaging it over all trees)

Ensemble methods can be parallelized and boosting type

algorithm uses smaller trees. So, the increase of computing
times is not so detrimental.
93

100
75
50
25
0

Example on microarray data

72 patients, 7129 gene expressions, 2 classes of Leukemia

(ALL and AML) (Golub et al., Science, 1999)
Leave-one-out error with several variants
Method

Error

1 decision tree
Random forests (k=85,T=500)

9.7% (7/72)

Extra-trees (sth=0.5, T=500)

5.5% (4/72)

Adaboost (1 test node, T=500)

1.4% (1/72)

Variable importance with boosting

100

Importance

22.2% (16/72)

75
50
25
0

94
variables

Method comparison

Method

Accuracy Efficiency

Interpretability Ease of use

kNN

+++

Linear

+++

Ensemble +++

+++

ANN
SVM

+
+

++
+

+++
++++

Note:

The relative importance of the criteria depends on

the specific application

These are only general trends. Eg., in terms of

accuracy, no algorithm is always better than all
others.

Outline

Introduction

Supervised Learning

Introduction

Model selection, cross-validation, overfitting

Some supervised learning algorithms

Beyond classification and regression

Other learning protocols/frameworks

Beyond classification and regression

All supervised learning problems can not be turned

into standard classification or regression problems
Examples:

Graph predictions

Sequence labeling

image segmentation

Structured output approaches

Decomposition:

Reduce the problem to several simpler classification or

regression problems by decomposing the output

Not always possible and does not take into account

interactions between suboutputs

Kernel output methods

Extend regression methods to handle an output space

endowed with a kernel

This can be done with regression trees or ridge regression

for example

Large margin methods

Use SVM-based approaches to learn a model that scores

directly input-output pairs:
y=arg max y ' w i i x , y '
i

Outline

Introduction

Supervised learning

Other learning protocols/frameworks

Semi-supervised learning

Transductive learning

Active learning

Reinforcement learning

Unsupervised learning

Labeled versus unlabeled data

Unlabeled data=input-output pairs without output value

In many settings, unlabeled data is cheap but labeled
data can be hard to get

labels may require human experts

human annotation is expensive, slow, unreliable

labels may require special devices

Examples:

Biomedical domain

Speech analysis

Natural language parsing

Image categorization/segmentation

Network measurement

100

Semi-supervised learning

Goal: exploit both labeled and unlabeled data to build

better models than using each one alone
labeled data
unlabeled data
test data

A1
0.01
-2.3
0.69
-0.56
-0.85
-0.17
-0.09

A2
0.37
-1.2
-0.78
-0.89
0.62
0.09
0.3

A3
T
F
F
T
F
T
F

A4
0.54
0.37
0.63
-0.42
-0.05
0.29
0.17

Y
Healthy
Disease
Healthy

Why would it improve?

101

Some approaches

Self-training

Iteratively label some unlabeled examples with a model

learned from the previously labeled examples

Semi-supervised SVM (S3VM)

Enumerate all possible labeling of the unlabeled

examples

Learn an SVM for each labeling

Pick the one with the largest margin

102

Some approaches

Graph-based algorithms

Build a graph over the (labeled and unlabeled)

examples (from the inputs)

Learn a model that predicts well labeled examples and

is smooth over the graph

103

Transductive learning

Like supervised learning but we have access to the

test data from the beginning and we want to exploit it
We don't want a model, only compute predictions for
the unlabeled data
Simple solution:

Apply semi-supervised learning techniques using the

test data as unlabeled data to get a model

Use the resulting model to make predictions on the test

data

There exist also specific algorithms that avoid building

a model
104

Active learning

Goal:

given unlabeled data, find (adaptively) the examples to

label in order to learn an accurate model

The hope is to reduce the number of labeled instances

with respect to the standard batch SL

Usually, in an online setting:

choose the k best unlabeled examples

determine their labels

update the model and iterate

Algorithms differ in the way the unlabeled examples

are selected

Example: choose the k examples for which the model

predictions are the most uncertain

105

Reinforcement learning
Learning form interactions

a0
r0

...

Goal: learn to choose sequence of actions (= policy) that

maximizes
2
106
r 0 r 1 r 2 ... , where 01

RL approaches

System is usually modeled by

state transition probabilities P st 1s t , at

reward probabilities P r t1st , at

(= Markov Decision Process)

Model of the dynamics and reward is known try to

compute optimal policy by dynamic programming
Model is unknown

Model-based approaches first learn a model of the

dynamics and then derive an optimal policy from it (DP)

Model-free approaches learn directly a policy from

the observed system trajectories
107

Reinforcement versus supervised learning

Batch-mode SL: learn a mapping from input to output from

observed input-output pairs
Batch-mode RL: learn a mapping from state to action from
observed (state,action,reward) triplets
Online active learning: combine SL and (online) selection
of instances to label
Online RL: combine policy learning with control of the
system and generation of the training trajectories
Note:

RL would reduce to SL if the optimal action was known

in each state

SL is used inside RL to model system dynamics and/or

value functions
108

Examples of applications

Robocup Soccer Teams (Stone & Veloso, Riedmiller et al.)

Inventory Management (Van Roy, Bertsekas, Lee
&Tsitsiklis)
Dynamic Channel Assignment, Routing (Singh &
Bertsekas, Nie & Haykin, Boyan & Littman)
Elevator Control (Crites & Barto)
Many Robots: navigation, bi-pedal walking, grasping,
switching between skills...
Games: TD-Gammon and Jellyfish (Tesauro, Dahl)

109

Unsupervised learning

Unsupervised learning tries to find any regularities in

the data without guidance about inputs and outputs
A1

-0.27 -0.15 -0.14

-2.3

-1.2

-4.5

0.41

0.77

-0.44

0.91

-0.17

-0.01 -0.83

A10

0.26 -0.48

-0.1

-0.53 -0.65

0.66

0.55

0.27 -0.65

0.39

-0.82 0.17

0.54 -0.04

0.6

A11

A12

A13

A14

A15

0.23

0.22

0.98

0.57

0.02 -0.55 -0.32

-1.3

-0.2

-3.5

0.4

0.21 -0.87 0.64

0.41

0.66 -0.27 -0.86 -0.92

0.03

0.28 -0.71 -0.82

0.27

-0.21

-0.9

0.61 -0.57 0.44

0.21

0.97

-0.27 0.74

0.2

-0.28 0.48

-0.14

0.8

0.28

0.75

-0.78 -0.72

0.94 -0.78

0.48
0.37

0.79

0.26

0.3

0.06

A16

A17

A18

A19

0.28

-0.33

0.6

-0.29

0.48

0.74

0.49

0.7

0.79

0.59

-0.33

0.26 0.83 -0.88 -0.59

0.71

-0.16

0.01

0.36

0.03

0.59

-0.5

0.4

-0.88 -0.53

0.95

0.15

0.31

0.66 -0.34 0.79

-0.12

0.49

-0.53

-0.8

-0.64

-0.93 -0.51

0.28

0.25

0.01 -0.94

0.96

0.25

-0.12 0.27

-0.72 -0.77 -0.31 0.44

0.58

-0.86

0.04

0.94

-0.92

0.05

-0.38 -0.07

0.98

0.1

0.19 -0.57 -0.69 -0.23

0.13

-0.28

0.98 -0.08 -0.3

-0.84

0.47

-0.88 -0.73

-0.4

0.58

0.24

0.08

-0.2

0.42 -0.61 -0.13 -0.47 -0.36 -0.37

0.95

-0.31 0.25

0.55

0.52

-0.66

-0.56 0.97

-0.93

0.91

0.36

-0.14

-0.9

0.65

0.73

0.68 -0.65 -0.4

0.91

-0.64

0.41

-0.12

0.35

0.21

0.22

Are there interesting groups of variables or samples?

outliers? What are the dependencies between
variables?
110

Unsupervised learning methods

Many families of problems exist, among which:

Clustering: try to find natural groups of

samples/variables

Dimensionality reduction: project the data from a highdimensional space down to a small number of
dimensions

eg: k-means, hierarchical clustering

eg: principal/independent component analysis, MDS

Density estimation: determine the distribution of data

within the input space

eg: bayesian networks, mixture models.

111

Clustering

Goal: grouping a collection of objects into subsets or

clusters, such that those within each cluster are more
closely related to one another than objects assigned to
different clusters

112

Clustering

variables

Clustering rows
grouping similar objects

objects

grouping similar variables

across samples

Bi-cluster

Clustering columns

Cluster of
variables

Cluster of
objects

Bi-Clustering/Two-way
clustering
grouping objects that are
similar across a subset of
variables
113

Clustering

Two essential components of cluster analysis:

Distance measure: A notion of distance or similarity of

two objects: When are two objects close to each other?

Cluster algorithm: A procedure to minimize distances

of objects within groups and/or maximize distances
between groups

114

Examples of distance measures

Euclidean distance measures

average difference across coordinates
Manhattan distance measures
average difference across
coordinates, in a robust way
Correlation distance measures
difference with respect to trends

115

Clustering algorithms

Popular algorithms for clustering

hierarchical clustering

K-means

SOMs (Self-Organizing Maps)

autoclass, mixture models...

Hierarchical clustering allows the choice of the

dissimilarity matrix.
k-Means and SOMs take original data directly as input.
Attributes are assumed to live in Euclidean space.

116

Hierarchical clustering
Agglomerative clustering:
1. Each object is assigned to its own cluster
2. Iteratively:

the two most similar clusters are joined and

replaced by a new one

the distance matrix is updated with this new cluster

replacing the two joined clusters

(divisive clustering would start from a big cluster)

117

Distance between two clusters

Single linkage uses the smallest distance

Complete linkage uses the largest distance

Average linkage uses the average distance

118

Hierarchical clustering

(wikipedia)
119

Dendrogram

Hierarchical clustering are visualized through dendrograms

Clusters that are joined are combined by a line

Height of line is distance between clusters

Can be used to determine visually the number of

clusters

120

Hierarchical clustering

Strengths

No need to assume any particular number of clusters

Can use any distance matrix

Find sometimes a meaningful taxonomy

Limitations

Find a taxonomy even if it does not exist

Once a decision is made to combine two clusters it

cannot be undone

Not well theoretically motivated

121

k-Means clustering

Partitioning algorithm with a prefixed number k of

clusters

Use Euclidean distance between objects

Try to minimize thek sum of intra-cluster variances

j= 1 oCluster j

d 2 o,c j

where cj is the center of cluster j and d2 is the Euclidean

distance

122

k-Means clustering

123

k-Means clustering

124

k-Means clustering

125

k-Means clustering

Strengths

Simple, understandable

Can cluster any new point (unlike hierarchical

clustering)

Well motivated theoretically

Limitations

Must fix the number of clusters beforehand

Sensitive to the initial choice of cluster centers

Sensitive to outliers
126

Suboptimal clustering

You could obtain any of these from a random start of

k-means

Solution: restart the algorithm several times

127

Principal Component Analysis

An exploratory technique used to reduce the

dimensionality of the data set to a smaller space (2D,
3D)
A1

A10

0.25

0.93

0.04

-0.78

-0.53

0.57

0.19

0.29

0.37

-0.22

-2.3

-1.2

-4.5

-0.51

-0.76

0.07

0.81

0.95

0.99

0.26

0.53

-0.29

-1

0.73

-0.33

0.52

0.13

-0.5

-0.48

-0.16

-0.17

-0.26

0.32

-0.08

-0.38

-0.48 0.99

-0.95

0.34

0.07

-0.87

0.39

0.5

-0.63

-0.53

0.79

0.74

-0.14

0.61

0.15

0.68

-0.94

0.5

0.06

-0.56 0.49

-0.77

0.88

PC1
0.36
-2.3
0.27
-0.19
-0.77
-0.65

PC2
0.1
-1.2
-0.89
0.7
-0.7
-0.99

Transform some large number of variables into a

smaller number of uncorrelated variables called
principal components (PCs)
128

Objectives of PCA

Reduce dimensionality (pre-processing for other

methods)

Choose the most useful (informative) variables

Compress the data

Visualize multidimensional data

to identify groups of objects

to identify outliers

129

Basic idea

Goal: map data points into a few dimensions while

trying to preserve the variance of the data as much as
possible

First
component

Second
component

130

Each component is a linear combination of

the original variables
A1

A10

-0.39

-0.38

0.29

0.65

0.15

0.73

-0.57 0.91

-0.89

-0.17

-2.3

-1.2

-4.5

-0.15

0.86

-0.85

0.43 -0.19

-0.83

-0.4

0.9

0.4

-0.11

0.62

0.94

0.97

0.1

-0.41

0.01

0.1

-0.82

-0.31

0.14

0.22

-0.49

-0.76

0.27

-0.43

-0.81

0.71

0.39

-0.09

0.26

-0.46

-0.05

0.46

0.39

-0.01

0.64

-0.25

0.27

-0.81

-0.42

0.62

0.54

-0.67 -0.15

-0.46

0.69

Scores for
each sample
and PC

PC1
0.62
-2.3
0.88
-0.18
-0.39
-0.61

PC2
-0.33
-1.2
0.31
-0.05
-0.01
0.53

PC1=0.2*A1+3.4*A2-4.5*A3

VAR(PC1)=4.5 45%

PC2=0.4*A4+5.6*A5+2.3*A7

VAR(PC2)=3.3 33%

...
Loading of a variable
Gives an idea of its importance in
the component
Can be use for selecting
biomarkers

...
For each component, we
have a measure of the
percentage of the variance
of the initial data that it
contains
131

Books

Reference textbooks for the course

The elements of statistical learning: data mining, inference, and

prediction. T. Hastie et al, Springer, 2001

Pattern Recognition and Machine Learning (Information Science and

Statistics). C.M.Bishop, Springer, 2004

Other textbooks

Pattern classification (2nd edition). R.Duda, P.Hart, D.Stork, Wiley

Interscience, 2000

Introduction to Machine Learning. Ethan Alpaydin, MIT Press, 2004.

Machine Learning. Tom Mitchell, McGraw Hill, 1997.

132

Books

kernel methods for pattern analysis. J. Shawe-Taylor and N. Cristianini.

Cambridge University Press, 2004

Reinforcement Learning: An Introduction. R.S. Sutton and A.G. Barto.

MIT Press, 1998

Neuro-Dynamic Programming. D.P Bertsekas and J.N. Tsitsiklis. Athena

Scientific, 1996

Semi-supervised learning. Chapelle et al., MIT Press, 2006

Predicting structured data. G. Bakir et al., MIT Press, 2007

133

Softwares

Pepito

www.pepite.be

Free for academic research and education

WEKA

http://www.cs.waikato.ac.nz/ml/weka/

Many R and Matlab packages

http://www.kyb.mpg.de/bs/people/spider/

http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html

134

Journals

Journal of Machine Learning Research

Machine Learning

IEEE Transactions on Pattern Analysis and Machine

Intelligence

Journal of Artificial Intelligence Research

Neural computation

Annals of Statistics

IEEE Transactions on Neural Networks

Data Mining and Knowledge Discovery

...
135

Conferences

International Conference on Machine Learning (ICML)

European Conference on Machine Learning (ECML)

Neural Information Processing Systems (NIPS)

Uncertainty in Artificial Intelligence (UAI)

International Joint Conference on Artificial Intelligence

(IJCAI)
International Conference on Artificial Neural Networks
(ICANN)

Computational Learning Theory (COLT)

Knowledge Discovery and Data mining (KDD)

...
136

Jekel's Epidemiology, Biostatistics, Preventive Medicine, and Public Health With STUDENT CONSULT Online Access (Jekel's Epidemiology, Biostatistics, Preventive Medicine, Public Health) - 4th Edition Dropbox Download
100% (8)
Jekel's Epidemiology, Biostatistics, Preventive Medicine, and Public Health With STUDENT CONSULT Online Access (Jekel's Epidemiology, Biostatistics, Preventive Medicine, Public Health) - 4th Edition Dropbox Download
15 pages
Practical Natural Language Processing A Comprehensive Guide To Building Real World NLP Systems 1st Edition Sowmya Vajjala
100% (3)
Practical Natural Language Processing A Comprehensive Guide To Building Real World NLP Systems 1st Edition Sowmya Vajjala
62 pages
Propensity Modelling For Business
No ratings yet
Propensity Modelling For Business
13 pages
Java Certification Study Notes
No ratings yet
Java Certification Study Notes
91 pages
Annotated Bibliography Final Draft
0% (1)
Annotated Bibliography Final Draft
7 pages
Statistics For Engineers-Fall 2009 Tutorial-09-Solution
No ratings yet
Statistics For Engineers-Fall 2009 Tutorial-09-Solution
3 pages
Conditional Probability For This Item To Be Produced by Machine 1 Is 0.50"
0% (1)
Conditional Probability For This Item To Be Produced by Machine 1 Is 0.50"
2 pages
Download Full Deep Learning 1st Edition Dulani Meedeniya PDF All Chapters
100% (2)
Download Full Deep Learning 1st Edition Dulani Meedeniya PDF All Chapters
50 pages
MACHINELEARING UNIT 1material
100% (1)
MACHINELEARING UNIT 1material
64 pages
Information Retrieval Data Structures & Algorithms - William B. Frakes
No ratings yet
Information Retrieval Data Structures & Algorithms - William B. Frakes
630 pages
[Ebooks PDF] download grokking Machine Learning MEAP v07 Luis G Serrano full chapters
100% (2)
[Ebooks PDF] download grokking Machine Learning MEAP v07 Luis G Serrano full chapters
39 pages
ML Module A7707 - Part1
No ratings yet
ML Module A7707 - Part1
48 pages
Comparative Analysis of Machine Learning Algorithms Using Diabetes Dataset
100% (1)
Comparative Analysis of Machine Learning Algorithms Using Diabetes Dataset
35 pages
2023.02 - Time Series Forecasting With Transformer Models - en
100% (1)
2023.02 - Time Series Forecasting With Transformer Models - en
52 pages
A Study On Deep Learning
No ratings yet
A Study On Deep Learning
6 pages
JaJa Parallel - Algorithms Intro
50% (2)
JaJa Parallel - Algorithms Intro
45 pages
Lecture 26
No ratings yet
Lecture 26
17 pages
NLP and ML Project
100% (1)
NLP and ML Project
37 pages
A Programmers Guide To Data Mining
No ratings yet
A Programmers Guide To Data Mining
299 pages
Machine Learning1
100% (1)
Machine Learning1
11 pages
Deep Learning: - Course Code: - Unit 1
No ratings yet
Deep Learning: - Course Code: - Unit 1
21 pages
Learn Flutter With An Unfair Advantage
No ratings yet
Learn Flutter With An Unfair Advantage
11 pages
Machine Learning For Beginners PDF
No ratings yet
Machine Learning For Beginners PDF
29 pages
Int404 (Syllabus)
No ratings yet
Int404 (Syllabus)
1 page
Deep Learning Models For Spatio-Temporal Forecasting and Analysis
No ratings yet
Deep Learning Models For Spatio-Temporal Forecasting and Analysis
131 pages
JC Bose
No ratings yet
JC Bose
28 pages
Algorithms in ML
No ratings yet
Algorithms in ML
15 pages
Complete Deep Learning in Natural Language Processing Deng PDF For All Chapters
100% (5)
Complete Deep Learning in Natural Language Processing Deng PDF For All Chapters
62 pages
Mathematical Foundations of Machine Learning: (NMAG 469, FALL TERM 2018-2019)
No ratings yet
Mathematical Foundations of Machine Learning: (NMAG 469, FALL TERM 2018-2019)
74 pages
Language Model PDF
No ratings yet
Language Model PDF
76 pages
Data Visualization With Python
No ratings yet
Data Visualization With Python
19 pages
Python Programming
No ratings yet
Python Programming
7 pages
Binary Classification
No ratings yet
Binary Classification
1 page
Logistic Regression: Jia Li
No ratings yet
Logistic Regression: Jia Li
44 pages
Artificial Intelligence (AI) : What Are The Types of AI and How Do They Difer?
No ratings yet
Artificial Intelligence (AI) : What Are The Types of AI and How Do They Difer?
2 pages
What Is Naive Bayes Algorithm?
No ratings yet
What Is Naive Bayes Algorithm?
18 pages
Ai Chapter1
No ratings yet
Ai Chapter1
24 pages
Chat GPT Final Speech Edited
No ratings yet
Chat GPT Final Speech Edited
2 pages
Applications of Evolutionary Computation, Part II
No ratings yet
Applications of Evolutionary Computation, Part II
547 pages
Recommender System
No ratings yet
Recommender System
45 pages
Data Visualization & Dashboards Designs
No ratings yet
Data Visualization & Dashboards Designs
1 page
Education:: Clinical Training
No ratings yet
Education:: Clinical Training
3 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
30 pages
7 Deep Learning
No ratings yet
7 Deep Learning
75 pages
FPGA Project
No ratings yet
FPGA Project
15 pages
Generative AI: - Lecture-1
100% (1)
Generative AI: - Lecture-1
21 pages
Machine Learning Is Fun 1565131730
No ratings yet
Machine Learning Is Fun 1565131730
48 pages
Developing Machine Learning Applications With TensorFlow
No ratings yet
Developing Machine Learning Applications With TensorFlow
22 pages
Machine Learning in New
No ratings yet
Machine Learning in New
13 pages
Text Analysis in R
No ratings yet
Text Analysis in R
21 pages
Lecture 1 Introduction by Dr. Fazeel Abid
No ratings yet
Lecture 1 Introduction by Dr. Fazeel Abid
26 pages
Data Science: Where Does A Physics Degree Fit?
100% (1)
Data Science: Where Does A Physics Degree Fit?
31 pages
Machine Learning C
No ratings yet
Machine Learning C
24 pages
Ethics of Artificial Intelligence
No ratings yet
Ethics of Artificial Intelligence
20 pages
Naïve Bayes Classifier Algorithm
No ratings yet
Naïve Bayes Classifier Algorithm
2 pages
Rakesh Kumar - Data Scientist
No ratings yet
Rakesh Kumar - Data Scientist
3 pages
Apache Mahout Essentials
From Everand
Apache Mahout Essentials
Jayani Withanawasam
No ratings yet
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
4. Ai_foundations of Machine Learning i
No ratings yet
4. Ai_foundations of Machine Learning i
40 pages
Unit-1 MLT
No ratings yet
Unit-1 MLT
51 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
CS3491-AI ML-Chapter 1
No ratings yet
CS3491-AI ML-Chapter 1
19 pages
lec001
No ratings yet
lec001
17 pages
PGPF 04 019 Kancharapu Akhil Kumar CF Assignmnet
No ratings yet
PGPF 04 019 Kancharapu Akhil Kumar CF Assignmnet
59 pages
Assignment 1: 1. Likelihood and Bayesian Inference in The Linear Model
No ratings yet
Assignment 1: 1. Likelihood and Bayesian Inference in The Linear Model
3 pages
RDL Reviewer
No ratings yet
RDL Reviewer
7 pages
Detyre Kursi Ekonometri
No ratings yet
Detyre Kursi Ekonometri
6 pages
SOP No. 9 Recommended Standard Operating Procedure For Control Charts For Calibration of Mass Standards
No ratings yet
SOP No. 9 Recommended Standard Operating Procedure For Control Charts For Calibration of Mass Standards
7 pages
Term Project Details
No ratings yet
Term Project Details
4 pages
Biostat MCQ
100% (4)
Biostat MCQ
8 pages
Application of Naïve Bayes Classification in Fraud Detection
No ratings yet
Application of Naïve Bayes Classification in Fraud Detection
30 pages
CUHK STAT5101 Ch10
No ratings yet
CUHK STAT5101 Ch10
60 pages
MCQ's B.A Third Year (Semester) Economics Paper 113
No ratings yet
MCQ's B.A Third Year (Semester) Economics Paper 113
12 pages
Achievement of Grade 11 Abm Students in Statistics and Probability
No ratings yet
Achievement of Grade 11 Abm Students in Statistics and Probability
21 pages
Climate change trends, impacts and adaptation of upland farmers in La Trinidad, Benguet, Philippines
No ratings yet
Climate change trends, impacts and adaptation of upland farmers in La Trinidad, Benguet, Philippines
15 pages
2003 ISHPC Paper - Flexural Crack Control in Concrete Structures PDF
No ratings yet
2003 ISHPC Paper - Flexural Crack Control in Concrete Structures PDF
24 pages
Lesson 2-02 Geometric Probability STAT
No ratings yet
Lesson 2-02 Geometric Probability STAT
14 pages
PSM 2 Variabel Acak
No ratings yet
PSM 2 Variabel Acak
27 pages
Activity Exemplar - Waze Project Proposal
No ratings yet
Activity Exemplar - Waze Project Proposal
3 pages
Practice Task DPD - Answers
No ratings yet
Practice Task DPD - Answers
2 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
Summative Test in General Mathematics
No ratings yet
Summative Test in General Mathematics
2 pages
Other Scheduling Methods: DR - Sawsan Hamzeh
No ratings yet
Other Scheduling Methods: DR - Sawsan Hamzeh
19 pages
Hypothesis
No ratings yet
Hypothesis
11 pages
Lectures 1-12 Review: Scott Sheffield
No ratings yet
Lectures 1-12 Review: Scott Sheffield
122 pages
Answers 4
No ratings yet
Answers 4
10 pages
Chapter 10 Questions
No ratings yet
Chapter 10 Questions
10 pages
Download full Modern Business Statistics with Microsoft Office Excel 4th Edition Anderson Solutions Manual all chapters
100% (36)
Download full Modern Business Statistics with Microsoft Office Excel 4th Edition Anderson Solutions Manual all chapters
65 pages
Attribute Normalization in Network Intrusion Detection: Wei Wang Svein J. Knapskog Sylvain Gombault
No ratings yet
Attribute Normalization in Network Intrusion Detection: Wei Wang Svein J. Knapskog Sylvain Gombault
7 pages
Racial/Ethnic Matching of Clients and Therapists in Mental Health Services: A Meta-Analytic Review of Preferences, Perceptions, and Outcomes
No ratings yet
Racial/Ethnic Matching of Clients and Therapists in Mental Health Services: A Meta-Analytic Review of Preferences, Perceptions, and Outcomes
19 pages