Linear Regression
Linear Regression
Logistic Regression
NLP’s practical applications
GOAL: Produce a model that outputs the most likely class yi, given features xi.
Supervised Classification
- features of N observations (i.e. words)
GOAL: Produce a model that outputs the most likely class yi, given features xi.
0 0.0 0
1 0.5 0
2 1.0 1
3 0.25 0
4 0.75 1
Supervised
Classification Some function or rules
- features of N observations (i.et.owg
orodsf)rom X to Y, as
close as possible.
- class of each of N observations
GOAL: Produce a model that outputs the most likely class yi, given features xi.
0 0.0 0
1 0.5 0
2 1.0 1
3 0.25 0
4 0.75 1
Supervised Classification
Logistic Regression
Binary classification goal: Build a model that can estimate P(A=1|B=?)
In machine learning, tradition to use Y for the variable being predicted and X for
the features use to make the prediction.
Logistic Regression
Binary classification goal: Build a “model” that can estimate P(Y=1|X=?)
In machine learning, tradition is to use Y for the variable being predicted and X for
the features use to make the prediction.
Logistic Regression
Binary classification goal: Build a “model” that can estimate P(Y=1|X=?)
In machine learning, tradition is to use Y for the variable being predicted and X for
the features use to make the prediction.
In machine learning, tradition is to use Y for the variable being predicted and X for
the features use to make the prediction.
x y
2 1
1 0
0 0
6 1
2 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.
x y
2 1
1 0
0 0
6 1
2 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.
x y
2 1
1 0
0 0
6 1
2 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.
x y
2 1
1 0
0 0
6 1
2 1
1 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.
x y
2 1
1 0
0 0
6 1
2 1
1 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.
x y
2 1
1 0
0 0
6 1
2 1
1 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.
x y
2 1
1 0
0 0
6 1
2 1
1 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.
x y
2 1
optim al b_0, changed!
1 0
b_1
0 0
6 1
2 1
1 1
Logistic Regression on a single feature (x)
learned
“best fit” : whatever maximizes the likelihood function:
learned
X can be multiple features
Often we want to make a classification based on multiple features:
(https://www.linkedin.com/pulse/predicting-outcomes-pr
obabilities-logistic-regression-konstantinidis/)
Logistic Regression
Yi ∊ {0, 1}; X can be anything numeric.
=0
x y
2 1
1 0
0 0
6 1
2 1
1 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X1: number of capital letters in target and surrounding words.
Let’s add a feature! X2: does the target word start with a capital letter?
x2 x1 y
1 2 1
0 1 0
0 0 0
1 6 1
1 2 1
1 1 1
Machine Learning: How to setup data
0
1 training
2
3 Model
4 Data
…
N
Machine Learning: How to setup data
0 0.0 0 0
1 0.5 1 0 training
2 1.0 1 1
3 0.25 0 0 Model
4 0.75Da0 ta 1
… … …
N 0.35 1 0
Machine Learning: How to setup data
0 0.0 0 0
“Corpus” 1 0.5 1 0 training
2 1.0 1 1
raw data:
3 0.25 0 0
sequences of 4 0.75Da0 ta 1
characters … … …
N 0.35 1 0
Machine Learning: How to setup data
Feature Extraction
0 0.0 0 0
“Corpus” 1 0.5 1 0 training
2 1.0 1 1
3 0.25 0 0
raw data:
sequences of 4 0.75Da0 ta 1
characters … … …
N 0.35 1 0
Machine Learning: How to setup data
Feature Extraction
N 0.35 1 0
Machine Learning: How to setup data
Feature Extraction
Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
raw data: of features; e.g.
sequences of ➔ number of capital letters Data
characters ➔ whether “I” was
mentioned or not
➔ k featuresindicating
whether k wordswere
mentioned or not
Machine Learning: How to setup data
Feature Extraction
Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
Feature exampole f :feiastuw
reo
s;rd
e
.gp
. resent in document?
raw data:
sequences of ➔ number of capital letters Data
characters ➔ whether “I” was
mentioned or not
➔ k featuresindicating
whether k wordswere
mentionedornot
Machine Learning: How to setup data
Feature Extraction
Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
Feature example: is word present in document?
raw data:
sequences of Data
characters
[0, 1, 1, 0, 1, …, 1, 0, 1, 1, 0, 1, …,
k
➔ kfeatures indicating
1]
whether kwords were
mentioned or not
Machine Learning: How to setup data
Feature Extraction
Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
Feature example: is word present in document
raw data:
sequences of Data
characters
[0, 1, 1, 0, 1, …, 1, 0, 1, 1, 0, 1, …,
k
➔ kfeatures indicating
1]
whether kwords were
mentioned or not
Machine Learning: How to setup data
Feature Extraction
Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
Feature example: is previous word “the”?
raw data: book
sequences of Data
characters
[0, 1, 1, 0, 1, …, 1, 0, 1, 1, 0, 1, …,
k
➔ kfeatures indicating
1]
whether kwords were
mentioned or not
Machine Learning: How to setup data
Feature Extraction
Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
Feature example: is previous word “the”?
raw data: book
sequences of Data
characters
[0, 1, 1, 0, 1, …, 1, 0, 1, 1, 0, 1, …,
k
➔ kfeatures indicating
1]
whether kwords were
mentioned or not
Machine Learning: How to setup data
Feature Extraction
One-hot Encoding
● Each word gets an index in the vector
●“CA nsd”ices 0 except present word: Feature
orllpiu
example: is previous word “the”?
raw data: book
sequences of Data
characters
[0, 1, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, …,
k
➔ kfeatures indicating
0]
whether kwords were
mentioned or not
Machine Learning: How to setup data
Feature Extraction
One-hot Encoding
● Each word gets an index in the vector
●“CA nsd”ices 0 except present word:
orllpiu
Feature example: which is previous word?
raw data: was
sequences of Data
characters
[0, 1, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, …,
0]k
[0, 0, 1, 0, 0, …, 0, 0, 0, 0, 0, 0, …,
k
Machine Learning: How to setup data
Feature Extraction
One-hot Encoding
● Each word gets an index in the vector
●“CA nsd”ices 0 except present word:
olrlpiu
Feature example: which is previous word?
raw data: interesting
sequences of Data
[ch0a,rac1te,rs 0, 0, 0, …, 0, 0, 0, 0, 0, 0, …,
0]k
[0, 0, 1, 0, 0, …, 0, 0, 0, 0, 0, 0, …,
k
Machine Learning: How to setup data
Feature Extraction
[0, 0, 0, 0, 1, 0, …, 0, 0, …, 0, 1, 0, …, 0]2k
Machine Learning: How to setup data
Feature Extraction
[0, 0, 0, 0, 1, 0, …, 0, 0, …, 0, 1, 0, …, 0]2k
[0, 0, 0, 0, 1, 0, …, 0, 0, …, 0, 1, 0, …, 0,
0.09]2k+1
Machine Learning: How to setup data
Model
Does the
Data
model hold up?
Machine Learning Goal: Generalize to new
data
Training Data
Testing Data
Machine Learning Goal: Generalize to new
data
Underfit
Underfit Overfit
L2 Regularization - “Ridge”
Shrinks features by adding values that keep from perfectly fitting the data.
Training Data
80% Set
penalty
Model
Does the
10% Development model hold up?