0% found this document useful (0 votes)
6 views85 pages

Linear Regression

Uploaded by

dogasancak01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views85 pages

Linear Regression

Uploaded by

dogasancak01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Supervised Classification:

Logistic Regression
NLP’s practical applications

● Machine translation ● Machine learning:


● Automatic speech recognition ○ Logistic regression
○ Personalized assistants ○ Probabilistic modeling
○ Auto customer service ○ Recurrent Neural Networks
● Information Retrieval how? ○ Transformers
○ Web Search ● Algorithms, e.g.:
○ Question Answering ○ Graph analytics
● Sentiment Analysis ○ Dynamic programming
● Computational Social Science ● Data science
● Growing day by day ○ Hypothesis testing
NLP’s practical applications

● Machine translation ● Machine learning:


● Automatic speech recognition ○ Logistic regression
○ Personalized assistants ○ Probabilistic modeling
○ Auto customer service ○ Recurrent Neural Networks
● Information Retrieval how? ○ Transformers
○ Web Search ● Algorithms, e.g.:
○ Question Answering ○ Graph analytics
● Sentiment Analysis ○ Dynamic programming
● Computational Social Science ● Data science
● Growing day by day ○ Hypothesis testing
Topics we will cover
● Supervised Classification
● Goal of logistic regression
● The “loss function” -- what logistic regression tries to optimize
● Adding Multiple Features
● Training and Test Sets
● Overfitting; Role of Regularization
Supervised Classification
- features of N observations (i.e. words)

- class of each of N observations

GOAL: Produce a model that outputs the most likely class yi, given features xi.
Supervised Classification
- features of N observations (i.e. words)

- class of each of N observations

GOAL: Produce a model that outputs the most likely class yi, given features xi.

0 0.0 0
1 0.5 0
2 1.0 1
3 0.25 0
4 0.75 1
Supervised
Classification Some function or rules
- features of N observations (i.et.owg
orodsf)rom X to Y, as
close as possible.
- class of each of N observations

GOAL: Produce a model that outputs the most likely class yi, given features xi.

0 0.0 0
1 0.5 0
2 1.0 1
3 0.25 0
4 0.75 1
Supervised Classification
Logistic Regression
Binary classification goal: Build a model that can estimate P(A=1|B=?)

i.e. given B, yield (or “predict”) the probability that A=1


Logistic Regression
Binary classification goal: Build a “model” that can estimate P(A=1|B=?)

i.e. given B, yield (or “predict”) the probability that A=1

In machine learning, tradition to use Y for the variable being predicted and X for
the features use to make the prediction.
Logistic Regression
Binary classification goal: Build a “model” that can estimate P(Y=1|X=?)

i.e. given X, yield (or “predict”) the probability that Y=1

In machine learning, tradition is to use Y for the variable being predicted and X for
the features use to make the prediction.
Logistic Regression
Binary classification goal: Build a “model” that can estimate P(Y=1|X=?)

i.e. given X, yield (or “predict”) the probability that Y=1

In machine learning, tradition is to use Y for the variable being predicted and X for
the features use to make the prediction.

Example: Y: 1 if target is verb, 0 otherwise;


X: 1 if “was” occurs before target; 0 otherwise
Logistic Regression
Binary classification goal: Build a “model” that can estimate P(Y=1|X=?)

i.e. given X, yield (or “predict”) the probability that Y=1

In machine learning, tradition is to use Y for the variable being predicted and X for
the features use to make the prediction.

Example: Y: 1 if target is verb, 0 otherwise;


X: 1 if “was” occurs before target; 0 otherwise
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.

x y

2 1

1 0

0 0

6 1

2 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.

x y

2 1

1 0

0 0

6 1

2 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.

x y

2 1

1 0

0 0

6 1

2 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.

x y

2 1

1 0

0 0

6 1

2 1

1 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.

x y

2 1

1 0

0 0

6 1

2 1

1 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.

x y

2 1

1 0

0 0

6 1

2 1

1 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.

x y

2 1

1 0

0 0

6 1

2 1

1 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.

x y

2 1
optim al b_0, changed!
1 0
b_1
0 0

6 1

2 1

1 1
Logistic Regression on a single feature (x)

Yi ∊ {0, 1}; X is a single value and can be anything numeric.


Logistic Regression on a single feature (x)

Yi ∊ {0, 1}; X is a single value and can be anything numeric.


Logistic Regression on a single feature (x)
Yi ∊ {0, 1}; X can be anything numeric.
Logistic Regression on a single feature (x)
Yi ∊ {0, 1}; X can be anything numeric.
Logistic Regression on a single feature (x)
Yi ∊ {0, 1}; X can be anything numeric.
Logistic Regression on a single feature (x)
Yi ∊ {0, 1}; X can be anything numeric.

HOW? Essentially, try different


values until “best fit” to the
training data (example ).

learned
“best fit” : whatever maximizes the likelihood function:

Logistic Regression on a single feature (x)

Yi ∊ {0, 1}; X can be anything numeric.

HOW? Essentially, try different


values until “best fit” to the
training data (example ).

learned
X can be multiple features
Often we want to make a classification based on multiple features:

● Number of capital letters


surrounding: integer
● Begins with capital letter: {0, 1}
● Preceded by “the”? {0, 1}
X can be multiple features
Often we want to make a classification based on multiple features:

● Number ofYc-aaxpisitias lYle(i.tet.e1rsor 0)


surrounding: integer
● Begins wTitohmcaakpeirtoaolmleftotrer: {0, 1}
● Precededmbulytip“ltehXes”,?let{’s0g, e1t }rid
of y-axis. Instead, show
decision point.
X can be multiple features
Often we want to make a classification based on multiple features:

● Number of capital letters


surrounding: integer
● Begins with capital letter: {0, 1}
● Preceded by “the”? {0, 1}

We’re learning a linear (i.e. flat)


separating hyperplane, but fitting
it to a logit outcome.
X can be multiple features
Often we want to make a classification based on multiple features:

● Number of capital letters


surrounding: integer
● Begins with capital letter: {0, 1}
● Preceded by “the”? {0, 1}

We’re learning a linear (i.e. flat)


separating hyperplane, but fitting
it to a logit outcome.
X can be multiple features
Often we want to make a classification based on multiple features:

● Number of capital letters


surrounding: integer
● Begins with capital letter: {0, 1}
● Preceded by “the”? {0, 1}

We’re learning a linear (i.e. flat)


separating hyperplane, but fitting
it to a logit outcome.

(https://www.linkedin.com/pulse/predicting-outcomes-pr
obabilities-logistic-regression-konstantinidis/)
Logistic Regression
Yi ∊ {0, 1}; X can be anything numeric.

=0

We’re still learning a linear


separating hyperplane, but
fitting it to a logit outcome.
(https://www.linkedin.com/pulse/predicting-outcomes-pr
obabilities-logistic-regression-konstantinidis/)
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X: number of capital letters in target and surrounding words.

x y

2 1

1 0

0 0

6 1

2 1

1 1
Logistic Regression
Example: Y: 1 if target is a part of a proper noun, 0 otherwise;
X1: number of capital letters in target and surrounding words.
Let’s add a feature! X2: does the target word start with a capital letter?

x2 x1 y

1 2 1

0 1 0

0 0 0

1 6 1

1 2 1

1 1 1
Machine Learning: How to setup data

0
1 training
2
3 Model
4 Data

N
Machine Learning: How to setup data

0 0.0 0 0
1 0.5 1 0 training
2 1.0 1 1
3 0.25 0 0 Model
4 0.75Da0 ta 1
… … …

N 0.35 1 0
Machine Learning: How to setup data

0 0.0 0 0
“Corpus” 1 0.5 1 0 training
2 1.0 1 1
raw data:
3 0.25 0 0
sequences of 4 0.75Da0 ta 1
characters … … …

N 0.35 1 0
Machine Learning: How to setup data
Feature Extraction

0 0.0 0 0
“Corpus” 1 0.5 1 0 training
2 1.0 1 1
3 0.25 0 0
raw data:
sequences of 4 0.75Da0 ta 1
characters … … …

N 0.35 1 0
Machine Learning: How to setup data
Feature Extraction

e.g.: words, sentences,0 0.0 0 0


0.5 1 0
“Corpus” documents,users. 1 training
2 1.0 1 1
3 0.25 0 0
raw data:
sequences of 4 0.75Da0 ta 1
characters … … …

N 0.35 1 0
Machine Learning: How to setup data
Feature Extraction

e.g.: words, sentences,0 0.0 0 0


“Corpus” documents,users. 1 0.5 1 0 training
2 1.0 1 1
row of features; e.g. 3 0.25 0 0
raw data:
sequences of ➔ number of capital letters 4 0.75Da0 ta 1
characters ➔ whether “I” was … … …
mentioned or not
N 0.35 1 0
Machine Learning: How to setup data
Feature Extraction

e.g.: words, sentences,0 0.0 0 0


“Corpus” documents,users. 1 0.5 1 0 training
2 1.0 1 1
row of features; e.g. 3 0.25 0 0
raw data:
sequences of ➔ number of capital letters 4 0.75Da0 ta 1
characters ➔ whether “I” was … … …
mentioned or not
➔ kfeatures indicating
whether kwords were N 0.35 1 0
mentioned or not
Machine Learning: How to setup data
Feature Extraction

Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
raw data: of features; e.g.
sequences of ➔ number of capital letters Data
characters ➔ whether “I” was
mentioned or not
➔ k featuresindicating
whether k wordswere
mentioned or not
Machine Learning: How to setup data
Feature Extraction

Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
Feature exampole f :feiastuw
reo
s;rd
e
.gp
. resent in document?
raw data:
sequences of ➔ number of capital letters Data
characters ➔ whether “I” was
mentioned or not
➔ k featuresindicating
whether k wordswere
mentionedornot
Machine Learning: How to setup data
Feature Extraction

Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
Feature example: is word present in document?
raw data:
sequences of Data
characters
[0, 1, 1, 0, 1, …, 1, 0, 1, 1, 0, 1, …,
k
➔ kfeatures indicating
1]
whether kwords were
mentioned or not
Machine Learning: How to setup data
Feature Extraction

Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
Feature example: is word present in document
raw data:
sequences of Data
characters
[0, 1, 1, 0, 1, …, 1, 0, 1, 1, 0, 1, …,
k
➔ kfeatures indicating
1]
whether kwords were
mentioned or not
Machine Learning: How to setup data
Feature Extraction

Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
Feature example: is previous word “the”?
raw data: book
sequences of Data
characters
[0, 1, 1, 0, 1, …, 1, 0, 1, 1, 0, 1, …,
k
➔ kfeatures indicating
1]
whether kwords were
mentioned or not
Machine Learning: How to setup data
Feature Extraction

Multi-hot Encoding
● Each word gets an index in the vector
●“C1orippsre” sent; 0 if not
fu
Feature example: is previous word “the”?
raw data: book
sequences of Data
characters
[0, 1, 1, 0, 1, …, 1, 0, 1, 1, 0, 1, …,
k
➔ kfeatures indicating
1]
whether kwords were
mentioned or not
Machine Learning: How to setup data
Feature Extraction

One-hot Encoding
● Each word gets an index in the vector
●“CA nsd”ices 0 except present word: Feature
orllpiu
example: is previous word “the”?
raw data: book
sequences of Data
characters
[0, 1, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, …,
k
➔ kfeatures indicating
0]
whether kwords were
mentioned or not
Machine Learning: How to setup data
Feature Extraction

One-hot Encoding
● Each word gets an index in the vector
●“CA nsd”ices 0 except present word:
orllpiu
Feature example: which is previous word?
raw data: was
sequences of Data
characters
[0, 1, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, …,
0]k

[0, 0, 1, 0, 0, …, 0, 0, 0, 0, 0, 0, …,
k
Machine Learning: How to setup data
Feature Extraction

One-hot Encoding
● Each word gets an index in the vector
●“CA nsd”ices 0 except present word:
olrlpiu
Feature example: which is previous word?
raw data: interesting
sequences of Data
[ch0a,rac1te,rs 0, 0, 0, …, 0, 0, 0, 0, 0, 0, …,
0]k

[0, 0, 1, 0, 0, …, 0, 0, 0, 0, 0, 0, …,
k
Machine Learning: How to setup data
Feature Extraction

Multiple One-hot encodings for one observation


(1) word before; (2) word after
“Corpus”
interesting
raw data:
[0, 0, 0,
sequences
Data
of 0, 1, 0, …, 0]k [0, …, 0, 1, 0, …, 0]k
characters
Machine Learning: How to setup data
Feature Extraction

Multiple One-hot encodings for one observation


(1) word before; (2) word after
“Corpus”
interesting
raw data:
[0, 0, 0,
sequences
Data
of 0, 1, 0, …, 0]k [0, …, 0, 1, 0, …, 0]k
characters
=

[0, 0, 0, 0, 1, 0, …, 0, 0, …, 0, 1, 0, …, 0]2k
Machine Learning: How to setup data
Feature Extraction

Multiple One-hot encodings for one observation


(1) word before; (2) word after; (3) percent capitals
“Corpus”
Interesting
raw data:
[0, 0, 0,
sequences
Data
of 0, 1, 0, …, 0]k [0, …, 0, 1, 0, …, 0]k
characters
=

[0, 0, 0, 0, 1, 0, …, 0, 0, …, 0, 1, 0, …, 0]2k
[0, 0, 0, 0, 1, 0, …, 0, 0, …, 0, 1, 0, …, 0,
0.09]2k+1
Machine Learning: How to setup data

Model
Does the
Data
model hold up?
Machine Learning Goal: Generalize to new
data

Training Data

Model Does the


model hold up?

Testing Data
Machine Learning Goal: Generalize to new
data

80% Training Data


Model Does the
model hold up?

20% Testing Data


Logistic Regression - Regularization
X = Y
0.5 0 0.6 1 0 0.25 1
0 0.5 0.3 0 0 0 1
0 0 1 1 1 0.5 0
0 0 0 0 1 1 0
0.25 1 1.25 1 0.1 2 1
Logistic Regression - Regularization
X = Y
0.5 0 0.6 1 0 0.25 1
0 0.5 0.3 0 0 0 1
0 0 1 1 1 0.5 0
0 0 0 0 1 1 0
0.25 1 1.25 1 0.1 2 1
Logistic Regression - Regularization
x1 x2 ... X = Y
0.5 0 0.6 1 0 0.25 1
0 0.5 0.3 0 0 0 1
0 0 1 1 1 0.5 0
0 0 0 0 1 1 0
0.25 1 1.25 1 0.1 2 1

1.2 + -63*x1 + 179*x2 + 71*x3 + 18*x4 + -59*x5 + 19*x6 = logit(Y)


Logistic Regression - Regularization
x1 x2 ... X = Y
0.5 0 0.6 1 0 0.25 1
0 0.5 0.3 0 0 0 1
0 0 1 1 1 0.5 0
0 0 0 0 1 1 0
0.25 1 1.25 1 0.1 2 1

1.2 + -63*x1 + 179*x2 + 71*x3 + 18*x4 + -59*x5 + 19*x6 = logit(Y)


Logistic Regression - Regularization
x1 x2 ... X = Y
0.5 0 0.6 1 0 0.25 1
0 0.5 0.3 0 0 0 1
0 0 1 “over1fitting” 1 0.5 0
0 0 0 0 1 1 0
0.25 1 1.25 1 0.1 2 1

1.2 + -63*x1 + 179*x2 + 71*x3 + 18*x4 + -59*x5 + 19*x6 = logit(Y)


Python Example
Overfitting (1-d non-linear example)
Overfitting (1-d non-linear example)

Underfit

(image credit: Scikit-learn; in practice data are rarely this clear)


Overfitting (1-d non-linear example)

Underfit Overfit

(image credit: Scikit-learn; in practice data are rarely this clear)


Logistic Regression - Regularization
x1 x2 ... X = Y
0.5 0 0.6 1 0 0.25 1
0 0.5 0.3 0 0 0 1
0 0 1 “over1fitting” 1 0.5 0
0 0 0 0 1 1 0
0.25 1 1.25 1 0.1 2 1

1.2 + -63*x1 + 179*x2 + 71*x3 + 18*x4 + -59*x5 + 19*x6 = logit(Y)


Logistic Regression - Regularization
x1 X x2 = Y
0.5 0 1
0 0.5 1
What if only 2
0 0 predictors? 0
0 0 0
0.25 1 1
Logistic Regression - Regularization
x1 X x2 = Y
0.5 0 1
0 0.5 1
0 0 What if only 2 0
0 0 predictors? 0
A: better fit
0.25 1 1

0 + 2*x1 + 2*x2 = logit(Y)


Logistic Regression - Regularization
L1 Regularization - “The Lasso”
Zeros out features by adding values that keep from perfectly fitting the data.
Logistic Regression - Regularization
L1 Regularization - “The Lasso”
Zeros out features by adding values that keep from perfectly fitting the data.
Logistic Regression - Regularization
L1 Regularization - “The Lasso”
Zeros out features by adding values that keep from perfectly fitting the data.

set betas that maximize L


Logistic Regression - Regularization
L1 Regularization - “The Lasso”
Zeros out features by adding values that keep from perfectly fitting the data.

set betas that maximize penalized L


Logistic Regression - Regularization
Sometimes written as:

L1 Regularization - “The Lasso”


Zeros out features by adding values that keep from perfectly fitting the data.

set betas that maximize penalized L


Logistic Regression - Regularization
Sometimes written as:

L2 Regularization - “Ridge”
Shrinks features by adding values that keep from perfectly fitting the data.

set betas that maximize penalized L


Machine Learning Goal: Generalize to new
data

80% Training Data


Model Does the
model hold up?

20% Testing Data


Machine Learning Goal: Generalize to new
data

Training Data
80% Set
penalty
Model
Does the
10% Development model hold up?

10% Testing Data


Logistic Regression - Review
● Classification: P(Y | X)
● Learn logistic curve based on example data
○ training + development + testing data
● Set betas based on maximizing the likelihood
○ “shifts” and “twists” the logistic curve
● Multivariate features: One-hot encodings
● Separation represented by hyperplane
● Overfitting
● Regularization
Example
See notebook on website.
Extra Material
One approach to finding the parameters which maximize the likelihood function...
“best fit” : whatever maximizes the likelihood function:

Logistic Regression on a single feature (x)

Yi ∊ {0, 1}; X can be anything numeric.

To estimate , HOW? Essentially, try different


one can use values until “best fit” to the
reweighted least training data (example ).
squares:
learned
(Wasserman, 2005; Li, 2010)
“best fit” : whatever maximizes the likelihood function:

Logistic Regression on a single feature (x)

Yi ∊ {0, 1}; X can be anything numeric.


This is just one way of finding the betas that maximize the likelihood
function. In practice, we will use existing libraries that are fast and
support additional useful steps like regularization..

To estimate , HOW? Essentially, try different


one can use values until “best fit” to the
reweighted least training data (example ).
squares:
learned
(Wasserman, 2005; Li, 2010)

You might also like