0% found this document useful (0 votes)
24 views80 pages

Lecture 2

Uploaded by

adityaasinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views80 pages

Lecture 2

Uploaded by

adityaasinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

Language Representations

Heng Ji
[email protected]

Acknowledgement: Some Slides are from Dave Touretzky, Luke


Zettlemoyer’s course and Ryan Cotterell et al. ACL23 tutorial
slides
Project Sign-Up
[Discussions on Sept 8]
▪ https://docs.google.com/spreadsheets/d/
1SPim1TTmzoStvJ20A4xbfTKLqH3maojIjiwhy-g3mx0/edit?
usp=sharing
The Steps in NLP

Discourse

Pragmatics

Semantics

Syntax

**we can go up, down and up and


Morphology
down and combine steps too!!
**every step is equally complex
Major Topics

1. Words
2. Syntax
5. Applications exploiting each
3. Meaning
4. Discourse
Simple Applications
▪ Word counters (wc in UNIX)
▪ Spell Checkers, grammar checkers
▪ Predictive Text on mobile handsets
Bigger Applications
▪ Intelligent computer systems
▪ NLU interfaces to databases
▪ Computer aided instruction
▪ Information retrieval
▪ Intelligent Web searching
▪ Data mining
▪ Machine translation
▪ Speech recognition
▪ Natural language generation
▪ Question answering
▪ Image Caption Generation
Part-of-Speech Tagging and
Syntactic Parsing
S

NP VP

NN IN NP VBZ NP

School of NP CC NP presents JJ NN

NN and NN Wonderful Town

Theatre Dance
Semantic Role Labeling: Adding
Semantics into Trees
S

NP/ARG0 VP

DT JJ NN VBD NP/ARG2 NP/ARG1

The Supreme Court gave NNS VBG NN

states working Leeway/ARG0

Predicate
Predicate
Core Arguments

▪ Arg0 = agent
▪ Arg1 = direct object / theme / patient
▪ Arg2 = indirect object / benefactive / instrument /
attribute / end state
▪ Arg3 = start point / benefactive / instrument / attribute
▪ Arg4 = end point
Dependency Parsing

They hid the letter on the shelf


Dependency Relations
Name Tagging

▪ Since its inception in 2001, <name ID=“1”


type=“organization”>Red</name>has caused a stir in
<name ID=“2” type=“location”>Northeast Ohio</name> by
stretching the boundaries of classical by adding multi-
media elements to performances and looking beyond the
expected canon of composers.

▪ Under the baton of <name ID=“3” type=“organization”>


Red</name> Artistic Director <name ID=“4”
type=“person”>Jonathan Sheffer</name>, <name ID=“5”
type=“organization”>Red</name> makes its debut
appearance at <name ID=“6” type=“organization”> Kent
State University</name> on March 7 at 7:30 p.m.
Coreference
▪ But the little prince could not restrain admiration:

▪ "Oh! How beautiful you are!"

▪ "Am I not?" the flower responded, sweetly. "And I was born


at the same moment as the sun . . ."

▪ The little prince could guess easily enough that she was not
any too modest--but how moving--and exciting--she was!

▪ "I think it is time for breakfast," she added an instant


later. "If you would have the kindness to think of my
needs--"

▪ And the little prince, completely abashed, went to look for


a sprinkling-can of fresh water. So, he tended the flower.
Relation Extraction

relation: a semantic relationship between two entities

ACE relation type example


Agent-Artifact Rubin Military Design, the makers of the Kursk
Discourse each of whom
Employment/ Membership Mr. Smith, a senior programmer at Microsoft
Place-Affiliation Salzburg Red Cross officials
Person-Social relatives of the dead
Physical a town some 50 miles south of Salzburg
Other-Affiliation Republican senators
Temporal Information Extraction

▪ In 1975, after being fired from Columbia amid allegations that he used
company funds to pay for his son's bar mitzvah, Davis founded Arista
▪ Is ‘1975’ related to the employee_of relation between Davis and
Arista?
▪ If so, does it indicate START, END, HOLDS… ?

▪ Each classification instance represents a temporal expression in the context


of the entity and slot value.

▪ We consider the following classes


▪ START Rob joined Microsoft in 1999.
▪ END Rob left Microsoft in 1999.
▪ HOLDS In 1999 Rob was still working for Microsoft.
▪ RANGE Rob has worked for Microsoft for the last ten years.
▪ NONE Last Sunday Rob’s friend joined Microsoft.

15
Event Extraction
▪ An event is specific occurrence that implies a change of states
▪ event trigger: the main word which most clearly expresses an event occurrence
▪ event arguments: the mentions that are involved in an event (participants)
▪ event mention: a phrase or sentence within which an event is described, including
trigger and arguments
▪ ACE defined 8 types of events, with 33 subtypes

Argument, role=victim trigger

ACE event type/subtype Event Mention Example

Life/Die Kurt Schork died in Sierra Leone yesterday


Transaction/Transfer GM sold the company in Nov 1998 to LLC
Movement/Transport Homeless people have been moved to schools
Business/Start-Org Schweitzer founded a hospital in 1913
Conflict/Attack the attack on Gaza killed 13
Contact/Meet Arafat’s cabinet met for 4 hours
Personnel/Start-Position She later recruited the nursing student
Justice/Arrest Faison was wrongly arrested on suspicion of murder
Recap: Language Modeling

▪ p(I read a book about dogs)


▪ How to assign a probability to a sentence?
▪ another view: distribution over next word:
▪ p(dogs | I read a book about _____ )

17
Recap: Language Modeling

18
We will study basic neural NLP models

21
Building Blocks: How to Represent Language?
▪ Neural Networks

24
Building Blocks
▪ Neural Networks

25
How to Represent a Word?
Word embeddings
▪ Idea: learn an embedding from words into vectors

▪ Need to have a function W(word) that returns a vector encoding


that word.
Word2Vec

Things you need to know:


Dot product:
a  b = ||a||||b||cos(θab)

= a1b1+a2b2+ … +anbn
One can derive Cosine similarity
cos θab = a  b / ||a||||b||

Softmax Function:
If we take an input of [1,2,3,4,1,2,3], the softmax of that is
[0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175].
The softmax function highlights the largest values and
suppress other values, so that they are all positive and
sum to 1.
Word2Vec

Traditional representation of a word's meaning


1. Dictionary (or PDB), not too useful in computational linguistic research.
2. WordNet
It is a graph of words, with relationships like “is-a”, synonym sets.
Problems: Depend on human labeling hence missing a lot, hard to
automate this process.
3. These are all using atomic symbols: hotel, motel, equivalent to 1-hot vector:
Hotel: [0,0,0,0,0,0,0,0,1,0,0,0,0,0]
Motel: [0,0,0,0,1,0,0,0,0,0,0,0,0,0]
These are called one-hot representations. Very long: 13M (google crawl).
Example: inverted index.
▪ Quiz 1: Assign Coordinates to each word
▪ Quiz 2: How to Represent Adult, Child, Infant, Grandfather?
▪ Quiz 3: How to Represent Grandmother, Grandparent, Teeanger?
▪ Quiz 3: How to Represent King, Queen, Prince and Princess?
▪ Quiz 4: How to Represent King-Man, King-Man+Woman?
▪ Quiz 4: How to Represent Cucumber, Smiled, Honesty, Rescue?
▪ Quiz 5: How to Represent “reserved”, “McVeigh”?
▪ Solution: Look at What Other Words they Co-occur With
Markov Assumption

P (t1 ,..., t n )  history


history

P (t1 )  P (t 2 | t1 )  P (t3 | t1t 2 )  ...  P (t n | t1 ,..., t n 1 )


 Make a Markov assumption to shorten history and solve
data scarcity problem
 P(T) is a product of the probability of N-grams that make
it up
n
P(t1 ,..., t n )  P(t1 )   P(ti | ti  1)
i 2

48/39
▪ Quiz 6: Come Up with Some Words that Cannot be represented by
such vectors?
▪ Action knowledge “open”, “close”, “sit”… (need vision and
simulation)
▪ Top-employee (need embedding composition)
▪ Numbers, Time
▪ Chemical entities
50

Recap: Neuron
51

Activation functions are applied element-wise (e.g., f(x) = [f(x 1), …, f(xn)])

https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092
52

Neural Network
▪ Multiple sets of weights in sequence
▪ Multiple hidden layers
▪ Activation functions
▪ Adds nonlinearity

xn = Φ(Wxn-1 + b)
• W is weight matrix
• b is bias vector
• xn-1 is response from
previous layer
53

Recurrent Neural Network

● Neural network with loops (output of previous step is input of next)


● Maintains hidden state
○ Tracks previous inputs
○ Handle variable length input (sentences, video frames,...)
● Single differentiable function from input words to output probabilities
● ht = fW(ht-1, xt)
● ht = tanh(Whht-1 + Wxxt)
● yt = Wyht

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
54

Language Modeling with RNN


▪ Embedding or hidden state can be word representation
▪ Embedding = general representation
▪ Hidden state = context specific representation
▪ Input is one-hot encoding of words
▪ Multiply by embedding matrix
▪ Output predicts probability distribution over vocabulary
▪ P(next word|previous words)
55

RNN to Train a Character Language Model


▪ Vocabulary: four possible letters helo
▪ Train RNN on sequence “hello”

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
56

RNN to Train a Character Language Model


57

Decoding Algorithm: Greedy Decoding


▪ At each step, take the most probable word (argmax)
▪ Use that as the next word, feed it as input on the next step
▪ Keep going until you produce the end token
58

Decoding Algorithm: Beam Search


Recursive Neural Networks
▪ Combining Vectors: multi-layer RNN

59
Training

60
Generation

61
Generation

62
Generation

63
Generation

64
Generation

65
Conditioned Generation

66
Seq2Seq

67
Seq2Seq

68
Seq2Seq

69
Seq2Seq

70
Seq2Seq

71
Seq2Seq with Attention

72
Seq2Seq with Attention

73
Seq2Seq with Attention

74
Seq2Seq with Attention

75
Seq2Seq with Attention

76
Seq2Seq with Attention

77
Seq2Seq with Attention

78
Seq2Seq with Attention

79
Seq2Seq with Attention

80

You might also like