Module 1 Lecture 5-1
Module 1 Lecture 5-1
CSE 243:
Natural Language Processing
Recap from Previous Lecture
• Sentiment Analysis
2
Contents
• Named Entity Recognition
3
Named Entities
• Named entities are anything that can be referred to with a proper
name.
• Multiple class problem
• 3 classes – PER (person), LOC (location), ORG (organization)
• 4 classes – PER (person), LOC (location), ORG (organization), GPE (geo-political
entity)
• More classes – PER (person), LOC (location), ORG (organization), GPE (geo-
political entity) + classes for dates, times, numbers, prices, etc.
• Often can include multi word phrases
4
Examples of Named Entities
Class Examples
Person Sandeep Mathias
Location Bengaluru
Organization Presidency University
Geo-political Entity Prime Minister of India
5
Named Entity Tagging
• The task of Named Entity Recognition (NER):
• Find spans of text that constitute a named entity.
• Tag the entity with the proper NER class.
6
NER Input
• Citing high fuel prices, United Airlines said Friday it has increased
fares by $6 per round trip on flights to some cities also served by
lower-cost carriers.
• American Airlines, a unit of AMR Corp., immediately matched the
move, spokesman Tim Wagner said.
• United, a unit of UAL Corp., said the increase took effect Thursday and
applies to most routes where it competes against discount carriers,
such as Chicago to Dallas and Denver to San Francisco.
7
NER – Finding NER Spans
• Citing high fuel prices, [United Airlines] said [Friday] it has increased
fares by [$6] per round trip on flights to some cities also served by
lower-cost carriers.
• [American Airlines], a unit of [AMR Corp.], immediately matched the
move, spokesman [Tim Wagner] said.
• [United], a unit of [UAL Corp.], said the increase took effect
[Thursday] and applies to most routes where it competes against
discount carriers, such as [Chicago] to [Dallas] and [Denver] to [San
Francisco].
8
NER Output
• Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has
increased fares by [MONEY $6] per round trip on flights to some cities
also served by lower-cost carriers.
• [ORG American Airlines], a unit of [ORG AMR Corp.], immediately
matched the move, spokesman [PER Tim Wagner] said.
• [ORG United], a unit of [ORG UAL Corp.], said the increase took effect
[TIME Thursday] and applies to most routes where it competes against
discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver]
to [LOC San Francisco].
9
Why NER?
• Sentiment analysis: consumer’s sentiment toward a particular
company or person?
• Question Answering: answer questions about an entity?
• Information Extraction: Extracting facts about entities from text
10
Why NER is not so easy
• Segmentation
• In PoS tagging, no segmentation, since each word gets 1 tag.
• In NER, we have to find the span before adding the tags!
• Type Ambiguity
• Multiple types can map to same span.
• [Washington] was born into slavery on the farm of James Burroughs.
• [Washington] went up 2 games to 1 in the four-game series.
• Blair arrived in [Washington] for what may well be his last state visit.
• In June, [Washington] legislators passed a primary seatbelt law.
11
Why NER is not so easy
• Segmentation
• In PoS tagging, no segmentation, since each word gets 1 tag.
• In NER, we have to find the span before adding the tags!
• Type Ambiguity
• Multiple types can map to same span.
• [PER Washington] was born into slavery on the farm of James Burroughs.
• [ORG Washington] went up 2 games to 1 in the four-game series.
• Blair arrived in [LOC Washington] for what may well be his last state visit.
• In June, [GPE Washington] legislators passed a primary seatbelt law.
12
BIO-Tagging
• Converting the NER tagging with 1 label for multiple words, to a
sequence labeling problem like PoS tagging with 1 tag per word.
• Consider the sentence: “[PER Jane Villanueva] of [ORG United] , a
unit of [ORG United Airlines Holding] , said the fare applies to the
[LOC Chicago] route.”
• Instead of just marking the spans, we also mark out whether it is the
beginning (B), or inside (I) part of the span. Words outside the span
are tagged as other (O).
13
BIO Tagging
• The sentence: “[PER Jane Villanueva] of [ORG United], a unit of [ORG
United Airlines Holding] , said the fare applies to the [LOC Chicago]
route.”
• Becomes:
• “Jane_B-PER Villanueva_I-PER of_O United_B-ORG ,_O a_O unit_O
of_O United_B-ORG Airlines_I-ORG Holding_I-ORG ,_O said_O the_O
fare_O applies_O to_O the_O Chicago_B-LOC route_O ._O”
• Total Number of Tags = 2n+1
14
Other BIO Tagging variants
• IO Label – I is inside the span, O is outside the span.
• BIO Label – B is beginning of the span, I is inside the span, O is
outside the span.
• BIOES Label – B is beginning of the span, I is inside the span, O is
outside the span, E is end of the span, and S is to represent a single
element tag.
15
Standard Algorithms for NER
• Many supervised sequence labeling models can be used.
• Hidden Markov Models (HMM)
• Conditional Random Fields (CRF)
• Maximum Entropy Markov Models (MEMM)
• Neural Sequence Models
• Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), etc.
• Pre-trained Language Models – Eg. BERT
16