0% found this document useful (0 votes)

21 views30 pages

Pert23 - NLP

Uploaded by

82gfmcz5fs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views30 pages

Pert23 - NLP

Uploaded by

82gfmcz5fs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Course : Artificial Intelligence (COMP6065)

Non-official Slides

Natural Language Processing

Session 23

Revised by Williem, S. Kom., Ph.D.

1
Learning Outcomes
At the end of this session, students will be able to:

• LO 6 : Apply how to process natural language and other

perceptual signs in order that an agent can interact
intelligently with the world

2
Outline
1. Natural Language Processing

2. Language Models

3. Text Classification

4. Information Retrieval

5. Information Extraction

6. Summary

3
Natural Language Processing
• Agent which want to add the information needs to understand
(at least partially) of the human language (natural language)

– To communicate with humans

– To acquire information from written language

• There are 3 ways to acquire information:

– Text Classification

– Information Retrieval

– Information Extraction

4
Language Models
• One common factor in searching information is the language
models

• Formal languages also have rules that define the meaning or

semantics of a program;

– For example: The rules say that the "meaning" of "2 + 2" is
4, and the meaning of “ 1 / 0 ” is that an error is signaled

• Natural languages is ambiguous and difficult to deal (large

and changing)

5
Language Models
• N-gram character models (can be words, or other units)

– We count the probability of n sequences of character

– I.e. in one web collection, P(“the”) = 0.027 and P(“zgq”) =

0.000000002

– A model of the probability distribution of n-letter

sequences

– It is also defined as a Markov Chain of order n-1

6
Language Models
• N-gram models for language identification

– Given, a text, determine what natural language it is written in

– I.e. “Hello, World” and “Halo, dunia”

• How?

– We build the trigram character model of each language

– We measure the prior probability of each language

7
Language Models
• Smoothing n-gram models

– N-gram only estimate the true probability distribution

(high probability for common words  “ th” = 1.5%)

– How about for uncommon words? “ ht” = ??

• No dictionary words start with ht

• The program issues an http request = ??

– The process of adjusting the probability of low-

frequency counts

8
Language Models
• Smoothing n-gram models

– Backoff model, in which we start by estimating n-gram

counts, but for any particular sequence that has low count,
we back off to n-1 grams.

– Linear interpolation smoothing is a backoff model that

combines trigram, bigram, and unigram

9
Language Models
• Model evaluation

– To choose what model we use

– Perform cross validation

• Training and validation corpus

– Evaluation metric:

• Reciprocal of probability, normalized by sequence

length

10
Text Classification
• Given a text of some kind, decide which of predefined set of
classes it belongs to (categorization)

– I.e. spam detection (spam and ham)

• Training data

11
Text Classification
• Another way to think about classification is as a problem in data
compression. A lossless compression algorithm takes a sequence
of symbols, detects repeated patterns in it, and writes a
description of the sequence that is more compact than the
original.

• For example, the text "0.142857142857142857" might be

compressed to “0.[142857]*3”. Compression algorithms work by
building dictionaries of subsequences of the text, and then
referring to entries in the dictionary.

12
Information Retrieval
• Information retrieval (Googling) is the task of finding
documents that are relevant to a user’s need for information

• An information retrieval (IR) system can be characterized by

– A corpus of documents

– Queries posed in a query language

– A result set

– A presentation of the result set

• The earliest IR systems worked on a Boolean keyword model

13
Information Retrieval
• IR scoring functions

– Instead of using Boolean model, most IR systems use

models based on statistics of word counts

• I.e. BM25 scoring function

– A scoring function takes a document and a query and

returns a numeric score (relevancy score)

– In BM25 function, the score is a linear weighted

combination of scores for each of the words

14
Information Retrieval
• BM25 function

– Three factors affect the weight of a query term

• The frequency with which a query term appears in

document (TF = term frequency)

• The inverse document frequency of the term (IDF)

– Document frequency of the term (DF)

• The length of document

15
Information Retrieval
• BM25 function

– The parameters are b and k

– L is the average document length in the corpus

16
Information Retrieval
• IR system evaluation

– There are two measures used in the scoring

• Recall: The proportion of all relevant documents in the

collection that are in the results set

• Precision: The proportion of documents in the result

set that are actually relevant

– IR system results in 100 documents

In Result Not In Result
Relevant 30 20
Not Relevant 10 40 17
Information Retrieval
• IR system evaluation
In Result Not In Result
Relevant 30 20
Not Relevant 10 40

30
– 𝑅𝑒𝑐𝑎𝑙𝑙 = = 0.60
30+20

30
– 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = 0.75
30+10

18
Information Retrieval
• PageRank algorithm

– It was one of the two original ideas that set Google’s search
apart from other Web search engines (1997)

– If the query [IBM] how do we ensure that the IBM home

page (ibm.com ) is the first in a sequence of query results,
even if other pages have a more frequency of IBM word .

– The concept is that ibm.com has many in-links (links to

pages ibm.com), then it certainly would be ranked first in
the results.
19
Information Retrieval
• PageRank algorithm

– PageRank is designed to weight links from high-quality

sites more heavily

– It can be computed by an iterative procedure: start with all

pages having PR(p) = 1, and iterate the algorithm until
convergence

20
Information Retrieval
• Question answering

– Is a somewhat different task, in which the query really is a

question, and the answer is not a ranked list of documents
but rather a short response

– Based on the premise that the question could be answered

on many web pages, then the problem in question-and-
answer is considered as the issue of precision (accuracy),
not a recall (completeness).

• We only have to find the answer

21
Information Extraction
• Information extraction is the process of acquiring knowledge
by skimming a text and looking for occurrences of a particular
class of object and for relationship among objects

– I.e. extract instances of addresses from web pages

• In a limited domain, it can be done with high accuracy

• In a general domain, more complex linguistic models and

learning techniques are necessary

22
Information Extraction
• The simplest type of information extraction system is an
attribute-based extraction systems

– Assumes that the entire text refers to a single object

– I.e. The problem of extracting from the text “IBM ThinkBook

970. Our price: $399.00” the attributes {Manufacturer=IBM,
Model=ThinkBook970, Price=$399.00”

• We can address the problem by defining a template

– Defined by a finite state automaton, regex (regular

expression)
23
Information Extraction
• The regex template for prices in dollars:

• The upgrade version of attribute-based extraction systems are

relational extraction systems

– I.e. FASTUS which handles news stories about corporate

mergers and acquisitions

24
Information Extraction

• FASTUS consists of five stages:

– Tokenization
– Complex-word handling
– Basic-group handling
– Complex-phrase handling
– Structure merging
25
Information Extraction
• A different application of extraction technology is building a
large knowledge base of facts from a corpus

• This is different in three ways:

– First it is open-ended—we want to acquire facts about all
types of domains, not just one specific domain

– Second, with a large corpus, this task is dominated by

precision, not recall

– Third, the results can be statistical aggregates gathered from

multiple sources

26
Information Extraction
• Machine reading

– A machine that behaves more like a human reader who

learns from the text itself

– A representative machine-reading system is TEXTRUNNER

(Banko and Etzioni, 2008)

• I.e. from the parse of the sentence “Einstein received

the Nobel Prize in 1921,” TEXTRUNNER is able to extract
the relation (“Einstein”, “received”, “Nobel Prize”)

27
Information Extraction

• 8 general templates cover about 95% of relation in English exp.

• TEXT RUNNER achieves a precision of 88% and recall of 45%

(F1 of 60%) on a large Web corpus.

28
Summary
• Text classification can be done with naïve Bayes n-gram
models or with any of the classification algorithms

• Information retrieval systems use a very simple language

model

• Information extraction systems use a more complex model

that includes limited notions of syntax and semantics

29
References
• Stuart Russell, Peter Norvig. 2010. Artificial Intelligence : A
Modern Approach. Pearson Education. New Jersey.
ISBN:9780132071482

• http://aima.cs.berkeley.edu

Libro de Inglés
80% (5)
Libro de Inglés
108 pages
Personal Effectiveness Scale
100% (4)
Personal Effectiveness Scale
4 pages
(2022) Consciousness and Quantum Mechanics - Shan Gao
100% (1)
(2022) Consciousness and Quantum Mechanics - Shan Gao
532 pages
New Early Learning Progress Profile Documentation Form
No ratings yet
New Early Learning Progress Profile Documentation Form
20 pages
A-Framework-For-Exceptional-Teaching PDF
100% (1)
A-Framework-For-Exceptional-Teaching PDF
68 pages
Development of A Program To Increase Personal Happiness: Michael W. Fordyce
No ratings yet
Development of A Program To Increase Personal Happiness: Michael W. Fordyce
11 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Automatic Negative Thoughts and Core Beliefs
100% (3)
Automatic Negative Thoughts and Core Beliefs
5 pages
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
No ratings yet
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
51 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
32 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
32 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
Reviewer-Child and Adult Devpt
No ratings yet
Reviewer-Child and Adult Devpt
6 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Ensemble Methods - Bagging, Boosting and Stacking - Towards Data Science PDF
No ratings yet
Ensemble Methods - Bagging, Boosting and Stacking - Towards Data Science PDF
37 pages
The Great Gatsby Final Essay
No ratings yet
The Great Gatsby Final Essay
2 pages
Module 3 Plot Setting Characterization
100% (1)
Module 3 Plot Setting Characterization
23 pages
Schumann S Model
No ratings yet
Schumann S Model
19 pages
(Megan Moore Duncan, Jeanne Holverstott, Brenda SM
No ratings yet
(Megan Moore Duncan, Jeanne Holverstott, Brenda SM
540 pages
Mir2ed Toc
No ratings yet
Mir2ed Toc
17 pages
Unit Vapplications Notes
No ratings yet
Unit Vapplications Notes
13 pages
Team Building
No ratings yet
Team Building
5 pages
Information Retrieval and Web Search
No ratings yet
Information Retrieval and Web Search
29 pages
Ai Lecture22
No ratings yet
Ai Lecture22
32 pages
1stunit GN
No ratings yet
1stunit GN
36 pages
IR Textbook
No ratings yet
IR Textbook
167 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Parental Involvement and Their Impact On Reading English of Students Among The Rural School in Malaysia
No ratings yet
Parental Involvement and Their Impact On Reading English of Students Among The Rural School in Malaysia
8 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
The Goals I Will Achieve in 10 Years
67% (3)
The Goals I Will Achieve in 10 Years
2 pages
VAK Checklist
100% (2)
VAK Checklist
1 page
Unit 1 Introduction To NLP
No ratings yet
Unit 1 Introduction To NLP
59 pages
Jawaban Huawei
No ratings yet
Jawaban Huawei
58 pages
Heidegger's Misinterpretation of Rilke
No ratings yet
Heidegger's Misinterpretation of Rilke
18 pages
6-Query Languages
No ratings yet
6-Query Languages
19 pages
1 Introduction To IR
No ratings yet
1 Introduction To IR
49 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Ai Unit 5 Chapter 12,13 Missing Part
No ratings yet
Ai Unit 5 Chapter 12,13 Missing Part
11 pages
Unit 5 6 Pages Notes
No ratings yet
Unit 5 6 Pages Notes
3 pages
Faculty Name: Dr. Humera Khanam Subject Name:NLP
No ratings yet
Faculty Name: Dr. Humera Khanam Subject Name:NLP
206 pages
Helen Longino, Interview
100% (1)
Helen Longino, Interview
8 pages
Chapter 6 Solution
No ratings yet
Chapter 6 Solution
10 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Approaches and Techniques For Needs Analysis
No ratings yet
Approaches and Techniques For Needs Analysis
16 pages
RPT Bahasa Inggeris Tahun 5
No ratings yet
RPT Bahasa Inggeris Tahun 5
11 pages
TEXT Mining
No ratings yet
TEXT Mining
45 pages
Intro Notes
No ratings yet
Intro Notes
11 pages
Eapp Reviewer
No ratings yet
Eapp Reviewer
17 pages
Question Answering, Information Retrieval, and Retrieval Augmented Generation
No ratings yet
Question Answering, Information Retrieval, and Retrieval Augmented Generation
22 pages
My M-7
No ratings yet
My M-7
44 pages
Ganong Et Al - A Meta-Analytic Review of Family Structure Stereotypes - 1990
No ratings yet
Ganong Et Al - A Meta-Analytic Review of Family Structure Stereotypes - 1990
12 pages
AI Unit-5
No ratings yet
AI Unit-5
10 pages
A2 - Listening 1& Writing & Reading & Speaking
No ratings yet
A2 - Listening 1& Writing & Reading & Speaking
5 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
CGSMS Form No 3.
No ratings yet
CGSMS Form No 3.
7 pages
The Data Driven Audit
No ratings yet
The Data Driven Audit
34 pages
Unit3 QueryLanguages Berlin
No ratings yet
Unit3 QueryLanguages Berlin
29 pages
Week 2-Lp-Grade 7
No ratings yet
Week 2-Lp-Grade 7
4 pages
ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
1-Overview of Information Retrieval - New
No ratings yet
1-Overview of Information Retrieval - New
47 pages
1 Overview
No ratings yet
1 Overview
44 pages
Ai & ML Unit-3 Ir & Ie
No ratings yet
Ai & ML Unit-3 Ir & Ie
15 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
UNIT I IR Final
No ratings yet
UNIT I IR Final
26 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
57 pages
Mgnt-Lesson-3 2
No ratings yet
Mgnt-Lesson-3 2
4 pages
AI UNIT-5 Notes
No ratings yet
AI UNIT-5 Notes
27 pages
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
44 pages
Module 5
No ratings yet
Module 5
57 pages
NLP M5 Part-1 SPP
No ratings yet
NLP M5 Part-1 SPP
55 pages
Unit 6 Endsem PYQs
No ratings yet
Unit 6 Endsem PYQs
15 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Lang Models: 04 December 2024 23:03
No ratings yet
Lang Models: 04 December 2024 23:03
4 pages
QB Irs
No ratings yet
QB Irs
13 pages
Module 3-2
No ratings yet
Module 3-2
17 pages
IR U1
No ratings yet
IR U1
103 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
Unit V Notes Adbt Adbt
No ratings yet
Unit V Notes Adbt Adbt
7 pages