Natural Language Processing Notes
Natural Language Processing Notes
Rada Mihalcea
Fall 2011
Yahoo, Google, Microsoft Information Retrieval Monster.com, HotJobs.com (Job finders) Information Extraction + Information Retrieval Systran powers Babelfish Machine Translation Ask Jeeves Question Answering Myspace, Facebook, Blogspot Processing of User-Generated Content Tools for business intelligence All Big Guys have (several) strong NLP research labs:
IBM, Microsoft, AT&T, Xerox, Sun, etc.
Information extraction
Extract useful information from resumes
Automatic summarization
Condense 1 book into 1 page
Natural?
Natural Language?
Refers to the language spoken by people, e.g. English, Japanese, Swahili, as opposed to artificial languages, like C++, Java, etc.
[Computational Linguistics
Doing linguistics on computers More on the linguistic side than NLP, but closely related ]
Computers have
No common sense knowledge No reasoning capacity
Search
Language Analysis
Semantics
Parsing
Issues in Syntax
the dog ate my homework - Who did what? 1. Identify the part of speech (POS)
Dog = noun ; ate = verb ; homework = noun English POS tagging: 95%
2. Identify collocations mother in law, hot dog Compositional versus non-compositional collocates
Issues in Syntax
Shallow parsing:
the dog chased the bear the dog chased the bear subject - predicate Identify basic structures NP-[the dog] VP-[chased the bear]
Issues in Syntax
Full parsing: John loves Mary
Help figuring out (automatically) questions like: Who did what and when?
Issues in Semantics
Understand language! How? plant = industrial plant plant = living organism Words are ambiguous Importance of semantics?
Machine Translation: wrong translations Information Retrieval: wrong information Anaphora Resolution: wrong referents
Why Semantics?
The sea is at the home for billions factories and animals The sea is home to million of plants and animals English French [commercial MT system] Le mer est a la maison de billion des usines et des animaux French English
Issues in Semantics
How to learn the meaning of words? From dictionaries:
plant, works, industrial plant -- (buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles") plant, flora, plant life -- (a living organism lacking the power of locomotion) They are producing about 1,000 automobiles in the new plant The sea flora consists in 1,000 different plant species The plant was close to the farm of animals.
Issues in Semantics
Learn from annotated examples:
Assume 100 examples containing plant previously tagged by a human Train a learning algorithm How to choose the learning algorithm? How to obtain the 100 tagged examples?
Task: find documents that are relevant to the given query How? Create an index, like the index in a book More
Vector-space models Boolean models
Need parallel corpora French-English, Chinese-English have the Hansards Reasonable translations Chinese-Hindi no such tools available today!
Even More
Discourse Summarization Subjectivity and sentiment analysis Text generation, dialog [pass the Turing test for some million dollars] Loebner prize Knowledge acquisition [how to get that common sense knowledge] Speech processing
Morphology N-grams
Also multi-word expressions
Administrivia
Instructor: Rada Mihalcea, F228, [email protected] Class meetings: TTh 11-12:20pm Office hours: TTh 4:00-5:00pm TA: TBA Textbook: Speech and Language Processing, by Jurafsky and Martin (2nd edition)
Recommended: Statistical Methods in NLP, by Manning and Schutze
Other readings (papers) may be assigned throughout the semester Grading: Assignments, 2 exams, term project
Late submission policy for assignments: can submit up to three days late, with 10% penalty / day