Intro To NLP and Text Mining
Intro To NLP and Text Mining
Free text,
Grammatical
Error,
Ambiguity,
Complex,
Slank Words, …
Semi-Unstructured…
XML,
JSON
(Angelino, 2012)
Structured…
Database
(Dzerovski, 1996)
Data Mining vs Text Mining
• “Data Mining is essentially concerned with
information extraction from structured
databases.”
• Information Retrieval
• Text Summarization
http://autosummarizer.com/
• Text Classification
NLP Applications
• Machine Translation
http://translate.google.com
• Question Answering
http://start.csail.mit.edu
• Sentiment Analysis
Approach to Solve NLP Problem
• Rule Based (Symbolic)
– Developed hand coded rules
• Statistics Based (Empirical)
– Annotate data based on standard tagsets, then
machine learn a model
• Hybrid systems
– Often blend rule-based pre- and post-processing with
ML core
(Effective) NLP Cycle
• Pick a problem (usually some disambiguation).
• Get a lot of data (hopefully labeled, but often
unlabeled).
• Build the simplest thing that could possibly work.
• Repeat:
– Examine the most common errors are.
– Figure out what information a human might use to avoid
them.
– Modify the system to exploit that information
• Feature engineering
• Representation redesign
• Different machine learning methods
THANK YOU