Text Analysis
Text Analysis
Text is Everywhere
• Medical Records
• Consumer Complaint Logs
• Product Inquiries
• Social Media Posts (Twitter feed, Emails, Facebook status,
Reddit comments, etc.)
• Personal Webpages
Identify and
Build Features
f1 f2 …
blah, blah, blah,
blah, blah, blah, Explore
val11 val12 … Explain
blah, blah, blah,
blah, blah, blah, Predict
blah, blah, blah,
blah, blah, blah,
val21 val22 …
…
…
Text Data is Difficult to Analyze
“all data mining involves the use of machine learning but not all
machine learning requires data mining”
Advantages
• A very simple representation
• Inexpensive to generate
• Works in many settings
• Often works surprisingly well!
Technical reports, prescriptions,…
• “a duck walked up to a lemonade stand”
• “a horse walked up to a lemonade
stand”
• “The Duck walks near the Lemonade
Stand”
The bag of words features:
According to bag of words:
BUT
[“a”, “duck”, “walked”, “up”, “to”, “a”,
“lemonade”, “stand”],
not similar
[“The”, “Duck”, “walks”, “near”, “the”,
“Lemonade”, “Stand”]
Cleaning the Text