Text Mining Using Python
Text Mining Using Python
What to do next?
We’ve extracted all the tweets related to consumer opinions of iPhone.
Here’s a sample tweet on which we’ll perform data cleaning
TWEET
“I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo
happppppy :) http://www.apple.com”
import HTMLParser
html_parser = HTMLParser.HTMLParser()
tweet = html_parser.unescape(original_tweet)
Output
>> “I luv my <3 iphone & you’re awsm apple. Display Is Awesome, sooo
happppppy http://www.apple.com”
02
Code
tweet = original_tweet.decode("utf8").encode(‘ascii’,’ignore’)
Output
>> “I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome,
sooo happppppy :) http://www.apple.com”
APPOSTOPHES = {“'s" : " is", "'re" : " are", ...} ## Need a huge dictionary
words = tweet.split()
reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]
reformed = " ".join(reformed)
Outcome
>> “I luv my <3 iphone & you are awsm apple. DisplayIsAwesome, sooo
happppppy :) http://www.apple.com”
04
When data analysis needs to be data driven at the word level, the
commonly occurring words (stop-words) should be removed.
One can either create a long list of stop-words or one can use
predefined language specific libraries.
Removal of Expressions
STEP
06
Textual data (usually speech transcripts) may contain human
expressions like [laughing], [Crying], [Audience paused]. These
expressions are usually non relevant to content of the speech and
hence need to be removed.
Outcome
>> “I luv my <3 iphone & you are awsm apple. Display Is Awesome, sooo
happppppy :) http://www.apple.com”
08
Code
tweet = _slang_loopup(tweet)
Outcome
>> “I love my <3 iphone & you are awesome apple. Display Is
Awesome, sooo happppppy :) http://www.apple.com”
Outcome
>> “I love my <3 iphone & you are awesome apple. Display Is
Awesome, so happy :) http://www.apple.com”
10
URLs and hyperlinks in text data like comments, reviews, and tweets
should be removed.
>> “I love my iphone & you are awesome apple. Display Is Awesome, so
happy!” , <3 , :)
Grammar checking
Grammar checking is majorly learning based,
huge amount of proper text data is learned and models are created.
Many online tools are available for grammar correction purposes.
Spelling correction
In natural language, misspelled errors are
encountered. One can use algorithms like the Levenshtein Distances,
Dictionary Lookup etc. other modules and packages to fix these
errors.
www.analyticsvidhya.com