0% found this document useful (0 votes)
29 views

1.2 - Handling Text in Python

Uploaded by

yiliawanghk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

1.2 - Handling Text in Python

Uploaded by

yiliawanghk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

2 Line Segment Title APPLIED TEXT MINING

IN PYTHON

Applied Text Mining in Python

Handling Text in Python


2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

Primitive constructs in Text

• Sentences / input strings


• Words or Tokens
• Characters
• Document, larger files

And their properties …


2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

Let’s try it out!


>>> text1 = "Ethics are built right into the ideals and objectives
of the United Nations "
>>> len(text1)
76
>>> text2 = text1.split(' ')
>>> len(text2)
13
>>> text2
['Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and',
'objectives', 'of', 'the', 'United', 'Nations', '']
2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

Finding specific words


• Long words: Words that are most than 3 letters long
>>> [w for w in text2 if len(w) > 3]
['Ethics', 'built', 'right', 'into', 'ideals', 'objectives', 'United',
'Nations']

• Capitalized words
>>> [w for w in text2 if w.istitle()]
['Ethics', 'United', 'Nations']

• Words that end with s


>>> [w for w in text2 if w.endswith(’s')]
['Ethics', 'ideals', 'objectives', 'Nations']
2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

Finding unique words: using set()


>>> text3 = 'To be or not to be'
>>> text4 = text3.split(' ')
>>> len(text4)
6
>>> len(set(text4))
5
>>> set(text4)
set(['not', 'To', 'or', 'to', 'be'])
>>> len(set([w.lower() for w in text4]))
4
>>> set([w.lower() for w in text4])
set(['not', 'to', 'or', 'be']
2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

Some word comparison functions …

• s.startswith(t)
• s.endswith(t)
• t in s
• s.isupper(); s.islower(); s.istitle()
• s.isalpha(); s.isdigit(); s.isalnum()
2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

String Operations

• s.lower(); s.upper(); s.titlecase()


• s.split(t)
• s.splitlines()
• s.join(t)
• s.strip(); s.rstrip()
• s.find(t); s.rfind(t)
• s.replace(u, v)
2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

From words to characters


>>> text5.split('')
>>> text5 = 'ouagadougou'
Traceback (most recent call last):
>>> text6 = text5.split('ou')
File "<stdin>", line 1, in
>>> text6 <module>
['', 'agad', 'g', ''] ValueError: empty separator
>>> 'ou'.join(text6) >>> list(text5)
'ouagadougou’ ['o', 'u', 'a', 'g', 'a', 'd',
'o', 'u', 'g', 'o', 'u']
>>> [c for c in text5]
['o', 'u', 'a', 'g', 'a', 'd',
'o', 'u', 'g', 'o', 'u']
2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

Cleaning Text
>>> text8 = ' A quick brown fox jumped over the lazy dog. '
>>> text8.split(' ')
['', '', '\t', 'A', 'quick', 'brown', 'fox', 'jumped', 'over',
'the', 'lazy', 'dog.', '']
>>> text9 = text8.strip()
>>> text9.split(' ')
['A', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy',
'dog.']
APPLIED TEXT MINING
IN PYTHON

Changing Text
• Find and replace
>>> text9
'A quick brown fox jumped over the lazy dog.'
>>> text9.find('o')
10
>>> text9.rfind('o')
40
>>> text9.replace('o', 'O')
'A quick brOwn fOx jumped Over the lazy dOg.'
2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

Handling Larger Texts


• Reading files line by line
>>> f = open('UNDHR.txt', 'r')
>>> f.readline()
'Universal Declaration of Human Rights\n’

• Reading the full file


>>> f.seek(0)
>>> text12 = f.read()
>>> len(text12)
10891
>>> text13 = text12.splitlines()
>>> len(text13)
158
>>> text13[0]
'Universal Declaration of Human Rights'
2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

File Operations
• f = open(filename, mode)
• f.readline(); f.read(); f.read(n)
• for line in f: doSomething(line)
• f.seek(n)
• f.write(message)
• f.close()
• f.closed
2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

Issues with reading text files


>>> f = open('UNDHR.txt', 'r')
>>> text14 = f.readline()
'Universal Declaration of Human Rights\n'

• How do you remove the last newline character?


>>> text14.rstrip()
'Universal Declaration of Human Rights’

– Works also for DOS newlines (^M) that shows up as '\r' or '\r\n'
2 Line Segment Title APPLIED TEXT MINING
2 Line Segment Title IN PYTHON

Take Home Concepts

• Handling text sentences


• Splitting sentences into words, words into characters
• Finding unique words
• Handling text from documents

You might also like