Unit 5
Unit 5
Text Analysis
2.String Operations:
• Concatenation: Combine two strings using the + operator.
• Repetition: Repeat a string multiple times using the * operator.
• Membership Test: Check if a character exists in a string using in and not in
operators.
3.String Comparisons:
• isalpha(): Check if the string contains only alphabets.
• isalnum(): Check if the string contains only alphabets and numbers.
• isdigit(): Check if the string contains only digits.
• isdecimal(): Check if the string contains only decimal characters.
• islower(): Check if the string contains only lowercase characters.
• isupper(): Check if the string contains only uppercase characters.
• isnumeric(): Check if the string contains only numeric characters.
• startswith(): Check if the string starts with a specified substring.
• endswith(): Check if the string ends with a specified substring.
4.String Conversions:
• capitalize(): Convert the first character to uppercase and the rest to lowercase.
• title(): Convert the first character of each word to uppercase.
• lower(): Convert all characters to lowercase.
• upper(): Convert all characters to uppercase.
• swapcase(): Swap the case of each character.
• casefold(): Perform case folding, a more aggressive lowercasing for comparisons.
5.String Manipulations:
• count(): Count the occurrences of a substring in the string.
• replace(): Replace all occurrences of a substring with a new one.
• find(): Find the index of the first occurrence of a substring.
• rfind(): Find the index of the last occurrence of a substring.
• join(): Join strings in a sequence with a specified separator.
• splitlines(): Split the string into separate lines.
• lstrip(): Remove leading whitespaces or specified characters.
Regular Expression.
1.Introduction:
1. Regular expression is a powerful tool in any language to match the text patterns.
2. Python also supports regular expression.
3. Python regular expression operations are supported by module re.
4. To use regular expression first we need to import 're' module.
import re
5. To perform regular expression search, we will follow this format:
matchset = re.search(pattern,text)
6. Here, pattern refers to the rule we formed for matching and text contains string in which
we want to perform the search.
7. If search goes successful, match object is returned otherwise None.
8. example:
text2 = "This news article is published on month:Jan"
matchResult = re.search(r'month:\w\w\w',text2)
if matchResult:
print('Pattern exists ', matchResult.group())
else:
print('Pattern not exists')
9. In above example we want to search for month followed by : and three characters.
10. If it contains, result will be stored in matchResult object.
11. We can print the result by using matchResult.group() method.
12. r which is used in the beginning of the pattern is to handle raw strings.
2.3. Tokenization:
1. This can be done by split() function available in python.
2. But if we want to do it more clearly, we can use nltk tokenization.
3. Let’s understand this by following code:
text2 = "Why are you so intelligent?"
words = text2.split(' ')
print(words)
print(nltk.word_tokenize(text2))
4. Output:
['Why', 'are', 'you', 'so', 'intelligent?']
['Why', 'are', 'you', 'so', 'intelligent', '?']
5. Our first split function combines '?' with the previous word. But when we did it by using
nltk, it create separate word.