Unit 5

Machine learning text analysis notes

Uploaded by

2902snehashinde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views4 pages

Unit 5

Machine learning text analysis notes

Uploaded by

2902snehashinde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

5.

Text Analysis

Basic Text Processing with Python.

1. Text Primitives:
• Document: A large collection of text.
• Sentences: Subdivisions of a document, formed by collections of words.
• Words: Units of meaning within sentences.
• Characters: The smallest units of text, forming words.

2.String Operations:
• Concatenation: Combine two strings using the + operator.
• Repetition: Repeat a string multiple times using the * operator.
• Membership Test: Check if a character exists in a string using in and not in
operators.

3.String Comparisons:
• isalpha(): Check if the string contains only alphabets.
• isalnum(): Check if the string contains only alphabets and numbers.
• isdigit(): Check if the string contains only digits.
• isdecimal(): Check if the string contains only decimal characters.
• islower(): Check if the string contains only lowercase characters.
• isupper(): Check if the string contains only uppercase characters.
• isnumeric(): Check if the string contains only numeric characters.
• startswith(): Check if the string starts with a specified substring.
• endswith(): Check if the string ends with a specified substring.

4.String Conversions:
• capitalize(): Convert the first character to uppercase and the rest to lowercase.
• title(): Convert the first character of each word to uppercase.
• lower(): Convert all characters to lowercase.
• upper(): Convert all characters to uppercase.
• swapcase(): Swap the case of each character.
• casefold(): Perform case folding, a more aggressive lowercasing for comparisons.

5.String Manipulations:
• count(): Count the occurrences of a substring in the string.
• replace(): Replace all occurrences of a substring with a new one.
• find(): Find the index of the first occurrence of a substring.
• rfind(): Find the index of the last occurrence of a substring.
• join(): Join strings in a sequence with a specified separator.
• splitlines(): Split the string into separate lines.
• lstrip(): Remove leading whitespaces or specified characters.
Regular Expression.
1.Introduction:
1. Regular expression is a powerful tool in any language to match the text patterns.
2. Python also supports regular expression.
3. Python regular expression operations are supported by module re.
4. To use regular expression first we need to import 're' module.
import re
5. To perform regular expression search, we will follow this format:
matchset = re.search(pattern,text)
6. Here, pattern refers to the rule we formed for matching and text contains string in which
we want to perform the search.
7. If search goes successful, match object is returned otherwise None.
8. example:
text2 = "This news article is published on month:Jan"
matchResult = re.search(r'month:\w\w\w',text2)
if matchResult:
print('Pattern exists ', matchResult.group())
else:
print('Pattern not exists')
9. In above example we want to search for month followed by : and three characters.
10. If it contains, result will be stored in matchResult object.
11. We can print the result by using matchResult.group() method.
12. r which is used in the beginning of the pattern is to handle raw strings.

2. some general rules used to make patterns:

1. Characters can be used as they appear. We can put them, and they match directly (Ex. -
a, B,4 etc.)
2. . (dot) is used to match with a single character.
3. \w is used to match any word character [a-zA-Z0-9_]. \W is used to match any non-word
character.
4. \s is used to match single whitespace character (space, newline, tab). \S is used to
match single non whitespace character.
5. \d is used to match single digit [0-9].
6. ^ is used to match start of the string.
7. $ is used to match end of the string.
8. + is used to check one or more occurrence of the pattern.
9. * is used to check zero or more occurrence of the pattern.
10. ? is used to check either zero or one occurrence of the pattern.
11. Square brackets can be used to match a set of words.
12. findall() - this function is used to find all occurrence of a specific pattern.
Example:
tweet = 'I am learning #datascience and it is awesome. #python #machinelearning'
hastags = re.findall(r'#\w+', tweet)
for tag in hastags:
print(tag)
Natural Language Processing.
1. Introduction:
1. Natural language is any language which is used for communication between humans e.g
Marathi, Hindi, English, German, etc.
2. Different types of manipulation can be done on natural language text.
3. Counting words and their frequencies, part of speech tagging, parsing sentences,
identifying entities, relationship between entities are different types of tasks covered in
natural language processing.
4. To perform natural language processing with python we use nltk (Natural Language
Toolkit).
5. It is open-source library written in python. It supports most of the natural language
tasks.
6. We start by importing the library:
import ntlk
7. nltk has good number of text corpora.
8. First, we need to download them to do some processing.
nltk.download()
9. Now to get a list of corpora, we can use following statement:
from nltk.book import *
10. This statement shows list of corpora available. We can print any text.
text3
11. Output: <Text: The Book of Gensis>.

2. Stemming, Lemmatization & Tokenization:

2.1. Stemming:
1. It is the process of finding the root word, hence reduce conflicts and convert words to
their base words.
2. For example, words like fishes, fishing, fisher all has the root word fish.
3. We can use porter stemmer available in nltk library.
fishwords = ['fish','Fishing','Fishes']
prt = nltk.PorterStemmer()
ls = [prt.stem(i) for i in fishwords]
print(ls)
4. Output: [u'fish', u'fish', u'fish']
5. Stemming depends on the type of work. Sometimes it makes sense to do it, sometimes
not. We have to take decision based on the problem statement.
2.2. Lemmatization:
1. It is the process to convert words into their actual dictionary form.
2. In nltk WordNetLemmatizer() is present to do this task.
3. We can understand the process by following code:
fishwords = ['fishes','Fishings','Fishes']
WNlemma = nltk.WordNetLemmatizer()
ls = [WNlemma.lemmatize(i) for i in fishwords]
print(ls)
4. Output: [u'fish', 'Fishings', 'Fishes']

2.3. Tokenization:
1. This can be done by split() function available in python.
2. But if we want to do it more clearly, we can use nltk tokenization.
3. Let’s understand this by following code:
text2 = "Why are you so intelligent?"
words = text2.split(' ')
print(words)
print(nltk.word_tokenize(text2))
4. Output:
['Why', 'are', 'you', 'so', 'intelligent?']
['Why', 'are', 'you', 'so', 'intelligent', '?']
5. Our first split function combines '?' with the previous word. But when we did it by using
nltk, it create separate word.

Alpha - 355 Operation Manual
100% (1)
Alpha - 355 Operation Manual
206 pages
School Based Assessment 2023-24 Second Term Computer Education Grade 8
100% (1)
School Based Assessment 2023-24 Second Term Computer Education Grade 8
1 page
EPI-USE - PPT - Template (Effective March 2023)
No ratings yet
EPI-USE - PPT - Template (Effective March 2023)
49 pages
Grade 8 Computer Monthly Test January
50% (4)
Grade 8 Computer Monthly Test January
3 pages
Tsa Lab Record - Cse
No ratings yet
Tsa Lab Record - Cse
53 pages
Lecture 2n 04032024 081220pm 19022025 105409am
No ratings yet
Lecture 2n 04032024 081220pm 19022025 105409am
38 pages
Ngram 2x3
No ratings yet
Ngram 2x3
5 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Module II
No ratings yet
Module II
17 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
NLP Pyth
No ratings yet
NLP Pyth
5 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
Python Re
No ratings yet
Python Re
101 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
NLP Record
No ratings yet
NLP Record
15 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
NLP - Course EDC 1 29
No ratings yet
NLP - Course EDC 1 29
29 pages
Chapter 1
No ratings yet
Chapter 1
31 pages
Untitled
No ratings yet
Untitled
53 pages
Manipulating Text With Regular Expression in Python
No ratings yet
Manipulating Text With Regular Expression in Python
4 pages
Practical11 Python Programming CkC21BUjW7
No ratings yet
Practical11 Python Programming CkC21BUjW7
10 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
CS173 Class Activity 2 Regex PDF
No ratings yet
CS173 Class Activity 2 Regex PDF
3 pages
String and Text Processing
No ratings yet
String and Text Processing
8 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
PP - Module-3 Notes
No ratings yet
PP - Module-3 Notes
56 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Python Re
No ratings yet
Python Re
18 pages
2 - Python Strings
No ratings yet
2 - Python Strings
23 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
38 pages
UNIT4
No ratings yet
UNIT4
67 pages
Python 201 - (Slightly) Advanced Python Topics
No ratings yet
Python 201 - (Slightly) Advanced Python Topics
69 pages
Bling
No ratings yet
Bling
7 pages
Lecture 9 Python
No ratings yet
Lecture 9 Python
8 pages
Unit 3 2
No ratings yet
Unit 3 2
3 pages
Detail NLP
No ratings yet
Detail NLP
5 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Regular Expression 1
No ratings yet
Regular Expression 1
17 pages
Unit 3 Python
No ratings yet
Unit 3 Python
72 pages
Unit-3 Python
No ratings yet
Unit-3 Python
72 pages
9python Simple Character Matches
No ratings yet
9python Simple Character Matches
19 pages
Regular Expressions
No ratings yet
Regular Expressions
104 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Unit 2
No ratings yet
Unit 2
69 pages
Unit7 RegularExpressionpdf 2023 10 17 09 16 29
No ratings yet
Unit7 RegularExpressionpdf 2023 10 17 09 16 29
17 pages
22MCA1061 Regx
No ratings yet
22MCA1061 Regx
18 pages
Module 5
No ratings yet
Module 5
69 pages
Unit-3 - Regular Expression
No ratings yet
Unit-3 - Regular Expression
15 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
Python Ultimate Guide
100% (1)
Python Ultimate Guide
10 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
Unit - 4 Regex
No ratings yet
Unit - 4 Regex
28 pages
Python Complete Unit 3
No ratings yet
Python Complete Unit 3
40 pages
NLP TP1 Report Lahouel Ibrahim
No ratings yet
NLP TP1 Report Lahouel Ibrahim
6 pages
Unit 4 Regular Expression
No ratings yet
Unit 4 Regular Expression
16 pages
17 - Regular Expression
No ratings yet
17 - Regular Expression
20 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
CipherTrust Manager - Hands-On - CTE - Linux
0% (1)
CipherTrust Manager - Hands-On - CTE - Linux
25 pages
Open VPN
No ratings yet
Open VPN
11 pages
Plotly PDF
No ratings yet
Plotly PDF
166 pages
Set of 6 Sample Papers of Computer Science
No ratings yet
Set of 6 Sample Papers of Computer Science
59 pages
Oracle1Z0 819dumps2024 FreeQuestionsAndAnswersPDF
No ratings yet
Oracle1Z0 819dumps2024 FreeQuestionsAndAnswersPDF
4 pages
SWOT สาหร่าย PDF
No ratings yet
SWOT สาหร่าย PDF
1 page
E3d Commands
No ratings yet
E3d Commands
21 pages
Vector Part3 Methods-Tech Piece-Recut-Additive en
No ratings yet
Vector Part3 Methods-Tech Piece-Recut-Additive en
6 pages
Class 7 Ai Sample Paper 1
No ratings yet
Class 7 Ai Sample Paper 1
3 pages
Hello World
No ratings yet
Hello World
4 pages
VSB Java Syllabus
No ratings yet
VSB Java Syllabus
4 pages
From Forms To HTML: Understanding and Using Oracle Projects' HTML Pages
100% (1)
From Forms To HTML: Understanding and Using Oracle Projects' HTML Pages
29 pages
F
No ratings yet
F
22 pages
Computer Subject File 2 Grade 4
No ratings yet
Computer Subject File 2 Grade 4
3 pages
Tank 44M3
No ratings yet
Tank 44M3
15 pages
Explain Following CSS Properties
No ratings yet
Explain Following CSS Properties
8 pages
CYBV 388 Syllabus Fall 2023 15W
No ratings yet
CYBV 388 Syllabus Fall 2023 15W
10 pages
Get Essential C# 12.0, 8th Edition Mark Michaelis Free All Chapters
100% (8)
Get Essential C# 12.0, 8th Edition Mark Michaelis Free All Chapters
39 pages
02 DSA PPT Introduction To Algorithms
No ratings yet
02 DSA PPT Introduction To Algorithms
17 pages
Errors From Internet
No ratings yet
Errors From Internet
35 pages
C++ Bible
No ratings yet
C++ Bible
77 pages
2 (1) - Hitesh Gupta
No ratings yet
2 (1) - Hitesh Gupta
1 page
Computers For Digital Era
No ratings yet
Computers For Digital Era
2 pages
Java
No ratings yet
Java
27 pages
Blue Pink Business PowerPoint Templates
No ratings yet
Blue Pink Business PowerPoint Templates
28 pages
Hud Sight
No ratings yet
Hud Sight
7 pages

Unit 5

Uploaded by

Unit 5

Uploaded by

5.

Basic Text Processing with Python.

2. some general rules used to make patterns:

2. Stemming, Lemmatization & Tokenization:

You might also like