0% found this document useful (0 votes)
4 views8 pages

String and Text Processing

The document outlines the String and Text Processing section of a course on Business Analytics and Text Mining using Python, covering both basic and advanced techniques for handling textual data. Key topics include built-in string methods, regular expressions for pattern matching, and vectorized string operations in Pandas for efficient text processing. Mastery of these techniques is essential for tasks in text mining, natural language processing, and business analytics.

Uploaded by

Dare Devil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views8 pages

String and Text Processing

The document outlines the String and Text Processing section of a course on Business Analytics and Text Mining using Python, covering both basic and advanced techniques for handling textual data. Key topics include built-in string methods, regular expressions for pattern matching, and vectorized string operations in Pandas for efficient text processing. Mastery of these techniques is essential for tasks in text mining, natural language processing, and business analytics.

Uploaded by

Dare Devil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

String and Text Processing

Week 7:
String and Text Processing

String and Text Processing (Week 7) - Summary

The String and Text Processing section in the course Business Analytics and Text Mining Modeling
Using Python covers fundamental and advanced techniques for handling textual data in Python. The
content is divided into two lectures:

Lecture 33: String and Text Processing - Part I

1. Introduction to String Processing

Python is popular for text operations due to its ease of use and built-in functionalities.

String operations can be done using built-in methods or regular expressions for complex
tasks​.
2. Basic String Operations

String objects provide built-in methods for common text manipulation.

Example: Creating a string variable (`str1`) and performing operations like splitting into a list
using `split()`.
3. Whitespace Handling
Methods such as `.strip()` can be combined with `.split()` to remove unnecessary spaces.

4. String Modification
Methods such as `.replace()` allow text transformation.

Example: Replace commas with colons.


5. Introduction to Regular Expressions (Regex)

Regex provides advanced pattern matching capabilities for text manipulation.


Python's `re` module supports regex operations like pattern matching, substitution, and
splitting​.

Lecture 34: String and Text Processing - Part II

1. Advanced Regex Operations


Regex objects can be compiled and reused for efficiency.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/8
Example: Finding email addresses in a text and segmenting them into username, domain,
and domain suffix​.
2. Regex Methods
`findall()`: Extracts all matches in a string.

`search()`: Finds the first match and returns a match object.


`match()`: Checks for matches at the start of a string.

`sub()`: Replaces all occurrences of a pattern with a substitute string​.


3. Vectorized String Operations in Pandas

Pandas Series and DataFrames support vectorized string operations for efficient text
processing.
Example: Checking for missing values, applying transformations across multiple rows.

Key Takeaways

Python’s built-in string methods handle basic text processing tasks.

Regular expressions (regex) provide powerful capabilities for pattern matching and substitution.
Pandas supports efficient text processing with vectorized string operations.

These techniques are essential for text mining, NLP, and business analytics.

Let me know if you need explanations for specific sections! 🚀


Sources

https://www.youtube.com/watch?
v=xoUGYPUgfmg&list=PLLy_2iUCG87C6Ni9RROAcJ7dKIcNMOtI5&index=34&pp=iAQB

The video you've linked is Lecture 34: String and Text Processing - Part II from the course Business
Analytics and Text Mining Modeling Using Python. This lecture delves into advanced string
manipulation techniques, focusing on regular expressions (regex) and their practical applications in
Python.

Key Topics Covered:

1. Advanced Regular Expressions:

Compiling Regex Patterns: Enhances efficiency by allowing pattern reuse.

Email Extraction Example: Demonstrates extracting email components like username,


domain, and suffix using regex groups.

2. Regex Methods:
`findall()`: Retrieves all matches in a string.

`search()`: Finds the first match and returns a match object.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/8
`match()`: Checks for matches at the beginning of a string.

`sub()`:Replaces occurrences of a pattern with a specified string.


3. Vectorized String Operations with Pandas:

Utilizes Pandas Series and DataFrames for efficient, vectorized text processing.
Examples include handling missing values and applying transformations across datasets.

For a comprehensive understanding, you can access the lecture notes and additional resources on the
NPTEL website. These materials provide in-depth explanations and examples to complement the video
content.

Mastering these advanced string and text processing techniques is crucial for tasks in text mining,
natural language processing (NLP), and business analytics. They enable efficient data cleaning,
extraction, and transformation, which are foundational steps in any data analysis pipeline.
Feel free to reach out if you have specific questions or need further clarification on any of the topics
covered in this lecture!

Sources

String

String Processing in Python

Strings are a fundamental data type in Python used for storing and manipulating text. Python provides
powerful built-in methods and libraries for efficient string operations.

1. Creating Strings

Strings in Python can be created using single (`'`), double (`"`), or triple quotes (`''' """` for multiline
strings).

python

str1 = 'Hello'
str2 = "World"
str3 = '''This is
a multiline string'''

2. String Indexing and Slicing

Python strings are immutable and can be accessed using indexing (`[]`).

python

text = "Python"
print(text[0]) # P
print(text[-1]) # n
print(text[1:4]) # yth

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/8
3. Common String Methods

Python provides several built-in methods to manipulate strings:

Method Description

`lower()` Converts to lowercase


`upper()` Converts to uppercase
`strip()` Removes whitespace from both ends
`replace(a, b)` Replaces `a` with `b`
`split(delim)` Splits a string into a list using a delimiter

`join(list)` Joins elements of a list into a string


`find(sub)` Finds the first occurrence of a substring
`count(sub)` Counts occurrences of a substring

Example:

python

text = " Hello, Python! "


print(text.strip()) # Removes spaces -> "Hello, Python!"
print(text.upper()) # " HELLO, PYTHON! "
print(text.replace("Python", "World")) # " Hello, World! "
print(text.split(",")) # [' Hello', ' Python! ']

4. String Formatting

Python supports formatted string literals (`f-strings`) and other formatting methods.

Using f-strings:

python

name = "Alice"
age = 25
print(f"My name is {name} and I am {age} years old.")

Using `.format()`:

python

print("My name is {} and I am {} years old.".format(name, age))

5. Regular Expressions (Regex)

Python’s `re` module provides powerful pattern-matching capabilities.

Example: Extracting Email IDs

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/8
python

import re

text = "Contact us at [email protected] or [email protected]"


emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print(emails) # ['[email protected]', '[email protected]']

6. String Operations in Pandas (Vectorized)

When working with large datasets, Pandas provides vectorized string operations for efficiency.

Example:

python

import pandas as pd

df = pd.DataFrame({"Names": ["Alice", "Bob", "Charlie"]})


df["Uppercase"] = df["Names"].str.upper()
print(df)

Key Takeaways

Strings are immutable sequences of characters.

Python provides built-in methods for text processing.


Regex allows advanced pattern matching.

Pandas supports efficient vectorized string operations.

Let me know if you need more details on any specific topic! 🚀

Join

Using `join()` in Python

The `.join()` method is used to concatenate elements of an iterable (like a list or tuple) into a single
string, using a specified separator.

1. Basic Syntax

python

separator.join(iterable)

`separator`: A string used as a separator between elements.


`iterable`: A list, tuple, or other iterable containing strings.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/8
2. Joining a List into a String

python

words = ["Python", "is", "awesome"]


sentence = " ".join(words) # Joining with space
print(sentence) # Output: "Python is awesome"

3. Joining with Different Separators

python

fruits = ["apple", "banana", "cherry"]


comma_separated = ", ".join(fruits) # Using comma and space
print(comma_separated) # Output: "apple, banana, cherry"

dash_separated = "-".join(fruits) # Using hyphen


print(dash_separated) # Output: "apple-banana-cherry"

4. Joining Characters to Form a String

python

chars = ['H', 'e', 'l', 'l', 'o']


word = "".join(chars) # No separator
print(word) # Output: "Hello"

5. Joining a Tuple

python

tuple_example = ("Data", "Science", "Python")


result = " | ".join(tuple_example)
print(result) # Output: "Data | Science | Python"

6. Joining Keys from a Dictionary

python

data = {"name": "Alice", "age": "25", "city": "New York"}


keys_joined = ", ".join(data.keys()) # Joining dictionary keys
print(keys_joined) # Output: "name, age, city"

7. Joining Numbers (Convert to Strings First)

python

numbers = [1, 2, 3, 4]
num_str = "-".join(map(str, numbers)) # Convert each number to string

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/8
print(num_str) # Output: "1-2-3-4"

Key Takeaways

`.join()` is used to concatenate strings from an iterable.

The separator string is placed between each element.

Convert non-string elements (like numbers) to strings before joining.

Let me know if you need more details! 🚀

counting

Counting in Python

Python provides several ways to count occurrences of elements in strings, lists, tuples, and dictionaries.

1. Counting Substrings in a String


The `.count()` method counts occurrences of a substring in a string.

python

text = "Python is fun. Python is powerful."


count_python = text.count("Python")
print(count_python) # Output: 2

🔹 Case-Sensitive: and are treated as different words.


🔹 Counting a Specific Character:
`"Python"` `"python"`

python

sentence = "Hello, how are you?"


count_o = sentence.count("o")
print(count_o) # Output: 3

2. Counting Elements in a List


The `.count()` method also works with lists.

python

numbers = [1, 2, 3, 4, 2, 2, 5]
count_twos = numbers.count(2)
print(count_twos) # Output: 3

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/8
3. Counting Elements with `Counter` (from `collections`)
The `Counter` class provides an efficient way to count elements in an iterable.

python

from collections import Counter

words = ["apple", "banana", "apple", "orange", "banana", "apple"]


word_count = Counter(words)
print(word_count)
# Output: Counter({'apple': 3, 'banana': 2, 'orange': 1})

# Count of a specific word


print(word_count["apple"]) # Output: 3

🔹 For Strings:
python

char_count = Counter("mississippi")
print(char_count)
# Output: Counter({'i': 4, 's': 4, 'p': 2, 'm': 1})

4. Counting in a Dictionary
You can count occurrences in a dictionary using a loop.

python

data = {"Alice": 3, "Bob": 5, "Charlie": 3, "David": 5}


count_3 = list(data.values()).count(3)
print(count_3) # Output: 2

5. Counting Words in a Sentence


python

sentence = "Python is great and Python is easy"


words = sentence.split() # Split into words
word_count = Counter(words)
print(word_count)
# Output: Counter({'Python': 2, 'is': 2, 'great': 1, 'and': 1, 'easy': 1})

Key Takeaways

✅ works for strings and lists.



`.count()`
is useful for large datasets.
✅ Convert strings to lowercase if case-insensitive counting is needed.
`Counter`

Let me know if you need more details! 🚀

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/8

You might also like