DSP Unit 2
DSP Unit 2
File Operations
Regular Expressions
For example, main.py is a file that is always used to store Python code.
- Each line of a file is terminated with a special character, called the EOL or
End of Line characters like comma {,} or newline character {\n}.
1. Opening a File
In python we need to open a file before performing any operations on it.
Common modes :
file = open(“example.txt”) - default mode is read if not mentioned any mode while
opening a file.
2. Writing to a File
with open("example.txt", "w") as file:
file.write("Hello, World!\n")
file.write("This is a test file.")
print("Data written successfully!")
Output:
Contents of example.txt:
Hello, World!
This is a test file.
print(content)
Output:
Hello, World!
This is a test file.
Output:
Hello, World!
This is a test file.
4. Appending to a File
with open("example.txt", "a") as file:
file.write("\nAppending new line.")
Hello, World!
This is a test file.
Appending new line.
Output:
if os.path.exists("example.txt"):
print("File exists")
else:
print("File does not exist")
Output:
File exists
7. Deleting a File
if os.path.exists("example.txt"):
os.remove("example.txt")
print("File deleted successfully!")
else:
print("File does not exist.")
Output:
Python provides efficient file handling through various modes and functions. Using
the with statement ensures safe and automatic file closure.
File operations are fundamental for data storage, logging, and configuration
management in applications
Examples
Read the first five characters of stored data and return it as a string:
Let’s see how to create a file and how to write mode works
file = open('geek.txt','w')
file.write("This is the write command")
file.write("It allows us to write in a particular file")
file.close()
# Python code to illustrate append() mode
Using 'with' , this method any files opened will be closed automatically after one is
done, so auto-cleanup
seek() method
In Python, seek() function is used to change the position of the File Handle to a given
specific position. File handle is like a cursor, which defines from where the data has
to be read or written in the file.
Syntax:
f.seek(offset, from_what), where f is file pointer
Parameters:
Offset: Number of positions to move forward
from_what: It defines point of reference.
Returns: Return the new absolute position.
Change the current file position to 4, (from beggining and return the rest of the line):
f = open(“kk.txt", "r")
f.seek(4) OR f.seek(4,0)
print(f.read())
Change the current file position to 4, from end, and return the rest of the line):
f = open(“kk.txt", "r")
f.seek(-4,2)
print(f.read())
Common list of methods in txt mode
Regular Expressions:
import re
#Check if the string starts with "The" and ends with "Spain":
The above code gives the starting index and the ending index of the string
portal.
Note: Here r character (r’portal’) stands for raw, not regex. The raw string
is slightly different from a regular string, it won’t interpret the \ character
as an escape character. This is because the regular expression engine uses
\ character for its own escaping purpose.
Before starting with the Python regex module let’s see how to actually
write regex using metacharacters or special sequences.
RegEx Functions:
The re module offers a set of functions that allows us to search a string for a match:
Function Description
Findall() - Returns a list containing all matches
Search() - Returns a Match object if there is a match anywhere in the string
Match() - Returns a Match object found at begining of the string (at start index)
Split() - Returns a list where the string has been split at each match
sub () - Replaces one or many matches with a string
import re
txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)
The split() function returns a list where the string has been split at each match:
import re
#Split the string at every white-space character:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)
import re
#Split the string at the first white-space character:
txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)
The sub () function replaces the matches with the text of your choice:
import re
#Replace all white-space characters with the digit "9":
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)
Match Object:
A Match Object is an object containing information about the search and the result.
import re
#The search() function returns a Match object:
txt = "The rain in Spain"
x = re.search("ai", txt)
print(x)
Special Sequence:
A special sequence is a \ followed by one of the characters in the list below, and has
a special meaning:
Character Description
\A - Returns a match if the specified characters are at the beginning of the string
Example : "\AThe"
\b - Returns a match where the specified characters are at the beginning or at the
end of a word (the "r" in the beginning is making sure that the string is being treated
as a "raw string")
\B - Returns a match where the specified characters are present, but NOT
at the beginning (or at the end) of a word (the "r" in the beginning is making sure that
the string is being treated as a "raw string")
Example: r"\Bain" r"ain\B"
\d - Returns a match where the string contains digits (numbers from 0-9)
Example: "\d"
\D - Returns a match where the string DOES NOT contain digits
Example: "\D"
\s - Returns a match where the string contains a white space character
Example: "\s"
\S - Returns a match where the string DOES NOT contain a white space
character
Example: "\S"
\w - Returns a match where the string contains any word characters (characters
from a to Z, digits from 0-9, and the underscore _ character)
Example: "\w"
\W - Returns a match where the string DOES NOT contain any word characters
Example: "\W"
\Z - Returns a match if the specified characters are at the end of the string
Example: "Spain\Z"
Example1:
import re
txt = "The rain in Spain"
Example 2:
import re
txt = "The rain in Spain"
#Check if "ain" is present at the beginning of a WORD:
x = re.findall(r"\bain", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Example 3:
import re
txt = "The rain in Spain"
#Check if "ain" is present, but NOT at the beginning of a word:
x = re.findall(r"\Bain", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Example 4:
import re
txt = "The rain in Spain"
#Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Example 5:
import re
txt = "The rain in Spain"
#Return a match at every word character (characters from a to Z, digits from 0-9,
and the underscore _ character):
x = re.findall("\w", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Metacharacters:
Metacharacters are characters with a special meaning:
\ – Backslash
The backslash (\) makes sure that the character is not treated in a special way. This
can be considered a way of escaping metacharacters. For example, if you want to
search for the dot(.) in the string then you will find that dot(.) will be treated as a
special character as is one of the metacharacters (as shown in the above table). So for
this case,
we will use the backslash(\) just before the dot(.) so that it will lose its
specialty.
import re
s = ' A computer science portal for Data science using python'
# without using \
match = re.search(r'.', s)
print(match)
# using \
match = re.search(r'\.', s)
print(match)
[] – Square Brackets
We can also specify a range of characters using – inside the square brackets.
For example,
•[0, 3] is sample as [0123]
•[a-c] is same as [abc]
We can also invert the character class using the caret(^) symbol.
For example,
•[^0-3] means any number except 0, 1, 2, or 3
•[^a-c] means any character except a, b, or c
^ – Caret
Caret (^) symbol matches the beginning of the string i.e. checks whether the string
starts with the given character(s) or not.
^g will check if the string starts with g such as geeks, globe, girl, g, etc.
•^gr will check if the string starts with ge such as greets, greetingsforgreetings, etc.
$ – Dollar
Dollar($) symbol matches the end of the string i.e checks whether the
string ends with the given character(s) or not. For example –
•s$ will check for the string that ends with a such as geeks, ends, s, etc.
•ks$ will check for the string that ends with ks such as geeks,
geeksforgeeks, ks, etc.
. – Dot
Dot(.) symbol matches only a single character except for the newline character (\n).
For example –
•a.b will check for the string that contains any character at the place of the dot
such as acb, acbd, abbb, etc
•.. will check if the string contains at least 2 characters
| – Or
Or symbol works as the or operator meaning it checks whether the pattern before or
after the or symbol is present in the string or not.
For example –
•a|b will match any string that contains a or b such as acd, bcd, abcd, etc.
? – Question Mark
Question mark(?) checks if the string before the question mark in the regex occurs at
least once or not at all.
For example –
•ab?c will be matched for the string ac, acb, dabc but will not be matched for
abbc because there are two b. Similarly, it will not be matched for abdc because b is
not followed by c.
* – Star
Star (*) symbol matches zero or more occurrences of the regex preceding the *
symbol.
For example –
•ab*c will be matched for the string ac, abc, abbbc, dabc, etc.
but will not be matched for abdc because b is not followed by c.
+ – Plus
Plus (+) symbol matches one or more occurrences of the regex preceding
the + symbol.
For example –
•ab+c will be matched for the string abc, abbc, dabc, but will not be
matched for ac, abdc because there is no b in ac and b is not followed by c in abdc.
{m, n} – Braces
Braces match any repetitions preceding regex from m to n both inclusive.
For example –
•a{2, 4} will be matched for the string aaab, baaaac, gaad, but will not
be matched for strings like abc, bc because there is only one a or no a in both the
cases.
(<regex>) – Group
Group symbol is used to group sub-patterns.
For example –
•(a|b)cd will match for strings like acd, abcd, gacd, etc.
Example 1:
import re
txt = "The rain in Spain"
#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-m]", txt)
print(x)
Example 2
import re
txt = "hell planet"
#Search for a sequence that starts with "he", followed by 1 or more (any)
characters, and an
"o":
x = re.findall("he.+o", txt)
print(x)
if x:
print("match found")
else:
print("no match")
Example 3:
import re
txt = "hell planet"
#Search for a sequence that starts with "he", followed by 1 or more (any) characters,
and an
"o":
x = re.findall("he.{2}o", txt)
print(x)
if x:
print("match found")
else:
print("no match")
Example 4
import re
txt = "hello planet"
#Check if the string starts with 'hello':
x = re.findall("^hello", txt)
if x:
print("Yes, the string starts with 'hello'")
else:
print("No match")
Example 5
import re
txt = "The rain in Spain falls mainly in the plain!"
#Check if the string contains either "falls" or "stays":
x = re.findall("foailil|stays", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Sets:
A set is a set of characters inside a pair of square brackets [] with a special meaning:
Example:
import re
txt = "The rain in Spain"
#Check if the string has any a, r, or n characters:
x = re.findall("[arn]", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Example:
import re
txt = "8 times before 11:45 AM"
#Check if the string has any digits:
x = re.findall("[0-9]", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Example:
import re
txt = "8 times before 11:45 AM"
#Check if the string has any two-digit numbers, from 00 to 59:
x = re.findall("[0-5][0-9]", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Pandas in Python for Data Science
Introduction to Pandas
Pandas is an open-source Python library for data manipulation and analysis. It
provides fast, flexible, and expressive data structures designed to work with structured
(tables, time-series) and semi-structured data.
Pandas are also able to delete rows that are not relevant, or contains wrong values,
like empty or NULL values. This is called cleaning the data.
Pandas is widely used in Data Science, Machine Learning, and Data Engineering
to clean, transform, and analyze data efficiently.
Installation
Import Pandas
Once Pandas is installed, import it in your applications by adding the import keyword:
import pandas as pd
import pandas as pd
print(pd.__version__)
import pandas as pd
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
Custom Indexing
Output:
A 10
B 20
C 30
D 40
E 50
dtype: int64
Accessing Elements
print(series['C'])
# Output: 30
print(series[2])
# Output: 30
import pandas as pd
series = pd.Series(data)
print(series)
Output: A 10
A 10
B 20
C 30
D 40
E 50
dtype: int64
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print(df)
Output:
df = pd.read_csv("data.csv")
print(df.head())
print(df.head())
# First 5 rows
print(df.tail())
# Last 5 rows
print(df.shape)
# (rows, columns) - No of rows and columns
print(df.info())
# Summary of DataFrame - No of columns and cloumn names and dat type
print(df.describe())
# Summary statistics - It display count, mean, std, min, 25%, 50%, 75% and Max
print(df['Name'])
Using Conditions
Deleting a Column
df.drop(columns=['Experience'], inplace=True)
print(df)
Grouping by a Column
Applying Aggregations
5. Sorting Data
6. Exporting Data
Save to CSV
df.to_csv("output.csv", index=False)
Save to Excel
df.to_excel("output.xlsx", index=False)
chunk_size = 1000
for chunk in pd.read_csv("large_data.csv", chunksize=chunk_size):
process(chunk) # Perform operations on each chunk
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df_resampled = df.resample('M').mean()
Conclusion
Pandas is an essential library for data analysis, data manipulation, and data
science. It simplifies working with structured data, making tasks like data cleaning,
transformation, and aggregation easy.
Key Points:
The dataset contains information about customer purchases, including date, branch,
city, product category, unit price, quantity, total sales, payment method, and
customer type.
Steps:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("sales_data.csv")
df.dropna(inplace=True)
df['Date'] = pd.to_datetime(df['Date'])
Basic Statistics
top_products = df['Product'].value_counts().head(5)
print(top_products)
5. Sales Analysis
Total Revenue
total_revenue = df['Total'].sum()
print(f"Total Revenue: ${total_revenue}")
revenue_per_city = df.groupby("City")["Total"].sum()
print(revenue_per_city)
revenue_per_category = df.groupby("Category")["Total"].sum()
print(revenue_per_category)
6. Data Visualization
df.groupby(df['Date'].dt.month)['Total'].sum().plot(kind='line', marker='o',
figsize=(10,5), title="Sales Trend Over Months")
plt.xlabel("Month")
plt.ylabel("Total Sales")
plt.show()
plt.figure(figsize=(8,5))
sns.countplot(x='Payment Method', data=df, palette='pastel')
plt.title("Most Used Payment Methods")
plt.show()
plt.figure(figsize=(10,5))
df['Product'].value_counts().head(10).plot(kind='bar', color='skyblue')
plt.title("Top 10 Best-Selling Products")
plt.xlabel("Product Name")
plt.ylabel("Number of Sales")
plt.xticks(rotation=45)
plt.show()
4. Revenue by City
plt.figure(figsize=(8,5))
sns.barplot(x=revenue_per_city.index, y=revenue_per_city.values, palette="viridis")
plt.title("Revenue by City")
plt.ylabel("Total Sales")
plt.show()
Further Improvements
Introduction to NumPy
Installing NumPy
Importing NumPy
import numpy as np
b) Multi-Dimensional Array
Random Arrays
Accessing Elements
# Output: 30
Accessing 2D Arrays
Slicing Arrays
Element-wise Operations
Reshaping Arrays
Transposing Arrays
np.concatenate((arr1, arr2))
# Combine arrays
np.vstack((arr1, arr2))
# Stack vertically
np.hstack((arr1, arr2))
# Stack horizontally
X = np.random.rand(100, 3)
# 100 samples, 3 features
y = np.random.randint(0, 2, 100)
# Binary labels (0 or 1)
Simulate a Dataset
Key points:
Objective:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("ecommerce_sales.csv")
print(df.head())
# View first 5 rows
print(df.info())
# Check data types and missing values
df.drop_duplicates(inplace=True)
df["Price"] = df["Price"].astype(float)
df["Quantity"] = df["Quantity"].astype(int)
df["Total Sales"] = df["Price"] * df["Quantity"]
Basic Statistics
5. Data Visualization
df.groupby(df["Order Date"].dt.month)["TotalSales"].sum().plot(kind="line",
marker="o", figsize=(10,5), title="Monthly Sales Trend")
plt.xlabel("Month")
plt.ylabel("Total Sales")
plt.show()
plt.figure(figsize=(10,5))
df["Category"].value_counts().plot(kind="bar", color="skyblue")
plt.title("Top Selling Product Categories")
plt.xlabel("Category")
plt.ylabel("Number of Sales")
plt.xticks(rotation=45)
plt.show()
plt.figure(figsize=(8,5))
sns.barplot(x=revenue_by_category.index, y=revenue_by_category.values,
palette="viridis")
plt.title("Revenue by Product Category")
plt.ylabel("Total Revenue")
plt.xticks(rotation=45)
plt.show()
4. Most Popular Payment Methods
plt.figure(figsize=(8,5))
sns.countplot(x="Payment Method", data=df, palette="pastel")
plt.title("Most Used Payment Methods")
plt.show()
We can use NumPy and Pandas to prepare data for predictive modeling using
Scikit-Learn.
For example, we can predict future sales trends using linear regression.
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Next Steps:
Regular Expressions
Regular Expressions (regex) are powerful tools used to search, match, and
manipulate text using patterns. They are widely used in data validation, text
processing, web scraping, and data cleaning.
import re
Function Description
re.search() Searches for the first occurrence of a pattern in a string.
Function Description
re.match() Checks if a pattern matches the beginning of a string.
re.findall() Returns all occurrences of a pattern in a string.
re.finditer() Returns an iterator with match objects for all occurrences.
re.sub() Replaces occurrences of a pattern with a replacement string.
re.split() Splits a string based on a pattern.
3. Regex Metacharacters
import re
text = "Python is a great programming language"
match = re.search(r"Python", text)
if match:
print("Match found!")
html = "<html><head><title>My
Website</title></head><body>Content</body></html>"
tags = re.findall(r"<.*?>", html)
print(tags)
password = "P@ssw0rd123"
pattern = r"^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-
z\d@$!%*?&]{8,}$"
if re.fullmatch(pattern, password):
print("Strong password!") # ✅ Output: Strong password!
Explanation:
6. Conclusion
Web Scraping is the process of extracting data from websites using automated
scripts. It is widely used in data science, market research, competitive analysis,
and automation.
1. Prerequisites
Web pages are built using HTML and structured with CSS. Web scraping involves:
Sending a request to the website
Extracting the HTML content
Parsing the HTML to find relevant data
Saving the data for further analysis
import requests
url = "https://example.com"
response = requests.get(url)
if response.status_code == 200:
print(response.text[:500]) # Print first 500 characters of the page
requests.get(url): Sends an HTTP request
.text: Returns the HTML content of the page
title = soup.title.text
paragraphs = [p.text for p in soup.find_all("p")]
print("Title:", title)
print("Paragraphs:", paragraphs)
2. Extracting Links
3. Extracting Images
Some websites load content dynamically using JavaScript, which requests cannot
handle. Selenium helps interact with such websites.
driver.quit()
Save to CSV
import pandas as pd
df.to_csv("scraped_data.csv", index=False)
print("Data saved to CSV")
Save to Excel
df.to_excel("scraped_data.xlsx", index=False)
such as:
Blocking frequent requests
Requiring JavaScript execution
CAPTCHAs and login requirements
9. Ethical Considerations
Example:
https://example.com/robots.txt