Module 3 Notes
Module 3 Notes
Functions, which are meant to convert text data into a DataFrame. The options for these functions fall into a few
categories:
• Indexing: can treat one or more columns as the returned DataFrame, and whether to get column
names from the file, the user, or not at all.
• Type inference and data conversion: this includes the user-defined value conversions and custom
list of missing value markers.
• Datetime parsing: includes combining capability, including combining dateandtime information spread over
multiple columns into a single column in the result.
• Unclean data issues: skipping rows or a footer, comments, or other minor things like numeric data
with thousands separated by commas.
Page 1
read_csv and read_table are most used functions.
Before using any methods in the pandas library , import the library with the following statement: Import
pandas as pd
Since ex1.csv is comma-delimited, we can use read_csv to read it into a DataFrame. If file contains
any other delimiters then, read_table can be used by specifying the delimiter.
Suppose we wanted the message column to be the index of the returned DataFrame. We can either
indicate we want the column at index 4 or named 'message' using the index_col argument:
Page 2
To form a hierarchical index from multiple columns, just pass a list of column numbers or names:
The parser
functions have many additional arguments to help you handle the wide variety of exception file
formats that occur. For example, you can skip the first, third, and fourth rows of a file with skiprows:
Page 3
The na_values
option can take either a list or set of strings to consider missing values:
There are many different database systems which are used for a wide variety of purposes
including: Oracle, MySQL, Microsoft SQL Server, PostgreSQL, and SQLite.
Python to work with data in SQLite database files, many operations can be done more
conveniently using software called the Database Browser for SQLite which is freely
available from:
http://sqlitebrowser.org/
Using the browser you can easily create tables, insert data, edit data, or run simple SQL
queries on the data in the database
Database concepts
For the first look, database seems to be a spreadsheet consisting of multiple sheets. The
primary data structures in a database are tables, rows and columns.
In a relational database terminology, tables, rows and columns are referred as relation,
tuple and attribute respectively. Typical structure of a database table is as shown below
Table 3.1.
Each table may consist of n number of attributes and m number of tuples (or records).
Every tuple gives the information about one individual. Every cell (i, j) in the table indicates
value of jth attribute for ith tuple.
Page 5
Table 3.1: Typical Relational database table
Consider the problem of storing details of students in a database table. The format may look
like –
Roll No Name DOB Marks Student1 1 Akshay 22/10/2001 82.5Student 2 2 Arun 20/12/2000
81.3............... ............... ............... ...............
............... ............... ............... ...............
Student
...............
m............... ............. .............
.. ..
Thus, table columns indicate the type of information to be stored, and table rows gives
record pertaining to every student. We can create one more table say department consisting
of attributes like dept_id, homephno, City. To relate this table with a respective Rollno stored
in student, and dept_id stored in department table. Thus, there is a relationship between two
tables in a single database. There are softwares that can maintain proper relationships
between multiple tables in a single database and are known as Relational Database
Management Systems (RDBMS).
The connect operation makes a “connection” to the database stored in the filemusic.db in
the current directory. If the file does not exist, it will be created.
A cursor is like a file handle that we can use to perform operations on the data stored in the
database. Calling cursor() is very similar conceptually to calling open() when dealing with
text files.
Page 6
Once we have the cursor, we can begin to execute commands on the contents of the
database using the execute() method is as shown in figure below.
Figure : Database Cursor
Example1: Write a python to create student Table from college database.(the attributes of student like
Name,USN,Marks.)Perform the following operations like insert,delete and retrieve record from student
Table.
Page 7
Module 1 [22MCA31] Data Analytics using Python
import sqlite3
conn = sqlite3.connect(‘college.db’)
cur=conn.cursor()
print(“Opened database successfully”)
cur.execute(‘CREATE TABLE student(name TEXT, usn NUMERIC, Marks
INTEGER)’) print(“Table created successfully”)
cur.execute(‘INSERT INTO student(name,usn,marks) values
(?,?,?)’,(‘akshay’,’1rn16mca16’,30)) cur.execute(‘insert into student(name,usn,marks)
values (?,?,?)’,(‘arun’,’1rn16mca17’,65)) print(‘student’)
cur.execute(‘SELECT name, usn ,marks from student’)
for row in cur:
print(row)
cur.execute(‘DELETE FROM student WHERE Marks < 40’)
cur.execute(‘select name,usn,marks from student’)
conn.commit()
cur.close()
Output:
Opened database successfully
Table created successfully
student
('akshay', '1rn16mca16', 30)
('arun', '1rn16mca17', 65)
Example 2: Write a python code to create a database file(music.sqlite) and a table named Tracks with
two columns- title , plays. Also insert , display and delete the contents of thetable
import sqlite3
conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()
cur.execute('CREATE TABLE Tracks (title TEXT, plays INTEGER)')
cur.execute(“INSERT INTO Tracks (title, plays) VALUES ('Thunderstruck', 200)”)
cur.execute(“INSERT INTO Tracks (title, plays) VALUES (?, ?)”,('My Way', 15))
conn.commit()
print('Tracks:')
cur.execute('DELETE FROM Tracks WHERE plays < 100')
cur.execute('SELECT title, plays FROM Tracks')
for row in cur:
print(row)
cur.close()
Output
Tracks:
('Thunderstruck', 200)
Cleansing Data
To strip out extraneous information in a Pandas DataFrame, you can use various methods and
functions provided by the library. Some commonly used methods include: dropna( ): This
method removes rows with missing values (NaN or None) from the DataFrame.
You can specify the axis (0 for rows and 1 for columns) along which the rows or columns with
missing values should be dropped.
Example:
df = df.dropna()
#This will remove all rows that contain at least one missing value.
drop( ): The drop() method in Pandas is used to remove columns from a DataFrame. It can be
used to drop a single column or multiple columns at once.
df.drop(columns, axis=1, inplace=False)
Ex:
cars2 = cars_data.drop(['Doors','Weight'],axis='columns')
drop_duplicates(): methods to remove missing values and duplicate rows specify the columns
based on which the duplicates should be checked.
Page 9
loc[ ] and iloc[ ]: These indexing methods allow you to select specific rows and columns from the
DataFrame. They are used to select only the relevant data and exclude the unwanted information.
Ex:1
Ex:2
Page 10
Filtering: conditional statements can be used to filter the DataFrame and select only the rows that
meet certain criteria. This allows to remove unwanted data based on specific conditions.
Example:
data = {'Name': [‘Anitha', ‘Barathi', 'Charlie', 'David'], 'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)
Data normalization is the process of transforming data into a consistent format to facilitate
comparison and analysis. This may involve converting data to a common unit of measurement,
formatting dates and times consistently, or standardizing data formats. Normalization ensures that
data is comparable and can be easily processed and analysed.
Normalization is a crucial step in data preprocessing for machine learning tasks. It involves
transforming numerical features to have a mean of 0 and a standard deviation of 1. This process
ensures that all features are on the same scale, enabling efficient and accurate learning by machine
learning algorithms.
Page 11
Module 1 [22MCA31] Data Analytics using Python
We import all the required libraries, NumPy and sklearn. You can see that we import thepreprocessing from the
sklearn itself. That’s why this is the sklearn normalization method. Wecreated a NumPy array with some integer
value that is not the same. We called the normalizemethod from the preprocessing and passed the numpy_array,
which we just createdasa
parameter. We can see from the results, all integer data are now normalized between 0 and 1. Method 2:
Normalize a particular column in a dataset using sklearn We can also normalize the particular dataset column. In
this, we are going to discuss about that.
We import the library pandas and sklearn. We created a dummy CSV file, and we are nowloading
that CSV file with the help of the pandas (read_csv) package. We print that CSV filewhich we
recently loaded. We read the particular column of the CSV file using the np. array andstore the
result to value_array. We called the normalize method from the preprocessing andpassed the
value_array parameter.
Method 3: Convert to normalize without using the columns to array (using sklearn)
In the previous method 2, we discussed how to a particular CSV file column we could
normalize.But sometimes we need to normalize the whole dataset, then we can use the below
methodwhere we do normalize the whole dataset but along column-wise (axis = 0). If we mention
theaxis = 1, then it will do row-wise normalize. The axis = 1 is by default value.
ROOPA.H.M, Dept of MCA, RNSIT
Page 12
Now, we pass the whole CSV file along with one more extra parameter axis =0, which said to the
library that the user wanted to normalize the whole dataset column-wise.
Page 13
Module 1 [22MCA31] Data Analytics using Python
We called the MinMaxScalar from the preprocessing method andcreatedanobject(min_max_Scalar) for that.
We did not pass any parameters because we needtonormalizethedata between 0 and 1. But if you want, you
can add your values which will beseeninthenextmethod.
We first read all the names of the columns for further use to display results. Then we call thefit_tranform from the
created object min_max_Scalar and passed the CSVfile intothat. Wegetthe normalized results which are
between 0 and 1.
The sklearn also provides the option to change the normalized value of what you want. Bydefault,
they do normalize the value between 0 and 1. But there is a parameter which we
calledfeature_range, which can set the normalized value according to our requirements.
Here, We call the MinMaxScalar from the preprocessing method and create an
object(min_max_Scalar) for that. But we also pass another parameter inside of the
MinMaxScaler(feature_range). That parameter value we set 0 to 2. So now, the MinMaxScaler will
normalize thedata values between 0 to 2. We first read all the names of the columns for further use to
displayresults. Then we call the fit_tranform from the created object min_max_Scalar and passed
theCSV file into that. We get the normalized results which are between 0 and 2.
We can also do normalize the data using pandas. These features are alsoverypopularinnormalizing the data. The
maximum absolute scaling does normalize values between0and1. We are applying here .max () and .abs() as
shown below:
We call each column and then divide the column values with the .max() and .abs(). We print
theresult, and from the result, we confirm that our data normalize between 0 and 1.
The next method which we are going to discuss is the z-score method. This method converts
theinformation to the distribution. This method calculates the mean of each column and
thensubtracts from each column and, at last, divides it with the standard deviation. This
normalizesthe data between -1 and 1.
ROOPA.H.M, Dept of MCA, RNSIT
Page 15
We calculate the column’s mean and subtract it from the column. Then we divide the
columnvalue with the standard deviation. We print the normalized data between -1 and 1.
One popular library is Scikit-Learn, which offers the StandardScaler class for normalization. Here's
an example of how to use StandardScaler to normalize a dataset:
Page 16
Formatting Data:
Formatting data in Pandas involves transforming and presenting data in a structured and
readable manner. Pandas, a popular Python library for data analysis, offers various methods
and techniques to format data effectively.
One of the key features of Pandas is its ability to handle different data types and structures. It
provides specific formatting options for each data type, ensuring that data is displayed in a
consistent and meaningful way. For example, numeric data can be formatted with specific
number of decimal places, currency symbols, or percentage signs. Date and time data can be
formatted in various formats, such as "dd/mm/yyyy" or "hh:mm:ss".
Pandas also allows users to align data within columns, making it easier to read and compare
values. This can be achieved using the "justify" parameter, which takes values such as "left",
"right", or "center". Additionally, Pandas provides options to control the width of columns,
ensuring that data is presented in a visually appealing manner.
Furthermore, Pandas offers methods to format entire dataframes, applying consistent formatting
rules to all columns. This can be done using the "style" attribute, which allows users to specify
formatting options for different aspects of the dataframe, such as font, background color, and
borders.
By leveraging the formatting capabilities of Pandas, users can effectively communicate insights
and patterns in their data, making it easier to analyze and interpret. Overall, formatting data in
Pandas is a crucial skill for data analysts and scientists to present their findings in a clear and
professional manner.
Ex 1 : Formatting Numeric Data
Page 17
Ex 2: Formatting Date and Time Data
Ex 3: Aligning Data in Columns
Page 18
Combining and
merging data sets
Data contained in pandas objects can be combined together in a number of built-in ways:
• pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to
users of SQL or other relational databases, as it implements database join operations.
• combine_first instance method enables splicing together overlapping data to fill in missing values in
one object with values from another.
import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
df1 df2
The below examples shows many-to-one merge situation; the data in df1 has multiple rows
labeled a and b, whereas df2 has only one row for each value in the key column.
# performs outer join
# performs inner join
# pd.merge(df1, df2) pd.merge(df1, df2, how='outer')
Observe that the 'c' and 'd' values and associated data are missing from the result. By default
merge does an 'inner' join; the keys in the result are the intersection. The outer join takes the
union of the keys, combining the effect of applying both left and right joins.
df1 df2
Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in the left
DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method only affects the
distinct key values appearing in the result:
Merging on Index
The merge key or keys in a DataFrame will be found in its index. In this case, you can pass
left_index=True or right_index=True (or both) to indicate that the index should be used as the merge
key
Page 20
Page
21
Concatenating Along an Axis
Another kind of data combination operation is alternatively referred to as concatenation, binding, or
stacking. NumPy has a concatenate function for doing this with raw NumPy arrays:
By default concat works along axis=0, producing another Series. If you pass axis=1, the result
will instead be a DataFrame (axis=1 is the columns):
Page 22
There are a number of fundamental operations for rearranging tabular data. These are alternatingly
referred to as reshape or pivot operations.
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two
primary actions:
• stack: this changes from the columns in the data to the rows.
• unstack: this changes from the rows into the columns.
Using the stack method on this data pivots the columns into the rows, producing a
Series.
From a hierarchically-indexed Series, we can rearrange the data back into a DataFrame with
unstack.
Page 23
By default the innermost level is unstacked (same with stack). You can unstack a different level
by passing a level number or name:
Unstacking might introduce missing data if all of the values in the level aren’t found in eachof the
subgroups:
Pivoting “long” to “wide” Format
A common way to store multiple time series in databases and CSV is in so-called long orstacked
format :
Data is frequently stored this way in relational databases like MySQL as a fixed schema allows the
number of distinct values in the item column to increase or decrease as data is added or deleted in
the table.
The data may not be easy to work with in long format; it is preferred to have a DataFrame
containing one column per distinct item value indexed by timestamps in the date column.
Page 24
DataFrame’s pivot method performs exactly this transformation.
The pivot() function is used to reshape a given DataFrame organized by given index / column
values. This function does not support data aggregation, multiple values will result in a MultiIndex
in the columns.
Syntax:
DataFrame.pivot(self, index=None, columns=None, values=None)
Example:
Page 25
Suppose you had two value columns that you wanted to reshape simultaneously:
By omitting the last argument, you obtain a DataFrame with hierarchical columns:
Page 26
Data transformation Data transformation is the process of converting raw data into a format that is
suitable for analysis and modeling. It's an essential step in data science and analytics workflows,
helping to unlock valuable insights and make informed decisions.
Few of the data transfer mechanisms are :
• Removing Duplicates
• Replacing Values
• Renaming Axis Indexes
• Discretization and Binning
• Detecting and Filtering Outliers
• Permutation and Random Sampling
i) Removing duplicates
Duplicate rows may be found in a DataFrame using method duplicated which returns a
boolean Series indicating whether each row is a duplicate or not. Relatedly, drop_duplicates
returns a DataFrame where the duplicated array is True.
data = DataFrame(
{ 'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4] } )
data.duplicated() data.drop_duplicates()
data
Page 27
Suppose we wanted to find values in one of the columns exceeding one in magnitude:
To select all rows having a value exceeding 1 or -1, we can use the any method on a boolean
DataFrame:
Some times it is necessary to replace missing values with some specific values or NAN values.
It can be done by using replace method. Let’s consider this Series:
data = Series([1., -999., 2., -999.,
-1000., 3.]) data
The -999 values might be sentinel values for missing data. To replace these with NA values that
pandas understands, we can use replace, producing a new Series:
data.replace(-999, np.nan)
Page 28
If we want to replace multiple values at once, you instead pass a list then the substitute
value:
data.replace([-999, -1000], np.nan)
To use a different replacement for each value, pass a list of substitutes: data.replace([-999,
import pandas as pd
To create a transformed version of a data set without modifying the original, a useful method is rename:
data.rename(index=str.title, columns=str.upper)
Page 29
rename can be used in conjunction with a dict-like object providing new values for a subset of
the axis labels:
data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
v. Discretization and binning
Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose we
have data about a group of people in a study, and we want to group them into discrete age
buckets:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do so,
we have to use cut, a function in pandas:
import pandas as pd
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
The object pandas returns is a special Categorical object. We can treat it like an array of strings
indicating the bin name; internally it contains a levels array indicating the distinct category
names along with a labeling for the ages data in the labels attribute:
cats.labels
cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object)
pd.value_counts(cats)
Consistent with mathematical notation for intervals, a parenthesis means that the side is open
while the square bracket means it is closed (inclusive).
Page 30
vi. Permutation and Random Sampling
Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the
numpy.random.permutation function. Calling permutation with the length of the axis you want to
permute produces an array of integers indicating the new ordering:
df
df = DataFrame(np.arange(5 * 4).reshape(5,
4))
That array can then be used in ix-based indexing or the take function:
df.take(sampler)
pd.get_dummies(df['key'])
In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which
can then be merged with the other data. get_dummies has a prefix argument for doing just
this:
Page 31
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
String
Manipulation
Python has long been a popular data munging language in part due to its ease-of-use for string
and text processing. Most text operations are made simple with the string object’s built in
methods. For more complex pattern matching and text manipulations, regular expressions may
be needed. pandas adds to the mix by enabling you to apply string and regular expressions
concisely on whole arrays of data, additionally handling the annoyance of missing data.
RegEx Functions
The re module offers a set of functions that allows us to search a string for a match. By using
these functions we can search required pattern. They are as follows:
match(): re.match() determine if the RE matches at the beginning of the string. The method
returns a match object if the search is successful. If not, it returns None.
import re Output:
abyss
pattern = '^a...s$'
Search successful.
test_string = 'abyss'
result = re.match(pattern,test_string)
if result:
print("Search successful.")
else:
print("Search unsuccessful.")
search(): The search( ) function searches the string for a match, and returns a Match object if
there is a match. If there is more than one match found, only the first occurrence of the
match will be returned.
import re Output:
pattern='Tutorials' <re.Match object;
line ='Python Tutorials' span=(7,16),match='Tutorials'>
result = re.search(pattern, line) print(result)
Tutorials
print(result.group())
Page 33
findall() : Find all substrings where the RE matches, and returns them as a list. It searches
from start or end of the given string and returns all occurrences of the pattern. While
searching a pattern, it is recommended to use re.findall() always, it works like re.search()
and re.match() both.
import re Output:
str = "The rain in Spain" ['ai', 'ai']
x = re.findall("ai", str)
print(x)
Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's
a list of metacharacters:
[] "[a-m]"
A set of characters "[^a-m]"
[^…]
Matches any single character NOT in
brackets
characters)
\
Signals a special sequence (can "\d"
also be used to escape special
. "he..o" ^ "^hello" $
Any character (except newline character) Starts with
"world$"
Ends with
* "aix*" + "aix+"
Zero or more occurrences One or more occurrences "al{2}"
{}
Exactly the specified number of occurrences
| "falls|stays"
Either or
() Capture and group
Page 34
Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a
special meaning:
"\d"
\d Returns a match where the string contains
digits (numbers from 0-9)
"\D"
\D Returns a match where the string DOES NOT
contain digits
"\s"
\s Returns a match where the string contains a
white space character
"\S"
\S Returns a match where the string DOES NOT
contain a white space character
"\w"
\w Returns a match where the string contains any word characters (characters from
a to Z,
digits from 0-9, and the underscore _
character)
"\W"
\W Returns a match where the string DOES NOT
contain any word characters
"\AThe"
\A Returns a match if the specified characters
are at the beginning of the string
r"\bain"
\b Returns a match where the specified
r"ain\b"
characters are at the beginning or at the end
of a word
r"\Bain"
\B Returns a match where the specified
r"ain\B"
characters are present, but NOT at the beginning (or at the end) of a word
\Z Returns a match if the specified characters "Spain\Z"
are at the end of the string
Page 35
[a-n]
Returns a match for
>>> str = "The rain in Spain"
>>>re.findall("[a-n]",str)
any lower case
character, ['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
alphabetically
between a and n
>>>re.findall("[0-5][0-9]", str)
Page 36
Few more examples for searching the pattern in files:
Let us consider a text file pattern.txt
#pattern.txt
From: Bengaluru^560098
From:<[email protected]>
ravi
rohan
Mysore^56788
From:Karnataka
From:
<[email protected]>
EX:1 Search for lines that start with 'F', followed by 2 characters, followed by 'm:'
import re Output:
hand = open('pattern.txt')
for line in hand: From: Bengaluru^560098
line = line.rstrip() From:<[email protected]>
if re.search('^F..m:', From: <[email protected]>
line):
print(line)
The regular expression F..m: would match any of the strings “From:”, “Fxxm:”, “F12m:”, or “F!@m:”
since the period characters in the regular expression match any character.
Ex:2 Search for lines that start with From and have an at sign
import re Output:
hand = open('pattern.txt')
for line in hand: From:<[email protected]>
line = line.rstrip() From:
if re.search('^From:.+@', <[email protected]>
line):
print(line)
The search string ˆFrom:.+@ will successfully match lines that start with“From:”,followed by
one or more characters (.+), followed by an at-sign.
Ex:1 Extract anything that looks like an email address from the line. import re
s = 'A message from [email protected] to [email protected] about meeting @2PM' lst =
re.findall('\S+@\S+', s)
print(lst)
Output: ['[email protected]', '[email protected]']
Page 37
In the above example:
— The findall() method searches the string in the second argument and returns a list of all of the
strings that look like email addresses.
— Translatingthe regular expression, we are looking for substrings that have at least One or more
non-whitespace character, followed by an at-sign, followed by at least one more non
whitespace character.
— The “\S+” matches as many non-whitespace characters as possible. Ex:2 Search for lines that
have an at sign between characters
import re Output:
hand = open('pattern.txt')
for line in hand: ['From:<[email protected]>']
line = line.rstrip() ['<[email protected]>']
x = re.findall('\S+@\S+',
line)
if len(x) > 0:
print(x)
We read each line and then extract all the substrings that match our regular expression.
Since findall() returns a list, we simply check if the number of elements in our returned list is
more than zero to print only lines where we found at least one substring that looks like an
email address.
Observe the above output, email addresses have incorrect characters like “<” or “>” at the
beginning or end. To eliminate those characters, refer to the below example program.
Ex:3 Search for lines that have an at sign between characters .The characters must be a
letter or number
import re Output:
hand = open('pattern.txt')
for line in hand: ['From:[email protected]']
line = line.rstrip() ['[email protected]']
x=re.findall('[a-zA-Z0-9]\S+@\S+[a-zA Z]',line)
if len(x) > 0:
print(x)
Here, we are looking for substrings that start with a single lowercase letter, uppercase letter,
or number “[a-zA-Z0-9]”, followed by zero or more non-blank characters (\S*), followed by an
at-sign, followed by zero or more non-blank characters (\S*), followed by an uppercase or
lowercase letter. Note that we switched from + to * to indicate zero or
more non-blank characters since [a-zA-Z0-9] is already one non-blank character. Remember
that the * or + applies to the single character immediately to the left of the plus or asterisk.
Page 38
Combining searching and extracting
• Sometimes we may want to extract the lines from the file that match with specific pattern, let
say
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
We can use the following regular expression to select the lines: ^X-.*: [0-9.]+
Let us consider a sample file called file.txt
File.txt
X-DSPAM-Confidence:
0.8475
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6178
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961
X-DSPAM
done
with the file
content
Ex:1 Search for lines that start with 'X' followed by any non whitespace characters and
':' followed by a space and any number. The number can include a decimal.
import re Output:
hand = open('file.txt') X-DSPAM-Confidence:
for line in hand: 0.8475
line = line.rstrip() X-DSPAM-Probability:
x =re.search('^X-.*: ([0-9.]+)', line) 0.0000
if x: X-DSPAM-Confidence:
print(x.group()) 0.6178
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961
Page 39
— followed by zero or more characters (.*),
— followed by a colon (:) and then a space.
— After the space we are looking for one or more characters that are either a digit (0- 9) or
a period [0-9.]+.
— Note that inside the square brackets, the period matches an actual period (i.e., it is
not a meta character between the square brackets).
• But, if we want only the numbers in the above output. We can use split() function on
extracted string. However, it is better to refine regular expression. To do so, we need the
help of parentheses.
When we add parentheses to a regular expression, they are ignored when matching the
string(with search()). But when we are using findall(), parentheses indicate that while we
want the whole expression to match, we are only interested in extracting a portion of the
substring that matches the regular expression.
Ex:2 Search for lines that start with 'X' followed by any non whitespace characters and ':'
followed by a space and any number. The number can include a decimal. Then print the
number if it is greater than zero.
import re Output:
hand = open('file.txt') ['0.8475']
for line in hand: ['0.0000']
line = line.rstrip() ['0.6178']
x = re.findall('^X.*: ([0-9.]+)', line) ['0.0000']
if len(x) > 0: ['0.6961']
print(x)
Let us consider another example, assume that the file contain lines of the form: Details:
http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
If we wanted to extract all of the revision numbers (the integer number at the end of these
lines) using the same technique as above, we could write the following program:
Page 40
Ex:3 Search for lines that start with 'Details: rev=' followed by numbers and '.' Then print the
number if it is greater than zero.
import re
str="Details:http://source.sakaiproject.org/viewsvn/?view=rev&rev=3 9772"
x = re.findall('^Details:.*rev=([0-9]+)', str)
if len(x) > 0:
print(x)
Output:
['39772']
In the above example, we are looking for lines that start with Details:, followed by any number
of characters (.*), followed by rev=, and then by one or more digits. We want to find lines that
match the entire expression but we only want to extract the integer number at the end of the
line, so we surround [0-9]+ with parentheses. Note that, the expression [0-9] is greedy,
because, it can display very large number. It keeps grabbing digits until it finds any other
character than the digit.
Consider another example – we may be interested in knowing time of a day of each email.
The file may have lines like –
From [email protected] Sat Jan 5 09:14:16 2008
Here, we would like to extract only the hour 09. That is, we would like only two digits
representing hour. This can done by following code
line="From [email protected] Sat Jan 5 09:14:16
2008"
x = re.findall('^From .* ([0-9][0-9]):', line)
if len(x) > 0:
print(x)
Output:
['09']
Escape character
Character like dot, plus, question mark, asterisk, dollar etc. are meta characters in regular
expressions. Sometimes, we need these characters themselves as a part of matching string.
Then, we need to escape them using a backslash. For example,
import re Output:
x = 'We just received $10.00 for cookies.'
y = re.search('\$[0-9.]+',x) matched string: $10.00
print("matched string:",y.group())
Page 41
Here, we want to extract only the price $10.00. As, $ symbol is a metacharacter, we need
to use \ before it. So that, now $ is treated as a part of matching string, but not as
metacharacter.
Question Bank
Page 42