0% found this document useful (0 votes)
5 views

Module 3 Notes

Module 3 covers data pre-processing and data wrangling using Python's Pandas and SQLite. It includes loading data from CSV files, handling missing values, accessing SQL databases, and creating and manipulating database tables. The module also discusses data cleansing techniques and normalization methods to prepare data for analysis.

Uploaded by

souravdilu78090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Module 3 Notes

Module 3 covers data pre-processing and data wrangling using Python's Pandas and SQLite. It includes loading data from CSV files, handling missing values, accessing SQL databases, and creating and manipulating database tables. The module also discusses data cleansing techniques and normalization methods to prepare data for analysis.

Uploaded by

souravdilu78090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Module 3

Data Pre-processing and Data Wrangling Topics Covered

Loading from CSV files


Pandas features a number of functions for reading tabular data as a DataFrame object. Table below has a
summary of all of them.

Functions, which are meant to convert text data into a DataFrame. The options for these functions fall into a few
categories:
• Indexing: can treat one or more columns as the returned DataFrame, and whether to get column
names from the file, the user, or not at all.

• Type inference and data conversion: this includes the user-defined value conversions and custom
list of missing value markers.

• Datetime parsing: includes combining capability, including combining dateandtime information spread over
multiple columns into a single column in the result.

• Iterating: support for iterating over chunks of very large files.

• Unclean data issues: skipping rows or a footer, comments, or other minor things like numeric data
with thousands separated by commas.

Page 1
read_csv and read_table are most used functions.
Before using any methods in the pandas library , import the library with the following statement: Import
pandas as pd

Let’s start with a small comma-separated (CSV) text file:


ex1.csv df = pd.read_csv('ex1.csv') pd.read_table('ex1.csv', sep=',')

Since ex1.csv is comma-delimited, we can use read_csv to read it into a DataFrame. If file contains
any other delimiters then, read_table can be used by specifying the delimiter.

pandas allows to assign column names by specifing names argument:

Suppose we wanted the message column to be the index of the returned DataFrame. We can either
indicate we want the column at index 4 or named 'message' using the index_col argument:

Page 2
To form a hierarchical index from multiple columns, just pass a list of column numbers or names:
The parser
functions have many additional arguments to help you handle the wide variety of exception file
formats that occur. For example, you can skip the first, third, and fourth rows of a file with skiprows:

Handling missing values


 Handling missing values is an important and frequently nuanced part of the file parsing
process. Missing data is usually either not present (empty string) or marked by some sentinel
value. By default, pandas uses a set of commonly occurring sentinels, such as NA, -1.#IND,
and NULL:

Page 3

 The na_values
option can take either a list or set of strings to consider missing values:

 Different NA sentinels can be specified for each column in a dict:

ROOPA.H.M, Dept of MCA, RNSIT Page 4


Module 1 [22MCA31] Data Analytics using Python

Accessing SQL databases


 A database is a file that is organized for storing data. Most databases are organized like a
dictionary in the sense that they map from keys to values. The biggest difference is that
the database is on disk (or other permanent storage), so it persists after the program
ends. Because a database is stored on permanent storage, it can store far more data
than a dictionary, which is limited to the size of the memory in the computer.
 Like a dictionary, database software is designed to keep the inserting and accessing of
data very fast, even for large amounts of data. Database software maintains its
performance by building indexes as data is added to the database to allow the computer to
jump quickly to a particular entry.

 There are many different database systems which are used for a wide variety of purposes
including: Oracle, MySQL, Microsoft SQL Server, PostgreSQL, and SQLite.

 Python uses SQLite database. SQLite is designed to be embedded into other


applications to provide database support within the application.

 Python to work with data in SQLite database files, many operations can be done more
conveniently using software called the Database Browser for SQLite which is freely
available from:
http://sqlitebrowser.org/

 Using the browser you can easily create tables, insert data, edit data, or run simple SQL
queries on the data in the database

Database concepts

 For the first look, database seems to be a spreadsheet consisting of multiple sheets. The
primary data structures in a database are tables, rows and columns.

 In a relational database terminology, tables, rows and columns are referred as relation,
tuple and attribute respectively. Typical structure of a database table is as shown below
Table 3.1.

 Each table may consist of n number of attributes and m number of tuples (or records).

 Every tuple gives the information about one individual. Every cell (i, j) in the table indicates
value of jth attribute for ith tuple.

Page 5
Table 3.1: Typical Relational database table

Consider the problem of storing details of students in a database table. The format may look
like –
Roll No Name DOB Marks Student1 1 Akshay 22/10/2001 82.5Student 2 2 Arun 20/12/2000
81.3............... ............... ............... ...............
............... ............... ............... ...............
Student
...............
m............... ............. .............
.. ..

Thus, table columns indicate the type of information to be stored, and table rows gives
record pertaining to every student. We can create one more table say department consisting
of attributes like dept_id, homephno, City. To relate this table with a respective Rollno stored
in student, and dept_id stored in department table. Thus, there is a relationship between two
tables in a single database. There are softwares that can maintain proper relationships
between multiple tables in a single database and are known as Relational Database
Management Systems (RDBMS).

Creating a database table


The code to create a database name(music.db) and a table named Tracks with two columns in
the database is as follows:
import sqlite3
conn = sqlite3.connect('music.db') cur = conn.cursor()
cur.execute('CREATE TABLE Tracks (title TEXT, plays INTEGER)')conn.close()

 The connect operation makes a “connection” to the database stored in the filemusic.db in
the current directory. If the file does not exist, it will be created.

 A cursor is like a file handle that we can use to perform operations on the data stored in the
database. Calling cursor() is very similar conceptually to calling open() when dealing with
text files.

Page 6
 Once we have the cursor, we can begin to execute commands on the contents of the
database using the execute() method is as shown in figure below.
Figure : Database Cursor

cur.execute(INSERT INTO Tracks (title, plays) VALUES ('My Way', 15))


This command inserts one record into the table Tracks where values for the attributes title and plays are ‘My
Way’ and 15 respectively.
cur.execute( SELECT * FROM Tracks)
Retrieves all the records from the table Tracks
cur.execute(SELECT * FROM Tracks WHERE title = 'My Way’)
Retrieves the records from the table Tracks having the value of attribute title as ‘My Way’ cur.execute(UPDATE
Tracks SET plays = 16 WHERE title = 'My Way’) Whenever we would like to modify the value of any particular
attribute in the table, we can use UPDATE command. Here, the value of attribute plays is assigned to a new
value for the record having value of title as ‘My Way’.

cur.execute(DELETE FROM Tracks WHERE title = 'My Way')


A particular record can be deleted from the table using DELETE command. Here, the record with value of
attribute title as ‘My Way’ is deleted from the table Tracks. cur.execute('DROP TABLE IF EXISTS Tracks ')
This command will delete the contents of entire table

Example1: Write a python to create student Table from college database.(the attributes of student like
Name,USN,Marks.)Perform the following operations like insert,delete and retrieve record from student
Table.

Page 7
Module 1 [22MCA31] Data Analytics using Python

import sqlite3
conn = sqlite3.connect(‘college.db’)
cur=conn.cursor()
print(“Opened database successfully”)
cur.execute(‘CREATE TABLE student(name TEXT, usn NUMERIC, Marks
INTEGER)’) print(“Table created successfully”)
cur.execute(‘INSERT INTO student(name,usn,marks) values
(?,?,?)’,(‘akshay’,’1rn16mca16’,30)) cur.execute(‘insert into student(name,usn,marks)
values (?,?,?)’,(‘arun’,’1rn16mca17’,65)) print(‘student’)
cur.execute(‘SELECT name, usn ,marks from student’)
for row in cur:
print(row)
cur.execute(‘DELETE FROM student WHERE Marks < 40’)
cur.execute(‘select name,usn,marks from student’)
conn.commit()
cur.close()
Output:
Opened database successfully
Table created successfully
student
('akshay', '1rn16mca16', 30)
('arun', '1rn16mca17', 65)
Example 2: Write a python code to create a database file(music.sqlite) and a table named Tracks with
two columns- title , plays. Also insert , display and delete the contents of thetable
import sqlite3
conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()
cur.execute('CREATE TABLE Tracks (title TEXT, plays INTEGER)')
cur.execute(“INSERT INTO Tracks (title, plays) VALUES ('Thunderstruck', 200)”)
cur.execute(“INSERT INTO Tracks (title, plays) VALUES (?, ?)”,('My Way', 15))
conn.commit()
print('Tracks:')
cur.execute('DELETE FROM Tracks WHERE plays < 100')
cur.execute('SELECT title, plays FROM Tracks')
for row in cur:
print(row)
cur.close()

Output
Tracks:
('Thunderstruck', 200)

Cleansing Data

with Python: Stripping out extraneous information


Extraneous information refers to irrelevant or unnecessary data that can clutter a dataset and
make it difficult to analyze. This could include duplicate entries, empty fields, or irrelevant
columns. Stripping out this information involves removing it from the dataset, resulting in a
more concise and manageable dataset.

To strip out extraneous information in a Pandas DataFrame, you can use various methods and
functions provided by the library. Some commonly used methods include:  dropna( ): This
method removes rows with missing values (NaN or None) from the DataFrame.
You can specify the axis (0 for rows and 1 for columns) along which the rows or columns with
missing values should be dropped.
Example:
df = df.dropna()
#This will remove all rows that contain at least one missing value.

 drop( ): The drop() method in Pandas is used to remove columns from a DataFrame. It can be
used to drop a single column or multiple columns at once.
df.drop(columns, axis=1, inplace=False)
Ex:
cars2 = cars_data.drop(['Doors','Weight'],axis='columns')

 drop_duplicates(): methods to remove missing values and duplicate rows specify the columns
based on which the duplicates should be checked.

Page 9
 loc[ ] and iloc[ ]: These indexing methods allow you to select specific rows and columns from the
DataFrame. They are used to select only the relevant data and exclude the unwanted information.
Ex:1

Ex:2

Page 10

 Filtering: conditional statements can be used to filter the DataFrame and select only the rows that
meet certain criteria. This allows to remove unwanted data based on specific conditions.
Example:
data = {'Name': [‘Anitha', ‘Barathi', 'Charlie', 'David'], 'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)

# Example of filtering based on a condition


filtered_df = df[df['Age'] > 30]

# Display the filtered DataFrame


print(filtered_df)

Normalizing data AND Formatting data Normalizing Data:

Data normalization is the process of transforming data into a consistent format to facilitate
comparison and analysis. This may involve converting data to a common unit of measurement,
formatting dates and times consistently, or standardizing data formats. Normalization ensures that
data is comparable and can be easily processed and analysed.

Normalization is a crucial step in data preprocessing for machine learning tasks. It involves
transforming numerical features to have a mean of 0 and a standard deviation of 1. This process
ensures that all features are on the same scale, enabling efficient and accurate learning by machine
learning algorithms.

In Python, several libraries provide functions for data normalization.

Method 1: Using sklearn


The sklearn method is a very famous method to normalize the data.

Page 11
Module 1 [22MCA31] Data Analytics using Python

We import all the required libraries, NumPy and sklearn. You can see that we import thepreprocessing from the
sklearn itself. That’s why this is the sklearn normalization method. Wecreated a NumPy array with some integer
value that is not the same. We called the normalizemethod from the preprocessing and passed the numpy_array,
which we just createdasa
parameter. We can see from the results, all integer data are now normalized between 0 and 1. Method 2:

Normalize a particular column in a dataset using sklearn We can also normalize the particular dataset column. In
this, we are going to discuss about that.

We import the library pandas and sklearn. We created a dummy CSV file, and we are nowloading
that CSV file with the help of the pandas (read_csv) package. We print that CSV filewhich we
recently loaded. We read the particular column of the CSV file using the np. array andstore the
result to value_array. We called the normalize method from the preprocessing andpassed the
value_array parameter.

Method 3: Convert to normalize without using the columns to array (using sklearn)

In the previous method 2, we discussed how to a particular CSV file column we could
normalize.But sometimes we need to normalize the whole dataset, then we can use the below
methodwhere we do normalize the whole dataset but along column-wise (axis = 0). If we mention
theaxis = 1, then it will do row-wise normalize. The axis = 1 is by default value.
ROOPA.H.M, Dept of MCA, RNSIT
Page 12
Now, we pass the whole CSV file along with one more extra parameter axis =0, which said to the
library that the user wanted to normalize the whole dataset column-wise.

Method 4: Using MinMaxScaler()


The sklearn also provides another method of normalization, which we called it MinMaxScalar. This
is also a very popular method because it is easy to use.

Page 13
Module 1 [22MCA31] Data Analytics using Python

We called the MinMaxScalar from the preprocessing method andcreatedanobject(min_max_Scalar) for that.
We did not pass any parameters because we needtonormalizethedata between 0 and 1. But if you want, you
can add your values which will beseeninthenextmethod.
We first read all the names of the columns for further use to display results. Then we call thefit_tranform from the
created object min_max_Scalar and passed the CSVfile intothat. Wegetthe normalized results which are
between 0 and 1.

Method 5: Using MinMaxScaler(feature_range=(x,y))

The sklearn also provides the option to change the normalized value of what you want. Bydefault,
they do normalize the value between 0 and 1. But there is a parameter which we
calledfeature_range, which can set the normalized value according to our requirements.

Here, We call the MinMaxScalar from the preprocessing method and create an
object(min_max_Scalar) for that. But we also pass another parameter inside of the
MinMaxScaler(feature_range). That parameter value we set 0 to 2. So now, the MinMaxScaler will
normalize thedata values between 0 to 2. We first read all the names of the columns for further use to
displayresults. Then we call the fit_tranform from the created object min_max_Scalar and passed
theCSV file into that. We get the normalized results which are between 0 and 2.

ROOPA.H.M, Dept of MCA, RNSIT


Page 14
Module 1 [22MCA31] Data Analytics using Python

Method 6: Using the maximum absolute scaling

We can also do normalize the data using pandas. These features are alsoverypopularinnormalizing the data. The
maximum absolute scaling does normalize values between0and1. We are applying here .max () and .abs() as
shown below:
We call each column and then divide the column values with the .max() and .abs(). We print
theresult, and from the result, we confirm that our data normalize between 0 and 1.

Method 7: Using the z-score method

The next method which we are going to discuss is the z-score method. This method converts
theinformation to the distribution. This method calculates the mean of each column and
thensubtracts from each column and, at last, divides it with the standard deviation. This
normalizesthe data between -1 and 1.
ROOPA.H.M, Dept of MCA, RNSIT
Page 15
We calculate the column’s mean and subtract it from the column. Then we divide the
columnvalue with the standard deviation. We print the normalized data between -1 and 1.

One popular library is Scikit-Learn, which offers the StandardScaler class for normalization. Here's
an example of how to use StandardScaler to normalize a dataset:

Page 16
Formatting Data:

 Formatting data in Pandas involves transforming and presenting data in a structured and
readable manner. Pandas, a popular Python library for data analysis, offers various methods
and techniques to format data effectively.

 One of the key features of Pandas is its ability to handle different data types and structures. It
provides specific formatting options for each data type, ensuring that data is displayed in a
consistent and meaningful way. For example, numeric data can be formatted with specific
number of decimal places, currency symbols, or percentage signs. Date and time data can be
formatted in various formats, such as "dd/mm/yyyy" or "hh:mm:ss".

 Pandas also allows users to align data within columns, making it easier to read and compare
values. This can be achieved using the "justify" parameter, which takes values such as "left",
"right", or "center". Additionally, Pandas provides options to control the width of columns,
ensuring that data is presented in a visually appealing manner.

 Furthermore, Pandas offers methods to format entire dataframes, applying consistent formatting
rules to all columns. This can be done using the "style" attribute, which allows users to specify
formatting options for different aspects of the dataframe, such as font, background color, and
borders.

 By leveraging the formatting capabilities of Pandas, users can effectively communicate insights
and patterns in their data, making it easier to analyze and interpret. Overall, formatting data in
Pandas is a crucial skill for data analysts and scientists to present their findings in a clear and
professional manner.
Ex 1 : Formatting Numeric Data

Page 17
Ex 2: Formatting Date and Time Data
Ex 3: Aligning Data in Columns

Page 18

Combining and
merging data sets
Data contained in pandas objects can be combined together in a number of built-in ways:
• pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to
users of SQL or other relational databases, as it implements database join operations.

• pandas.concat glues or stacks together objects along an axis.

• combine_first instance method enables splicing together overlapping data to fill in missing values in
one object with values from another.

Database-style DataFrame Merges


Merge or join operations combine data sets by linking rows using one or more keys. The merge
function in pandas is used to combine datasets.

Let’s start with a simple example:

import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
df1 df2

 The below examples shows many-to-one merge situation; the data in df1 has multiple rows
labeled a and b, whereas df2 has only one row for each value in the key column.
# performs outer join
# performs inner join
# pd.merge(df1, df2) pd.merge(df1, df2, how='outer')

pd.merge(df1, df2, on='key')


Page 19

Observe that the 'c' and 'd' values and associated data are missing from the result. By default
merge does an 'inner' join; the keys in the result are the intersection. The outer join takes the
union of the keys, combining the effect of applying both left and right joins.

 The below examples shows Many-to-many merges:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],'data2': range(5)})

df1 df2

Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in the left
DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method only affects the
distinct key values appearing in the result:

pd.merge(df1, df2, how='inner') pd.merge(df1, df2, on='key', how='left')

Merging on Index
The merge key or keys in a DataFrame will be found in its index. In this case, you can pass
left_index=True or right_index=True (or both) to indicate that the index should be used as the merge
key
Page 20
Page
21
Concatenating Along an Axis
Another kind of data combination operation is alternatively referred to as concatenation, binding, or
stacking. NumPy has a concatenate function for doing this with raw NumPy arrays:

By default concat works along axis=0, producing another Series. If you pass axis=1, the result
will instead be a DataFrame (axis=1 is the columns):
Page 22

There are a number of fundamental operations for rearranging tabular data. These are alternatingly
referred to as reshape or pivot operations.

Reshaping with Hierarchical Indexing

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two
primary actions:
• stack: this changes from the columns in the data to the rows.
• unstack: this changes from the rows into the columns.

 Using the stack method on this data pivots the columns into the rows, producing a
Series.
From a hierarchically-indexed Series, we can rearrange the data back into a DataFrame with
unstack.

Page 23
 By default the innermost level is unstacked (same with stack). You can unstack a different level
by passing a level number or name:

 Unstacking might introduce missing data if all of the values in the level aren’t found in eachof the
subgroups:
Pivoting “long” to “wide” Format
 A common way to store multiple time series in databases and CSV is in so-called long orstacked
format :

 Data is frequently stored this way in relational databases like MySQL as a fixed schema allows the
number of distinct values in the item column to increase or decrease as data is added or deleted in
the table.

 The data may not be easy to work with in long format; it is preferred to have a DataFrame
containing one column per distinct item value indexed by timestamps in the date column.
Page 24
DataFrame’s pivot method performs exactly this transformation.

The pivot() function is used to reshape a given DataFrame organized by given index / column
values. This function does not support data aggregation, multiple values will result in a MultiIndex
in the columns.

Syntax:
DataFrame.pivot(self, index=None, columns=None, values=None)
Example:

Page 25

Suppose you had two value columns that you wanted to reshape simultaneously:
 By omitting the last argument, you obtain a DataFrame with hierarchical columns:

Page 26

Data transformation Data transformation is the process of converting raw data into a format that is
suitable for analysis and modeling. It's an essential step in data science and analytics workflows,
helping to unlock valuable insights and make informed decisions.
Few of the data transfer mechanisms are :
• Removing Duplicates
• Replacing Values
• Renaming Axis Indexes
• Discretization and Binning
• Detecting and Filtering Outliers
• Permutation and Random Sampling

i) Removing duplicates
Duplicate rows may be found in a DataFrame using method duplicated which returns a
boolean Series indicating whether each row is a duplicate or not. Relatedly, drop_duplicates
returns a DataFrame where the duplicated array is True.

data = DataFrame(
{ 'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4] } )

data.duplicated() data.drop_duplicates()
data

# rows 1, 4 and 6 are


dropped

ii) Filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations.


Consider a DataFrame with some normally distributed data. (Note : while writing answers, write
your own random numbers between 0 and 1)

Page 27

 Suppose we wanted to find values in one of the columns exceeding one in magnitude:
 To select all rows having a value exceeding 1 or -1, we can use the any method on a boolean
DataFrame:

iii) Replacing Values

 Some times it is necessary to replace missing values with some specific values or NAN values.
It can be done by using replace method. Let’s consider this Series:
data = Series([1., -999., 2., -999.,
-1000., 3.]) data

 The -999 values might be sentinel values for missing data. To replace these with NA values that
pandas understands, we can use replace, producing a new Series:

data.replace(-999, np.nan)
Page 28
 If we want to replace multiple values at once, you instead pass a list then the substitute
value:
data.replace([-999, -1000], np.nan)

 To use a different replacement for each value, pass a list of substitutes: data.replace([-999,

-1000], [np.nan, 0]) data.replace({-999: np.nan, -1000: 0})


iv. Renaming Axis Indexes
Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to
produce new, differently labeled objects. The axes can also be modified in place without creating a new data
structure.

 We can assign to index, modifying the DataFrame in place:

import pandas as pd

data = pd.DataFrame(np.arange(12).reshape((3, 4)),


index=['Ohio', 'Colorado', 'New York'], columns=['one', 'two', 'three', 'four'])
data.index = data.index.map(str.upper)
data

 To create a transformed version of a data set without modifying the original, a useful method is rename:
data.rename(index=str.title, columns=str.upper)

Page 29

 rename can be used in conjunction with a dict-like object providing new values for a subset of
the axis labels:
data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
v. Discretization and binning

 Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose we
have data about a group of people in a study, and we want to group them into discrete age
buckets:

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do so,
we have to use cut, a function in pandas:

import pandas as pd
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

 The object pandas returns is a special Categorical object. We can treat it like an array of strings
indicating the bin name; internally it contains a levels array indicating the distinct category
names along with a labeling for the ages data in the labels attribute:

cats.labels

cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object)
pd.value_counts(cats)

Consistent with mathematical notation for intervals, a parenthesis means that the side is open
while the square bracket means it is closed (inclusive).

Page 30
vi. Permutation and Random Sampling
Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the
numpy.random.permutation function. Calling permutation with the length of the axis you want to
permute produces an array of integers indicating the new ordering:
df

df = DataFrame(np.arange(5 * 4).reshape(5,
4))

sampler = np.random.permutation(5)sampler array([1, 0, 2, 3, 4])

That array can then be used in ix-based indexing or the take function:
df.take(sampler)

vii. Computing Indicator/Dummy Variables


Another type of transformation for statistical modeling or machine learning applica tions is
converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a
DataFrame has k distinct values, you would derive a matrix or DataFrame containing k
columns containing all 1’s and 0’s. pandas has a get_dummies function for doing this, though
devising one yourself is not difficult. Let’s return to an earlier ex ample DataFrame:
df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})

pd.get_dummies(df['key'])

In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which
can then be merged with the other data. get_dummies has a prefix argument for doing just
this:
Page 31

dummies = pd.get_dummies(df['key'], prefix='key')

df_with_dummy = df[['data1']].join(dummies)

df_with_dummy

String
Manipulation
Python has long been a popular data munging language in part due to its ease-of-use for string
and text processing. Most text operations are made simple with the string object’s built in
methods. For more complex pattern matching and text manipulations, regular expressions may
be needed. pandas adds to the mix by enabling you to apply string and regular expressions
concisely on whole arrays of data, additionally handling the annoyance of missing data.

String Object Methods


 In many string munging and scripting applications, built-in string methods are sufficient. Examples:
Description Code Output val = 'a,b, guido'
a comma-separated
string can be broken
val.split(',') ['a', 'b', ' guido']
into pieces with split
pieces = [x.strip()
split is often combined
with strip to trim
for x in val.split(',')]
['a', 'b', 'guido']
whitespace (including
newlines)
first,second,third= pieces
The above substrings
could be concatenated
together with a two first+'::'+second+'::'+ 'a::b::guido'

colon delimiter using Pythonic way is to pass


third a list or tuple to the
addition '::'.join(pieces)
join method on the string '::'
A faster and more
'a::b::guido'

ROOPA.H.M, Dept of MCA, RNSIT Page 32

Note: refer module 2 for


remaining string methods

 A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.


 RegEx can be used to check if a string contains the specified search pattern.  Python has
a built-in package called re, which can be used to work with Regular Expressions

RegEx Functions
The re module offers a set of functions that allows us to search a string for a match. By using
these functions we can search required pattern. They are as follows:

 match(): re.match() determine if the RE matches at the beginning of the string. The method
returns a match object if the search is successful. If not, it returns None.

import re Output:
abyss
pattern = '^a...s$'
Search successful.
test_string = 'abyss'
result = re.match(pattern,test_string)

if result:
print("Search successful.")
else:
print("Search unsuccessful.")

 search(): The search( ) function searches the string for a match, and returns a Match object if
there is a match. If there is more than one match found, only the first occurrence of the
match will be returned.

import re Output:
pattern='Tutorials' <re.Match object;
line ='Python Tutorials' span=(7,16),match='Tutorials'>
result = re.search(pattern, line) print(result)
Tutorials
print(result.group())

#group( ) returns matched string

Page 33
 findall() : Find all substrings where the RE matches, and returns them as a list. It searches
from start or end of the given string and returns all occurrences of the pattern. While
searching a pattern, it is recommended to use re.findall() always, it works like re.search()
and re.match() both.

import re Output:
str = "The rain in Spain" ['ai', 'ai']
x = re.findall("ai", str)
print(x)

Character matching in regular expressions


 Python provides a list of meta-characters to match search strings.

 Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's
a list of metacharacters:

[] "[a-m]"
A set of characters "[^a-m]"

[^…]
Matches any single character NOT in
brackets
characters)
\
Signals a special sequence (can "\d"
also be used to escape special

. "he..o" ^ "^hello" $
Any character (except newline character) Starts with

"world$"
Ends with

* "aix*" + "aix+"
Zero or more occurrences One or more occurrences "al{2}"

{}
Exactly the specified number of occurrences
| "falls|stays"
Either or
() Capture and group

When parentheses are added to a


regular expression, they are ignored for
the purpose of matching, but allow you to
extract a particular subset of the
matched string rather than the whole
string when using findall()

Page 34
Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a
special meaning:
"\d"
\d Returns a match where the string contains
digits (numbers from 0-9)
"\D"
\D Returns a match where the string DOES NOT
contain digits
"\s"
\s Returns a match where the string contains a
white space character
"\S"
\S Returns a match where the string DOES NOT
contain a white space character

"\w"
\w Returns a match where the string contains any word characters (characters from
a to Z,
digits from 0-9, and the underscore _
character)
"\W"
\W Returns a match where the string DOES NOT
contain any word characters
"\AThe"
\A Returns a match if the specified characters
are at the beginning of the string
r"\bain"
\b Returns a match where the specified
r"ain\b"
characters are at the beginning or at the end
of a word
r"\Bain"
\B Returns a match where the specified
r"ain\B"
characters are present, but NOT at the beginning (or at the end) of a word
\Z Returns a match if the specified characters "Spain\Z"
are at the end of the string

Few examples on set of characters for pattern matching are as follows:

Set Description Examples


[arn] Returns a match where >>> str = "The rain in Spain"
one of the specified
characters (a, r, or n) >>>re.findall("[arn]",
are present
str)

['r', 'a', 'n', 'n', 'a', 'n']

Page 35
[a-n]
Returns a match for
>>> str = "The rain in Spain"
>>>re.findall("[a-n]",str)
any lower case
character, ['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
alphabetically
between a and n

[^arn] Returns a match for


>>> str = "The rain in Spain"
>>>re.findall("[^arn]", str)
any character
['T', 'h', 'e', ' ', 'i', ' ',
EXCEPT a, r, and n
'i', ' ', 'S', 'p', 'i']

[0123] Returns a match where


>>> str = "The rain in Spain"
>>>re.findall("[0123]", str)
any of the specified
[]
digits (0, 1, 2, or 3) are
present
[0-9] Returns a match for
>>>str ="8 times before 11:45 AM"
>>>re.findall("[0-9]", str)
any digit
['8', '1', '1', '4', '5']
between 0 and 9
any two-digit numbers
[0-
5][0- 9]
>>>str = "8 times before 11:45 AM" ['11', '45']
Returns a match for from 00 and 59

>>>re.findall("[0-5][0-9]", str)

[a-zA-Z] Returns a match for


>>>str = "8 times before
11:45 AM"
any character
>>>re.findall("[a-zA-Z]",
alphabetically
str)
between a and z, lower
['t', 'i', 'm', 'e', 's', 'b',
case OR upper case
'e', 'f', 'o', 'r', 'e', 'A',
'M']
[+] In
sets, +, *, ., |, (), $,{} has >>>str ="8 times before 11:45 AM"
special meaning,so [+] m
return a matchfor >>>re.findall("[+]", str)
any + character in the
string []

Page 36
Few more examples for searching the pattern in files:
Let us consider a text file pattern.txt
#pattern.txt
From: Bengaluru^560098
From:<[email protected]>
ravi
rohan
Mysore^56788
From:Karnataka
From:
<[email protected]>

EX:1 Search for lines that start with 'F', followed by 2 characters, followed by 'm:'
import re Output:
hand = open('pattern.txt')
for line in hand: From: Bengaluru^560098
line = line.rstrip() From:<[email protected]>
if re.search('^F..m:', From: <[email protected]>
line):
print(line)

The regular expression F..m: would match any of the strings “From:”, “Fxxm:”, “F12m:”, or “F!@m:”
since the period characters in the regular expression match any character.

Ex:2 Search for lines that start with From and have an at sign
import re Output:
hand = open('pattern.txt')
for line in hand: From:<[email protected]>
line = line.rstrip() From:
if re.search('^From:.+@', <[email protected]>
line):
print(line)

The search string ˆFrom:.+@ will successfully match lines that start with“From:”,followed by
one or more characters (.+), followed by an at-sign.

Extracting data using regular expressions


If we want to extract data from a string in Python we can use the findall() method to
extract all of the substrings which match a regular expression.

Ex:1 Extract anything that looks like an email address from the line. import re
s = 'A message from [email protected] to [email protected] about meeting @2PM' lst =
re.findall('\S+@\S+', s)
print(lst)
Output: ['[email protected]', '[email protected]']

Page 37
In the above example:
— The findall() method searches the string in the second argument and returns a list of all of the
strings that look like email addresses.
— Translatingthe regular expression, we are looking for substrings that have at least One or more
non-whitespace character, followed by an at-sign, followed by at least one more non
whitespace character.

— The “\S+” matches as many non-whitespace characters as possible. Ex:2 Search for lines that
have an at sign between characters

import re Output:
hand = open('pattern.txt')
for line in hand: ['From:<[email protected]>']
line = line.rstrip() ['<[email protected]>']
x = re.findall('\S+@\S+',
line)
if len(x) > 0:
print(x)

We read each line and then extract all the substrings that match our regular expression.
Since findall() returns a list, we simply check if the number of elements in our returned list is
more than zero to print only lines where we found at least one substring that looks like an
email address.
Observe the above output, email addresses have incorrect characters like “<” or “>” at the
beginning or end. To eliminate those characters, refer to the below example program.

Ex:3 Search for lines that have an at sign between characters .The characters must be a
letter or number
import re Output:
hand = open('pattern.txt')
for line in hand: ['From:[email protected]']
line = line.rstrip() ['[email protected]']
x=re.findall('[a-zA-Z0-9]\S+@\S+[a-zA Z]',line)
if len(x) > 0:
print(x)

Here, we are looking for substrings that start with a single lowercase letter, uppercase letter,
or number “[a-zA-Z0-9]”, followed by zero or more non-blank characters (\S*), followed by an
at-sign, followed by zero or more non-blank characters (\S*), followed by an uppercase or
lowercase letter. Note that we switched from + to * to indicate zero or
more non-blank characters since [a-zA-Z0-9] is already one non-blank character. Remember
that the * or + applies to the single character immediately to the left of the plus or asterisk.

Page 38
Combining searching and extracting
• Sometimes we may want to extract the lines from the file that match with specific pattern, let
say
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000

We can use the following regular expression to select the lines: ^X-.*: [0-9.]+
Let us consider a sample file called file.txt
File.txt

X-DSPAM-Confidence:
0.8475
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6178
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961
X-DSPAM
done
with the file
content

Ex:1 Search for lines that start with 'X' followed by any non whitespace characters and
':' followed by a space and any number. The number can include a decimal.

import re Output:
hand = open('file.txt') X-DSPAM-Confidence:
for line in hand: 0.8475
line = line.rstrip() X-DSPAM-Probability:
x =re.search('^X-.*: ([0-9.]+)', line) 0.0000
if x: X-DSPAM-Confidence:
print(x.group()) 0.6178
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961

Here, it select the lines that


— start with X-,

Page 39
— followed by zero or more characters (.*),
— followed by a colon (:) and then a space.
— After the space we are looking for one or more characters that are either a digit (0- 9) or
a period [0-9.]+.
— Note that inside the square brackets, the period matches an actual period (i.e., it is
not a meta character between the square brackets).

• But, if we want only the numbers in the above output. We can use split() function on
extracted string. However, it is better to refine regular expression. To do so, we need the
help of parentheses.

When we add parentheses to a regular expression, they are ignored when matching the
string(with search()). But when we are using findall(), parentheses indicate that while we
want the whole expression to match, we are only interested in extracting a portion of the
substring that matches the regular expression.

Ex:2 Search for lines that start with 'X' followed by any non whitespace characters and ':'
followed by a space and any number. The number can include a decimal. Then print the
number if it is greater than zero.
import re Output:
hand = open('file.txt') ['0.8475']
for line in hand: ['0.0000']
line = line.rstrip() ['0.6178']
x = re.findall('^X.*: ([0-9.]+)', line) ['0.0000']
if len(x) > 0: ['0.6961']
print(x)

 Let us consider another example, assume that the file contain lines of the form: Details:

http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

If we wanted to extract all of the revision numbers (the integer number at the end of these
lines) using the same technique as above, we could write the following program:
Page 40
Ex:3 Search for lines that start with 'Details: rev=' followed by numbers and '.' Then print the
number if it is greater than zero.
import re
str="Details:http://source.sakaiproject.org/viewsvn/?view=rev&rev=3 9772"
x = re.findall('^Details:.*rev=([0-9]+)', str)
if len(x) > 0:
print(x)
Output:
['39772']

In the above example, we are looking for lines that start with Details:, followed by any number
of characters (.*), followed by rev=, and then by one or more digits. We want to find lines that
match the entire expression but we only want to extract the integer number at the end of the
line, so we surround [0-9]+ with parentheses. Note that, the expression [0-9] is greedy,
because, it can display very large number. It keeps grabbing digits until it finds any other
character than the digit.

 Consider another example – we may be interested in knowing time of a day of each email.
The file may have lines like –
From [email protected] Sat Jan 5 09:14:16 2008
Here, we would like to extract only the hour 09. That is, we would like only two digits
representing hour. This can done by following code
line="From [email protected] Sat Jan 5 09:14:16
2008"
x = re.findall('^From .* ([0-9][0-9]):', line)
if len(x) > 0:
print(x)
Output:
['09']

Escape character
Character like dot, plus, question mark, asterisk, dollar etc. are meta characters in regular
expressions. Sometimes, we need these characters themselves as a part of matching string.
Then, we need to escape them using a backslash. For example,
import re Output:
x = 'We just received $10.00 for cookies.'
y = re.search('\$[0-9.]+',x) matched string: $10.00
print("matched string:",y.group())

Page 41
Here, we want to extract only the price $10.00. As, $ symbol is a metacharacter, we need
to use \ before it. So that, now $ is treated as a part of matching string, but not as
metacharacter.

Question Bank

1 Explain merge methods with example demonstrating the following joins


i) Outer ii)left iii)right
2 Discuss various techniques for stripping out extraneous information in the
dataset. 3 What is data normalization? Explain with an example.
4 Illustrate with examples to handle missing data while reading the CSV file.
5 Describe reshaping with hierarchical indexing with suitable examples.
6 Write a short note string manipulation.
7 Write a short note on pivoting mechanism.
8 List and describe different functions used for pattern matching in re module with
example. 9 Discuss the data transformation mechanisms with examples.
10 Briefly discuss Discretization and Binning
Implement a python program to demonstrate
manipulation using
11
(i) Importing Datasets NumPy REGULAR
(ii) Cleaning the Data EXPRESSION
(iii) Data frame
12
What is the need of regular expressions in programming? Explain.
13
Discuss any 5 meta characters used in regular expressions with suitable
14
example. Discuss match() , search() and findall() functions of re module.
15
What is the need of escape characters in regular expressions? Give suitable code
snippet Write a Python program to search for lines that start with the word ‘From’ and a
character
16
followed by a two digit number between 00 and 99 followed by “:” Print the number if it
is greater than zero. Assume any input file.
17
How to extract a substring from the selected lines from the file
1 Explain the intention/meaning of the following Regular expressions
8 1. ^From .* ([0-9][0-9]):
2. ^Details:.*rev=([0-9.]+
3. . ^X\S*: ([0-9.]+)

Page 42

You might also like