0% found this document useful (0 votes)

20 views45 pages

Module 3 Notes

Module 3 covers data pre-processing and data wrangling using Python's Pandas and SQLite. It includes loading data from CSV files, handling missing values, accessing SQL databases, and creating and manipulating database tables. The module also discusses data cleansing techniques and normalization methods to prepare data for analysis.

Uploaded by

souravdilu78090

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views45 pages

Module 3 Notes

Uploaded by

souravdilu78090

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Module 3

Data Pre-processing and Data Wrangling Topics Covered

Loading from CSV files

Pandas features a number of functions for reading tabular data as a DataFrame object. Table below has a
summary of all of them.

Functions, which are meant to convert text data into a DataFrame. The options for these functions fall into a few
categories:
• Indexing: can treat one or more columns as the returned DataFrame, and whether to get column
names from the file, the user, or not at all.

• Type inference and data conversion: this includes the user-defined value conversions and custom
list of missing value markers.

• Datetime parsing: includes combining capability, including combining dateandtime information spread over
multiple columns into a single column in the result.

• Iterating: support for iterating over chunks of very large files.

• Unclean data issues: skipping rows or a footer, comments, or other minor things like numeric data
with thousands separated by commas.

Page 1
read_csv and read_table are most used functions.
Before using any methods in the pandas library , import the library with the following statement: Import
pandas as pd

Let’s start with a small comma-separated (CSV) text file:

ex1.csv df = pd.read_csv('ex1.csv') pd.read_table('ex1.csv', sep=',')

Since ex1.csv is comma-delimited, we can use read_csv to read it into a DataFrame. If file contains
any other delimiters then, read_table can be used by specifying the delimiter.

pandas allows to assign column names by specifing names argument:

Suppose we wanted the message column to be the index of the returned DataFrame. We can either
indicate we want the column at index 4 or named 'message' using the index_col argument:

Page 2
To form a hierarchical index from multiple columns, just pass a list of column numbers or names:
The parser
functions have many additional arguments to help you handle the wide variety of exception file
formats that occur. For example, you can skip the first, third, and fourth rows of a file with skiprows:

Handling missing values

 Handling missing values is an important and frequently nuanced part of the file parsing
process. Missing data is usually either not present (empty string) or marked by some sentinel
value. By default, pandas uses a set of commonly occurring sentinels, such as NA, -1.#IND,
and NULL:

Page 3

 The na_values
option can take either a list or set of strings to consider missing values:

 Different NA sentinels can be specified for each column in a dict:

ROOPA.H.M, Dept of MCA, RNSIT Page 4

Module 1 [22MCA31] Data Analytics using Python

Accessing SQL databases

 A database is a file that is organized for storing data. Most databases are organized like a
dictionary in the sense that they map from keys to values. The biggest difference is that
the database is on disk (or other permanent storage), so it persists after the program
ends. Because a database is stored on permanent storage, it can store far more data
than a dictionary, which is limited to the size of the memory in the computer.
 Like a dictionary, database software is designed to keep the inserting and accessing of
data very fast, even for large amounts of data. Database software maintains its
performance by building indexes as data is added to the database to allow the computer to
jump quickly to a particular entry.

 There are many different database systems which are used for a wide variety of purposes
including: Oracle, MySQL, Microsoft SQL Server, PostgreSQL, and SQLite.

 Python uses SQLite database. SQLite is designed to be embedded into other

applications to provide database support within the application.

 Python to work with data in SQLite database files, many operations can be done more
conveniently using software called the Database Browser for SQLite which is freely
available from:
http://sqlitebrowser.org/

 Using the browser you can easily create tables, insert data, edit data, or run simple SQL
queries on the data in the database

Database concepts

 For the first look, database seems to be a spreadsheet consisting of multiple sheets. The
primary data structures in a database are tables, rows and columns.

 In a relational database terminology, tables, rows and columns are referred as relation,
tuple and attribute respectively. Typical structure of a database table is as shown below
Table 3.1.

 Each table may consist of n number of attributes and m number of tuples (or records).

 Every tuple gives the information about one individual. Every cell (i, j) in the table indicates
value of jth attribute for ith tuple.

Page 5
Table 3.1: Typical Relational database table

Consider the problem of storing details of students in a database table. The format may look
like –
Roll No Name DOB Marks Student1 1 Akshay 22/10/2001 82.5Student 2 2 Arun 20/12/2000
81.3............... ............... ............... ...............
............... ............... ............... ...............
Student
...............
m............... ............. .............
.. ..

Thus, table columns indicate the type of information to be stored, and table rows gives
record pertaining to every student. We can create one more table say department consisting
of attributes like dept_id, homephno, City. To relate this table with a respective Rollno stored
in student, and dept_id stored in department table. Thus, there is a relationship between two
tables in a single database. There are softwares that can maintain proper relationships
between multiple tables in a single database and are known as Relational Database
Management Systems (RDBMS).

Creating a database table

The code to create a database name(music.db) and a table named Tracks with two columns in
the database is as follows:
import sqlite3
conn = sqlite3.connect('music.db') cur = conn.cursor()
cur.execute('CREATE TABLE Tracks (title TEXT, plays INTEGER)')conn.close()

 The connect operation makes a “connection” to the database stored in the filemusic.db in
the current directory. If the file does not exist, it will be created.

 A cursor is like a file handle that we can use to perform operations on the data stored in the
database. Calling cursor() is very similar conceptually to calling open() when dealing with
text files.

Page 6
 Once we have the cursor, we can begin to execute commands on the contents of the
database using the execute() method is as shown in figure below.
Figure : Database Cursor

cur.execute(INSERT INTO Tracks (title, plays) VALUES ('My Way', 15))

This command inserts one record into the table Tracks where values for the attributes title and plays are ‘My
Way’ and 15 respectively.
cur.execute( SELECT * FROM Tracks)
Retrieves all the records from the table Tracks
cur.execute(SELECT * FROM Tracks WHERE title = 'My Way’)
Retrieves the records from the table Tracks having the value of attribute title as ‘My Way’ cur.execute(UPDATE
Tracks SET plays = 16 WHERE title = 'My Way’) Whenever we would like to modify the value of any particular
attribute in the table, we can use UPDATE command. Here, the value of attribute plays is assigned to a new
value for the record having value of title as ‘My Way’.

cur.execute(DELETE FROM Tracks WHERE title = 'My Way')

A particular record can be deleted from the table using DELETE command. Here, the record with value of
attribute title as ‘My Way’ is deleted from the table Tracks. cur.execute('DROP TABLE IF EXISTS Tracks ')
This command will delete the contents of entire table

Example1: Write a python to create student Table from college database.(the attributes of student like
Name,USN,Marks.)Perform the following operations like insert,delete and retrieve record from student
Table.

Page 7
Module 1 [22MCA31] Data Analytics using Python

import sqlite3
conn = sqlite3.connect(‘college.db’)
cur=conn.cursor()
print(“Opened database successfully”)
cur.execute(‘CREATE TABLE student(name TEXT, usn NUMERIC, Marks
INTEGER)’) print(“Table created successfully”)
cur.execute(‘INSERT INTO student(name,usn,marks) values
(?,?,?)’,(‘akshay’,’1rn16mca16’,30)) cur.execute(‘insert into student(name,usn,marks)
values (?,?,?)’,(‘arun’,’1rn16mca17’,65)) print(‘student’)
cur.execute(‘SELECT name, usn ,marks from student’)
for row in cur:
print(row)
cur.execute(‘DELETE FROM student WHERE Marks < 40’)
cur.execute(‘select name,usn,marks from student’)
conn.commit()
cur.close()
Output:
Opened database successfully
Table created successfully
student
('akshay', '1rn16mca16', 30)
('arun', '1rn16mca17', 65)
Example 2: Write a python code to create a database file(music.sqlite) and a table named Tracks with
two columns- title , plays. Also insert , display and delete the contents of thetable
import sqlite3
conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()
cur.execute('CREATE TABLE Tracks (title TEXT, plays INTEGER)')
cur.execute(“INSERT INTO Tracks (title, plays) VALUES ('Thunderstruck', 200)”)
cur.execute(“INSERT INTO Tracks (title, plays) VALUES (?, ?)”,('My Way', 15))
conn.commit()
print('Tracks:')
cur.execute('DELETE FROM Tracks WHERE plays < 100')
cur.execute('SELECT title, plays FROM Tracks')
for row in cur:
print(row)
cur.close()

Output
Tracks:
('Thunderstruck', 200)

Cleansing Data

with Python: Stripping out extraneous information

Extraneous information refers to irrelevant or unnecessary data that can clutter a dataset and
make it difficult to analyze. This could include duplicate entries, empty fields, or irrelevant
columns. Stripping out this information involves removing it from the dataset, resulting in a
more concise and manageable dataset.

To strip out extraneous information in a Pandas DataFrame, you can use various methods and
functions provided by the library. Some commonly used methods include:  dropna( ): This
method removes rows with missing values (NaN or None) from the DataFrame.
You can specify the axis (0 for rows and 1 for columns) along which the rows or columns with
missing values should be dropped.
Example:
df = df.dropna()
#This will remove all rows that contain at least one missing value.

 drop( ): The drop() method in Pandas is used to remove columns from a DataFrame. It can be
used to drop a single column or multiple columns at once.
df.drop(columns, axis=1, inplace=False)
Ex:
cars2 = cars_data.drop(['Doors','Weight'],axis='columns')

 drop_duplicates(): methods to remove missing values and duplicate rows specify the columns
based on which the duplicates should be checked.

Page 9
 loc[ ] and iloc[ ]: These indexing methods allow you to select specific rows and columns from the
DataFrame. They are used to select only the relevant data and exclude the unwanted information.
Ex:1

Ex:2

Page 10

 Filtering: conditional statements can be used to filter the DataFrame and select only the rows that
meet certain criteria. This allows to remove unwanted data based on specific conditions.
Example:
data = {'Name': [‘Anitha', ‘Barathi', 'Charlie', 'David'], 'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)

# Example of filtering based on a condition

filtered_df = df[df['Age'] > 30]

# Display the filtered DataFrame

print(filtered_df)

Normalizing data AND Formatting data Normalizing Data:

Data normalization is the process of transforming data into a consistent format to facilitate
comparison and analysis. This may involve converting data to a common unit of measurement,
formatting dates and times consistently, or standardizing data formats. Normalization ensures that
data is comparable and can be easily processed and analysed.

Normalization is a crucial step in data preprocessing for machine learning tasks. It involves
transforming numerical features to have a mean of 0 and a standard deviation of 1. This process
ensures that all features are on the same scale, enabling efficient and accurate learning by machine
learning algorithms.

In Python, several libraries provide functions for data normalization.

Method 1: Using sklearn

The sklearn method is a very famous method to normalize the data.

Page 11
Module 1 [22MCA31] Data Analytics using Python

We import all the required libraries, NumPy and sklearn. You can see that we import thepreprocessing from the
sklearn itself. That’s why this is the sklearn normalization method. Wecreated a NumPy array with some integer
value that is not the same. We called the normalizemethod from the preprocessing and passed the numpy_array,
which we just createdasa
parameter. We can see from the results, all integer data are now normalized between 0 and 1. Method 2:

Normalize a particular column in a dataset using sklearn We can also normalize the particular dataset column. In
this, we are going to discuss about that.

We import the library pandas and sklearn. We created a dummy CSV file, and we are nowloading
that CSV file with the help of the pandas (read_csv) package. We print that CSV filewhich we
recently loaded. We read the particular column of the CSV file using the np. array andstore the
result to value_array. We called the normalize method from the preprocessing andpassed the
value_array parameter.

Method 3: Convert to normalize without using the columns to array (using sklearn)

In the previous method 2, we discussed how to a particular CSV file column we could
normalize.But sometimes we need to normalize the whole dataset, then we can use the below
methodwhere we do normalize the whole dataset but along column-wise (axis = 0). If we mention
theaxis = 1, then it will do row-wise normalize. The axis = 1 is by default value.
ROOPA.H.M, Dept of MCA, RNSIT
Page 12
Now, we pass the whole CSV file along with one more extra parameter axis =0, which said to the
library that the user wanted to normalize the whole dataset column-wise.

Method 4: Using MinMaxScaler()

The sklearn also provides another method of normalization, which we called it MinMaxScalar. This
is also a very popular method because it is easy to use.

Page 13
Module 1 [22MCA31] Data Analytics using Python

We called the MinMaxScalar from the preprocessing method andcreatedanobject(min_max_Scalar) for that.
We did not pass any parameters because we needtonormalizethedata between 0 and 1. But if you want, you
can add your values which will beseeninthenextmethod.
We first read all the names of the columns for further use to display results. Then we call thefit_tranform from the
created object min_max_Scalar and passed the CSVfile intothat. Wegetthe normalized results which are
between 0 and 1.

Method 5: Using MinMaxScaler(feature_range=(x,y))

The sklearn also provides the option to change the normalized value of what you want. Bydefault,
they do normalize the value between 0 and 1. But there is a parameter which we
calledfeature_range, which can set the normalized value according to our requirements.

Here, We call the MinMaxScalar from the preprocessing method and create an
object(min_max_Scalar) for that. But we also pass another parameter inside of the
MinMaxScaler(feature_range). That parameter value we set 0 to 2. So now, the MinMaxScaler will
normalize thedata values between 0 to 2. We first read all the names of the columns for further use to
displayresults. Then we call the fit_tranform from the created object min_max_Scalar and passed
theCSV file into that. We get the normalized results which are between 0 and 2.

ROOPA.H.M, Dept of MCA, RNSIT

Page 14
Module 1 [22MCA31] Data Analytics using Python

Method 6: Using the maximum absolute scaling

We can also do normalize the data using pandas. These features are alsoverypopularinnormalizing the data. The
maximum absolute scaling does normalize values between0and1. We are applying here .max () and .abs() as
shown below:
We call each column and then divide the column values with the .max() and .abs(). We print
theresult, and from the result, we confirm that our data normalize between 0 and 1.

Method 7: Using the z-score method

The next method which we are going to discuss is the z-score method. This method converts
theinformation to the distribution. This method calculates the mean of each column and
thensubtracts from each column and, at last, divides it with the standard deviation. This
normalizesthe data between -1 and 1.
ROOPA.H.M, Dept of MCA, RNSIT
Page 15
We calculate the column’s mean and subtract it from the column. Then we divide the
columnvalue with the standard deviation. We print the normalized data between -1 and 1.

One popular library is Scikit-Learn, which offers the StandardScaler class for normalization. Here's
an example of how to use StandardScaler to normalize a dataset:

Page 16
Formatting Data:

 Formatting data in Pandas involves transforming and presenting data in a structured and
readable manner. Pandas, a popular Python library for data analysis, offers various methods
and techniques to format data effectively.

 One of the key features of Pandas is its ability to handle different data types and structures. It
provides specific formatting options for each data type, ensuring that data is displayed in a
consistent and meaningful way. For example, numeric data can be formatted with specific
number of decimal places, currency symbols, or percentage signs. Date and time data can be
formatted in various formats, such as "dd/mm/yyyy" or "hh:mm:ss".

 Pandas also allows users to align data within columns, making it easier to read and compare
values. This can be achieved using the "justify" parameter, which takes values such as "left",
"right", or "center". Additionally, Pandas provides options to control the width of columns,
ensuring that data is presented in a visually appealing manner.

 Furthermore, Pandas offers methods to format entire dataframes, applying consistent formatting
rules to all columns. This can be done using the "style" attribute, which allows users to specify
formatting options for different aspects of the dataframe, such as font, background color, and
borders.

 By leveraging the formatting capabilities of Pandas, users can effectively communicate insights
and patterns in their data, making it easier to analyze and interpret. Overall, formatting data in
Pandas is a crucial skill for data analysts and scientists to present their findings in a clear and
professional manner.
Ex 1 : Formatting Numeric Data

Page 17
Ex 2: Formatting Date and Time Data
Ex 3: Aligning Data in Columns

Page 18

Combining and
merging data sets
Data contained in pandas objects can be combined together in a number of built-in ways:
• pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to
users of SQL or other relational databases, as it implements database join operations.

• pandas.concat glues or stacks together objects along an axis.

• combine_first instance method enables splicing together overlapping data to fill in missing values in
one object with values from another.

Database-style DataFrame Merges

Merge or join operations combine data sets by linking rows using one or more keys. The merge
function in pandas is used to combine datasets.

Let’s start with a simple example:

import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
df1 df2

 The below examples shows many-to-one merge situation; the data in df1 has multiple rows
labeled a and b, whereas df2 has only one row for each value in the key column.
# performs outer join
# performs inner join
# pd.merge(df1, df2) pd.merge(df1, df2, how='outer')

pd.merge(df1, df2, on='key')

Page 19

Observe that the 'c' and 'd' values and associated data are missing from the result. By default
merge does an 'inner' join; the keys in the result are the intersection. The outer join takes the
union of the keys, combining the effect of applying both left and right joins.

 The below examples shows Many-to-many merges:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],'data2': range(5)})

df1 df2

Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in the left
DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method only affects the
distinct key values appearing in the result:

pd.merge(df1, df2, how='inner') pd.merge(df1, df2, on='key', how='left')

Merging on Index
The merge key or keys in a DataFrame will be found in its index. In this case, you can pass
left_index=True or right_index=True (or both) to indicate that the index should be used as the merge
key
Page 20
Page
21
Concatenating Along an Axis
Another kind of data combination operation is alternatively referred to as concatenation, binding, or
stacking. NumPy has a concatenate function for doing this with raw NumPy arrays:

By default concat works along axis=0, producing another Series. If you pass axis=1, the result
will instead be a DataFrame (axis=1 is the columns):
Page 22

There are a number of fundamental operations for rearranging tabular data. These are alternatingly
referred to as reshape or pivot operations.

Reshaping with Hierarchical Indexing

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two
primary actions:
• stack: this changes from the columns in the data to the rows.
• unstack: this changes from the rows into the columns.

 Using the stack method on this data pivots the columns into the rows, producing a
Series.
From a hierarchically-indexed Series, we can rearrange the data back into a DataFrame with
unstack.

Page 23
 By default the innermost level is unstacked (same with stack). You can unstack a different level
by passing a level number or name:

 Unstacking might introduce missing data if all of the values in the level aren’t found in eachof the
subgroups:
Pivoting “long” to “wide” Format
 A common way to store multiple time series in databases and CSV is in so-called long orstacked
format :

 Data is frequently stored this way in relational databases like MySQL as a fixed schema allows the
number of distinct values in the item column to increase or decrease as data is added or deleted in
the table.

 The data may not be easy to work with in long format; it is preferred to have a DataFrame
containing one column per distinct item value indexed by timestamps in the date column.
Page 24
DataFrame’s pivot method performs exactly this transformation.

The pivot() function is used to reshape a given DataFrame organized by given index / column
values. This function does not support data aggregation, multiple values will result in a MultiIndex
in the columns.

Syntax:
DataFrame.pivot(self, index=None, columns=None, values=None)
Example:

Page 25

Suppose you had two value columns that you wanted to reshape simultaneously:
 By omitting the last argument, you obtain a DataFrame with hierarchical columns:

Page 26

Data transformation Data transformation is the process of converting raw data into a format that is
suitable for analysis and modeling. It's an essential step in data science and analytics workflows,
helping to unlock valuable insights and make informed decisions.
Few of the data transfer mechanisms are :
• Removing Duplicates
• Replacing Values
• Renaming Axis Indexes
• Discretization and Binning
• Detecting and Filtering Outliers
• Permutation and Random Sampling

i) Removing duplicates
Duplicate rows may be found in a DataFrame using method duplicated which returns a
boolean Series indicating whether each row is a duplicate or not. Relatedly, drop_duplicates
returns a DataFrame where the duplicated array is True.

data = DataFrame(
{ 'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4] } )

data.duplicated() data.drop_duplicates()
data

# rows 1, 4 and 6 are

dropped

ii) Filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations.

Consider a DataFrame with some normally distributed data. (Note : while writing answers, write
your own random numbers between 0 and 1)

Page 27

 Suppose we wanted to find values in one of the columns exceeding one in magnitude:
 To select all rows having a value exceeding 1 or -1, we can use the any method on a boolean
DataFrame:

iii) Replacing Values

 Some times it is necessary to replace missing values with some specific values or NAN values.
It can be done by using replace method. Let’s consider this Series:
data = Series([1., -999., 2., -999.,
-1000., 3.]) data

 The -999 values might be sentinel values for missing data. To replace these with NA values that
pandas understands, we can use replace, producing a new Series:

data.replace(-999, np.nan)
Page 28
 If we want to replace multiple values at once, you instead pass a list then the substitute
value:
data.replace([-999, -1000], np.nan)

 To use a different replacement for each value, pass a list of substitutes: data.replace([-999,

-1000], [np.nan, 0]) data.replace({-999: np.nan, -1000: 0})

iv. Renaming Axis Indexes
Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to
produce new, differently labeled objects. The axes can also be modified in place without creating a new data
structure.

 We can assign to index, modifying the DataFrame in place:

import pandas as pd

data = pd.DataFrame(np.arange(12).reshape((3, 4)),

index=['Ohio', 'Colorado', 'New York'], columns=['one', 'two', 'three', 'four'])
data.index = data.index.map(str.upper)
data

 To create a transformed version of a data set without modifying the original, a useful method is rename:
data.rename(index=str.title, columns=str.upper)

Page 29

 rename can be used in conjunction with a dict-like object providing new values for a subset of
the axis labels:
data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
v. Discretization and binning

 Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose we
have data about a group of people in a study, and we want to group them into discrete age
buckets:

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do so,
we have to use cut, a function in pandas:

import pandas as pd
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

 The object pandas returns is a special Categorical object. We can treat it like an array of strings
indicating the bin name; internally it contains a levels array indicating the distinct category
names along with a labeling for the ages data in the labels attribute:

cats.labels

cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object)
pd.value_counts(cats)

Consistent with mathematical notation for intervals, a parenthesis means that the side is open
while the square bracket means it is closed (inclusive).

Page 30
vi. Permutation and Random Sampling
Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the
numpy.random.permutation function. Calling permutation with the length of the axis you want to
permute produces an array of integers indicating the new ordering:
df

df = DataFrame(np.arange(5 * 4).reshape(5,
4))

sampler = np.random.permutation(5)sampler array([1, 0, 2, 3, 4])

That array can then be used in ix-based indexing or the take function:
df.take(sampler)

vii. Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applica tions is
converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a
DataFrame has k distinct values, you would derive a matrix or DataFrame containing k
columns containing all 1’s and 0’s. pandas has a get_dummies function for doing this, though
devising one yourself is not difficult. Let’s return to an earlier ex ample DataFrame:
df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})

pd.get_dummies(df['key'])

In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which
can then be merged with the other data. get_dummies has a prefix argument for doing just
this:
Page 31

dummies = pd.get_dummies(df['key'], prefix='key')

df_with_dummy = df[['data1']].join(dummies)

df_with_dummy

String
Manipulation
Python has long been a popular data munging language in part due to its ease-of-use for string
and text processing. Most text operations are made simple with the string object’s built in
methods. For more complex pattern matching and text manipulations, regular expressions may
be needed. pandas adds to the mix by enabling you to apply string and regular expressions
concisely on whole arrays of data, additionally handling the annoyance of missing data.

String Object Methods

 In many string munging and scripting applications, built-in string methods are sufficient. Examples:
Description Code Output val = 'a,b, guido'
a comma-separated
string can be broken
val.split(',') ['a', 'b', ' guido']
into pieces with split
pieces = [x.strip()
split is often combined
with strip to trim
for x in val.split(',')]
['a', 'b', 'guido']
whitespace (including
newlines)
first,second,third= pieces
The above substrings
could be concatenated
together with a two first+'::'+second+'::'+ 'a::b::guido'

colon delimiter using Pythonic way is to pass

third a list or tuple to the
addition '::'.join(pieces)
join method on the string '::'
A faster and more
'a::b::guido'

ROOPA.H.M, Dept of MCA, RNSIT Page 32

Note: refer module 2 for

remaining string methods

 A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

 RegEx can be used to check if a string contains the specified search pattern.  Python has
a built-in package called re, which can be used to work with Regular Expressions

RegEx Functions
The re module offers a set of functions that allows us to search a string for a match. By using
these functions we can search required pattern. They are as follows:

 match(): re.match() determine if the RE matches at the beginning of the string. The method
returns a match object if the search is successful. If not, it returns None.

import re Output:
abyss
pattern = '^a...s$'
Search successful.
test_string = 'abyss'
result = re.match(pattern,test_string)

if result:
print("Search successful.")
else:
print("Search unsuccessful.")

 search(): The search( ) function searches the string for a match, and returns a Match object if
there is a match. If there is more than one match found, only the first occurrence of the
match will be returned.

import re Output:
pattern='Tutorials' <re.Match object;
line ='Python Tutorials' span=(7,16),match='Tutorials'>
result = re.search(pattern, line) print(result)
Tutorials
print(result.group())

#group( ) returns matched string

Page 33
 findall() : Find all substrings where the RE matches, and returns them as a list. It searches
from start or end of the given string and returns all occurrences of the pattern. While
searching a pattern, it is recommended to use re.findall() always, it works like re.search()
and re.match() both.

import re Output:
str = "The rain in Spain" ['ai', 'ai']
x = re.findall("ai", str)
print(x)

Character matching in regular expressions

 Python provides a list of meta-characters to match search strings.

 Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's
a list of metacharacters:

[] "[a-m]"
A set of characters "[^a-m]"

[^…]
Matches any single character NOT in
brackets
characters)
\
Signals a special sequence (can "\d"
also be used to escape special

. "he..o" ^ "^hello" $
Any character (except newline character) Starts with

"world$"
Ends with

* "aix*" + "aix+"
Zero or more occurrences One or more occurrences "al{2}"

{}
Exactly the specified number of occurrences
| "falls|stays"
Either or
() Capture and group

When parentheses are added to a

regular expression, they are ignored for
the purpose of matching, but allow you to
extract a particular subset of the
matched string rather than the whole
string when using findall()

Page 34
Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a
special meaning:
"\d"
\d Returns a match where the string contains
digits (numbers from 0-9)
"\D"
\D Returns a match where the string DOES NOT
contain digits
"\s"
\s Returns a match where the string contains a
white space character
"\S"
\S Returns a match where the string DOES NOT
contain a white space character

"\w"
\w Returns a match where the string contains any word characters (characters from
a to Z,
digits from 0-9, and the underscore _
character)
"\W"
\W Returns a match where the string DOES NOT
contain any word characters
"\AThe"
\A Returns a match if the specified characters
are at the beginning of the string
r"\bain"
\b Returns a match where the specified
r"ain\b"
characters are at the beginning or at the end
of a word
r"\Bain"
\B Returns a match where the specified
r"ain\B"
characters are present, but NOT at the beginning (or at the end) of a word
\Z Returns a match if the specified characters "Spain\Z"
are at the end of the string

Few examples on set of characters for pattern matching are as follows:

Set Description Examples

[arn] Returns a match where >>> str = "The rain in Spain"
one of the specified
characters (a, r, or n) >>>re.findall("[arn]",
are present
str)

['r', 'a', 'n', 'n', 'a', 'n']

Page 35
[a-n]
Returns a match for
>>> str = "The rain in Spain"
>>>re.findall("[a-n]",str)
any lower case
character, ['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
alphabetically
between a and n

[^arn] Returns a match for

>>> str = "The rain in Spain"
>>>re.findall("[^arn]", str)
any character
['T', 'h', 'e', ' ', 'i', ' ',
EXCEPT a, r, and n
'i', ' ', 'S', 'p', 'i']

[0123] Returns a match where

>>> str = "The rain in Spain"
>>>re.findall("[0123]", str)
any of the specified
[]
digits (0, 1, 2, or 3) are
present
[0-9] Returns a match for
>>>str ="8 times before 11:45 AM"
>>>re.findall("[0-9]", str)
any digit
['8', '1', '1', '4', '5']
between 0 and 9
any two-digit numbers
[0-
5][0- 9]
>>>str = "8 times before 11:45 AM" ['11', '45']
Returns a match for from 00 and 59

>>>re.findall("[0-5][0-9]", str)

[a-zA-Z] Returns a match for

>>>str = "8 times before
11:45 AM"
any character
>>>re.findall("[a-zA-Z]",
alphabetically
str)
between a and z, lower
['t', 'i', 'm', 'e', 's', 'b',
case OR upper case
'e', 'f', 'o', 'r', 'e', 'A',
'M']
[+] In
sets, +, *, ., |, (), $,{} has >>>str ="8 times before 11:45 AM"
special meaning,so [+] m
return a matchfor >>>re.findall("[+]", str)
any + character in the
string []

Page 36
Few more examples for searching the pattern in files:
Let us consider a text file pattern.txt
#pattern.txt
From: Bengaluru^560098
From:<[email protected]>
ravi
rohan
Mysore^56788
From:Karnataka
From:
<[email protected]>

EX:1 Search for lines that start with 'F', followed by 2 characters, followed by 'm:'
import re Output:
hand = open('pattern.txt')
for line in hand: From: Bengaluru^560098
line = line.rstrip() From:<[email protected]>
if re.search('^F..m:', From: <[email protected]>
line):
print(line)

The regular expression F..m: would match any of the strings “From:”, “Fxxm:”, “F12m:”, or “F!@m:”
since the period characters in the regular expression match any character.

Ex:2 Search for lines that start with From and have an at sign
import re Output:
hand = open('pattern.txt')
for line in hand: From:<[email protected]>
line = line.rstrip() From:
if re.search('^From:.+@', <[email protected]>
line):
print(line)

The search string ˆFrom:.+@ will successfully match lines that start with“From:”,followed by
one or more characters (.+), followed by an at-sign.

Extracting data using regular expressions

If we want to extract data from a string in Python we can use the findall() method to
extract all of the substrings which match a regular expression.

Ex:1 Extract anything that looks like an email address from the line. import re
s = 'A message from [email protected] to [email protected] about meeting @2PM' lst =
re.findall('\S+@\S+', s)
print(lst)
Output: ['[email protected]', '[email protected]']

Page 37
In the above example:
— The findall() method searches the string in the second argument and returns a list of all of the
strings that look like email addresses.
— Translatingthe regular expression, we are looking for substrings that have at least One or more
non-whitespace character, followed by an at-sign, followed by at least one more non
whitespace character.

— The “\S+” matches as many non-whitespace characters as possible. Ex:2 Search for lines that
have an at sign between characters

import re Output:
hand = open('pattern.txt')
for line in hand: ['From:<[email protected]>']
line = line.rstrip() ['<[email protected]>']
x = re.findall('\S+@\S+',
line)
if len(x) > 0:
print(x)

We read each line and then extract all the substrings that match our regular expression.
Since findall() returns a list, we simply check if the number of elements in our returned list is
more than zero to print only lines where we found at least one substring that looks like an
email address.
Observe the above output, email addresses have incorrect characters like “<” or “>” at the
beginning or end. To eliminate those characters, refer to the below example program.

Ex:3 Search for lines that have an at sign between characters .The characters must be a
letter or number
import re Output:
hand = open('pattern.txt')
for line in hand: ['From:[email protected]']
line = line.rstrip() ['[email protected]']
x=re.findall('[a-zA-Z0-9]\S+@\S+[a-zA Z]',line)
if len(x) > 0:
print(x)

Here, we are looking for substrings that start with a single lowercase letter, uppercase letter,
or number “[a-zA-Z0-9]”, followed by zero or more non-blank characters (\S*), followed by an
at-sign, followed by zero or more non-blank characters (\S*), followed by an uppercase or
lowercase letter. Note that we switched from + to * to indicate zero or
more non-blank characters since [a-zA-Z0-9] is already one non-blank character. Remember
that the * or + applies to the single character immediately to the left of the plus or asterisk.

Page 38
Combining searching and extracting
• Sometimes we may want to extract the lines from the file that match with specific pattern, let
say
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000

We can use the following regular expression to select the lines: ^X-.*: [0-9.]+
Let us consider a sample file called file.txt
File.txt

X-DSPAM-Confidence:
0.8475
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6178
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961
X-DSPAM
done
with the file
content

Ex:1 Search for lines that start with 'X' followed by any non whitespace characters and
':' followed by a space and any number. The number can include a decimal.

import re Output:
hand = open('file.txt') X-DSPAM-Confidence:
for line in hand: 0.8475
line = line.rstrip() X-DSPAM-Probability:
x =re.search('^X-.*: ([0-9.]+)', line) 0.0000
if x: X-DSPAM-Confidence:
print(x.group()) 0.6178
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961

Here, it select the lines that

— start with X-,

Page 39
— followed by zero or more characters (.*),
— followed by a colon (:) and then a space.
— After the space we are looking for one or more characters that are either a digit (0- 9) or
a period [0-9.]+.
— Note that inside the square brackets, the period matches an actual period (i.e., it is
not a meta character between the square brackets).

• But, if we want only the numbers in the above output. We can use split() function on
extracted string. However, it is better to refine regular expression. To do so, we need the
help of parentheses.

When we add parentheses to a regular expression, they are ignored when matching the
string(with search()). But when we are using findall(), parentheses indicate that while we
want the whole expression to match, we are only interested in extracting a portion of the
substring that matches the regular expression.

Ex:2 Search for lines that start with 'X' followed by any non whitespace characters and ':'
followed by a space and any number. The number can include a decimal. Then print the
number if it is greater than zero.
import re Output:
hand = open('file.txt') ['0.8475']
for line in hand: ['0.0000']
line = line.rstrip() ['0.6178']
x = re.findall('^X.*: ([0-9.]+)', line) ['0.0000']
if len(x) > 0: ['0.6961']
print(x)

 Let us consider another example, assume that the file contain lines of the form: Details:

http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

If we wanted to extract all of the revision numbers (the integer number at the end of these
lines) using the same technique as above, we could write the following program:
Page 40
Ex:3 Search for lines that start with 'Details: rev=' followed by numbers and '.' Then print the
number if it is greater than zero.
import re
str="Details:http://source.sakaiproject.org/viewsvn/?view=rev&rev=3 9772"
x = re.findall('^Details:.*rev=([0-9]+)', str)
if len(x) > 0:
print(x)
Output:
['39772']

In the above example, we are looking for lines that start with Details:, followed by any number
of characters (.*), followed by rev=, and then by one or more digits. We want to find lines that
match the entire expression but we only want to extract the integer number at the end of the
line, so we surround [0-9]+ with parentheses. Note that, the expression [0-9] is greedy,
because, it can display very large number. It keeps grabbing digits until it finds any other
character than the digit.

 Consider another example – we may be interested in knowing time of a day of each email.
The file may have lines like –
From [email protected] Sat Jan 5 09:14:16 2008
Here, we would like to extract only the hour 09. That is, we would like only two digits
representing hour. This can done by following code
line="From [email protected] Sat Jan 5 09:14:16
2008"
x = re.findall('^From .* ([0-9][0-9]):', line)
if len(x) > 0:
print(x)
Output:
['09']

Escape character
Character like dot, plus, question mark, asterisk, dollar etc. are meta characters in regular
expressions. Sometimes, we need these characters themselves as a part of matching string.
Then, we need to escape them using a backslash. For example,
import re Output:
x = 'We just received $10.00 for cookies.'
y = re.search('\$[0-9.]+',x) matched string: $10.00
print("matched string:",y.group())

Page 41
Here, we want to extract only the price $10.00. As, $ symbol is a metacharacter, we need
to use \ before it. So that, now $ is treated as a part of matching string, but not as
metacharacter.

Question Bank

1 Explain merge methods with example demonstrating the following joins

i) Outer ii)left iii)right
2 Discuss various techniques for stripping out extraneous information in the
dataset. 3 What is data normalization? Explain with an example.
4 Illustrate with examples to handle missing data while reading the CSV file.
5 Describe reshaping with hierarchical indexing with suitable examples.
6 Write a short note string manipulation.
7 Write a short note on pivoting mechanism.
8 List and describe different functions used for pattern matching in re module with
example. 9 Discuss the data transformation mechanisms with examples.
10 Briefly discuss Discretization and Binning
Implement a python program to demonstrate
manipulation using
11
(i) Importing Datasets NumPy REGULAR
(ii) Cleaning the Data EXPRESSION
(iii) Data frame
12
What is the need of regular expressions in programming? Explain.
13
Discuss any 5 meta characters used in regular expressions with suitable
14
example. Discuss match() , search() and findall() functions of re module.
15
What is the need of escape characters in regular expressions? Give suitable code
snippet Write a Python program to search for lines that start with the word ‘From’ and a
character
16
followed by a two digit number between 00 and 99 followed by “:” Print the number if it
is greater than zero. Assume any input file.
17
How to extract a substring from the selected lines from the file
1 Explain the intention/meaning of the following Regular expressions
8 1. ^From .* ([0-9][0-9]):
2. ^Details:.*rev=([0-9.]+
3. . ^X\S*: ([0-9.]+)

Page 42

NB 9
No ratings yet
NB 9
29 pages
Mainframe Referesher
No ratings yet
Mainframe Referesher
174 pages
SQL Optimization For Microsoft Dynamics AX 2012 - Dynamics Edge
No ratings yet
SQL Optimization For Microsoft Dynamics AX 2012 - Dynamics Edge
4 pages
DAP Module3
No ratings yet
DAP Module3
42 pages
Lec 16 BB
No ratings yet
Lec 16 BB
24 pages
ANL252 SU6 Jul2022
No ratings yet
ANL252 SU6 Jul2022
51 pages
SQL Cheat Sheet Accessing Databases Using Python
No ratings yet
SQL Cheat Sheet Accessing Databases Using Python
3 pages
Computer Science (Shreya and Sasikala)
No ratings yet
Computer Science (Shreya and Sasikala)
13 pages
M4 Python SQL
No ratings yet
M4 Python SQL
40 pages
Unit IV Part1
No ratings yet
Unit IV Part1
5 pages
Data Visualization Using Pyplot
No ratings yet
Data Visualization Using Pyplot
14 pages
Ipproject
No ratings yet
Ipproject
4 pages
Chapter 4 - Import-Export Data
No ratings yet
Chapter 4 - Import-Export Data
30 pages
Database Connectivity
No ratings yet
Database Connectivity
12 pages
Databases Python
No ratings yet
Databases Python
44 pages
SQL Cheat Sheet Accessing Databases Using Python
No ratings yet
SQL Cheat Sheet Accessing Databases Using Python
2 pages
Tutorialfor SQliteusing Python
No ratings yet
Tutorialfor SQliteusing Python
8 pages
SQL Cheat Sheet - Accessing Databases Using Python
No ratings yet
SQL Cheat Sheet - Accessing Databases Using Python
4 pages
CTP-MD5 ch3
No ratings yet
CTP-MD5 ch3
18 pages
Unit 4 Python
No ratings yet
Unit 4 Python
52 pages
Unit 6
No ratings yet
Unit 6
20 pages
SBL Python LAB Manual by NY Expt. No. 6
No ratings yet
SBL Python LAB Manual by NY Expt. No. 6
5 pages
Computer School
No ratings yet
Computer School
29 pages
Unit 5 Python
No ratings yet
Unit 5 Python
13 pages
SQL Cheat Sheet
No ratings yet
SQL Cheat Sheet
12 pages
Unit IV Part2
No ratings yet
Unit IV Part2
5 pages
14 - Gaddis Python - Lecture - PPT - ch14
No ratings yet
14 - Gaddis Python - Lecture - PPT - ch14
37 pages
Python MYSQL
No ratings yet
Python MYSQL
29 pages
Python Unit - 5
No ratings yet
Python Unit - 5
8 pages
PythonSQLite
No ratings yet
PythonSQLite
6 pages
Unit-7 Working With Databases
No ratings yet
Unit-7 Working With Databases
43 pages
70 Python Experiment No. 14 Nakhwa Arman Anis
No ratings yet
70 Python Experiment No. 14 Nakhwa Arman Anis
13 pages
Installing Sqlite Via Browser To Work in CMD
No ratings yet
Installing Sqlite Via Browser To Work in CMD
8 pages
Interface Python With MySQL
100% (1)
Interface Python With MySQL
40 pages
DB Connectivity
No ratings yet
DB Connectivity
67 pages
The Python Database API
No ratings yet
The Python Database API
9 pages
CS1 1 1
No ratings yet
CS1 1 1
2 pages
Python SQLite
No ratings yet
Python SQLite
7 pages
S22 Lecture 4 Relational Data
No ratings yet
S22 Lecture 4 Relational Data
46 pages
4th Module Python
No ratings yet
4th Module Python
33 pages
Database Programming
No ratings yet
Database Programming
16 pages
Connecting Python With SQL Database
No ratings yet
Connecting Python With SQL Database
21 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
Python Database Programming
No ratings yet
Python Database Programming
11 pages
Python Database Programming: Storage Areas
No ratings yet
Python Database Programming: Storage Areas
11 pages
Comp Progs
No ratings yet
Comp Progs
9 pages
15-MySQL Connectivity Cs 12
No ratings yet
15-MySQL Connectivity Cs 12
36 pages
Journal Answersheet
No ratings yet
Journal Answersheet
18 pages
Python - Mysql Database Access: Gadfly MSQL Mysql Postgresql Microsoft SQL Server 2000 Informix Interbase Oracle Sybase
No ratings yet
Python - Mysql Database Access: Gadfly MSQL Mysql Postgresql Microsoft SQL Server 2000 Informix Interbase Oracle Sybase
10 pages
Python Details PDF
No ratings yet
Python Details PDF
3 pages
Sqlite3 A
No ratings yet
Sqlite3 A
25 pages
1 Format For Sqlite Commands: Csca20 Worksheet - Databases
No ratings yet
1 Format For Sqlite Commands: Csca20 Worksheet - Databases
11 pages
303database Handling Using Python
No ratings yet
303database Handling Using Python
3 pages
Unit 4
No ratings yet
Unit 4
9 pages
Introduction To Databases in Python: Creating Databases and Tables
No ratings yet
Introduction To Databases in Python: Creating Databases and Tables
31 pages
Unit 7 Chapter 2
No ratings yet
Unit 7 Chapter 2
4 pages
DHP Journal
No ratings yet
DHP Journal
29 pages
Module 4
No ratings yet
Module 4
30 pages
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Data Structures in C / C ++: Exercises and Solved Problems
From Everand
Data Structures in C / C ++: Exercises and Solved Problems
Fulbia Torres
No ratings yet
Dbms 5
No ratings yet
Dbms 5
14 pages
Retrieval-Augmented Generation For AI-Generated Content A Survey
No ratings yet
Retrieval-Augmented Generation For AI-Generated Content A Survey
28 pages
Quiz2 2021 Cmuis401sc
100% (1)
Quiz2 2021 Cmuis401sc
3 pages
Course08 - RelEval
No ratings yet
Course08 - RelEval
22 pages
Cust ID Cust Name Cust City
No ratings yet
Cust ID Cust Name Cust City
3 pages
Php-Ipam Setup Guide
No ratings yet
Php-Ipam Setup Guide
16 pages
SQL
No ratings yet
SQL
11 pages
Getting Started - Pdo - Informix
No ratings yet
Getting Started - Pdo - Informix
72 pages
Windows Forms Programming With C#
No ratings yet
Windows Forms Programming With C#
39 pages
Lecture Index Structures
No ratings yet
Lecture Index Structures
43 pages
DBMS (Unit2)
No ratings yet
DBMS (Unit2)
14 pages
Interview Questions and Answers On Database Basics
No ratings yet
Interview Questions and Answers On Database Basics
13 pages
Best Practices For Optimizing DB2 Performance
No ratings yet
Best Practices For Optimizing DB2 Performance
41 pages
Reducing Lookup Table Size Used For Bit-Counting Algorithm
No ratings yet
Reducing Lookup Table Size Used For Bit-Counting Algorithm
8 pages
Reading Statspack Report
100% (1)
Reading Statspack Report
24 pages
Python and Excel
No ratings yet
Python and Excel
11 pages
Tuning Hierarchy - The Pyramid: Server/OS Level Tuning
No ratings yet
Tuning Hierarchy - The Pyramid: Server/OS Level Tuning
75 pages
SQL From W3Schools: Quick Reference
No ratings yet
SQL From W3Schools: Quick Reference
4 pages
37DLPlus Training
No ratings yet
37DLPlus Training
192 pages
Top 50 SQL Queries For Interview
No ratings yet
Top 50 SQL Queries For Interview
135 pages
Advance Database Management System
No ratings yet
Advance Database Management System
124 pages
Tif 09
No ratings yet
Tif 09
22 pages
Unit - I 1.data Base System
No ratings yet
Unit - I 1.data Base System
102 pages
Advanced SQL Presentation Template
No ratings yet
Advanced SQL Presentation Template
29 pages
Lecture 3 - Query Optimization
No ratings yet
Lecture 3 - Query Optimization
8 pages
Join Index and Hash Index in Teradata
No ratings yet
Join Index and Hash Index in Teradata
20 pages
Micro Strategy Admin
No ratings yet
Micro Strategy Admin
1,092 pages
2016 12 Innodb Internals
No ratings yet
2016 12 Innodb Internals
43 pages

Module 3 Notes

Uploaded by

Module 3 Notes

Uploaded by

Module 3

Data Pre-processing and Data Wrangling Topics Covered

Loading from CSV files

• Iterating: support for iterating over chunks of very large files.

Let’s start with a small comma-separated (CSV) text file:

pandas allows to assign column names by specifing names argument:

Handling missing values

 Different NA sentinels can be specified for each column in a dict:

ROOPA.H.M, Dept of MCA, RNSIT Page 4

Accessing SQL databases

 Python uses SQLite database. SQLite is designed to be embedded into other

Creating a database table

cur.execute(INSERT INTO Tracks (title, plays) VALUES ('My Way', 15))

cur.execute(DELETE FROM Tracks WHERE title = 'My Way')

with Python: Stripping out extraneous information

# Example of filtering based on a condition

# Display the filtered DataFrame

Normalizing data AND Formatting data Normalizing Data:

In Python, several libraries provide functions for data normalization.

Method 1: Using sklearn

Method 4: Using MinMaxScaler()

Method 5: Using MinMaxScaler(feature_range=(x,y))

ROOPA.H.M, Dept of MCA, RNSIT

Method 6: Using the maximum absolute scaling

Method 7: Using the z-score method

• pandas.concat glues or stacks together objects along an axis.

Database-style DataFrame Merges

Let’s start with a simple example:

pd.merge(df1, df2, on='key')

 The below examples shows Many-to-many merges:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],'data2': range(5)})

pd.merge(df1, df2, how='inner') pd.merge(df1, df2, on='key', how='left')

Reshaping with Hierarchical Indexing

# rows 1, 4 and 6 are

ii) Filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations.

iii) Replacing Values

-1000], [np.nan, 0]) data.replace({-999: np.nan, -1000: 0})

 We can assign to index, modifying the DataFrame in place:

data = pd.DataFrame(np.arange(12).reshape((3, 4)),

sampler = np.random.permutation(5)sampler array([1, 0, 2, 3, 4])

vii. Computing Indicator/Dummy Variables

dummies = pd.get_dummies(df['key'], prefix='key')

String Object Methods

colon delimiter using Pythonic way is to pass

ROOPA.H.M, Dept of MCA, RNSIT Page 32

Note: refer module 2 for

 A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

#group( ) returns matched string

Character matching in regular expressions

When parentheses are added to a

Few examples on set of characters for pattern matching are as follows:

Set Description Examples

['r', 'a', 'n', 'n', 'a', 'n']

[^arn] Returns a match for

[0123] Returns a match where

[a-zA-Z] Returns a match for

Extracting data using regular expressions

Here, it select the lines that

1 Explain merge methods with example demonstrating the following joins

You might also like