0% found this document useful (0 votes)

5 views63 pages

Mod 4

Module 4 of the Data Analytics Using Python course covers data preprocessing and data wrangling techniques, focusing on reading and writing data in text format, including CSV and JSON. It discusses handling unclean data, missing values, and iterating through large files, as well as exporting data to various formats. Additionally, it introduces methods for interacting with web APIs and databases, and techniques for merging and combining datasets.

Uploaded by

lia.tulip19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views63 pages

Mod 4

Uploaded by

lia.tulip19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Data Analytics Using Python 20MCA31

MODULE 4

Data Preprocessing and Data Wrangling

4. 1 Reading and Writing Data in Text Format

These functions, which are meant to convert text data into a DataFrame. The optional arguments for
these functions may fall into a few categories:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 110

Data Analytics Using Python 20MCA31

Indexing : Can treat one or more columns as the returned DataFrame, and whether to get column
names from the file, the user, or not at all.
Type inference and data conversion: This includes the user-defined value conversions and custom list of
missing value markers.
Datetime parsing: Includes combining capability, including combining date and time information spread
over multiple columns into a single column in the result.
Iterating : Support for iterating over chunks of very large files.
Unclean data issues: Skipping rows or footer, comments, or other minor things like numeric data with
thousands separated by commas.

Some of these functions, like pandas.read_csv, perform type inference, because the column data types
are not part of the data format. That means you don’t necessarily have to specify which columns are
numeric, integer, boolean, or string. Other data formats, like HDF5, Feather, and msgpack, have the data
types stored in the format. Handling dates and other custom types can require extra effort.

Comma-separated (CSV) text file:

In [8]: !cat examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
Here used the Unix cat shell command to print the raw contents of the file to the screen. In Windows,
use type instead of cat to achieve the same effect.
Since this is comma-delimited, we can use read_csv to read it into a DataFrame:

We could also have used read_table and specified the delimiter:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 111

Data Analytics Using Python 20MCA31

A file will not always have a header row. Consider this file:

To read this file, we have a couple of options. You can allow pandas to assign default column names, or
you can specify names yourself:

Suppose if we wanted the message column to be the index of the returned DataFrame. You can either
indicate you want the column at index 4 or named 'message' using the index_col argument:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 112

Data Analytics Using Python 20MCA31

In the event that you want to form a hierarchical index from multiple columns, pass a list of column
numbers or names:

In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to
separate fields. Consider a text file that looks like this:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 113

Data Analytics Using Python 20MCA31

Here the fields are separated by a vari‐ able amount of whitespace. In these cases, you can pass a
regular expression as a delimiter for read_table. This can be expressed by the regular expression \s+, so
we have then:

The parser functions have many additional arguments to help you handle the wide variety of exception
file formats that occur .
For example, you can skip the first, third, and fourth rows of a file with skiprows:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 114

Data Analytics Using Python 20MCA31

Handling missing values is an important. Missing data is usually either not present (empty string) or
marked by some sentinel value. By default, pandas use a set of commonly occurring sentinels, such as
NA and NULL:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 115

Data Analytics Using Python 20MCA31

Prof. Kumar P K, VTU, PG Center, Mysuru Page 116

Data Analytics Using Python 20MCA31

4.2 Reading Text Files in Pieces :

Prof. Kumar P K, VTU, PG Center, Mysuru Page 117
Data Analytics Using Python 20MCA31

When processing very large files or figuring out the right set of arguments to correctly processes a large
file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.
Before we look at a large file, we make the pandas display settings more compact:

If you want to only read a small number of rows (avoiding reading the entire file), specify that with
nrows:

To read a file in pieces, specify a chunksize as a number of rows:

In [37]: chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)

4.2.1 Writing Data to Text Format

Prof. Kumar P K, VTU, PG Center, Mysuru Page 118

Data Analytics Using Python 20MCA31

Data can also be exported to a delimited format. Let’s consider one of the CSV files read before:

Using DataFrame’s to_csv method, we can write the data out to a comma-separated file:

Other delimiters can be used:

Writing to sys.stdout so it prints the text result to the console

Prof. Kumar P K, VTU, PG Center, Mysuru Page 119

Data Analytics Using Python 20MCA31

Missing values appear as empty strings in the output. You might want to denote them by some other
sentinel value:

With no other options specified, both the row and column labels are written. Both of these can be
disabled:

4.2.3 Working with Delimited Formats

It’s possible to load most forms of tabular data from disk using functions like
pandas.read_table. In some cases, however, some manual processing may be
necessary. It’s not uncommon to receive a file with one or more malformed lines that
trip up read_table. To illustrate the basic tools, consider a small CSV file:

For any file with a single-character delimiter, you can use Python’s built-in csv
module. To use it, pass any open file or file-like object to csv.reader:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 120

Data Analytics Using Python 20MCA31

Iterating through the reader like a file yields tuples of values with any quote
characters removed:

CSV files come in many different flavors. To define a new format with a different
delimiter, string quoting convention, or line terminator, we define a simple subclass of
csv.Dialect:

We can also give individual CSV dialect parameters as keywords to csv.reader without
having to define a subclass:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 121

Data Analytics Using Python 20MCA31

To write delimited files manually, you can use csv.writer. It accepts an open, writable
file object and the same dialect and format options as csv.reader:

4.2.4 JSON Data

JSON (short for JavaScript Object Notation) has become one of the standard formats
for sending data by HTTP request between web browsers and other applications.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 122

Data Analytics Using Python 20MCA31

The basic types are objects (dicts), arrays (lists), strings, numbers, booleans,
and nulls. All of the keys in an object must be strings. There are several Python
libraries for reading and writing JSON data. To convert a JSON string to
Python form, use json.loads:

json.dumps, on the other hand, converts a Python object back to JSON:

The pandas.read_json can automatically convert JSON datasets in specific

arrangements into a Series or DataFrame. For example:

The default options for pandas.read_json assume that each object in the JSON
array is a row in the table:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 123

Data Analytics Using Python 20MCA31

4.2.5 XML and HTML: Web Scraping

pandas has a built-in function, read_html, which uses libraries like lxml and
Beautiful Soup to automatically parse tables out of HTML files as DataFrame
objects.
The pandas.read_html function has a number of options, but by default it
searches for and attempts to parse all tabular data contained within <table>
tags. The result is a list of DataFrame objects:

Because failures has many columns, pandas inserts a line break character \.

We proceed to do some data cleaning and analysis, like computing the number of bank failures
by year:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 124

Data Analytics Using Python 20MCA31

4.2.5.1 Parsing XML with lxml.objectify

XML (eXtensible Markup Language) is another common structured data format
supporting hierarchical, nested data with metadata.
pandas.read_html function, which uses either lxml or Beautiful Soup under
the hood to parse data from HTML. XML and HTML are structurally similar,
but XML is more general.
Example of how to use lxml to parse data from a more general XML format.

The New York Metropolitan Transportation Authority (MTA) publishes a

number of data series about its bus and train services, which is contained in a
set of XML files. Each train or bus service has a different file (like
Performance_MNR.xml for the Metro-North Railroad) containing monthly data
as a series of XML records that look like this:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 125

Data Analytics Using Python 20MCA31

Using lxml.objectify, we parse the file and get a reference to the root node of
the XML file with getroot:

root.INDICATOR returns a generator yielding each <INDICATOR> XML

element. For each record, we can populate a dict of tag names (like
YTD_ACTUAL) to data values (excluding a few tags):

Lastly, convert this list of dicts into a DataFrame:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 126

Data Analytics Using Python 20MCA31

4.3. Binary Data Formats:

One of the easiest ways to store data (also known as serialization) efficiently in
binary format is using Python’s built-in pickle serialization. pandas objects all
have a to_pickle method that writes the data to disk in pickle format:

You can read any “pickled” object stored in a file by using the built-in pickle
directly, or even more conveniently using pandas.read_pickle:

pandas has built-in support for two more binary data formats: HDF5 and
Message Pack.
4.3.1 Using HDF5 Format
HDF5 is a well-regarded file format intended for storing large quantities of
scientific array data. It is available as a C library, and it has interfaces
available in many other languages, including Java, Julia, MATLAB, and
Python. The “HDF” in HDF5 stands for hierarchical data format. Each HDF5 file
can store multiple datasets and supporting metadata. Compared with simpler
formats, HDF5 supports on-the-fly compression with a variety of compression
modes, enabling data with repeated patterns to be stored more efficiently.
HDF5 can be a good choice for working with very large datasets that don’t fit

Prof. Kumar P K, VTU, PG Center, Mysuru Page 127

Data Analytics Using Python 20MCA31

into memory, as you can efficiently read and write small sections of much
larger arrays.
While it’s possible to directly access HDF5 files using either the PyTables or
h5py libraries, pandas provides a high-level interface that simplifies storing
Series and DataFrame object. The HDFStore class works like a dict and
handles the low-level details:

Objects contained in the HDF5 file can then be retrieved with the same dict-
like API:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 128

Data Analytics Using Python 20MCA31

HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is
generally slower, but it supports query operations using a special syntax:

The put is an explicit version of the store['obj2'] = frame method but allows us
to set other options like the storage format.
The pandas.read_hdf function gives you a shortcut to these tools:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 129

Data Analytics Using Python 20MCA31

4.3.2 Reading Microsoft Excel Files

Pandas also supports reading tabular data stored in Excel 2003 (and higher)
files using either the ExcelFile class or pandas.read_excel function. Internally
these tools use the add-on packages xlrd and openpyxl to read XLS and
XLSX files, respectively. You may need to install these manually with pip or
conda.
To use ExcelFile, create an instance by passing a path to an xls or xlsx file:

If you are reading multiple sheets in a file, then it is faster to create the
ExcelFile, but you can also simply pass the filename to pandas.read_excel:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 130

Data Analytics Using Python 20MCA31

4.4. Interacting with Web APIs

Many websites have public APIs providing data feeds via JSON or some other
format.
There are a number of ways to access these APIs from Python; one easy-to-use
method that I recommend is the requests package.
To find the last 30 GitHub issues for pandas on GitHub, we can make a GET
HTTP request using the add-on requests library:

Each element in data is a dictionary containing all of the data found on a

GitHub issue page (except for the comments). We can pass data directly to
DataFrame and extract fields of interest:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 131

Data Analytics Using Python 20MCA31

4.5 Interacting with Databases

In a business setting, most data may not be stored in text or Excel files. SQL-
based relational databases (such as SQL Server, PostgreSQL, and MySQL) are
in wide use, and many alternative databases have become quite popular. The
choice of database is usually dependent on the performance, data integrity,
and scalability needs of an application.
Loading data from SQL into a DataFrame is fairly straightforward, and pandas
has some functions to simplify the process. As an example, create a SQLite
database using Python’s built-in sqlite3 driver:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 132

Data Analytics Using Python 20MCA31

Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return
a list of tuples when selecting data from a table:

You can pass the list of tuples to the DataFrame constructor, but you also need
the column names, contained in the cursor’s description attribute:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 133

Data Analytics Using Python 20MCA31

4.6 Combining and Merging Data Sets:

Data contained in pandas objects can be combined together in a number of
built-in ways:
• pandas.merge connects rows in DataFrames based on one or more keys. This
will be familiar to users of SQL or other relational databases, as it implements
database join operations.
• pandas.concat glues or stacks together objects along an axis.
• combine_first instance method enables splicing together overlapping data to
fill in missing values in one object with values from another.

4.6.1 Database-style DataFrame Merges

Merge or join operations combine data sets by linking rows using one or more
keys. These operations are central to relational databases. The merge function
in pandas is the main entry point for using these algorithms on your data.
Example:
>>> df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
....: 'data1': range(7)})
>>> df2 = DataFrame({'key': ['a', 'b', 'd'],
....: 'data2': range(3)})
>>> df1
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
Prof. Kumar P K, VTU, PG Center, Mysuru Page 134
Data Analytics Using Python 20MCA31

4 4 a
5 5 a
6 6 b
>>> df2
Data2 key
0 0 a
1 1 b
2 2 d

Following is an example of a many-to-one merge situation; the data

in df1 has multiple rows labeled a and b, whereas df2 has only one row for
each value in the key column. Calling merge with these objects we obtain:

>>> pd.merge(df1, df2)

Note, in the above example didn’t specify which column to join on. If not
specified, merge uses the overlapping column names as the keys. It’s a good
practice to specify explicitly, though:

>>> pd.merge(df1, df2, on='key')

Prof. Kumar P K, VTU, PG Center, Mysuru Page 135

Data Analytics Using Python 20MCA31

If the column names are different in each object, we can specify them
separately:
>>> df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
....: 'data1': range(7)})
>>> df4 = DataFrame({'rkey': ['a', 'b', 'd'],
....: 'data2': range(3)})
>>> pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Noticed in the above example, the 'c' and 'd' values and associated data are
missing from the result. By default merge does an 'inner' join; the keys in
the result are the intersection. Other possible options are 'left', 'right', and
'outer'. The outer join takes the union of the keys, combining the effect of
applying both left and right joins:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 136

Data Analytics Using Python 20MCA31

>>> pd.merge(df1, df2, how='outer')

Different join types with how argument

Many-to-many merges have well-defined though not necessarily intuitive

behavior. Here’s an example: >>> df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
....: 'data1': range(6)})
df2 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
....: 'data2': range(5)})

>>> df1

Prof. Kumar P K, VTU, PG Center, Mysuru Page 137

Data Analytics Using Python 20MCA31

>> df2

>>> pd.merge(df1, df2, on='key', how='left')

Many-to-many joins form the Cartesian product of the rows. Since there were
3 'b' rows in the left DataFrame and 2 in the right one, there are 6 'b' rows in
the result. The join method only affects the distinct key values appearing in the
result:
>>> pd.merge(df1, df2, how='inner')

Prof. Kumar P K, VTU, PG Center, Mysuru Page 138

Data Analytics Using Python 20MCA31

To merge with multiple keys, pass a list of column names:

>>> left = DataFrame({'key1': ['foo', 'foo', 'bar'],
....: 'key2': ['one', 'two', 'one'],
....: 'lval': [1, 2, 3]})
>>> right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
....: 'key2': ['one', 'one', 'one', 'two'],
....: 'rval': [4, 5, 6, 7]})

>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')

Merge function arguments

Prof. Kumar P K, VTU, PG Center, Mysuru Page 139

Data Analytics Using Python 20MCA31

4.6.1.1 Merging on Index

In some cases, the merge key or keys in a DataFrame will be found in its index.
In this case, you can pass left_index=True or right_index=True (or both) to
indicate that the index should be used as the merge key:

>>> left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],

....: 'value': range(6)})
>> right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
>>> left1

>>> right1

>>> pd.merge(left1, right1, left_on='key', right_index=True)

Prof. Kumar P K, VTU, PG Center, Mysuru Page 140

Data Analytics Using Python 20MCA31

4.6.2 Concatenating Along an Axis :

Another kind of data combination operation is alternatively referred to as
concatenation, binding, or stacking. NumPy has a concatenate function for
doing this with raw NumPy arrays:

>>> arr = np.arange(12).reshape((3, 4))

>>> arr

>>> np.concatenate([arr, arr], axis=1)

Suppose we have three Series with no index overlap:

>>> s1 = Series([0, 1], index=['a', 'b'])
>>> s2 = Series([2, 3, 4], index=['c', 'd', 'e'])
>>> s3 = Series([5, 6], index=['f', 'g'])
Calling concat with these object in a list glues together the values and indexes:
>>> pd.concat([s1, s2, s3])

Prof. Kumar P K, VTU, PG Center, Mysuru Page 141

Data Analytics Using Python 20MCA31

By default concat works along axis=0, producing another Series. If you pass
axis=1, the result will instead be a DataFrame (axis=1 is the columns):

>>> pd.concat([s1, s2, s3], axis=1)

Concat function arguments

Prof. Kumar P K, VTU, PG Center, Mysuru Page 142

Data Analytics Using Python 20MCA31

4.7 . Reshaping and Pivoting:

There are a number of fundamental operations for rearranging tabular data.

These are alternatingly referred to as reshape or pivot operations.
4.7.1 Reshaping with Hierarchical Indexing:
Hierarchical indexing provides a consistent way to rearrange data in a
DataFrame. There are two primary actions:
• stack: this “rotates” or pivots from the columns in the data to the rows
• unstack: this pivots from the rows into the columns

Exmaple:
>>> data = DataFrame(np.arange(6).reshape((2, 3)),
....: index=pd.Index(['Ohio', 'Colorado'], name='state'),
....: columns=pd.Index(['one', 'two', 'three'], name='number'))
>>> data

Using the stack method on this data pivots the columns into the rows,
producing a Series:
>>> result = data.stack()
>>> result

Prof. Kumar P K, VTU, PG Center, Mysuru Page 143

Data Analytics Using Python 20MCA31

From a hierarchically-indexed Series, you can rearrange the data back into a
DataFrame with unstack:
>>> result.unstack()

By default the innermost level is unstacked (same with stack). You can unstack
a different level by passing a level number or name:
>>> result.unstack(0)

>>> result.unstack('state')

Unstacking might introduce missing data if all of the values in the level aren’t
found in each of the subgroups:
>>> s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
>>> s2 = Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
>>> data2.unstack()

Prof. Kumar P K, VTU, PG Center, Mysuru Page 144

Data Analytics Using Python 20MCA31

Stacking filters out missing data by default, so the operation is easily

invertible:
>>> data2.unstack().stack()

>>> data2.unstack().stack(dropna=False)

Prof. Kumar P K, VTU, PG Center, Mysuru Page 145

Data Analytics Using Python 20MCA31

4.7.2 Pivoting “Long” to “Wide” Format

A common way to store multiple time series in databases and CSV is in so-
called long or stacked format. Let’s load some example data and do a small
amount of time series wrangling and other data cleaning:

In short, it combines the year and quarter columns to create a kind of time
interval type.
Now, ldata looks like:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 146

Data Analytics Using Python 20MCA31

This is the so-called long format for multiple time series, or other
observational data with two or more keys (here, our keys are date and item).
Each row in the table represents a single observation.
Data is frequently stored this way in relational databases like MySQL, as a
fixed schema (column names and data types) allows the number of distinct
values in the item column to change as data is added to the table. In the
previous example, date and item would usually be the primary keys (in
relational database parlance), offering both relational integrity and easier joins.
In some cases, the data may be more difficult to work with in this format; you
might prefer to have a DataFrame containing one column per distinct item
value indexed by timestamps in the date column. Data Frame’s pivot method
performs exactly this transformation:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 147

Data Analytics Using Python 20MCA31

4.7.3 Pivoting “Wide” to “Long” Format

An inverse operation to pivot for DataFrames is pandas.melt. Rather than
transforming one column into many in a new DataFrame, it merges multiple
columns into one, producing a DataFrame that is longer than the input. Let’s
look at an example:

The 'key' column may be a group indicator, and the other columns are data
values.
When using pandas.melt, we must indicate which columns (if any) are group
indicators.
Let’s use 'key' as the only group indicator here:

Using pivot, we can reshape back to the original layout:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 148

Data Analytics Using Python 20MCA31

Since the result of pivot creates an index from the column used as the row
labels, we may want to use reset_index to move the data back into a column:

You can also specify a subset of columns to use as value columns:

pandas.melt can be used without any group identifiers, too:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 149

Data Analytics Using Python 20MCA31

4.8 Data Transformation

4.8.1 Removing Duplicates
Duplicate rows may be found in a DataFrame for any number of reasons. Here
is an
example:

The DataFrame method duplicated returns a boolean Series indicating whether

each row is a duplicate or not:

Relatedly, drop_duplicates returns a

DataFrame where the duplicated array is
False:

Both of these methods by default consider all of the columns; alternatively, you
can specify any subset of them to detect duplicates. Suppose we had an

Prof. Kumar P K, VTU, PG Center, Mysuru Page 150

Data Analytics Using Python 20MCA31

additional column of values and wanted to filter duplicates only based on the
'k1' column:

duplicated and drop_duplicates by default keep the first observed value

combination.
Passing keep='last' will return the last one:

4. 8.2 Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the
values in an array, Series, or column in a DataFrame. Consider the following
hypothetical data collected about various kinds of meat:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 151

Data Analytics Using Python 20MCA31

Suppose you wanted to add a column indicating the type of animal that each
food came from. Let’s write down a mapping of each distinct meat type to the
kind of animal:

The map method on a Series accepts a function or dict-like object containing a

mapping, but here we have a small problem in that some of the meats are
capitalized and others are not. Thus, we need to convert each value to
lowercase using the str.lower
Series method:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 152

Data Analytics Using Python 20MCA31

4.8.3 Replacing Values

Filling in missing data with the fillna method can be thought of as a special
case of more general value replacement. While map, as you’ve seen above, can
be used to modify a subset of values in an object, replace provides a simpler
and more flexible way to do so.
Example:
>>> data = Series([1., -999., 2., -999., -1000., 3.])
>>> data

The -999 values might be sentinel values for missing data. To replace these
with NA values that pandas understands, we can use replace, producing a new
Series:
>>> data.replace(-999, np.nan)

Prof. Kumar P K, VTU, PG Center, Mysuru Page 153

Data Analytics Using Python 20MCA31

If you want to replace multiple values at once, you instead pass a list then the
substitute value:
>>> data.replace([-999, -1000], np.nan)

To use a different replacement for each value, pass a list of substitutes:

>>> data.replace([-999, -1000], [np.nan, 0])

The argument passed can also be a dict:

>>> data.replace({-999: np.nan, -1000: 0})

4.8.4 Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function
or mapping of some form to produce new, differently labeled objects. The axes
can also be modified in place without creating a new data structure.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 154

Data Analytics Using Python 20MCA31

Example:
>>> data = DataFrame(np.arange(12).reshape((3, 4)),
.....: index=['Ohio', 'Colorado', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])

Like a Series, the axis indexes have a map method:

>>> data.index.map(str.upper)
>>> array([OHIO, COLORADO, NEW YORK], dtype=object)
You can assign to index, modifying the DataFrame in place:
>>> data.index = data.index.map(str.upper)
>>> data

If you want to create a transformed version of a data set without modifying the
original, a useful method is rename:
>>> data.rename(index=str.title, columns=str.upper)

rename can be used in conjunction with a dict-like object providing new values
for a subset of the axis labels:
>>> data.rename(index={'OHIO': 'INDIANA'},
.....: columns={'three': 'peekaboo'})

Prof. Kumar P K, VTU, PG Center, Mysuru Page 155

Data Analytics Using Python 20MCA31

4.8.5. Discretization and Binning:

Continuous data is often discretized or otherwise separated into “bins” for
analysis. Suppose you have data about a group of people in a study, and you
want to group them into discrete age buckets:

>>> ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60
and older. To do so, you have to use cut, a function in pandas:
>>> bins = [18, 25, 35, 60, 100]
>>> cats = pd.cut(ages, bins)
>>> cats

The object pandas returns is a special Categorical object. You can treat it like
an array of strings indicating the bin name; internally it contains a levels array
indicating the distinct category names along with a labeling for the ages data in
the labels attribute:

>>> cats.labels
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1])

>>> cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object

>>> pd.value_counts(cats)

Prof. Kumar P K, VTU, PG Center, Mysuru Page 156

Data Analytics Using Python 20MCA31

Consistent with mathematical notation for intervals, a parenthesis means that

the side is open while the square bracket means it is closed (inclusive). Which
side is closed can be changed by passing right=False: >>> pd.cut(ages, [18, 26,
36, 61, 100], right=False)
Categorical:
array([[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), [18, 26), [36, 61), [26, 36),
[61, 100), [36, 61), [36, 61), [26, 36)], dtype=object) Levels (4): Index([[18, 26),
[26, 36), [36, 61), [61, 100)], dtype=object)

You can also pass your own bin names by passing a list or array to the labels
option:
>>> group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
>>> pd.cut(ages, bins, labels=group_names)
Categorical:
array([Youth, Youth, Youth, YoungAdult, Youth, Youth, MiddleAged,
YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult], dtype=object)
Levels (4): Index([Youth, YoungAdult, MiddleAged, Senior], dtype=object)

4.8.6 Detecting and Filtering Outliers:

Filtering or transforming outliers is largely a matter of applying array
operations. Consider a DataFrame with some normally distributed data:
>>> np.random.seed(12345)
>>> data = DataFrame(np.random.randn(1000, 4))
>>> data.describe()

Prof. Kumar P K, VTU, PG Center, Mysuru Page 157

Data Analytics Using Python 20MCA31

Suppose you wanted to find values in one of the columns exceeding three in
magnitude:
>>> col = data[3]
>>> col[np.abs(col) > 3]

To select all rows having a value exceeding 3 or -3, you can use the any method
on a boolean DataFrame:
>>> data[(np.abs(data) > 3).any(1)]

4.8.7 Permutation and Random Sampling : Permuting (randomly reordering)

a Series or the rows in a DataFrame is easy to do using the
numpy.random.permutation function. Calling permutation with the length of
the axis you want to permute produces an array of integers indicating the new
ordering:
>>> df = DataFrame(np.arange(5 * 4).reshape(5, 4))
>>> sampler = np.random.permutation(5
>>> sampler
array([1, 0, 2, 3, 4])
That array can then be used in ix-based indexing or the take function:
>>> df

Prof. Kumar P K, VTU, PG Center, Mysuru Page 158

Data Analytics Using Python 20MCA31

>>> df.take(sampler)

To select a random subset without replacement, one way is to slice off the first
k elements of the array returned by permutation, where k is the desired subset
size. There are much more efficient sampling-without-replacement algorithms.
>>> df.take(np.random.permutation(len(df))[:3])

To generate a sample with replacement, the fastest way is to use

np.random.randint to draw random integers.
>>> bag = np.array([5, 7, -1, 6, 4])
>>> sampler = np.random.randint(0, len(bag), size=10)
>>> sampler
array([4, 4, 2, 2, 2, 0, 3, 0, 4, 1])
>>> draws = bag.take(sampler)
>>> draws
array([ 4, 4, -1, -1, -1, 5, 6, 5, 4, 7])

4.9. String Manipulation:

Python has long been a popular data munging language in part due to its ease-
of-use for string and text processing. Most text operations are made simple
with the string object’s built-in methods. For more complex pattern matching
and text manipulations, regular expressions may be needed.
4.9.1 String Object Methods

Prof. Kumar P K, VTU, PG Center, Mysuru Page 159

Data Analytics Using Python 20MCA31

In many string munging and scripting applications, built-in string methods are
sufficient.
As an example, a comma-separated string can be broken into pieces with split:

>>> val = 'a,b, guido'

>>> val.split(',')
['a', 'b', ' guido']
split is often combined with strip to trim whitespace (including newlines):

>>> pieces = [x.strip() for x in val.split(',')]

>>> pieces
['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter

using addition:
>>> first, second, third = pieces
>>> first + '::' + second + '::' + third
'a::b::guido'

But, this isn’t a practical generic method. A faster and more Pythonic way is to
pass a list or tuple to the join method on the string '::':

>>> '::'.join(pieces)
'a::b::guido'
Other methods are concerned with locating substrings. Using Python’s in
keyword is the best way to detect a substring, though index and find can also
be used:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 160

Data Analytics Using Python 20MCA31

>>> 'guido' in val

True
>>> val.index(',')
>>> val.find(':')
1
-1

Note the difference between find and index is that index raises an exception if
the string isn’t found (versus returning -1):

>>> val.index(':')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last) in ()
----> 1 val.index(':')
ValueError: substring not foundin

Relatedly, count returns the number of occurrences of a particular substring

>>> val.count(',')
2
replace will substitute occurrences of one pattern for another. This is
commonly used to delete patterns, too, by passing an empty string:
>>> val.replace(',', '::')
>>> val.replace(',', '')
'a::b:: guido'
'ab guido'

Regular expressions can also be used with many of these operations like below.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 161

Data Analytics Using Python 20MCA31

Python built-in string methods

4.9.2 Regular expressions

Regular expressions provide a flexible way to search or match string patterns
in text. A single expression, commonly called a regex, is a string formed
according to the regular expression language. Python’s built-in re module is
responsible for applying regular expressions to strings.
The re module functions fall into three categories:
1. Pattern matching
2. Substitution
3. Splitting

Prof. Kumar P K, VTU, PG Center, Mysuru Page 162

Data Analytics Using Python 20MCA31

Example: I wanted to split a string with a variable number of whitespace

characters (tabs, spaces, and newlines). The regex describing one or more
whitespace characters is \s+:
>>> import re
>>> text = "foo bar\t baz \tqux"
>>> re.split('\s+', text)
['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled,
then its split method is called on the passed text. You can compile the regex
yourself with re.compile, forming a reusable regex object: >>> regex =
re.compile('\s+')
>>> regex.split(text)
['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can
use the findall method: regex.findall(text) Out[227]: [' ', '\t ', ' \t']

Creating a regex object with re.compile is highly recommended if you intend to

apply the same expression to many strings; doing so will save CPU cycles.
match and search are closely related to findall. While findall returns all
matches in a string, search returns only the first match. More rigidly, match
only matches at the beginning of the string.
let’s consider a block of text and a regular expression capable of identifying
most email addresses:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 163

Data Analytics Using Python 20MCA31

text = """Dave [email protected]

Steve [email protected]
Rob [email protected]
Ryan [email protected]
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

Using findall on the text produces a list of the e-mail addresses:

>>> regex.findall(text)
['[email protected]', '[email protected]', '[email protected]', '[email protected]']
search returns a special match object for the first email address in the text.
For the above regex, the match object can only tell us the start and end
position of the pattern in the string:

>>> m = regex.search(text)
>>> m

>>> text[m.start():m.end()]
'[email protected]'
regex.match returns None, as it only will match if the pattern occurs at the
start of the string:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 164

Data Analytics Using Python 20MCA31

>>> print regex.match(text)

None

Relatedly, sub will return a new string with occurrences of the pattern replaced
by the a new string:

>>> print regex.sub('REDACTED', text)

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

Suppose you wanted to find email addresses and simultaneously segment each
address into its 3 components:
username, domain name, and domain suffix. To do this, put parentheses
around the parts of the pattern to segment:
>>> pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
A match object produced by this modified regex returns a tuple of the pattern
components with its groups method
>>> m = regex.match('[email protected]')
>>> m.groups()
('wesm', 'bright', 'net')
findall returns a list of tuples when the pattern has groups:
>>> regex.findall(text)
[('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan',
'yahoo', 'com')]

Prof. Kumar P K, VTU, PG Center, Mysuru Page 165

Data Analytics Using Python 20MCA31

sub also has access to groups in each match using special symbols like \1, \2,
etc.:

>>> print regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text)

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

Regular expression methods

Prof. Kumar P K, VTU, PG Center, Mysuru Page 166

Data Analytics Using Python 20MCA31

4.9.3 Vectorized String Functions in pandas

Cleaning up a messy dataset for analysis often requires a lot of string munging
and regularization. To complicate matters, a column containing strings will
sometimes have missing data:

You can apply string and regular expression methods can be applied (passing a
lambda or other function) to each value using data.map, but it will fail on the
NA (null) values. To cope with this, Series has array-oriented methods for string
operations that skip NA values. These are accessed through Series’s str
attribute; for example, we could check whether each email address has 'gmail'
in it with str.contains:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 167

Data Analytics Using Python 20MCA31

Regular expressions can be used, too, along with any re options like
IGNORECASE:

There are a couple of ways to do vectorized element retrieval. Either use str.get
or index into the str attribute:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 168

Data Analytics Using Python 20MCA31

Partial listing of vectorized string methods:

- END of 4th MODULE-

Prof. Kumar P K, VTU, PG Center, Mysuru Page 169

Data Analytics Using Python 20MCA31

Acquiring Data with Python

1. Loading form CSV files
Reading data from CSV(comma separated values) is a fundamental necessity in
Data Science. The Pandas library provides features using which we can read
the CSV file in full as well as in parts for only a selected group of columns and
rows.

1.1 Input as CSV File

The csv file is a text file in which the values in the columns are separated by a
comma.
Example : Consider the following data present in the file named input.csv.
We can create this file using notepad and Save the file as input.csv

1.2 Reading a CSV File

The read_csv function of the pandas library is used read the content of a CSV
file into the python environment as a pandas DataFrame. The function can
read the files from the OS by using proper path to the file.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 170

Data Analytics Using Python 20MCA31

When we execute the above code, it produces the following result. Please note
how an additional column starting with zero as a index has been created by the
function.

1.3 Reading Specific Rows

The read_csv function of the pandas library can also be used to read some
specific rows for a given column. We slice the result from the read_csv
function using the code shown below for first 5 rows for the column named
salary.

When we execute the above code, it produces the following result.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 171

Data Analytics Using Python 20MCA31

1.4 Reading Specific Columns:

The read_csv function of the pandas library can also be used to read some
specific columns. We use the multi-axes indexing method called .loc() for this
purpose. We choose to display the salary and name column for all the rows.

When we execute the above code, it produces the following result.

1.5 Reading Specific Columns and Rows

The read_csv function of the pandas library can also be used to read some
specific columns and specific rows. We use the multi-axes indexing method
called .loc() for this purpose. We choose to display the salary and name column
for some of the rows.

When we execute the above code, it produces the following result.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 172

Pandas Library Documentation
No ratings yet
Pandas Library Documentation
16 pages
Contoh Soal Bahasa Indonesia Kelas 8 Semester 2
71% (7)
Contoh Soal Bahasa Indonesia Kelas 8 Semester 2
6 pages
Marketing Mix For Apple I - Phone: by Sangam Abhay
100% (2)
Marketing Mix For Apple I - Phone: by Sangam Abhay
23 pages
Data Wrangling With Python Lab Manual
No ratings yet
Data Wrangling With Python Lab Manual
29 pages
Python GTU Study Material E-Notes 3 16012021061619AM
No ratings yet
Python GTU Study Material E-Notes 3 16012021061619AM
36 pages
Chapter 4 - Import-Export Data
No ratings yet
Chapter 4 - Import-Export Data
30 pages
Multimedla: From Wagner To Virtual Reality
No ratings yet
Multimedla: From Wagner To Virtual Reality
39 pages
Step-By-Step Guide Writing & Testing Real-World Crowdsale Contracts On Ethereum
100% (1)
Step-By-Step Guide Writing & Testing Real-World Crowdsale Contracts On Ethereum
24 pages
MOD-3 Dap
No ratings yet
MOD-3 Dap
41 pages
Data Management With Python, SQLite, and SQLAlchemy
No ratings yet
Data Management With Python, SQLite, and SQLAlchemy
57 pages
Chapter 12eng Data Transfer Between Files SQL Databases and DataFrames
0% (1)
Chapter 12eng Data Transfer Between Files SQL Databases and DataFrames
14 pages
UBA-5 2 0-ReleaseNotes
100% (1)
UBA-5 2 0-ReleaseNotes
14 pages
Moblie Virus Descriptions
No ratings yet
Moblie Virus Descriptions
49 pages
File Handling
No ratings yet
File Handling
12 pages
Export To PDF Crystal Reports
100% (1)
Export To PDF Crystal Reports
2 pages
Employee Data Analysis System (Ip Class Xii)
No ratings yet
Employee Data Analysis System (Ip Class Xii)
26 pages
SBE 11plus V2 Deck Unit: Product Manual
No ratings yet
SBE 11plus V2 Deck Unit: Product Manual
92 pages
Pandas 1
No ratings yet
Pandas 1
64 pages
AnsysEMInstallGuide Linux PDF
No ratings yet
AnsysEMInstallGuide Linux PDF
64 pages
Ip Project File
No ratings yet
Ip Project File
7 pages
Topic 4 - Data Mining Tools and Technique
No ratings yet
Topic 4 - Data Mining Tools and Technique
22 pages
R in Hydrological Modelling - Zambrano+Bigiarini
No ratings yet
R in Hydrological Modelling - Zambrano+Bigiarini
33 pages
Pandas
No ratings yet
Pandas
57 pages
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
No ratings yet
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
6 pages
Data Model ss2
0% (1)
Data Model ss2
5 pages
Autonomous Tybsc Syllabus 2020
No ratings yet
Autonomous Tybsc Syllabus 2020
21 pages
Week 3 Python
No ratings yet
Week 3 Python
152 pages
Software Quality Concepts
No ratings yet
Software Quality Concepts
38 pages
Utf-8''libraries Data Management
No ratings yet
Utf-8''libraries Data Management
9 pages
Cubes - Models and Schemas
No ratings yet
Cubes - Models and Schemas
6 pages
Username: Password:: Sign Up Help
No ratings yet
Username: Password:: Sign Up Help
4 pages
Introduction To Autocad: Technical College of Engineering Engineering Drawing and Autocad
No ratings yet
Introduction To Autocad: Technical College of Engineering Engineering Drawing and Autocad
11 pages
Starplat 1 41 Datasheet v5-1 Eng
No ratings yet
Starplat 1 41 Datasheet v5-1 Eng
9 pages
Comprehending The Statistics of Zomato
No ratings yet
Comprehending The Statistics of Zomato
33 pages
Ba Ca
No ratings yet
Ba Ca
10 pages
Module 3 Notes
No ratings yet
Module 3 Notes
45 pages
CW MD Jahid Hasan 2024
No ratings yet
CW MD Jahid Hasan 2024
23 pages
Lab13 Sorting PDF
No ratings yet
Lab13 Sorting PDF
4 pages
RM - Pandas - Importing Data
No ratings yet
RM - Pandas - Importing Data
15 pages
3D Processing
No ratings yet
3D Processing
3 pages
Python For Data Analysis (1) - 171-192
No ratings yet
Python For Data Analysis (1) - 171-192
24 pages
Employee Data Analysis System (Ip Class 12) (2024-25)
No ratings yet
Employee Data Analysis System (Ip Class 12) (2024-25)
30 pages
PYTHON
No ratings yet
PYTHON
13 pages
Data Frame
No ratings yet
Data Frame
95 pages
DA Unit 4
No ratings yet
DA Unit 4
46 pages
DAwHPC L03 Data Cleaning Practical
No ratings yet
DAwHPC L03 Data Cleaning Practical
43 pages
Trend Micro Apex One Training For Certified Professionals: Duration: 3 Days Course Code: TMAO
No ratings yet
Trend Micro Apex One Training For Certified Professionals: Duration: 3 Days Course Code: TMAO
3 pages
Capsula XL 2 - New RU2 - A - 02E
93% (27)
Capsula XL 2 - New RU2 - A - 02E
708 pages
Lecture 21 Working With Pandas
No ratings yet
Lecture 21 Working With Pandas
11 pages
Forwarder Conditional Forwarder Stub Zone Secondary Zone
No ratings yet
Forwarder Conditional Forwarder Stub Zone Secondary Zone
1 page
CSV File Handing
No ratings yet
CSV File Handing
15 pages
III Unit Fds
No ratings yet
III Unit Fds
24 pages
DAP Module3
No ratings yet
DAP Module3
42 pages
Kritika Cyber Crime Dataset
No ratings yet
Kritika Cyber Crime Dataset
19 pages
Using List Views React Native
No ratings yet
Using List Views React Native
3 pages
05 Data Loading, Storage and Wrangling-1
No ratings yet
05 Data Loading, Storage and Wrangling-1
22 pages
Getting Started: Matlab Practice Sessions
No ratings yet
Getting Started: Matlab Practice Sessions
5 pages
Mexico Localization User Guide Cfdi 4.0 - English
No ratings yet
Mexico Localization User Guide Cfdi 4.0 - English
3 pages
OSY 7 Years Assignment (1, 2 & 3)
No ratings yet
OSY 7 Years Assignment (1, 2 & 3)
3 pages
Numpy Data Analysis and Visualisation With Python
No ratings yet
Numpy Data Analysis and Visualisation With Python
75 pages
Class Xii Information Practices PPT On Data Handling Using Pandas-I
No ratings yet
Class Xii Information Practices PPT On Data Handling Using Pandas-I
64 pages
Mastering TypoScript: TYPO3 Website, Template, and Extension Development
From Everand
Mastering TypoScript: TYPO3 Website, Template, and Extension Development
Daniel Koch
No ratings yet
Rest of The Ip Project
No ratings yet
Rest of The Ip Project
26 pages
Experiment 678910
No ratings yet
Experiment 678910
12 pages
A2z Telugu Boothu Kathalu 13
No ratings yet
A2z Telugu Boothu Kathalu 13
10 pages
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
No ratings yet
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
37 pages
Common Python Data Science Interview Questions1
No ratings yet
Common Python Data Science Interview Questions1
5 pages
CW MD Jahid Hasan 2024
No ratings yet
CW MD Jahid Hasan 2024
20 pages
MOS - IsO 27001 2022 Mandatory Documents List
No ratings yet
MOS - IsO 27001 2022 Mandatory Documents List
11 pages
Notes On CSV Filespdf
No ratings yet
Notes On CSV Filespdf
11 pages
Python Unit 5
No ratings yet
Python Unit 5
21 pages
Importing Data Cheat Sheet Python For Data Science: Pickled Files Exploring Your Data
No ratings yet
Importing Data Cheat Sheet Python For Data Science: Pickled Files Exploring Your Data
1 page
UNIT II Notes
No ratings yet
UNIT II Notes
23 pages
CH 3 2
No ratings yet
CH 3 2
17 pages
PP Manual Exp No. 07
No ratings yet
PP Manual Exp No. 07
9 pages
DWV Unit1
No ratings yet
DWV Unit1
102 pages
Fds Unit - III
No ratings yet
Fds Unit - III
58 pages
Server Hosting Management System (Ip Class 12) (2024-25)
No ratings yet
Server Hosting Management System (Ip Class 12) (2024-25)
21 pages
UNIT4
No ratings yet
UNIT4
19 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
CSV File
No ratings yet
CSV File
30 pages
CSV File: Python With CSV Files
No ratings yet
CSV File: Python With CSV Files
19 pages
Importing Data Into Pandas Dataframes
No ratings yet
Importing Data Into Pandas Dataframes
5 pages
Python
No ratings yet
Python
5 pages
Functional Python Programming
From Everand
Functional Python Programming
Steven Lott
No ratings yet
CSV New
No ratings yet
CSV New
4 pages
Importing Data Python Cheat Sheet PDF
No ratings yet
Importing Data Python Cheat Sheet PDF
1 page
Handling CSV Files in Python
No ratings yet
Handling CSV Files in Python
11 pages
TheaForSketchUp-UserManual 1 5
No ratings yet
TheaForSketchUp-UserManual 1 5
24 pages
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet