0% found this document useful (0 votes)
5 views63 pages

Mod 4

Module 4 of the Data Analytics Using Python course covers data preprocessing and data wrangling techniques, focusing on reading and writing data in text format, including CSV and JSON. It discusses handling unclean data, missing values, and iterating through large files, as well as exporting data to various formats. Additionally, it introduces methods for interacting with web APIs and databases, and techniques for merging and combining datasets.

Uploaded by

lia.tulip19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views63 pages

Mod 4

Module 4 of the Data Analytics Using Python course covers data preprocessing and data wrangling techniques, focusing on reading and writing data in text format, including CSV and JSON. It discusses handling unclean data, missing values, and iterating through large files, as well as exporting data to various formats. Additionally, it introduces methods for interacting with web APIs and databases, and techniques for merging and combining datasets.

Uploaded by

lia.tulip19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Data Analytics Using Python 20MCA31

MODULE 4

Data Preprocessing and Data Wrangling

4. 1 Reading and Writing Data in Text Format

These functions, which are meant to convert text data into a DataFrame. The optional arguments for
these functions may fall into a few categories:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 110


Data Analytics Using Python 20MCA31

Indexing : Can treat one or more columns as the returned DataFrame, and whether to get column
names from the file, the user, or not at all.
Type inference and data conversion: This includes the user-defined value conversions and custom list of
missing value markers.
Datetime parsing: Includes combining capability, including combining date and time information spread
over multiple columns into a single column in the result.
Iterating : Support for iterating over chunks of very large files.
Unclean data issues: Skipping rows or footer, comments, or other minor things like numeric data with
thousands separated by commas.

Some of these functions, like pandas.read_csv, perform type inference, because the column data types
are not part of the data format. That means you don’t necessarily have to specify which columns are
numeric, integer, boolean, or string. Other data formats, like HDF5, Feather, and msgpack, have the data
types stored in the format. Handling dates and other custom types can require extra effort.

Comma-separated (CSV) text file:


In [8]: !cat examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
Here used the Unix cat shell command to print the raw contents of the file to the screen. In Windows,
use type instead of cat to achieve the same effect.
Since this is comma-delimited, we can use read_csv to read it into a DataFrame:

We could also have used read_table and specified the delimiter:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 111


Data Analytics Using Python 20MCA31

A file will not always have a header row. Consider this file:

To read this file, we have a couple of options. You can allow pandas to assign default column names, or
you can specify names yourself:

Suppose if we wanted the message column to be the index of the returned DataFrame. You can either
indicate you want the column at index 4 or named 'message' using the index_col argument:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 112


Data Analytics Using Python 20MCA31

In the event that you want to form a hierarchical index from multiple columns, pass a list of column
numbers or names:

In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to
separate fields. Consider a text file that looks like this:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 113


Data Analytics Using Python 20MCA31

Here the fields are separated by a vari‐ able amount of whitespace. In these cases, you can pass a
regular expression as a delimiter for read_table. This can be expressed by the regular expression \s+, so
we have then:

The parser functions have many additional arguments to help you handle the wide variety of exception
file formats that occur .
For example, you can skip the first, third, and fourth rows of a file with skiprows:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 114


Data Analytics Using Python 20MCA31

Handling missing values is an important. Missing data is usually either not present (empty string) or
marked by some sentinel value. By default, pandas use a set of commonly occurring sentinels, such as
NA and NULL:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 115


Data Analytics Using Python 20MCA31

Prof. Kumar P K, VTU, PG Center, Mysuru Page 116


Data Analytics Using Python 20MCA31

4.2 Reading Text Files in Pieces :


Prof. Kumar P K, VTU, PG Center, Mysuru Page 117
Data Analytics Using Python 20MCA31

When processing very large files or figuring out the right set of arguments to correctly processes a large
file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.
Before we look at a large file, we make the pandas display settings more compact:

If you want to only read a small number of rows (avoiding reading the entire file), specify that with
nrows:

To read a file in pieces, specify a chunksize as a number of rows:


In [37]: chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)

4.2.1 Writing Data to Text Format

Prof. Kumar P K, VTU, PG Center, Mysuru Page 118


Data Analytics Using Python 20MCA31

Data can also be exported to a delimited format. Let’s consider one of the CSV files read before:

Using DataFrame’s to_csv method, we can write the data out to a comma-separated file:

Other delimiters can be used:

Writing to sys.stdout so it prints the text result to the console

Prof. Kumar P K, VTU, PG Center, Mysuru Page 119


Data Analytics Using Python 20MCA31

Missing values appear as empty strings in the output. You might want to denote them by some other
sentinel value:

With no other options specified, both the row and column labels are written. Both of these can be
disabled:

4.2.3 Working with Delimited Formats


It’s possible to load most forms of tabular data from disk using functions like
pandas.read_table. In some cases, however, some manual processing may be
necessary. It’s not uncommon to receive a file with one or more malformed lines that
trip up read_table. To illustrate the basic tools, consider a small CSV file:

For any file with a single-character delimiter, you can use Python’s built-in csv
module. To use it, pass any open file or file-like object to csv.reader:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 120


Data Analytics Using Python 20MCA31

Iterating through the reader like a file yields tuples of values with any quote
characters removed:

CSV files come in many different flavors. To define a new format with a different
delimiter, string quoting convention, or line terminator, we define a simple subclass of
csv.Dialect:

We can also give individual CSV dialect parameters as keywords to csv.reader without
having to define a subclass:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 121


Data Analytics Using Python 20MCA31

To write delimited files manually, you can use csv.writer. It accepts an open, writable
file object and the same dialect and format options as csv.reader:

4.2.4 JSON Data


JSON (short for JavaScript Object Notation) has become one of the standard formats
for sending data by HTTP request between web browsers and other applications.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 122


Data Analytics Using Python 20MCA31

The basic types are objects (dicts), arrays (lists), strings, numbers, booleans,
and nulls. All of the keys in an object must be strings. There are several Python
libraries for reading and writing JSON data. To convert a JSON string to
Python form, use json.loads:

json.dumps, on the other hand, converts a Python object back to JSON:

The pandas.read_json can automatically convert JSON datasets in specific


arrangements into a Series or DataFrame. For example:

The default options for pandas.read_json assume that each object in the JSON
array is a row in the table:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 123


Data Analytics Using Python 20MCA31

4.2.5 XML and HTML: Web Scraping


pandas has a built-in function, read_html, which uses libraries like lxml and
Beautiful Soup to automatically parse tables out of HTML files as DataFrame
objects.
The pandas.read_html function has a number of options, but by default it
searches for and attempts to parse all tabular data contained within <table>
tags. The result is a list of DataFrame objects:

Because failures has many columns, pandas inserts a line break character \.

We proceed to do some data cleaning and analysis, like computing the number of bank failures
by year:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 124


Data Analytics Using Python 20MCA31

4.2.5.1 Parsing XML with lxml.objectify


XML (eXtensible Markup Language) is another common structured data format
supporting hierarchical, nested data with metadata.
pandas.read_html function, which uses either lxml or Beautiful Soup under
the hood to parse data from HTML. XML and HTML are structurally similar,
but XML is more general.
Example of how to use lxml to parse data from a more general XML format.

The New York Metropolitan Transportation Authority (MTA) publishes a


number of data series about its bus and train services, which is contained in a
set of XML files. Each train or bus service has a different file (like
Performance_MNR.xml for the Metro-North Railroad) containing monthly data
as a series of XML records that look like this:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 125


Data Analytics Using Python 20MCA31

Using lxml.objectify, we parse the file and get a reference to the root node of
the XML file with getroot:

root.INDICATOR returns a generator yielding each <INDICATOR> XML


element. For each record, we can populate a dict of tag names (like
YTD_ACTUAL) to data values (excluding a few tags):

Lastly, convert this list of dicts into a DataFrame:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 126


Data Analytics Using Python 20MCA31

4.3. Binary Data Formats:


One of the easiest ways to store data (also known as serialization) efficiently in
binary format is using Python’s built-in pickle serialization. pandas objects all
have a to_pickle method that writes the data to disk in pickle format:

You can read any “pickled” object stored in a file by using the built-in pickle
directly, or even more conveniently using pandas.read_pickle:

pandas has built-in support for two more binary data formats: HDF5 and
Message Pack.
4.3.1 Using HDF5 Format
HDF5 is a well-regarded file format intended for storing large quantities of
scientific array data. It is available as a C library, and it has interfaces
available in many other languages, including Java, Julia, MATLAB, and
Python. The “HDF” in HDF5 stands for hierarchical data format. Each HDF5 file
can store multiple datasets and supporting metadata. Compared with simpler
formats, HDF5 supports on-the-fly compression with a variety of compression
modes, enabling data with repeated patterns to be stored more efficiently.
HDF5 can be a good choice for working with very large datasets that don’t fit

Prof. Kumar P K, VTU, PG Center, Mysuru Page 127


Data Analytics Using Python 20MCA31

into memory, as you can efficiently read and write small sections of much
larger arrays.
While it’s possible to directly access HDF5 files using either the PyTables or
h5py libraries, pandas provides a high-level interface that simplifies storing
Series and DataFrame object. The HDFStore class works like a dict and
handles the low-level details:

Objects contained in the HDF5 file can then be retrieved with the same dict-
like API:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 128


Data Analytics Using Python 20MCA31

HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is
generally slower, but it supports query operations using a special syntax:

The put is an explicit version of the store['obj2'] = frame method but allows us
to set other options like the storage format.
The pandas.read_hdf function gives you a shortcut to these tools:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 129


Data Analytics Using Python 20MCA31

4.3.2 Reading Microsoft Excel Files


Pandas also supports reading tabular data stored in Excel 2003 (and higher)
files using either the ExcelFile class or pandas.read_excel function. Internally
these tools use the add-on packages xlrd and openpyxl to read XLS and
XLSX files, respectively. You may need to install these manually with pip or
conda.
To use ExcelFile, create an instance by passing a path to an xls or xlsx file:

If you are reading multiple sheets in a file, then it is faster to create the
ExcelFile, but you can also simply pass the filename to pandas.read_excel:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 130


Data Analytics Using Python 20MCA31

4.4. Interacting with Web APIs

Many websites have public APIs providing data feeds via JSON or some other
format.
There are a number of ways to access these APIs from Python; one easy-to-use
method that I recommend is the requests package.
To find the last 30 GitHub issues for pandas on GitHub, we can make a GET
HTTP request using the add-on requests library:

Each element in data is a dictionary containing all of the data found on a


GitHub issue page (except for the comments). We can pass data directly to
DataFrame and extract fields of interest:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 131


Data Analytics Using Python 20MCA31

4.5 Interacting with Databases


In a business setting, most data may not be stored in text or Excel files. SQL-
based relational databases (such as SQL Server, PostgreSQL, and MySQL) are
in wide use, and many alternative databases have become quite popular. The
choice of database is usually dependent on the performance, data integrity,
and scalability needs of an application.
Loading data from SQL into a DataFrame is fairly straightforward, and pandas
has some functions to simplify the process. As an example, create a SQLite
database using Python’s built-in sqlite3 driver:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 132


Data Analytics Using Python 20MCA31

Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return
a list of tuples when selecting data from a table:

You can pass the list of tuples to the DataFrame constructor, but you also need
the column names, contained in the cursor’s description attribute:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 133


Data Analytics Using Python 20MCA31

4.6 Combining and Merging Data Sets:


Data contained in pandas objects can be combined together in a number of
built-in ways:
• pandas.merge connects rows in DataFrames based on one or more keys. This
will be familiar to users of SQL or other relational databases, as it implements
database join operations.
• pandas.concat glues or stacks together objects along an axis.
• combine_first instance method enables splicing together overlapping data to
fill in missing values in one object with values from another.

4.6.1 Database-style DataFrame Merges


Merge or join operations combine data sets by linking rows using one or more
keys. These operations are central to relational databases. The merge function
in pandas is the main entry point for using these algorithms on your data.
Example:
>>> df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
....: 'data1': range(7)})
>>> df2 = DataFrame({'key': ['a', 'b', 'd'],
....: 'data2': range(3)})
>>> df1
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
Prof. Kumar P K, VTU, PG Center, Mysuru Page 134
Data Analytics Using Python 20MCA31

4 4 a
5 5 a
6 6 b
>>> df2
Data2 key
0 0 a
1 1 b
2 2 d

Following is an example of a many-to-one merge situation; the data


in df1 has multiple rows labeled a and b, whereas df2 has only one row for
each value in the key column. Calling merge with these objects we obtain:

>>> pd.merge(df1, df2)

Note, in the above example didn’t specify which column to join on. If not
specified, merge uses the overlapping column names as the keys. It’s a good
practice to specify explicitly, though:

>>> pd.merge(df1, df2, on='key')

Prof. Kumar P K, VTU, PG Center, Mysuru Page 135


Data Analytics Using Python 20MCA31

If the column names are different in each object, we can specify them
separately:
>>> df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
....: 'data1': range(7)})
>>> df4 = DataFrame({'rkey': ['a', 'b', 'd'],
....: 'data2': range(3)})
>>> pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Noticed in the above example, the 'c' and 'd' values and associated data are
missing from the result. By default merge does an 'inner' join; the keys in
the result are the intersection. Other possible options are 'left', 'right', and
'outer'. The outer join takes the union of the keys, combining the effect of
applying both left and right joins:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 136


Data Analytics Using Python 20MCA31

>>> pd.merge(df1, df2, how='outer')

Different join types with how argument

Many-to-many merges have well-defined though not necessarily intuitive


behavior. Here’s an example: >>> df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
....: 'data1': range(6)})
df2 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
....: 'data2': range(5)})

>>> df1

Prof. Kumar P K, VTU, PG Center, Mysuru Page 137


Data Analytics Using Python 20MCA31

>> df2

>>> pd.merge(df1, df2, on='key', how='left')

Many-to-many joins form the Cartesian product of the rows. Since there were
3 'b' rows in the left DataFrame and 2 in the right one, there are 6 'b' rows in
the result. The join method only affects the distinct key values appearing in the
result:
>>> pd.merge(df1, df2, how='inner')

Prof. Kumar P K, VTU, PG Center, Mysuru Page 138


Data Analytics Using Python 20MCA31

To merge with multiple keys, pass a list of column names:


>>> left = DataFrame({'key1': ['foo', 'foo', 'bar'],
....: 'key2': ['one', 'two', 'one'],
....: 'lval': [1, 2, 3]})
>>> right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
....: 'key2': ['one', 'one', 'one', 'two'],
....: 'rval': [4, 5, 6, 7]})

>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')

Merge function arguments

Prof. Kumar P K, VTU, PG Center, Mysuru Page 139


Data Analytics Using Python 20MCA31

4.6.1.1 Merging on Index


In some cases, the merge key or keys in a DataFrame will be found in its index.
In this case, you can pass left_index=True or right_index=True (or both) to
indicate that the index should be used as the merge key:

>>> left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],


....: 'value': range(6)})
>> right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
>>> left1

>>> right1

>>> pd.merge(left1, right1, left_on='key', right_index=True)

Prof. Kumar P K, VTU, PG Center, Mysuru Page 140


Data Analytics Using Python 20MCA31

4.6.2 Concatenating Along an Axis :


Another kind of data combination operation is alternatively referred to as
concatenation, binding, or stacking. NumPy has a concatenate function for
doing this with raw NumPy arrays:

>>> arr = np.arange(12).reshape((3, 4))


>>> arr

>>> np.concatenate([arr, arr], axis=1)

Suppose we have three Series with no index overlap:


>>> s1 = Series([0, 1], index=['a', 'b'])
>>> s2 = Series([2, 3, 4], index=['c', 'd', 'e'])
>>> s3 = Series([5, 6], index=['f', 'g'])
Calling concat with these object in a list glues together the values and indexes:
>>> pd.concat([s1, s2, s3])

Prof. Kumar P K, VTU, PG Center, Mysuru Page 141


Data Analytics Using Python 20MCA31

By default concat works along axis=0, producing another Series. If you pass
axis=1, the result will instead be a DataFrame (axis=1 is the columns):

>>> pd.concat([s1, s2, s3], axis=1)

Concat function arguments

Prof. Kumar P K, VTU, PG Center, Mysuru Page 142


Data Analytics Using Python 20MCA31

4.7 . Reshaping and Pivoting:

There are a number of fundamental operations for rearranging tabular data.


These are alternatingly referred to as reshape or pivot operations.
4.7.1 Reshaping with Hierarchical Indexing:
Hierarchical indexing provides a consistent way to rearrange data in a
DataFrame. There are two primary actions:
• stack: this “rotates” or pivots from the columns in the data to the rows
• unstack: this pivots from the rows into the columns

Exmaple:
>>> data = DataFrame(np.arange(6).reshape((2, 3)),
....: index=pd.Index(['Ohio', 'Colorado'], name='state'),
....: columns=pd.Index(['one', 'two', 'three'], name='number'))
>>> data

Using the stack method on this data pivots the columns into the rows,
producing a Series:
>>> result = data.stack()
>>> result

Prof. Kumar P K, VTU, PG Center, Mysuru Page 143


Data Analytics Using Python 20MCA31

From a hierarchically-indexed Series, you can rearrange the data back into a
DataFrame with unstack:
>>> result.unstack()

By default the innermost level is unstacked (same with stack). You can unstack
a different level by passing a level number or name:
>>> result.unstack(0)

>>> result.unstack('state')

Unstacking might introduce missing data if all of the values in the level aren’t
found in each of the subgroups:
>>> s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
>>> s2 = Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
>>> data2.unstack()

Prof. Kumar P K, VTU, PG Center, Mysuru Page 144


Data Analytics Using Python 20MCA31

Stacking filters out missing data by default, so the operation is easily


invertible:
>>> data2.unstack().stack()

>>> data2.unstack().stack(dropna=False)

Prof. Kumar P K, VTU, PG Center, Mysuru Page 145


Data Analytics Using Python 20MCA31

4.7.2 Pivoting “Long” to “Wide” Format


A common way to store multiple time series in databases and CSV is in so-
called long or stacked format. Let’s load some example data and do a small
amount of time series wrangling and other data cleaning:

In short, it combines the year and quarter columns to create a kind of time
interval type.
Now, ldata looks like:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 146


Data Analytics Using Python 20MCA31

This is the so-called long format for multiple time series, or other
observational data with two or more keys (here, our keys are date and item).
Each row in the table represents a single observation.
Data is frequently stored this way in relational databases like MySQL, as a
fixed schema (column names and data types) allows the number of distinct
values in the item column to change as data is added to the table. In the
previous example, date and item would usually be the primary keys (in
relational database parlance), offering both relational integrity and easier joins.
In some cases, the data may be more difficult to work with in this format; you
might prefer to have a DataFrame containing one column per distinct item
value indexed by timestamps in the date column. Data Frame’s pivot method
performs exactly this transformation:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 147


Data Analytics Using Python 20MCA31

4.7.3 Pivoting “Wide” to “Long” Format


An inverse operation to pivot for DataFrames is pandas.melt. Rather than
transforming one column into many in a new DataFrame, it merges multiple
columns into one, producing a DataFrame that is longer than the input. Let’s
look at an example:

The 'key' column may be a group indicator, and the other columns are data
values.
When using pandas.melt, we must indicate which columns (if any) are group
indicators.
Let’s use 'key' as the only group indicator here:

Using pivot, we can reshape back to the original layout:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 148


Data Analytics Using Python 20MCA31

Since the result of pivot creates an index from the column used as the row
labels, we may want to use reset_index to move the data back into a column:

You can also specify a subset of columns to use as value columns:

pandas.melt can be used without any group identifiers, too:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 149


Data Analytics Using Python 20MCA31

4.8 Data Transformation


4.8.1 Removing Duplicates
Duplicate rows may be found in a DataFrame for any number of reasons. Here
is an
example:

The DataFrame method duplicated returns a boolean Series indicating whether


each row is a duplicate or not:

Relatedly, drop_duplicates returns a


DataFrame where the duplicated array is
False:

Both of these methods by default consider all of the columns; alternatively, you
can specify any subset of them to detect duplicates. Suppose we had an

Prof. Kumar P K, VTU, PG Center, Mysuru Page 150


Data Analytics Using Python 20MCA31

additional column of values and wanted to filter duplicates only based on the
'k1' column:

duplicated and drop_duplicates by default keep the first observed value


combination.
Passing keep='last' will return the last one:

4. 8.2 Transforming Data Using a Function or Mapping


For many datasets, you may wish to perform some transformation based on the
values in an array, Series, or column in a DataFrame. Consider the following
hypothetical data collected about various kinds of meat:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 151


Data Analytics Using Python 20MCA31

Suppose you wanted to add a column indicating the type of animal that each
food came from. Let’s write down a mapping of each distinct meat type to the
kind of animal:

The map method on a Series accepts a function or dict-like object containing a


mapping, but here we have a small problem in that some of the meats are
capitalized and others are not. Thus, we need to convert each value to
lowercase using the str.lower
Series method:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 152


Data Analytics Using Python 20MCA31

4.8.3 Replacing Values


Filling in missing data with the fillna method can be thought of as a special
case of more general value replacement. While map, as you’ve seen above, can
be used to modify a subset of values in an object, replace provides a simpler
and more flexible way to do so.
Example:
>>> data = Series([1., -999., 2., -999., -1000., 3.])
>>> data

The -999 values might be sentinel values for missing data. To replace these
with NA values that pandas understands, we can use replace, producing a new
Series:
>>> data.replace(-999, np.nan)

Prof. Kumar P K, VTU, PG Center, Mysuru Page 153


Data Analytics Using Python 20MCA31

If you want to replace multiple values at once, you instead pass a list then the
substitute value:
>>> data.replace([-999, -1000], np.nan)

To use a different replacement for each value, pass a list of substitutes:


>>> data.replace([-999, -1000], [np.nan, 0])

The argument passed can also be a dict:


>>> data.replace({-999: np.nan, -1000: 0})

4.8.4 Renaming Axis Indexes


Like values in a Series, axis labels can be similarly transformed by a function
or mapping of some form to produce new, differently labeled objects. The axes
can also be modified in place without creating a new data structure.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 154


Data Analytics Using Python 20MCA31

Example:
>>> data = DataFrame(np.arange(12).reshape((3, 4)),
.....: index=['Ohio', 'Colorado', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])

Like a Series, the axis indexes have a map method:


>>> data.index.map(str.upper)
>>> array([OHIO, COLORADO, NEW YORK], dtype=object)
You can assign to index, modifying the DataFrame in place:
>>> data.index = data.index.map(str.upper)
>>> data

If you want to create a transformed version of a data set without modifying the
original, a useful method is rename:
>>> data.rename(index=str.title, columns=str.upper)

rename can be used in conjunction with a dict-like object providing new values
for a subset of the axis labels:
>>> data.rename(index={'OHIO': 'INDIANA'},
.....: columns={'three': 'peekaboo'})

Prof. Kumar P K, VTU, PG Center, Mysuru Page 155


Data Analytics Using Python 20MCA31

4.8.5. Discretization and Binning:


Continuous data is often discretized or otherwise separated into “bins” for
analysis. Suppose you have data about a group of people in a study, and you
want to group them into discrete age buckets:

>>> ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60
and older. To do so, you have to use cut, a function in pandas:
>>> bins = [18, 25, 35, 60, 100]
>>> cats = pd.cut(ages, bins)
>>> cats

The object pandas returns is a special Categorical object. You can treat it like
an array of strings indicating the bin name; internally it contains a levels array
indicating the distinct category names along with a labeling for the ages data in
the labels attribute:

>>> cats.labels
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1])

>>> cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object

>>> pd.value_counts(cats)

Prof. Kumar P K, VTU, PG Center, Mysuru Page 156


Data Analytics Using Python 20MCA31

Consistent with mathematical notation for intervals, a parenthesis means that


the side is open while the square bracket means it is closed (inclusive). Which
side is closed can be changed by passing right=False: >>> pd.cut(ages, [18, 26,
36, 61, 100], right=False)
Categorical:
array([[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), [18, 26), [36, 61), [26, 36),
[61, 100), [36, 61), [36, 61), [26, 36)], dtype=object) Levels (4): Index([[18, 26),
[26, 36), [36, 61), [61, 100)], dtype=object)

You can also pass your own bin names by passing a list or array to the labels
option:
>>> group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
>>> pd.cut(ages, bins, labels=group_names)
Categorical:
array([Youth, Youth, Youth, YoungAdult, Youth, Youth, MiddleAged,
YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult], dtype=object)
Levels (4): Index([Youth, YoungAdult, MiddleAged, Senior], dtype=object)

4.8.6 Detecting and Filtering Outliers:


Filtering or transforming outliers is largely a matter of applying array
operations. Consider a DataFrame with some normally distributed data:
>>> np.random.seed(12345)
>>> data = DataFrame(np.random.randn(1000, 4))
>>> data.describe()

Prof. Kumar P K, VTU, PG Center, Mysuru Page 157


Data Analytics Using Python 20MCA31

Suppose you wanted to find values in one of the columns exceeding three in
magnitude:
>>> col = data[3]
>>> col[np.abs(col) > 3]

To select all rows having a value exceeding 3 or -3, you can use the any method
on a boolean DataFrame:
>>> data[(np.abs(data) > 3).any(1)]

4.8.7 Permutation and Random Sampling : Permuting (randomly reordering)


a Series or the rows in a DataFrame is easy to do using the
numpy.random.permutation function. Calling permutation with the length of
the axis you want to permute produces an array of integers indicating the new
ordering:
>>> df = DataFrame(np.arange(5 * 4).reshape(5, 4))
>>> sampler = np.random.permutation(5
>>> sampler
array([1, 0, 2, 3, 4])
That array can then be used in ix-based indexing or the take function:
>>> df

Prof. Kumar P K, VTU, PG Center, Mysuru Page 158


Data Analytics Using Python 20MCA31

>>> df.take(sampler)

To select a random subset without replacement, one way is to slice off the first
k elements of the array returned by permutation, where k is the desired subset
size. There are much more efficient sampling-without-replacement algorithms.
>>> df.take(np.random.permutation(len(df))[:3])

To generate a sample with replacement, the fastest way is to use


np.random.randint to draw random integers.
>>> bag = np.array([5, 7, -1, 6, 4])
>>> sampler = np.random.randint(0, len(bag), size=10)
>>> sampler
array([4, 4, 2, 2, 2, 0, 3, 0, 4, 1])
>>> draws = bag.take(sampler)
>>> draws
array([ 4, 4, -1, -1, -1, 5, 6, 5, 4, 7])

4.9. String Manipulation:


Python has long been a popular data munging language in part due to its ease-
of-use for string and text processing. Most text operations are made simple
with the string object’s built-in methods. For more complex pattern matching
and text manipulations, regular expressions may be needed.
4.9.1 String Object Methods

Prof. Kumar P K, VTU, PG Center, Mysuru Page 159


Data Analytics Using Python 20MCA31

In many string munging and scripting applications, built-in string methods are
sufficient.
As an example, a comma-separated string can be broken into pieces with split:

>>> val = 'a,b, guido'


>>> val.split(',')
['a', 'b', ' guido']
split is often combined with strip to trim whitespace (including newlines):

>>> pieces = [x.strip() for x in val.split(',')]


>>> pieces
['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter


using addition:
>>> first, second, third = pieces
>>> first + '::' + second + '::' + third
'a::b::guido'

But, this isn’t a practical generic method. A faster and more Pythonic way is to
pass a list or tuple to the join method on the string '::':

>>> '::'.join(pieces)
'a::b::guido'
Other methods are concerned with locating substrings. Using Python’s in
keyword is the best way to detect a substring, though index and find can also
be used:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 160


Data Analytics Using Python 20MCA31

>>> 'guido' in val


True
>>> val.index(',')
>>> val.find(':')
1
-1

Note the difference between find and index is that index raises an exception if
the string isn’t found (versus returning -1):

>>> val.index(':')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last) in ()
----> 1 val.index(':')
ValueError: substring not foundin

Relatedly, count returns the number of occurrences of a particular substring

>>> val.count(',')
2
replace will substitute occurrences of one pattern for another. This is
commonly used to delete patterns, too, by passing an empty string:
>>> val.replace(',', '::')
>>> val.replace(',', '')
'a::b:: guido'
'ab guido'

Regular expressions can also be used with many of these operations like below.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 161


Data Analytics Using Python 20MCA31

Python built-in string methods

4.9.2 Regular expressions


Regular expressions provide a flexible way to search or match string patterns
in text. A single expression, commonly called a regex, is a string formed
according to the regular expression language. Python’s built-in re module is
responsible for applying regular expressions to strings.
The re module functions fall into three categories:
1. Pattern matching
2. Substitution
3. Splitting

Prof. Kumar P K, VTU, PG Center, Mysuru Page 162


Data Analytics Using Python 20MCA31

Example: I wanted to split a string with a variable number of whitespace


characters (tabs, spaces, and newlines). The regex describing one or more
whitespace characters is \s+:
>>> import re
>>> text = "foo bar\t baz \tqux"
>>> re.split('\s+', text)
['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled,
then its split method is called on the passed text. You can compile the regex
yourself with re.compile, forming a reusable regex object: >>> regex =
re.compile('\s+')
>>> regex.split(text)
['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can
use the findall method: regex.findall(text) Out[227]: [' ', '\t ', ' \t']

Creating a regex object with re.compile is highly recommended if you intend to


apply the same expression to many strings; doing so will save CPU cycles.
match and search are closely related to findall. While findall returns all
matches in a string, search returns only the first match. More rigidly, match
only matches at the beginning of the string.
let’s consider a block of text and a regular expression capable of identifying
most email addresses:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 163


Data Analytics Using Python 20MCA31

text = """Dave [email protected]


Steve [email protected]
Rob [email protected]
Ryan [email protected]
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

Using findall on the text produces a list of the e-mail addresses:

>>> regex.findall(text)
['[email protected]', '[email protected]', '[email protected]', '[email protected]']
search returns a special match object for the first email address in the text.
For the above regex, the match object can only tell us the start and end
position of the pattern in the string:

>>> m = regex.search(text)
>>> m

>>> text[m.start():m.end()]
'[email protected]'
regex.match returns None, as it only will match if the pattern occurs at the
start of the string:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 164


Data Analytics Using Python 20MCA31

>>> print regex.match(text)


None

Relatedly, sub will return a new string with occurrences of the pattern replaced
by the a new string:

>>> print regex.sub('REDACTED', text)


Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

Suppose you wanted to find email addresses and simultaneously segment each
address into its 3 components:
username, domain name, and domain suffix. To do this, put parentheses
around the parts of the pattern to segment:
>>> pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
A match object produced by this modified regex returns a tuple of the pattern
components with its groups method
>>> m = regex.match('[email protected]')
>>> m.groups()
('wesm', 'bright', 'net')
findall returns a list of tuples when the pattern has groups:
>>> regex.findall(text)
[('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan',
'yahoo', 'com')]

Prof. Kumar P K, VTU, PG Center, Mysuru Page 165


Data Analytics Using Python 20MCA31

sub also has access to groups in each match using special symbols like \1, \2,
etc.:

>>> print regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text)


Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

Regular expression methods

Prof. Kumar P K, VTU, PG Center, Mysuru Page 166


Data Analytics Using Python 20MCA31

4.9.3 Vectorized String Functions in pandas


Cleaning up a messy dataset for analysis often requires a lot of string munging
and regularization. To complicate matters, a column containing strings will
sometimes have missing data:

You can apply string and regular expression methods can be applied (passing a
lambda or other function) to each value using data.map, but it will fail on the
NA (null) values. To cope with this, Series has array-oriented methods for string
operations that skip NA values. These are accessed through Series’s str
attribute; for example, we could check whether each email address has 'gmail'
in it with str.contains:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 167


Data Analytics Using Python 20MCA31

Regular expressions can be used, too, along with any re options like
IGNORECASE:

There are a couple of ways to do vectorized element retrieval. Either use str.get
or index into the str attribute:

Prof. Kumar P K, VTU, PG Center, Mysuru Page 168


Data Analytics Using Python 20MCA31

Partial listing of vectorized string methods:

- END of 4th MODULE-

Prof. Kumar P K, VTU, PG Center, Mysuru Page 169


Data Analytics Using Python 20MCA31

Acquiring Data with Python


1. Loading form CSV files
Reading data from CSV(comma separated values) is a fundamental necessity in
Data Science. The Pandas library provides features using which we can read
the CSV file in full as well as in parts for only a selected group of columns and
rows.

1.1 Input as CSV File


The csv file is a text file in which the values in the columns are separated by a
comma.
Example : Consider the following data present in the file named input.csv.
We can create this file using notepad and Save the file as input.csv

1.2 Reading a CSV File


The read_csv function of the pandas library is used read the content of a CSV
file into the python environment as a pandas DataFrame. The function can
read the files from the OS by using proper path to the file.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 170


Data Analytics Using Python 20MCA31

When we execute the above code, it produces the following result. Please note
how an additional column starting with zero as a index has been created by the
function.

1.3 Reading Specific Rows


The read_csv function of the pandas library can also be used to read some
specific rows for a given column. We slice the result from the read_csv
function using the code shown below for first 5 rows for the column named
salary.

When we execute the above code, it produces the following result.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 171


Data Analytics Using Python 20MCA31

1.4 Reading Specific Columns:


The read_csv function of the pandas library can also be used to read some
specific columns. We use the multi-axes indexing method called .loc() for this
purpose. We choose to display the salary and name column for all the rows.

When we execute the above code, it produces the following result.

1.5 Reading Specific Columns and Rows


The read_csv function of the pandas library can also be used to read some
specific columns and specific rows. We use the multi-axes indexing method
called .loc() for this purpose. We choose to display the salary and name column
for some of the rows.

When we execute the above code, it produces the following result.

Prof. Kumar P K, VTU, PG Center, Mysuru Page 172

You might also like