Mod 4
Mod 4
MODULE 4
These functions, which are meant to convert text data into a DataFrame. The optional arguments for
these functions may fall into a few categories:
Indexing : Can treat one or more columns as the returned DataFrame, and whether to get column
names from the file, the user, or not at all.
Type inference and data conversion: This includes the user-defined value conversions and custom list of
missing value markers.
Datetime parsing: Includes combining capability, including combining date and time information spread
over multiple columns into a single column in the result.
Iterating : Support for iterating over chunks of very large files.
Unclean data issues: Skipping rows or footer, comments, or other minor things like numeric data with
thousands separated by commas.
Some of these functions, like pandas.read_csv, perform type inference, because the column data types
are not part of the data format. That means you don’t necessarily have to specify which columns are
numeric, integer, boolean, or string. Other data formats, like HDF5, Feather, and msgpack, have the data
types stored in the format. Handling dates and other custom types can require extra effort.
A file will not always have a header row. Consider this file:
To read this file, we have a couple of options. You can allow pandas to assign default column names, or
you can specify names yourself:
Suppose if we wanted the message column to be the index of the returned DataFrame. You can either
indicate you want the column at index 4 or named 'message' using the index_col argument:
In the event that you want to form a hierarchical index from multiple columns, pass a list of column
numbers or names:
In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to
separate fields. Consider a text file that looks like this:
Here the fields are separated by a vari‐ able amount of whitespace. In these cases, you can pass a
regular expression as a delimiter for read_table. This can be expressed by the regular expression \s+, so
we have then:
The parser functions have many additional arguments to help you handle the wide variety of exception
file formats that occur .
For example, you can skip the first, third, and fourth rows of a file with skiprows:
Handling missing values is an important. Missing data is usually either not present (empty string) or
marked by some sentinel value. By default, pandas use a set of commonly occurring sentinels, such as
NA and NULL:
When processing very large files or figuring out the right set of arguments to correctly processes a large
file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.
Before we look at a large file, we make the pandas display settings more compact:
If you want to only read a small number of rows (avoiding reading the entire file), specify that with
nrows:
Data can also be exported to a delimited format. Let’s consider one of the CSV files read before:
Using DataFrame’s to_csv method, we can write the data out to a comma-separated file:
Missing values appear as empty strings in the output. You might want to denote them by some other
sentinel value:
With no other options specified, both the row and column labels are written. Both of these can be
disabled:
For any file with a single-character delimiter, you can use Python’s built-in csv
module. To use it, pass any open file or file-like object to csv.reader:
Iterating through the reader like a file yields tuples of values with any quote
characters removed:
CSV files come in many different flavors. To define a new format with a different
delimiter, string quoting convention, or line terminator, we define a simple subclass of
csv.Dialect:
We can also give individual CSV dialect parameters as keywords to csv.reader without
having to define a subclass:
To write delimited files manually, you can use csv.writer. It accepts an open, writable
file object and the same dialect and format options as csv.reader:
The basic types are objects (dicts), arrays (lists), strings, numbers, booleans,
and nulls. All of the keys in an object must be strings. There are several Python
libraries for reading and writing JSON data. To convert a JSON string to
Python form, use json.loads:
The default options for pandas.read_json assume that each object in the JSON
array is a row in the table:
Because failures has many columns, pandas inserts a line break character \.
We proceed to do some data cleaning and analysis, like computing the number of bank failures
by year:
Using lxml.objectify, we parse the file and get a reference to the root node of
the XML file with getroot:
You can read any “pickled” object stored in a file by using the built-in pickle
directly, or even more conveniently using pandas.read_pickle:
pandas has built-in support for two more binary data formats: HDF5 and
Message Pack.
4.3.1 Using HDF5 Format
HDF5 is a well-regarded file format intended for storing large quantities of
scientific array data. It is available as a C library, and it has interfaces
available in many other languages, including Java, Julia, MATLAB, and
Python. The “HDF” in HDF5 stands for hierarchical data format. Each HDF5 file
can store multiple datasets and supporting metadata. Compared with simpler
formats, HDF5 supports on-the-fly compression with a variety of compression
modes, enabling data with repeated patterns to be stored more efficiently.
HDF5 can be a good choice for working with very large datasets that don’t fit
into memory, as you can efficiently read and write small sections of much
larger arrays.
While it’s possible to directly access HDF5 files using either the PyTables or
h5py libraries, pandas provides a high-level interface that simplifies storing
Series and DataFrame object. The HDFStore class works like a dict and
handles the low-level details:
Objects contained in the HDF5 file can then be retrieved with the same dict-
like API:
HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is
generally slower, but it supports query operations using a special syntax:
The put is an explicit version of the store['obj2'] = frame method but allows us
to set other options like the storage format.
The pandas.read_hdf function gives you a shortcut to these tools:
If you are reading multiple sheets in a file, then it is faster to create the
ExcelFile, but you can also simply pass the filename to pandas.read_excel:
Many websites have public APIs providing data feeds via JSON or some other
format.
There are a number of ways to access these APIs from Python; one easy-to-use
method that I recommend is the requests package.
To find the last 30 GitHub issues for pandas on GitHub, we can make a GET
HTTP request using the add-on requests library:
Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return
a list of tuples when selecting data from a table:
You can pass the list of tuples to the DataFrame constructor, but you also need
the column names, contained in the cursor’s description attribute:
4 4 a
5 5 a
6 6 b
>>> df2
Data2 key
0 0 a
1 1 b
2 2 d
Note, in the above example didn’t specify which column to join on. If not
specified, merge uses the overlapping column names as the keys. It’s a good
practice to specify explicitly, though:
If the column names are different in each object, we can specify them
separately:
>>> df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
....: 'data1': range(7)})
>>> df4 = DataFrame({'rkey': ['a', 'b', 'd'],
....: 'data2': range(3)})
>>> pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Noticed in the above example, the 'c' and 'd' values and associated data are
missing from the result. By default merge does an 'inner' join; the keys in
the result are the intersection. Other possible options are 'left', 'right', and
'outer'. The outer join takes the union of the keys, combining the effect of
applying both left and right joins:
>>> df1
>> df2
Many-to-many joins form the Cartesian product of the rows. Since there were
3 'b' rows in the left DataFrame and 2 in the right one, there are 6 'b' rows in
the result. The join method only affects the distinct key values appearing in the
result:
>>> pd.merge(df1, df2, how='inner')
>>> right1
By default concat works along axis=0, producing another Series. If you pass
axis=1, the result will instead be a DataFrame (axis=1 is the columns):
Exmaple:
>>> data = DataFrame(np.arange(6).reshape((2, 3)),
....: index=pd.Index(['Ohio', 'Colorado'], name='state'),
....: columns=pd.Index(['one', 'two', 'three'], name='number'))
>>> data
Using the stack method on this data pivots the columns into the rows,
producing a Series:
>>> result = data.stack()
>>> result
From a hierarchically-indexed Series, you can rearrange the data back into a
DataFrame with unstack:
>>> result.unstack()
By default the innermost level is unstacked (same with stack). You can unstack
a different level by passing a level number or name:
>>> result.unstack(0)
>>> result.unstack('state')
Unstacking might introduce missing data if all of the values in the level aren’t
found in each of the subgroups:
>>> s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
>>> s2 = Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
>>> data2.unstack()
>>> data2.unstack().stack(dropna=False)
In short, it combines the year and quarter columns to create a kind of time
interval type.
Now, ldata looks like:
This is the so-called long format for multiple time series, or other
observational data with two or more keys (here, our keys are date and item).
Each row in the table represents a single observation.
Data is frequently stored this way in relational databases like MySQL, as a
fixed schema (column names and data types) allows the number of distinct
values in the item column to change as data is added to the table. In the
previous example, date and item would usually be the primary keys (in
relational database parlance), offering both relational integrity and easier joins.
In some cases, the data may be more difficult to work with in this format; you
might prefer to have a DataFrame containing one column per distinct item
value indexed by timestamps in the date column. Data Frame’s pivot method
performs exactly this transformation:
The 'key' column may be a group indicator, and the other columns are data
values.
When using pandas.melt, we must indicate which columns (if any) are group
indicators.
Let’s use 'key' as the only group indicator here:
Since the result of pivot creates an index from the column used as the row
labels, we may want to use reset_index to move the data back into a column:
Both of these methods by default consider all of the columns; alternatively, you
can specify any subset of them to detect duplicates. Suppose we had an
additional column of values and wanted to filter duplicates only based on the
'k1' column:
Suppose you wanted to add a column indicating the type of animal that each
food came from. Let’s write down a mapping of each distinct meat type to the
kind of animal:
The -999 values might be sentinel values for missing data. To replace these
with NA values that pandas understands, we can use replace, producing a new
Series:
>>> data.replace(-999, np.nan)
If you want to replace multiple values at once, you instead pass a list then the
substitute value:
>>> data.replace([-999, -1000], np.nan)
Example:
>>> data = DataFrame(np.arange(12).reshape((3, 4)),
.....: index=['Ohio', 'Colorado', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])
If you want to create a transformed version of a data set without modifying the
original, a useful method is rename:
>>> data.rename(index=str.title, columns=str.upper)
rename can be used in conjunction with a dict-like object providing new values
for a subset of the axis labels:
>>> data.rename(index={'OHIO': 'INDIANA'},
.....: columns={'three': 'peekaboo'})
>>> ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60
and older. To do so, you have to use cut, a function in pandas:
>>> bins = [18, 25, 35, 60, 100]
>>> cats = pd.cut(ages, bins)
>>> cats
The object pandas returns is a special Categorical object. You can treat it like
an array of strings indicating the bin name; internally it contains a levels array
indicating the distinct category names along with a labeling for the ages data in
the labels attribute:
>>> cats.labels
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1])
>>> cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object
>>> pd.value_counts(cats)
You can also pass your own bin names by passing a list or array to the labels
option:
>>> group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
>>> pd.cut(ages, bins, labels=group_names)
Categorical:
array([Youth, Youth, Youth, YoungAdult, Youth, Youth, MiddleAged,
YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult], dtype=object)
Levels (4): Index([Youth, YoungAdult, MiddleAged, Senior], dtype=object)
Suppose you wanted to find values in one of the columns exceeding three in
magnitude:
>>> col = data[3]
>>> col[np.abs(col) > 3]
To select all rows having a value exceeding 3 or -3, you can use the any method
on a boolean DataFrame:
>>> data[(np.abs(data) > 3).any(1)]
>>> df.take(sampler)
To select a random subset without replacement, one way is to slice off the first
k elements of the array returned by permutation, where k is the desired subset
size. There are much more efficient sampling-without-replacement algorithms.
>>> df.take(np.random.permutation(len(df))[:3])
In many string munging and scripting applications, built-in string methods are
sufficient.
As an example, a comma-separated string can be broken into pieces with split:
But, this isn’t a practical generic method. A faster and more Pythonic way is to
pass a list or tuple to the join method on the string '::':
>>> '::'.join(pieces)
'a::b::guido'
Other methods are concerned with locating substrings. Using Python’s in
keyword is the best way to detect a substring, though index and find can also
be used:
Note the difference between find and index is that index raises an exception if
the string isn’t found (versus returning -1):
>>> val.index(':')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last) in ()
----> 1 val.index(':')
ValueError: substring not foundin
>>> val.count(',')
2
replace will substitute occurrences of one pattern for another. This is
commonly used to delete patterns, too, by passing an empty string:
>>> val.replace(',', '::')
>>> val.replace(',', '')
'a::b:: guido'
'ab guido'
Regular expressions can also be used with many of these operations like below.
When you call re.split('\s+', text), the regular expression is first compiled,
then its split method is called on the passed text. You can compile the regex
yourself with re.compile, forming a reusable regex object: >>> regex =
re.compile('\s+')
>>> regex.split(text)
['foo', 'bar', 'baz', 'qux']
If, instead, you wanted to get a list of all patterns matching the regex, you can
use the findall method: regex.findall(text) Out[227]: [' ', '\t ', ' \t']
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
>>> regex.findall(text)
['[email protected]', '[email protected]', '[email protected]', '[email protected]']
search returns a special match object for the first email address in the text.
For the above regex, the match object can only tell us the start and end
position of the pattern in the string:
>>> m = regex.search(text)
>>> m
>>> text[m.start():m.end()]
'[email protected]'
regex.match returns None, as it only will match if the pattern occurs at the
start of the string:
Relatedly, sub will return a new string with occurrences of the pattern replaced
by the a new string:
Suppose you wanted to find email addresses and simultaneously segment each
address into its 3 components:
username, domain name, and domain suffix. To do this, put parentheses
around the parts of the pattern to segment:
>>> pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
A match object produced by this modified regex returns a tuple of the pattern
components with its groups method
>>> m = regex.match('[email protected]')
>>> m.groups()
('wesm', 'bright', 'net')
findall returns a list of tuples when the pattern has groups:
>>> regex.findall(text)
[('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan',
'yahoo', 'com')]
sub also has access to groups in each match using special symbols like \1, \2,
etc.:
You can apply string and regular expression methods can be applied (passing a
lambda or other function) to each value using data.map, but it will fail on the
NA (null) values. To cope with this, Series has array-oriented methods for string
operations that skip NA values. These are accessed through Series’s str
attribute; for example, we could check whether each email address has 'gmail'
in it with str.contains:
Regular expressions can be used, too, along with any re options like
IGNORECASE:
There are a couple of ways to do vectorized element retrieval. Either use str.get
or index into the str attribute:
When we execute the above code, it produces the following result. Please note
how an additional column starting with zero as a index has been created by the
function.