0% found this document useful (0 votes)
22 views62 pages

Unit 4

This document covers the fundamentals of data wrangling using Python libraries such as NumPy and Pandas. It includes topics like array manipulation, data aggregation, handling missing data, and hierarchical indexing, as well as advanced techniques like Boolean masking and pivot tables. The document serves as a comprehensive guide for performing data analysis and manipulation efficiently.

Uploaded by

ilayaraja.it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views62 pages

Unit 4

This document covers the fundamentals of data wrangling using Python libraries such as NumPy and Pandas. It includes topics like array manipulation, data aggregation, handling missing data, and hierarchical indexing, as well as advanced techniques like Boolean masking and pivot tables. The document serves as a comprehensive guide for performing data analysis and manipulation efficiently.

Uploaded by

ilayaraja.it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING

Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean logic –
fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and selection –
operating on data – missing data – Hierarchical indexing – combining datasets – aggregation and
grouping – pivot tables

I. Basics of Numpy arrays:

NumPy array manipulation to access data and subarrays, and to split, reshape, and join the arrays.

Basic array manipulations:


Attributes of arrays
Determining the size, shape, memory consumption, and data types of arrays
Indexing of arrays
Getting and setting the value of individual array elements
Slicing of arrays
Getting and setting smaller subarrays within a larger array
Reshaping of arrays
Changing the shape of a given array
Joining and splitting of arrays
Combining multiple arrays into one, and splitting one array into many

NumPy Array Attributes

1. Creating numpy arrays:

The Three random arrays: a one-dimensional, two-dimensional, and three-dimensional array.


We’ll use NumPy’s random number generator, which we will seed with a set value in order to ensure that the
same random arrays are generated each time this code is run:

2. Attributes:

Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and size (the
total size of the array):
3. Array Indexing: Accessing Single Elements

In a one-dimensional array, you can access the ith value (counting from zero) by specifying the desired index in
square brackets, just as with Python lists:
4. Array Slicing: Accessing Subarrays

We can also use them to access subarrays with the slice notation, marked by the colon (:) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use this:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1.

Multidimensional subarrays

Multidimensional slices work in the same way, with multiple slices separated by commas.
For example:
5. Accessing array rows and columns

One commonly needed routine is accessing single rows or columns of an array.


We can do this by combining indexing and slicing, using an empty slice marked by a single colon (:):

6. Subarrays as no-copy views

NumPy array slicing differs from Python list slicing: in lists, slices will be copies.
The array slices is that they return views rather than copies of the array data
Consider our two-dimensional array from before:
7. Creating copies of arrays

Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an
array or a subarray.
This can be most easily done with the copy() method:

8. Reshaping of Arrays

The most flexible way of doing this is with the reshape() method.
For example, if we want to put the numbers 1 through 9 in a 3X3 grid, we can do the following:
Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or
column matrix.
We can do this with the reshape method, or more easily by making use of the newaxis keyword within a slice
operation:

9. Array Concatenation and Splitting

It’s also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays.

Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished through the routines
np.concatenate, np.vstack, and np.hstack. np.concatenate takes a tuple or list of arrays as its first argument
Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and
np.vsplit. For each of these, we can pass a list of indices giving the split points:
II. Aggregations: Min, Max, and Everything in Between

1. Summing the Values in an Array


Be careful, though: the sum function and the np.sum function are not identical, which can sometimes lead to
confusion! In particular, their optional arguments have different meanings, and np.sum is aware of multiple
array dimensions, as we will see in the following section.

2. Minimum and Maximum

3. Multidimensional aggregates
Other aggregation functions
Example: What Is the Average Height of US Presidents?

Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let’s consider the heights of all US presidents.
This data is available in the file president_heights.csv, which is a simple comma-separated list of
labels and values:
III. Computation on Arrays: Broadcasting

Another means of vectorizing operations is to use NumPy’s broadcasting functionality. Broadcasting is simply
a set of rules for applying binary ufuncs (addition, subtraction, multiplication, etc.) on
arrays of different sizes.

Introducing Broadcasting
Recall that for arrays of the same size, binary operations are performed on an element-by-element basis:
Broadcasting allows these types of binary operations to be performed on arrays of different sizes—for example,
we can just as easily add a scalar (think of it as a zero dimensional array) to an array:
Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:
• Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is
padded with ones on its leading (left) side.
• Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that
dimension is stretched to match the other shape.
• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
IV. Comparisons, Masks, and Boolean Logic

This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays. Masking
comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some
criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all
outliers that are above some threshold.
In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.
One approach to this would be to answer these questions by hand: loop through the data, incrementing a
counter each time we see values in some desired range.
For reasons discussed throughout this chapter, such an approach is very inefficient, both from the standpoint of
time writing code and time computing the result.

Comparison Operators as ufuncs


NumPy also implements comparison operators such as < (less than) and > (greater than) as element-wise
ufuncs.
The result of these comparison operators is always an array with a Boolean data type.
All six of the standard comparison operations are available:
Boolean operators
NumPy overloads these as ufuncs that work element-wise on (usually Boolean) arrays.
Boolean Arrays as Masks:

We looked at aggregates computed directly on Boolean arrays.


A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.
Returning to our x array from before, suppose we want an array of all values in the array that are less than, say,
5:

IV. Fancy Indexing:


We’ll look at another style of array indexing, known as fancy indexing.
Fancy indexing is like the simple indexing we’ve already seen, but we pass arrays of indices in place of single
scalars.
This allows us to very quickly access and modify complicated subsets of an array’s values.

Exploring Fancy Indexing


Example: Binning Data
V. Structured Data: NumPy’s Structured Arrays

This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient
storage for compound, hetero‐ geneous data.

Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we’d
like to store these values for use in a Python program. It would be possible to store these in three separate
arrays:

In[2]: name = ['Alice', 'Bob', 'Cathy', 'Doug']


age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
VI. Data Manipulation with Pandas

Data Indexing and Selection

Data Selection in Series


A Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard
Python dictionary.

Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:

Series as one-dimensional array


A Series builds on this dictionary-like interface and provides array-style item selection via the same basic
mechanisms as NumPy arrays—that is, slices, masking, and fancy indexing. Examples of these are as follows:
Indexers: loc, iloc, and ix
For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the
explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.
A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to standard []-based
indexing.

Data Selection in DataFrame:

DataFrame as a dictionary
Operating on Data in Pandas:

Pandas : for unary operations like negation and trigonometric functions, the ufuncs will preserve index and
column labels in the output, and for binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.
Ufuncs: Operations Between DataFrame and Series

When you are performing operations between a DataFrame and a Series, the index and column alignment is
similarly maintained.
Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-
dimensional NumPy array.

In[15]: A = rng.randint(10, size=(3, 4))


A
Out[15]: array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])
In[16]: A - A[0]
Out[16]: array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])

subtraction between a two-dimensional array and one of its rows is applied row-wise.

In Pandas, the convention similarly operates row-wise by default:


In[17]: df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]
Out[17]: Q R S T
00000
1 -1 -2 2 4
2 3 -7 1 4
If you would instead like to operate column-wise, you can use the object methods
mentioned earlier, while specifying the axis keyword:
In[18]: df.subtract(df['R'], axis=0)
Out[18]: Q R S T
0 -5 0 -6 -4
1 -4 0 -2 2
25027

VII. Handling Missing Data

In the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting
datasets will have some amount of data missing

Trade-Offs in Missing Data Conventions


A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame.
Two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a
missing entry.

In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation
of one bit in the data representation to locally indicate the null status of a value.
In the sentinel approach, the sentinel value could be some data-specific convention. Eg: IEEE floating-point
specification.

Missing Data in Pandas


The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which
does not have a built-in notion of NA values for nonfloating- point data types.
VIII. Hierarchical Indexing:

Hierarchical indexing (also known as multi-indexing) - to incorporate multiple index levels within a
single index. In this way, higher-dimensional data can be compactly represented within the familiar one-
dimensional Series and two-dimensional DataFrame objects.

A Multiply Indexed Series


Rearranging Multi-Indices

Sorted and unsorted indices


Many of the MultiIndex slicing operations will fail if the index is not sorted.

Stacking and unstacking indices


VIII. Combining Datasets: Concat and Append
Notice the repeated indices in the result. While this is valid within DataFrames, the outcome is often
undesirable. pd.concat() gives us a few ways to handle it.

Catching the repeats as an error.


Ignoring the index.
Adding MultiIndex keys

IX. Aggregation and Grouping

An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(),
mean(), median(), min(), and max()

Simple Aggregation in Pandas


GroupBy: Split, Apply, Combine

A canonical example of this split-apply-combine operation, where the “apply” is asummation aggregation, is
illustrated in Figure 3-1.
Figure 3-1 makes clear what the GroupBy accomplishes:
• The split step involves breaking up and grouping a DataFrame depending on the
value of the specified key.
• The apply step involves computing some function, usually an aggregate, transformation,
or filtering, within the individual groups.
• The combine step merges the results of these operations into an output array.

Here it’s important to realize that the intermediate splits do not need to be explicitly instantiated.
The GroupBy object
The GroupBy object is a very flexible abstraction.

Column indexing. The GroupBy object supports column indexing in the same way as
the DataFrame, and returns a modified GroupBy object. For example:
In[14]: planets.groupby('method')
Out[14]: <pandas.core.groupby.DataFrameGroupBy object at 0x1172727b8>
In[15]: planets.groupby('method')['orbital_period']
Out[15]: <pandas.core.groupby.SeriesGroupBy object at 0x117272da0>

Iteration over groups. The GroupBy object supports direct iteration over the groups,
returning each group as a Series or DataFrame:
In[17]: for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))

Dispatch methods. Through some Python class magic, any method not explicitly
implemented by the GroupBy object will be passed through and called on the groups,
whether they are DataFrame or Series objects. For example, you can use the
describe() method of DataFrames to perform a set of aggregations that describe each
group in the data:
In[18]: planets.groupby('method')['year'].describe().unstack()
X. Pivot Tables

We have seen how the GroupBy abstraction lets us explore relationships within a dataset.
A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on
tabular data.
The pivot table takes simple columnwise data as input, and groups the entries into a two-dimensional table that
provides a multidimensional summarization of the data.

You might also like