0% found this document useful (0 votes)

22 views62 pages

Unit 4

This document covers the fundamentals of data wrangling using Python libraries such as NumPy and Pandas. It includes topics like array manipulation, data aggregation, handling missing data, and hierarchical indexing, as well as advanced techniques like Boolean masking and pivot tables. The document serves as a comprehensive guide for performing data analysis and manipulation efficiently.

Uploaded by

ilayaraja.it

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views62 pages

Unit 4

Uploaded by

ilayaraja.it

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING

Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean logic –
fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and selection –
operating on data – missing data – Hierarchical indexing – combining datasets – aggregation and
grouping – pivot tables

I. Basics of Numpy arrays:

NumPy array manipulation to access data and subarrays, and to split, reshape, and join the arrays.

Basic array manipulations:

Attributes of arrays
Determining the size, shape, memory consumption, and data types of arrays
Indexing of arrays
Getting and setting the value of individual array elements
Slicing of arrays
Getting and setting smaller subarrays within a larger array
Reshaping of arrays
Changing the shape of a given array
Joining and splitting of arrays
Combining multiple arrays into one, and splitting one array into many

NumPy Array Attributes

1. Creating numpy arrays:

The Three random arrays: a one-dimensional, two-dimensional, and three-dimensional array.

We’ll use NumPy’s random number generator, which we will seed with a set value in order to ensure that the
same random arrays are generated each time this code is run:

2. Attributes:

Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and size (the
total size of the array):
3. Array Indexing: Accessing Single Elements

In a one-dimensional array, you can access the ith value (counting from zero) by specifying the desired index in
square brackets, just as with Python lists:
4. Array Slicing: Accessing Subarrays

We can also use them to access subarrays with the slice notation, marked by the colon (:) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use this:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1.

Multidimensional subarrays

Multidimensional slices work in the same way, with multiple slices separated by commas.
For example:
5. Accessing array rows and columns

One commonly needed routine is accessing single rows or columns of an array.

We can do this by combining indexing and slicing, using an empty slice marked by a single colon (:):

6. Subarrays as no-copy views

NumPy array slicing differs from Python list slicing: in lists, slices will be copies.
The array slices is that they return views rather than copies of the array data
Consider our two-dimensional array from before:
7. Creating copies of arrays

Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an
array or a subarray.
This can be most easily done with the copy() method:

8. Reshaping of Arrays

The most flexible way of doing this is with the reshape() method.
For example, if we want to put the numbers 1 through 9 in a 3X3 grid, we can do the following:
Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or
column matrix.
We can do this with the reshape method, or more easily by making use of the newaxis keyword within a slice
operation:

9. Array Concatenation and Splitting

It’s also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays.

Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished through the routines
np.concatenate, np.vstack, and np.hstack. np.concatenate takes a tuple or list of arrays as its first argument
Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and
np.vsplit. For each of these, we can pass a list of indices giving the split points:
II. Aggregations: Min, Max, and Everything in Between

1. Summing the Values in an Array

Be careful, though: the sum function and the np.sum function are not identical, which can sometimes lead to
confusion! In particular, their optional arguments have different meanings, and np.sum is aware of multiple
array dimensions, as we will see in the following section.

2. Minimum and Maximum

3. Multidimensional aggregates
Other aggregation functions
Example: What Is the Average Height of US Presidents?

Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let’s consider the heights of all US presidents.
This data is available in the file president_heights.csv, which is a simple comma-separated list of
labels and values:
III. Computation on Arrays: Broadcasting

Another means of vectorizing operations is to use NumPy’s broadcasting functionality. Broadcasting is simply
a set of rules for applying binary ufuncs (addition, subtraction, multiplication, etc.) on
arrays of different sizes.

Introducing Broadcasting
Recall that for arrays of the same size, binary operations are performed on an element-by-element basis:
Broadcasting allows these types of binary operations to be performed on arrays of different sizes—for example,
we can just as easily add a scalar (think of it as a zero dimensional array) to an array:
Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:
• Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is
padded with ones on its leading (left) side.
• Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that
dimension is stretched to match the other shape.
• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
IV. Comparisons, Masks, and Boolean Logic

This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays. Masking
comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some
criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all
outliers that are above some threshold.
In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.
One approach to this would be to answer these questions by hand: loop through the data, incrementing a
counter each time we see values in some desired range.
For reasons discussed throughout this chapter, such an approach is very inefficient, both from the standpoint of
time writing code and time computing the result.

Comparison Operators as ufuncs

NumPy also implements comparison operators such as < (less than) and > (greater than) as element-wise
ufuncs.
The result of these comparison operators is always an array with a Boolean data type.
All six of the standard comparison operations are available:
Boolean operators
NumPy overloads these as ufuncs that work element-wise on (usually Boolean) arrays.
Boolean Arrays as Masks:

We looked at aggregates computed directly on Boolean arrays.

A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.
Returning to our x array from before, suppose we want an array of all values in the array that are less than, say,
5:

IV. Fancy Indexing:

We’ll look at another style of array indexing, known as fancy indexing.
Fancy indexing is like the simple indexing we’ve already seen, but we pass arrays of indices in place of single
scalars.
This allows us to very quickly access and modify complicated subsets of an array’s values.

Exploring Fancy Indexing

Example: Binning Data
V. Structured Data: NumPy’s Structured Arrays

This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient
storage for compound, hetero‐ geneous data.

Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we’d
like to store these values for use in a Python program. It would be possible to store these in three separate
arrays:

In[2]: name = ['Alice', 'Bob', 'Cathy', 'Doug']

age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
VI. Data Manipulation with Pandas

Data Indexing and Selection

Data Selection in Series

A Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard
Python dictionary.

Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:

Series as one-dimensional array

A Series builds on this dictionary-like interface and provides array-style item selection via the same basic
mechanisms as NumPy arrays—that is, slices, masking, and fancy indexing. Examples of these are as follows:
Indexers: loc, iloc, and ix
For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the
explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.
A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to standard []-based
indexing.

Data Selection in DataFrame:

DataFrame as a dictionary
Operating on Data in Pandas:

Pandas : for unary operations like negation and trigonometric functions, the ufuncs will preserve index and
column labels in the output, and for binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.
Ufuncs: Operations Between DataFrame and Series

When you are performing operations between a DataFrame and a Series, the index and column alignment is
similarly maintained.
Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-
dimensional NumPy array.

In[15]: A = rng.randint(10, size=(3, 4))

A
Out[15]: array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])
In[16]: A - A[0]
Out[16]: array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])

subtraction between a two-dimensional array and one of its rows is applied row-wise.

In Pandas, the convention similarly operates row-wise by default:

In[17]: df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]
Out[17]: Q R S T
00000
1 -1 -2 2 4
2 3 -7 1 4
If you would instead like to operate column-wise, you can use the object methods
mentioned earlier, while specifying the axis keyword:
In[18]: df.subtract(df['R'], axis=0)
Out[18]: Q R S T
0 -5 0 -6 -4
1 -4 0 -2 2
25027

VII. Handling Missing Data

In the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting
datasets will have some amount of data missing

Trade-Offs in Missing Data Conventions

A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame.
Two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a
missing entry.

In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation
of one bit in the data representation to locally indicate the null status of a value.
In the sentinel approach, the sentinel value could be some data-specific convention. Eg: IEEE floating-point
specification.

Missing Data in Pandas

The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which
does not have a built-in notion of NA values for nonfloating- point data types.
VIII. Hierarchical Indexing:

Hierarchical indexing (also known as multi-indexing) - to incorporate multiple index levels within a
single index. In this way, higher-dimensional data can be compactly represented within the familiar one-
dimensional Series and two-dimensional DataFrame objects.

A Multiply Indexed Series

Rearranging Multi-Indices

Sorted and unsorted indices

Many of the MultiIndex slicing operations will fail if the index is not sorted.

Stacking and unstacking indices

VIII. Combining Datasets: Concat and Append
Notice the repeated indices in the result. While this is valid within DataFrames, the outcome is often
undesirable. pd.concat() gives us a few ways to handle it.

Catching the repeats as an error.

Ignoring the index.
Adding MultiIndex keys

IX. Aggregation and Grouping

An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(),
mean(), median(), min(), and max()

Simple Aggregation in Pandas

GroupBy: Split, Apply, Combine

A canonical example of this split-apply-combine operation, where the “apply” is asummation aggregation, is
illustrated in Figure 3-1.
Figure 3-1 makes clear what the GroupBy accomplishes:
• The split step involves breaking up and grouping a DataFrame depending on the
value of the specified key.
• The apply step involves computing some function, usually an aggregate, transformation,
or filtering, within the individual groups.
• The combine step merges the results of these operations into an output array.

Here it’s important to realize that the intermediate splits do not need to be explicitly instantiated.
The GroupBy object
The GroupBy object is a very flexible abstraction.

Column indexing. The GroupBy object supports column indexing in the same way as
the DataFrame, and returns a modified GroupBy object. For example:
In[14]: planets.groupby('method')
Out[14]: <pandas.core.groupby.DataFrameGroupBy object at 0x1172727b8>
In[15]: planets.groupby('method')['orbital_period']
Out[15]: <pandas.core.groupby.SeriesGroupBy object at 0x117272da0>

Iteration over groups. The GroupBy object supports direct iteration over the groups,
returning each group as a Series or DataFrame:
In[17]: for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))

Dispatch methods. Through some Python class magic, any method not explicitly
implemented by the GroupBy object will be passed through and called on the groups,
whether they are DataFrame or Series objects. For example, you can use the
describe() method of DataFrames to perform a set of aggregations that describe each
group in the data:
In[18]: planets.groupby('method')['year'].describe().unstack()
X. Pivot Tables

We have seen how the GroupBy abstraction lets us explore relationships within a dataset.
A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on
tabular data.
The pivot table takes simple columnwise data as input, and groups the entries into a two-dimensional table that
provides a multidimensional summarization of the data.

05 NumPy - Arrays and Vectorized Computation
No ratings yet
05 NumPy - Arrays and Vectorized Computation
47 pages
LT2 - 07 - Numpy Matplotlib Pandas
No ratings yet
LT2 - 07 - Numpy Matplotlib Pandas
101 pages
FDS Unit 4
No ratings yet
FDS Unit 4
66 pages
Advanced Data Science Training - Trainer
No ratings yet
Advanced Data Science Training - Trainer
515 pages
Areer: A Warm Welcome To Careerera Family
No ratings yet
Areer: A Warm Welcome To Careerera Family
131 pages
Python Numpy
No ratings yet
Python Numpy
20 pages
Numpy Basics Introduction To
No ratings yet
Numpy Basics Introduction To
35 pages
Numpy Full
100% (1)
Numpy Full
40 pages
Practical Guide To NumPy For Data Science
100% (1)
Practical Guide To NumPy For Data Science
27 pages
Unit Iv FDS
No ratings yet
Unit Iv FDS
142 pages
Numpy
No ratings yet
Numpy
32 pages
Fundamentals of Data Science Unit 4 and 5
No ratings yet
Fundamentals of Data Science Unit 4 and 5
90 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
36 pages
M3-Introduction To Numpy and Pandas
No ratings yet
M3-Introduction To Numpy and Pandas
55 pages
Numpy
No ratings yet
Numpy
28 pages
Numpy Operations
No ratings yet
Numpy Operations
55 pages
Print
No ratings yet
Print
296 pages
Unit III - Data Manipulation Using Python
No ratings yet
Unit III - Data Manipulation Using Python
16 pages
4 Introduction To Python Part 3
No ratings yet
4 Introduction To Python Part 3
62 pages
Lecture - 09 - Python DS - NumPy
No ratings yet
Lecture - 09 - Python DS - NumPy
48 pages
4 Introduction To Python Part 3
No ratings yet
4 Introduction To Python Part 3
48 pages
Array in Python
No ratings yet
Array in Python
33 pages
DS Unit 3 Part 1
No ratings yet
DS Unit 3 Part 1
27 pages
B14 - LT2 - 07 - Numpy Matplotlib Pandas
No ratings yet
B14 - LT2 - 07 - Numpy Matplotlib Pandas
101 pages
Unit - V
No ratings yet
Unit - V
90 pages
1 Numpy
No ratings yet
1 Numpy
26 pages
Dse Unit 3
No ratings yet
Dse Unit 3
12 pages
Num Py
No ratings yet
Num Py
52 pages
Python Module 5
No ratings yet
Python Module 5
43 pages
Numpy New
No ratings yet
Numpy New
16 pages
python-notes-BCC-302 (Unit - 05)
No ratings yet
python-notes-BCC-302 (Unit - 05)
25 pages
Guru Gobind Singh Public School Subject: IP Notes/Assignment: 2 Numpy Indexing and Slicing
No ratings yet
Guru Gobind Singh Public School Subject: IP Notes/Assignment: 2 Numpy Indexing and Slicing
6 pages
Numpy
No ratings yet
Numpy
14 pages
Numpy&pandas
No ratings yet
Numpy&pandas
17 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
61 pages
Unit3 - Arrays and Strings
No ratings yet
Unit3 - Arrays and Strings
20 pages
Chapter 2
No ratings yet
Chapter 2
32 pages
Data Visualization1
No ratings yet
Data Visualization1
52 pages
Unit 3 - Numpy - VP
No ratings yet
Unit 3 - Numpy - VP
53 pages
Unit 4
No ratings yet
Unit 4
49 pages
NUMPYA03
No ratings yet
NUMPYA03
36 pages
Python Unit 4
No ratings yet
Python Unit 4
43 pages
NUMPY
No ratings yet
NUMPY
33 pages
Unit 2
No ratings yet
Unit 2
21 pages
NumPy - The Absolute Basics For Beginners - NumPy V2.4.dev0 Manual
No ratings yet
NumPy - The Absolute Basics For Beginners - NumPy V2.4.dev0 Manual
41 pages
NumPy Class 11th
No ratings yet
NumPy Class 11th
10 pages
Basic Array Creation and Operations
No ratings yet
Basic Array Creation and Operations
27 pages
Week2-1 Numpy
No ratings yet
Week2-1 Numpy
43 pages
Num Py
No ratings yet
Num Py
8 pages
Numpy
No ratings yet
Numpy
64 pages
Numpy - Pandas
No ratings yet
Numpy - Pandas
26 pages
Searching Sorting Notes Handwritten
No ratings yet
Searching Sorting Notes Handwritten
29 pages
Lab 2, Python Numpy - LUMS
No ratings yet
Lab 2, Python Numpy - LUMS
4 pages
MS Access
100% (4)
MS Access
143 pages
Numpy Basics
No ratings yet
Numpy Basics
66 pages
Numpy Tutorial
No ratings yet
Numpy Tutorial
19 pages
Numpy
No ratings yet
Numpy
27 pages
Unit 4 Numpy
No ratings yet
Unit 4 Numpy
14 pages
RAW Data
No ratings yet
RAW Data
22 pages
Lab-3 AI
No ratings yet
Lab-3 AI
21 pages
1 - Numpy
No ratings yet
1 - Numpy
1 page
Introduction To Algorithms and Flowchart
No ratings yet
Introduction To Algorithms and Flowchart
48 pages
Java Collection Framework
No ratings yet
Java Collection Framework
21 pages
Pseudo Code
No ratings yet
Pseudo Code
189 pages
CS411 Quiz 4 Merged Finl Term
No ratings yet
CS411 Quiz 4 Merged Finl Term
292 pages
Unit 4 Unit 4 Bda
No ratings yet
Unit 4 Unit 4 Bda
16 pages
Practice Ques
No ratings yet
Practice Ques
16 pages
CDD in AUTOSAR
No ratings yet
CDD in AUTOSAR
10 pages
ATVM & Infra Training Content Day - 3
No ratings yet
ATVM & Infra Training Content Day - 3
43 pages
CVMuhammadUsmanKhalid 365930
100% (2)
CVMuhammadUsmanKhalid 365930
2 pages
Thonny - Python IDE
No ratings yet
Thonny - Python IDE
5 pages
Nested Classes 235-245: Durgasoft MR - Ratan
No ratings yet
Nested Classes 235-245: Durgasoft MR - Ratan
4 pages
C File
No ratings yet
C File
34 pages
Assignment - 1 (Kiran Jamil)
No ratings yet
Assignment - 1 (Kiran Jamil)
5 pages
SE463 Practice Problems
No ratings yet
SE463 Practice Problems
67 pages
C4H260 Participants Handbook
No ratings yet
C4H260 Participants Handbook
187 pages
Mod Menu Log - Com - Bandainamcoent.dblegends - WW
No ratings yet
Mod Menu Log - Com - Bandainamcoent.dblegends - WW
94 pages
Full Site Audit Report-Inter Pro Web Host
No ratings yet
Full Site Audit Report-Inter Pro Web Host
125 pages
IT22111 Record - 96 (Ex9)
No ratings yet
IT22111 Record - 96 (Ex9)
6 pages
Operator Overloading More Operators
No ratings yet
Operator Overloading More Operators
34 pages
Data Contracts For Schema Registry - Confluent Documentation
No ratings yet
Data Contracts For Schema Registry - Confluent Documentation
22 pages
Computer Practical File
No ratings yet
Computer Practical File
21 pages
API Integration in Flutter Using GETX
No ratings yet
API Integration in Flutter Using GETX
7 pages
Lab 9
No ratings yet
Lab 9
3 pages
SRITAN KAR - Doc
No ratings yet
SRITAN KAR - Doc
2 pages
Core Java Programming I: Karimullabasha (Technoschool)
No ratings yet
Core Java Programming I: Karimullabasha (Technoschool)
17 pages
Assignment No: 08
No ratings yet
Assignment No: 08
5 pages
How To Move and Click The Mouse in VBA - Excel Help HQ
No ratings yet
How To Move and Click The Mouse in VBA - Excel Help HQ
5 pages
PL Prelim Exam
No ratings yet
PL Prelim Exam
2 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet

Unit 4

Uploaded by

Unit 4

Uploaded by

UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING

I. Basics of Numpy arrays:

Basic array manipulations:

NumPy Array Attributes

1. Creating numpy arrays:

The Three random arrays: a one-dimensional, two-dimensional, and three-dimensional array.

One commonly needed routine is accessing single rows or columns of an array.

6. Subarrays as no-copy views

9. Array Concatenation and Splitting

1. Summing the Values in an Array

2. Minimum and Maximum

Comparison Operators as ufuncs

We looked at aggregates computed directly on Boolean arrays.

IV. Fancy Indexing:

Exploring Fancy Indexing

In[2]: name = ['Alice', 'Bob', 'Cathy', 'Doug']

Data Indexing and Selection

Data Selection in Series

Series as one-dimensional array

Data Selection in DataFrame:

In[15]: A = rng.randint(10, size=(3, 4))

In Pandas, the convention similarly operates row-wise by default:

VII. Handling Missing Data

Trade-Offs in Missing Data Conventions

Missing Data in Pandas

A Multiply Indexed Series

Sorted and unsorted indices

Stacking and unstacking indices

Catching the repeats as an error.

IX. Aggregation and Grouping

Simple Aggregation in Pandas

You might also like