0% found this document useful (0 votes)
1 views37 pages

2 Unit 2 Python Library for Data Wrangling

This document covers Python libraries for data wrangling, focusing on NumPy and Pandas for data manipulation and analysis. It details the data wrangling process, which includes discovering, structuring, cleaning, enriching, validating, and publishing data. Additionally, it introduces Python as a high-level programming language and provides an overview of NumPy's capabilities for handling multi-dimensional arrays and performing complex computations.

Uploaded by

chitra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views37 pages

2 Unit 2 Python Library for Data Wrangling

This document covers Python libraries for data wrangling, focusing on NumPy and Pandas for data manipulation and analysis. It details the data wrangling process, which includes discovering, structuring, cleaning, enriching, validating, and publishing data. Additionally, it introduces Python as a high-level programming language and provides an overview of NumPy's capabilities for handling multi-dimensional arrays and performing complex computations.

Uploaded by

chitra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT-II PYTHON LIBRARIES FOR DATA WRANGLING

Syllabus

Basics of Numpy arrays - aggregations - computations on arrays - comparisons,


masks, boolean logic - fancy indexing - structured arrays - Data manipulation with
Pandas - data indexing and selection operating on data - missing data - Hierarchical
indexing - combining datasets - aggregation and grouping - pivot tables.

2.1 Data Wrangling


• Data Wrangling is the process of transforming data from its original "raw" form into a
more digestible format and organizing sets from various sources into a singular
coherent whole for further processing.
• Data wrangling is also called as data munging.
• The primary purpose of data wrangling can be described as getting data in coherent
shape. In other words, it is making raw data usable. It provides substance for further
proceedings.
• Data wrangling covers the following processes:
1. Getting data from the various source into one place
2. Piecing the data together according to the determined setting
3. Cleaning the data from the noise or erroneous, missing elements.
• Data wrangling is the process of cleaning, structuring and enriching raw data into a
desired format for better decision making in less time.
• There are typically six iterative steps that make up the data wrangling process:
1. Discovering: Before you can dive deeply, you must better understand what is in
your data, which will inform how you want to analyze it. How you wrangle customer
data, for example, may be informed by where they are located, what they bought, or
what promotions they received.
2. Structuring: This means organizing the data, which is necessary because raw data
comes in many different shapes and sizes. A single column may turn into several rows
for easier analysis. One column may become two. Movement of data is made for easier
computation and analysis.
3. Cleaning: What happens when errors and outliers skew your data? You clean the
data. What happens when state data is entered as AP or Andhra Pradesh or Arunachal
Pradesh? You clean the data. Null values are changed and standard formatting
implemented, ultimately increasing data quality.
4. Enriching Here you take stock in your data and strategize about how other
additional data might augment it. Questions asked during this data wrangling step
might be : what new types of data can I derive from what I already have or what other
information would better inform my decision making about this current data?
5. Validating Validation rules are repetitive programming sequences that verify data
consistency, quality, and security. Examples of validation include ensuring uniform
distribution of attributes that should be distributed normally (e.g. birth dates) or
confirming accuracy of fields through a check across data.
6. Publishing: Analysts prepare the wrangled data for use downstream, whether by a
particular user or software and document any particular steps taken or logic used to
wrangle said data. Data wrangling gurus understand that implementation of insights
relies upon the ease with which it can be accessed and utilized by others.

2.2 Introduction to Python


• Python is a high-level scripting language which can be used for a wide variety of text
processing, system administration and internet-related tasks.
• Python is a true object-oriented language, and is available on a wide variety of
platforms.
• Python was developed in the early 1990's by Guido van Rossum, then at CWI in
Amsterdam, and currently at CNRI in Virginia.
• Python 3.0 was released in Year 2008.
• Python statements do not need to end with a special character.
• Python relies on modules, that is, self-contained programs which define a variety of
functions and data types.
• A module is a file containing Python definitions and statements. The file name is the
module name with the suffix .py appended.
• Within a module, the module's name (as a string) is available as the value of the
global variable_name_.
• If a module is executed directly however, the value of the global variable_name will be
"_main_".
• Modules can contain executable statements aside from definitions. These are
executed only the first time the module name is encountered in an import statement as
well as if the file is executed as a script.
• Integrated Development Environment (IDE) is the basic interpreter and editor
environment that you can use along with Python. This typically includes an editor for
creating and modifying programs, a translator for executing programs, and a program
debugger. A debugger provides a means of taking control of the execution of a program
to aid in finding program errors.
• Python is most commonly translated by use of an interpreter. It provides the very
useful ability to execute in interactive mode. The window that provides this interaction
is referred to as the Python shell.
• Python support two basic modes: Normal mode and interactive mode
• Normal mode: The normal mode is the mode where the scripted and finished.py files
are run in the Python interpreter. This mode is also called as script mode.
• Interactive mode is a command line shell which gives immediate feedback for each
statement, while running previously fed statements in active memory.
• Start the Python interactive interpreter by typing python with no arguments at the
command line.
• To access the Python shell, open the terminal of your operating system and then type
"python". Press the enter key and the Python shell will appear.
C:\Windows\system32>python
Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:27:37) [MSC v.1900 64 bit
(AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.
>>>
• The >>> indicates that the Python shell is ready to execute and send your commands
to the Python interpreter. The result is immediately displayed on the Python shell as
soon as the Python interpreter interprets the command.
• For example, to print the text "Hello World", we can type the following:
>>> print("Hello World")
Hello World
>>>
• In script mode, a file must be created and saved before executing the code to get
results. In interactive mode, the result is returned immediately after pressing the enter
key.
• In script mode, you are provided with a direct way of editing your code. This is not
possible in interactive mode.
Features of Python Programming
1. Python is a high-level, interpreted, interactive and object-oriented scripting
language.
2. It is simple and easy to learn.
3. It is portable.
4. Python is free and open source programming language.
5. Python can perform complex tasks using a few lines of code.
6. Python can run equally on different platforms such as Windows, Linux, UNIX, and
Macintosh, etc
7. It provides a vast range of libraries for the various fields such as machine learning,
web developer, and also for the scripting.
Advantages and Disadvantages of Python
Advantages of Python
• Ease of programming
• Minimizes the time to develop and maintain code
•Modular and object-oriented
•Large community of users
• A large standard and user-contributed library
Disadvantages of python
• Interpreted and therefore slower than compiled languages
• Decentralized with packages

2.2 Numpy
• NumPy, short for Numerical Python, is the core library for scientific computing in
Python. It has been designed specifically for performing basic and advanced array
operations. It primarily supports multi-dimensional arrays and vectors for complex
arithmetic operations.
•A library is a collection of files (called modules) that contains functions for use by
other programs. A Python library is a reusable chunk of code that you may want to
include in your programs.
• Many popular Python libraries are NumPy, SciPy, Pandas and Scikit-Learn. Python
visualization libraries are matplotlib and Seaborn.
• NumPy has risen to become one of the most popular Python science libraries and just
secured a round of grant funding.
• NumPy's multidimensional array can perform very large calculations much more
easily and efficiently than using the Python standard data types.
• To get started, NumPy has many resources on their website, including
documentation and tutorials.
• NumPy (Numerical Python) is a perfect tool for scientific computing and performing
basic and advanced array operations.
• The library offers many handy features performing operations on a n-arrays and
matrices in Python. It helps to process arrays that store values of the same data type
and makes performing math operations on arrays easier. In fact, the vectorization of
mathematical operations on the NumPy array type increases performance and
accelerates the execution time.
• Numpy is the core library for scientific computing in Python. It provides a high
performance multidimensional array object and tools for working with these arrays.
• NumPy is the fundamental package needed for scientific computing with Python. It
contains:
a) A powerful N-dimensional array object
b) Basic linear algebra functions
c) Basic Fourier transforms
d) Sophisticated random number capabilities
e) Tools for integrating Fortran code
f) Tools for integrating C/C++ code.
• NumPy is an extension package to Python for array programming. It provides "closer
to the hardware" optimization, which in Python means C implementation.
2.3 Basics of Numpy Arrays
• Numpy array is a powerful N-dimensional array object which is in the form of rows
and columns. We can initialize NumPy arrays from nested Python lists and access it
elements. NumPy array is a collection of elements that have the same data type.
• A one-dimensional NumPy array can be thought of as a vector, a two-dimensional
array as a matrix (i.e., a set of vectors), and a three-dimensional array as a tensor (i.e.,
a set of matrices).
• To define an array manually, we can use the np.array() function.
• Basic array manipulations are as follows :
1. Attributes of arrays: It define the size, shape, memory consumption, and data
types of arrays.
2. Indexing of arrays: Getting and setting the value of individual array elements. 3.
Slicing of arrays: Getting and setting smaller subarrays within a larger array.
4. Reshaping of arrays: Changing the shape of a given array.
5. Joining and splitting of arrays: Combining multiple arrays into one, and splitting
one array into many.
a) Attributes of array
• In Python, arrays from the NumPy library, called N-dimensional arrays or the
ndarray, are used as the primary data structure for representing data.
• The main data structure in NumPy is the ndarray, which is a shorthand name for N-
dimensional array. When working with NumPy, data in an ndarray is simply referred to
as an array. It is a fixed-sized array in memory that contains data of the same type,
such as integers or floating point values.
•The data type supported by an array can be accessed via the "dtype" attribute on the
array. The dimensions of an array can be accessed via the "shape" attribute that
returns a tuple describing the length of each dimension.
• Array attributes are essential to find out the shape, dimension, item size etc.
• ndarray.shape: By using this method in numpy, we can know the array dimensions.
It can also be used to resize the array. Each array has attributes ndim (the number of
dimensions), shape (the size of each dimension), and size (the total size of the array).
• ndarray.size: The total number of elements of the array. This is equal to the product
of the elements of the array's shape.
• ndarray.dtype: An object describing the data type of the elements in the array.
Recall that NumPy's ND-arrays are homogeneous: they can only posses numbers of a
uniform data type.
b) Indexing of arrays
• Array indexing always refers to the use of square brackets ("[ ]') to index the elements
of the array. In order to access a single element of an array we can refer to its index.
• Fig. 4.4.1 shows the indexing of an ndarray mono-dimensional.

>>> a = np.arange(25, 31)


>>>P
array([25, 26, 27, 28, 29, 30])
>>>P[3]
28
• The NumPy arrays also accept negative indexes. These indexes have the same
incremental sequence from 0 to -1, -2, and so on,
>>>P[-1]
30
>>>P[-6]
25
• In a multidimensional array, we can access items using a comma-separated tuple of
indices. To select multiple items at once, we can pass array of indexes within the
square brackets.
>>>a[[1, 3, 4]]
array([26, 28, 29])
• Moving on to the two-dimensional case, namely, the matrices, they are represented
as rectangular arrays consisting of rows and columns, defined by two axes, where axis
0 is represented by the rows and axis 1 is represented by the columns. Thus, indexing
in this case is represented by a pair of values : the first value is the index of the row
and the second is the index of the column.
• Fig. 4.4.2 shows the indexing of a bi-dimensional array.

>>> A = np.arange(10, 19).reshape((3, 3))


>>> A
array([[10, 11, 12],
[13, 14, 15],
[16, 17, 18]])
c) Slicing of arrays
• Slicing is the operation which allows to extract portions of an array to generate new
ones. Whereas using the Python lists the arrays obtained by slicing are copies, in
NumPy, arrays are views onto the same underlying buffer.
• Slicing of array in Python means to access sub-parts of an array. These sub-parts
can be stored in other variables and further modified.
• Depending on the portion of the array, to extract or view the array, make use of the
slice syntax; that is, we will use a sequence of numbers separated by colons (':') within
the square brackets.
• Synatx : arr[ start stop step],
Arr[slice(start, stop, step)]
• The start parameter represents the starting index, stop is the ending index, and step
is the number of items that are "stepped" over. If any of these are unspecified, they
default to the values start=0, stop-size of dimension, step-1.
importnumpy as np
arr = np.array([1,2,3,4])
print(arr[1:3:2])
print(arr[:3])
print(arr[::2])
Output:
[2]
[1 2 3]
[13]
Multidimensional sub-arrays:
• Multidimensional slices work in the same way, with multiple slices separated by
commas. For example:
In[24]: x2
Out[24]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In[25]: x2[:2, :3] # two rows, three columns
Out[25]: array([[12, 5, 2],
[ 7, 6, 8]])
In[26]: x2[:3, ::2] # all rows, every other column
Out[26]: array([[12, 2],
[7, 8],
[ 1, 7]])
• Let us create an array using the package Numpy and access its columns.
# Creating an array
importnumpy as np
a= np.array([[1,2,3],[4,5,6],[7,8,9]])
• Now let us access the elements column-wise. In order to access the elements in a
column-wise manner colon(:) symbol is used let us see that with an example.
importnumpy as np
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(a[:,1])
Output:
[258]
d) Reshaping of array
•The numpy.reshape() function is used to reshape a numpy array without changing the
data in the array.
• Syntax:
numpy.reshape(a, newshape, order='C')
Where order: {'C', 'F', 'A'}, optional Read the elements of a using this index order, and
place the elements into the reshaped array using this index order.
Step 1: Create a numpy array of shape (8,)
num_array = np.array([1,2,3,4,5,6,7,8])
num_array
Output:
array([1, 2, 3, 4, 5, 6, 7, 8])
Step 2: Use np.reshape() function with new shape as (4,2)
np.reshape(num_array,(4,2))
array([[1,2],
[3,4],
[5,6],
[7,8]])
• The shape of the input array has been changed to a (4,2). This is a 2-D array and
contains the same data present in the original input 1-D array.
e) Array concatenation and splitting
• np.concatenate() constructor is used to concatenate or join two or more arrays into
one. The only required argument is list or tuple of arrays.
#first, import numpy
importnumpy as np
# making two arrays to concatenate
arr1 = np.arange(1,4)
arr2 = np.arange(4,7)
print("Arrays to concatenate:")
print(arr1); print(arr2)
print("After concatenation:")
print(np.concatenate([arr1,arr2]))
Arrays to concatenate:
[1 2 3]
[4 5 6]
After concatenation:
[1 2 3 4 5 6]
2.4 Aggregations
• In aggregation function is one which takes multiple individual values and returns a
summary. In the majority of the cases, this summary is a single value. The most
common aggregation functions are a simple average or summation of values.
• Let us consider following example:
>>> import numpy as np
>>> arr1 = np.array([10, 20, 30, 40, 50])
>>> arr1
array([10, 20, 30, 40, 50])
>>> arr2 = np.array([[0, 10, 20], [30, 40, 50], [60, 70, 80]])
>>> arr2
array([[0, 10, 20]
[30, 40, 50]
[60, 70, 80]])
>>> arr3 = np.array([[14, 6, 9, -12, 19, 72], [-9, 8, 22, 0, 99, -11]])
>>> array3
array([[14, 6, 9, -12, 19, 72])
[-9, 8, 22, 0, 99, -11]])
• Python numpy sum function calculates the sum of values in an array.
arr1.sum()
arr2.sum()
arr3.sum()
• This Python numpy sum function allows to use an optional argument called an axis.
This Python numpy Aggregate Function helps to calculate the sum of a given axis. For
example, axis = 0 returns the sum of each column in anNumpy array.
arr2.sum(axis = 0)
arr3.sum(axis = 0)
• axis = 1 returns the sum of each row in an array.
arr2.sum(axis = 1)
arr3.sum(axis = 1)
>>> arr1.sum()
150
>>> arr2.sum()
360
>>> arr3.sum()
217
>>> arr2.sum(axis = 0)
array([90, 120, 150])
>>> arr3.sum(axis=0)
array([5, 14, 31, -12, 118, 61])
>>> arr2.sum(axis=1)
array([30, 120, 210])
>>> arr3.sum(axis =1)
array([108, 109])
• Python has built-in min and max functions used to find the minimum value and
maximum value of any given array.
• Python min() and max() are built-in functions in python which returns the smallest
number and the largest number of the list respectively, as the output. Python min()
can also be used to find the smaller one in the comparison of two variables or lists.
However, Python max() on the other hand is used to find the bigger one in the
comparison of two variables or lists.

2.5 Computations on Arrays


• Computation on NumPy arrays can be very fast, or it can be very slow. Using
vectorized operations, fast computations is possible and it is implemented by using
NumPy's universial functions (ufuncs).
• A universal function (ufuncs) is a function that operates on ndarrays in an element-
by- element fashion, supporting array broadcasting, type casting, and several other
standard features. The ufunc is a "vectorized" wrapper for a function that takes a fixed
number of specific inputs and produces a fixed number of specific outputs.
• Functions that work on both scalars and arrays are known as ufuncs. For arrays,
ufuncs apply the function in an element-wise fashion. Use of ufuncs is an esssential
aspect of vectorization and typically much more computtionally efficient than using an
explicit loop over each element.
NumPy's Ufuncs :
• Ufuncs are of two types: unary ufuncs and binary ufuncs.
• Unary ufuncs operate on a single input and binary ufuncs, which operate on two
inputs.
• Arithmetic operators implemented in NumPy is as follows:

• Example of Arithmetic Operators: Python Code


# Taking input
num1 = input('Enter first number:')
num2 = input('Enter second number:')
# Addition
sum = float(num1) + float(num2)
# Subtraction
min =float(num1) - float(num2)
# Multiplication
mul = float(num1)* float(num2)
#Division
div = float(num1) / float(num2)
#Modulus
mod = float(num1) % float(num2)
#Exponentiation
exp =float(num1)**float(num2)
#Floor Division
floordiv = float(num1) // float(num2)
print("The sum of {0} and {1} is {2}'.format(num1, num2, sum))
print("The subtraction of {0} and {1} is {2}'.format(num1, num2, min))
print("The multiplication of {0} and {1} is {2}'.format(num1, num2, mul))
print("The division of {0} and {1} is {2}'.format(num1, num2, div))
print("The modulus of {0} and {1} is {2}'.format(num1, num2, mod))
print("The exponentiation of {0} and {1} is {2}'.format(num1, num2, exp))
print("The floor division between {0} and {1} is {2}'.format(num1, num2,floordiv))
Absolute value :
• NumPy understands Python's built-in arithmetic operators, it also understands
Python's built-in absolute value function. The abs() function returns the absolute
magnitude or value of input passed to it as an argument. It returns the actual value of
input without taking the sign into consideration.
• The abs() function accepts only a single arguement that has to be a number and it
returns the absolute magnitude of the number. If the input is of type integer or float,
the abs() function returns the absolute magnitude/value. If the input is a complex
number, the abs() function returns only the magnitude portion of the number.
Syntax: abs(number)
Where the number can be of integer type, floating point type or a complex number.
• Example:
num -25.79
print("Absolute value:", abs(num))
• Output:
Absolute value : 25.79
Trigonometric functions:
• The numpy package provides trigonometric functions which can be used to calculate
trigonometric ratios for a given angle in radians.
Example:
importnumpy as np
Arr = np.array([0, 30, 60, 90])
#converting the angles in radians
Arr = Arr*np.pi/180
print("\nThe sin value of angles:")
print(np.sin(Arr))
print("\nThe cos value of angles:")
print(np.cos(Arr))
print("\nThe tan value of angles:")
print(np.tan(Arr))

2.6 Comparisons, Masks and Boolean Logic


• Masking means to extract, modify, count or otherwise manipulate values in an array
based on some criterion.
• Boolean masking, also called boolean indexing, is a feature in Python NumPy that
allows for the filtering of values in numpy arrays. There are two main ways to carry out
boolean masking:
a) Method one: Returning the result array.
b) Method two: Returning a boolean array.
Comparison operators as ufuncs
• The result of these comparison operators is always an array with a Boolean data
type. All six of the standard comparison operations are available. For example, we
might wish to count all values greater than a certain value, or perhaps remove all
outliers that are above some threshold. In NumPy, Boolean masking is often the most
efficient way to accomplish these types of tasks.
x = np.array([1,2,3,4,5])
print(x<3) # less than
print(x>3) # greater than
print(x<=3) # less than or equal
print(x>=3) #greater than or equal
print(x!=3) #not equal
print(x==3) #equal
• Comparison operators and their equivalent :

Boolean array:
• A boolean array is a numpy array with boolean (True/False) values. Such array can
be obtained by applying a logical operator to another numpy array:
importnumpyasnp
a = np.reshape(np.arange(16), (4,4)) # create a 4x4 array of integers
print(a)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
large values (a>10) # test which elements of a are greated than 10
print(large_values)
[[False FalseFalse False]
[False FalseFalse False]
[False Falsefalse True]
[ TrueTrueTrue True]]
even_values = (a%2==0) # test which elements of a are even
print(even_values)
[[True False True False]
[True False True False]
[True False True False]
[True False True False]]
Logical operations on boolean arrays
• Boolean arrays can be combined using logical operators :

b = ~(a%3 == 0) # test which elements of a are not divisible by 3


print('array a:\n{}\n'.format(a))
print('array b:\n{}'.format(b))
array a:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
array b:
[[False TrueTrue False]
[ TrueTrue False True]
[True False True True]
[False TrueTrue False]]
2.7 Fancy Indexing
• With NumPy array fancy indexing, an array can be indexed with another NumPy
array, a Python list, or a sequence of integers, whose values select elements in the
indexed array.
• Example: We first create a NumPy array with 11 floating-point numbers and then
index the array with another NumPy array and Python list, to extract element numbers
0, 2 and 4 from the original array :
importnumpy as np
A = np.linspace(0, 1, 11)
print(A)
print(A[np.array([0, 2, 4])])
# The same thing can be accomplished by indexing with a Python list
print(A[[0, 2, 4]])
Output:
[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ].
[0. 0.2 0.4]
[0. 0.2 0.4]

2.8 Structured Arrays


• A structured Numpy array is an array of structures. As numpy arrays are
homogeneous i.e. they can contain data of same type only. So, instead of creating a
numpy array of int or float, we can create numpy array of homogeneous structures too.
• First of all import numpy module i.e.
importnumpy as np
• Now to create a structure numpy array we can pass a list of tuples containing the
structure elements i.e.
[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]
• But as elements of a Numpy array are homogeneous, so how will be the size and type
of structure will be decided? For that we need to pass the type of above structure type
i.e. schema in dtype parameter.
• Let's create a dtype for above structure i.e.
# Creating the type of a structure
dtype = [('Name', (np.str_, 10)), ('Marks', np.float64), ('GradeLevel', np.int32)]
• Let's create a numpy array based on this dtype i.e.
# Creating a StrucuredNumpy array
structuredArr= np.array([('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6),
('Iresh', 99.9, 7)], dtype=dtype)
• It will create a structured numpy array and its contents will be,
[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]
• Let's check the data type of the above created numpy array is,
print(structured Arr.dtype)
Output:
[('Name', '<U10'), ('Marks', '<f8'), ('GradeLevel', '<i4')]
Creating structured arrays:
• Structured array data types can be specified in a number of ways.
1. Dictionary method :
np.dtype({'names': ('name', 'age', 'weight'),
'formats': ('U10', '14', 'f8')})
Output: dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])
2. Numerical types can be specified with Python types or NumPydtypes instead :
np.dtype({'names': ('name', 'age', 'weight'),
'formats':((np.str_, 10), int, np.float32)})
Output: dtype([('name', '<U10'), ('age', '<i8'), ('weight', '<f4')])
3. A compound type can also be specified as a list of tuples :
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])
Output: dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])
NumPy data types:
• Below is a listing of all data types available in NumPy and the characters that
represent them.
1) I - integer
2) b - boolean
3) u - unsigned integer
4) f - float
5) c - complex float
6) m - timedelta
7) M - datetime
8) O - object
9) S - string
10) U - unicode string
11) V - fixed for other types of memory

2.9 Data Manipulation with Pandas


• Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built
on the Numpy package and its key data structure is called the DataFrame.
• DataFrames allow you to store and manipulate tabular data in rows of observations
and columns of variables.
• Pandas is built on top of the NumPy package, meaning a lot of the structure of
NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical
analysis in SciPy, plotting functions from Matplotlib and machine learning algorithms
in Scikit-learn.
• Pandas is the library for data manipulation and analysis. Usually, it is the starting
point for your data science tasks. It allows you to read/write data from/to multiple
sources. Process the missing data, align your data, reshape it, merge and join it with
other data, search data, group it, slice it.
Create DataFrame with Duplicate Data
• Duplicate data creates the problem for data science project. If database is large, then
processing duplicate data means wastage of time.
• Finding duplicates is important because it will save time, space false result. how to
easily and efficiently you can remove the duplicate data using drop_duplicates()
function in pandas.
• Create Dataframe with Duplicate data
import pandas as pd
raw_data={'first_name': ['rupali', 'rupali', 'rakshita','sangeeta', 'mahesh', 'vilas'],
'last_name': ['dhotre', 'dhotre', 'dhotre','Auti', 'jadhav', 'bagad'],
'RNo': [12, 12, 1111111, 36, 24, 73],
'TestScore1': [4, 4, 4, 31, 2, 3],
'TestScore2': [25, 25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore',
'postTestScore'])
df

Drop duplicates
df.drop_duplicates()
• Drop duplicates in the first name column, but take the last observation in the
duplicated set
df.drop_duplicates (['first_name'], keep='last')
Creating a Data Map and Data Plan
• Overview of dataset is given by data map. Data map is used for finding potential
problems in data, such as redundant variables, possible errors, missing values and
variable transformations.
• Try creating a Python script that converts a Python dictionary into a Pandas
DataFrame, then print the DataFrame to screen.
import pandas as pd
scottish_hills={'Ben Nevis': (1345, 56.79685, -5.003508),
'Ben Macdui': (1309, 57.070453, -3.668262),
'Braeriach': (1296, 57.078628, -3.728024),
'Cairn Toul': (1291, 57.054611, -3.71042),
'Sgòr an Lochain Uaine': (1258, 57.057999, -3.725416)}
dataframe = pd.DataFrame(scottish_hills)
print(dataframe)
Manipulating and Creating Categorical Variables
• Categorical variable is one that has a specific value from a limited selection of values.
The number of values is usually fixed.
• Categorical features can only take on a limited, and usually fixed, number of possible
values. For example, if a dataset is about information related to users, then you will
typically find features like country, gender, age group, etc. Alternatively, if the data you
are working with is related to products, you will find features like product type,
manufacturer, seller and so on.
• Method for creating a categorical variable and then using it to check whether some
data falls within the specified limits.
import pandas as pd
cycle_colors=pd.Series(['Blue', 'Red', 'Green'], dtype='category')
cycle_data = pd.Series( pd.Categorical(['Yellow', 'Green', 'Red', 'Blue', 'Purple'],
categories=cycle_colors, ordered=False))
find_entries = pd.isnull(cycle_data)
print cycle_colors
print
print cycle_data
print
print find_entries [find_entries==True]
• Here cycle_color is a categorical variable. It contains the values Blue, Red, and Green
as color.
Renaming Levels and Combining Levels
• Data frame variable names are typically used many times when wrangling data. Good
names for these variables make it easier to write and read wrangling programs.
• Categorical data has a categories and a ordered property, which list their possible
values and whether the ordering matters or not.
• Renaming categories is done by assigning new values to the Series.cat.categories
property or by using the Categorical.rename_categories() method :
In [41]: s = pd.Series(["a","b","c","a"], dtype="category")
In [41]: s
Out[43]:
0a
1b
2C
3a
dtype: category
Categories (3, object): [a, b, c]
In [44]: s.cat.categories=["Group %s" % g for g in s.cat.categories]
In [45]: s
Out[45]:
0 Group a
1 Group b
2 Group c
3 Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]
In [46]: s.cat.rename_categories([1,2,3])
Out[46]:
01
12
23
31
dtype: category
Categories (3, int64): [1, 2, 3]
Dealing with Dates and Times Values
• Dates are often provided in different formats and must be converted into single
format Date Time objects before analysis.
• Python provides two methods of formatting date and time.
1. str() = It turns a datetime value into a string without any formatting.
2. strftime() function= It define how user want the datetime value to appear after
conversion.
1. Using pandas.to_datetime() with a date
import pandas as pd
#input in mm.dd.yyyy format
date = ['21.07.2020']
# output in yyyy-mm-dd format
print(pd.to_datetime(date))
2. Using pandas.to_datetime() with a date and time
import pandas as pd
# date (mm.dd.yyyy) and time (H:MM:SS)
date [21.07.2020 11:31:01 AM']
# output in yyyy-mm-dd HH:MM:SS
print(pd.to_datetime(date))
• We can convert a string to datetime using strptime() function. This function is
available in datetime and time modules to parse a string to datetime and time objects
respectively.
• Python strptime() is a class method in datetime class. Its syntax is :
datetime.strptime(date_string, format)
• Both the arguments are mandatory and should be string
import datetime
format="%a %b %d %H:%M:%S %Y"
today = datetime.datetime.today()
print 'ISO:', today
s = today.strftime(format)
print 'strftime:', s
d = datetime.datetime.strptime(s, format)
print 'strptime:', d.strftime(format)
$ python datetime_datetime_strptime.py
ISO : 2013-02-21 06:35:45.707450
strftime: Thu Feb 21 06:35:45 2013
strptime: Thu Feb 21 06:35:45 2013
• Time Zones: Within datetime, time zones are represented by subclasses of tzinfo.
Since tzinfo is an abstract base class, you need to define a subclass and provide
appropriate implementations for a few methods to make it useful.
2.11 Missing Data
• Data can have missing values for a number of reasons such as observations that
were not recorded and data corruption. Handling missing data is important as many
machine learning algorithms do not support data with missing values.
• You can load the dataset as a Pandas DataFrame and print summary statistics on
each attribute.
# load and summarize the dataset
from pandas import read_csv
# load the dataset
dataset = read_csv('csv file name', header=None)
# summarize the dataset
print(dataset.describe())
• In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as
NaN. Values with a NaN value are ignored from operations like sum, count, etc.
• Use the isnull() method to detect the missing values. Pandas Dataframe provides a
function isnull(), it returns a new dataframe of same size as calling dataframe, it
contains only True and False only. With True at the place NaN in original dataframe
and False at other places.
Encoding missingness:
• The fillna() function is used to fill NA/NaN values using the specified method.
• Syntax :
DataFrame.fillna(value=None, method=None, axis=None, inplace=False,
limit=None, downcast=None, **kwargs)
Where
1. value: It is a value that is used to fill the null values.
2. method: A method that is used to fill the null values.
3. axis: It takes int or string value for rows/columns.
4. inplace: If it is True, it fills values at an empty place.
5. limit: It is an integer value that specifies the maximum number of consecutive
forward/backward NaN value fills.
6. downcast: It takes a dict that specifies what to downcast like Float64 to int64.
2.12 Hierarchical Indexing
• Hierarchical indexing is a method of creating structured group relationships in data.
• A MultiIndex or Hierarchical index comes in when our DataFrame has more than two
dimensions. As we already know, a Series is a one-dimensional labelled NumPy array
and a DataFrame is usually a two-dimensional table whose columns are Series. In
some instances, in order to carry out some sophisticated data analysis and
manipulation, our data is presented in higher dimensions.
• A MultiIndex adds at least one more dimension to the data. A Hierarchical Index as
the name suggests is ordering more than one item in terms of their ranking.
• To createDataFrame with player ratings of a few players from the Fifa 19 dataset.
In [1]: import pandas as pd
In [2]: data = {'Position': ['GK', 'GK', 'GK', 'DF', 'DF', 'DF",
'MF', 'MF", 'MF', 'CF', 'CF', 'CF'],
'Name': ['De Gea', 'Coutois', 'Allison', 'VanDijk',
'Ramos', 'Godin', 'Hazard', 'Kante', 'De Bruyne', 'Ronaldo'
'Messi', 'Neymar'],
'Overall': ['91','88', '89', '89', '91', '90', '91', '90', '92', '94', '93', '92'],
'Rank': ['1st', '3rd', '2nd', '3rd','1st', '2nd', '2nd', '3rd', '1st', '1st', '2nd', '3rd']}
In [3]: fifa19 = pd.DataFrame(data, columns=['Position', 'Name', 'Overall', 'Rank'])
In [4]: fifa19
Out[4]:
• From above Dataframe, we notice that the index is the default Pandas index; the
columns 'Position' and 'Rank' both have values or objects that are repeated. This could
sometimes pose a problem for us when we want to analyse the data. What we would
like to do is to use meaningful indexes that uniquely identify each row and makes it
easier to get a sense of the data we are working with. This is where MultiIndex or
Hierarchical Indexing comes in.
• We do this by using the set_index() method. For Hierarchical indexing, we use
set_index() method for passing a list to represent how we want the rows to be identified
uniquely.
In [5]: fif19.set_index(['Position', 'Rank'], drop = False)
In [6]: fifa19
Out[6];
• We can see from the code above that we have set our new indexes to 'Position' and
'Rank' but there is a replication of these columns. This is because we passed drop-
False which keeps the columns where they are. The default method, however, is drop-
True so without indicating drop=False the two columns will be set as the indexes and
the columns deleted automatically.
In [7]: fifa19.set_index(['Position', 'Rank'])
Out[7]: Name Overall
Position Rank
GK 1st De Gea91
GK 3rd Coutios88
GK 2nd Allison 89
DF 3rd Van Dijk 89
DF 1st Ramos 91
DF 2nd Godin 90
MF 2nd Hazard 91
MF 3rd Kante90
MF 1st De Bruyne 92
CF 1st Ronaldo 94
CF 2nd Messi93
CF 3rd Neymar92
• We use set_index() with an ordered list of column labels to make the new indexes. To
verify that we have indeed set our DataFrame to a hierarchical index, we call the .index
attribute.
In [8]: fifa19-fifa 19.set_index(['Position', 'Rank'])
In [9]: fifa19.index
Out[9]: MultiIndex(levels = [['CF', 'DF', 'GK', 'MF'],
['1st', '2nd', '3rd']],
codes = [[2, 2, 2, 1, 1, 1, 3, 3, 3, 0, 0, 0],
[0, 2, 1, 2,0,1, 1, 2, 0, 0, 1, 2]],
names= ['Position', 'Rank'])

2.13 Combining Datasets


• Whether it is to concatenate several datasets from different csv files or to merge sets
of aggregated data from different google analytics accounts, combining data from
various sources is critical to drawing the right conclusions and extracting optimal
value from data analytics.
• When using pandas, data scientists often have to concatenate multiple pandas
DataFrame; either vertically (adding lines) or horizontally (adding columns).
DataFrame.append
• This method allows to add another dataframe to an existing one. While columns with
matching names are concatenated together, columns with different labels are filled
with NA.
>>>df1
ints bools
0 0 True
11 False
2 2 True
>>> df2
ints floats
0 3 1.5
1 4 2.5
2 5 3.5
>>> df1.append(df2).
ints bools floats
0 0 True NaN
1 1 False NaN
2 2 True NaN
0 3 NaN 1.5
1 4 NaN 2.5
2 5 NaN 3.5
• In addition to this, DataFrame.append provides other flexibilities such as resetting
the resulting index, sorting the resulting data or raising an error when the resulting
index includes duplicate records.
Pandas.concat
• We can concat dataframes both vertically (axis=0) and horizontally (axis=1) by using
the Pandas.concat function. Unlike DataFrame.append, Pandas.concat is not a method
but a function that takes a list of objects as input. On the other hand, columns with
different labels are filled with NA values as for DataFrame.append.
>>> df3
bools floats
0 False 4.5
1 True 5.5
2 False 6.5
>>>pd.concat([df1, df2, df3])
ints bools floats
0 0.0 True NaN
1 1.0 False NaN
2 2.0 True NaN
0 3.0 NaN 1.5
1 4.0 NaN 2.5
2 5.0 NaN 3.5
0 NaN False 4.5
1 NaN True 5.5
2 NaN False 6.5

2.14 Aggregation and Grouping


• Pandas aggregation methods are as follows:
a) count() Total number of items
b) first(), last(): First and last item
c) mean(), median(): Mean and median
d) min(), max(): Minimum and maximum
e) std(), var(): Standard deviation and variance
f) mad(): Mean absolute deviation
g) prod(): Product of all items
h) sum(): Sum of all items.
• Sample CSV file is as follows:

• The date column can be parsed using the extremely handy dateutil library.
import pandas as pd
importdateutil
# Load data from csv file
data = pd.DataFrame.from_csv('phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)
• Once the data has been loaded into Python, Pandas makes the calculation of different
statistics very simple. For example, mean, max, min, standard deviations and more for
columns are easily calculable:
# How many rows the dataset
data['item'].count()
Out[38]: 830
# What was the longest phone call / data entry?
data['duration'].max()
Out[39]: 10528.0
# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()
Out[40]: 92321.0
# How many entries are there for each month?
data['month'].value_counts()
Out[41]:
2014-11 230
2015-01 205
2014-12 157
2015-02 137
2015-03 101
dtype: int64
# Number of non-null unique network entries
data['network'].nunique()
Out[42]: 9
groupby() function :
• groupby essentially splits the data into different groups depending on a variable of
user choice.
• The groupby() function returns a GroupBy object, but essentially describes how the
rows of the original data set has been split. The GroupBy object groups variable is a
dictionary whose keys are the computed unique groups and corresponding values
being the axis labels belonging to each group.
• Functions like max(), min(), mean(), first(), last() can be quickly applied to the
GroupBy object to obtain summary statistics for each group.
• The GroupBy object supports column indexing in the same way as the DataFrame
and returns a modified GroupBy object.

2.15 Pivot Tables


• A pivot table is a similar operation that is commonly seen in spreadsheets and other
programs that operate on tabular data. The pivot table takes simple column-wise data
as input, and groups the entries into a two-dimensional table that provides a
multidimensional summarization of the data.
• A pivot table is a table of statistics that helps summarize the data of a larger table by
"pivoting" that data. Pandas gives access to creating pivot tables in Python using the
.pivot_table() function.
• The syntax of the .pivot_table() function:
import pandas as pd
pd.pivot_table(
data=,
values=None,
index=None,
columns=None,
aggfunc='mean',
fill_value =None,
margins=False,
dropna=True,
margins_name='All',
observed=False,
sort=True
)
• To use the pivot method in Pandas, we need to specify three parameters:
1. Index: Which column should be used to identify and order the rows vertically.
2. Columns: Which column should be used to create the new columns in reshaped
DataFrame. Each unique value in the column stated here will create a column in new
DataFrame.
3. Values: Which column(s) should be used to fill the values in the cells of DataFrame.
• Import modules:
import pandas as pd
Create dataframe :
raw_data={'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks',
'Nighthawks','Dragoons','
Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd', '1st', '1st', '2nd','2nd'],
'TestScore': [4,24,31,2,3,4,24,31,2,3,2,3]}
df pd.DataFrame(raw_data, columns=['regiment','company', 'TestScore'])
df

• Create a pivot table of group means, by company and regiment


pd.pivot_table(df,index=['regiment','company'], aggfunc='mean')

• Create a pivot table of group score counts, by company and regiment


df.pivot_table(index=['regiment', 'company'], aggfunc='count')
Two Marks Questions with Answers
Q.1 Define data wrangling ?
Ans. Data wrangling is the process of transforming data from its original "raw" form
into a more digestible format and organizing sets from various sources into a singular
coherent whole for further processing.

Q.2 What is Python?


Ans. Python is a high-level scripting language which can be used for a wide variety of
text processing, system administration and internet-related tasks. Python is a true
object-oriented language and is available on a wide variety of platforms.

Q.3 What is NumPy ?


Ans. NumPy, short for Numerical Python, is the core library for scientific computing in
Python. It has been designed specifically for performing basic and advanced array
operations. It primarily supports multi-dimensional arrays and vectors for complex
arithmetic operations.

Q.4 What is an aggregation function ?


Ans. An aggregation function is one which takes multiple individual values and
returns a summary. In the majority of the cases, this summary is a single value. The
most common aggregation functions are a simple average or summation of values
.
Q.5 What is Structured Arrays?
Ans. A structured Numpy array is an array of structures. As numpy arrays are
homogeneous i.e. they can contain data of same type only. So, instead of creating a
numpy array of int or float, we can create numpy array of homogeneous structures too.

Q.6 Describe Pandas.


Ans. Pandas is a high-level data manipulation tool developed by Wes McKinney. It is
built on the Numpy package and its key data structure is called the DataFrame.
DataFrames allow you to store and manipulate tabular data in rows of observations
and columns of variables. Pandas is built on top of the NumPy package, meaning a lot
of the structure of NumPy is used or replicated in Pandas.

Q.7 How to Manipulating and Creating Categorical Variables?


Ans. Categorical variable is one that has a specific value from a limited selection of
values. The number of values is usually fixed. Categorical features can only take on a
limited and usually fixed, number of possible values. For example, if a dataset is about
information related to users, then user will typically find features like country, gender,
age group, etc. Alternatively, if the data we are working with is related to products, you
will find features like product type, manufacturer, seller and so on.

Q.8 Explain Hierarchical Indexing.


Ans; Hierarchical indexing is a method of creating structured group relationships in
data. A MultiIndex or Hierarchical index comes in when our DataFrame has more than
two dimensions. As we already know, a Series is a one-dimensional labelled NumPy
array and a DataFrame is usually a two-dimensional table whose columns are Series.
In some instances, in order to carry out some sophisticated data analysis and
manipulation, our data is presented in higher dimensions.

Q.9 What is Pivot Tables?


Ans. : A pivot table is a similar operation that is commonly seen in spreadsheets and
other programs that operate on tabular data. The pivot table takes simple column-wise
data as input and groups the entries into a two-dimensional table that provides a
multidimensional summarization of the data.

You might also like