0% found this document useful (0 votes)
19 views109 pages

PP&DS 3

NumPy is a Python library designed for efficient array manipulation, providing functions for linear algebra and matrices. It offers a faster alternative to traditional Python lists through its ndarray object, which is optimized for performance and memory usage. The document also covers installation, importing, creating arrays, indexing, slicing, and data types in NumPy.

Uploaded by

shatviklakshman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views109 pages

PP&DS 3

NumPy is a Python library designed for efficient array manipulation, providing functions for linear algebra and matrices. It offers a faster alternative to traditional Python lists through its ndarray object, which is optimized for performance and memory usage. The document also covers installation, importing, creating arrays, indexing, slicing, and data types in NumPy.

Uploaded by

shatviklakshman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

NumPy

What is NumPy?

➢ NumPy is a Python library used for working with arrays.


➢ It has various functions working for linear algebra, matrices…etc.
➢ NumPy stands for Numerical Python.

Why Use NumPy?

➢ In Python we have lists that serve the purpose of arrays, but they are
slow to process.
➢ NumPy aims to provide an array object that is up to 50 times faster than
traditional Python lists.
➢ The array object in NumPy is called ndarray, it provides a lot of
supporting functions that make working with ndarray very easy.
➢ Arrays are very frequently used in data science, where speed and
resources are very important.

Why is NumPy Faster Than Lists?


➢ NumPy arrays are stored at one continuous place in memory unlike lists,
so processes can access and manipulate them very efficiently.
➢ This behavior is called locality of reference in computer science.
➢ This is the main reason why NumPy is faster than lists. Also it is optimized
to work with latest CPU architectures.
Installation of NumPy

If you have Python and PIP already installed on a system, then installation
of NumPy is very easy.

Install it using this command:

C:\Users\Your Name>pip install numpy

If this command fails, then use a python distribution that already has NumPy
installed like, Anaconda, Spyder etc.
Import NumPy

Once NumPy is installed, import it in your applications by adding


the import keyword:

import numpy

Now NumPy is imported and ready to use.

Example:
import numpy
arr = numpy.array([1, 2, 3, 4, 5])
print(arr)
Output: [1 2 3 4 5]
NumPy as np
• NumPy is usually imported under the np alias.
• alias: In Python alias are an alternate name for referring to the same
thing.
• Create an alias with the as keyword while importing:
import numpy as np

Now the NumPy package can be referred to as np instead of numpy.

Example:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Output: [1, 2, 3, 4, 5]

Checking NumPy Version

The version string is stored under __version__ attribute.

Example:
import numpy as np
print(np.__version__)
Output: 1.19.2
NumPy Creating Arrays
➢ Create a NumPy ndarray Object
➢ NumPy is used to work with arrays. The array object in NumPy is
called ndarray.
➢ We can create a NumPy ndarray object by using
the array() function.
Example:

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))

Output:

[1 2 3 4 5]
<class 'numpy.ndarray'>

To create an ndarray, we can pass a list, tuple or any array-like object


into the array() method, and it will be converted into an ndarray:

Example:Use a tuple to create a NumPy array:

import numpy as np
arr = np.array((1, 2, 3, 4, 5))
print(arr)

Output: [1 2 3 4 5]
Dimensions in Arrays

The dimensions in arrays is


• 0-D Arrays
• 1-D Arrays
• 2-D Arrays
• 3-D Arrays……etc
0-D Arrays

0-D arrays are the elements in an array. Each value in an array is a 0-D array.

Example: Create a 0-D array with value 42

import numpy as np
arr = np.array(42)
print(arr)

Output: 42

1-D Arrays

An array that has 0-D arrays as its elements is called uni-dimensional or 1-D
array.

Example: Create a 1-D array containing the values 1,2,3,4,5:

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)

Output: [1 2 3 4, 5]

2-D Arrays

An array that has 1-D arrays as its elements is called a 2-D array.

These are often used to represent matrixs


Example: Create a 2-D array containing two arrays with the values 1,2,3 and
4,5,6:

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)

Output:

[[1 2 3]
[4 5 6]]
3-D arrays

An array that has 2-D arrays (matrices) as its elements is called 3-D array.

Example:
#Create a 3-D array with two 2-D arrays, both containing two arrays with the
values 1,2,3 and 4,5,6:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
Output:

[[[1 2 3]
[4 5 6]]

[[1 2 3]
[4 5 6]]]
Check Number of Dimensions?

NumPy Arrays provides the ndim attribute that returns an integer that tells us
how many dimensions the array have.

Example: Check how many dimensions the arrays have:

import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)

Output:

0
1
2
3
Higher Dimensional Arrays

An array can have any number of dimensions.

When the array is created, you can define the number of dimensions by using
the ndmin argument.

Example: Create an array with 5 dimensions and verify that it has 5


dimensions:

import numpy as np
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('number of dimensions :', arr.ndim)

Output:

[[[[[1 2 3 4]]]]]
number of dimensions : 5
NumPy Array Indexing

Access Array Elements

Array indexing is the same as accessing an array element in list.

You can access an array element by referring to its index number.

The indexes in NumPy arrays start with 0, meaning that the first element
has index 0, and the second has index 1 etc.
Example: Get the first element from the following array:
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[0])

Output: 1

Example: Get the second element from the following array.

import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[1])

Output: 2

Example: Get third and fourth elements from the following array and add them.

import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[2] + arr[3])

Output: 7

Access 2-D Arrays

To access elements from 2-D arrays we can use comma separated integers
representing the dimension and the index of the element.

Example: Access the 2nd element on 1st dim:

import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st dim: ', arr[0, 1])

Output: 2nd element on 1st dim: 2

Example:Access the 5th element on 2nd dim:

import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('5th element on 2nd dim: ', arr[1, 4])

Output: 5th element on 2nd dim:10


Access 3-D Arrays

To access elements from 3-D arrays we can use comma separated integers
representing the dimensions and the index of the element.

Example: Access the third element of the second array of the first array:

import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])

Output: 6

o In the above Example arr[0, 1, 2] prints the


value 6.because of

o The first number represents the zero dimension, which contains two
arrays:
[[1, 2, 3], [4, 5, 6]]
and:
[[7, 8, 9], [10, 11, 12]]
Since we selected 0, we are left with the first array:
[[1, 2, 3], [4, 5, 6]]

o The second number represents the second dimension, which also


contains two arrays:
[1, 2, 3]
and:
[4, 5, 6]
Since we selected 1, we are left with the second array:
[4, 5, 6]

o The third number represents the third dimension, which contains three
values:
4
5
6
Since we selected 2, we end up with the third value:
6
Negative Indexing

Use negative indexing to access an array from the end.

Example: Print the last element from the 2nd dim:

import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('Last element from 2nd dim: ', arr[1, -1])

Output: Last element from 2nd dim: 10


NumPy Array Slicing
Slicing arrays
o Slicing in python means taking elements from one given index to another
given index.
o We pass slice instead of index like this: [start:end].
o We can also define the step, like this: [start:end:step].
o If we don't pass start its considered 0
o If we don't pass end its considered length of array in that dimension
o If we don't pass step its considered 1
Example: Slice elements from index 1 to index 5 from the following array:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])

Output: [2 3 4 5]

Example: Slice elements from index 4 to the end of the array:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])

Output: [5 6 7]

Example: Slice elements from the beginning to index 4 (not included):


import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[:4])

Output: [ 1 2 3 4]

Negative Slicing

Use the minus operator to refer to an index from the end

Example: Slice from the index 3 from the end to index 1 from the end:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[-3:-1])

Output: [5 6]

STEP

Use the step value to determine the step of the slicing:

Example: Return every other element from index 1 to index 5:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])

Output: [2 4]

Example: Return every other element from the entire array:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[::2])

Output: [1 3 5 7]
Slicing 2-D Arrays
Example: From the second element, slice elements from index 1 to index 4
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[1, 1:4])

Output : [7 8 9]

Example: From both elements, return index 2:

import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 2])

Output: [3 8]

Example: From both elements, slice index 1 to index 4 this will return a 2-D
array:

import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 1:4])

Output:

[[2 3 4]
[7 8 9]]
NumPy Data Types
Data Types in Python

By default Python have these data types:

• strings - used to represent text data, the text is given under quote
marks. e.g. "ABCD"
• integer - used to represent integer numbers. e.g. -1, -2, -3
• float - used to represent real numbers. e.g. 1.2, 42.42
• boolean - used to represent True or False.
• complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j
Data Types in NumPy

NumPy has some extra data types, and refer to data types with one
character, like i for integers, u for unsigned integers etc.

Below is a list of all data types in NumPy and the characters used to represent
them.

• i - integer
• b - boolean
• u - unsigned integer
• f - float
• c - complex float
• m - timedelta
• M - datetime
• O - object
• S - string
• U - unicode string
• V - fixed chunk of memory for other type ( void )

Checking the Data Type of an Array

The NumPy array object has a property called dtype that returns the data type
of the array:

Example: Get the data type of an array object:

import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr.dtype)

Output: int32

Example: Get the data type of an array containing strings:

import numpy as np
arr = np.array(['apple', 'banana', 'cherry'])
print(arr.dtype)

Output: <u6
Creating Arrays With a Defined Data Type

We use the array() function to create arrays, this function can take an
optional argument: dtype that allows us to define the expected data type of
the array elements:

Example: Create an array with data type string:

import numpy as np
arr = np.array([1, 2, 3, 4], dtype='S')
print(arr)
print(arr.dtype)

Output:

[b'1' b'2' b'3' b'4']


|S1

For i, u, f, S and U we can define size as well.

Example: Create an array with data type 4 bytes integer:

import numpy as np
arr = np.array([1, 2, 3, 4], dtype='i4')
print(arr)
print(arr.dtype)

Output:

[1 2 3 4]
int32
What if a Value Can Not Be Converted?

If a type is given in which elements can't be casted then NumPy will raise a
ValueError.

ValueError: In Python ValueError is raised when the type of passed argument to


a function is unexpected/incorrect.

Example: A non integer string like 'a' can not be converted to integer (will raise
an error):
import numpy as np
arr = np.array(['a', '2', '3'], dtype='i')

Output:

ValueError Traceback (most recent call last)


<ipython-input-43-2942102936bc> in <module>
1 import numpy as np
----> 2 arr = np.array(['a', '2', '3'], dtype='i')

ValueError: invalid literal for int() with base 10: 'a'


Converting Data Type on Existing Arrays
o The best way to change the data type of an existing array, is to make a
copy of the array with the astype() method.
o The astype() function creates a copy of the array, and allows you to
specify the data type as a parameter.
o The data type can be specified using a string, like 'f' for float, 'i' for
integer etc. or you can use the data type directly like float for float
and int for integer.
Example: Change data type from float to integer by using 'i'/int as
parameter value:
import numpy as np
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype('i')
print(newarr)
print(newarr.dtype)

Output:

[1 2 3]
int32
Example:Change data type from integer to boolean:
import numpy as np
arr = np.array([1, 0, 3])
newarr = arr.astype(bool)
print(newarr)
print(newarr.dtype)

Output:
[ True False True]
bool

NumPy
NumPy Copy vs View

➢ The main difference between a copy and a view of an array is that the
copy is a new array, and the view is just a view of the original array.

➢ The copy owns the data and any changes made to the copy will not affect
original array, and any changes made to the original array will not affect
the copy.

➢ The view does not own the data and any changes made to the view will
affect the original array, and any changes made to the original array will
affect the view.

COPY:
Example: Make a copy, change the original array, and display both arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
x = arr.copy()
arr[0] = 42
print(arr)
print(x)
Output:
[42 2 3 4 5]
[1 2 3 4 5]
VIEW:
Example: Make a view, change the original array, and display both arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
x = arr.view()
arr[0] = 42
print(arr)
print(x)
Output:
[42 2 3 4 5]
[42 2 3 4 5]

The view SHOULD be affected by the changes made to the original array.

Make Changes in the VIEW:

Example: Make a view, change the view, and display both arrays

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
x = arr.view()
x[0] = 31
print(arr)
print(x)
Output:
[31 2 3 4 5]
[31 2 3 4 5]
The original array SHOULD be affected by the changes made to the view.

Check if Array Owns it's Data

As mentioned above, copies owns the data, and views does not own the data,
but how can we check this?

Every NumPy array has the attribute base that returns None if the array owns
the data.

Otherwise, the base attribute refers to the original object.


Example: Print the value of the base attribute to check if an array owns it's
data or not:

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
x = arr.copy()
y = arr.view()
print(x.base)
print(y.base)

Output:
None
[1 2 3 4 5]
NumPy Array Shape
Shape of an Array

The shape of an array is the number of elements in each dimension.

Get the Shape of an Array


NumPy arrays have an attribute called shape that returns a tuple with each
index having the number of corresponding elements.

Example: Print the shape of a 2-D array:


import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)
Output: (2, 4)

The above example returns (2,4), which means that the array has 2
dimensions, and each dimension has 4 elements.

Example: Create an array with 5 dimensions using ndmin using a vector with
values 1,2,3,4 and verify that last dimension has value 4:

import numpy as np
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('shape of array :', arr.shape)
Output:
[[[[[1 2 3 4]]]]]
shape of array : (1, 1, 1, 1, 4)
NumPy Array Reshaping
Reshaping arrays
➢ Reshaping means changing the shape of an array.
➢ The shape of an array is the number of elements in each dimension.
➢ By reshaping we can add or remove dimensions or change number of
elements in each dimension.
Reshape From 1-D to 2-D
Example: Convert the following 1-D array with 12 elements into a 2-D array.
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(4, 3)
print(newarr)
Output:
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
Reshape From 1-D to 3-D
Example: Convert the following 1-D array with 12 elements into a 3-D array
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)

Output:
[[[ 1 2]
[ 3 4]
[ 5 6]]

[[ 7 8]
[ 9 10]
[11 12]]]
NumPy Array Iterating
Iterating Arrays
➢ Iterating means going through elements one by one.
➢ As we deal with multi-dimensional arrays in numpy, we can do this using
basic for loop of python.
➢ If we iterate on a 1-D array it will go through each element one by one.
Example: Iterate the elements for the following 1-D array:
import numpy as np
arr = np.array([1, 2, 3])
for x in arr:
print(x)
Output:
1
2
3
Iterating 2-D Arrays
In a 2-D array it will go through all the rows.
Example: Iterate on the elements of the following 2-D array:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
print(x)
Output:
[1 2 3]
[4 5 6]
If we iterate on a n-D array it will go through n-1th dimension one by one.

To return the actual values, we have to iterate the arrays in each dimension.

Example: Iterate on each scalar element of the 2-D array

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
for y in x:
print(y)

Output:
1
2
3
4
5
6

Iterating 3-D Arrays

In a 3-D array it will go through all the 2-D arrays.

Example: Iterate on the elements of the following 3-D array:

import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
print(x)

Output:
[[1 2 3]
[4 5 6]]
[[ 7 8 9]
[10 11 12]]

Example: Iterate down to the scalars

import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
for y in x:
for z in y:
print(z)

Output:
1
2
3
4
5
6
7
8
9
10
11
12
Iterating Arrays Using nditer()

The function nditer() is a helping function that can be used from very
basic to very advanced iterations. It solves some basic issues which we face in
iteration, lets go through it with examples.

Iterating on Each Scalar Element

In basic for loops, iterating through each scalar of an array we need to


use n for loops which can be difficult to write for arrays with very high
dimensionality.

Example: Iterate through the following 3-D array:

import numpy as np
arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
for x in np.nditer(arr):
print(x)

Output:
1
2
3
4
5
6
7
8
Iterating With Different Step Size
We can use filtering and followed by iteration.
Example: Iterate through every scalar element of the 2D array skipping 1
element:
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
for x in np.nditer(arr[:, ::2]):
print(x)
Output:
1
3
5
7
Enumerated Iteration Using ndenumerate()

Enumeration means add sequence numbers to an array

Sometimes we require corresponding index of the element while iterating,


the ndenumerate() method it can be used for those usecases.

Example: Enumerate on following 1D arrays elements:

import numpy as np
arr = np.array([1, 2, 3])
for idx, x in np.ndenumerate(arr):
print(idx, x)

Output:
(0,) 1
(1,) 2
(2,) 3
Example: Enumerate on following 2D array's elements
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
for idx, x in np.ndenumerate(arr):
print(idx, x)

Output:
(0, 0) 1
(0, 1) 2
(0, 2) 3
(0, 3) 4
(1, 0) 5
(1, 1) 6
(1, 2) 7
(1, 3) 8
NumPy Joining Array
Joining NumPy Arrays
➢ Joining means combining of two or more arrays as a single array.
➢ In SQL we join tables based on a key, whereas in NumPy we join arrays by
axes.
➢ We pass a sequence of arrays that we want to join to
the concatenate() function, along with the axis.
➢ If axis is not explicitly passed, it is taken as 0.
Example: Join two arrays
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)

Output: [1 2 3 4 5 6]

Example: Join two 2-D arrays along rows (axis=1):


import numpy as np
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
arr = np.concatenate((arr1, arr2), axis=1)
print(arr)
Output:
[[1 2 5 6]
[3 4 7 8]]
Joining Arrays Using Stack Functions
➢ Stacking is same as concatenation, the only difference is that stacking is
done along a new axis.
➢ We can concatenate two 1-D arrays along the second axis which would
result in putting them one over the other, ie. stacking.
➢ We pass a sequence of arrays that we want to join to the stack() method
along with the axis.
➢ If axis is not explicitly passed it is taken as 0.
Example:
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.stack((arr1, arr2), axis=1)
print(arr)

Output:
[[1 4]
[2 5]
[3 6]]
Stacking Along Rows

NumPy provides a helper function: hstack() to stack along rows.

Example:

import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.hstack((arr1, arr2))
print(arr)

Output: [1 2 3 4 5 6]
Stacking Along Columns

NumPy provides a helper function: vstack() to stack along columns.

Example:
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.vstack((arr1, arr2))
print(arr)
Output:
[[1 2 3]
[4 5 6]]

Stacking Along Height (depth)

NumPy provides a helper function: dstack() to stack along height, which is


the same as depth.

Example:

import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.dstack((arr1, arr2))
print(arr)

Output:
[[[1 4]
[2 5]
[3 6]]]
NumPy Splitting Array
Splitting NumPy Arrays
➢ Splitting is reverse operation of Joining.
➢ Joining merges multiple arrays into one and Splitting breaks one array
into multiple.
➢ We use array_split() for splitting arrays, we pass it the array we
want to split and the number of splits.
Example: Split the array in 3 parts:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr)

Output:
[array([1, 2]), array([3, 4]), array([5, 6])]

If the array has less elements than required, it will adjust from the end
accordingly.

Example: Split the array in 4 parts:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 4)
print(newarr)

Output:
[array([1, 2]), array([3, 4]), array([5]), array([6])]

Split Into Arrays


➢ The return value of the array_split() method is an array containing
each of the split as an array.
➢ If you split an array into 3 arrays, you can access them from the result just
like any array element
Example: Access the splitted arrays:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr[0])
print(newarr[1])
print(newarr[2])

Output:
[1 2]
[3 4]
[5 6]
Splitting 2-D Arrays

Use the array_split() method, in an array you want to split the number
of splits you want to do.

Example: Split the 2-D array into three 2-D arrays.

import numpy as np
arr = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
newarr = np.array_split(arr, 3)
print(newarr)

Output:

[array([[1, 2],[3, 4]]),


array([[5, 6], [7, 8]]),
array([[ 9, 10],[11, 12]])]

Example: Split the 2-D array into three 2-D arrays.


import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])
newarr = np.array_split(arr, 3)
print(newarr)

Output:

[array([[1, 2, 3],[4, 5, 6]]),


array([[ 7, 8, 9],[10, 11, 12]]),
array([[13, 14, 15],[16, 17, 18]])]

The example above returns three 2-D arrays.

In addition, you can specify which axis you want to do the split around.

The example below also returns three 2-D arrays, but they are split along the
row (axis=1).

Example: Split the 2-D array into three 2-D arrays along rows.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])
newarr = np.array_split(arr, 3, axis=1)
print(newarr)

Output:

[array([[ 1],[ 4],[ 7],[10],[13],[16]]),


array([[ 2], [ 5],[ 8],[11],[14],[17]]),
array([[ 3], [ 6],[ 9],[12],[15],[18]])]

Example: Use the hsplit() method to split the 2-D array into three 2-D array
s along rows.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])
newarr = np.hsplit(arr, 3)
print(newarr)

Output:

[array([[ 1],[ 4],[ 7],[10],[13],[16]]),


array([[ 2], [ 5],[ 8],[11],[14],[17]]),
array([[ 3], [ 6],[ 9],[12],[15],[18]])]

NumPy
NumPy Searching Arrays
Searching Arrays
➢ You can search an array for a certain value, and return the indexes that it
matches.
➢ To search an array, use the where() method.
Example :Find the indexes where the value is 4:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)
Output: (array([3, 5, 6], dtype=int64),)

Which means that the value 4 is present at index 3, 5, and 6.

Example: Find the indexes where the values are even


import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 0)
print(x)
Output: (array([1, 3, 5, 7], dtype=int64),)

Example: Find the indexes where the values are odd


import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 1)
print(x)
Output: (array([0, 2, 4, 6], dtype=int64),)
Search Sorted
➢ There is a method called searchsorted() which performs a binary search
in the array
➢ It returns the index where the specified value would be inserted to
maintain the search order.
➢ The searchsorted() method is assumed to be used on sorted arrays.
Example: Find the indexes where the value 7 should be inserted
import numpy as np
arr = np.array([6, 7, 8, 9])
x = np.searchsorted(arr, 7)
print(x)
Output: 1

Search From the Right Side

To return the right most index .


Example: Find the indexes where the value 7 should be inserted, starting from
the right:
import numpy as np
arr = np.array([6, 7, 8, 9])
x = np.searchsorted(arr, 7, side='right')
print(x)
Output: 2

NumPy Sorting Arrays


Sorting Arrays
➢ Sorting means putting elements in an ordered sequence.
➢ Ordered sequence is any sequence that has an order corresponding to
elements, like numeric or alphabetical, ascending or descending.
➢ The NumPy ndarray object has a function called sort(), that will sort a
specified array.
Example: Sort the array
import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))
Output: [0 1 2 3]

Example: Sort the array alphabetically


import numpy as np
arr = np.array(['banana', 'cherry', 'apple'])
print(np.sort(arr))
Output: ['apple' 'banana' 'cherry']
Example: Sort a boolean array
import numpy as np
arr = np.array([True, False, True])
print(np.sort(arr))
Output: [False True True]

Sorting a 2-D Array


If you use the sort() method on a 2-D array, both arrays will be sorted
Example: Sort a 2-D array:
import numpy as np
arr = np.array([[3, 2, 4], [5, 0, 1]])
print(np.sort(arr))
Output:
[[2 3 4]
[0 1 5]]
NumPy Filter Array
Filtering Arrays
➢ Getting some elements from an existing array and creating a new array
out of them is called filtering.
➢ In NumPy, you filter an array using a boolean index list.
➢ If the value at an index is True that element is contained in the filtered
array, if the value at that index is False that element is excluded from the
filtered array.
Example: Create an array from the elements on index 0 and 2
import numpy as np
arr = np.array([41, 42, 43, 44])
x = [True, False, True, False]
newarr = arr[x]
print(newarr)
Output: [41 43]
The example above will return [41, 43]
Because the new filter contains only the values where the filter array had the
value True, in this case, index 0 and 2.
Creating the Filter Array
In the above example we use True and False values, but most commonly we
create a filter array based on conditions.
Example: Create a filter array that will return only values higher than 42:
import numpy as np
arr = np.array([41, 42, 43, 44])
# Create an empty list
filter_arr = []
# go through each element in arr
for element in arr:
# if the element is higher than 42, set the value to True, otherwise False:
if element > 42:
filter_arr.append(True)
else:
filter_arr.append(False)
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
Output:
[False, False, True, True]
[43 44]

Example: Create a filter array that will return only even elements from the origi
nal array
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
# Create an empty list
filter_arr = []
# go through each element in arr
for element in arr:
# if the element is completely divisble by 2, set the value to True, otherwise
False
if element % 2 == 0:
filter_arr.append(True)
else:
filter_arr.append(False)
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
Output:
[False, True, False, True, False, True, False]
[2 4 6]
Creating Filter Directly From Array

The above example is quite a common task in NumPy and NumPy provides a
way to tackle it.

We can directly substitute the array instead of the iterable variable in our
condition and it will work as we expect it

Example: Create a filter array that will return only values higher than 42:
import numpy as np
arr = np.array([41, 42, 43, 44])
filter_arr = arr > 42
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
Output:
[False False True True]
[43 44]

Example: Create a filter array that will return only even elements from the orig
inal array
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
filter_arr = arr % 2 == 0
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
Output:
[False True False True False True False]
[2 4 6]
Random Numbers in NumPy
Generate Random Number
NumPy offers the random module to work with random numbers.
Example: Generate a random integer from 0 to 100:
from numpy import random
x = random.randint(100)
print(x)
Output: 28
Generate Random Float

The random module's rand() method returns a random float between 0 and
1.

Example: Generate a random float from 0 to 1:


from numpy import random
x = random.rand()
print(x)
Output: 0.18228552030082645

Generate Random Array


In NumPy we work with arrays, and we can use the two methods from the
below examples to make random arrays.

Integers

The randint() method takes a size parameter where you can specify the
shape of an array.
Example: Generate a 1-D array containing 5 random integers from 0 to 100
from numpy import random
x=random.randint(100, size=(5))
print(x)
Output: [67 23 57 94 40]

Example: Generate a 2-D array with 3 rows, each row containing 5 random integ
ers
from 0 to 100
from numpy import random
x = random.randint(100, size=(3, 5))
print(x)
Output:
[[19 29 52 85 86]
[ 6 31 29 94 29]
[59 52 31 83 14]]

Floats

The rand() method also allows you to specify the shape of the array.
Example:Generate a 1-D array containing 5 random floats
from numpy import random
x = random.rand(5)
print(x)
Output: [0.40059825 0.57669527 0.61470883 0.71033653 0.58024434]

Example: Generate a 2-D array with 3 rows, each row containing 5 random num
bers
from numpy import random
x = random.rand(3, 5)
print(x)
Output:
[[0.26077556 0.90322224 0.5462452 0.62142255 0.24189894]
[0.0610846 0.29801495 0.6495516 0.12850847 0.00344236]
[0.85568781 0.77171421 0.4696878 0.16071512 0.47611551]]

Generate Random Number From Array


➢ The choice() method allows you to generate a random value based on
an array of values.
➢ The choice() method takes an array as a parameter and randomly
returns one of the values.
Example: Return one of the values in an array
from numpy import random
x = random.choice([3, 5, 7, 9])
print(x)
Output: 3
The choice() method also allows you to return an array of values.
Add a size parameter to specify the shape of the array.

Example: Generate a 2-D array that consists of the values in the array
parameter
(3, 5, 7, and 9):
from numpy import random
x = random.choice([3, 5, 7, 9], size=(3, 5))
print(x)
Output:
[[7 7 5 5 3]
[7 3 9 9 3]
[3 7 9 3 9]]

NumPy ufuncs
➢ ufuncs stands for "Universal Functions" and they are NumPy functions
that operates on the ndarray object.
➢ ufuncs are used to implement vectorization in NumPy which is way faster
than iterating over elements.
➢ ufuncs also take additional arguments, like:
➢ where boolean array or condition defining where the operations should
take place.
➢ dtype defining the return type of elements.
➢ out output array where the return value should be copied.
What is Vectorization?
➢ Converting iterative statements into a vector based operation is called
vectorization.
➢ It is faster and modern CPUs are optimized for such operations.

Add the Elements of Two Lists

list 1: [1, 2, 3, 4]
list 2: [4, 5, 6, 7]

One way of doing it is to iterate over both of the lists and then sum each
elements.
Example: Without ufunc, we can use Python's built-in zip() method

x = [1, 2, 3, 4]
y = [4, 5, 6, 7]
z = []
for i, j in zip(x, y):
z.append(i + j)
print(z)
Output: [5, 7, 9, 11]

NumPy has a ufunc for this, called add(x, y) that will produce the same
result.
Example: With ufunc, we can use the add() function:
import numpy as np
x = [1, 2, 3, 4]
y = [4, 5, 6, 7]
z = np.add(x, y)
print(z)
Output: [5, 7, 9, 11]

Create Your Own ufunc


To create you own ufunc, you have to define a function, like you do with normal
functions in Python, then you add it to your NumPy ufunc library with
the frompyfunc() method.
The frompyfunc() method takes the following arguments:
1. function - the name of the function.
2. inputs - the number of input arguments (arrays).
3. outputs - the number of output arrays.
Example: Create your own ufunc for addition
import numpy as np
def myadd(x, y):
return x+y
myadd = np.frompyfunc(myadd, 2, 1)
print(myadd([1, 2, 3, 4], [5, 6, 7, 8]))
Output: [6 8 10 12]
How to Check if a Function is a ufunc or not
Example: Check if a function is a ufunc or Not
import numpy as np
print(type(np.add))
Output: <class 'numpy.ufunc'>

If it is not a ufunc, it will return another type, like this built-in NumPy function
for joining two or more arrays:
Example: Check the type of another function: concatenate():
import numpy as np
print(type(np.concatenate))
Output: <class 'function'>

Simple Arithmetic
➢ Use arithmetic operators + - * / directly between NumPy arrays,
➢ In this section we discusses an extension of the same where we have
functions that can take any array-like objects e.g. lists, tuples etc. and
perform arithmetic conditionally.
Arithmetic Conditionally
➢ It defines conditions where the arithmetic operation should happen.
➢ All of the discussed arithmetic functions take a where parameter in
which we can specify that condition.
Addition
The add() function sums the content of two arrays, and return the results in a
new array.
Example: Add the values in arr1 to the values in arr2
import numpy as np
arr1 = np.array([10, 11, 12, 13, 14, 15])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.add(arr1, arr2)
print(newarr)
Output: [30 32 34 36 38 40]
The example above will return [30 32 34 36 38 40] which is the sums of 10+20,
11+21, 12+22 etc.
Subtraction
The subtract() function subtracts the values from one array with the values
from another array, and return the results in a new array.
Example: Subtract the values in arr2 from the values in arr1
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.subtract(arr1, arr2)
print(newarr)
Output: [-10 -1 8 17 26 35]
The example above will return [-10 -1 8 17 26 35] which is the result of 10-20, 2
0-21, 30-22 etc.

Multiplication
The multiply() function multiplies the values from one array with the
values from another array, and return the results in a new array.
Example: Multiply the values in arr1 with the values in arr2
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.multiply(arr1, arr2)
print(newarr)
Output: [ 200 420 660 920 1200 1500]
The example above will return [200 420 660 920 1200 1500] which is the result
of 10*20, 20*21, 30*22 etc.

Division
The divide() function divides the values from one array with the values from
another array, and return the results in a new array.
Example: Divide the values in arr1 with the values in arr2
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 5, 10, 8, 2, 33])
newarr = np.divide(arr1, arr2)
print(newarr)
Output: [ 3.33333333 4. 3. 5. 25. 1.81818182]
The example above will return [3.33333333 4. 3. 5. 25. 1.81818182] which is
the result of 10/3, 20/5, 30/10 etc.

Power
The power() function rises the values from the first array to the power of the
values of the second array, and return the results in a new array.
Example: Raise the valules in arr1 to the power of values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 5, 6, 8, 2, 33])
newarr = np.power(arr1, arr2)
print(newarr)
Output: [ 1000 3200000 729000000 -520093696 2500 0]
The example above will return [1000 3200000 729000000 -520093696 2500
0] which is the result of 10*10*10, 20*20*20*20*20, 30*30*30*30*30*30 et
c.

Remainder
Both the mod() and the remainder() functions return the remainder of the
values in the first array corresponding to the values in the second array, and
return the results in a new array.
Example: Return the remainders
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 7, 9, 8, 2, 33])
newarr = np.mod(arr1, arr2)
print(newarr)
Output: [ 1 6 3 0 0 27]
The example above will return [1 6 3 0 0 27] which is the remainders when y
ou divide 10 with 3 (10%3), 20 with 7 (20%7) 30 with 9 (30%9) etc.
You get the same result when using the remainder() function
Example: Return the remainders
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 7, 9, 8, 2, 33])
newarr = np.remainder(arr1, arr2)
print(newarr)
Output: [ 1 6 3 0 0 27]

Quotient and Mod


The divmod() function return both the quotient and the the mod. The return
value is two arrays, the first array contains the quotient and second array
contains the mod.
Example: Return the quotient and mod:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 7, 9, 8, 2, 33])
newarr = np.divmod(arr1, arr2)
print(newarr)
Output: (array([ 3, 2, 3, 5, 25, 1], dtype=int32), array([ 1, 6, 3, 0, 0, 27], dtyp
e=int32))
The example above will return: (array([3, 2, 3, 5, 25, 1]), array([1, 6, 3, 0, 0, 27])
)

The first array represents the quotients, (the integer value when you divide 10
with 3, 20 with 7, 30 with 9 etc.

The second array represents the remainders of the same divisions.

Absolute Values
Both the absolute() and the abs() functions functions do the same
absolute operation element-wise but we should use absolute() to avoid
confusion with python's inbuilt math.abs()

Example:Return the quotient and mod:


import numpy as np
arr = np.array([-1, -2, 1, 2, 3, -4])
newarr = np.absolute(arr)
print(newarr)
Output: [1 2 1 2 3 4]
PANDAS
➢ Pandas is defined as an open-source library that provides high-
performance data analysis

➢ It is working with large data sets

➢ It support files with different formats and it is more flexible

➢ It represents in tabular way(rows &columns)

➢ Pandas are used for indexing , Slicing, sub setting the large data sets

➢ In Pandas we can merge and join two data sets easily

➢ In Pandas we can reshape the data sets

➢ Pandas allows us to analyze big data and make conclusions based on


statistical theories.

➢ The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.

➢ There are different tools are available for fast data processing, such
as Numpy, Scipy, Cython, and Panda.
➢ But we prefer Pandas because working with Pandas is fast, simple and
more expressive than other tools.

Pandas Data Structure

Pandas deals with the following three data structures −

• Series
• DataFrame
• Panel

Data Dimensions Description


Structure
Series 1 1D labeled homogeneous array, size-immutable.(cannot be
changed)………..

Data 2 General 2D labeled, size-mutable tabular structure with


Frames potentially heterogeneously typed columns.

Panel 3 General 3D labeled, size-mutable (can be changed) array.

Python Pandas Series

➢ The Pandas Series can be defined as a one-dimensional array that is


capable of storing various data types.
➢ We can easily convert the list, tuple, and dictionary into series using
"series' method.
➢ The row labels of series are called the index.
➢ A Series cannot contain multiple columns.

It has the following parameter:

o data: It can be any list, dictionary, or scalar value.


o index: The value of the index should be unique and hashable. It must be of
the same length as data. If we do not pass any index,
default np.arrange(n) will be used.
o dtype: It refers to the data type of series.
o copy: It is used for copying the data.

Creating a Series:

We can create a Series in two ways:

1. Create an empty Series


2. Create a Series using inputs.

Create an Empty Series:

We can easily create an empty series in Pandas which means it will not have any
value.
The syntax that is used for creating an Empty Series:

series object = pandas.Series()

The below example creates an Empty Series type object that has no values and
having default datatype, i.e., float64.

Example: between JDK, JRE, and JVM


import pandas as pd
x = pd.Series()
print (x)
Output:)
Series([], dtype: float64)

Creating a Series using inputs:

We can create Series by using various inputs:

o Array
o Dictionary
o Scalar value

Creating Series from Array:

➢ Before creating a Series, firstly, we have to import the numpy module and
then use array() function in the program.
➢ If the data is ndarray, then the passed index must be of the same length.
➢ If we do not pass an index, then by default index of range(n) is being
passed where n defines the length of an array, i.e.,
[0,1,2,....range(len(array))-1].

Example:
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)
Output:
0 P
1 a
2 n
3 d
4 a
5 s
dtype: object

Create a Series from dictionary

➢ We can also create a Series from dictionary.


➢ If the dictionary object is being passed as an input and the index is not
specified, then the dictionary keys are taken in a sorted order to
construct the index.
➢ If index is passed, then values correspond to a particular label in the index
will be extracted from the dictionary.

Example:
#import the pandas library
import pandas as pd
import numpy as np
info = {'x' : 0., 'y' : 1., 'z' : 2.}
a = pd.Series(info)
print (a)
Output: x 0.0

y 1.0
z 2.0
dtype: float64

Create a Series using Scalar:

➢ If we take the scalar values, then the index must be provided.


➢ The scalar value will be repeated for matching the length of the index.

Example:
#import pandas library
import pandas as pd
import numpy as np
x = pd.Series(4, index=[0, 1, 2, 3])
print (x)
Output:
0 4
1 4
2 4
3 4
dtype: int64
Accessing data from series with Position:

Once you create the Series type object, you can access its indexes, data, and
even individual elements.

The data in the Series can be accessed similar to that in the ndarray.

Example:
import pandas as pd
x = pd.Series([1,2,3],index = ['a','b','c'])
#retrieve the first element
print (x[0])
Output : 1

Series object attributes

The Series attribute is defined as any information related to the Series


object such as size, datatype. etc.

Below are some of the attributes that you can use to get the information about
the Series object

Attributes Description

Series.index Defines the index of the Series.

Series.shape It returns a tuple of shape of the data.

Series.dtype It returns the data type of the data.

Series.size It returns the size of the data.


Series.empty It returns True if Series object is empty, otherwise returns false.

Series.hasnans It returns True if there are any NaN(Empty values) values, otherwise

returns false.

Series.nbytes It returns the number of bytes in the data.

Series.ndim It returns the number of dimensions in the data.

Series.itemsize It returns the size of the datatype of item.

Retrieving Index array and data array of a series object

We can retrieve the index array and data array of an existing Series object
by using the attributes index and values.

Example:
import numpy as np
import pandas as pd
x=pd.Series(data=[2,4,6,8])
y=pd.Series(data=[11.2,18.6,22.5], index=['a','b','c'])
print(x.index)
print(x.values)
print(y.index)
print(y.values)
Output:
RangeIndex(start=0, stop=4, step=1)
[2 4 6 8]
Index(['a', 'b', 'c'], dtype='object')
[11.2 18.6 22.5]

Retrieving Types (dtype) and Size of Type (itemsize)

You can use attribute dtype with Series object as <objectname> dtype for
retrieving the data type of an individual element of a series object, you can use
the itemsize attribute to show the number of bytes allocated to each data item.
Example:
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4,5])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.dtype)
print(a.size)
print(b.dtype)
print(b.size)
Output:
int64
5
float64
3

Retrieving Shape

The shape of the Series object defines total number of elements including
missing or empty values(NaN).

Example:
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.shape)
print(b.shape)
Output:
(4,)
(3,)

Retrieving Dimension, Size and Number of bytes

Example:
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.ndim, b.ndim)
print(a.size, b.size)
print(a.nbytes, b.nbytes)
Output:
11
43
32 24

Checking Emptiness and Presence of NaNs

➢ To check the Series object is empty, you can use the empty attribute.
➢ Similarly, to check if a series object contains some NaN values we can use
the hasans attribute.

Example:
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,np.NaN])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
c=pd.Series()
print(a.empty,b.empty,c.empty)
print(a.hasnans,b.hasnans,c.hasnans)
print(len(a),len(b))
print(a.count( ),b.count( ))
Output:
False False True
True False False
43
33
Series Functions

There are some functions used in Series which are as follows

Functions Description

Pandas Series.map() Map the values from two series that have a common column.

Pandas Series.std() Calculate the standard deviation of the given set of numbers,
`1 DataFrame, column, and rows.

Pandas Series.to_frame() Convert the series object to the dataframe.

Pandas Returns a Series that contain counts of unique values.


Series.value_counts()

Pandas Series.map()

➢ The main task of map() is used to map the values from two series that have
a common column.
➢ To map the two Series, the last column of the first Series should be the
same as the index column of the second series, and the values should be
unique.

Syntax

Series.map(arg, na_action=None)

Parameters

o arg: function,dictionary,orSeries.
It refers to the mapping correspondence.
o na_action: {None, 'ignore'}, Default value None. If ignore, it returns null
values, without passing it to the mapping correspondence.

Returns

It returns the Pandas Series with the same index as a caller.

Example:

import pandas as pd
import numpy as np
a = pd.Series(['Java', 'C', 'C++', np.nan])
a.map({'Java': 'Core'})
Output:
0 Core
1 NaN
2 NaN
3 NaN
dtype: object

Example:

import pandas as pd
import numpy as np
a = pd.Series(['Java', 'C', 'C++', np.nan])
a.map({'Java': 'Core'})
a.map('I like {}'.format, na_action='ignore')
Output:
0 I like Java
1 I like C
2 I like C++
3 NaN
dtype: object

Pandas Series.std()

➢ The Pandas std() is defined as a function for calculating the standard


deviation of the given set of numbers, DataFrame, column, and rows.
➢ In respect to calculate the standard deviation, we need to import the
package named "statistics" for the calculation of median.
➢ The standard deviation is normalized by N-1 by default and can be changed
using the ddof argument.

Syntax:

Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, *


*kwargs)

Parameters:

o axis: {index (0), columns (1)}


o skipna: It excludes all the NA/null values. If NA is present in an entire
row/column, the result will be NA.
o level: It counts along with a particular level, and collapsing into a scalar if
the axis is a MultiIndex (hierarchical).
o ddof: Delta Degrees of Freedom. The divisor used in calculations is N -
ddof, where N represents the number of elements.
o numeric_only: boolean, default value None

It includes only float, int, boolean columns. If it is None, it will attempt to


use everything,so.use.only.numericdata.
It is not implemented for a Series.

Returns:

It returns Series or DataFrame if the level is specified.

Example:

import pandas as pd
# calculate standard deviation
import numpy as np
print(np.std([4,7,2,1,6,3]))
print(np.std([6,9,15,2,-17,15,4]))
Output:
2.1147629234082532
10.077252622027656

Example:

import pandas as pd
import numpy as np
#Create a DataFrame
info = { 'Name':['Parker','Smith','John','William'], 'sub1_Marks':[52,38,42,37],
'sub2_Marks':[41,35,29,36]}
data = pd.DataFrame(info)
#data
# standard deviation of the dataframe
data.std()
Output:
sub1_Marks 6.849574
sub2_Marks 4.924429
dtype: float64
Pandas Series.to_frame()

➢ Series is defined as a type of list that can hold an integer, string, double
values, etc.
➢ It returns an object in the form of a list that has an index starting from 0 to
n where n represents the length of values in Series.
➢ The main difference between Series and Data Frame is that Series can
only contain a single list with a particular index,
➢ whereas the DataFrame is a combination of more than one series that
can analyze the data.
➢ The Pandas Series.to_frame() function is used to convert the series object
to the DataFrame.

Syntax

Series.to_frame(name=None)

Parameters

name: Refers to the object. Its Default value is None. If it has one value, the
passed name will be substituted for the series name.

Returns

It returns DataFrame representation of Series.

Example:

s = pd.Series(["a", "b", "c"],


name="Python")
s.to_frame()
Output:
Python
________
0 a
1 b
2 c
Pandas Series.value_counts()

➢ The value_counts() function returns a Series that contain counts of unique


values.
➢ It returns an object that will be in descending order so that its first element
will be the most frequently-occurred element.

Syntax

Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, d


ropna=True) Parameters
o normalize: If it is true, then the returned object will contain the relative
frequencies of the unique values.
o sort: It sort by the values.
o ascending: It sort in the ascending order.
o bins: Rather than counting the values, it groups them into the half-open
bins that provide convenience for the pd.cut, which only works with
numeric data.
o dropna: It does not include counts of NaN.

Returns

It returns the counted series.

Example:

import pandas as pd
import numpy as np
index = pd.Index([2, 1, 1, np.nan, 3])
index.value_counts()
Output:
1.0 2
3.0 1
2.0 1
dtype: int64

Normalize
With normalize set to True, returns the relative frequency by dividing all
values by the sum of values.

Example:

s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts(normalize=True)
Output:
3.0 0.4
4.0 0.2
2.0 0.2
1.0 0.2
dtype: float64
dropna

With dropna set to False we can also see NaN index values.

Example:
s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts(dropna=False)
Output:
3.0 2
NaN 1
4.0 1
2.0 1
1.0 1
dtype: int64

PANDAS
DataFrame
A Data frame is a two-dimensional data structure, i.e., data is aliened in a
tabular fashion in rows and columns.

Features of DataFrame

• Potentially columns are of different types


• Size – Mutable(changed)
• Labeled axes (rows and columns)
• Can Perform Arithmetic operations on rows and columns

Structure

Let us assume that we are creating a data frame with student’s data.
We can think of it as an SQL table or a spreadsheet data representation.

pandas.DataFrame

A pandas DataFrame can be created using the following constructor −


pandas.DataFrame( data, index, columns, dtype, copy)
The parameters of the constructor are as follows −

SNo Parameter & Description

1 data: data takes various forms like ndarray, series, map, lists, dict, constants and
another DataFrame.

2 index: For the row labels, the Index to be used for the resulting frame is Optional De
np.arange(n) if no index is passed.

3 columns:For column labels, the optional default syntax is - np.arange(n). This is only
if no index is passed.

4 dtype:Data type of each column.

5 copy:This command is used for copying of data

Create DataFrame

A pandas DataFrame can be created using various inputs like −


• Lists
• dict
• Series
• Numpy ndarrays

Create an Empty DataFrame

A basic DataFrame, which can be created is an Empty Dataframe.


Example:
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print(df)
Output:
Empty DataFrame
Columns: []
Index: []

Create a DataFrame from Lists

The DataFrame can be created using a single list or a list of lists.


Example:
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)
Output:
0
0 1
1 2
2 3
3 4
4 5

Example:
import pandas as pd
data = [['Anand',10],['Beema',12],['Cat',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
Output:
Name Age
0 Anand 10
1 Beema 12
2 Cat 13

Example:
import pandas as pd
data = [['Abc',10],['Xyz',12],['Pqr',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)
Output:
Name Age
0 Abc 10.0
1 Xyz 12.0
2 Pqr 13.0
Note − Observe, the dtype parameter changes the type of Age column to floating
point.

Create a DataFrame from Dictionary of ndarrays / Lists

▪ All the ndarrays must be of same length.


▪ If index is passed, then the length of the index should equal to the length
of the arrays.
▪ If no index is passed, then by default, index will be range(n), where n is
the array length.
▪ The dictionary keys are by default taken as column names.
Example:
import pandas as pd
data = {'Name':['Ant', 'Cat', 'Rat', 'Parrot'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)

Output:
Name Age
0 Ant 28
1 Cat 34
2 Rat 29
3 Parrot 42
Note − Observe the values 0,1,2,3. They are the default index assigned to each
using the function range(n).
Example: Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)
Output:
Name Age
rank1 Ant 28
rank2 Cat 34
rank3 Rat 29
rank4 Parrot 42
Note − Observe, the index parameter assigns an index to each row.

Create a DataFrame from List of Dicts

List of Dictionaries can be passed as input data to create a DataFrame.


The dictionary keys are by default taken as column names.
Example: To create a DataFrame by passing a list of dictionaries.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)
Output:
a b c
0 1 2 NaN
1 5 10 20.0
Note − Observe, NaN (Not a Number) is appended in missing areas.
Example: To create a DataFrame by passing a list of dictionaries and the row
indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print(df)
Output:
a b c
first 1 2 NaN
second 5 10 20.0
Example: To create a DataFrame with a list of dictionaries, row indices, and
column indices. import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print(df1)
print(df2)
Output:
a b
first 1 2
second 5 10

a b1
first 1 NaN
second 5 NaN
Note − Observe, df2 DataFrame is created with a column index other than the
dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with
column indices same as dictionary keys, so NaN’s appended.

Create a DataFrame from Dictionary of Series

Dictionary of Series can be passed to form a DataFrame. The resultant


index is the union of all the series indexes passed.
Example:
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
Output:
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Note − Observe, for the series one, there is no label ‘d’ passed, but in the result,
for the d label, NaN is appended with NaN.

column selection, addition, and deletion:

We will now understand column selection, addition, and deletion with examples

Column Selection

We will understand this by selecting a column from the DataFrame.


Example:
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df ['one'])
Output:
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64

Column Addition

We will understand this by adding a new column to an existing data frame.


Example:
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object with column label by
passing new series
print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)
print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']
print(df)
Output:
Adding a new column by passing as Series:
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN

Adding a new column using the existing columns in DataFrame:


one two three four
a 1.0 1 10.0 11.0
b 2.0 2 20.0 22.0
c 3.0 3 30.0 33.0
d NaN 4 NaN NaN

Column Deletion

Columns can be deleted or popped


Example:
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print(df)
# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print(df)
# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print(df)
Output:
Our dataframe is:
one three two
a 1.0 10.0 1
b 2.0 20.0 2
c 3.0 30.0 3
d NaN NaN 4

Deleting the first column using DEL function:


three two
a 10.0 1
b 20.0 2
c 30.0 3
d NaN 4

Deleting another column using POP function:


three
a 10.0
b 20.0
c 30.0
d NaN

Row Selection, Addition, and Deletion

We will now understand row selection, addition and deletion through examples.
Let us begin with the concept of selection.

Selection by Label

Rows can be selected by passing row label to a loc function.


Example:
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df.loc['b'])
Output:
one 2.0
two 2.0
Name: b, dtype: float64
The result is a series with labels as column names of the DataFrame. And, the
Name of the series is the label with which it is retrieved.

Selection by integer location

Rows can be selected by passing integer location to an iloc function.


Example:
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df.iloc[2])
Output:
one 3.0
two 3.0
Name: c, dtype: float64

Slice Rows

Multiple rows can be selected using ‘:’ operator.


Example:
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df[2:4])
Output:
one two
c 3.0 3
d NaN 4

Addition of Rows
Add new rows to a DataFrame using the append function. This function will
append the rows at the end.
Example:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
print(df)
Output:
a b
0 1 2
1 3 4
0 5 6
1 7 8

Deletion of Rows

Use index label to delete or drop rows from a DataFrame.


If label is duplicated, then multiple rows will be dropped.
If we observe, in the above example, the labels are duplicate.
Let us drop a label and will see how many rows will get dropped.
Example:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
# Drop rows with label 0
df = df.drop(0)
print(df)
Output:
a b
1 3 4
1 7 8

PANDAS
PANEL

A panel is a 3D container of data. The term Panel data is derived from


econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-
s.
The names for the 3 axes are intended to give some semantic meaning to
describing operations involving panel data. They are −
• items − axis 0, each item corresponds to a DataFrame contained inside.
• major_axis − axis 1, it is the index (rows) of each of the DataFrames.
• minor_axis − axis 2, it is the columns of each of the DataFrames.

pandas.Panel()

A Panel can be created using the following constructor −


pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
The parameters of the constructor are as follows −

Parameter Description

data Data takes various forms like ndarray, series, map, lists, dict, constants and als
another DataFrame

items axis=0

major_axis axis=1

minor_axis axis=2

dtype Data type of each column

copy Copy data.

Create Panel

A Panel can be created using multiple ways like −

• From ndarrays
• From dict of DataFrames

From 3D ndarray
Example:
# creating an empty panel
import pandas as pd
import numpy as np
data = np.random.rand(2,4,5)
p = pd.Panel(data)
print(p)
Output:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4
Note − Observe the dimensions of the empty panel and the above panel, all the
objects are different.

From dict of DataFrame Objects

Example:
#creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p)
Output:
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

Create an Empty Panel

An empty panel can be created using the Panel constructor as follows –


Example:
#creating an empty panel
import pandas as pd
p = pd.Panel()
print(p)
Output:
<class 'pandas.core.panel.Panel'>
Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)
Items axis: None
Major_axis axis: None
Minor_axis axis: None

Selecting the Data from Panel

Select the data from the panel using −

• Items
• Major_axis
• Minor_axis

Using Items

Example:
# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p['Item1'])
Output:
0 1 2
0 0.488224 -0.128637 0.930817
1 0.417497 0.896681 0.576657
2 -2.775266 0.571668 0.290082
3 -0.400538 -0.144234 1.110535
We have two items, and we retrieved item1. The result is a DataFrame with 4
rows and 3 columns, which are the Major_axis and Minor_axis dimensions.

Using major_axis

Data can be accessed using the method panel.major_axis(index).


Example:
# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(3, 3)),
'Item2' : pd.DataFrame(np.random.randn(2, 2))}
p = pd.Panel(data)
print(p.major_xs(1))
Output:
Item1 Item2
0 0.610564 -0.777748
1 -1.719858 0.103176
2 0.360732 NaN

Using minor_axis

Data can be accessed using the method panel.minor_axis(index).


Example:
# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 4)),
'Item2' : pd.DataFrame(np.random.randn(3, 3))}
p = pd.Panel(data)
print(p.minor_xs(1))
Output:
Item1 Item2
0 -0.118659 -0.423997
1 0.418010 -2.138602
2 -0.383507 0.786874
3 -1.586257 NaN

Pandas Panel.add()
In Pandas Panel.add() function is used for element-wise addition of series
and dataframe/Panel
Example:
# importing pandas module
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a': ['cse', 'For', 'Computer', 'for', 'Siddartha'],
'b': [11, 1.025, 333, 114.48, 1333]})

data = {'item1':df1, 'item2':df1}


# creating Panel
panel = pd.Panel.from_dict(data, orient ='minor')
print("panel['b'] is - \n\n", panel['b'], '\n')
print("\nAdding panel['b'] with df1['b'] using add() method - \n")
print("\n", panel['b'].add(df1['b'], axis = 0))
Output:

Pandas Panel.mul()

In Pandas Panel.mul() function is used to get the multiplication of series


and dataframe/Panel.
Example:
# importing pandas module
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a': ['abc', 'For', 'alpha', 'real'],
'b': [111, 123, 425, 1333]})
df2 = pd.DataFrame({'a': ['I', 'am', 'dataframe', 'two'],
'b': [100, 100, 100, 100]})
data = {'item1':df1, 'item2':df2}
# creating Panel
panel = pd.Panel.from_dict(data, orient ='minor')
print("panel['b'] is - \n\n", panel['b'])
print("\nMultiplying panel['b'] with df2['b'] using mul() method - \n")
print("\n", panel['b'].mul(df2['b'], axis = 0))
Output:

Pandas Panel.sum()

Panel.sum() function is used to return the sum of the values for the
requested axis.

Example:
# importing pandas module
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a': ['Ant', 'cat', 'rat', 'bird', 'animal'],
'b': [11, 1.025, 333, 114.48, 1333]})
data = {'item1':df1, 'item2':df1}
# creating Panel
panel = pd.Panel.from_dict(data, orient ='minor')
print(panel['b'], '\n')
print("\n", panel['b'].sum(axis = 0))
Output:

Pandas Panel.size

Pandas Panel.size gives the number of Rows and columns in DataFrame.


Example:
# importing pandas module
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a': ['ram', 'mouse', 'cpu', 'key', 'smps'],
'b': [11, 1.025, 333, 114.48, 1333]})
print("df1 is - \n\n", df1)
# Create a 10 * 2 dataframe
df2 = pd.DataFrame(np.random.rand(10, 2), columns =['a', 'b'])
print("df2 is - \n\n", df2)

Output:

('df1 is - \n\n', a b
0 ram 11.000
1 mouse 1.025
2 cpu 333.000
3 key 114.480
4 smps 1333.000)
('df2 is - \n\n', a b
0 0.120637 0.893705
1 0.741454 0.082920)

Matplotlib
➢ Matplotlib is a Python library which is defined as a multi-platform data
visualization library built on Numpy array.
➢ It can be used in python scripts, shell, web application, and other graphical
user interface toolkit.
➢ There are various toolkits available that are used to enhance the
functionality of the matplotlib.

They are

o Bashmap: It is a map plotting toolkit with several map projections,


coastlines, and political boundaries.
o Cartopy: It is a mapping library consisting of object-oriented map
projection definitions, and arbitrary point, line, polygon, and image
transformation abilities.
o Excel tools: Matplotlib provides the facility to utilities for exchanging data
with Microsoft Excel.
o Mplot3d: It is used for 3D plots.
o Natgrid: It is an interface to the Natgrid library for irregular gridding of the
spaced data.

Matplotlib Architecture

There are three different layers in the architecture of the matplotlib which are
the following:
o Backend Layer
o Artist layer
o Scripting layer

Backend layer

The backend layer is the bottom layer of the figure, which consists of the
implementation of the various functions that are necessary for plotting.

There are three essential classes from the backend layer

FigureCanvas(The surface on which the figure will be drawn)

Renderer(The class that takes care of the drawing on the surface)

Event(It handle the mouse and keyboard events).

Artist Layer

The artist layer is the second layer in the architecture.

It is responsible for the various plotting functions, like axis, which coordinates on
how to use the renderer on the figure canvas.

Scripting layer

The scripting layer is the topmost layer on which most of our code will run.

The methods in the scripting layer, almost automatically take care of the other
layers

The General Concept of Matplotlib

A Matplotlib figure can be categorized into various parts as below:


Figure: It is a whole figure which may hold one or more axes (plots). We can think
of a Figure as a canvas that holds plots.

Figsize: Matplotlib Figsize is a method used to change the dimension of your


matplotlib window. Currently, the window is generated of 6.4×4.8 inches by
default. Using this module, you can change it at any size.

Axes: A Figure can contain several Axes. It consists of two or three (in the case of
3D) Axis objects. Each Axes is comprised of a title, an x-label, and a y-label. and
they are responsible for generating the graph limits.

Basic Example of plotting Graph


Example: Generating a simple graph
from matplotlib import pyplot as plt
#ploting our canvas
plt.plot([1,2,3],[4,5,1])
#display the graph
plt.show()

Output:
Example: We can add titles, labels to the chart which are created by Python
matplotlib library

from matplotlib import pyplot as plt


x = [5, 2, 7]
y = [1, 10, 4]
plt.plot(x, y)
plt.title('Line graph')
plt.ylabel('Y axis')
plt.xlabel('X axis')
plt.show()

Output:

The graph is more understandable from the previous graph.

Working with Pyplot

➢ The pyplot functions are used to make some changes to figure such as
create a figure, creates a plotting area in a figure, plots some lines in a
plotting area, decorates the plot including labels, etc.
➢ The pyplot module provide the plot() function which is frequently use to
plot a graph.

Example:

from matplotlib import pyplot as plt


plt.plot([1,2,3,4,5])
plt.ylabel("y axis")
plt.xlabel('x axis')
plt.show()

Output:

In the above program, it plots the graph x-axis ranges from 0-4 and the y-axis
from 1-5.

If we provide a single list to the plot(), matplotlib assumes it is a sequence of y


values, and automatically generates the x values.

Since we know that python index starts at 0, the default x vector has the same
length as y but starts at 0. Hence the x data are [0, 1, 2, 3, 4].

Example: We can pass the arbitrary number of arguments to the plot() x versus y

from matplotlib import pyplot as plt


plt.plot([1,2,3,4,5],[1,4,9,16,25])
plt.ylabel("y axis")
plt.xlabel('x axis')
plt.show()

Output:

Formatting the style of the plot

➢ A format string that indicates the color and line type of the plot.
➢ The default format string is 'b-'which is the solid blue as you can observe
in the above plotted graph.

Example: The graph with the red circles.

from matplotlib import pyplot as plt


plt.plot([1, 2, 3, 4,5], [1, 4, 9, 16,25], 'ro')
plt.axis([0, 6, 0, 20])
plt.show()

Output:
Format String

'b' Using for the blue marker with default shape.

'ro' Red circle

'-g' Green solid line

'--' A dashed line with the default color

'^k:' Black triangle up markers connected by a dotted line

The matplotlib supports the following color abbreviation:

Character Color

'b' Blue

'g' Green

'r' Red

'c' Cyan

'm' Magenta

'y' Yellow

'k' Black

'w' White
Plotting with categorical variables

Matplotlib allows us to pass categorical variables directly to many plotting


functions:

Example:

from matplotlib import pyplot as plt


names = ['Anirudh', 'Anand', 'Arjun']
marks= [87,50,98]
plt.figure(figsize=(9,3))
#created 3 axes at positions 1, 2 and 3 respectively, in a grid of 1 row and 3
columns.
plt.subplot(131)
plt.bar(names, marks)
plt.subplot(132)
plt.scatter(names, marks)
plt.subplot(133)
plt.plot(names, marks)
plt.suptitle('Categorical Plotting')
plt.show()
Output:

In the above program, we have plotted the categorical graph using


the subplot() function. Let's a have a look on the subplot() function.

What is subplot()

➢ The Matplotlib subplot() function is defined as two or more plots in one


figure.
➢ We can use this method to separate two graphs which plotted in the same
axis Matplotlib supports all kinds of subplots, including 2x1 vertical, 2x1
horizontal, or a 2x2 grid.
➢ It accepts the three arguments: they are nrows, ncols, and index. It denote
the number of rows, number of columns and the index.

Creating different Types of graphs

1. Line graph

➢ The line graph is one of charts which shows information as a series of the
line.
➢ The graph is plotted by the plot() function.
➢ The line graph is simple to plot

Example:

from matplotlib import pyplot as plt


x = [4,8,9]
y = [10,12,15]
plt.plot(x,y)
plt.title("Line graph")
plt.ylabel('Y axis')
plt.xlabel('X axis')
plt.show()

Output:
We can customize the graph by importing the style module.

The style module will be built into a matplotlib installation.

It contains the various functions to make the plot more attractive.

In the below program, we are using the style module

Example:

from matplotlib import pyplot as plt


from matplotlib import style
style.use('ggplot')
x = [16, 8, 10]
y = [8, 16, 6]
x2 = [8, 15, 11]
y2 = [6, 15, 7]
plt.plot(x, y, 'r', label='line one', linewidth=5)
plt.plot(x2, y2, 'm', label='line two', linewidth=5)
plt.title('Epic Info')
fig = plt.figure()
plt.ylabel('Y axis')
plt.xlabel('X axis')
plt.legend()
plt.grid(True, color='k')
plt.show()

Output:
2. Bar graphs

Bar graphs are one of the most common types of graphs and are used to show
data associated with the categorical variables.

Matplotlib provides a bar() to make bar graphs which accepts arguments such as:
categorical variables, their value and color.

Example:

from matplotlib import pyplot as plt


players = ['Ram','Abhi','Bhav','Devdut']
runs = [51,87,45,67]
plt.bar(players,runs,color = 'red')
plt.title('Score ')
plt.xlabel('Players')
plt.ylabel('Runs')
plt.show()

Output:
barh()

barh() function is used to make horizontal bar graphs.

It accepts xerr or yerr as arguments (in case of vertical graphs) to depict the
variance in our data as follows:

Example:

from matplotlib import pyplot as plt


players = ['Virat','Rohit','Shikhar','Hardik']
runs = [51,87,45,67]
plt.barh(players,runs, color = 'blue')
plt.title('Score ')
plt.xlabel('Players')
plt.ylabel('Runs')
plt.show()

Output:

style()
Example: using the style() function

from matplotlib import pyplot as plt


from matplotlib import style
style.use('ggplot')
x = [5,8,10]
y = [12,16,6]
x2 = [6,9,11]
y2 = [7,15,7]
plt.bar(x, y, color = 'y', align='center')
plt.bar(x2, y2, color='c', align='center')
plt.title('Information')
plt.ylabel('Y axis')
plt.xlabel('X axis')

Output:

Similarly to vertical stack, the bar graph together by using the bottom argument
and define the bar graph, which we want to stack below and its value.

Example:

from matplotlib import pyplot as plt


import numpy as np
countries = ['USA', 'India', 'China', 'Russia', 'Germany']
bronzes = np.array([38, 17, 26, 19, 15])
silvers = np.array([37, 23, 18, 18, 10])
golds = np.array([46, 27, 26, 19, 17])
ind = [x for x, _ in enumerate(countries)]
plt.bar(ind, golds, width=0.5, label='golds', color='gold', bottom=silvers+bronzes
)
plt.bar(ind, silvers, width=0.5, label='silvers', color='silver', bottom=bronzes)
plt.bar(ind, bronzes, width=0.5, label='bronzes', color='#CD853F')
plt.xticks(ind, countries)
plt.ylabel("Medals")
plt.xlabel("Countries")
plt.legend(loc="upper right")
plt.title("2019 Olympics Top Scorers")

Output:

Pie Chart

➢ A pie chart is a circular graph that is broken down in the segment or slices
of pie.
➢ It is generally used to represent the percentage or proportional data where
each slice of pie represents a particular category.

Example:

from matplotlib import pyplot as plt


# Pie chart, where the slices will be ordered and plotted counter-clockwise:
Players = 'Rohit', 'Virat', 'Shikhar', 'Yuvraj'
Runs = [45, 30, 15, 10]
explode = (0.1, 0, 0, 0) # it "explode" the 1st slice
fig1, ax1 = plt.subplots()
ax1.pie(Runs, explode=explode, labels=Players, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

Output:

4. Histogram

First, we need to understand the difference between the bar graph and
histogram.

A histogram is used for the distribution, whereas a bar chart is used to compare
different entities.

A histogram is a type of bar plot that shows the frequency of a number of values
compared to a set of values ranges.

Example: If we take the data of the different age group of the people and plot a
histogram with respect to the bin. Now, bin represents the range of values that
are divided into series of intervals. Bins are generally created of the same size.

from matplotlib import pyplot as plt


population_age = [21,53,60,49,25,27,30,42,40,1,2,102,95,8,15,105,70,65,55,70,
75,60,52,44,43,42,45]
bins = [0,10,20,30,40,50,60,70,80,90,100]
plt.hist(population_age, bins, histtype='bar', rwidth=0.8)
plt.xlabel('age groups')
plt.ylabel('Number of people')
plt.title('Histogram')
plt.show()

Output:

5. Scatter plot

➢ The scatter plots are mostly used for comparing variables when we need
to define how much one variable is affected by another variable.
➢ The data is displayed as a collection of points.
➢ Each point has the value of one variable, which defines the position on the
horizontal axes, and the value of other variable represents the position on
the vertical axis.

Let's consider the following simple example:

Example:

from matplotlib import pyplot as plt


from matplotlib import style
style.use('ggplot')
x = [5,7,10]
y = [18,10,6]
x2 = [6,9,11]
y2 = [7,14,17]
plt.scatter(x, y)
plt.scatter(x2, y2, color='g')
plt.title('Epic Info')
plt.ylabel('Y axis')
plt.xlabel('X axis')
plt.show()

Output:

6. 3D graph plot

Matplotlib was initially developed with only two-dimension plot.

Its 1.0 release was built with some of three-dimensional plotting utilities on top
of two-dimension display, and the result is a convenient set of tools for 3D data
visualization.

Three-dimension plots can be created by importing the mplot3d toolkit, include


with the main Matplotlib installation:

from mpl_toolkits import mplot3d

When this module is imported in the program, three-dimension axes can be


created by passing the keyword projection='3d' to any of the normal axes
creation routines:

Let's see the simple 3D plot

Example:

from mpltoolkits import mplot3d


import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection='3d')

Output:

Data Science

Data science is the study of the massive amount of data, which involves extracting
structured, and unstructured data that is processed using the scientific method.

It is a multidisciplinary field that uses tools and techniques to manipulate the data
so that we can find something new and meaningful.

In short, we can say that data science is all about:

o Asking the correct questions and analyzing the raw data.


o Modeling the data using various complex and efficient algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final
result.

Example:
Let suppose we want to travel from station A to station B by car.

Now, we need to take some decisions such as which route will be the best route
to reach faster at the location, in which route there will be no traffic jam, and
which will be cost-effective.

All these decision factors will act as input data, and we will get an appropriate
answer from these decisions, so this analysis of data is called the data analysis,
which is a part of data science.

Need for Data Science:

Some years ago, data was less and mostly available in a structured form, which
could be easily stored in excel sheets, and processed using tools.

But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals
bytes of data is generating on every day, which led to data explosion. It is
estimated as per researches, that by 2030, 100 MB of data will be created at
every single second, by a single person on earth. Every Company requires data to
work, grow, and improve their businesses.

Now, handling of such huge amount of data is a challenging task for every
organization.
So to handle, process, and analysis of this, we required some complex, powerful,
and efficient algorithms and technology, and that technology came into existence
as data Science.

Following are some main reasons for using data science technology:

o With the help of data science technology, we can convert the massive
amount of raw and unstructured data into meaningful insights.
o Data science technology is opting by various companies, whether it is a big
brand or a startup. Google, Amazon, Netflix, etc, which handle the huge
amount of data, are using data science algorithms for better customer
experience.
o Data science is working for automating transportation such as creating a
self-driving car, which is the future of transportation.
o Data science can help in different predictions such as various survey,
elections, flight ticket confirmation, etc.

Exploratory Data Analysis (EDA)


EDA in Python uses data visualization to draw meaningful patterns. It also
involves the preparation of data sets for analysis by removing irregularities in the
data.
1. Data Sourcing
2. Data Cleaning
3. Univariate analysis
4. Multivariate analysis

1. Data Sourcing
Data Sourcing is the process of finding and loading the data into our
system. there are two ways in which we can find data.
• Private Data
• Public Data
Private Data
▪ As the name suggests, private data is given by private organizations.
▪ There are some security and privacy concerns attached to it.
▪ This type of data is used for mainly organizations internal analysis.
Public Data
▪ This type of Data is available to everyone.
▪ We can find this in government websites and public organizations etc.
▪ Anyone can access this data, we do not need any special permissions or
approval.
We can get public data on the following sites.
• https://data.gov
• https://data.gov.in
The very first step of EDA is Data Sourcing,it is how we can access data and load
into our system. The next step is how to clean the data.
2. Data Cleaning
▪ After completing the Data Sourcing, the next step in the process of EDA
is Data Cleaning.
▪ It is very important to get rid of the irregularities and clean the data after
sourcing it into our system.
Irregularities of data.
Missing Values
• Missing data is always a problem in real life scenarios.
• Areas like machine learning and data mining face severe issues in the
accuracy of their model predictions because of poor quality of data caused
by missing values.
• In these areas, missing value treatment is a major point of focus to make
their models more accurate and valid.
When and Why Is Data Missed?
• Let us consider an online survey for a product.
• Many times, people do not share all the information related to them.
• Few people share their experience, but not how long they are using the
product;
• few people share how long they are using the product, their experience
but not their contact information.
• Thus, in some or the other way a part of data is always missing, and this is
very common in real time.

Let us now see how we can handle missing values (say NA or NaN) using Pandas.
Example:
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e',
'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df)
Output:
one two three
a 0.077988 0.476149 0.965836
b NaN NaN NaN
c -0.390208 -0.551605 -2.301950
d NaN NaN NaN
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g NaN NaN NaN
h 0.085100 0.532791 0.887415
Cleaning / Filling Missing Data
▪ Pandas provides various methods for cleaning the missing values.
▪ The fillna function can “fill in” NA values with non-null data in a couple of
ways, which we have illustrated in the following sections.
▪ Replace NaN with a Scalar Value
The following program shows how you can replace "NaN" with "0".
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c',
'e'],columns=['one','two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df)
print ("NaN replaced with '0':")
print (df.fillna(0))
Output:
one two three
a -0.576991 -0.741695 0.553172
b NaN NaN NaN
c 0.744328 -1.735166 1.749580

NaN replaced with '0':


one two three
a -0.576991 -0.741695 0.553172
b 0.000000 0.000000 0.000000
c 0.744328 -1.735166 1.749580
Here, we are filling with value zero; instead we can also fill with any other value.
Drop Missing Values
▪ If you want to simply exclude the missing values, then use
the dropna function along with the axis argument.
▪ By default, axis=0, i.e., along row, which means that if any value within a
row is NA then the whole row is excluded.

Example:

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e',


'f','h'],columns=['one', 'two','three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df.dropna())

Output:
one two three
a 0.077988 0.476149 0.965836
c -0.390208 -0.551605 -2.301950
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
h 0.085100 0.532791 0.887415

Replace Missing (or) Generic Values


▪ Many times, we have to replace a generic value with some specific value.
▪ We can achieve this by applying the replace method.
▪ Replacing NA with a scalar value is equivalent behavior of
the fillna() function.

Example:

import pandas as pd

import numpy as np

df = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]})

print (df.replace({1000:10,2000:60}))

Output:
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
Handling Outliers
▪ An outlier is something which is separate or different from the crowd.
▪ Outliers can be a result of a mistake during data collection or it can be
just an indication of variance in your data.
Let’s have a look at some examples.
Suppose you have been asked to observe the performance of Indian cricket
team i.e Run made by each player and collect the data.

▪ As you can see from the above collected data that all other players scored
300+ except Player3 who scored 10.
▪ This figure can be just a typing mistake or it is showing the variance in your
data and indicating that Player3 is performing very bad so, needs
improvements
▪ Now that we know outliers can either be a mistake or just variance,
Before we try to understand whether to ignore the outliers or not, we
need to know the ways to identify them.
There are two types of outliers:
1. Univariate outliers: Univariate outliers are the data points whose values lie
beyond the range of expected values based on one variable.

2. Multivariate outliers: While plotting data, some values of one variable may
not lie beyond the expected range, but when you plot the data with some
other variable, these values may lie far from the expected value.

So, after understanding the causes of these outliers, we can handle them by
dropping those records or imputing with the values or leaving them as is, if it
makes more sense.

3.Univariate analysis

If we analyze data over a single variable/column from a dataset, it is known


as Univariate Analysis.

4. Multivariate Analysis
If we analyze data by taking more than two variables/columns into
consideration from a dataset, it is known as Multivariate Analysis.

Once Exploratory Data Analysis is complete and insights are drawn, its
feature can be used for machine learning modeling.

Data Science Lifecycle


Data science Life Cycle is recursive. After completing the all phases, the data
scientist can back to top
The main phases of data science life cycle are given below:

1. Discovery: The first phase is discovery, which involves asking the right
questions. When you start any data science project, you need to determine what
are the basic requirements, priorities, and project budget. In this phase, we need
to determine all the requirements of the project such as the number of people,
technology, time, data, an end goal, and then we can frame the business problem
on first hypothesis level.

2. Data preparation: Data preparation is also known as Data Munging. In this


phase, we need to perform the following tasks:

o Data cleaning
o Data Reduction
o Data integration
o Data transformation,

After performing all the above tasks, we can easily use this data for our further
processes.

3. Model Planning: In this phase, we need to determine the various methods and
techniques to establish the relation between input variables. We will apply
Exploratory data analytics(EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what
data can inform us. Common tools used for model planning are:

o SQL Analysis Services


o R
o SAS
o Python

4. Model-building: In this phase, the process of model building starts. We will


create datasets for training and testing purpose. We will apply different
techniques such as association, classification, and clustering, to build the model.

Following are some common Model building tools:

o SAS Enterprise Miner


o WEKA
o SPCS Modeler
o MATLAB

5. Operationalize: In this phase, we will deliver the final reports of the project,
along with briefings, code, and technical documents. This phase provides you a
clear overview of complete project performance and other components on a
small scale before the full deployment.

6. Communicate results: In this phase, we will check if we reach the goal, which
we have set on the initial phase. We will communicate the findings and final
result with the business team.
Descriptive Statistics
• Descriptive statistics is the type of statistics which is used to summarize
and describe the dataset.
• It is used to describe the characteristics of data.
• Descriptive statistics are generally used to determine the sample.
• It is displayed through tables, charts, frequency distributions and is
generally reported as a measure of central tendency.
• Descriptive statistics include the following details about the data
Central Tendency
o Mean – also known as the average
o Median – the centre most value of the given dataset
o Mode – The value which appears most frequently in the given
dataset
Example: To compute the Measures of Central Tendency
Consider the following data points.
17, 16, 21, 18, 15, 17, 21, 19, 11, 23
• Mean- Mean is calculated as
• Median - To calculate Median, lets arrange the data in ascending order.
11, 15, 16, 17, 17, 18, 19, 21, 21, 23
Since the number of observations is even (10), median is given by the average of
the two middle observations (5th and 6th here).

Mode - Mode is given by the number that occurs maximum number of times.
Here, 17 and 21 both occur twice. Hence, this is a Bimodal data and the modes
are 17 and 21.

Statistical Dispersion
o Range – Range gives us the understanding of how spread out the
given data is
o Variance – It gives us the understanding of how the far the
measurements are from the mean.
o Standard deviation – Square root of the variance is standard
deviation, also the measurement of how far the data deviate from
the mean
Range - Range is the difference between the Maximum value and the Minimum
value in the data set. It is given as

Variance - Variance measures how far are data points spread out from the mean.
A high variance indicates that data points are spread widely and a small variance
indicates that the data points are closer to the mean of the data set. It is
calculated as

Standard Deviation - The square root of Variance is called the Standard


Deviation. It is calculated as
The Bell Curve
It is a graph of a normal distribution of a variable, it is called a bell curve
because of its shape.

Skewness — The measure of asymmetry in a probability


distribution is defined by Skewness. It can either be positive,
negative or undefined.

• Positive Skew — This is the case when the tail on the right
side of the curve is bigger than that on the left side. For these
distributions, mean is greater than the mode.

• Negative Skew — This is the case when the tail on the left
side of the curve is bigger than that on the right side. For
these distributions, mean is smaller than the mode.

The most commonly used method of calculating Skewness is

If the skewness is zero, the distribution is symmetrical. If it is


negative, the distribution is Negatively Skewed and if it is
positive, it is Positively Skewed.

Kurtosis -Kurtosis describes whether the data is light tailed


(lack of outliers) or heavy tailed (outliers present) when
compared to a Normal distribution. There are three kinds of
Kurtosis:

• Mesokurtic — This is the case when the kurtosis is zero,


similar to the normal distributions.

• Leptokurtic — This is when the tail of the distribution is


heavy (outlier present) and kurtosis is higher than that of the
normal distribution.

• Platykurtic — This is when the tail of the distribution is light(


no outlier) and kurtosis is lesser than that of the normal
distribution.
EDA Tools
Python and R language are the two most commonly used data science tools to
create an EDA.

Python:

• EDA can be done using python for identifying the missing value in a data
set.
• Other functions that can be performed are –
• The description of data, handling outliers, getting insights through the
plots.
• Due to its high-level, built-in data structure, and dynamic typing and
binding make it an attractive tool for EDA.
• Python provides certain open-source modules that can automate the
whole process of EDA and help in saving time.

R:

• The R language is used widely by data scientists and statisticians for


developing statistical observations and data analysis.
• R is an open-source programming language which provides a free
software environment for statistical computing and graphics that is
supported by the R Foundation for Statistical Computing.

Wrapping Up

Apart from the functions described above, EDA can also:

• Perform k-means clustering.


• It is an unsupervised learning algorithm where the data points are
assigned to clusters, also known as k-groups.
• K-means clustering is commonly used in market segmentation, image
compression, and pattern recognition.
• EDA can be used in predictive models such as linear regression, where it
is used to predict outcomes.
• It is also used in univariate, bivariate, and multivariate visualization for
summary statistics, establishing relationships between each variable, and
for understanding how different fields in the data interact with each
other.

Data Visualization

Data visualization is the graphical representation of information and


data. By using visual elements like charts, graphs, and maps, data visualization
tools provide an accessible way to see and understand trends, outliers, and
patterns in data.

Introduction to Data Visualization


• Data visualization is the process of creating interactive visuals to
understand trends, variations, and derive meaningful insights from the
data.
• Data visualization is used mainly for data checking and cleaning,
exploration and discovery, and communicating results to business
stakeholders.
Scatter plot
• A scatter plot is a data visualization that displays the values of two
different variables as points.
• The data for each point is represented by its horizontal (x) and vertical (y)
position on the visualization.
• Additional variables can be encoded by labels, markers, color,
transparency, size (bubbles), and creating 'small multiples' of scatter
plots. Scatter plots are also known as scatterplots, scatter graphs, scatter
charts, scattergrams, and scatter diagrams.
To understand the importance of visualization let’s take a look an example in
Figures 1 and 2 below.
Figure 1. showing how a pair of X and Y can have different values yet have
different central tendency and correlation values.
The same data points, when represented using visualization in Figure 2 below,
depicts a different trend altogether.

Figure 2. Illustrates how four identical datasets when examined using simple
summary statistics look similar but vary considerably when graphed.

Bar Chart
• A bar chart displays categorical data with rectangular bars whose length
or height corresponds to the value of each data point.

• Bar charts can be visualized using vertical or horizontal bars.

• Bar charts are best used to compare a single category of data or several.

• When comparing more than one category of data, the bars can be
grouped together to created a grouped bar chart.

Examples

Types of Languages Spoken at Home in New York


Histogram
• A histogram is used to summarize discrete or continuous data.
• In other words, it provides a visual interpretation of numerical data by
showing the number of data points that fall within a specified range of
values (called “bins”).
• It is similar to a vertical bar graph.
• However, a histogram, unlike a vertical bar graph, shows no gaps
between the bars.

Example:

Ram is the branch manager at a local bank. Recently, Ram is receiving customer
feedback saying that the wait times for a client to be served by a customer
service representative are too long. Ram decides to observe and write down the
time spent by each customer on waiting. Here are his findings from observing
and writing down the wait times spent by 20 customers:
The corresponding histogram with 5-second bins (5-second intervals) would
look as follows:

Introduction to Heat Maps


We can quickly grasp the state and impact of a large number of variables at one time
by displaying our data with a heat map visualization.

A heat map visualization is a combination of nested, colored rectangles, each


representing an attribute element.

Heat Maps are often used in the financial services industry to review the status of a
portfolio.
The rectangles contain a wide variety and many shadings of colors, which
emphasize the weight of the various components.

You might also like