PP&DS 3
PP&DS 3
What is NumPy?
➢ In Python we have lists that serve the purpose of arrays, but they are
slow to process.
➢ NumPy aims to provide an array object that is up to 50 times faster than
traditional Python lists.
➢ The array object in NumPy is called ndarray, it provides a lot of
supporting functions that make working with ndarray very easy.
➢ Arrays are very frequently used in data science, where speed and
resources are very important.
If you have Python and PIP already installed on a system, then installation
of NumPy is very easy.
If this command fails, then use a python distribution that already has NumPy
installed like, Anaconda, Spyder etc.
Import NumPy
import numpy
Example:
import numpy
arr = numpy.array([1, 2, 3, 4, 5])
print(arr)
Output: [1 2 3 4 5]
NumPy as np
• NumPy is usually imported under the np alias.
• alias: In Python alias are an alternate name for referring to the same
thing.
• Create an alias with the as keyword while importing:
import numpy as np
Example:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Output: [1, 2, 3, 4, 5]
Example:
import numpy as np
print(np.__version__)
Output: 1.19.2
NumPy Creating Arrays
➢ Create a NumPy ndarray Object
➢ NumPy is used to work with arrays. The array object in NumPy is
called ndarray.
➢ We can create a NumPy ndarray object by using
the array() function.
Example:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
Output:
[1 2 3 4 5]
<class 'numpy.ndarray'>
import numpy as np
arr = np.array((1, 2, 3, 4, 5))
print(arr)
Output: [1 2 3 4 5]
Dimensions in Arrays
0-D arrays are the elements in an array. Each value in an array is a 0-D array.
import numpy as np
arr = np.array(42)
print(arr)
Output: 42
1-D Arrays
An array that has 0-D arrays as its elements is called uni-dimensional or 1-D
array.
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Output: [1 2 3 4, 5]
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
Output:
[[1 2 3]
[4 5 6]]
3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
Example:
#Create a 3-D array with two 2-D arrays, both containing two arrays with the
values 1,2,3 and 4,5,6:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
Output:
[[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]]
Check Number of Dimensions?
NumPy Arrays provides the ndim attribute that returns an integer that tells us
how many dimensions the array have.
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
Output:
0
1
2
3
Higher Dimensional Arrays
When the array is created, you can define the number of dimensions by using
the ndmin argument.
import numpy as np
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('number of dimensions :', arr.ndim)
Output:
[[[[[1 2 3 4]]]]]
number of dimensions : 5
NumPy Array Indexing
•
Access Array Elements
•
Array indexing is the same as accessing an array element in list.
•
You can access an array element by referring to its index number.
•
The indexes in NumPy arrays start with 0, meaning that the first element
has index 0, and the second has index 1 etc.
Example: Get the first element from the following array:
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[0])
Output: 1
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[1])
Output: 2
Example: Get third and fourth elements from the following array and add them.
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[2] + arr[3])
Output: 7
To access elements from 2-D arrays we can use comma separated integers
representing the dimension and the index of the element.
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st dim: ', arr[0, 1])
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('5th element on 2nd dim: ', arr[1, 4])
To access elements from 3-D arrays we can use comma separated integers
representing the dimensions and the index of the element.
Example: Access the third element of the second array of the first array:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
Output: 6
o The first number represents the zero dimension, which contains two
arrays:
[[1, 2, 3], [4, 5, 6]]
and:
[[7, 8, 9], [10, 11, 12]]
Since we selected 0, we are left with the first array:
[[1, 2, 3], [4, 5, 6]]
o The third number represents the third dimension, which contains three
values:
4
5
6
Since we selected 2, we end up with the third value:
6
Negative Indexing
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('Last element from 2nd dim: ', arr[1, -1])
Output: [2 3 4 5]
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])
Output: [5 6 7]
Output: [ 1 2 3 4]
Negative Slicing
Example: Slice from the index 3 from the end to index 1 from the end:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[-3:-1])
Output: [5 6]
STEP
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
Output: [2 4]
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[::2])
Output: [1 3 5 7]
Slicing 2-D Arrays
Example: From the second element, slice elements from index 1 to index 4
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[1, 1:4])
Output : [7 8 9]
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 2])
Output: [3 8]
Example: From both elements, slice index 1 to index 4 this will return a 2-D
array:
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 1:4])
Output:
[[2 3 4]
[7 8 9]]
NumPy Data Types
Data Types in Python
• strings - used to represent text data, the text is given under quote
marks. e.g. "ABCD"
• integer - used to represent integer numbers. e.g. -1, -2, -3
• float - used to represent real numbers. e.g. 1.2, 42.42
• boolean - used to represent True or False.
• complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j
Data Types in NumPy
NumPy has some extra data types, and refer to data types with one
character, like i for integers, u for unsigned integers etc.
Below is a list of all data types in NumPy and the characters used to represent
them.
• i - integer
• b - boolean
• u - unsigned integer
• f - float
• c - complex float
• m - timedelta
• M - datetime
• O - object
• S - string
• U - unicode string
• V - fixed chunk of memory for other type ( void )
The NumPy array object has a property called dtype that returns the data type
of the array:
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr.dtype)
Output: int32
import numpy as np
arr = np.array(['apple', 'banana', 'cherry'])
print(arr.dtype)
Output: <u6
Creating Arrays With a Defined Data Type
We use the array() function to create arrays, this function can take an
optional argument: dtype that allows us to define the expected data type of
the array elements:
import numpy as np
arr = np.array([1, 2, 3, 4], dtype='S')
print(arr)
print(arr.dtype)
Output:
import numpy as np
arr = np.array([1, 2, 3, 4], dtype='i4')
print(arr)
print(arr.dtype)
Output:
[1 2 3 4]
int32
What if a Value Can Not Be Converted?
If a type is given in which elements can't be casted then NumPy will raise a
ValueError.
Example: A non integer string like 'a' can not be converted to integer (will raise
an error):
import numpy as np
arr = np.array(['a', '2', '3'], dtype='i')
Output:
Output:
[1 2 3]
int32
Example:Change data type from integer to boolean:
import numpy as np
arr = np.array([1, 0, 3])
newarr = arr.astype(bool)
print(newarr)
print(newarr.dtype)
Output:
[ True False True]
bool
NumPy
NumPy Copy vs View
➢ The main difference between a copy and a view of an array is that the
copy is a new array, and the view is just a view of the original array.
➢ The copy owns the data and any changes made to the copy will not affect
original array, and any changes made to the original array will not affect
the copy.
➢ The view does not own the data and any changes made to the view will
affect the original array, and any changes made to the original array will
affect the view.
COPY:
Example: Make a copy, change the original array, and display both arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
x = arr.copy()
arr[0] = 42
print(arr)
print(x)
Output:
[42 2 3 4 5]
[1 2 3 4 5]
VIEW:
Example: Make a view, change the original array, and display both arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
x = arr.view()
arr[0] = 42
print(arr)
print(x)
Output:
[42 2 3 4 5]
[42 2 3 4 5]
The view SHOULD be affected by the changes made to the original array.
Example: Make a view, change the view, and display both arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
x = arr.view()
x[0] = 31
print(arr)
print(x)
Output:
[31 2 3 4 5]
[31 2 3 4 5]
The original array SHOULD be affected by the changes made to the view.
As mentioned above, copies owns the data, and views does not own the data,
but how can we check this?
Every NumPy array has the attribute base that returns None if the array owns
the data.
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
x = arr.copy()
y = arr.view()
print(x.base)
print(y.base)
Output:
None
[1 2 3 4 5]
NumPy Array Shape
Shape of an Array
The above example returns (2,4), which means that the array has 2
dimensions, and each dimension has 4 elements.
Example: Create an array with 5 dimensions using ndmin using a vector with
values 1,2,3,4 and verify that last dimension has value 4:
import numpy as np
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('shape of array :', arr.shape)
Output:
[[[[[1 2 3 4]]]]]
shape of array : (1, 1, 1, 1, 4)
NumPy Array Reshaping
Reshaping arrays
➢ Reshaping means changing the shape of an array.
➢ The shape of an array is the number of elements in each dimension.
➢ By reshaping we can add or remove dimensions or change number of
elements in each dimension.
Reshape From 1-D to 2-D
Example: Convert the following 1-D array with 12 elements into a 2-D array.
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(4, 3)
print(newarr)
Output:
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
Reshape From 1-D to 3-D
Example: Convert the following 1-D array with 12 elements into a 3-D array
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)
Output:
[[[ 1 2]
[ 3 4]
[ 5 6]]
[[ 7 8]
[ 9 10]
[11 12]]]
NumPy Array Iterating
Iterating Arrays
➢ Iterating means going through elements one by one.
➢ As we deal with multi-dimensional arrays in numpy, we can do this using
basic for loop of python.
➢ If we iterate on a 1-D array it will go through each element one by one.
Example: Iterate the elements for the following 1-D array:
import numpy as np
arr = np.array([1, 2, 3])
for x in arr:
print(x)
Output:
1
2
3
Iterating 2-D Arrays
In a 2-D array it will go through all the rows.
Example: Iterate on the elements of the following 2-D array:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
print(x)
Output:
[1 2 3]
[4 5 6]
If we iterate on a n-D array it will go through n-1th dimension one by one.
To return the actual values, we have to iterate the arrays in each dimension.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
for y in x:
print(y)
Output:
1
2
3
4
5
6
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
print(x)
Output:
[[1 2 3]
[4 5 6]]
[[ 7 8 9]
[10 11 12]]
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
for y in x:
for z in y:
print(z)
Output:
1
2
3
4
5
6
7
8
9
10
11
12
Iterating Arrays Using nditer()
The function nditer() is a helping function that can be used from very
basic to very advanced iterations. It solves some basic issues which we face in
iteration, lets go through it with examples.
import numpy as np
arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
for x in np.nditer(arr):
print(x)
Output:
1
2
3
4
5
6
7
8
Iterating With Different Step Size
We can use filtering and followed by iteration.
Example: Iterate through every scalar element of the 2D array skipping 1
element:
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
for x in np.nditer(arr[:, ::2]):
print(x)
Output:
1
3
5
7
Enumerated Iteration Using ndenumerate()
import numpy as np
arr = np.array([1, 2, 3])
for idx, x in np.ndenumerate(arr):
print(idx, x)
Output:
(0,) 1
(1,) 2
(2,) 3
Example: Enumerate on following 2D array's elements
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
for idx, x in np.ndenumerate(arr):
print(idx, x)
Output:
(0, 0) 1
(0, 1) 2
(0, 2) 3
(0, 3) 4
(1, 0) 5
(1, 1) 6
(1, 2) 7
(1, 3) 8
NumPy Joining Array
Joining NumPy Arrays
➢ Joining means combining of two or more arrays as a single array.
➢ In SQL we join tables based on a key, whereas in NumPy we join arrays by
axes.
➢ We pass a sequence of arrays that we want to join to
the concatenate() function, along with the axis.
➢ If axis is not explicitly passed, it is taken as 0.
Example: Join two arrays
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
Output: [1 2 3 4 5 6]
Output:
[[1 4]
[2 5]
[3 6]]
Stacking Along Rows
Example:
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.hstack((arr1, arr2))
print(arr)
Output: [1 2 3 4 5 6]
Stacking Along Columns
Example:
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.vstack((arr1, arr2))
print(arr)
Output:
[[1 2 3]
[4 5 6]]
Example:
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.dstack((arr1, arr2))
print(arr)
Output:
[[[1 4]
[2 5]
[3 6]]]
NumPy Splitting Array
Splitting NumPy Arrays
➢ Splitting is reverse operation of Joining.
➢ Joining merges multiple arrays into one and Splitting breaks one array
into multiple.
➢ We use array_split() for splitting arrays, we pass it the array we
want to split and the number of splits.
Example: Split the array in 3 parts:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr)
Output:
[array([1, 2]), array([3, 4]), array([5, 6])]
If the array has less elements than required, it will adjust from the end
accordingly.
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 4)
print(newarr)
Output:
[array([1, 2]), array([3, 4]), array([5]), array([6])]
Output:
[1 2]
[3 4]
[5 6]
Splitting 2-D Arrays
Use the array_split() method, in an array you want to split the number
of splits you want to do.
import numpy as np
arr = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
newarr = np.array_split(arr, 3)
print(newarr)
Output:
Output:
In addition, you can specify which axis you want to do the split around.
The example below also returns three 2-D arrays, but they are split along the
row (axis=1).
Example: Split the 2-D array into three 2-D arrays along rows.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])
newarr = np.array_split(arr, 3, axis=1)
print(newarr)
Output:
Example: Use the hsplit() method to split the 2-D array into three 2-D array
s along rows.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])
newarr = np.hsplit(arr, 3)
print(newarr)
Output:
NumPy
NumPy Searching Arrays
Searching Arrays
➢ You can search an array for a certain value, and return the indexes that it
matches.
➢ To search an array, use the where() method.
Example :Find the indexes where the value is 4:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)
Output: (array([3, 5, 6], dtype=int64),)
Example: Create a filter array that will return only even elements from the origi
nal array
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
# Create an empty list
filter_arr = []
# go through each element in arr
for element in arr:
# if the element is completely divisble by 2, set the value to True, otherwise
False
if element % 2 == 0:
filter_arr.append(True)
else:
filter_arr.append(False)
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
Output:
[False, True, False, True, False, True, False]
[2 4 6]
Creating Filter Directly From Array
The above example is quite a common task in NumPy and NumPy provides a
way to tackle it.
We can directly substitute the array instead of the iterable variable in our
condition and it will work as we expect it
Example: Create a filter array that will return only values higher than 42:
import numpy as np
arr = np.array([41, 42, 43, 44])
filter_arr = arr > 42
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
Output:
[False False True True]
[43 44]
Example: Create a filter array that will return only even elements from the orig
inal array
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
filter_arr = arr % 2 == 0
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
Output:
[False True False True False True False]
[2 4 6]
Random Numbers in NumPy
Generate Random Number
NumPy offers the random module to work with random numbers.
Example: Generate a random integer from 0 to 100:
from numpy import random
x = random.randint(100)
print(x)
Output: 28
Generate Random Float
The random module's rand() method returns a random float between 0 and
1.
Integers
The randint() method takes a size parameter where you can specify the
shape of an array.
Example: Generate a 1-D array containing 5 random integers from 0 to 100
from numpy import random
x=random.randint(100, size=(5))
print(x)
Output: [67 23 57 94 40]
Example: Generate a 2-D array with 3 rows, each row containing 5 random integ
ers
from 0 to 100
from numpy import random
x = random.randint(100, size=(3, 5))
print(x)
Output:
[[19 29 52 85 86]
[ 6 31 29 94 29]
[59 52 31 83 14]]
Floats
The rand() method also allows you to specify the shape of the array.
Example:Generate a 1-D array containing 5 random floats
from numpy import random
x = random.rand(5)
print(x)
Output: [0.40059825 0.57669527 0.61470883 0.71033653 0.58024434]
Example: Generate a 2-D array with 3 rows, each row containing 5 random num
bers
from numpy import random
x = random.rand(3, 5)
print(x)
Output:
[[0.26077556 0.90322224 0.5462452 0.62142255 0.24189894]
[0.0610846 0.29801495 0.6495516 0.12850847 0.00344236]
[0.85568781 0.77171421 0.4696878 0.16071512 0.47611551]]
Example: Generate a 2-D array that consists of the values in the array
parameter
(3, 5, 7, and 9):
from numpy import random
x = random.choice([3, 5, 7, 9], size=(3, 5))
print(x)
Output:
[[7 7 5 5 3]
[7 3 9 9 3]
[3 7 9 3 9]]
NumPy ufuncs
➢ ufuncs stands for "Universal Functions" and they are NumPy functions
that operates on the ndarray object.
➢ ufuncs are used to implement vectorization in NumPy which is way faster
than iterating over elements.
➢ ufuncs also take additional arguments, like:
➢ where boolean array or condition defining where the operations should
take place.
➢ dtype defining the return type of elements.
➢ out output array where the return value should be copied.
What is Vectorization?
➢ Converting iterative statements into a vector based operation is called
vectorization.
➢ It is faster and modern CPUs are optimized for such operations.
list 1: [1, 2, 3, 4]
list 2: [4, 5, 6, 7]
One way of doing it is to iterate over both of the lists and then sum each
elements.
Example: Without ufunc, we can use Python's built-in zip() method
x = [1, 2, 3, 4]
y = [4, 5, 6, 7]
z = []
for i, j in zip(x, y):
z.append(i + j)
print(z)
Output: [5, 7, 9, 11]
NumPy has a ufunc for this, called add(x, y) that will produce the same
result.
Example: With ufunc, we can use the add() function:
import numpy as np
x = [1, 2, 3, 4]
y = [4, 5, 6, 7]
z = np.add(x, y)
print(z)
Output: [5, 7, 9, 11]
If it is not a ufunc, it will return another type, like this built-in NumPy function
for joining two or more arrays:
Example: Check the type of another function: concatenate():
import numpy as np
print(type(np.concatenate))
Output: <class 'function'>
Simple Arithmetic
➢ Use arithmetic operators + - * / directly between NumPy arrays,
➢ In this section we discusses an extension of the same where we have
functions that can take any array-like objects e.g. lists, tuples etc. and
perform arithmetic conditionally.
Arithmetic Conditionally
➢ It defines conditions where the arithmetic operation should happen.
➢ All of the discussed arithmetic functions take a where parameter in
which we can specify that condition.
Addition
The add() function sums the content of two arrays, and return the results in a
new array.
Example: Add the values in arr1 to the values in arr2
import numpy as np
arr1 = np.array([10, 11, 12, 13, 14, 15])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.add(arr1, arr2)
print(newarr)
Output: [30 32 34 36 38 40]
The example above will return [30 32 34 36 38 40] which is the sums of 10+20,
11+21, 12+22 etc.
Subtraction
The subtract() function subtracts the values from one array with the values
from another array, and return the results in a new array.
Example: Subtract the values in arr2 from the values in arr1
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.subtract(arr1, arr2)
print(newarr)
Output: [-10 -1 8 17 26 35]
The example above will return [-10 -1 8 17 26 35] which is the result of 10-20, 2
0-21, 30-22 etc.
Multiplication
The multiply() function multiplies the values from one array with the
values from another array, and return the results in a new array.
Example: Multiply the values in arr1 with the values in arr2
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.multiply(arr1, arr2)
print(newarr)
Output: [ 200 420 660 920 1200 1500]
The example above will return [200 420 660 920 1200 1500] which is the result
of 10*20, 20*21, 30*22 etc.
Division
The divide() function divides the values from one array with the values from
another array, and return the results in a new array.
Example: Divide the values in arr1 with the values in arr2
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 5, 10, 8, 2, 33])
newarr = np.divide(arr1, arr2)
print(newarr)
Output: [ 3.33333333 4. 3. 5. 25. 1.81818182]
The example above will return [3.33333333 4. 3. 5. 25. 1.81818182] which is
the result of 10/3, 20/5, 30/10 etc.
Power
The power() function rises the values from the first array to the power of the
values of the second array, and return the results in a new array.
Example: Raise the valules in arr1 to the power of values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 5, 6, 8, 2, 33])
newarr = np.power(arr1, arr2)
print(newarr)
Output: [ 1000 3200000 729000000 -520093696 2500 0]
The example above will return [1000 3200000 729000000 -520093696 2500
0] which is the result of 10*10*10, 20*20*20*20*20, 30*30*30*30*30*30 et
c.
Remainder
Both the mod() and the remainder() functions return the remainder of the
values in the first array corresponding to the values in the second array, and
return the results in a new array.
Example: Return the remainders
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 7, 9, 8, 2, 33])
newarr = np.mod(arr1, arr2)
print(newarr)
Output: [ 1 6 3 0 0 27]
The example above will return [1 6 3 0 0 27] which is the remainders when y
ou divide 10 with 3 (10%3), 20 with 7 (20%7) 30 with 9 (30%9) etc.
You get the same result when using the remainder() function
Example: Return the remainders
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 7, 9, 8, 2, 33])
newarr = np.remainder(arr1, arr2)
print(newarr)
Output: [ 1 6 3 0 0 27]
The first array represents the quotients, (the integer value when you divide 10
with 3, 20 with 7, 30 with 9 etc.
Absolute Values
Both the absolute() and the abs() functions functions do the same
absolute operation element-wise but we should use absolute() to avoid
confusion with python's inbuilt math.abs()
➢ Pandas are used for indexing , Slicing, sub setting the large data sets
➢ The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.
➢ There are different tools are available for fast data processing, such
as Numpy, Scipy, Cython, and Panda.
➢ But we prefer Pandas because working with Pandas is fast, simple and
more expressive than other tools.
• Series
• DataFrame
• Panel
Creating a Series:
We can easily create an empty series in Pandas which means it will not have any
value.
The syntax that is used for creating an Empty Series:
The below example creates an Empty Series type object that has no values and
having default datatype, i.e., float64.
o Array
o Dictionary
o Scalar value
➢ Before creating a Series, firstly, we have to import the numpy module and
then use array() function in the program.
➢ If the data is ndarray, then the passed index must be of the same length.
➢ If we do not pass an index, then by default index of range(n) is being
passed where n defines the length of an array, i.e.,
[0,1,2,....range(len(array))-1].
Example:
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)
Output:
0 P
1 a
2 n
3 d
4 a
5 s
dtype: object
Example:
#import the pandas library
import pandas as pd
import numpy as np
info = {'x' : 0., 'y' : 1., 'z' : 2.}
a = pd.Series(info)
print (a)
Output: x 0.0
y 1.0
z 2.0
dtype: float64
Example:
#import pandas library
import pandas as pd
import numpy as np
x = pd.Series(4, index=[0, 1, 2, 3])
print (x)
Output:
0 4
1 4
2 4
3 4
dtype: int64
Accessing data from series with Position:
Once you create the Series type object, you can access its indexes, data, and
even individual elements.
The data in the Series can be accessed similar to that in the ndarray.
Example:
import pandas as pd
x = pd.Series([1,2,3],index = ['a','b','c'])
#retrieve the first element
print (x[0])
Output : 1
Below are some of the attributes that you can use to get the information about
the Series object
Attributes Description
Series.hasnans It returns True if there are any NaN(Empty values) values, otherwise
returns false.
We can retrieve the index array and data array of an existing Series object
by using the attributes index and values.
Example:
import numpy as np
import pandas as pd
x=pd.Series(data=[2,4,6,8])
y=pd.Series(data=[11.2,18.6,22.5], index=['a','b','c'])
print(x.index)
print(x.values)
print(y.index)
print(y.values)
Output:
RangeIndex(start=0, stop=4, step=1)
[2 4 6 8]
Index(['a', 'b', 'c'], dtype='object')
[11.2 18.6 22.5]
You can use attribute dtype with Series object as <objectname> dtype for
retrieving the data type of an individual element of a series object, you can use
the itemsize attribute to show the number of bytes allocated to each data item.
Example:
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4,5])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.dtype)
print(a.size)
print(b.dtype)
print(b.size)
Output:
int64
5
float64
3
Retrieving Shape
The shape of the Series object defines total number of elements including
missing or empty values(NaN).
Example:
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.shape)
print(b.shape)
Output:
(4,)
(3,)
Example:
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.ndim, b.ndim)
print(a.size, b.size)
print(a.nbytes, b.nbytes)
Output:
11
43
32 24
➢ To check the Series object is empty, you can use the empty attribute.
➢ Similarly, to check if a series object contains some NaN values we can use
the hasans attribute.
Example:
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,np.NaN])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
c=pd.Series()
print(a.empty,b.empty,c.empty)
print(a.hasnans,b.hasnans,c.hasnans)
print(len(a),len(b))
print(a.count( ),b.count( ))
Output:
False False True
True False False
43
33
Series Functions
Functions Description
Pandas Series.map() Map the values from two series that have a common column.
Pandas Series.std() Calculate the standard deviation of the given set of numbers,
`1 DataFrame, column, and rows.
Pandas Series.map()
➢ The main task of map() is used to map the values from two series that have
a common column.
➢ To map the two Series, the last column of the first Series should be the
same as the index column of the second series, and the values should be
unique.
Syntax
Series.map(arg, na_action=None)
Parameters
o arg: function,dictionary,orSeries.
It refers to the mapping correspondence.
o na_action: {None, 'ignore'}, Default value None. If ignore, it returns null
values, without passing it to the mapping correspondence.
Returns
Example:
import pandas as pd
import numpy as np
a = pd.Series(['Java', 'C', 'C++', np.nan])
a.map({'Java': 'Core'})
Output:
0 Core
1 NaN
2 NaN
3 NaN
dtype: object
Example:
import pandas as pd
import numpy as np
a = pd.Series(['Java', 'C', 'C++', np.nan])
a.map({'Java': 'Core'})
a.map('I like {}'.format, na_action='ignore')
Output:
0 I like Java
1 I like C
2 I like C++
3 NaN
dtype: object
Pandas Series.std()
Syntax:
Parameters:
Returns:
Example:
import pandas as pd
# calculate standard deviation
import numpy as np
print(np.std([4,7,2,1,6,3]))
print(np.std([6,9,15,2,-17,15,4]))
Output:
2.1147629234082532
10.077252622027656
Example:
import pandas as pd
import numpy as np
#Create a DataFrame
info = { 'Name':['Parker','Smith','John','William'], 'sub1_Marks':[52,38,42,37],
'sub2_Marks':[41,35,29,36]}
data = pd.DataFrame(info)
#data
# standard deviation of the dataframe
data.std()
Output:
sub1_Marks 6.849574
sub2_Marks 4.924429
dtype: float64
Pandas Series.to_frame()
➢ Series is defined as a type of list that can hold an integer, string, double
values, etc.
➢ It returns an object in the form of a list that has an index starting from 0 to
n where n represents the length of values in Series.
➢ The main difference between Series and Data Frame is that Series can
only contain a single list with a particular index,
➢ whereas the DataFrame is a combination of more than one series that
can analyze the data.
➢ The Pandas Series.to_frame() function is used to convert the series object
to the DataFrame.
Syntax
Series.to_frame(name=None)
Parameters
name: Refers to the object. Its Default value is None. If it has one value, the
passed name will be substituted for the series name.
Returns
Example:
Syntax
Returns
Example:
import pandas as pd
import numpy as np
index = pd.Index([2, 1, 1, np.nan, 3])
index.value_counts()
Output:
1.0 2
3.0 1
2.0 1
dtype: int64
Normalize
With normalize set to True, returns the relative frequency by dividing all
values by the sum of values.
Example:
s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts(normalize=True)
Output:
3.0 0.4
4.0 0.2
2.0 0.2
1.0 0.2
dtype: float64
dropna
With dropna set to False we can also see NaN index values.
Example:
s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts(dropna=False)
Output:
3.0 2
NaN 1
4.0 1
2.0 1
1.0 1
dtype: int64
PANDAS
DataFrame
A Data frame is a two-dimensional data structure, i.e., data is aliened in a
tabular fashion in rows and columns.
Features of DataFrame
Structure
Let us assume that we are creating a data frame with student’s data.
We can think of it as an SQL table or a spreadsheet data representation.
pandas.DataFrame
1 data: data takes various forms like ndarray, series, map, lists, dict, constants and
another DataFrame.
2 index: For the row labels, the Index to be used for the resulting frame is Optional De
np.arange(n) if no index is passed.
3 columns:For column labels, the optional default syntax is - np.arange(n). This is only
if no index is passed.
Create DataFrame
Example:
import pandas as pd
data = [['Anand',10],['Beema',12],['Cat',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
Output:
Name Age
0 Anand 10
1 Beema 12
2 Cat 13
Example:
import pandas as pd
data = [['Abc',10],['Xyz',12],['Pqr',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)
Output:
Name Age
0 Abc 10.0
1 Xyz 12.0
2 Pqr 13.0
Note − Observe, the dtype parameter changes the type of Age column to floating
point.
Output:
Name Age
0 Ant 28
1 Cat 34
2 Rat 29
3 Parrot 42
Note − Observe the values 0,1,2,3. They are the default index assigned to each
using the function range(n).
Example: Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)
Output:
Name Age
rank1 Ant 28
rank2 Cat 34
rank3 Rat 29
rank4 Parrot 42
Note − Observe, the index parameter assigns an index to each row.
a b1
first 1 NaN
second 5 NaN
Note − Observe, df2 DataFrame is created with a column index other than the
dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with
column indices same as dictionary keys, so NaN’s appended.
We will now understand column selection, addition, and deletion with examples
Column Selection
Column Addition
Column Deletion
We will now understand row selection, addition and deletion through examples.
Let us begin with the concept of selection.
Selection by Label
Slice Rows
Addition of Rows
Add new rows to a DataFrame using the append function. This function will
append the rows at the end.
Example:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
print(df)
Output:
a b
0 1 2
1 3 4
0 5 6
1 7 8
Deletion of Rows
PANDAS
PANEL
pandas.Panel()
Parameter Description
data Data takes various forms like ndarray, series, map, lists, dict, constants and als
another DataFrame
items axis=0
major_axis axis=1
minor_axis axis=2
Create Panel
• From ndarrays
• From dict of DataFrames
From 3D ndarray
Example:
# creating an empty panel
import pandas as pd
import numpy as np
data = np.random.rand(2,4,5)
p = pd.Panel(data)
print(p)
Output:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4
Note − Observe the dimensions of the empty panel and the above panel, all the
objects are different.
Example:
#creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p)
Output:
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
• Items
• Major_axis
• Minor_axis
Using Items
Example:
# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p['Item1'])
Output:
0 1 2
0 0.488224 -0.128637 0.930817
1 0.417497 0.896681 0.576657
2 -2.775266 0.571668 0.290082
3 -0.400538 -0.144234 1.110535
We have two items, and we retrieved item1. The result is a DataFrame with 4
rows and 3 columns, which are the Major_axis and Minor_axis dimensions.
Using major_axis
Using minor_axis
Pandas Panel.add()
In Pandas Panel.add() function is used for element-wise addition of series
and dataframe/Panel
Example:
# importing pandas module
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a': ['cse', 'For', 'Computer', 'for', 'Siddartha'],
'b': [11, 1.025, 333, 114.48, 1333]})
Pandas Panel.mul()
Pandas Panel.sum()
Panel.sum() function is used to return the sum of the values for the
requested axis.
Example:
# importing pandas module
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a': ['Ant', 'cat', 'rat', 'bird', 'animal'],
'b': [11, 1.025, 333, 114.48, 1333]})
data = {'item1':df1, 'item2':df1}
# creating Panel
panel = pd.Panel.from_dict(data, orient ='minor')
print(panel['b'], '\n')
print("\n", panel['b'].sum(axis = 0))
Output:
Pandas Panel.size
Output:
('df1 is - \n\n', a b
0 ram 11.000
1 mouse 1.025
2 cpu 333.000
3 key 114.480
4 smps 1333.000)
('df2 is - \n\n', a b
0 0.120637 0.893705
1 0.741454 0.082920)
Matplotlib
➢ Matplotlib is a Python library which is defined as a multi-platform data
visualization library built on Numpy array.
➢ It can be used in python scripts, shell, web application, and other graphical
user interface toolkit.
➢ There are various toolkits available that are used to enhance the
functionality of the matplotlib.
They are
Matplotlib Architecture
There are three different layers in the architecture of the matplotlib which are
the following:
o Backend Layer
o Artist layer
o Scripting layer
Backend layer
The backend layer is the bottom layer of the figure, which consists of the
implementation of the various functions that are necessary for plotting.
Artist Layer
It is responsible for the various plotting functions, like axis, which coordinates on
how to use the renderer on the figure canvas.
Scripting layer
The scripting layer is the topmost layer on which most of our code will run.
The methods in the scripting layer, almost automatically take care of the other
layers
Axes: A Figure can contain several Axes. It consists of two or three (in the case of
3D) Axis objects. Each Axes is comprised of a title, an x-label, and a y-label. and
they are responsible for generating the graph limits.
Output:
Example: We can add titles, labels to the chart which are created by Python
matplotlib library
Output:
➢ The pyplot functions are used to make some changes to figure such as
create a figure, creates a plotting area in a figure, plots some lines in a
plotting area, decorates the plot including labels, etc.
➢ The pyplot module provide the plot() function which is frequently use to
plot a graph.
Example:
Output:
In the above program, it plots the graph x-axis ranges from 0-4 and the y-axis
from 1-5.
Since we know that python index starts at 0, the default x vector has the same
length as y but starts at 0. Hence the x data are [0, 1, 2, 3, 4].
Example: We can pass the arbitrary number of arguments to the plot() x versus y
Output:
➢ A format string that indicates the color and line type of the plot.
➢ The default format string is 'b-'which is the solid blue as you can observe
in the above plotted graph.
Output:
Format String
Character Color
'b' Blue
'g' Green
'r' Red
'c' Cyan
'm' Magenta
'y' Yellow
'k' Black
'w' White
Plotting with categorical variables
Example:
What is subplot()
1. Line graph
➢ The line graph is one of charts which shows information as a series of the
line.
➢ The graph is plotted by the plot() function.
➢ The line graph is simple to plot
Example:
Output:
We can customize the graph by importing the style module.
Example:
Output:
2. Bar graphs
Bar graphs are one of the most common types of graphs and are used to show
data associated with the categorical variables.
Matplotlib provides a bar() to make bar graphs which accepts arguments such as:
categorical variables, their value and color.
Example:
Output:
barh()
It accepts xerr or yerr as arguments (in case of vertical graphs) to depict the
variance in our data as follows:
Example:
Output:
style()
Example: using the style() function
Output:
Similarly to vertical stack, the bar graph together by using the bottom argument
and define the bar graph, which we want to stack below and its value.
Example:
Output:
Pie Chart
➢ A pie chart is a circular graph that is broken down in the segment or slices
of pie.
➢ It is generally used to represent the percentage or proportional data where
each slice of pie represents a particular category.
Example:
Output:
4. Histogram
First, we need to understand the difference between the bar graph and
histogram.
A histogram is used for the distribution, whereas a bar chart is used to compare
different entities.
A histogram is a type of bar plot that shows the frequency of a number of values
compared to a set of values ranges.
Example: If we take the data of the different age group of the people and plot a
histogram with respect to the bin. Now, bin represents the range of values that
are divided into series of intervals. Bins are generally created of the same size.
Output:
5. Scatter plot
➢ The scatter plots are mostly used for comparing variables when we need
to define how much one variable is affected by another variable.
➢ The data is displayed as a collection of points.
➢ Each point has the value of one variable, which defines the position on the
horizontal axes, and the value of other variable represents the position on
the vertical axis.
Example:
Output:
6. 3D graph plot
Its 1.0 release was built with some of three-dimensional plotting utilities on top
of two-dimension display, and the result is a convenient set of tools for 3D data
visualization.
Example:
Output:
Data Science
Data science is the study of the massive amount of data, which involves extracting
structured, and unstructured data that is processed using the scientific method.
It is a multidisciplinary field that uses tools and techniques to manipulate the data
so that we can find something new and meaningful.
Example:
Let suppose we want to travel from station A to station B by car.
Now, we need to take some decisions such as which route will be the best route
to reach faster at the location, in which route there will be no traffic jam, and
which will be cost-effective.
All these decision factors will act as input data, and we will get an appropriate
answer from these decisions, so this analysis of data is called the data analysis,
which is a part of data science.
Some years ago, data was less and mostly available in a structured form, which
could be easily stored in excel sheets, and processed using tools.
But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals
bytes of data is generating on every day, which led to data explosion. It is
estimated as per researches, that by 2030, 100 MB of data will be created at
every single second, by a single person on earth. Every Company requires data to
work, grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task for every
organization.
So to handle, process, and analysis of this, we required some complex, powerful,
and efficient algorithms and technology, and that technology came into existence
as data Science.
Following are some main reasons for using data science technology:
o With the help of data science technology, we can convert the massive
amount of raw and unstructured data into meaningful insights.
o Data science technology is opting by various companies, whether it is a big
brand or a startup. Google, Amazon, Netflix, etc, which handle the huge
amount of data, are using data science algorithms for better customer
experience.
o Data science is working for automating transportation such as creating a
self-driving car, which is the future of transportation.
o Data science can help in different predictions such as various survey,
elections, flight ticket confirmation, etc.
1. Data Sourcing
Data Sourcing is the process of finding and loading the data into our
system. there are two ways in which we can find data.
• Private Data
• Public Data
Private Data
▪ As the name suggests, private data is given by private organizations.
▪ There are some security and privacy concerns attached to it.
▪ This type of data is used for mainly organizations internal analysis.
Public Data
▪ This type of Data is available to everyone.
▪ We can find this in government websites and public organizations etc.
▪ Anyone can access this data, we do not need any special permissions or
approval.
We can get public data on the following sites.
• https://data.gov
• https://data.gov.in
The very first step of EDA is Data Sourcing,it is how we can access data and load
into our system. The next step is how to clean the data.
2. Data Cleaning
▪ After completing the Data Sourcing, the next step in the process of EDA
is Data Cleaning.
▪ It is very important to get rid of the irregularities and clean the data after
sourcing it into our system.
Irregularities of data.
Missing Values
• Missing data is always a problem in real life scenarios.
• Areas like machine learning and data mining face severe issues in the
accuracy of their model predictions because of poor quality of data caused
by missing values.
• In these areas, missing value treatment is a major point of focus to make
their models more accurate and valid.
When and Why Is Data Missed?
• Let us consider an online survey for a product.
• Many times, people do not share all the information related to them.
• Few people share their experience, but not how long they are using the
product;
• few people share how long they are using the product, their experience
but not their contact information.
• Thus, in some or the other way a part of data is always missing, and this is
very common in real time.
Let us now see how we can handle missing values (say NA or NaN) using Pandas.
Example:
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e',
'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df)
Output:
one two three
a 0.077988 0.476149 0.965836
b NaN NaN NaN
c -0.390208 -0.551605 -2.301950
d NaN NaN NaN
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g NaN NaN NaN
h 0.085100 0.532791 0.887415
Cleaning / Filling Missing Data
▪ Pandas provides various methods for cleaning the missing values.
▪ The fillna function can “fill in” NA values with non-null data in a couple of
ways, which we have illustrated in the following sections.
▪ Replace NaN with a Scalar Value
The following program shows how you can replace "NaN" with "0".
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c',
'e'],columns=['one','two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df)
print ("NaN replaced with '0':")
print (df.fillna(0))
Output:
one two three
a -0.576991 -0.741695 0.553172
b NaN NaN NaN
c 0.744328 -1.735166 1.749580
Example:
import pandas as pd
import numpy as np
print (df.dropna())
Output:
one two three
a 0.077988 0.476149 0.965836
c -0.390208 -0.551605 -2.301950
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
h 0.085100 0.532791 0.887415
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]})
print (df.replace({1000:10,2000:60}))
Output:
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
Handling Outliers
▪ An outlier is something which is separate or different from the crowd.
▪ Outliers can be a result of a mistake during data collection or it can be
just an indication of variance in your data.
Let’s have a look at some examples.
Suppose you have been asked to observe the performance of Indian cricket
team i.e Run made by each player and collect the data.
▪ As you can see from the above collected data that all other players scored
300+ except Player3 who scored 10.
▪ This figure can be just a typing mistake or it is showing the variance in your
data and indicating that Player3 is performing very bad so, needs
improvements
▪ Now that we know outliers can either be a mistake or just variance,
Before we try to understand whether to ignore the outliers or not, we
need to know the ways to identify them.
There are two types of outliers:
1. Univariate outliers: Univariate outliers are the data points whose values lie
beyond the range of expected values based on one variable.
2. Multivariate outliers: While plotting data, some values of one variable may
not lie beyond the expected range, but when you plot the data with some
other variable, these values may lie far from the expected value.
So, after understanding the causes of these outliers, we can handle them by
dropping those records or imputing with the values or leaving them as is, if it
makes more sense.
3.Univariate analysis
4. Multivariate Analysis
If we analyze data by taking more than two variables/columns into
consideration from a dataset, it is known as Multivariate Analysis.
Once Exploratory Data Analysis is complete and insights are drawn, its
feature can be used for machine learning modeling.
1. Discovery: The first phase is discovery, which involves asking the right
questions. When you start any data science project, you need to determine what
are the basic requirements, priorities, and project budget. In this phase, we need
to determine all the requirements of the project such as the number of people,
technology, time, data, an end goal, and then we can frame the business problem
on first hypothesis level.
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our further
processes.
3. Model Planning: In this phase, we need to determine the various methods and
techniques to establish the relation between input variables. We will apply
Exploratory data analytics(EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what
data can inform us. Common tools used for model planning are:
5. Operationalize: In this phase, we will deliver the final reports of the project,
along with briefings, code, and technical documents. This phase provides you a
clear overview of complete project performance and other components on a
small scale before the full deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which
we have set on the initial phase. We will communicate the findings and final
result with the business team.
Descriptive Statistics
• Descriptive statistics is the type of statistics which is used to summarize
and describe the dataset.
• It is used to describe the characteristics of data.
• Descriptive statistics are generally used to determine the sample.
• It is displayed through tables, charts, frequency distributions and is
generally reported as a measure of central tendency.
• Descriptive statistics include the following details about the data
Central Tendency
o Mean – also known as the average
o Median – the centre most value of the given dataset
o Mode – The value which appears most frequently in the given
dataset
Example: To compute the Measures of Central Tendency
Consider the following data points.
17, 16, 21, 18, 15, 17, 21, 19, 11, 23
• Mean- Mean is calculated as
• Median - To calculate Median, lets arrange the data in ascending order.
11, 15, 16, 17, 17, 18, 19, 21, 21, 23
Since the number of observations is even (10), median is given by the average of
the two middle observations (5th and 6th here).
Mode - Mode is given by the number that occurs maximum number of times.
Here, 17 and 21 both occur twice. Hence, this is a Bimodal data and the modes
are 17 and 21.
Statistical Dispersion
o Range – Range gives us the understanding of how spread out the
given data is
o Variance – It gives us the understanding of how the far the
measurements are from the mean.
o Standard deviation – Square root of the variance is standard
deviation, also the measurement of how far the data deviate from
the mean
Range - Range is the difference between the Maximum value and the Minimum
value in the data set. It is given as
Variance - Variance measures how far are data points spread out from the mean.
A high variance indicates that data points are spread widely and a small variance
indicates that the data points are closer to the mean of the data set. It is
calculated as
• Positive Skew — This is the case when the tail on the right
side of the curve is bigger than that on the left side. For these
distributions, mean is greater than the mode.
• Negative Skew — This is the case when the tail on the left
side of the curve is bigger than that on the right side. For
these distributions, mean is smaller than the mode.
Python:
• EDA can be done using python for identifying the missing value in a data
set.
• Other functions that can be performed are –
• The description of data, handling outliers, getting insights through the
plots.
• Due to its high-level, built-in data structure, and dynamic typing and
binding make it an attractive tool for EDA.
• Python provides certain open-source modules that can automate the
whole process of EDA and help in saving time.
R:
Wrapping Up
Data Visualization
Figure 2. Illustrates how four identical datasets when examined using simple
summary statistics look similar but vary considerably when graphed.
Bar Chart
• A bar chart displays categorical data with rectangular bars whose length
or height corresponds to the value of each data point.
• Bar charts are best used to compare a single category of data or several.
• When comparing more than one category of data, the bars can be
grouped together to created a grouped bar chart.
Examples
Example:
Ram is the branch manager at a local bank. Recently, Ram is receiving customer
feedback saying that the wait times for a client to be served by a customer
service representative are too long. Ram decides to observe and write down the
time spent by each customer on waiting. Here are his findings from observing
and writing down the wait times spent by 20 customers:
The corresponding histogram with 5-second bins (5-second intervals) would
look as follows:
Heat Maps are often used in the financial services industry to review the status of a
portfolio.
The rectangles contain a wide variety and many shadings of colors, which
emphasize the weight of the various components.