0% found this document useful (0 votes)
1 views46 pages

Iml Unit 2

The document provides an introduction to NumPy, a Python library essential for machine learning, focusing on its array object 'ndarray' which offers significant performance advantages over traditional Python lists. It covers array creation, dimensions, accessing elements, and various operations such as stacking, splitting, and mathematical functions. Additionally, it highlights statistical functions available in NumPy for data analysis, including mean, median, and standard deviation.

Uploaded by

Devanshi Dave
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views46 pages

Iml Unit 2

The document provides an introduction to NumPy, a Python library essential for machine learning, focusing on its array object 'ndarray' which offers significant performance advantages over traditional Python lists. It covers array creation, dimensions, accessing elements, and various operations such as stacking, splitting, and mathematical functions. Additionally, it highlights statistical functions available in NumPy for data analysis, including mean, median, and standard deviation.

Uploaded by

Devanshi Dave
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction to Machine Learning (4350702) PROF.

DEVANSHI DAVE
)

Unit- II Python libraries suitable for Machine Learning

2.1 Numpy:

What is NumPy?

NumPy stands for Numerical Python.


NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier transform, and
matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use
it freely.

Why Use NumPy?

In Python we have lists that serve the purpose of arrays, but they are slow to process.
NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that
make working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very
important.
NumPy arrays are stored at one continuous place in memory unlike lists, so processes can
access and manipulate them very efficiently.
This behavior is called locality of reference in computer science.
This is the main reason why NumPy is faster than lists. Also it is optimized to work with
latest CPU architectures.

Creating Array: array()


NumPy is used to work with arrays. The array object in NumPy is called ndarray.
We can create a NumPy ndarray object by using the array() function.

Example:

import numpy as np

arr= np.array([l, 2, 3, 4, 5])

print(arr)

print(type(arr))

Output:

Page 1
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
[1 2 3 4 5]
<class 'numpy.ndarray'>

type(): This built-in Python function tells us the type of the object passed to it. Like in above
code it shows that arr is numpy.ndarray type.
To create an ndarray, we can pass a list, tuple or any array-like object into the array() method,
and it will be converted into an ndarray:

Dimensions in Arrays: A dimension in arrays is one level of array depth (nested


arrays).

► nested array: are arrays that have arrays as their elements.


► 0-D Arrays
• 0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D
array.
• Example: Create a 0-D array with value 42

import numpy as np
arr = op.array(42)
print(arr)

Output:
42

► 1-D Arrays
• An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
• These are the most common and basic arrays.
• Example: Create a 1-D array containing the values 1,2,3,4,5

import numpy as np
arr= np.array([l, 2, 3, 4, 5])
print(arr)

Output:
[1 2 3 4 5]

► 2-D Arrays
• An array that has 1-D arrays as its elements is called a 2-D array.
• These are often used to represent matrix or 2nd order tensors.
• NumPy has a whole sub module dedicated towards matrix operations called numpy.mat

• Example: Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6

Page 2
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
import numpy as np
arr= np.array([[l, 2, 3], [4, 5, 6]])
print(arr)

Output:

[[123]
[4 5 6]]

► 3-D arrays
• An array that has 2-D arrays (matrices) as its elements is called 3-D array.
• These are often used to represent a 3rd order tensor.

Example: Create a 3-D array with two 2-D arrays, both containing two arrays with the
values 1,2,3 and 4,5,6

import numpy as np
arr= np.array([[[l, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)

Output:

[[[1 2 3]
[4 5 6]]
[[l 2 3]
[4 5 6]]]

Accessing Array: by referring to its index number:


► Access Array Elements

Array indexing is the same as accessing an array element.


You can access an array element by referring to its index number.
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.

Example:Get the second element from the following array.


import numpy as np

arr= np.array([l, 2, 3, 4])

print(arr[1])

Output:
2

Page 3
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
► Access 2-D Arrays

• To access elements from 2-D arrays we can use comma separated integers representing
the dimension and the index of the element.
• Think of 2-D arrays like a table with rows and columns, where the dimension represents
the row and the index represents the column.
• Example:Access the element on the first row, second column:

import numpy as np

arr= np.array([[l,2,3,4,5], [6,7,8,9,10]])

print('2nd element on 1st row: ', arr[0, 1])

Output:
2nd element on 1st dim: 2
► Access 3-D Arrays

• To access elements from 3-D arrays we can use comma separated integers
representing the dimensions and the index of the element.
• Example:Access the third element of the second array of the first array:

import numpy as np

arr= np.array([[[l, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

print(arr[0, 1, 2])

Output:
6

► Negative Indexing

• Use negative indexing to access an array from the end.

• Example:Print the last element from the 2nd dim:


import numpy as np

arr= np.array([[l,2,3,4,5], [6,7,8,9,10]])

print('Last element from 2nd dim: ', arr[1, -1])

Output:

Page 4
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
Last element from 2nd dim: 10

Stacking & Splitting: stack(), array_split()


► stack()
• stack() is used for joining multiple NumPy arrays. Unlike, concatenate(), it joins
arrays along a new axis. It returns a NumPy array.
• to join 2 arrays, they must have the same shape and dimensions. (e.g. both (2,3)-> 2
rows,3 columns)
• stack() creates a new array which has 1 more dimension than the input arrays. If
we stack 2 1-D arrays, the resultant array will have 2 dimensions.

Syntax: numpy.stack(arrays, axis=O, out=None)

Where,

• arrays: Sequence of input arrays (required)


• axis: Along this axis, in the new array, input arrays are stacked. Possible values are
0 to (n-1) positive integer for n-dimensional output array.
• out: The destination to place the resultant array.

► Example:stacking two ld arrays

import numpy as np

# input array
a= np.array([l, 2, 3])
b = op.array([4, 5, 6])

# Stacking 2 1-d arrays


c = np.stack((a, b),axis=0)
print(c)

Output:
array([[l, 2, 3],
[4, 5, 6)1)

► array_split()
• Splitting is reverse operation of Joining.
• Joining merges multiple arrays into one and Splitting breaks one array into multiple.

Page 5
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
• We use array_split() for splitting arrays, we pass it the array we want to split and
the number of splits.

Example : Split the array in 3 parts


import numpy as np

arr= np.array([l, 2, 3, 7,8])

newarr = np.array_split(arr, 3)

print(newarr)

Output:
[array([l, 2]), array([3, 4]), array([5, 6])]

► Splitting 2-D Arrays

Example:Split the 2-D array into three 2-D arrays.

import numpy as np

arr= np.array([[l,2], (3, 4], (5, 6], (7, 8], (9, 10], [11, 12]])

newarr = np.array_split(arr, 3)

print(newarr)
Output:
[array([[l, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([[ 9, 10],
[11, 12]])]

• In addition, you can specify which axis you want to do the split around.

• The example below also returns three 2-D arrays, but they are split along the row (axis=l).

Example:Split the 2-D array into three 2-D arrays along rows.

import numpy as np

arr= np.array([[l, 2, 3], (4, 5, 6], (7, 8, 9], [10, 11, 12], (13, 14, 15], (16, 17, 18]])

Page 6
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

newarr = np.array_split(arr, 3, axis=l)

print(newarr)

Output:
[array([[ 1],
[ 4],
[ 7],
[10],
[13],
[16] ]), array([[ 2],
[ 5],
[ 8],
[11],
[14],
[17] ]), array([[ 3],
[ 6],
[ 9],
[12],
[15],
[18] ])]

• An alternate solution is using hsplit() opposite of hstack()


Example: Use the hsplit() method to split the 2-D array into three 2-D arrays along
rows.
import numpy as np

arr= np.array([[l, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])

newarr = np.hsplit(arr, 3)

print(newarr)

Output:
[array([[ 1],
[ 4],
[ 7],
[10],
[13],
[16] ]), array([[ 2],
[ 5],

Page 7
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
[ 8],
[11],
[14],
[17] ]), array([[ 3],
[ 6],
[ 9],
[12],
[15],
[18] ])]

Maths Functions: add(), subtract(), multiply(), divide(), power(), mod()

• NumPy provides a wide range of arithmetic functions to perform on arrays.


• Here's a list of various arithmetic functions along with their associated operators:

Operation Arithmetic Function Operator

Addition add() +

Subtraction subtract() -

Multiplication multiply() *
Division divide() I

Exponentiation power() **
Modulus mod() %

Example 1:
import numpy as np

first_array = np.array([l, 3, 5, 7])


second_array = np.array([2, 4, 6, 8])

# using the add() function


result2 = np.add(first_array, second_array)
print("Using the add() function:",result2)

output:

Page 8
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

Using the add() function: [ 3 7 11 15]

Example 2:
import numpy as np
print("Add:")
print(np.add(l.0, 4.0))
print("Subtract:")
print(np.subtract(l.0, 4.0))
print("Multiply:")
print(np.multiply(l.0, 4.0))
print("Divide:")
print(np.divide(l.0, 4.0))

output:
Add:
5.0
Subtract:
-3.0
Multiply:
4.0
Divide:
0.25

Statistics Functions: amin(), amax(), mean(), median(), std(),


var(), average(), ptp()
• The NumPy package contains a number of statistical functions which provides all the
functionality required for various statistical operations.
• It includes finding mean, median, average, standard deviation, variance and percentile
etc from elements of a given array. Below mentioned are the most frequently used
statistical functions:

Function Description

mean() Computes the arithmetic mean along the specified axis.

Page 9
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

median() Computes the median along the specified axis.

average() Computes the weighted average along the specified axis.

std() Compute the standard deviation along the specified axis.

var() Compute the variance along the specified axis.

amax() Returns the maximum of an array or maximum along an axis.

amin() Returns the minimum of an array or minimum along an axis.

Return range of values (maximum - minimum) of an array or


ptp()
along an axis.

Computes the specified percentile of the data along the


percentile()
specified axis.

• numpy.mean() function
The numpy.mean() function is used to compute the arithmetic mean along the specified axis.
The mean is calculated over the flattened array by default, otherwise over the specified axis.

Syntax:
numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>)

a Required. Specify an array containing numbers whose mean is desired. If


a is not an array, a conversion is attempted.
axis Optional. Specify axis or axes along which the means are computed. The
default is to compute the mean of the flattened array.
dtype Optional. Specify the data type for computing the mean. For integer
inputs, the default is float64. For floating point inputs, it is same as the
input dtype.
out Optional. Specify output array for the result. The default is None.
If provided, it must have the same shape as output.
keepdims Optional. If this is set to True, the reduced axes are left in the result as
dimensions with size one. With this option, the result will broadcast
correctly against the input array. With default value, the keepdims will not

Page 10
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

be passed through to the mean method of sub-classes of ndarray, but


any non-default value will be. If the sub-class method does not
implement keepdims the exceptions will be raised.

Example:
import numpy as np
Arr= np.array([[l0,20,30],[70,80,90]])

print("Array is:")
print(Arr)

#mean of all values


print("\nMean of all values:", np.mean(Arr))

#mean along axis=0 print("\


nMean along axis=0")
print(np.mean(Arr, axis=0))

#mean along axis=l print("\


nMean along axis=l")
print(np.mean(Arr, axis=l))

output:
Array is:
[[10 20 30]
[70 80 90]]

Mean of all values: 50.0

Mean along axis=0


[40. 50. 60.]

Mean along axis=l


[20. 80.]

• numpy.median() function

The numpy.median() function is used to compute the median along the specified axis. The median
is calculated over the flattened array by default, otherwise over the specified axis.

Page 11
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
Syntax
numpy.median(a, axis=None, out=None, overwrite_input=False, keepdims=False)

Parameters

a Required. Specify an array (array_like) containing numbers whose


median is desired.

axis Optional. Specify axis or axes along which the medians are
computed. The default is to compute the median of the flattened
array.

out Optional. Specify output array for the result. The default is
None. If provided, it must have the same shape as output.
overwrite_input Optional. If True, the input array will be modified. If
overwrite_input is True and a is not already an ndarray, an error
will be raised. Default is False.
keepdims Optional. If this is set to True, the reduced axes are left in the result
as dimensions with size one. With this option, the result will
broadcast correctly against the input array.

Example:
In the example below, median() function is used to calculate median of all values present in
the array. When axis parameter is provided, median is calculated over the specified axes.

import numpy as np
Arr= np.array([[l0,20,500],[30,40,400], [100,200,300]])

print("Array is:")
print(Arr)

#median of all values


print("\nMedian of values:", np.median(Arr))

#median along axis=0 print("\


nMedian along axis=0")
print(np.median(Arr, axis=0))

#median along axis=l print("\


nMedian along axis=l ")
print(np.median(Arr, axis=l))

Page 12
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

output:

Array is:
[[ 10 20 500]
[ 30 40 400]
[100 200 300]]

Median of values: 100.0

Median along axis=0


[ 30. 40. 400.]

Median along axis=l


[ 20. 40. 200.]

• numpy.average() function

The numpy.average() function is used to compute the weighted average along the specified axis.
The syntax for using this function is given below:

Syntax
numpy.average(a, axis=None, weights=None, returned=False)

Parameters

a Required. Specify an array containing data to be averaged. If a is not an array,


a conversion is attempted.
axis Optional. Specify axis or axes along which to average a. The default,
axis=None, will average over all of the elements of the input array. If axis is
negative it counts from the last to the first axis.

weight Optional. Specify an array of weights associated with the values in a. The
weights array can either be 1-D (in which case its length must be the size of a
along the given axis) or of the same shape as a. If weights=None, then all data
in a are assumed to have a weight equal to one.
returned Optional. Default is False. If True, the tuple (average, sum_of_weights) is

Page 13
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

returned, otherwise only the average is returned.

Example:
In the example below, average() function is used to calculate average of all values present
in the array. When axis parameter is provided, averaging is performed over the specified
axes.

Example:

import numpy as np
Arr= np.array([[l0,20,30],[70,80,90]])

print("Array is:")
print(Arr)

#average of all values


print("\nAverage of values:", np.average(Arr))

#averaging along axis=0


print("\nAverage along
axis=0") print(np.average(Arr,
axis=0))

#averaging along axis=l print("\


nAverage along axis=1")
print(np.average(Arr, axis=l))

output:

Array is:
[[10 20 30]
[70 80 90]]

Average of values: 50.0

Average along axis=0


[40. 50. 60.]

Average along axis=l


[20. 80.]

• numpy.std() function
Page 14
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

The numpy.std() function is used to compute the standard deviation along the specified axis.
The standard deviation is defined as the square root of the average of the squared deviations
from the mean. Mathematically, it can be represented as:

std= sqrt(mean(abs(x - x.mean())**2))

Syntax
numpy.std(a, axis=None, dtype=None, out=None, keepdims=<no value>)

""
P arameters
Required. Specify the input array. -I
a
)
axis Optional. Specify axis or axes along which the standard evia tion is
d
calculated. The default, axis=None, computes the standard deviation of the
flattened array.
dtype Optional. Specify the type to use in computing the standard deviation. For
arrays of integer type the default is float64, for arrays of float types it is the
same as the array type.
out Optional. Specify the output array in which to place the result. It must have
the same shape as the expected output.

keepdims Optional. If this is set to True, the reduced axes are left in the result as
dimensions with size one. With this option, the result will broadcast
correctly against the input array.

Example:
Here, std() function is used to calculate standard deviation of all values present in the array.
But, when axis parameter is provided, standard deviation is calculated over the specified
axes as shown in the example below.

import numpy as np
Arr= np.array([[l0,20,30],[70,80,90]])

print("Array is:")
print(Arr)

#standard deviation of all values


print("\nStandard deviation of all values:", np.std(Arr))

Page 15
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

#standard deviation along axis=0 print("\


nStandard deviation along axis=0")
print(np.std(Arr, axis=0))

#standard deviation along axis=l print("\


nStandard deviation along axis=l ")
print(np.std(Arr, axis=l))

output:

Array is:
[[10 20 30]
[70 80 90]]

Standard deviation of all values: 31.09126351029605

Standard deviation along axis=0


[30. 30. 30.]

Standard deviation along axis=l


[8.16496581 8.16496581]

• numpy.var() function

The numpy.var() function is used to compute the variance along the specified axis. The variance is
a measure of the spread of a distribution. The variance is computed for the flattened array by
default, otherwise over the specified axis.

Syntax
numpy.var(a, axis=None, dtype=None, out=None, keepdims=<no value>)

Parameters

a Required. Specify the input array.


axis Optional. Specify axis or axes along which the variance is calculated. The
default, axis=None, computes the variance of the flattened array.
dtype Optional. Specify the type to use in computing the variance. For arrays of
integer type the default is float64, for arrays of float types it is the same as the
array type.

Page 16
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

out Optional. Specify the output array in which to place the result. It must have the
same shape as the expected output.
keepdims Optional. If this is set to True, the reduced axes are left in the result as
dimensions with size one. With this option, the result will broadcast correctly
against the input array.
Example:
In the example below, var() function is used to calculate variance of all values present in the
array. When axis parameter is provided, variance is calculated over the specified axes.

import numpy as np
Arr= np.array([[l0,20,30],[70,80,90]])

print("Array is:")
print(Arr)

#variance of all values


print("\nVariance of all values:", np.var(Arr))

#variance along axis=0


print("\nVariance along
axis=0") print(np.var(Arr,
axis=0))

#variance along axis=l


print("\n Variance along axis=1")
print(np.var(Arr, axis=l ))

output:

Array is:
[[10 20 30]
[70 80 90]]

Variance of all values: 966.6666666666666

Variance along axis=0


[900. 900. 900.]

Variance along axis=l


[66.66666667 66.66666667]

• numpy.amax() function

Page 17
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

The NumPy amax() function returns the maximum of an array or maximum along the specified axis.

Syntax:
numpy.amax(a, axis=None, out=None, keepdims=<no value>)
Parameters

a Required. Specify the input array.

axis Optional. Specify axis or axes along which to operate. The default, axis=None,
operation is performed on flattened array.

out Optional. Specify the output array in which to place the result. It must have the same
shape as the expected output.

keepdims Optional. If this is set to True, the reduced axes are left in the result as dimensions
with size one. With this option, the result will broadcast correctly against the input
array. (; \,,_
'

Example:
In the example below, amax() function is used to calculate maximum of an array. When axis
parameter is provided, maximum is calculated over the specified axes.

import numpy as np
Arr= np.array([[l0,20,30],[70,80,90]])

print("Array is:")
print(Arr)

#maximum of all values


print("\nMaximum of all values:", np.amax(Arr))

#maximum along axis=0 print("\


nMaximum along axis=0")
print(np.amax(Arr, axis=0))

#maximum along axis=l print("\


nMaximum along axis=l")
print(np.amax(Arr, axis=l))

output:

Array is:

Page 18
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
[[10 20 30]
[70 80 90]]

Maximum of all values: 90

Maximum along axis=0


[70 80 90]

Maximum along axis=1


[30 90]

• numpy.amin() function

The NumPy amin() function returns the minimum of an array or minimum along the specified axis.

Syntax
numpy.amin(a, axis=None, out=None, keepdims=<no value>)

Parameters

a Required. Specify the input array.

axis Optional. Specify axis or axes along which to operate. The default,
axis=None, operation is performed on flattened array.

out Optional. Specify the output array in which to place the result. It must
have the same shape as the expected output.

keepdims Optional. If this is set to True, the reduced axes are left in the result as
dimensions with size one. With this option, the result will broadcast
correctly against the input array.
/

Example:
In the example below, amin() function is used to calculate minimum of an array. When axis
parameter is provided, minimum is calculated over the specified axes.

import numpy as np
Arr= np.array([[l0,20,30],[70,80,90]])

print("Array is:")
print(Arr)

Page 19
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
#minimum of all values
print("\nMinimum of all values:", np.amjn(Arr))

#mjnimum along axjs=0 print("\


nMinimum along axis=0")
print(np.amin(Arr, axis=0))

#minimum along axis=l print("\


nMinimum along axis=l ")
print(np.amin(Arr, axis=l))

output:

Array is:
[[10 20 30]
[70 80 90]]

Minimum of all values: 10

Minimum along axis=0


[10 20 30]

Minimum along axis=l


[10 70]

• numpy.ptp() function

The NumPy ptp() function returns range of values (maximum - minimum) of an array or range
of values along the specified axjs_

The name of the function comes from the acronym for peak to peak.

Syntax:
numpy.ptp(a, axis=None, out=None, keepdims=<no value>)

Parameters
a Required. Specify the input array.
axis Optional. Specify axis or axes along which to operate. The default,
axis=None, operation is performed on flattened array.
out Optional. Specify the output array in which to place the result. It must have
the same shape as the expected output.

Page 20
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

keepdims Optional. If this is set to True, the reduced axes are left in the result
as dimensions with size one. With this option, the result will
broadcast correctly against the input array.

Example
In the example below, ptp() function is used to calculate the range of values present in the
array. When axis parameter is provided, it is calculated over the specified axes.

import numpy as np
Arr= np.array([[l0,20,30],[70,80,90]])

print("Array is:")
print(Arr)

#range of values
print("\nRange of values:", np.ptp(Arr))

#Range of values along axis=0 print("\


nRange of values along axis=0")
print(np.ptp(Arr, axis=0))

#Range of values along axis=l print("\


nRange of values along axis=1")
print(np.ptp(Arr, axis=l))

The output of the above code will be:

Array is:
[[10 20 30]
[70 80 90]]

Range of values: 80

Range of values along axis=0


[60 60 60]

Range of values along axis=l


[20 20]

• numpy.percentile() function

Page 21
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

The NumPy percentile() function returns the q-th percentile of the array elements or q-th percentile
the data along the specified axis.

Syntax:
numpy.percentile(a, q, axis=None, out=None, interpolation='linear', keepdims=False)
Parameters

a Required. Specify the input array (array_like).

q Required. Specify percentile or sequence of percentiles to compute,


which must be between 0 and 100 inclusive (array_like of float).
·-
axis Optional. Specify axis or axes along which to operate. The default,
axis=None, operation is performed on flattened array.
out Optional. Specify the output array in which to place the result. It must
have the same shape as the expected output.

interpolation Optional. Specify the interpolation method to use when the desired
percentile lies between two data points. It can take value from {'linear',
'lower', 'higher', 'midpoint', 'nearest'}

keepdims Optional. If this is set to True, the reduced axes are left in the result as
dimensions with size one. With this option, the result will broadcast
correctly against the input array.

Example:
In the example below, percentile() function returns the maximum of all values present in the
array. When axis parameter is provided, it is calculated over the specified axes.
import numpy as np
Arr= op.array([[10,20, 30],[40, 50, 60]])

print("Array is:")
print(Arr)

print()
#calculating 50th percentile point
print("50th percentile:", np.percentile(Arr, 50))

print()
#calculating (25, 50, 75) percentile points
print("[25, 50, 75] percentile:\n",
np.percentile(Arr, (25, 50, 75)))

Page 22
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
print()
#calculating 50th percentile point along axis=0
print("50th percentile (axis=0):",
np.percentile(Arr, 50, axis=0))

#calculating 50th percentile point along axis=l


print("50th percentile (axis=l):",
np.percentile(Arr, 50, axis=l))

output:
Array is:
[[10 20 30]
[40 50 60]]

50th percentile: 35.0

[25, 50, 75] percentile:


[22.5 35. 47.5]

50th percentile (axis=0): [25. 35. 45.]


50th percentile (axis=l): [20. 50.]

2.2 Pandas
• History of development
• In 2008, pandas development began at AQR Capital Management. By the end of 2009 it
had been open sourced, and is actively supported today by a community of like-minded
individuals around the world who contribute their valuable time and energy to help make
open source pandas possible.
• Since 2015, pandas is a NumFOCUS sponsored project. This will help ensure the success of
development of pandas as a world-class open-source project.

• Timeline
2008: Development of pandas started
2009: pandas becomes open source
2012: First edition of Python for Data Analysis is published
2015: pandas becomes a NumFOCUS sponsored project
2018: First in-person core developer sprint

• Library Highlights

Page 23
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
A fast and efficient DataFrame object for data manipulation with integrated indexing;

Tools for reading and writing data between in-memory data structures and different
formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;

Intelligent data alignment and integrated handling of missing data: gain automatic label
based alignment in computations and easily manipulate messy data into an orderly form;

Flexible reshaping and pivoting of data sets;

Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;

Columns can be inserted and deleted from data structures for size mutability;

Aggregating or transforming data with a powerful group by engine allowing split-apply


combine operations on data sets;

High performance merging and joining of data sets;

Hierarchical axis indexing provides an intuitive way of working with high-dimensional


data in a lower-dimensional data structure;

Time series-functionality: date range generation and frequency conversion, moving


window statistics, date shifting and lagging. Even create domain-specific time offsets
and join time series without losing data;

Highly optimized for performance, with critical code paths written in Cython or C.

Python with pandas is in use in a wide variety of academic and commercial domains,
including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics,
and more.

Series: Series()
• The Pandas Series can be defined as a one-dimensional array that is capable of storing
various data types.
• We can easily convert the list, tuple, and dictionary into series using "series' method. The
row labels of series are called the index.
• A Series cannot contain multiple columns. It has the following parameter:

Page 24
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
data: It can be any list, dictionary, or scalar value.
index: The value of the index should be unique and hashable. It must be of the same length
as data. If we do not pass any index, default np.arrange(n) will be used.
dtype: It refers to the data type of series.
copy: It is used for copying the data.

► Creating a Series:
We can create a Series in two ways:

1. Create an empty Series

2. Create a Series using inputs.

► Create an Empty Series:


We can easily create an empty series in Pandas which means it will not have any value.
The syntax that is used for creating an Empty Series:

<series object> = pandas.Series()

The below example creates an Empty Series type object that has no values and having default
datatype, i.e., float64.

Example:

import pandas as pd

x = pd.Series()

print (x)

output:

Series([], dtype: float64)

► Creating a Series using inputs:


We can create Series by using various inputs:

o Array

o Diet

o Scalar value

Page 25
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

► Creating Series from Array:

Before creating a Series, firstly, we have to import the numpy module and then use array() function
in the program. If the data is ndarray, then the passed index must be of the same length.

If we do not pass an index, then by default index of range(n) is being passed where n defines the
length of an array, i.e., [0,1,2,....range(len(array))-1].

Example:

import pandas as pd

import numpy as np

info= np.array(['P','a','n','d','a','s'])

a= pd.Series(info)

print(a)

output:
0 p
1 a
2 n
3 d
4 a
5 s
dtype: object

► Create a Series from diet

We can also create a Series from diet. If the dictionary object is being passed as an input and the
index is not specified, then the dictionary keys are taken in a sorted order to construct the index.

If index is passed, then values correspond to a particular label in the index will be extracted from
the dictionary.

Example:

#import the pandas library

import pandas as pd

import numpy as np

Page 26
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

info = {'x' : 0., 'y' : 1., 'z' : 2.}

a = pd.Series(info)

print (a)

Output:

X 0.0
y 1.0
z 2.0
dtype: float64

► Series Functions
There are some functions used in Series which are as follows:

· _ _I
Functions Description
'
\'
)
>
Pandas Series.maQO Map the values from two series that have a common column.

Pandas Series.std(} Calculate the standard deviation of the given set of numbers, DataFrame,
column, and rows.

Pandas Series.to frame{} Convert the series object to the dataframe.

Pandas Returns a Series that contain counts of unique values.


Series.value counts()
•••·...

.......

Dataframes: DataFrames()

• Pandas DataFrame is a widely used data structure which works with a two-dimensional
array with labeled axes (rows and columns).
• DataFrame is defined as a standard way to store data that has two different indexes, i.e., row
index and column index. It consists of the following properties:

• The columns can be heterogeneous types like int, bool, and so on.

Page 27
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

• It can be seen as a dictionary of Series structure where both the rows and columns are
indexed. It is denoted as "columns" in case of columns and "index" in case of rows.

Parameter & Description:

data: It consists of different forms like ndarray, series, map, constants, lists, array.

index: The Default np.arrange(n) index is used for the row labels if no index is passed.

columns: The default syntax is np.arrange(n) for the column labels. It shows only true if no
index is passed.

dtype: It refers to the data type of each column.

copy(): It is used for copying the data.

Columns---.
!
Percentage
Regd.No Name
of Marks

100 John 74.5

101 Smith 87.2


Rows
102 Parker 92

103 Jones 70.6

104 William 87.5

• Create a DataFrame

We can create a DataFrame using following ways:

o diet

Page 28
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

o Lists

o Numpy ndarrrays

o Series

• Create an empty DataFrame

Example:

# importing the pandas library


import pandas as pd
df = pd.DataFrame()
print (df)

Output

Empty DataFrame
Columns:[]
Index: []

• Create a DataFrame using List:

We can easily create a DataFrame in Pandas using list.

Example:

# importing the pandas library

import pandas as pd
# a list of strings
x = ['Python', 'Pandas']

# Calling DataFrame constructor on list


df = pd.DataFrame(x)
print(df)

Output

Page 29
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
0
0 Python
1 Pandas

• Create a DataFrame from Diet of ndarrays/

Lists Example:

# importing the pandas library


import pandas as pd
info= {'ID' :[101, 102, 103],'Department' :['B.Sc','B.Tech','M.Tech',]}
df = pd.DataFrame(info)
print (df)

Output

ID Department
0 101 B.Sc
1 102 B.Tech
2 103 M.Tech

• Create a DataFrame from Diet of

Series: Example:

# importing the pandas library


import pandas as pd

info= {'one'· pd Series([l 2 3 4 5 6] index=['a' 'b' 'c' 'd' 'e' 'f])


• • ' ' ' ' ' ' ' ' ' ' '
'two'· pd Series([l 2 3 4 5
' 6 7 8] index=['a' 'b' 'c' 'd' 'e' 'f 'g' 'h'])}
• • ' ' ' ' ' ' ' ' ' ' ' ' ' '
'
dl = pd.DataFrame(info)
print (dl)

Output:

one two
a 1.0 1
b 2.0 2
C 3.0 3

Page 30
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
d 4.0 4
e 5.0 5
f 6.0 6
g NaN 7
h NaN 8

Read CSV File: read_csv()

To read the csv file as pandas.DataFrame, use the pandas function read_csv() or read_table().

The difference between read_csv() and read_table() is almost nothing. In fact, the same function is
called by the source:

• read_csv() delimiter is a comma character


• read_table() is a delimiter of tab \t.

The pandas function read_csv() reads in values, where the delimiter is a comma character.
You can export a file into a csv file in any modern office suite including Google Sheets.

Use the following csv data as an example.

Example:

name,age,state,point
Alice,24,NY,64
Bob,42,CA,92
Charlie,18,CA,70
Dave,68,TX,70
Ellen,24,CA,88
Frank,30,NY,57
Alice,24,NY,64
Bob,42,CA,92
Charlie,18,CA,70
Dave,68,TX,70
Ellen,24,CA,88
Frank,30,NY,57

You can load the csv like this:

# Load pandas
import pandas as pd

# Read CSV file into DataFrame df


df = pd.read_csv('sample.csv', index_col=0)

# Show dataframe

Page 31
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

print(df)

output:

# age state point


#name
# Alice 24 NY 64
#Bob 42 CA 92
# Charlie 18 CA 70
#Dave 68 TX 70
# Ellen 24 CA 88
# Frank 30 NY 57

Cleaning Empty Cells: dropna()

Empty cells can potentially give you a wrong result when you analyze data.

• Remove Rows

One way to deal with empty cells is to remove rows that contain empty cells.

This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.

Example: Return a new Data Frame with no empty cells:

import pandas as pd

df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df.to_string())

Cleaning Wrong Data: drop()

• Removing Rows

Another way of handling wrong data is to remove the rows that contains wrong data.

Page 32
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
This way you do not have to find out what to replace them with, and there is a good chance you do
not need them to do your analyses.

Example: Delete rows where "Duration" is higher than 120:

for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)

Removing Duplicates: duplicated()

The duplicated() method returns a Series with True and False values that describe which rows in
the DataFrame are duplicated and not.

Use the subset parameter to specify which columns to include when looking for duplicates. By
default all columns are included.

By default, the first occurrence of two or more duplicates will be set to False.

Set the keep parameter to False to also set the first occurrence to True.

Syntax

dataframe.duplicated(subset, keep)

Parameter Value Description

subset column label(s) Optional. A String, or a list, of the column names to include
when looking for duplicates. Default subset=None
(meaning no subset is specified, and all columns should be
included.

keep 'first' Optional, default 'first'. Specifies how to deal with


'last' duplicates:
False 'first' means set the first occurrence to False, the rest to
True.
'last' means set the last occurrence to False, the rest to True.

Page 33
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

False means set all occurrences to True.

Example:

Only include the columns "name" and "age":

s = df.duplicated(subset=["name", "age"])

print(s)

output:

0 False
1 False
2 True
3 False
4 True
dtype: bool

Pandas Plotting: plot()

Pandas uses the plot() method to create diagrams.

We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the screen.

Example:

import pandas as pd
import matplotlib.pyplot as pit

df = pd.read_csv('data.csv')

df.plot()

pit.show()

Page 34
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

Duration
1750 Pulse
Maxpulse
1500 calories

1250

1000

750

500

250

0
0 20 40 60 80 100 120 140 160

2.3 Matplotlib
What is Matplotlib?

Matplotlib is a low level graph plotting library in python that serves as a visualization
utility.

Matplotlib was created by John D. Hunter.

Matplotlib is open source and we can use it freely.

Matplotlib is mostly written in python, a few segments are written in C, Objective-C and
Javascript for Platform compatibility.

Pyplot.plot: plot() and Show: show()

Page 35
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported under
the pit alias:

import matplotlib.pyplot as pit

Now the Pyplot package can be referred to as pit.

Example:

Draw a line in a diagram from position (0,0) to position (6,250):

import matplotlib.pyplot as pit


import numpy as np

xpoints = np.array([0, 6])


ypoints = np.array([0, 250])

plt.plot(xpoints, ypoints)
pit.show()

250

200

150

100

50

0 1 2 3 4 5 6

Page 36
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

Labels: xlabel(), ylabel()


With Pyplot, you can use the xlabel() and ylabel() functions to set a label for the x- and y-axis.

Example:

import numpy as np
import matplotlib.pyplot as pit

x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])

plt.plot(x, y)

plt.xlabel("A verage Pulse")


plt.ylabel("Calorie Burnage")

pit.show()

320

v 300
O l
r o
E
Ill
CJ)
'i: 280
0

ro
u

260

240

80 90 100 110 120


Average
Pullse

Page 37
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
Grid: grid()
With Pyplot, you can use the grid() function to add grid lines to the plot.

Example:

Add grid lines to the plot:


import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.title("Sports Watch
Data") plt.xlabel("Average
Pulse") plt.ylabel("Calorie
Burnage") plt.plot(x, y)
pit.grid()
pit.show()

Result:

Sports Watch Data

v 300 -+--+-------+-------+-------------+-------------------------,
Ol
ro
E
:,
en
Q)
C 280 -+--+-------+-------------------+-------+----------------------t
0

ro
u

80 90 100
110 120
Average
Pullse

Page 38
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
Bars: bar()
With Pyplot, you can use the bar() function to draw bar graphs:

Example:

Draw 4 bars:

import matplotlib.pyplot as plt


import numpy as np

x = np.array(["A", "B", "C", "D"])


y = np.array([3, 8, 1, 10])

plt.bar(x,y)
pit.show()

10

0
A B C D

Histogram: hist()

Page 39
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
A histogram is a graph showingfrequency distributions.

It is a graph showing the number of observations within each given interval.

Example: Say you ask for the height of 250 people, you might end up with a histogram like this:

In Matplotlib, we use the hist() function to create histograms.

The hist() function will use an array of numbers to create a histogram, the array is sent into the
function as an argument.

Example

import matplotlib.pyplot as pit


import numpy as np

x = np.random.normal(l 70, 10, 250)

plt.hist(x)
pit.show()

50

40

30

20

10

0
140 150 160 170 180 190

Page 40
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
Subplot: subplot()
With the subplot() function you can draw multiple plots in one figure:

Example: Draw 2 plots


import matplotlib.pyplot as pit
import numpy as np
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(l, 2, 1)
plt.plot(x,y)
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([lO, 20, 30, 40])
plt.subplot(l, 2, 2)
plt.plot(x,y)
pit.show()

Result:

10 40

35
8

30

6
25

4 20

15
2

10

0 1 2 3 0 1 2 3

Page 41
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
pie chart: pie()
With Pyplot, you can use the pie() function to draw pie charts:

Example:

A simple pie chart:

import matplotlib.pyplot as pit


import numpy as np

y = np.array([35, 25, 25, 15])

plt.pie(y)
pit.show()

Result:

Page 42
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
Save the plotted images into pdf: savefig()

Using plt.savefig("mylmagePDF.pdf", format="pdf", bbox_inches="tight") method, we can save a


figure in PDF format.
Steps
• Create a dictionary with Column 1 and Column 2 as the keys and Values are like i and i*i,
where i is from Oto 10, respectively.
• Create a data frame using pd.DataFrame( d), d created in step 1.
• Plot the data frame with 'o' and 'rx' style.
• To save the file in PDF format, use savefig() method where the image name is
mylmagePDF.pdf, format= "pdf'.
• To show the image, use the pit.show() method.
Example

import pandas as pd
from matplotlib import pyplot as pit
d = {'Column 1': [i for i in range(lO)], 'Column 2': [i * i for i in range(lO)]}
df = pd.DataFrame(d)
df.plot(style=['o', 'rx'])
plt.savefig("mylmagePDF.pdf", format="pdf', bbox_inches="tight")
pit.show()

output:

so • Column 1 X
x Column 2
70
X
60

so X

40
X
30
X
20
X
10 X

• • • • •
0 • • • •
0 2
.4 6 8

Page 43
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

2.4 sklearn

What is Sklearn?

• An open-source Python package to implement machine learning models in Python is called


Scikit-learn.
• This library supports modern algorithms like KNN, random forest, XGBoost, and SVC.
• It aids in various processes of model building, like model selection, regression,
classification, clustering, and dimensionality reduction (parameter selection).

Key concepts and features


Supervised Learning algorithms - Almost all the popular supervised learning
algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are
the part of scikit-learn.
Unsupervised Learning algorithms - On the other hand, it also has all the popular
unsupervised learning algorithms from clustering, factor analysis, PCA (Principal
Component Analysis) to unsupervised neural networks.
Clustering - This model is used for grouping unlabeled data.
Cross Validation - It is used to check the accuracy of supervised models on unseen data.
Dimensionality Reduction - It is used for reducing the number of attributes in data which
can be further used for summarisation, visualisation and feature selection.
Ensemble methods - As name suggest, it is used for combining the predictions of multiple
supervised models.
Feature extraction - It is used to extract the features from data to define the attributes in
image and text data.
Feature selection - It is used to identify useful attributes to create supervised models.
Open Source - It is open source library and also commercially usable under BSD license.

Steps to Build a Model in Sklearn:

Step 1: Load a dataset


Step 2: Splitting the
dataset Step 3: Training
the model

Example: using KNN (K nearest neighbors) classifier.

# load the iris dataset as an example


from sklearn.datasets import load_iris

Page 44
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)
iris= load_iris()

Page 45
Introduction to Machine Learning (4350702) PROF. DEVANSHI DAVE
)

# store the feature matrix (X) and response vector (y)


X = iris.data
y=
iris.target

# splitting X and y into training and testing sets


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=l)

# training the model on training set


from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# making predictions on the testing set


y_pred = knn.predict(X_test)

# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("kNN model accuracy:", metrics.accuracy_score(y_test, y_pred))

# making prediction for out of sample data


sample= [[3, 5, 4, 2], [2, 3, 5, 4]]
preds = knn.predict(sample)
pred_species = [iris.target_names[p] for pin preds]
print("Predictions:", pred_species)

# saving the model


from sklearn.externals import joblib
joblib.dump(knn, 'iris_knn.pkl')

Output:
kNN model accuracy: 0.983333333333
Predictions: ['versicolor', 'virginica']

Page 46

You might also like