0% found this document useful (0 votes)
33 views56 pages

Unit 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views56 pages

Unit 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

CS3305 - PYTHON PROGRAMMING

UNIT III FUNCTIONS, MODULES AND PACKAGES


Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean
logic – fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and
selection. Scientific computing and numerical simulations with SciPy and SimPy. Large-scale
data analysis and machine learning with Pandas, Scikit-learn, and TensorFlow/PyTorch

Basics of Numpy arrays:


NumPy is a Python library used for working with arrays. It also has functions for working in
domain of linear algebra, fourier transform, and matrices. NumPy was created in 2005 by Travis
Oliphant. It is an open source project and you can use it freely. NumPy stands for Numerical
Python.
NumPy is a Python library and is written partially in Python, but most of the parts that require
fast computation are written in C or C++.
NumPy provides standard trigonometric functions, functions for arithmetic operations, handling
complex numbers, etc.
 Trigonometric Functions
 Hyperbolic Functions
 Functions for Rounding
 Arithmetic Functions
 Complex number Function
 Special functions

NumPy stands for Numerical Python. It is a Python library used for working with an array. In
Python, we use the list for the array but it’s slow to process. NumPy array is a powerful N-
dimensional array object and is used in linear algebra, Fourier transform, and random number
capabilities. It provides an array object much faster than traditional Python lists.
Types of Array:
1. One Dimensional Array
2. Multi-Dimensional Array
One Dimensional Array:
A one-dimensional array is a type of linear array.

One Dimensional Array


Example:
# importing numpy module
import numpy as np
# creating list
list = [1, 2, 3, 4]
# creating numpy array
sample_array = np.array(list)
print("List in python : ", list)
print("Numpy Array in python :",
sample_array)
Output:
List in python : [1, 2, 3, 4]
Numpy Array in python : [1 2 3 4]

print(type(list_1))
print(type(sample_array))
Output:
<class 'list'>
<class 'numpy.ndarray'>

Pandas is most commonly used for data wrangling and data manipulation purposes, and NumPy
objects are primarily used to create arrays or matrices that can be applied to DL or ML models.
Whereas Pandas is used for creating heterogenous, two-dimensional data objects, NumPy makes
N-dimensional homogeneous objects.
Multi-Dimensional Array:
Data in multidimensional arrays are stored in tabular form.

Two Dimensional Array


# importing numpy module
import numpy as np
# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]
# creating numpy array
sample_array = np.array([list_1,
list_2,
list_3])

print("Numpy multi dimensional array in python\n",


sample_array)

Output:
Numpy multi dimensional array in python
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]

Reference : https://www.geeksforgeeks.org/basics-of-numpy-arrays/
Anatomy of an array :
1. Axis: The Axis of an array describes the order of the indexing into the array.
Axis 0 = one dimensional
Axis 1 = Two dimensional
Axis 2 = Three dimensional
Shape: The number of elements along with each axis. It is from a tuple.
# importing numpy module
import numpy as np

# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]
# creating numpy array
sample_array = np.array([list_1,
list_2,
list_3])
print("Numpy array :")
print(sample_array)

# print shape of the array


print("Shape of the array :",
sample_array.shape)
Output:
Numpy array :
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
Shape of the array : (3, 4)

Rank: The rank of an array is simply the number of axes (or dimensions) it has.
The one-dimensional array has rank 1.

Rank 1

The two-dimensional array has rank 2.


Rank 2

Data type objects (dtype): Data type objects (dtype) is an instance of numpy.dtype class. It
describes how the bytes in the fixed-size block of memory corresponding to an array item
should be interpreted.

# Import module
import numpy as np
# Creating the array
sample_array_1 = np.array([[0, 4, 2]])

sample_array_2 = np.array([0.2, 0.4, 2.4])


# display data type
print("Data type of the array 1 :",
sample_array_1.dtype)

print("Data type of array 2 :",


sample_array_2.dtype)

Output:
Data type of the array 1 : int32
Data type of array 2 : float64
numpy.arange(): This is an inbuilt NumPy function that returns evenly spaced values within a
given interval.
Syntax: numpy.arange([start, ]stop, [step, ]dtype=None)

import numpy as np

np.arange(1, 20 , 2, dtype = np.float32)

Output:
array([ 1., 3., 5., 7., 9., 11., 13., 15., 17., 19.], dtype=float32)

numpy.empty(): This function create a new array of given shape and type, without initializing
value.
Syntax: numpy.empty(shape, dtype=float, order=’C’)

import numpy as np

np.empty([4, 3],

dtype = np.int32,

order = 'f')
Output:
array([[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11],
[ 4, 8, 12]])

numpy.ones(): This function is used to get a new array of given shape and type, filled with
ones(1).
Syntax: numpy.ones(shape, dtype=None, order=’C’)

import numpy as np

np.ones([4, 3],

dtype = np.int32,

order = 'f')

Output:
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])

numpy.zeros(): This function is used to get a new array of given shape and type, filled with
zeros(0).
Syntax: numpy.ones(shape, dtype=None)
import numpy as np
np.zeros([4, 3],
dtype = np.int32,
order = 'f')
Output:
array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])

Array Attributes:

Below are the functions used to understand Array attributes:

 .shape: Array dimensions.


 .dtype: Data type of the array elements.
 .ndim: Number of array dimensions.
 .size: Total number of elements.

print(np_array.shape)
print(np_array.dtype)
print(np_array.ndim)
print(np_array.size)

Numpy Aggregations:
Numpy Aggregation Function:
 numpy. sum: Computes the sum of array elements. ...
 numpy. mean: Computes the mean (average) of array elements. ...
 numpy.min and numpy.max`: Compute the minimum and maximum values of an array. arr =
np.array([[1, 2, 3], [4, 5, 6]]) ...
 numpy. median: Computes the median of array elements.

aggregate() function is used to apply some aggregation across one or more columns. Aggregate
using callable, string, dict, or list of string/callables. The most frequently used aggregations are:
sum: Return the sum of the values for the requested axis. min: Return the minimum of the values
for the requested axis.

numpy.sum() - Computes the sum of array elements.


import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

# Sum of all elements in the array

total_sum = np.sum(arr)

print(“Total Sum:”, total_sum)

# Sum along a specific axis (axis=0 for columns, axis=1 for rows)

column_sum = np.sum(arr, axis=0)

row_sum = np.sum(arr, axis=1)


print(“Column Sum:”, column_sum)

print(“Row Sum:”, row_sum)

numpy.mean() - Computes the mean (average) of array elements.

arr = np.array([[1, 2, 3], [4, 5, 6]])

# Mean of all elements in the array

average = np.mean(arr)

print(“Mean:”, average)

# Mean along a specific axis

column_mean = np.mean(arr, axis=0)

row_mean = np.mean(arr, axis=1)

print(“Column Mean:”, column_mean)

print(“Row Mean:”, row_mean)

numpy.min() and numpy.max() - Compute the minimum and maximum values of an array
arr = np.array([[1, 2, 3], [4, 5, 6]])

# Minimum and maximum values in the array

min_value = np.min(arr)

max_value = np.max(arr)

print(“Minimum Value:”, min_value)

print(“Maximum Value:”, max_value)

# Minimum and maximum along a specific axis

min_col = np.min(arr, axis=0)


max_row = np.max(arr, axis=1)

print(“Minimum Value Along Columns:”, min_col)

print(“Maximum Value Along Rows:”, max_row)

numpy.median() - Computes the median of array elements.

arr = np.array([[1, 2, 3], [4, 5, 6]])

# Median of all elements in the array

median = np.median(arr)

print(“Median:”, median)

# Median along a specific axis

column_median = np.median(arr, axis=0)

row_median = np.median(arr, axis=1)

print(“Column Median:”, column_median)

print(“Row Median:”, row_median)

Operators used to perform basic mathematical computations using NumPy Arrays:


 Arithmetic Operations: +, -, *, /, ** (element-wise operations).
 Aggregation Functions: np. sum(), np. mean(), np. min(), np. max().
 Reshaping Arrays: np_array. reshape(new_shape).
 Transpose: np_array. T.

Arithmetic Operations:

array_sum = np_array + np_array

array_diff = np_array - np_array

array_product = np_array * np_array

array_division = np_array / np_array

array_power = np_array ** 2

Aggregation Functions:

array_sum = np.sum(np_array)

array_mean = np.mean(np_array)

array_min = np.min(np_array)

array_max = np.max(np_array)

Reshaping and Transpose:

reshaped_array = np_array.reshape((5, 1))

transposed_array = reshaped_array.T

Matrix Multiplication:

matrix1 = np.array([[1, 2], [3, 4]])

matrix2 = np.array([[5, 6], [7, 8]])

matrix_product = np.matmul(matrix1, matrix2)

NumPy Comparison Functions:


Functions Descriptions

greater() returns element-wise True if the first value is greater then second

greater_equal() returns element-wise True if the first value is greater than or equal to second

equal() returns element-wise True if two values are equal

import numpy as np

a = np.array([101, 99, 87])


b = np.array([897, 97, 111])

print("Array a: ", a)
print("Array b: ", b)

print("a > b")


print(np.greater(a, b))

print("a >= b")


print(np.greater_equal(a, b))

print("a < b")


print(np.less(a, b))

print("a <= b")


print(np.less_equal(a, b))

Output:
Reference : https://www.geeksforgeeks.org/how-to-compare-two-numpy-arrays/

Masks in numpy:
A mask is either nomask , indicating that no value of the associated array is invalid, or an array
of booleans that determines for each element of the associated array whether the value is valid
or not.
Masking comes up when you want to extract, modify, count, or otherwise manipulate values in
an array based on some criterion: for example, you might wish to count all values greater than
a certain value, or perhaps remove all outliers that are above some threshold.
The masked array is the arrays that have invalid or missing entries. Using Masking of arrays
we can easily handle the missing, invalid, or unwanted entries in our array or
dataset/dataframe.

Difference between array and masked array:


The difference resides in the data held by the two structures. Using a regular array with np. nan
, there is no data behind invalid values. Using a masked array, you can initialize a full array,
and then apply a mask over it so that certain values appear invalid.
NumPy Array – Logical Operations:
Logical operations are used to find the logical relation between two arrays or lists or variables.
We can perform logical operations using NumPy between two data.
Logical AND :
The numpy module supports the logical_and operator. It is used to relate between two
variables. If two variables are 0 then output is 0, if two variables are 1 then output is 1 and if
one variable is 0 and another is 1 then output is 0.

Syntax:
numpy.logical_and(var1,var2)
Where, var1 and var2 are a single variable or a list/array.
Return type: Boolean value (True or False)

# importing numpy module


import numpy as np

# list 1 represents an array with boolean values


list1 = [True, False, True, False]

# list 2 represents an array with boolean values


list2 = [True, True, False, True]

# logical operations between boolean values


print('Operation between two lists = ',
np.logical_and(list1, list2))

Output:
Operation between two lists: [False True True True True False]

Logical OR:
The NumPy module supports the logical_or operator. It is also used to relate between two
variables. If two variables are 0 then output is 0, if two variables are 1 then output is 1 and if
one variable is 0 and another is 1 then output is 1.
Syntax:
numpy.logical_or(var1,var2)
Where, var1 and var2 are a single variable or a list/array.
Return type: Boolean value (True or False)

Logical NOT:
The logical_not operation takes one value and converts it into another value. If the value is 0,
then output is 1, if value is greater than or equal to 1 output is 0.
Syntax:
numpy.logical_not(var1)
Where, var1is a single variable or a list/array.
Return type: Boolean value (True or False)

Logical XOR:
The logical_xor performs the xor operation between two variables or lists. In this operation, if
two values are same it returns 0 otherwise 1.
Syntax:
numpy.logical_xor(var1,var2)
Where, var1 and var2 are a single variable or a list/array.
Return type: Boolean value (True or False)

Reference : https://www.geeksforgeeks.org/numpy-array-logical-operations/

Logic functions:

Truth value testing :


all(a[, axis, out, keepdims, where]) Test whether all array elements along a given axis evaluate to True.
any(a[, axis, out, keepdims, where]) Test whether any array element along a given axis evaluates to
True.
Array contents
isfinite(x, /[, out, where, casting, order, ...]) Test element-wise for finiteness (not infinity and not Not a
Number).
isinf(x, /[, out, where, casting, order, ...]) Test element-wise for positive or negative infinity.
isnan(x, /[, out, where, casting, order, ...]) Test element-wise for NaN and return result as a boolean
array.
isnat(x, /[, out, where, casting, order, ...]) Test element-wise for NaT (not a time) and return result as
a boolean array.
isneginf(x[, out]) Test element-wise for negative infinity, return result as
bool array.
isposinf(x[, out]) Test element-wise for positive infinity, return result as bool
array.
Array type testing
iscomplex(x) Returns a bool array, where True if input element is complex.
iscomplexobj(x) Check for a complex type or an array of complex numbers.
isfortran(a) Check if the array is Fortran contiguous but not C contiguous.
isreal(x) Returns a bool array, where True if input element is real.
isrealobj(x) Return True if x is a not complex type or an array of complex numbers.
isscalar(element) Returns True if the type of element is a scalar type.
Logical operations
logical_and(x1, x2, /[, out, where, ...]) Compute the truth value of x1 AND x2 element-wise.
logical_or(x1, x2, /[, out, where, casting, ...]) Compute the truth value of x1 OR x2 element-wise.
logical_not(x, /[, out, where, casting, ...]) Compute the truth value of NOT x element-wise.
logical_xor(x1, x2, /[, out, where, ...]) Compute the truth value of x1 XOR x2, element-wise.
Comparison
allclose(a, b[, rtol, atol, equal_nan]) Returns True if two arrays are element-wise equal within a
tolerance.
isclose(a, b[, rtol, atol, equal_nan]) Returns a boolean array where two arrays are element-wise equal
within a tolerance.
array_equal(a1, a2[, equal_nan]) True if two arrays have the same shape and elements, False
otherwise.
array_equiv(a1, a2) Returns True if input arrays are shape consistent and all elements
equal.
greater(x1, x2, /[, out, where, casting, ...]) Return the truth value of (x1 > x2) element-wise.

greater_equal(x1, x2, /[, out, where, ...]) Return the truth value of (x1 >= x2) element-wise.

less(x1, x2, /[, out, where, casting, ...]) Return the truth value of (x1 < x2) element-wise.

less_equal(x1, x2, /[, out, where, casting, ...]) Return the truth value of (x1 <= x2) element-wise.

equal(x1, x2, /[, out, where, casting, ...]) Return (x1 == x2) element-wise.

not_equal(x1, x2, /[, out, where, casting, ...]) Return (x1 != x2) element-wise.

Reference : https://numpy.org/doc/stable/reference/routines.logic.html

Fancy indexing numpy:

Fancy indexing is conceptually simple: it means passing an array of indices to access multiple
array elements at once. For example, consider the following array: In [ 1 ]: import numpy as np
rng = np . random . default_rng ( seed = 1701 ) x = rng .

In NumPy, fancy indexing allows us to use an array of indices to access multiple array
elements at once. Fancy indexing can perform more advanced and efficient array operations,
including conditional filtering, sorting, and so on.
Fancy indexing is a special type of indexing in which elements of an array are selected by an
array of indices. This means we pass the array of indices in brackets.

# Select Multiple Elements Using NumPy Fancy Indexing

import numpy as np

# create a numpy array

array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8])

# select elements at index 1, 2, 5, 7

select_elements = array1[[1, 2, 5, 7]]

print(select_elements)

Output:

[2 3 6 8]

Reference : https://www.programiz.com/python-programming/numpy/fancy-indexing

Example: NumPy Fancy Indexing

import numpy as np

array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8])

# select a single element

simple_indexing = array1[3]

print("Simple Indexing:",simple_indexing) # 4

# select multiple elements

fancy_indexing = array1[[1, 2, 5, 7]]

print("Fancy Indexing:",fancy_indexing) # [2 3 6 8]
Output:

Simple Indexing: 4

Fancy Indexing: [2 3 6 8]

NumPy’s Structured Array | Create, Use and Manipulate Array :


Numpy’s Structured Array is similar to the Struct in C. It is used for grouping data of different
data types and sizes.
Structured array uses data containers called fields. Each data field can contain data of any data
type and size.
Array elements can be accessed with the help of dot notation. For example, if you have a
structured array “Student”, you can access the ‘class’ field by calling Student[‘class’].
Properties of Structured Array
 All structs in the array have the same number of fields.
 All structs have the same field names.

Example : student ( name, year,marks)

Each record in the array student has a structure of class Struct. The array of a structure is
referred to as a struct as adding any new fields for a new struct in the array contains the empty
array.
Creating Structured Array in Python NumPy:
We can create a structured array in Python using the NumPy module.

Follow the steps below to create a structured array:


Step 1: Import NumPy library
Step 2: Define the data type of structured array by creating a list of tuples, where each tuple
contains the name of the field and its data type.
Step 3: You can now create the structured array using NumPy.array() method and set the dtype
argument to the data type you defined in the previous step.

Example: Creating Structured Array in NumPy Python

import numpy as np

dt= np.dtype([('name', (np.str_, 10)), ('age', np.int32), ('weight', np.float64)])

a = np.array([('Sana', 2, 21.0), ('Mansi', 7, 29.0)],

dtype=dt)

print(a)

Output
[('Sana', 2, 21.0) ('Mansi', 7, 29.0)]

Structured Array Operations:


Python offers many operations that you can perform on the structured array as a whole. These
operations allow us to manipulate the entire structured array without worrying about individual
fields.

Sorting Structured Array:

The structure array can be sorted by using NumPy.sort() method and passing the order as the
parameter. This parameter takes the value of the field according to which it is needed to be
sorted.

Example:

import numpy as np

a = np.array([('Sana', 2, 21.0), ('Mansi', 7, 29.0)],

dtype=[('name', (np.str_, 10)), ('age', np.int32), ('weight', np.float64)])

# Sorting according to the name

b = np.sort(a, order='name')
print('Sorting according to the name', b)

# Sorting according to the age

b = np.sort(a, order='age')

print('\nSorting according to the age', b)

Output
Sorting according to the name [('Mansi', 7, 29.0) ('Sana', 2, 21.0)]

Sorting according to the age [('Sana', 2, 21.0) ('Mansi', 7, 29.0)]

Finding Min and Max in Structured Array:

We can find the minimum and maximum of a structured array using


the np.min() and np.max() functions and pass the fields in the function.

Example:

import numpy as np

a = np.array([('Sana', 2, 21.0), ('Mansi', 7, 29.0)],

dtype=[('name', (np.str_, 10)), ('age', np.int32), ('weight', np.float64)])

max_age = np.max(a['age'])

min_age = np.min(a['age'])

print("Max age = ",max_age)

print("Min age = ", min_age)

Output
Max age = 7
Min age = 2
Concatenating Structured Array:

We can use the np.concatenate() function to concatenate two structured arrays. Look at the
example below showing the concatenation of two structured arrays.

Example:

import numpy as np

a = np.array([('Sana', 2, 21.0), ('Mansi', 7, 29.0)],

dtype=[('name', (np.str_, 10)), ('age', np.int32), ('weight', np.float64)])

b = np.array([('Ayushi', 5, 30.0)], dtype=a.dtype)

c = np.concatenate((a, b))

print(c)

Output:
[('Sana', 2, 21.) ('Mansi', 7, 29.) ('Ayushi', 5, 30.)]

Reshaping a Structured Array:

We can reshape a structured array by using the NumPy.reshape() function.


Note: The total size of the structured array will remain the same.

Example:

import numpy as np

a = np.array([('Sana', 2, 21.0), ('Mansi', 7, 29.0)],

dtype=[('name', (np.str_, 10)), ('age', np.int32), ('weight', np.float64)])

reshaped_a = np.reshape(a, (2, 1))

print(reshaped_a)
Output:
[[('Sana', 2, 21.)]
[('Mansi', 7, 29.)]]

Uses of Structured Array:


Structured arrays in NumPy allow us to work with arrays that contain elements of different
data types. They are very useful in the case of tabular and structured data. Here are some uses
of structured arrays:

1) Grouping Data

NumPy’s structured arrays allow us to group data of different data types and sizes. Each field
in a structured array can contain data of any data type, making it a versatile tool for data
grouping.

2) Tabular Data

Structured arrays can be a great tool when dealing with tabular data. They allow us to store and
manipulate complex data structures with multiple fields, similar to a table or a spreadsheet.

3) Data Analysis

Structured arrays are very useful for data analysis. They provide efficient, flexible data
containers that allow us to perform operations on entire datasets at once.

4) Memory efficiency

Structured arrays are memory-efficient. They allow us to store complex, heterogeneous data in
a compact format, which can be important when working with large datasets.

5) Integrating with other libraries

Many Python libraries, such as Pandas and Scikit-learn, are built on top of NumPy and can
work directly with structured arrays. This makes structured arrays a good choice when you
need to integrate your code with other libraries.

Use Cases for Structured Arrays:

Structured arrays are particularly useful in scenarios involving tabular or structured data. Some
common use cases include:
1) Data Import/Export

When working with structured data from external sources like CSV files or databases, we can
use structured arrays to read, manipulate, and process the data efficiently.

2) Data Analysis

Structured arrays provide a convenient way to perform various data analysis tasks. We can use
them to filter, sort, group, and aggregate data based on different fields, enabling us to gain
insights and extract meaningful information from the data.

3) Simulation and Modeling

In scientific simulations or modeling tasks, structured arrays can be used to represent different
variables or parameters. This allows us to organize and manipulate the data efficiently,
facilitating complex calculations and simulations.

4) Record-keeping and Databases

Structured arrays are useful for record-keeping applications or when working with small
databases. They provide an organized and efficient way to store, query, and modify records with
multiple fields.

Reference : https://www.tutorialspoint.com/structured-array-in-numpy

Data manipulation with Pandas:

Pandas data manipulation is the process of cleaning, transforming, and aggregating data using
the Pandas library. Pandas provides a variety of functions for performing these tasks, making it a
powerful and versatile tool for data analysis.

In Machine Learning, the model requires a dataset to operate, i.e. to train and test. But data
doesn’t come fully prepared and ready to use. There are discrepancies like Nan/ Null / NA
values in many rows and columns. Sometimes the data set also contains some of the rows and
columns which are not even required in the operation of our model. In such conditions, it
requires proper cleaning and modification of the data set to make it an efficient input for our
model. We achieve that by practicing Data Wrangling before giving data input to the model.

Benefits of pandas in data manipulation:


 Pandas is easy to use and only requires a few skills.
 Pandas' data structure is efficient.
 Pandas' tools are incredibly powerful.
 Data merging is simple in various situations.
 Pandas can cope with missing values.
 Column and row operations are simple in Pandas.

Installing Pandas:
pip install pandas

Creating DataFrame:

# Importing the pandas library


import pandas as pd
# creating a dataframe object
student_register = pd.DataFrame()
# assigning values to the
# rows and columns of the dataframe
student_register['Name'] = ['Abhijit','Smriti', ‘Akash', 'Roshni']
student_register['Age'] = [20, 19, 20, 14]
student_register['Student'] = [False, True, True, False]
print(student_register)

Output:
Name Age Student
0 Abhijit 20 False
1 Smriti 19 True
2 Akash 20 True
3 Roshni 14 False
Adding data in DataFrame using Append Function:

# creating a new pandas


# series object
new_person = pd.Series(['Mansi', 19, True], index = ['Name', 'Age’, ‘'Student'])
# using the .append() function
# to add that row to the dataframe
student_register.append(new_person, ignore_index = True)
print(student_register)

Output:

Name Age Student


0 Abhijit 20 False
1 Smriti 19 True
2 Akash 20 True
3 Roshni 14 False

Data Manipulation on Dataset:

There are three support functions, .shape, .info() and .corr() which output the shape of the
table, information on rows and columns, and correlation between numerical columns.
# dimension of the dataframe
print('Shape: ')
print(student_register.shape)
print('--------------------------------------')
# showing info about the data
print('Info: ')
print(student_register.info())
print('--------------------------------------')
# correlation between columns
print('Correlation: ')
print(student_register.corr())

Output:

Shape:
(4, 3)
--------------------------------------
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Age 4 non-null int64
2 Student 4 non-null bool
dtypes: bool(1), int64(1), object(1)
memory usage: 196.0+ bytes
None
--------------------------------------
Correlation:
Age Student
Age 1.000000 0.502519
Student 0.502519 1.000000
Getting Statistical Analysis of Data:
Before processing and wrangling any data you need to get the total overview of it, which
includes statistical conclusions like standard deviation(std), mean and it’s quartile distributions.
# for showing the statistical
# info of the dataframe
print('Describe')
print(student_register.describe())

Output:
Describe
Age
count 4.000000
mean 18.250000
std 2.872281
min 14.000000
25% 17.750000
50% 19.500000
75% 20.000000
max 20.000000

The description of the output given by .describe() method is as follows:

1. count is the number of rows in the dataframe.


2. mean is the mean value of all the entries in the “Age” column.
3. std is the standard deviation of the corresponding column.
4. min and max are the minimum and maximum entry in the column respectively.
5. 25%, 50% and 75% are the First Quartiles, Second Quartile(Median) and Third
Quartile respectively, which gives us important info on the distribution of the dataset and
makes it simpler to apply an ML model.

Dropping Columns from Data:


To drop a column from the data, use the drop function from the pandas.
axis = 1 for columns.
students = student_register.drop('Age', axis=1)

print(students.head())

Output:

Name Student

0 Abhijit False

1 Smriti True

2 Akash True

3 Roshni False

Dropping Rows from Data:

To drop a row from the data, use the drop function from the pandas.
axis = 0 for rows.

students = students.drop(2, axis=0)

print(students.head())

Output:

Name Student

0 Abhijit False

1 Smriti True

3 Roshni False

Reference : https://www.geeksforgeeks.org/data-manipulattion-in-python-using-pandas/
Indexing and Selecting Data with Pandas:

Indexing in Pandas :
Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the
rows and all of the columns, or some of each of the rows and columns. Indexing can also be
known as Subset Selection.
Pandas Indexing using [ ], .loc[], .iloc[ ], .ix[ ]
There are a lot of ways to pull the elements, rows, and columns from a DataFrame. There are
some indexing method in Pandas which help in getting an element from a DataFrame. These
indexing methods appear very similar but behave very differently. Pandas support four types of
Multi-axes indexing they are:
 Dataframe.[ ] ; This function also known as indexing operator
 Dataframe.loc[ ] : This function is used for labels.
 Dataframe.iloc[ ] : This function is used for positions or integer based
 Dataframe.ix[] : This function is used for both label and integer based
Collectively, they are called the indexers. These are by far the most common ways to index
data. These are four function which help in getting the elements, rows, and columns from a
DataFrame.

Indexing a Dataframe using indexing operator [] :

Indexing operator is used to refer to the square brackets following an object.


The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing
operator to refer to df[].

Selecting a single columns:


# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving columns by indexing operator


first = data["Age"]
print(first)

Output:
Selecting multiple columns:
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving multiple columns by indexing operator


first = data[["Age", "College", "Salary"]]

first

Output:
Indexing a DataFrame using .loc[ ] :

This function selects data by the label of the rows and columns. The df.loc indexer selects data
in a different way than just the indexing operator. It can select subsets of rows or columns. It
can also simultaneously select subsets of rows and columns.

Selecting a single row:

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving row by loc method

first = data.loc["Avery Bradley"]


second = data.loc["R.J. Hunter"]

print(first, "\n\n\n", second)

Output:

Selecting multiple rows:

import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving multiple rows by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"]]
print(first)

Output:

Selecting two rows and three columns:

Dataframe.loc[["row1", "row2"], ["column1", "column2", "column3"]]


import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving two rows and three columns by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"], ["Team", "Number", "Position"]]
print(first)
Output:

Selecting all of the rows and some columns:

Dataframe.loc[:, ["column1", "column2", "column3"]]

import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving all rows and some columns by loc method
first = data.loc[:, ["Team", "Number", "Position"]]
print(first)

Output:
Indexing a DataFrame using .iloc[ ] :
This function allows us to retrieve rows and columns by position. In order to do that, we’ll
need to specify the positions of the rows that we want, and the positions of the columns that we
want as well. The df.iloc indexer is very similar to df.loc but only uses integer locations to
make its selections.
Selecting a single row:

import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving rows by iloc method


row2 = data.iloc[3]

print(row2)

Output:
Selecting multiple rows:

row2 = data.iloc [[3, 5, 7]]

Selecting two rows and two columns:

row2 = data.iloc [[3, 4], [1, 2]]

Selecting all the rows and a some columns:

row2 = data.iloc [:, [1, 2]]

Indexing a using Dataframe.ix[ ] :

This indexer was capable of selecting both by label and by integer location. While it was
versatile, it caused lots of confusion because it’s not explicit. Sometimes integers can also be
labels for rows or columns. Thus there were instances where it was ambiguous. Generally, ix is
label based and acts just as the .loc indexer. However, .ix also supports integer type selections
(as in .iloc) where passed an integer. This only works where the index of the DataFrame is not
integer based .ix will accept any of the inputs of .loc and .iloc.

Note: The .ix indexer has been deprecated in recent versions of Pandas.

Selecting a single row using .ix[] as .loc[] :

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving row by ix method


first = data.ix["Avery Bradley"]

print(first)

Output:

Selecting a single row using .ix[] as .iloc[]:

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving row by ix method

first = data.ix[1]

print(first)

Output:
Methods for indexing in DataFrame:

Function Description

Dataframe.head() Return top n rows of a data frame.

Dataframe.tail() Return bottom n rows of a data frame.

Dataframe.at[] Access a single value for a row/column label pair.

Dataframe.iat[] Access a single value for a row/column pair by integer position.

Dataframe.tail() Purely integer-location based indexing for selection by position.

DataFrame.lookup() Label-based “fancy indexing” function for DataFrame.

DataFrame.pop() Return item and drop from frame.

DataFrame.xs() Returns a cross-section (row(s) or column(s)) from the DataFrame.

DataFrame.get() Get item from object for given key (DataFrame column, Panel slice, etc.).

Return boolean DataFrame showing whether each element in the


DataFrame.isin()
DataFrame is contained in values.

Return an object of same shape as self and whose corresponding entries


DataFrame.where()
are from self where cond is True and otherwise are from other.
Return an object of same shape as self and whose corresponding entries
DataFrame.mask()
are from self where cond is False and otherwise are from other.

DataFrame.query() Query the columns of a frame with a boolean expression.

DataFrame.insert() Insert column into DataFrame at specified location.

Reference : https://www.geeksforgeeks.org/indexing-and-selecting-data-with-pandas/

Another Reference :
https://jakevdp.github.io/PythonDataScienceHandbook/03.02-dataindexing-and-selection.html

Another Reference : Python Pandas - Indexing and Selecting Data (tutorialspoint.com)

Scientific computing and numerical simulations with SciPy and SimPy:

SciPy:

SciPy is a library of numerical routines for the Python programming language that provides

fundamental building blocks for modeling and solving scientific problems.

It is a Python library useful for solving many mathematical equations and algorithms. It is
designed on the top of Numpy library that gives more extension of finding scientific
mathematical formulae like Matrix Rank, Inverse, polynomial equations, LU Decomposition,
etc.

It is a scientific computation library that uses NumPy underneath. SciPy stands for Scientific
Python. It provides more utility functions for optimization, stats and signal processing.

It is a library used by scientists, analysts, and engineers doing scientific computing and
technical computing. It contains modules for optimization, linear algebra, integration,
interpolation, special functions, FFT, signal and image processing, ODE solvers, and other
tasks common in science and engineering.

It is an open-source library built on top of the foundational library NumPy (Numerical Python). It
extends the capabilities of NumPy by adding a vast collection of high-level functions and routines
that are essential for scientific computing, data analysis, and engineering.
It covers a broad spectrum of domains including linear algebra, optimization, signal processing,
statistics, integration, interpolation, and more.

Key Features and Modules:

1. Linear Algebra
The `scipy.linalg` module provides functions for performing linear algebra operations, such as

solving linear systems, computing eigenvalues and eigenvectors, and matrix factorizations. These

operations are fundamental to various scientific and engineering applications, including data

analysis, machine learning, and simulations.

2. Optimization
The `scipy.optimize` module offers a range of optimization algorithms for finding the minimum

or maximum of functions. These algorithms are crucial for parameter estimation, model fitting,

and solving optimization problems across different fields. From simple gradient-based methods to

more advanced global optimization techniques, SciPy has you covered.

3. Signal and Image Processing


With the `scipy.signal` and `scipy.ndimage` modules, you can perform tasks such as signal

filtering, convolution, image manipulation, and feature extraction. These tools are vital for

processing and analyzing signals, images, and multidimensional data.

4. Statistics
The `scipy.stats` module provides a comprehensive suite of statistical functions for probability

distributions, hypothesis testing, descriptive statistics, and more. Researchers and data analysts

can leverage these tools to gain insights from data and make informed decisions.

5. Integration and Interpolation


Integration and interpolation are common tasks in scientific computing. SciPy’s `scipy.integrate`

module offers methods for numerical integration, while the `scipy.interpolate` module provides

interpolation techniques to estimate values between data points.

6. Special Functions
Scientific and mathematical computations often involve special functions like Bessel functions,

gamma functions, and hypergeometric functions. The `scipy.special` module offers a collection of

these functions, enabling researchers to solve complex mathematical problems.

pip install scipy

import numpy as np

from scipy import linalg


# Coefficient matrix

A = np.array([[2, 1], [1, 3]])

# Right-hand side vector

b = np.array([5, 8])

# Solve the linear system

x = linalg.solve(A, b)

print(“Solution:”, x)

In this example, the `linalg.solve` function from SciPy’s `linalg` module is used to solve the

system of equations represented by the matrix `A` and vector `b`.

SimPy:

SimPy is a process-based discrete-event simulation framework based on standard Python.


Processes in SimPy are defined by Python generator functions and can, for example, be used to
model active components like customers, vehicles or agents. SimPy also provides various types
of shared resources to model limited capacity congestion points (like servers, checkout counters
and tunnels).
Simulations can be performed “as fast as possible”, in real time (wall clock time) or by manually
stepping through the events.

$ python -m pip install simpy

Example : A clock process that prints the current simulation time at each step

>>> import simpy


>>>
>>> def clock(env, name, tick):
... while True:
... print(name, env.now)
... yield env.timeout(tick)
...
>>> env = simpy.Environment()
>>> env.process(clock(env, 'fast', 0.5))
<Process(clock) object at 0x...>
>>> env.process(clock(env, 'slow', 1))
<Process(clock) object at 0x...>
>>> env.run(until=2)
fast 0
slow 0
fast 0.5
slow 1
fast 1.0
fast 1.5

Reference : https://pypi.org/project/simpy/

Large-scale data analysis and machine learning with Pandas:

Pandas are the most popular python library that is used for data analysis. It provides highly
optimized performance with back-end source code purely written in C or Python.
We can analyze data in Pandas with:
 Pandas Series
 Pandas DataFrames
Pandas Series
Series in Pandas is one dimensional(1-D) array defined in pandas that can be used to store any
data type.

# Program to create series

# Import Panda Library

import pandas as pd

# Create series with Data, and Index

a = pd.Series(Data, index=Index)

Data can be:


1. A Scalar value which can be integerValue, string
2. A Python Dictionary which can be Key, Value pair
3. A Ndarray
Note: Index by default is from 0, 1, 2, …(n-1) where n is the length of data.

Create Series from List:


Creating series with predefined index values.
# Numeric data
Data = [1, 3, 4, 5, 6, 2, 9]

# Creating series with default index values


s = pd.Series(Data)

# predefined index values


Index = ['a', 'b', 'c', 'd', 'e', 'f', 'g']

si = pd.Series(Data, Index)

Output:
Create Pandas Series from Dictionary:

dictionary = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

# Creating series of Dictionary type


sd = pd.Series(dictionary)

Output:

Convert an Array to Pandas Series:


# Defining 2darray
Data = [[2, 3, 4], [5, 6, 7]]
# Creating series of 2darray
snd = pd.Series(Data)
Output:
Pandas DataFrames:
The DataFrames in Pandas is a two-dimensional (2-D) data structure defined in pandas which
consists of rows and columns.

# Program to Create DataFrame

# Import Library
import pandas as pd

# Create DataFrame with Data


a = pd.DataFrame(Data)
Here, Data can be:
1. One or more dictionaries
2. One or more Series
3. 2D-numpy Ndarray

Create a Pandas DataFrame from multiple Dictionary:


# Define Dictionary 1
dict1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
# Define Dictionary 2
dict2 = {'a': 5, 'b': 6, 'c': 7, 'd': 8, 'e': 9}
# Define Data with dict1 and dict2
Data = {'first': dict1, 'second': dict2}
# Create DataFrame
df = pd.DataFrame(Data)
df

Output:
Convert list of dictionaries to a Pandas DataFrame:

Here, we are taking three dictionaries and with the help of from_dict() we convert them into
Pandas DataFrame.

import pandas as pd
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]

pd.DataFrame.from_dict(data_c, orient='columns')

Output:
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6

Create DataFrame from Multiple Series:


import pandas as pd
# Define series 1
s1 = pd.Series([1, 3, 4, 5, 6, 2, 9])
# Define series 2
s2 = pd.Series([1.1, 3.5, 4.7, 5.8, 2.9, 9.3])
# Define series 3
s3 = pd.Series(['a', 'b', 'c', 'd', 'e'])
# Define Data
Data ={'first':s1, 'second':s2, 'third':s3}
# Create DataFrame
dfseries = pd.DataFrame(Data)
dfseries

Output:
Convert a Array to Pandas Dataframe:
One constraint has to be maintained while creating a DataFrame of 2D arrays – The
dimensions of the 2D array must be the same.
# Program to create DataFrame from 2D array

# Import Library
import pandas as pd

# Define 2d array 1
d1 =[[2, 3, 4], [5, 6, 7]]

# Define 2d array 2
d2 =[[2, 4, 8], [1, 3, 9]]

# Define Data
Data ={'first': d1, 'second': d2}

# Create DataFrame
df2d = pd.DataFrame(Data)

df2d

Output:

Scikit-learn:

Scikit-learn is an open-source Python library that implements a range of machine learning,


pre-processing, cross-validation, and visualization.

It is the most useful and robust library for machine learning in Python. It provides a selection
of efficient tools for machine learning.
Scikit-learn has emerged as a powerful and user-friendly Python library. Its simplicity and
versatility make it a better choice for both beginners and seasoned data scientists to build and
implement machine learning models.

Scikit-learn is an open-source Python library that implements a range of machine learning, pre-
processing, cross-validation, and visualization algorithms using a unified interface. It is an
open-source machine-learning library that provides a plethora of tools for various machine-
learning tasks such as Classification, Regression, Clustering, and many more.

Installation of Scikit- learn:


The latest version of Scikit-learn is 1.1 and it requires Python 3.8 or newer.
Scikit-learn requires:
 NumPy
 SciPy as its dependencies.
Before installing scikit-learn, ensure that you have NumPy and SciPy installed. Once you have
a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip:

!pip install -U scikit-learn

Let us get started with the modeling process now.


Step 1: Load a Dataset
A dataset is nothing but a collection of data. A dataset generally has two main components:
 Features: (also known as predictors, inputs, or attributes) they are simply the variables of
our data. They can be more than one and hence represented by a feature matrix (‘X’ is a
common notation to represent feature matrix). A list of all the feature names is
termed feature names.
 Response: (also known as the target, label, or output) This is the output variable depending
on the feature variables. We generally have a single response column and it is represented
by a response vector (‘y’ is a common notation to represent response vector). All the
possible values taken by a response vector are termed target names.
Loading exemplar dataset: scikit-learn comes loaded with a few example datasets like the iris
and digits datasets for classification and the boston house prices dataset for regression.
Given below is an example of how one can load an exemplar dataset:

# load the iris dataset as an example

from sklearn.datasets import load_iris

iris = load_iris()

# store the feature matrix (X) and response vector (y)

X = iris.data
y = iris.target

# store the feature and target names

feature_names = iris.feature_names

target_names = iris.target_names

# printing features and target names of our dataset

print("Feature names:", feature_names)

print("Target names:", target_names)

# X and y are numpy arrays

print("\nType of X is:", type(X))

# printing first 5 input rows

print("\nFirst 5 rows of X:\n", X[:5])

Output:
Feature names: ['sepal length (cm)','sepal width (cm)',
'petal length (cm)','petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
Type of X is:
First 5 rows of X:
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]]

Loading external dataset: Now, consider the case when we want to load an external dataset.
For this purpose, we can use the pandas library for easily loading and manipulating datasets.
To install pandas, use the following pip command:
! pip install pandas
In pandas, important data types are:
 Series: Series is a one-dimensional labeled array capable of holding any data type.
 DataFramet: is a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table, or a dict of Series
objects. It is generally the most commonly used pandas object.

Note: The CSV file used in the example below can be downloaded from here: weather.csv

import pandas as pd

# reading csv file

data = pd.read_csv('weather.csv')

# shape of dataset

print("Shape:", data.shape)

# column names

print("\nFeatures:", data.columns)

# storing the feature matrix (X) and response vector (y)

X = data[data.columns[:-1]]

y = data[data.columns[-1]]

# printing first 5 rows of feature matrix

print("\nFeature matrix:\n", X.head())


# printing first 5 values of response vector

print("\nResponse vector:\n", y.head())

Output:
Shape: (366, 22)
Features: Index(['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
'Temp3pm', 'RainToday', 'RISK_MM', 'RainTomorrow'],
dtype='object')
Feature matrix:
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir \
0 8.0 24.3 0.0 3.4 6.3 NW
1 14.0 26.9 3.6 4.4 9.7 ENE
2 13.7 23.4 3.6 5.8 3.3 NW
3 13.3 15.5 39.8 7.2 9.1 NW
4 7.6 16.1 2.8 5.6 10.6 SSE
WindGustSpeed WindDir9am WindDir3pm WindSpeed9am ... Humidity9am \
0 30.0 SW NW 6.0 ... 68
1 39.0 E W 4.0 ... 80
2 85.0 N NNE 6.0 ... 82
3 54.0 WNW W 30.0 ... 62
4 50.0 SSE ESE 20.0 ... 68
Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am \
0 29 1019.7 1015.0 7 7 14.4
1 36 1012.4 1008.4 5 3 17.5
2 69 1009.5 1007.2 8 7 15.4
3 56 1005.5 1007.0 2 7 13.5
4 49 1018.3 1018.5 7 7 11.1
Temp3pm RainToday RISK_MM
0 23.6 No 3.6
1 25.7 Yes 3.6
2 20.2 Yes 39.8
3 14.1 Yes 2.8
4 15.4 Yes 0.0
[5 rows x 21 columns]
Response vector:
0 Yes
1 Yes
2 Yes
3 Yes
4 No
Name: RainTomorrow, dtype: object
Step 2: Splitting the Dataset
One important aspect of all machine learning models is to determine their accuracy. Now, in
order to determine their accuracy, one can train the model using the given dataset and then
predict the response values for the same dataset using that model and hence, find the accuracy
of the model.
But this method has several flaws in it, like:
 The goal is to estimate the likely performance of a model on out-of-sample data.
 Maximizing training accuracy rewards overly complex models that won’t necessarily
generalize our model.
 Unnecessarily complex models may over-fit the training data.
A better option is to split our data into two parts: the first one for training our machine learning
model, and the second one for testing our model.
To summarize
 Split the dataset into two pieces: a training set and a testing set.
 Train the model on the training set.
 Test the model on the testing set and evaluate how well our model did.
Advantages of train/test split
 The model can be trained and tested on different data than the one used for training.
 Response values are known for the test dataset; hence predictions can be evaluated.
 Testing accuracy is a better estimate than training accuracy of out-of-sample performance.

Reference : https://www.geeksforgeeks.org/learning-model-building-scikit-learn-python-
machine-learning-library/

Scikit- Learn:
scikit-learn is a free and open-source machine learning library for the Python programming
language.
It also known as sklearn is a python library to implement machine learning models and statistical
modelling. Through scikit-learn, we can implement various machine learning models for
regression, classification, clustering, and statistical tools for analyzing these models.
It is an open-source machine learning library that supports supervised and unsupervised learning.
It also provides various tools for model fitting, data preprocessing, model selection, model
evaluation, and many other utilities.
It provides dozens of built-in machine learning algorithms and models, called estimators. Each
estimator can be fitted to some data using its fit method.
# Linear Regression
Example1:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression()
reg.coef_
array([0.5, 0.5])
Example2:
import matplotlib.pyplot as plt
import numpy as np

from sklearn import datasets, linear_model


from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset


diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

# Use only one feature


diabetes_X = diabetes_X[:, np.newaxis, 2]

# Split the data into training/testing sets


diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets


diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

# Create linear regression object


regr = linear_model.LinearRegression()

# Train the model using the training sets


regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set


diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print("Coefficients: \n", regr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)

plt.xticks(())
plt.yticks(())
plt.show()

# Random Forest Classifier


from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[ 1, 2, 3], [11, 12, 13]] # 2 samples, 3 features
y = [0, 1] # classes of each sample
clf.fit(X, y)
RandomForestClassifier(random_state=0)
clf.predict(X) # predict classes of the training data
array([0, 1])
clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data
array([0, 1])

# Support Vector Machine


from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)
SVC()
clf.predict([[2., 2.]])
array([1])

To find out support_vectors_, support_ and n_support_:


# get support vectors
clf.support_vectors_
array([[0., 0.],[1., 1.]])
# get indices of support vectors
clf.support_
array([0, 1]...)

# get number of support vectors for each class


clf.n_support_
array([1, 1]...)

# Multiclass Classfication
X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = svm.SVC(decision_function_shape='ovo')
clf.fit(X, Y)
dec = clf.decision_function([[1]])
dec.shape[1] # 6 classes: 4*3/2 = 6
clf.decision_function_shape = "ovr"
dec = clf.decision_function([[1]])
dec.shape[1] # 4 classes

TensorFlow:

TensorFlow is a free and open-source software library for machine learning and artificial
intelligence. It can be used across a range of tasks but has a particular focus on training and
inference of deep neural networks. It was developed by the Google Brain team for Google's
internal use in research and production.

It makes it easy to create ML models that can run in any environment.

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()


x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)


model.evaluate(x_test, y_test)

Reference: https://www.tensorflow.org/
(Click Run quickstart Button to run the program)

https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/quicksta
rt/beginner.ipynb#scrollTo=rYb6DrEH0GMv
PyTorch:
PyTorch is a machine learning library based on the Torch library, used for applications such as
computer vision and natural language processing, originally developed by Meta AI and now part
of the Linux Foundation umbrella.

It is a fully featured framework for building deep learning models, which is a type of machine
learning that's commonly used in applications like image recognition and language processing.
Written in Python, it's relatively easy for most machine learning developers to learn and use.

It is recognized as one of the two most popular machine learning libraries alongside
TensorFlow, offering free and open-source software released under the modified BSD license.
Although the Python interface is more polished and the primary focus of development,
PyTorch also has a C++ interface.

PyTorch provides two main features:

 An n-dimensional Tensor, similar to numpy but can run on GPUs


 Automatic differentiation for building and training neural networks
We will use a problem of fitting 𝑦=sin(𝑥) with a third order polynomial as our running example.
The network will have four parameters, and will be trained with gradient descent to fit random
data by minimizing the Euclidean distance between the network output and the true output.

we will first implement the network using numpy.

Numpy provides an n-dimensional array object, and many functions for manipulating these
arrays. Numpy is a generic framework for scientific computing; it does not know anything about
computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a
third order polynomial to sine function by manually implementing the forward and backward
passes through the network using numpy operations:

# -*- coding: utf-8 -*-


import numpy as np
import math

# Create random input and output data


x = np.linspace(-math.pi, math.pi, 2000)
y = np.sin(x)

# Randomly initialize weights


a = np.random.randn()
b = np.random.randn()
c = np.random.randn()
d = np.random.randn()

learning_rate = 1e-6
for t in range(2000):
# Forward pass: compute predicted y
# y = a + b x + c x^2 + d x^3
y_pred = a + b * x + c * x ** 2 + d * x ** 3

# Compute and print loss


loss = np.square(y_pred - y).sum()
if t % 100 == 99:
print(t, loss)

# Backprop to compute gradients of a, b, c, d with respect to loss


grad_y_pred = 2.0 * (y_pred - y)
grad_a = grad_y_pred.sum()
grad_b = (grad_y_pred * x).sum()
grad_c = (grad_y_pred * x ** 2).sum()
grad_d = (grad_y_pred * x ** 3).sum()

# Update weights
a -= learning_rate * grad_a
b -= learning_rate * grad_b
c -= learning_rate * grad_c
d -= learning_rate * grad_d

print(f'Result: y = {a} + {b} x + {c} x^2 + {d} x^3')

PyTorch: Tensors

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations.
For modern deep neural networks, GPUs often provide speedups of 50x or greater, so
unfortunately numpy won’t be enough for modern deep learning.

A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional


array, and PyTorch provides many functions for operating on these Tensors. Behind the scenes,
Tensors can keep track of a computational graph and gradients, but they’re also useful as a
generic tool for scientific computing.

Also unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations.
To run a PyTorch Tensor on GPU, you simply need to specify the correct device.

Here we use PyTorch Tensors to fit a third order polynomial to sine function. Like the numpy
example above we need to manually implement the forward and backward passes through the
network:
# -*- coding: utf-8 -*-

import torch
import math

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# Create random input and output data


x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Randomly initialize weights


a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(2000):
# Forward pass: compute predicted y
y_pred = a + b * x + c * x ** 2 + d * x ** 3

# Compute and print loss


loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)

# Backprop to compute gradients of a, b, c, d with respect to loss


grad_y_pred = 2.0 * (y_pred - y)
grad_a = grad_y_pred.sum()
grad_b = (grad_y_pred * x).sum()
grad_c = (grad_y_pred * x ** 2).sum()
grad_d = (grad_y_pred * x ** 3).sum()

# Update weights using gradient descent


a -= learning_rate * grad_a
b -= learning_rate * grad_b
c -= learning_rate * grad_c
d -= learning_rate * grad_d

print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

You might also like