Unit 3
Unit 3
NumPy stands for Numerical Python. It is a Python library used for working with an array. In
Python, we use the list for the array but it’s slow to process. NumPy array is a powerful N-
dimensional array object and is used in linear algebra, Fourier transform, and random number
capabilities. It provides an array object much faster than traditional Python lists.
Types of Array:
1. One Dimensional Array
2. Multi-Dimensional Array
One Dimensional Array:
A one-dimensional array is a type of linear array.
print(type(list_1))
print(type(sample_array))
Output:
<class 'list'>
<class 'numpy.ndarray'>
Pandas is most commonly used for data wrangling and data manipulation purposes, and NumPy
objects are primarily used to create arrays or matrices that can be applied to DL or ML models.
Whereas Pandas is used for creating heterogenous, two-dimensional data objects, NumPy makes
N-dimensional homogeneous objects.
Multi-Dimensional Array:
Data in multidimensional arrays are stored in tabular form.
Output:
Numpy multi dimensional array in python
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
Reference : https://www.geeksforgeeks.org/basics-of-numpy-arrays/
Anatomy of an array :
1. Axis: The Axis of an array describes the order of the indexing into the array.
Axis 0 = one dimensional
Axis 1 = Two dimensional
Axis 2 = Three dimensional
Shape: The number of elements along with each axis. It is from a tuple.
# importing numpy module
import numpy as np
# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]
# creating numpy array
sample_array = np.array([list_1,
list_2,
list_3])
print("Numpy array :")
print(sample_array)
Rank: The rank of an array is simply the number of axes (or dimensions) it has.
The one-dimensional array has rank 1.
Rank 1
Data type objects (dtype): Data type objects (dtype) is an instance of numpy.dtype class. It
describes how the bytes in the fixed-size block of memory corresponding to an array item
should be interpreted.
# Import module
import numpy as np
# Creating the array
sample_array_1 = np.array([[0, 4, 2]])
Output:
Data type of the array 1 : int32
Data type of array 2 : float64
numpy.arange(): This is an inbuilt NumPy function that returns evenly spaced values within a
given interval.
Syntax: numpy.arange([start, ]stop, [step, ]dtype=None)
import numpy as np
Output:
array([ 1., 3., 5., 7., 9., 11., 13., 15., 17., 19.], dtype=float32)
numpy.empty(): This function create a new array of given shape and type, without initializing
value.
Syntax: numpy.empty(shape, dtype=float, order=’C’)
import numpy as np
np.empty([4, 3],
dtype = np.int32,
order = 'f')
Output:
array([[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11],
[ 4, 8, 12]])
numpy.ones(): This function is used to get a new array of given shape and type, filled with
ones(1).
Syntax: numpy.ones(shape, dtype=None, order=’C’)
import numpy as np
np.ones([4, 3],
dtype = np.int32,
order = 'f')
Output:
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])
numpy.zeros(): This function is used to get a new array of given shape and type, filled with
zeros(0).
Syntax: numpy.ones(shape, dtype=None)
import numpy as np
np.zeros([4, 3],
dtype = np.int32,
order = 'f')
Output:
array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
Array Attributes:
print(np_array.shape)
print(np_array.dtype)
print(np_array.ndim)
print(np_array.size)
Numpy Aggregations:
Numpy Aggregation Function:
numpy. sum: Computes the sum of array elements. ...
numpy. mean: Computes the mean (average) of array elements. ...
numpy.min and numpy.max`: Compute the minimum and maximum values of an array. arr =
np.array([[1, 2, 3], [4, 5, 6]]) ...
numpy. median: Computes the median of array elements.
aggregate() function is used to apply some aggregation across one or more columns. Aggregate
using callable, string, dict, or list of string/callables. The most frequently used aggregations are:
sum: Return the sum of the values for the requested axis. min: Return the minimum of the values
for the requested axis.
total_sum = np.sum(arr)
# Sum along a specific axis (axis=0 for columns, axis=1 for rows)
average = np.mean(arr)
print(“Mean:”, average)
numpy.min() and numpy.max() - Compute the minimum and maximum values of an array
arr = np.array([[1, 2, 3], [4, 5, 6]])
min_value = np.min(arr)
max_value = np.max(arr)
median = np.median(arr)
print(“Median:”, median)
Arithmetic Operations:
array_power = np_array ** 2
Aggregation Functions:
array_sum = np.sum(np_array)
array_mean = np.mean(np_array)
array_min = np.min(np_array)
array_max = np.max(np_array)
transposed_array = reshaped_array.T
Matrix Multiplication:
greater() returns element-wise True if the first value is greater then second
greater_equal() returns element-wise True if the first value is greater than or equal to second
import numpy as np
print("Array a: ", a)
print("Array b: ", b)
Output:
Reference : https://www.geeksforgeeks.org/how-to-compare-two-numpy-arrays/
Masks in numpy:
A mask is either nomask , indicating that no value of the associated array is invalid, or an array
of booleans that determines for each element of the associated array whether the value is valid
or not.
Masking comes up when you want to extract, modify, count, or otherwise manipulate values in
an array based on some criterion: for example, you might wish to count all values greater than
a certain value, or perhaps remove all outliers that are above some threshold.
The masked array is the arrays that have invalid or missing entries. Using Masking of arrays
we can easily handle the missing, invalid, or unwanted entries in our array or
dataset/dataframe.
Syntax:
numpy.logical_and(var1,var2)
Where, var1 and var2 are a single variable or a list/array.
Return type: Boolean value (True or False)
Output:
Operation between two lists: [False True True True True False]
Logical OR:
The NumPy module supports the logical_or operator. It is also used to relate between two
variables. If two variables are 0 then output is 0, if two variables are 1 then output is 1 and if
one variable is 0 and another is 1 then output is 1.
Syntax:
numpy.logical_or(var1,var2)
Where, var1 and var2 are a single variable or a list/array.
Return type: Boolean value (True or False)
Logical NOT:
The logical_not operation takes one value and converts it into another value. If the value is 0,
then output is 1, if value is greater than or equal to 1 output is 0.
Syntax:
numpy.logical_not(var1)
Where, var1is a single variable or a list/array.
Return type: Boolean value (True or False)
Logical XOR:
The logical_xor performs the xor operation between two variables or lists. In this operation, if
two values are same it returns 0 otherwise 1.
Syntax:
numpy.logical_xor(var1,var2)
Where, var1 and var2 are a single variable or a list/array.
Return type: Boolean value (True or False)
Reference : https://www.geeksforgeeks.org/numpy-array-logical-operations/
Logic functions:
greater_equal(x1, x2, /[, out, where, ...]) Return the truth value of (x1 >= x2) element-wise.
less(x1, x2, /[, out, where, casting, ...]) Return the truth value of (x1 < x2) element-wise.
less_equal(x1, x2, /[, out, where, casting, ...]) Return the truth value of (x1 <= x2) element-wise.
equal(x1, x2, /[, out, where, casting, ...]) Return (x1 == x2) element-wise.
not_equal(x1, x2, /[, out, where, casting, ...]) Return (x1 != x2) element-wise.
Reference : https://numpy.org/doc/stable/reference/routines.logic.html
Fancy indexing is conceptually simple: it means passing an array of indices to access multiple
array elements at once. For example, consider the following array: In [ 1 ]: import numpy as np
rng = np . random . default_rng ( seed = 1701 ) x = rng .
In NumPy, fancy indexing allows us to use an array of indices to access multiple array
elements at once. Fancy indexing can perform more advanced and efficient array operations,
including conditional filtering, sorting, and so on.
Fancy indexing is a special type of indexing in which elements of an array are selected by an
array of indices. This means we pass the array of indices in brackets.
import numpy as np
print(select_elements)
Output:
[2 3 6 8]
Reference : https://www.programiz.com/python-programming/numpy/fancy-indexing
import numpy as np
simple_indexing = array1[3]
print("Simple Indexing:",simple_indexing) # 4
print("Fancy Indexing:",fancy_indexing) # [2 3 6 8]
Output:
Simple Indexing: 4
Fancy Indexing: [2 3 6 8]
Each record in the array student has a structure of class Struct. The array of a structure is
referred to as a struct as adding any new fields for a new struct in the array contains the empty
array.
Creating Structured Array in Python NumPy:
We can create a structured array in Python using the NumPy module.
import numpy as np
dtype=dt)
print(a)
Output
[('Sana', 2, 21.0) ('Mansi', 7, 29.0)]
The structure array can be sorted by using NumPy.sort() method and passing the order as the
parameter. This parameter takes the value of the field according to which it is needed to be
sorted.
Example:
import numpy as np
b = np.sort(a, order='name')
print('Sorting according to the name', b)
b = np.sort(a, order='age')
Output
Sorting according to the name [('Mansi', 7, 29.0) ('Sana', 2, 21.0)]
Example:
import numpy as np
max_age = np.max(a['age'])
min_age = np.min(a['age'])
Output
Max age = 7
Min age = 2
Concatenating Structured Array:
We can use the np.concatenate() function to concatenate two structured arrays. Look at the
example below showing the concatenation of two structured arrays.
Example:
import numpy as np
c = np.concatenate((a, b))
print(c)
Output:
[('Sana', 2, 21.) ('Mansi', 7, 29.) ('Ayushi', 5, 30.)]
Example:
import numpy as np
print(reshaped_a)
Output:
[[('Sana', 2, 21.)]
[('Mansi', 7, 29.)]]
1) Grouping Data
NumPy’s structured arrays allow us to group data of different data types and sizes. Each field
in a structured array can contain data of any data type, making it a versatile tool for data
grouping.
2) Tabular Data
Structured arrays can be a great tool when dealing with tabular data. They allow us to store and
manipulate complex data structures with multiple fields, similar to a table or a spreadsheet.
3) Data Analysis
Structured arrays are very useful for data analysis. They provide efficient, flexible data
containers that allow us to perform operations on entire datasets at once.
4) Memory efficiency
Structured arrays are memory-efficient. They allow us to store complex, heterogeneous data in
a compact format, which can be important when working with large datasets.
Many Python libraries, such as Pandas and Scikit-learn, are built on top of NumPy and can
work directly with structured arrays. This makes structured arrays a good choice when you
need to integrate your code with other libraries.
Structured arrays are particularly useful in scenarios involving tabular or structured data. Some
common use cases include:
1) Data Import/Export
When working with structured data from external sources like CSV files or databases, we can
use structured arrays to read, manipulate, and process the data efficiently.
2) Data Analysis
Structured arrays provide a convenient way to perform various data analysis tasks. We can use
them to filter, sort, group, and aggregate data based on different fields, enabling us to gain
insights and extract meaningful information from the data.
In scientific simulations or modeling tasks, structured arrays can be used to represent different
variables or parameters. This allows us to organize and manipulate the data efficiently,
facilitating complex calculations and simulations.
Structured arrays are useful for record-keeping applications or when working with small
databases. They provide an organized and efficient way to store, query, and modify records with
multiple fields.
Reference : https://www.tutorialspoint.com/structured-array-in-numpy
Pandas data manipulation is the process of cleaning, transforming, and aggregating data using
the Pandas library. Pandas provides a variety of functions for performing these tasks, making it a
powerful and versatile tool for data analysis.
In Machine Learning, the model requires a dataset to operate, i.e. to train and test. But data
doesn’t come fully prepared and ready to use. There are discrepancies like Nan/ Null / NA
values in many rows and columns. Sometimes the data set also contains some of the rows and
columns which are not even required in the operation of our model. In such conditions, it
requires proper cleaning and modification of the data set to make it an efficient input for our
model. We achieve that by practicing Data Wrangling before giving data input to the model.
Installing Pandas:
pip install pandas
Creating DataFrame:
Output:
Name Age Student
0 Abhijit 20 False
1 Smriti 19 True
2 Akash 20 True
3 Roshni 14 False
Adding data in DataFrame using Append Function:
Output:
There are three support functions, .shape, .info() and .corr() which output the shape of the
table, information on rows and columns, and correlation between numerical columns.
# dimension of the dataframe
print('Shape: ')
print(student_register.shape)
print('--------------------------------------')
# showing info about the data
print('Info: ')
print(student_register.info())
print('--------------------------------------')
# correlation between columns
print('Correlation: ')
print(student_register.corr())
Output:
Shape:
(4, 3)
--------------------------------------
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Age 4 non-null int64
2 Student 4 non-null bool
dtypes: bool(1), int64(1), object(1)
memory usage: 196.0+ bytes
None
--------------------------------------
Correlation:
Age Student
Age 1.000000 0.502519
Student 0.502519 1.000000
Getting Statistical Analysis of Data:
Before processing and wrangling any data you need to get the total overview of it, which
includes statistical conclusions like standard deviation(std), mean and it’s quartile distributions.
# for showing the statistical
# info of the dataframe
print('Describe')
print(student_register.describe())
Output:
Describe
Age
count 4.000000
mean 18.250000
std 2.872281
min 14.000000
25% 17.750000
50% 19.500000
75% 20.000000
max 20.000000
print(students.head())
Output:
Name Student
0 Abhijit False
1 Smriti True
2 Akash True
3 Roshni False
To drop a row from the data, use the drop function from the pandas.
axis = 0 for rows.
print(students.head())
Output:
Name Student
0 Abhijit False
1 Smriti True
3 Roshni False
Reference : https://www.geeksforgeeks.org/data-manipulattion-in-python-using-pandas/
Indexing and Selecting Data with Pandas:
Indexing in Pandas :
Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the
rows and all of the columns, or some of each of the rows and columns. Indexing can also be
known as Subset Selection.
Pandas Indexing using [ ], .loc[], .iloc[ ], .ix[ ]
There are a lot of ways to pull the elements, rows, and columns from a DataFrame. There are
some indexing method in Pandas which help in getting an element from a DataFrame. These
indexing methods appear very similar but behave very differently. Pandas support four types of
Multi-axes indexing they are:
Dataframe.[ ] ; This function also known as indexing operator
Dataframe.loc[ ] : This function is used for labels.
Dataframe.iloc[ ] : This function is used for positions or integer based
Dataframe.ix[] : This function is used for both label and integer based
Collectively, they are called the indexers. These are by far the most common ways to index
data. These are four function which help in getting the elements, rows, and columns from a
DataFrame.
Output:
Selecting multiple columns:
# importing pandas package
import pandas as pd
first
Output:
Indexing a DataFrame using .loc[ ] :
This function selects data by the label of the rows and columns. The df.loc indexer selects data
in a different way than just the indexing operator. It can select subsets of rows or columns. It
can also simultaneously select subsets of rows and columns.
import pandas as pd
Output:
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving multiple rows by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"]]
print(first)
Output:
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving all rows and some columns by loc method
first = data.loc[:, ["Team", "Number", "Position"]]
print(first)
Output:
Indexing a DataFrame using .iloc[ ] :
This function allows us to retrieve rows and columns by position. In order to do that, we’ll
need to specify the positions of the rows that we want, and the positions of the columns that we
want as well. The df.iloc indexer is very similar to df.loc but only uses integer locations to
make its selections.
Selecting a single row:
import pandas as pd
print(row2)
Output:
Selecting multiple rows:
This indexer was capable of selecting both by label and by integer location. While it was
versatile, it caused lots of confusion because it’s not explicit. Sometimes integers can also be
labels for rows or columns. Thus there were instances where it was ambiguous. Generally, ix is
label based and acts just as the .loc indexer. However, .ix also supports integer type selections
(as in .iloc) where passed an integer. This only works where the index of the DataFrame is not
integer based .ix will accept any of the inputs of .loc and .iloc.
Note: The .ix indexer has been deprecated in recent versions of Pandas.
import pandas as pd
print(first)
Output:
import pandas as pd
first = data.ix[1]
print(first)
Output:
Methods for indexing in DataFrame:
Function Description
DataFrame.get() Get item from object for given key (DataFrame column, Panel slice, etc.).
Reference : https://www.geeksforgeeks.org/indexing-and-selecting-data-with-pandas/
Another Reference :
https://jakevdp.github.io/PythonDataScienceHandbook/03.02-dataindexing-and-selection.html
SciPy:
SciPy is a library of numerical routines for the Python programming language that provides
It is a Python library useful for solving many mathematical equations and algorithms. It is
designed on the top of Numpy library that gives more extension of finding scientific
mathematical formulae like Matrix Rank, Inverse, polynomial equations, LU Decomposition,
etc.
It is a scientific computation library that uses NumPy underneath. SciPy stands for Scientific
Python. It provides more utility functions for optimization, stats and signal processing.
It is a library used by scientists, analysts, and engineers doing scientific computing and
technical computing. It contains modules for optimization, linear algebra, integration,
interpolation, special functions, FFT, signal and image processing, ODE solvers, and other
tasks common in science and engineering.
It is an open-source library built on top of the foundational library NumPy (Numerical Python). It
extends the capabilities of NumPy by adding a vast collection of high-level functions and routines
that are essential for scientific computing, data analysis, and engineering.
It covers a broad spectrum of domains including linear algebra, optimization, signal processing,
statistics, integration, interpolation, and more.
1. Linear Algebra
The `scipy.linalg` module provides functions for performing linear algebra operations, such as
solving linear systems, computing eigenvalues and eigenvectors, and matrix factorizations. These
operations are fundamental to various scientific and engineering applications, including data
2. Optimization
The `scipy.optimize` module offers a range of optimization algorithms for finding the minimum
or maximum of functions. These algorithms are crucial for parameter estimation, model fitting,
and solving optimization problems across different fields. From simple gradient-based methods to
filtering, convolution, image manipulation, and feature extraction. These tools are vital for
4. Statistics
The `scipy.stats` module provides a comprehensive suite of statistical functions for probability
distributions, hypothesis testing, descriptive statistics, and more. Researchers and data analysts
can leverage these tools to gain insights from data and make informed decisions.
module offers methods for numerical integration, while the `scipy.interpolate` module provides
6. Special Functions
Scientific and mathematical computations often involve special functions like Bessel functions,
gamma functions, and hypergeometric functions. The `scipy.special` module offers a collection of
import numpy as np
b = np.array([5, 8])
x = linalg.solve(A, b)
print(“Solution:”, x)
In this example, the `linalg.solve` function from SciPy’s `linalg` module is used to solve the
SimPy:
Example : A clock process that prints the current simulation time at each step
Reference : https://pypi.org/project/simpy/
Pandas are the most popular python library that is used for data analysis. It provides highly
optimized performance with back-end source code purely written in C or Python.
We can analyze data in Pandas with:
Pandas Series
Pandas DataFrames
Pandas Series
Series in Pandas is one dimensional(1-D) array defined in pandas that can be used to store any
data type.
import pandas as pd
a = pd.Series(Data, index=Index)
si = pd.Series(Data, Index)
Output:
Create Pandas Series from Dictionary:
Output:
# Import Library
import pandas as pd
Output:
Convert list of dictionaries to a Pandas DataFrame:
Here, we are taking three dictionaries and with the help of from_dict() we convert them into
Pandas DataFrame.
import pandas as pd
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
pd.DataFrame.from_dict(data_c, orient='columns')
Output:
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Output:
Convert a Array to Pandas Dataframe:
One constraint has to be maintained while creating a DataFrame of 2D arrays – The
dimensions of the 2D array must be the same.
# Program to create DataFrame from 2D array
# Import Library
import pandas as pd
# Define 2d array 1
d1 =[[2, 3, 4], [5, 6, 7]]
# Define 2d array 2
d2 =[[2, 4, 8], [1, 3, 9]]
# Define Data
Data ={'first': d1, 'second': d2}
# Create DataFrame
df2d = pd.DataFrame(Data)
df2d
Output:
Scikit-learn:
It is the most useful and robust library for machine learning in Python. It provides a selection
of efficient tools for machine learning.
Scikit-learn has emerged as a powerful and user-friendly Python library. Its simplicity and
versatility make it a better choice for both beginners and seasoned data scientists to build and
implement machine learning models.
Scikit-learn is an open-source Python library that implements a range of machine learning, pre-
processing, cross-validation, and visualization algorithms using a unified interface. It is an
open-source machine-learning library that provides a plethora of tools for various machine-
learning tasks such as Classification, Regression, Clustering, and many more.
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
Output:
Feature names: ['sepal length (cm)','sepal width (cm)',
'petal length (cm)','petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
Type of X is:
First 5 rows of X:
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]]
Loading external dataset: Now, consider the case when we want to load an external dataset.
For this purpose, we can use the pandas library for easily loading and manipulating datasets.
To install pandas, use the following pip command:
! pip install pandas
In pandas, important data types are:
Series: Series is a one-dimensional labeled array capable of holding any data type.
DataFramet: is a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table, or a dict of Series
objects. It is generally the most commonly used pandas object.
Note: The CSV file used in the example below can be downloaded from here: weather.csv
import pandas as pd
data = pd.read_csv('weather.csv')
# shape of dataset
print("Shape:", data.shape)
# column names
print("\nFeatures:", data.columns)
X = data[data.columns[:-1]]
y = data[data.columns[-1]]
Output:
Shape: (366, 22)
Features: Index(['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
'Temp3pm', 'RainToday', 'RISK_MM', 'RainTomorrow'],
dtype='object')
Feature matrix:
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir \
0 8.0 24.3 0.0 3.4 6.3 NW
1 14.0 26.9 3.6 4.4 9.7 ENE
2 13.7 23.4 3.6 5.8 3.3 NW
3 13.3 15.5 39.8 7.2 9.1 NW
4 7.6 16.1 2.8 5.6 10.6 SSE
WindGustSpeed WindDir9am WindDir3pm WindSpeed9am ... Humidity9am \
0 30.0 SW NW 6.0 ... 68
1 39.0 E W 4.0 ... 80
2 85.0 N NNE 6.0 ... 82
3 54.0 WNW W 30.0 ... 62
4 50.0 SSE ESE 20.0 ... 68
Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am \
0 29 1019.7 1015.0 7 7 14.4
1 36 1012.4 1008.4 5 3 17.5
2 69 1009.5 1007.2 8 7 15.4
3 56 1005.5 1007.0 2 7 13.5
4 49 1018.3 1018.5 7 7 11.1
Temp3pm RainToday RISK_MM
0 23.6 No 3.6
1 25.7 Yes 3.6
2 20.2 Yes 39.8
3 14.1 Yes 2.8
4 15.4 Yes 0.0
[5 rows x 21 columns]
Response vector:
0 Yes
1 Yes
2 Yes
3 Yes
4 No
Name: RainTomorrow, dtype: object
Step 2: Splitting the Dataset
One important aspect of all machine learning models is to determine their accuracy. Now, in
order to determine their accuracy, one can train the model using the given dataset and then
predict the response values for the same dataset using that model and hence, find the accuracy
of the model.
But this method has several flaws in it, like:
The goal is to estimate the likely performance of a model on out-of-sample data.
Maximizing training accuracy rewards overly complex models that won’t necessarily
generalize our model.
Unnecessarily complex models may over-fit the training data.
A better option is to split our data into two parts: the first one for training our machine learning
model, and the second one for testing our model.
To summarize
Split the dataset into two pieces: a training set and a testing set.
Train the model on the training set.
Test the model on the testing set and evaluate how well our model did.
Advantages of train/test split
The model can be trained and tested on different data than the one used for training.
Response values are known for the test dataset; hence predictions can be evaluated.
Testing accuracy is a better estimate than training accuracy of out-of-sample performance.
Reference : https://www.geeksforgeeks.org/learning-model-building-scikit-learn-python-
machine-learning-library/
Scikit- Learn:
scikit-learn is a free and open-source machine learning library for the Python programming
language.
It also known as sklearn is a python library to implement machine learning models and statistical
modelling. Through scikit-learn, we can implement various machine learning models for
regression, classification, clustering, and statistical tools for analyzing these models.
It is an open-source machine learning library that supports supervised and unsupervised learning.
It also provides various tools for model fitting, data preprocessing, model selection, model
evaluation, and many other utilities.
It provides dozens of built-in machine learning algorithms and models, called estimators. Each
estimator can be fitted to some data using its fit method.
# Linear Regression
Example1:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression()
reg.coef_
array([0.5, 0.5])
Example2:
import matplotlib.pyplot as plt
import numpy as np
# The coefficients
print("Coefficients: \n", regr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
# Multiclass Classfication
X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = svm.SVC(decision_function_shape='ovo')
clf.fit(X, Y)
dec = clf.decision_function([[1]])
dec.shape[1] # 6 classes: 4*3/2 = 6
clf.decision_function_shape = "ovr"
dec = clf.decision_function([[1]])
dec.shape[1] # 4 classes
TensorFlow:
TensorFlow is a free and open-source software library for machine learning and artificial
intelligence. It can be used across a range of tasks but has a particular focus on training and
inference of deep neural networks. It was developed by the Google Brain team for Google's
internal use in research and production.
import tensorflow as tf
mnist = tf.keras.datasets.mnist
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Reference: https://www.tensorflow.org/
(Click Run quickstart Button to run the program)
https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/quicksta
rt/beginner.ipynb#scrollTo=rYb6DrEH0GMv
PyTorch:
PyTorch is a machine learning library based on the Torch library, used for applications such as
computer vision and natural language processing, originally developed by Meta AI and now part
of the Linux Foundation umbrella.
It is a fully featured framework for building deep learning models, which is a type of machine
learning that's commonly used in applications like image recognition and language processing.
Written in Python, it's relatively easy for most machine learning developers to learn and use.
It is recognized as one of the two most popular machine learning libraries alongside
TensorFlow, offering free and open-source software released under the modified BSD license.
Although the Python interface is more polished and the primary focus of development,
PyTorch also has a C++ interface.
Numpy provides an n-dimensional array object, and many functions for manipulating these
arrays. Numpy is a generic framework for scientific computing; it does not know anything about
computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a
third order polynomial to sine function by manually implementing the forward and backward
passes through the network using numpy operations:
learning_rate = 1e-6
for t in range(2000):
# Forward pass: compute predicted y
# y = a + b x + c x^2 + d x^3
y_pred = a + b * x + c * x ** 2 + d * x ** 3
# Update weights
a -= learning_rate * grad_a
b -= learning_rate * grad_b
c -= learning_rate * grad_c
d -= learning_rate * grad_d
PyTorch: Tensors
Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations.
For modern deep neural networks, GPUs often provide speedups of 50x or greater, so
unfortunately numpy won’t be enough for modern deep learning.
Also unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations.
To run a PyTorch Tensor on GPU, you simply need to specify the correct device.
Here we use PyTorch Tensors to fit a third order polynomial to sine function. Like the numpy
example above we need to manually implement the forward and backward passes through the
network:
# -*- coding: utf-8 -*-
import torch
import math
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
learning_rate = 1e-6
for t in range(2000):
# Forward pass: compute predicted y
y_pred = a + b * x + c * x ** 2 + d * x ** 3