Machine Learning Notes
Machine Learning Notes
Machine Learning Notes
MACHINE LEARNING
mANDAOTRY COURSE
PRAMOTH SABARI S
COURSE PLAN AND PORTION:
1|Page
2|Page
INTRODUCTION TO ARTIFICIAL INTELLIGENCE
Artificial Intelligence:
Why AI? - Demystifying Intelligent Systems History of AI
Dartmouth Conference in 1956 organized by John
What is AI? - Technical Overview
McCharthy.
AI and Business
According to Marvin Minsky AI is science of making
Sneak Peek into Popular AI platforms machine that do thing that would require
intelligence if done by men.
It is ability of the machine to mimic the cognitive
Why AI
function that associate with the human.
This is self-learning and adaptive.
AI minds the pattern and become Types of AI
knowledgeable on their own. Two types of AI
It is able to interpret and analyze a huge
Weak AI (Artificial Narrow Intelligence)
amount of data in short duration of time
Self-driving car, google assistant
The data can be images, videos, folders,
Strong AI (Artificial General Intelligence)
documents, etc.,
Machines that exhibit human
These solves problem for which a formula
intelligence and can perform any
or procedure do not exist.
task that humans do.
Note: Video Analysis
Aspects of AI
Natural language Analysis Planning – Decision making tasks
Speech recognition – ability of computer to
AI Business Opportunity
recognize speech and covert into text.
It helps to optimize the data and analyze
Natural Language processing –
the trend for the particular product or
Understanding human speaking language
particular solution and gives the solution
Robotics – Intelligence behavior to the
accordingly.
robot
Walmart, Rolls-Royce are some of the
Expert Systems – Gives solutions based on a
companies that uses AI technology to
huge data provided to it
prevent their lose and increase their profit.
Machine Learning
AI is helping them achieve:
Vision – Making understand the images and
1. Creation of new business lines
videos.
2. Competitive advantages
3. Cost reduction, increased Robotic Process Automation
productivity, process improvements Robotic process automation (RPA) is a software
4. Personalized communication at technology that makes it easy to build, deploy, and
scale manage software robots that emulate human
5. AI centered innovations of products actions interacting with digital systems and
and processes. software.
3|Page
Machine Learning Logistics Regression – Predicts the probability of
It learns pattern from the given data and applies to occurrence of the event and the predicted value
the data. Machine Learning includes Supervised will be between 0 and 1.
Learning, Unsupervised Learning and
It uses the logistics function which S- shaped. It is
Reinforcement learning.
used to fit the data and predict the of given data.
It deals with predictive analytics This covert the occurrence of the event into
probability.
Machine learning learns the patterns from the past
data and makes a model. This model is used to Regression
make predictions for a new data. This is way It learns the relation between the given input data
machine learning predicts. and output. This regression is based on the single
input feature data that can be newly given and this
The choice of machine learning algorithm depends
predicts the output based on the older data,
on the type of the problem and type of data we
have. The regression can of two types based on the
feature number,
Supervised learning
Input variable (X) give Output variable Single Linear Regression
Multiple Linear Regression
It is used when data is well labelled.
Deep Learning or Artificial Neural
Types of Supervised learning
Networks
Classification – Predicts discrete Artificial Neural Networks is inspired from
values biological neurons.
Regression – Predicts continuous
These contains three nodes: Input nodes, hidden
numeric value
nodes and Output nodes. This connection and
Classification combination such many nodes is called as neural
Support Vector Machine networks.
Widely used classification techniques that helps in This produces more accurate results than the
deciding the optimal classifier between the classes traditional methods of machine learning.
of data.
The neural networks model will be created by
Decision Tree – It represents the data in form of a feeding different type of input, which will be
tree containing decision nodes and leaf nodes. mapped to the accurate output using formation of
many neural networks.
More the number of input data there will be a
greater number of nodes in the decision tree and Types of ANN – Convolutional Neural Networks
more will be the accuracy of the output. (CNN) and Recurrent Neural Network (RNN).
KNN – It is widely used for classification Overfitting – It happens when a model learns the
techniques. Compares K nearest neighbors in the details as well as noise in the training data. Bad
data point with the new data point. performance of test data but good performance in
training data.
K defines as the no. of neighbors.
4|Page
Underfitting – It happens when model cannot Timeseries forecasting – predicting the future
capture the underlying trends in the given training outputs from the past input data.
data.
Stages involved in the machine learning
Both these will give poor predictions in the 1. Define the goal
machine learning. Common problem is overfitting. 2. Identify the machine learning technique
that should be used
Ensemble method – It combine multiple learning
3. Data is collected for input
models and arrive at one prediction to increase the
4. Feature engineering
accuracy of the output.
5. The data is divided into two – training data
and testing data.
6. Choosing ML Method according to the ML
Two aspects of error: Bias and variance
technique.
Bias – Refers to errors by the model due to 7. Model validation
erroneous assumptions. 8. Final model – New data
9. Test Prediction in real-time
Variance – refers to errors by the model due to
sensitivity to small fluctuations in the training data AI Architecture
set. Data
Complexity of the model is low – High Bias and Low Information Management
variance
Databases
Increasing Complexity – Low bias and high Variance Spark
Unsupervised Learning Hadoop
We train the algorithm with the unlabeled data. It AI Technologies
will group the input data based on the patterns,
similarities and differences. Machine Learning
NLP
Types – Clustering and Association mining Robotics
Clustering – Processes the input data and finds the Image analytics
inherent clusters or groups. Expert systems
Digital Twins
Adaptive Manufacturing
Predictive Maintenance
Automated Qualitative Control
Demand Driven Production
AI in Energy
Smart Exploration
Better Development
Faster production
6|Page
7|Page
INTRODUCTION TO DATA SCIENCE
Data Science The very beginning of this is method by starting to
Data science is an amalgamation of different ask number of questions.
scientific methods, algorithms and systems which
What is Data Science
enable us to gain insights and derive knowledge
Data science is primarily a combination of Data &
from data in various forms.
Science
Data science is a cross functional field, emerging at
Data Science is the empirical synthesis of
an intersection of probability and statistics, linear
actionable knowledge from raw data through the
algebra, calculus and other mathematical branches,
complete data lifecycle process.
machine learning, and computer science.
Components of Data Science
There are really big data that should be analyzed
properly for proper usage. Different organization 1. Probability and statistics
needs to methodically align data to their 2. Linear Algebra
advantage. 3. Machine Learning
4. Computer Science
Adoption of Data Science led to the following
benefits in business: Probability and Statistics
Probability is a mathematical subject which
Cost reduction
enables us in determining or predicting how likely
Increase in productivity
it is that an event will happen.
Reduction in time taken to solve problems
Process improvements Statistics is another mathematical subject which
Competitive advantages deals primarily with data. It helps us draw
inferences from data by having procedures in place
Methodical Alignment of data for collecting, classifying and presenting the data in
an organized manner.
Linear Algebra
This deals with the theory of systems of linear
equations, matrices, vector spaces and linear
transformations.
Most of the complex science problems are
converted int problems of vectors and matrices
and then solved using linear models. Linear Algebra
works as a computational engine for most of the
data science problems because of its performance
advantages over iterative methods.
Machine learning
Making the machine to learn and make it to make
its own decision.
Supervised
Semi-supervised
8|Page
Unsupervised Life Cycle of a Data Science Project
Reinforcement
Computer Science
Computer Science provides us with the necessary
programming languages, database management
systems, statistical analysis and machine learning
tools.
To solve the given business model, many blocks
should be integrated. The major steps are
1. Writing the core algorithm
2. Algorithm that uses linear algebra
3. Statistical Computations should be done on
the given data
4. All structured, un-structured and semi-
structured data should be managed using
data management systems.
9|Page
PYTHON FOR DATASCIENCE
Python Libraries Dtype – data type of object (for example, a
Python libraries are collections of pre-written codes integer)
to perform specific tasks. This eliminates the need
Accessing using index
of rewriting the code from scratch.
Python Libraries for Machine learning,
NumPy
It is python libraries used for working with arrays. It
is also known as array processing package.
Numeric-Python (NumPy) is a Python library that is It is accessed using square brackets
used for numeric and scientific operations. It serves
as a building block for many libraries available in
Python.
Data Structure in NumPy
The main data structure of NumPy is the ndarray or
n-dimensional array (it is a multidimensional
container of elements of the same type. Images as a NumPy Matrix
10 | P a g e
Pandas Dtype – this represents the data type used in
Pandas is an open-source library for real world data the series
analysis in python. Using Pandas, data can be
o By default, series creates an integer index.
cleaned, transformed, manipulated, and analyzed.
The custom index can be defined.
The steps involved to perform data analysis using
Pandas are, Data Frame – is a collection of series where each
series represents a column from a table
o Read the data – Reading the data can be
done in multiple format such as Syntax: pd.DataFrame(data, index, columns)
‘.csv’,’.json’, ‘.xlsx’ etc.
o Explore the data Data - data can contain Series or list-like
o Perform operations on the data – Grouping, objects. If data is a dictionary, column order follows
insertion-order.
sorting, masking, merging, concatenating
o Visualize the data – to get a clear picture of index- index for dataframe that is created.
various relationship among the data. Scatter By default, it will be RangeIndex(0, 1, 2, …, n) if no
plot, box plot, bar plot and histogram and explicit index is provided
many more.
columns- If data contains column labels, it
o Generate insights – All the above steps will
will use the same. Else, default to RangeIndex(0, 1,
generate the insight about the data.
2, …, n).
Advantage of Pandas:
There are different approaches to create a
1. Has the capability to load huge sizes of data DataFrame,
easily
o From a single series object
2. Provides us with extremely streamlined
o From a list of dictionaries
forms of data representation
3. Can handle heterogenous data, has o From a dictionary o series object
extensive set of data manipulation features, o From a existing file
and makes data flexible and customizable. The axis Keyword
Pandas Objects One of the important parameters used while
Pandas objects are advanced versions of NumPy performing operations on DataFrames is 'axis'. Axis
structured arrays in which the rows and columns takes two values: 0 and 1.
are identified with labels instead of simple integer axis = 0 represents row specific operations.
indices. Basic data structure of Pandas are series axis = 1 represents column specific
and Data Frame. operations
Series – one dimensional labelled array Pandas can read a variety of files,
Syntax: pd.Series(data, index, dtype)
Data – it can be a list, a list of lists or even a
dictionary
Index – The index can be explicitly defined for
different values if required
11 | P a g e
1. Head and tail- to view 1st few or last few marker = shape in case of specific
rows plots like a scatter plot
2. Describe – used a generate a quick kind = type of plot
summary of data statistics Matplotlib
3. Info – to know about the datatypes and The graphical representation of data or information
number of rows containing null vales for using visual elements like graphs, charts and maps
respective columns is known as data visualization.
4. Dropping null values
5. Selecting a subset of the data Plot – The plot is the basic visualization elements
that helps to visualize the data
‘iloc’ and ‘loc’ are two indexing techniques that
help us in slecting specfic rows and column. Figure- It is the top-level container that acts
as the window or page on which everything
Iloc – Access by integer index is drawn
Syntax – df.iloc[rows, columns] Axes- The axes are the area on which data
is plotted.
loc – access a group by custom index The plot of comprises several elements such
Operations in Pandas as title, label, axes, legend etc.,
Apply - This method is used to apply a function Plotting using object-oriented approach, following
along an axis of the DataFrame. method are followed,
12 | P a g e
Types of plots Text annotates – text can be added to
describe the plots
1. Boxplot- Gives a good indication of
distribution of data about the median. For saving the plot as image,
Boxplots are a standardized way of plt.savefig(“filename.jpg”,dpi =200) is used.
displaying the distribution of data based on
the five-number summary (“minimum”, first
quartile (Q1), median, third quartile (Q3),
and “maximum”).
2. Scatter plot – uses dots and markers to
represent a value on the axes. It is the
simplest plot that can accept bot
quantitative and qualitative values, with a
wide variety of application in primitive data
analysis.
Syntax- ax.scatter(x, y, marker) Machine Learning using sklearn
3. Bar Chart – graph with rectangular bars that Scikit-learn (also referred as sklearn) is a python
usually compare different categories. library widely used for machine learning. It is
Syntax- ax.bar(x, height, width, bottom, characterized by a clean, uniform and streamlined
align) API.
4. Histogram- represents data as rectangular
The objective here is to introduce the usage of a
bars. It is used for continuous data.
scikit-learn library for different stages of ML model
Syntax – ax.hist(x,bins)
building.
5. Pie chart- divides the entire datast into
distinct groups. Follow the order,
Syntax – ax.pie(x,labels)
Loading the data
To enhance the analysis customizing other
Preprocessing data
parameter will take place
Training the model
Explode - To get an elevated view for
the selected pie. Testing the model
Colors - To customize the colors for the Evaluating model performance
plot.
Autopct - To add the percentage of the
distribution in the pie chart.
Shadow - To add shadow to the plot.
Startangle -To change the starting angle
of the pie chart.
6. Line Chart – Drawn by interconnecting all
data points using straight line segments in
sequential order.
Line style – Matplotlib supports many lines
style like dashed line, dotted lines, dash-dot
etc.
13 | P a g e
DATA VISUALIZATION USING PYTHON
Data Visualization i. Temporal Data - Data with a time
It is the concept of graphical representation of data component attached to it.
or information using visual elements like graphs, ii. Geospatial data - Data with a physical
charts, and maps. location as an attribute.
o It helps in finding patterns and connections iii. Topical Data - Data concerned with topics.
between variables iv. Network Data - Data in the form of nodes
o Requires less effort from the reader to and links between nodes.
v. Tree Data - Data which is basically network
understand the visuals
data but with some hierarchy in it.
o Condenses a large amount of information
into a small space for quick analysis Two types of data
o Provides relevant answers and clarity on
i. Qualitative/Categorical Data
certain questions swiftly
Binary
When huge data is given, and the pattern is to be Nominal
found then it is time consuming if it is done Ordinal
manually or with any other method. Data ii. Quantitative/Numerical Data
visualization using the plots can reduce the time Discrete
and complexity in finding the pattern in the given Continuous
data set.
Different kind of plots that can be used for data
Data Visualization Stakeholders: visualization
There could potentially be two types of Box plot
visualizations based on the types of stakeholders Scatter plot
involved. Line chart
1. For self-consumption during data Bar graph
exploration, feature engineering, etc. Histogram
2. For presenting or communicating the Dist plot
insights (from the data) with a target Pie chart
audience, typically decision makers. This Joint plot
sort of visualization is usually performed to Pair plot
prepare the results/reports that may enable Heat map
the target audience in decision making. Outliers - Outliers are the extreme values present
Types of Data collected in the dataset. They affect the properties of data
14 | P a g e
like mean and variance which are used in model groups. The chart consists of a circle split
into slices and each slice represents a
building. Hence, they may impact the accuracy of
group.
the model. Dist Plot depicts the variation in a data distribution.
It represents the overall distribution of
Quartiles – Divides the number of data points into continuous data variables.
four equal-sized groups, or quarters Joint Plot A joint plot is a combination of two
univariate and one bivariate plot.
Inter-quartile Range – It is also called Mid-spread, The bivariate plot (in the center) helps in
H-spread, or IQR, indicates where most of the data analyzing the relationship between two
variables. The univariate plot describes the
is lying. The upper limit and lower limit is
distribution of data in each variable as a
calculated using the following formula marginal plot.
Pair plot A depicts pairwise relationships between
all the variables in a dataset in a matrix
format. Each row and column in the matrix
represent a variable in the dataset.
Head Map is a graphical representation of data where
similar values are depicted by the same
Any values lie outside the limit is called the outlier. colors. The colors vary based on
the intensity of the results
Matplotlib
Seaborn
Plotly
16 | P a g e
Coefficient of Determination (R2)
∑ ¿¿ ¿
SSR – the sum of squared differences between
predicted and mean target values
∑ ¿¿ ¿
Relation Between SST, SSR and SSE : SST = MultiCollinearity - In multiple regression model it
SSR+SSE is possible that one predictor can be linearly
predicted from the others, with a substantial
MULTIPLE LINEAR REGRESSION
degree of accuracy. In such a situation, the
This have multiple predictors and one dependent predictors are said to be highly correlated. In
variable. Steps involved in creating a model statistics, this phenomenon is called
multicollinearity, or in other words collinearity
Visualizing the dataset
between variables.
Building a Multple Linear Regression model
Visualizing Multiple Linear Regression Note: The obtained best fit model shall be valid if
model the predictor variables are linearly independent of
Finding the correlation between the each other. Linearly dependent if the correlation
predictors values are close to –1 or 1.
Finding the coeffiecient of determination Variance Inflation Factor – to determine if the
Adjusted R-squared (increasing for making predictor variables are independent of each other.
the model more valid)
17 | P a g e
Logistic Regression is a supervised Machine
Learning algorithm, primarily used for binary
classification. It computes the probability of a
1. VIF = 1 -> no correlation between variables sample belonging to each of the classes.
2. 1 to 5 -> slightly correlated
3. Greater than 5 -> Highly correlated The probabilities are computed using the non-
linear sigmoid or logistic function given below:
R2 values can be inflated by including more and
more predictor variables
The adjusted R2 is defined as
Logistic Regression
Decision Trees DECISION TREES
K-Nearest Neighbors (kNN) This is another kind of algorithm to build a model
Support Vector Machines (SVM) with data that can be properly visualized into a
LOGISTIC REGRESSION graph. A decision is taken on how to split the data
at each node of the tree, this algorithm is called a
Decision Tree.
18 | P a g e
A decision tree is a tree-like structure in which: The k-Nearest Neighbors (k-NN)
algorithm determines the target value of a new
the root node and each internal node
data point or instance by comparing it with existing
represent a "test" on an attribute of an
data points or instances that are closest to it.
instance in the dataset.
the outcome of each test is represented by Euclidean Distance
the corresponding branches.
The Euclidean distance between two (tuples) -
the node that does not branch further is
X1 (x11, x12, x13... x1n) and X2 (x21, x22, x23... x2n) can be
called a leaf node and represents the class
computed as:
labels.
19 | P a g e
The dataset has fewer number of classes
(preferably 2) in the target variable.
The dataset is high dimensional.
The dataset is balanced.
Hyperparameters are model properties which
guide the training process i.e., they cannot be
learnt from the training data. While building a
machine learning models, situation such as
overfitting or underfitting were encountered. These
can be controlled by tuning the hyper parameters
of the model.
Ensemble methods – These are techniques that
aim at improving the prediction accuracy in models
by creating and combining multiple models instead
of using a single model. Two commonly used
ensemble methods
o Bagging - multiple models are trained using
the same algorithm on different subsets of
the training data. Once multiple models are
trained in this manner, they are aggregated
using maximum voting or simple
aggregation methods such as averaging.
o Boosting - Boosting is another ensemble
learning technique where the models are
built sequentially. Each new model is built
by taking into account the mistakes made
by the previous model in predicting target
value.
Clustering Analysis
Unsupervised learning deals with historical data
which contains no labels. The aim of clustering is to
group similar records together and make sure that
the members of different groups are significantly
different from each other.
Clustering can be performed using several
algorithms and one of the widely used clustering
algorithms is the K-means algorithm.
20 | P a g e