0% found this document useful (0 votes)

133 views21 pages

Machine Learning Notes

This document provides an overview of machine learning and artificial intelligence concepts. It discusses the history and types of AI, including weak and strong AI. Machine learning is described as learning from patterns in data to make predictions. Supervised learning techniques like classification, regression, and decision trees are covered. Deep learning using artificial neural networks is also summarized. The stages of a machine learning project are outlined as defining the goal, selecting a technique, collecting data, model validation, and testing predictions.

Uploaded by

Kaushik S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

133 views21 pages

Machine Learning Notes

Uploaded by

Kaushik S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

2022

MACHINE LEARNING
mANDAOTRY COURSE
PRAMOTH SABARI S
COURSE PLAN AND PORTION:

1|Page
2|Page
INTRODUCTION TO ARTIFICIAL INTELLIGENCE
Artificial Intelligence:
Why AI? - Demystifying Intelligent Systems History of AI
Dartmouth Conference in 1956 organized by John
What is AI? - Technical Overview
McCharthy.
AI and Business
According to Marvin Minsky AI is science of making
Sneak Peek into Popular AI platforms machine that do thing that would require
intelligence if done by men.
It is ability of the machine to mimic the cognitive
Why AI
function that associate with the human.
 This is self-learning and adaptive.
 AI minds the pattern and become Types of AI
knowledgeable on their own. Two types of AI
 It is able to interpret and analyze a huge
 Weak AI (Artificial Narrow Intelligence)
amount of data in short duration of time
 Self-driving car, google assistant
 The data can be images, videos, folders,
 Strong AI (Artificial General Intelligence)
documents, etc.,
 Machines that exhibit human
 These solves problem for which a formula
intelligence and can perform any
or procedure do not exist.
task that humans do.
Note: Video Analysis
Aspects of AI
Natural language Analysis  Planning – Decision making tasks
 Speech recognition – ability of computer to
AI Business Opportunity
recognize speech and covert into text.
 It helps to optimize the data and analyze
 Natural Language processing –
the trend for the particular product or
Understanding human speaking language
particular solution and gives the solution
 Robotics – Intelligence behavior to the
accordingly.
robot
 Walmart, Rolls-Royce are some of the
 Expert Systems – Gives solutions based on a
companies that uses AI technology to
huge data provided to it
prevent their lose and increase their profit.
 Machine Learning
 AI is helping them achieve:
 Vision – Making understand the images and
1. Creation of new business lines
videos.
2. Competitive advantages
3. Cost reduction, increased Robotic Process Automation
productivity, process improvements Robotic process automation (RPA) is a software
4. Personalized communication at technology that makes it easy to build, deploy, and
scale manage software robots that emulate human
5. AI centered innovations of products actions interacting with digital systems and
and processes. software.

3|Page
Machine Learning Logistics Regression – Predicts the probability of
It learns pattern from the given data and applies to occurrence of the event and the predicted value
the data. Machine Learning includes Supervised will be between 0 and 1.
Learning, Unsupervised Learning and
It uses the logistics function which S- shaped. It is
Reinforcement learning.
used to fit the data and predict the of given data.
It deals with predictive analytics This covert the occurrence of the event into
probability.
Machine learning learns the patterns from the past
data and makes a model. This model is used to Regression
make predictions for a new data. This is way It learns the relation between the given input data
machine learning predicts. and output. This regression is based on the single
input feature data that can be newly given and this
The choice of machine learning algorithm depends
predicts the output based on the older data,
on the type of the problem and type of data we
have. The regression can of two types based on the
feature number,
Supervised learning
Input variable (X) give Output variable  Single Linear Regression
 Multiple Linear Regression
It is used when data is well labelled.
Deep Learning or Artificial Neural
Types of Supervised learning
Networks
 Classification – Predicts discrete Artificial Neural Networks is inspired from
values biological neurons.
 Regression – Predicts continuous
These contains three nodes: Input nodes, hidden
numeric value
nodes and Output nodes. This connection and
Classification combination such many nodes is called as neural
Support Vector Machine networks.
Widely used classification techniques that helps in This produces more accurate results than the
deciding the optimal classifier between the classes traditional methods of machine learning.
of data.
The neural networks model will be created by
Decision Tree – It represents the data in form of a feeding different type of input, which will be
tree containing decision nodes and leaf nodes. mapped to the accurate output using formation of
many neural networks.
More the number of input data there will be a
greater number of nodes in the decision tree and Types of ANN – Convolutional Neural Networks
more will be the accuracy of the output. (CNN) and Recurrent Neural Network (RNN).
KNN – It is widely used for classification Overfitting – It happens when a model learns the
techniques. Compares K nearest neighbors in the details as well as noise in the training data. Bad
data point with the new data point. performance of test data but good performance in
training data.
K defines as the no. of neighbors.

4|Page
Underfitting – It happens when model cannot Timeseries forecasting – predicting the future
capture the underlying trends in the given training outputs from the past input data.
data.
Stages involved in the machine learning
Both these will give poor predictions in the 1. Define the goal
machine learning. Common problem is overfitting. 2. Identify the machine learning technique
that should be used
Ensemble method – It combine multiple learning
3. Data is collected for input
models and arrive at one prediction to increase the
4. Feature engineering
accuracy of the output.
5. The data is divided into two – training data
and testing data.
6. Choosing ML Method according to the ML
Two aspects of error: Bias and variance
technique.
Bias – Refers to errors by the model due to 7. Model validation
erroneous assumptions. 8. Final model – New data
9. Test Prediction in real-time
Variance – refers to errors by the model due to
sensitivity to small fluctuations in the training data AI Architecture
set. Data
Complexity of the model is low – High Bias and Low Information Management
variance
 Databases
Increasing Complexity – Low bias and high Variance  Spark
Unsupervised Learning  Hadoop
We train the algorithm with the unlabeled data. It AI Technologies
will group the input data based on the patterns,
similarities and differences.  Machine Learning
 NLP
Types – Clustering and Association mining  Robotics
Clustering – Processes the input data and finds the  Image analytics
inherent clusters or groups.  Expert systems

Association mining – depends on the association Models

rules that helps discover the relationship from the Action
large data set.
FNOL – First notice of loss.
Reinforcement Learning
The algorithm is not explicitly told on how to AI in Business
perform the task. It learns by experiencing reward Managing of data can provide the business to
and pain. improve their profit and value. The use cases are
taken from retail, health care, banking,
It learns by itself to maximize the rewards and manufacturing and energy.
minimize the penalty. This used when the training
data is not available and it is done by interacting AI in retails
with the environment.  Conversational Commerce
5|Page
 Contextual Commerce Some of the programming languages,
 Actionable analytics – getting access to tools and frameworks popularly used are
relevant data in the correct context listed below
 Predictive Marketing – Marketing that uses  Programming Languages: Python, Java, R,
big data to develop accurate forecasts of Scala, SQL
future customer behavior  Big Data Tools: Spark, HBase, Kafka, HDFS,
 Guided Sales – understanding needs and Hive, Hadoop, MapReduce, Pig
suggesting the best match  Machine Learning Frameworks:TensorFlow,
PyTorch, Keras, Theano, Caffe,, MXNET
 Cloud Platforms: AWS, Azure, GCP, IBM
AI in Healthcare Cloud
 Automated Image Diagnosis  Computer Vision: Keras, Open CV, Pytorch
 Robot Assisted Surgery  Natural Language Processing: NLTK,
 Virtual Nursing Assistants Gensim, SpaCy, Keras, StancoreNLP, Text
 Fraud Detection blob
 Preliminary Diagnosis  ML Operation (MLops): Flask, MLflow,
 Administrative Workflow Assistant Apache Airflow, Kubeflow, Seldon
 AI Governance: What-if Tool, LIME,
AI in Banking DeepLIFT, Skater, Shapley, AIX360
 Risk assessment
 Investment Management
 Trading
 Credit Approval Process
 Customer Support
 Regulations and Compliance
AI in Manufacturing

 Digital Twins
 Adaptive Manufacturing
 Predictive Maintenance
 Automated Qualitative Control
 Demand Driven Production
AI in Energy

 Smart Exploration
 Better Development
 Faster production

6|Page
7|Page
INTRODUCTION TO DATA SCIENCE
Data Science The very beginning of this is method by starting to
Data science is an amalgamation of different ask number of questions.
scientific methods, algorithms and systems which
What is Data Science
enable us to gain insights and derive knowledge
Data science is primarily a combination of Data &
from data in various forms.
Science
Data science is a cross functional field, emerging at
Data Science is the empirical synthesis of
an intersection of probability and statistics, linear
actionable knowledge from raw data through the
algebra, calculus and other mathematical branches,
complete data lifecycle process.
machine learning, and computer science.
Components of Data Science
There are really big data that should be analyzed
properly for proper usage. Different organization 1. Probability and statistics
needs to methodically align data to their 2. Linear Algebra
advantage. 3. Machine Learning
4. Computer Science
Adoption of Data Science led to the following
benefits in business: Probability and Statistics
Probability is a mathematical subject which
 Cost reduction
enables us in determining or predicting how likely
 Increase in productivity
it is that an event will happen.
 Reduction in time taken to solve problems
 Process improvements Statistics is another mathematical subject which
 Competitive advantages deals primarily with data. It helps us draw
inferences from data by having procedures in place
Methodical Alignment of data for collecting, classifying and presenting the data in
an organized manner.

Linear Algebra
This deals with the theory of systems of linear
equations, matrices, vector spaces and linear
transformations.
Most of the complex science problems are
converted int problems of vectors and matrices
and then solved using linear models. Linear Algebra
works as a computational engine for most of the
data science problems because of its performance
advantages over iterative methods.

Machine learning
Making the machine to learn and make it to make
its own decision.
 Supervised
 Semi-supervised
8|Page
 Unsupervised Life Cycle of a Data Science Project
 Reinforcement

Characteristics of a successful data science project

1. A clear, verifiable and quantifiable goal
2. Set realistic expectations for all
stakeholders
3. An unbiased data set covering examples
that are sufficient to represent the entire
data
4. The right model based on the business
scenario
5. Correctly represented and deployed model
in order to meet the predefined goal

Computer Science
Computer Science provides us with the necessary
programming languages, database management
systems, statistical analysis and machine learning
tools.
To solve the given business model, many blocks
should be integrated. The major steps are
1. Writing the core algorithm
2. Algorithm that uses linear algebra
3. Statistical Computations should be done on
the given data
4. All structured, un-structured and semi-
structured data should be managed using
data management systems.

9|Page
PYTHON FOR DATASCIENCE
Python Libraries Dtype – data type of object (for example, a
Python libraries are collections of pre-written codes integer)
to perform specific tasks. This eliminates the need
Accessing using index
of rewriting the code from scratch.
Python Libraries for Machine learning,

 NumPy – Scientific Computation

 Pandas – Data Structure and Data Analysis
 Matplotlib – Plotting and Visualization
 Scikit-Learn – Machine Learning Tools

NumPy
It is python libraries used for working with arrays. It
is also known as array processing package.
Numeric-Python (NumPy) is a Python library that is It is accessed using square brackets
used for numeric and scientific operations. It serves
as a building block for many libraries available in
Python.
Data Structure in NumPy
The main data structure of NumPy is the ndarray or
n-dimensional array (it is a multidimensional
container of elements of the same type. Images as a NumPy Matrix

Advantage of NumPy Images are stored as arrays of hundreds, thousands

or even millions of picture elements called
 When the array size increases the NumPy as pixels.
can execute more parallel operations,
making the program run faster. Appendix
 It has many optimized built-in mathematical
functions.
 It has multidimensional array data
structured that can represent vectors and
matrices.
NumPy object creation
Syntax: [Link](object,dtype)
Object – A python object (for example, a list)

10 | P a g e
Pandas Dtype – this represents the data type used in
Pandas is an open-source library for real world data the series
analysis in python. Using Pandas, data can be
o By default, series creates an integer index.
cleaned, transformed, manipulated, and analyzed.
The custom index can be defined.
The steps involved to perform data analysis using
Pandas are, Data Frame – is a collection of series where each
series represents a column from a table
o Read the data – Reading the data can be
done in multiple format such as Syntax: [Link](data, index, columns)
‘.csv’,’.json’, ‘.xlsx’ etc.
o Explore the data Data - data can contain Series or list-like
o Perform operations on the data – Grouping, objects. If data is a dictionary, column order follows
insertion-order.
sorting, masking, merging, concatenating
o Visualize the data – to get a clear picture of index- index for dataframe that is created.
various relationship among the data. Scatter By default, it will be RangeIndex(0, 1, 2, …, n) if no
plot, box plot, bar plot and histogram and explicit index is provided
many more.
columns- If data contains column labels, it
o Generate insights – All the above steps will
will use the same. Else, default to RangeIndex(0, 1,
generate the insight about the data.
2, …, n).
Advantage of Pandas:
There are different approaches to create a
1. Has the capability to load huge sizes of data DataFrame,
easily
o From a single series object
2. Provides us with extremely streamlined
o From a list of dictionaries
forms of data representation
3. Can handle heterogenous data, has o From a dictionary o series object
extensive set of data manipulation features, o From a existing file
and makes data flexible and customizable. The axis Keyword
Pandas Objects One of the important parameters used while
Pandas objects are advanced versions of NumPy performing operations on DataFrames is 'axis'. Axis
structured arrays in which the rows and columns takes two values: 0 and 1.
are identified with labels instead of simple integer  axis = 0 represents row specific operations.
indices. Basic data structure of Pandas are series  axis = 1 represents column specific
and Data Frame. operations
Series – one dimensional labelled array Pandas can read a variety of files,
Syntax: [Link](data, index, dtype)
Data – it can be a list, a list of lists or even a
dictionary
Index – The index can be explicitly defined for
different values if required

11 | P a g e
1. Head and tail- to view 1st few or last few marker = shape in case of specific
rows plots like a scatter plot
2. Describe – used a generate a quick kind = type of plot
summary of data statistics Matplotlib
3. Info – to know about the datatypes and The graphical representation of data or information
number of rows containing null vales for using visual elements like graphs, charts and maps
respective columns is known as data visualization.
4. Dropping null values
5. Selecting a subset of the data Plot – The plot is the basic visualization elements
that helps to visualize the data
‘iloc’ and ‘loc’ are two indexing techniques that
help us in slecting specfic rows and column.  Figure- It is the top-level container that acts
as the window or page on which everything
Iloc – Access by integer index is drawn
 Syntax – [Link][rows, columns]  Axes- The axes are the area on which data
is plotted.
loc – access a group by custom index  The plot of comprises several elements such
Operations in Pandas as title, label, axes, legend etc.,

Masking- The masking operation replaces values Matplotlib

where the condition is TRUE. Python library for data visualization.
 Syntax – [Link](cond, other = [Link] is used for two-dimensional
nan, inplace = False,axis = None) graphics in python programming.
 F = marks_df < 33 There are two approaches to plotting in Matplotlib,
 Marks_df.mask(f, ‘Fail’)
1. MATLAB way of plotting using
Index Preservation- Pandas preserves the index [Link]. It is simple to use
and column labels in the output. For binary 2. Object-oriented way of plotting for more
operations such as addition and multiplication, control and customization
Pandas will automatically align indices when
passing the objects to the functions. The library is imported by import [Link]
as plt.
Broadcasting refers to a set of rules to operate
between data of different sizes and shapes. Syntax – [Link](x,y)

Apply - This method is used to apply a function Plotting using object-oriented approach, following
along an axis of the DataFrame. method are followed,

Pandas Plot  Creating a figure

 Setting up the axes
It can visualize the data in the form of plots.  Creating a plot using the axes object
 Syntax – [Link](X, y, marker, kind)  Creating multiple plots using the same axes
object
X = value on X axis  Setting up the title, label, and legend for a
y = value on y axis plot

12 | P a g e
Types of plots Text annotates – text can be added to
describe the plots
1. Boxplot- Gives a good indication of
distribution of data about the median. For saving the plot as image,
Boxplots are a standardized way of [Link](“[Link]”,dpi =200) is used.
displaying the distribution of data based on
the five-number summary (“minimum”, first
quartile (Q1), median, third quartile (Q3),
and “maximum”).
2. Scatter plot – uses dots and markers to
represent a value on the axes. It is the
simplest plot that can accept bot
quantitative and qualitative values, with a
wide variety of application in primitive data
analysis.
Syntax- [Link](x, y, marker) Machine Learning using sklearn
3. Bar Chart – graph with rectangular bars that Scikit-learn (also referred as sklearn) is a python
usually compare different categories. library widely used for machine learning. It is
Syntax- [Link](x, height, width, bottom, characterized by a clean, uniform and streamlined
align) API.
4. Histogram- represents data as rectangular
The objective here is to introduce the usage of a
bars. It is used for continuous data.
scikit-learn library for different stages of ML model
Syntax – [Link](x,bins)
building.
5. Pie chart- divides the entire datast into
distinct groups. Follow the order,
Syntax – [Link](x,labels)
 Loading the data
To enhance the analysis customizing other
 Preprocessing data
parameter will take place
 Training the model
 Explode - To get an elevated view for
the selected pie.  Testing the model
 Colors - To customize the colors for the  Evaluating model performance
plot.
 Autopct - To add the percentage of the
distribution in the pie chart.
 Shadow - To add shadow to the plot.
 Startangle -To change the starting angle
of the pie chart.
6. Line Chart – Drawn by interconnecting all
data points using straight line segments in
sequential order.
Line style – Matplotlib supports many lines
style like dashed line, dotted lines, dash-dot
etc.

13 | P a g e
DATA VISUALIZATION USING PYTHON
Data Visualization i. Temporal Data - Data with a time
It is the concept of graphical representation of data component attached to it.
or information using visual elements like graphs, ii. Geospatial data - Data with a physical
charts, and maps. location as an attribute.
o It helps in finding patterns and connections iii. Topical Data - Data concerned with topics.
between variables iv. Network Data - Data in the form of nodes
o Requires less effort from the reader to and links between nodes.
v. Tree Data - Data which is basically network
understand the visuals
data but with some hierarchy in it.
o Condenses a large amount of information
into a small space for quick analysis Two types of data
o Provides relevant answers and clarity on
i. Qualitative/Categorical Data
certain questions swiftly
 Binary
When huge data is given, and the pattern is to be  Nominal
found then it is time consuming if it is done  Ordinal
manually or with any other method. Data ii. Quantitative/Numerical Data
visualization using the plots can reduce the time  Discrete
and complexity in finding the pattern in the given  Continuous
data set.
Different kind of plots that can be used for data
Data Visualization Stakeholders: visualization
There could potentially be two types of  Box plot
visualizations based on the types of stakeholders  Scatter plot
involved.  Line chart
1. For self-consumption during data  Bar graph
exploration, feature engineering, etc.  Histogram
2. For presenting or communicating the  Dist plot
insights (from the data) with a target  Pie chart
audience, typically decision makers. This  Joint plot
sort of visualization is usually performed to  Pair plot
prepare the results/reports that may enable  Heat map
the target audience in decision making. Outliers - Outliers are the extreme values present
Types of Data collected in the dataset. They affect the properties of data
14 | P a g e
like mean and variance which are used in model groups. The chart consists of a circle split
into slices and each slice represents a
building. Hence, they may impact the accuracy of
group.
the model. Dist Plot depicts the variation in a data distribution.
It represents the overall distribution of
Quartiles – Divides the number of data points into continuous data variables.
four equal-sized groups, or quarters Joint Plot A joint plot is a combination of two
univariate and one bivariate plot.
Inter-quartile Range – It is also called Mid-spread, The bivariate plot (in the center) helps in
H-spread, or IQR, indicates where most of the data analyzing the relationship between two
variables. The univariate plot describes the
is lying. The upper limit and lower limit is
distribution of data in each variable as a
calculated using the following formula marginal plot.
Pair plot A depicts pairwise relationships between
all the variables in a dataset in a matrix
format. Each row and column in the matrix
represent a variable in the dataset.
Head Map is a graphical representation of data where
similar values are depicted by the same
Any values lie outside the limit is called the outlier. colors. The colors vary based on
the intensity of the results

Plot Type Description Network – It is a set of objects (called nodes or

Box plot A box plot gives good indication of the
distribution of data about the median.
vertices) that are connected to each other. The
Boxplots are a standardized way of connection between the nodes is called edges or
displaying the distribution of data based on links.
a five-number summary (minimum, first
quartile (Q1), median, third quartile (Q3), If the edges in network are directed called a
and maximum) directed network (arrows are drawn to indicate the
Scatter Plot Uses dots or markers to represent a value
in the hyperplane. The scatter plot is one of direction) and if all edges are bidirectional or
the simplest plots which can accept both unidirectional, the network is an undirected
quantitative and qualitative values, with a network.
wide variety of applications in primitive
data analysis. A Work Cloud is a visual representation of free
Line Chart A line chart is drawn by interconnecting all
form text, which is like a collage. It is typically used
data points using straight line segments. It
is used to analyze historic variations and to depict keyword metadata of websites, articles,
trends in data. reviews, feedbacks etc.
Bar Chart A bar chart is a graph with rectangular bars
that compares different categories. Each The frequency and significance of the words are
bar represents a particular category, and depicted by the font, font size and color of the text
the length of a bar indicates the total
number of values or items in that category.
in the cluster.
Histogram It represents data as rectangular bars.
A choropleth map is a pictorial representation of
Unlike the bar chart, it is used for
continuous data. Each bar groups the data on a geographical map. The intensity of color
numbers into intervals (bins) and the in a region on the map corresponds to the
height of the bar is based on the number of respective values.
values that fall into the corresponding
intervals
Pie Chart It divides the entire dataset into distinct
15 | P a g e
Data Visualization Using Python
The python libraries used for data visualization are,

 Matplotlib
 Seaborn
 Plotly

EXPLORE MACHINE LEARNING USING PYTHON

Machine Learning Regression Technique
Machine learning algorithms build a mathematical
Regression analysis is a statistical process for
model based on sample data, known as "training
estimating the relationships between
data", in order to make predictions or decisions
variables. It can be used to build a model to predict
without being explicitly programmed to do so.
the value of the target variable from the predictor
Machine Learning Techniques are classified in to variables.
two types
 y= f(X), where y is the target or dependent
 Supervised Learning – works on labelled variable and X is the set of predictors or
data. Mapping input to the output. independent variables
 Unsupervised – has no explicitly defined  One predictor variable – simple linear
output. The idea is to discover knowledge regression model
or structure in the data. This task of  Multiple predictor – Multiple Linear
discovering inherent clusters or groups in Regression model
the data is known as Clustering.
SIMPLE LINEAR REGRESSION
Supervised learning is further classified as
Steps in working with the Regression Models
1. Regression - When the output variable
1. Creating regression Models
can take continuous numerical values,
2. Visualizing the speculated regression
e.g., price of a car, delivery time, credit
models
limit.
3. Analyzing the Speculated Models
2. Classification - When the output
4. Analyzing Models
variable takes categorical or discrete
5. Finding the Best Fit Model Manually
(non-continuous) values, e.g., whether
6. Visualizing the Best Fit Model
an email is a spam, whether a
transaction is fraudulent etc. Best Fit Model – The goal of linear regression is to
create a model that predicts the value accurately
Introduction to Regression and consequently the lowest sum of squared errors
Regression is a supervised machine learning (SSE).
technique that helps in predicting continuous
numerical values or quantity.

16 | P a g e
Coefficient of Determination (R2)

SST – the sum of squared difference between

actual and mean target values

∑ ¿¿ ¿
SSR – the sum of squared differences between
predicted and mean target values

∑ ¿¿ ¿

Relation Between SST, SSR and SSE : SST = MultiCollinearity - In multiple regression model it
SSR+SSE is possible that one predictor can be linearly
predicted from the others, with a substantial
MULTIPLE LINEAR REGRESSION
degree of accuracy. In such a situation, the
This have multiple predictors and one dependent predictors are said to be highly correlated. In
variable. Steps involved in creating a model statistics, this phenomenon is called
multicollinearity, or in other words collinearity
 Visualizing the dataset
between variables.
 Building a Multple Linear Regression model
 Visualizing Multiple Linear Regression Note: The obtained best fit model shall be valid if
model the predictor variables are linearly independent of
 Finding the correlation between the each other. Linearly dependent if the correlation
predictors values are close to –1 or 1.
 Finding the coeffiecient of determination Variance Inflation Factor – to determine if the
 Adjusted R-squared (increasing for making predictor variables are independent of each other.
the model more valid)

17 | P a g e
Logistic Regression is a supervised Machine
Learning algorithm, primarily used for binary
classification. It computes the probability of a
1. VIF = 1 -> no correlation between variables sample belonging to each of the classes.
2. 1 to 5 -> slightly correlated
3. Greater than 5 -> Highly correlated The probabilities are computed using the non-
linear sigmoid or logistic function given below:
R2 values can be inflated by including more and
more predictor variables
The adjusted R2 is defined as

Measuring Model Performance using Confusion

Matrix – Confusion matrix helps in assessing how
good a model is by comparing the actual target
values with the predicted target values.
n - no of observations, k is the number of predictor From confusion matrix precision, recall and F1
variables in the model. score can be derived.
Categorial variables: The variables which take Precision - The precision for a class A indicates how
labels as values. In python, we cannot use
accurate the model is in identifying class A.
categorical predictor variables directly to build a
machine learning model.
The get_dummies() of Pandas library can help us
get one-hot encoding done easily. Recall - The recall for a class A indicates how good
the model is in fetching/retrieving instances of
Model performance on train and test data is poor
class A.
(High RMSE) – Underfit

Model performance on train data is good (Low

RMSE, High R-squared) but on test data is poor
F1-score - This metric is the harmonic mean of
(High RMSE) – Overfit
precision and recall and can indicate how good the
Classification model is in classifying instances of a particular
Classification is a supervised Machine Learning class.
technique that helps in predicting categorical or
discrete (non-continuous) outputs.

 Logistic Regression
 Decision Trees DECISION TREES
 K-Nearest Neighbors (kNN) This is another kind of algorithm to build a model
 Support Vector Machines (SVM) with data that can be properly visualized into a
LOGISTIC REGRESSION graph. A decision is taken on how to split the data
at each node of the tree, this algorithm is called a
Decision Tree.

18 | P a g e
A decision tree is a tree-like structure in which: The k-Nearest Neighbors (k-NN)
algorithm determines the target value of a new
 the root node and each internal node
data point or instance by comparing it with existing
represent a "test" on an attribute of an
data points or instances that are closest to it.
instance in the dataset.
 the outcome of each test is represented by Euclidean Distance
the corresponding branches.
The Euclidean distance between two (tuples) -
 the node that does not branch further is
X1 (x11, x12, x13... x1n) and X2 (x21, x22, x23... x2n) can be
called a leaf node and represents the class
computed as:
labels.

Limitation: that attributes with larger ranges

contribute more value to the Euclidean distance. So
the numerical attributes should be normalized
Splitting the dataset - A subset which contains before used for computing the Euclidean distance.
instances belonging to only one class label is called SUPPORT VECTOR MACHINE (SVM)
a pure (homogeneous) subset. The predictor
attribute on which the dataset is split to obtain the It classifies data into one category or the other by
maximum number of pure subsets is called using hyperplanes. There can be many planes
the best attribute. between two data, so it is required to find an
optimal hyperplane.
For creating a decision tree algorithm, the
following steps are carried out, When data is not separable by a line or a plane,
SVM maps the data into a higher dimensional
1. Instances – the set of instances or samples, space, where it can be separated using a linear
for which, class labels are already known. hyperplane. The mapping function used to
2. Target_Attribute – the class label attribute. transform the data into the higher dimensional
3. Attributes_List – the list of predictor space is called 'kernel' function.
attributes.
Multi-class classification - SVM can also be used
Attribute selection measures compare different for multi-class classification. This can be done in 2
predictor attributes and rank them for the purpose ways
of model building.
1. One-vs-One Classification: It builds a binary
Three of the most commonly used attribute classification model for each pair of classes.
selection measures to induct a decision tree are: 2. One-vs-All Classification: It compares every
 Information gain class with the remaining classes thereby
 Gain ratio building a model for every class.
 Gini Index SVM usually performs well where -
CLASSIFICATION USING k-NN

19 | P a g e
 The dataset has fewer number of classes
(preferably 2) in the target variable.
 The dataset is high dimensional.
 The dataset is balanced.
Hyperparameters are model properties which
guide the training process i.e., they cannot be
learnt from the training data. While building a
machine learning models, situation such as
overfitting or underfitting were encountered. These
can be controlled by tuning the hyper parameters
of the model.
Ensemble methods – These are techniques that
aim at improving the prediction accuracy in models
by creating and combining multiple models instead
of using a single model. Two commonly used
ensemble methods
o Bagging - multiple models are trained using
the same algorithm on different subsets of
the training data. Once multiple models are
trained in this manner, they are aggregated
using maximum voting or simple
aggregation methods such as averaging.
o Boosting - Boosting is another ensemble
learning technique where the models are
built sequentially. Each new model is built
by taking into account the mistakes made
by the previous model in predicting target
value.

Clustering Analysis
Unsupervised learning deals with historical data
which contains no labels. The aim of clustering is to
group similar records together and make sure that
the members of different groups are significantly
different from each other.
Clustering can be performed using several
algorithms and one of the widely used clustering
algorithms is the K-means algorithm.

20 | P a g e

2023 Kaggle AI Report
100% (1)
2023 Kaggle AI Report
71 pages
Knowledge Representation
No ratings yet
Knowledge Representation
65 pages
Machine Learning For Everyone - in Simple Words. With Real-World Examples. Yes, Again PDF
No ratings yet
Machine Learning For Everyone - in Simple Words. With Real-World Examples. Yes, Again PDF
62 pages
CS 8520: Artificial Intelligence: Knowledge Representation
100% (1)
CS 8520: Artificial Intelligence: Knowledge Representation
30 pages
Abstract On The Artificial Intelegence
No ratings yet
Abstract On The Artificial Intelegence
15 pages
Data Platform and Analytics Foundational Training: (Speaker Name)
100% (1)
Data Platform and Analytics Foundational Training: (Speaker Name)
23 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
82 pages
Distributed System
100% (1)
Distributed System
119 pages
Lecture Notes Artificial Intelligence COMP 241: Niccolo Machiavelli, Italian (1469-1527)
No ratings yet
Lecture Notes Artificial Intelligence COMP 241: Niccolo Machiavelli, Italian (1469-1527)
15 pages
Recommender Systems Overview
No ratings yet
Recommender Systems Overview
28 pages
Ai ML
No ratings yet
Ai ML
11 pages
Int404 (Syllabus)
No ratings yet
Int404 (Syllabus)
1 page
Introduction To Artificial Intelligence
No ratings yet
Introduction To Artificial Intelligence
14 pages
What Are The Types of Machine Learning?
100% (1)
What Are The Types of Machine Learning?
24 pages
AI and Data Science in Accounting
No ratings yet
AI and Data Science in Accounting
12 pages
Generative AI: Creative Chaos Unleashed
No ratings yet
Generative AI: Creative Chaos Unleashed
1 page
Artificial Intelligence Notes
No ratings yet
Artificial Intelligence Notes
126 pages
AI Introduction
100% (1)
AI Introduction
3 pages
NLP Transformers for Data Scientists
No ratings yet
NLP Transformers for Data Scientists
38 pages
AI Concepts for Tech Enthusiasts
No ratings yet
AI Concepts for Tech Enthusiasts
1 page
GenAI Pinnacle Plus Brochure
No ratings yet
GenAI Pinnacle Plus Brochure
10 pages
Google+Cloud+Generative+AI+Leader+ +Recap+Slides+v2
No ratings yet
Google+Cloud+Generative+AI+Leader+ +Recap+Slides+v2
76 pages
Machine Learning Overview and Resources
No ratings yet
Machine Learning Overview and Resources
19 pages
Thesis
No ratings yet
Thesis
87 pages
AI & ML Intro for Students
No ratings yet
AI & ML Intro for Students
39 pages
Intelligent Agents and Environment
No ratings yet
Intelligent Agents and Environment
9 pages
AI Course Overview for Students
100% (4)
AI Course Overview for Students
16 pages
Lecture6 Tfidf
No ratings yet
Lecture6 Tfidf
45 pages
Rules of Thumb in Data Engineering
No ratings yet
Rules of Thumb in Data Engineering
10 pages
Chapter 1 Artificial Intelligence
No ratings yet
Chapter 1 Artificial Intelligence
38 pages
Latest Thesis Topics in Machine Learning
No ratings yet
Latest Thesis Topics in Machine Learning
14 pages
Generative Ai Leader Study Guide English
No ratings yet
Generative Ai Leader Study Guide English
11 pages
Overview of Artificial Intelligence Concepts
No ratings yet
Overview of Artificial Intelligence Concepts
24 pages
Artificial Intelligence Notes
No ratings yet
Artificial Intelligence Notes
156 pages
Practical Applications of AI Explained
No ratings yet
Practical Applications of AI Explained
24 pages
Book Summary
No ratings yet
Book Summary
35 pages
Machine Learning Foundations - Overview
100% (1)
Machine Learning Foundations - Overview
24 pages
Ebook - Unleash The Next Wave of Productivity With AI A Practical Guide For IT Leaders
No ratings yet
Ebook - Unleash The Next Wave of Productivity With AI A Practical Guide For IT Leaders
9 pages
AI Basics for Tech Enthusiasts
No ratings yet
AI Basics for Tech Enthusiasts
125 pages
Cloud Platforms Compared
No ratings yet
Cloud Platforms Compared
7 pages
MIT AI for Business Strategy Course
No ratings yet
MIT AI for Business Strategy Course
14 pages
Evolution of Large Language Models
No ratings yet
Evolution of Large Language Models
32 pages
AI Emotion Recognition System
No ratings yet
AI Emotion Recognition System
57 pages
Machine Learning: Trustworthy
No ratings yet
Machine Learning: Trustworthy
267 pages
QN: What Is Difference Between Symbolic AI and ML? Ans
No ratings yet
QN: What Is Difference Between Symbolic AI and ML? Ans
2 pages
Machine Learning
No ratings yet
Machine Learning
11 pages
H2o Training Day
No ratings yet
H2o Training Day
180 pages
Google Cloud Core Infrastructure Guide
No ratings yet
Google Cloud Core Infrastructure Guide
69 pages
Forward and Backward Chaining AI
No ratings yet
Forward and Backward Chaining AI
11 pages
Statista, The AI Advantage Powering Business Competitiveness
No ratings yet
Statista, The AI Advantage Powering Business Competitiveness
25 pages
01 In28minutes Presentation Generative Ai With Chatgpt Openai
100% (1)
01 In28minutes Presentation Generative Ai With Chatgpt Openai
86 pages
The Machine Learning Landscape
No ratings yet
The Machine Learning Landscape
25 pages
AI Training for Executives: 1-Day Program
No ratings yet
AI Training for Executives: 1-Day Program
2 pages
ML Seminar Presentation
No ratings yet
ML Seminar Presentation
26 pages
Agentic AI Governance V5.0
No ratings yet
Agentic AI Governance V5.0
83 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
14 pages
INTRO TO AI - Mehul Bharti
No ratings yet
INTRO TO AI - Mehul Bharti
23 pages
Unit 2-Handout
No ratings yet
Unit 2-Handout
5 pages
Intro To AI With Python
No ratings yet
Intro To AI With Python
50 pages
Introduction To AI and Machine Learning
No ratings yet
Introduction To AI and Machine Learning
21 pages
Naukri SachinGBatish (16y 0m)
No ratings yet
Naukri SachinGBatish (16y 0m)
3 pages
Connect Inclusive Access
No ratings yet
Connect Inclusive Access
10 pages
Bugreport Viva - Global SP1A.210812.016 2023 07 07 09 24 22 Dumpstate - Log 12592
No ratings yet
Bugreport Viva - Global SP1A.210812.016 2023 07 07 09 24 22 Dumpstate - Log 12592
35 pages
Unit 5 Deep Learning
100% (1)
Unit 5 Deep Learning
36 pages
Field Force Management System
No ratings yet
Field Force Management System
5 pages
CS155 Homework 1: Security Vulnerabilities
No ratings yet
CS155 Homework 1: Security Vulnerabilities
4 pages
Enclosure Internal Thermostat - SK 3110.000: Date: Jun 22, 2014
No ratings yet
Enclosure Internal Thermostat - SK 3110.000: Date: Jun 22, 2014
4 pages
Cbse - Department of Skill Education: Artificial Intelligence (Subject Code 843)
No ratings yet
Cbse - Department of Skill Education: Artificial Intelligence (Subject Code 843)
2 pages
How To Decode Vehicles CAN Bus Data
No ratings yet
How To Decode Vehicles CAN Bus Data
8 pages
Unit I-Unit 5 UIUX Notes
No ratings yet
Unit I-Unit 5 UIUX Notes
158 pages
Challenges in Ph.D. Computer Science
100% (2)
Challenges in Ph.D. Computer Science
4 pages
UNIT III Event Handling
No ratings yet
UNIT III Event Handling
44 pages
Danfoss Crimp Specs 2024 12 23
No ratings yet
Danfoss Crimp Specs 2024 12 23
21 pages
Sap Cpi Roadmap 2023
No ratings yet
Sap Cpi Roadmap 2023
7 pages
Blockchain Technology: The Novelty and Business Transformation Potential
No ratings yet
Blockchain Technology: The Novelty and Business Transformation Potential
3 pages
SQL Lab Exercises for Students
No ratings yet
SQL Lab Exercises for Students
16 pages
Computer Programming C 1 To 5 Units Notes
No ratings yet
Computer Programming C 1 To 5 Units Notes
108 pages
NEP BSC IT Sem2 Syllabus IT 2024-25
No ratings yet
NEP BSC IT Sem2 Syllabus IT 2024-25
13 pages
Hotel Management System Python
No ratings yet
Hotel Management System Python
12 pages
Gerätetechnik 2018 en
No ratings yet
Gerätetechnik 2018 en
34 pages
What Is IEC 61509?
No ratings yet
What Is IEC 61509?
13 pages
Club Penguin Island Testing Update
No ratings yet
Club Penguin Island Testing Update
2 pages
New Text Document
No ratings yet
New Text Document
7 pages
El Jarron Azul Resumen Corto
100% (1)
El Jarron Azul Resumen Corto
5 pages
PDMS Manual Index
No ratings yet
PDMS Manual Index
3 pages
Trilok Quiz 2
No ratings yet
Trilok Quiz 2
2 pages
12th Computer Applications Quarterly Exam 2024 Model Questions English Medium PDF Download
No ratings yet
12th Computer Applications Quarterly Exam 2024 Model Questions English Medium PDF Download
2 pages
AZ 104T00A ENU PowerPoint - 02
No ratings yet
AZ 104T00A ENU PowerPoint - 02
39 pages
Web Dynpro Tips v2
No ratings yet
Web Dynpro Tips v2
16 pages
Stata 15 Guide for Econometrics Students
No ratings yet
Stata 15 Guide for Econometrics Students
17 pages