0% found this document useful (0 votes)
106 views100 pages

Unit 2

Uploaded by

saiprassad20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views100 pages

Unit 2

Uploaded by

saiprassad20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

III Year/V Semester

AD 8552 - Machine Learning

Unit-2 Course WorkBook


Index Sheet - Coursework Book

SlNo Content Pages

1 Course Notes,MCQ 1-78

2 Content Beyond the Syllabus 79-95

3 Assignment,Resources 96

4 Question Bank(Part-A,B,C) 97-99


UNIT II MACHINE LEARNING METHODS
Unit Syllabus Content

Linear methods – Regression -Classification –Perceptron and Neural networks – Decision


trees – Support vector machines – Probabilistic models ––Unsupervised learning –
Featurization

1.Linear Methods
In general machine learning algorithms are divided into 2 types:
1. Supervised learning algorithms
2. Unsupervised learning algorithms

Supervised learning algorithms upset issues that involve learning with guidance. In
different words, the coaching information in supervised learning strategies need
labeled samples. for instance for a categorification downside, we'd like samples with
class label, or for a regression problem we need samples with desired output worth for
every sample, and so on The underlying mathematical model then learns its parameters
victimization the labeled samples so it's able to build predictions on samples that
the model has not been seen, additionally referred to as taking a look at samples.

Most of the applications in machine learning involve some variety of direction and thus
most of the chapters during this course can target the various supervised learning
methods.Unsupervised learning deals with issues that involve information while not labels.
In some sense one will argue that this can be not extremely a tangle in machine learning
as there's no knowledge to be learned from the past experiences. Unsupervised approaches
try to notice some structure or some form of trends in the coaching data.
Some unsupervised algorithms try to perceive the origin of the info itself. A common
example of unsupervised learning is clustering.

2.Regression & Classification


Regression Analysis is defined as a set of statistical processes for estimating the
relationship between a dependent variable and one or more independent variables.

Case Study
Let's try to understand regression analysis with an example. Imagine you have made plans
with friends after a long time and you wish to go out, but you are not sure whether it will
rain or not. It’s the monsoon season, but your mom says the air feels dry today, and
therefore the probability of raining today is less. On the contrary, your sister believes
because it rained yesterday it’s likely that it will rain today. Considering you are no Lord of
Thunder and you have no control over the weather, how will you decide whose opinion to
take more seriously, keeping in mind the fact that you are impartial towards both?
Regression Analysis might come to your rescue. There are many factors on which rain
depends like geography, time of the year, precipitation, wind speed but unless you are the
weather department or Sheldon you wouldn’t want to work with all these values.So, you
would take the humidity level and the previous day’s precipitation to decide today’s level of
precipitation level(or the amount of rainfall). You can get both of these values easily on the
internet, I know you can get the weather forecast for today too, but we are trying to learn
something here.

In our example what we are trying to predict is today’s precipitation level which is
dependent on the level of humidity and rain received yesterday hence it is called, the
dependent variable. The variables on which it depends will be called independent variables.
What we try to do with regression Analysis is to model or quantify the relationship
between these two kinds of variables and hence predict one with the help of the other with
a level of certainty. An informed guess is better than random guessing, right? To solve our
problem If we were to do a simple linear regression, we would collect the humidity level
and precipitation level for the previous month and plot them.
Even without doing any math, we can infer that humidity and rainfall(precipitation) are
linearly correlated. An increase in the value of one leads to an increase in the other value
too. But here you can see we have oversimplified the problem by making a lot of
assumptions, the major one being humidity is the only or the most important factor in
deciding rainfall. In real-world (not-so simplified) business problems, there are many
variables with complex relationships between them.

Terminology
Outliers: Outliers are basically values or data points that are very stray from the general
population or distribution of data. Outliers have the ability to skew the results of any ML
model towards their detection. Therefore, it is necessary to detect them early on or use
algorithms resistant to outliers.

Overfitting: Overfitting happens when the ML model learns the training data such that it
memorizes every little detail and noise, making it less generalizable and therefore the
model performs badly on any unknown dataset and is ridiculously complex.

Heteroscedasticity: This term is hard to even read, let alone to understand. So, we’ll take
an example. We have humidity which predicts rainfall or precipitation. Now as humidity
increases the amount by which precipitation increases or decreases is variable and is not
fixed. If the level of precipitation would have increased/decreased constantly as the
humidity was increasing, we would have called it homoscedasticity.

Linear Regression Analysis:


The type of regression we observed above is linear regression. There is assumed to be a
linear relationship between the variable we want to predict and the explanatory variable.
Linear regression will attempt to model the relationship between two variables by fitting a
linear equation to the observed data. In our case, y is the level of precipitation and X is
humidity, while a and b are regression coefficients. For all the observed points y, X we try to
find the values of a and b that best fit our equation. In a little complex scenario, there would
be a lot of variables affecting rain like temperature, day of the year, amount of precipitation
the previous day, etc. For such cases with more than one independent variable, we have
multiple linear regression and the equation for it goes like this -

y = a + bX + cX2 + dX3 …….

where X1, X2, X3 are all explanatory variables and a, b, c are regression coefficients. A
positive coefficient tells how much of a positive influence a predictor has on the dependent
variable and a negative coefficient says vice versa.

Benefits and Applications of Linear Regression Analysis


Despite being so simple Linear Regression is a very powerful technique that can be used to
generate insights on consumer behavior, understanding business and factors influencing
profitability. It can be used in business to evaluate trends and make estimates or forecasts.

Linear regression can also increase the operational efficiency of the business by data-driven
decision-making. A bike rental company can avoid overstocking or understocking of bikes
by modeling the relationship between the number of bikes rented and factors like time of
the day, traffic on the road, weather, etc.

Limitations and Assumptions of Linear Regression Analysis


Linear regression assumes a linear relationship between the dependent and independent
variables which is often not the case in real-world scenarios. This is when you would want
to go with other regression techniques that account for non-linearity.
Linear Regression also won’t do well if the independent variables are related to each other,
in other terms if there is Multicollinearity. To avoid this, only keep one of the correlated
independent variables.
It also assumes different observations are independent of each other. Today’s rainfall is
independent of yesterday’s rainfall, which again is not a very realistic assumption.

Polynomial Regression Analysis


The need for polynomial regression stems from the need to model relationships between
the dependent and independent variables when it’s non-linear, which is often the case in
most practical applications. The equation for polynomial regression would obviously be a
multinomial one:

y = a + bX + cX2 + dX3 ……
Before we go any further, let us familiarize ourselves with the concept called loss function,
used to assess the usefulness of our regression algorithm. While fitting our regression line
to our data, we position the line in such a way that the sum of perpendicular distances of
the data points from the line is minimized.

The Root Mean Squared error is very similar to this, it just takes the square of these
residuals(the distance of a point from the line) and takes a root of their sum.

Here Predicted(i) are the red points and Actual(i) are the black points. RMSE will tell you
how well fit the line of regression is.

Now, coming back to polynomial regression, when the relationship between variables is not
linear, it’s hard to fit a line on the data and minimize our cost function. This is when we
need Polynomial Regression.

Benefits and Applications of Polynomial Regression Analysis


Besides providing the best level of approximation between the variables, polynomial
regression also provides a broad range of functions that can fall under its hood.
In industry, polynomial regression can be used for all cases where linear regression is used
but with a greater degree of reliability, since we are not violating the linearity assumption.

Limitations and Assumptions of Polynomial Regression Analysis


Most of the assumptions made for linear regression are still valid here. It is assumed that
the data is not multicollinear, independent of subsequent observations, and no
Heteroscedasticity.
Polynomial Regression is also sensitive to the presence of outliers. It’s also hard to detect
outliers with an algorithm and their presence skews the result in their direction.
Logistic Regression Analysis
A common feature in the above two methods was the dependent variable was continuous,
in Logistic Regression the dependent variable is discrete(or categorical) while the
independent variables could be discrete or continuous. It is named after the function at its
core called the Logarithmic function. The equation goes like this:

Where x1, x2,x3 are independent variables and b0, b1, b2 are regression coefficients. In a
Binary Classification problem, p gives the probability that the sample belongs to the main
class.When Logistic regression is applied in real-world problems – like detecting cancer in
people P here, would tell the probability of whether the person has cancer or not. P less
than 0.5 would be mapped to no cancer and greater than that would map to cancer. Logistic
regression is a linear method, but the predictions are transformed using the logistic
function. The curve for it follows the curve for log function.

Benefits and Applications of Logistic Regression Analysis


● Logistic Regression is one of the most widely used algorithms out there. It is easy to
implement and versatile. From petal recognition to text classification, it is used on
top(as a classification layer) of some of the most sophisticated deep Learning
architectures out there.
● It can be used with a variety of output classes and it also outputs the magnitude of
association with a class. It can also interpret model coefficients as indicators of
feature importance.
● Unlike other classification algorithms like SVM, it is relatively faster to train.

Limitations and Assumptions of Logistic Regression Analysis


● Logistic Regression is at the core of a linear algorithm; thus, it follows most of the
assumptions of linear regression like the linear relationships between the input
variables and output variables, auto-correlation, etc.
● Logistic Regression can overfit if the number of observations is less than the number
of independent variables.
● It is also sensitive to outliers and noise in the data.

Quantile Regression Analysis


In probability distributions, quantiles are points dividing the range of distributions into
continuous intervals with equal probabilities. For a normal distribution the quantiles would
be placed as follows :

In our probability distribution, 25% of the data points would lie on the left of Q1 and 75%
would lie to the left of Q3.

Ordinary Least Squares Regression or Linear Regression is modeled around the mean of the
dependent variable. Quantile regression allows us to understand relationships between
variables outside of the mean of the data, making it useful in understanding outcomes that
are non-normally distributed and that have non-linear relationships with predictor
variables.
Benefits and Applications of Quantile Regression Analysis
● Quantile regression can be used when the assumptions of linear regression are not
met. It is robust to outliers and can be used when heteroscedasticity is present.
● It is also useful when data is skewed as it does not depend on measures of mean but
quantiles. In any business, it is likely that the amount of money spent by customers
is skewed and the business might be more interested in the top quantiles rather
than the mean.

Limitations and Assumptions of Quantile Regression Analysis


● If all assumptions of the Linear Regression model are met, Quantile Regression is
less efficient than the alternative.
● Unlike Logistic Regression, Quantile Regression works on predicting continuous
variables and does so less precisely since it makes predictions over quantiles.

Ordinal Regression Analysis


When the dependent variables are ordinal, this technique is used. Ordinal variables are
categorical variables, but the categories are ordered/ranked like Low, Moderate, High.
Ordinal Regression can be seen as an intermediate problem between regression and
classification. The formula for Ordinal Regression comes from a technique called
Generalized Linear Model and goes as follows:

Benefits and Applications of Ordinal Regression Analysis


● Ordinal regression turns up often in the social sciences, for example in the modeling
of human levels of preference (on a scale from, say, 1–5 for "very poor" through
"excellent"), as well as in information retrieval.
● It serves as the best technique for predicting multiclass ordered variables.

Limitations and Assumptions of Ordinal Regression Analysis


● Parallel lines assumption: There is one regression equation for each category except
the last category. The last category probability can be predicted as a 1-second last
category probability.
● Estimates are sometimes implausible, suggesting that the data are being spread too
thin and another method is needed.

Support Vector Regression


Let’s take an example of a 2D dataset having 2 features(independent variables) and 2
classes. We can easily plot them into a 2D space.

The red dots correspond to one class and green to the other. These classes can be easily
separated by a line in 2D space. But for SVM, it can’t be just any line. The distance between
the points in the two classes closest to each other is taken and the line passing mid-way
through it is the optimal dividing plane. These points that play a major role in deciding the
position of the separator line are called Support Vectors and hence the whole technique is
called Support Vector Machine. In more realistic cases, we have an n-dimensional space,
where n is the number of features and the decision plane is obviously not linear.

In Support Vector Regression, instead of having a discrete dependent variable, we have a


continuous one and instead of having a decision boundary, we have a regressor line to fit
our data. Now, the way we find the best fit line or plane is a little different from what we did
above. Again, for the purpose of simplification consider a 2D plane.
The points are distributed in the 2D space. Now the two points farthest from each other, in
other words having the maximum distance between them are the support vectors and the
line passing through the median of that perpendicular distance is our best-fit line.

Benefits and Applications of Support Vector Regression Analysis


● While being robust to outliers, SVR works way better in high-dimensional space
than the linear regression model.
● You can define a confidence interval or level of tolerance, marked by C while
training. The prediction accuracy is improved by measuring confidence in
classification. This is useful in real-world systems that do not require a very precise
prediction, but a prediction between a confidence interval.
● It is very easy to accommodate new data points in support vector regression
analysis.

Limitations and Assumptions of Support Vector Regression Analysis


● They take a lot of time to train and are not suitable for larger datasets.
● SVR will seriously underperform if the number of samples is less than the number of
features. There are no probabilistic explanations for the predictions.

Poisson Regression Analysis


Poisson distribution is a discrete probability distribution covering the number of events
occurring in a period of time, given the average number of times that event has occurred in
that period. When the dependent variable follows Poisson distribution or is count-based,
we use Poisson Regression. Count-based data contains events that occur at a certain rate.
The rate of occurrence may change over time or from one observation to the next. The
instance we stated above is an example of this. The formula for Poisson distribution follows
this probability mass function:

Where PX(k) is the probability of seeing k events in time t, e-(λt) is the event rate or the
number of events happening per unit time and k is the number of events.
Consider a small-scale restaurant where we are recording the number of customers
walking in an hour between 10 a.m. – 11 a.m., on average 5 customers are in the restaurant
at this hour. With this information, we can calculate the probability that there will be no
customer between 10 a.m. – 11 a.m. as follows:

Benefits and Applications of Poisson Regression Analysis


● A lot of businesses rely on count-based data like the number of bikes rented in an
hour, the number of calls received in a call center at a particular time in the day, or
the number of pizzas ordered during a particular time in the month.
● It is useful when the data is skewed and sparse.
● It is used to determine the probable maximum and the minimum number of times
the event will occur within the specified time frame.

Limitations and Assumptions of Poisson Regression Analysis


● The assumption of Poisson Regression is that the dependent variable must be
count-based, the observations must be independent of another and the mean of the
Poisson random variable must be equal to its variance.
● Poisson regression may not perform well in situations where the conditional
variance is greater than the conditional mean, a phenomenon known as
overdispersion.

Negative Binomial Regression:


Like Poisson Regression, Negative Binomial Regression also works on count data. In a way,
Negative Binomial Regression is better than Poisson distribution because it doesn’t make
the mean equal to the variance assumption. This strict assumption is often not satisfied by
real-world data. In real-world data, the variance is either greater than the mean called
overdispersion or less than the mean called under-dispersion.
The plot is pretty much the same as Poisson Distribution. Negative Binomial Regression can
be considered as a generalization of Poisson regression since it has the same mean
structure as Poisson regression, and it has an extra parameter to model the
over-dispersion.
Benefits and Applications of Negative Binomial Regression Analysis
● A use case for this would be School administrators study the attendance behavior of
high school juniors at two schools. Predictors of the number of days of absence
include the type of program in which the student is enrolled and how well he/she
does in a standardized test in math.
● It has a clear advantage over Poisson Regression since it does not make the mean
equal to variance assumption.

Limitations and Assumptions of Negative Binomial Regression Analysis


● When the number of samples is small, negative binomial regression may not be a
good choice.
● The outcome variable cannot have negative numbers.

Principal Components Regression:


This regression technique is based on Principal Component Analysis. In PCR, instead of
regressing the dependent variable on the explanatory variables directly, the principal
components of the explanatory variables are used as regressors. PCA is basically a
dimensionality-reduction method that is used to reduce the dimensionality (number of
features) of large datasets without losing most of the information. A little accuracy is traded
for simplicity.

Here’s an example of converting the points from 2D to 1D space.


In PCR, the steps followed are as follows:

● Perform PCA on the explanatory variables to obtain principal components and then
choose a subset from this.
● Using this subset and our dependent variable, fit a linear regression model to get a
vector of estimated regression coefficients.
● Transform this vector back to the scale of the original independent variables.

Benefits and Applications of Principal Component Regression Analysis


● One of the greatest advantages of PCR is the consistency check that one gets on the
raw data, which you don’t have for MLR. PCR is also way less prone to overfitting.
● PCR can be used even when the explanatory variables are correlated. It can also be
run when there are more features than observations.

Limitations and Assumptions of Principal Component Regression Analysis


● One of the biggest disadvantages of PCR is that it does not consider the dependent
variable when deciding which principal components to drop. The decision to drop
components is based only on the magnitude of the variance of the components.
● In PCA data is not normalized, so it’s sensitive to the scale of features. Changing the
scale would completely change the results of PCA.

Partial Least Squares Regression


It is an extension of Principal Components Regression. Instead of finding hyperplanes of
maximum variance between the dependent and independent variables, it finds a linear
regression model by projecting the predicted variables and the observable variables to a
new space. Both kinds of variables are mapped into a new space, hence it overcomes a
limitation of PCA. A PLS regression model will try to find the multidimensional direction in
the X space that explains the maximum multidimensional variance direction in the Y space.
The mathematical model is given by:

where X is a matrix of independent variables, Y is a matrix of dependent variables; T and U


matrices that are, respectively, projections of X and projections of Y, P, and Q are,
respectively orthogonal loading matrices; and matrices E and F are the error terms, which
are independent and identically distributed random normal variables. The decompositions
of X and Y are to maximize the covariance between T and U.

Benefits and Applications of Partial Least Squares Regression


● PLS can be used for the detection of outliers. Like PCR, it can also handle more
features than observations.
● It also provides more predictive accuracy and a lower risk of finding correlation on
chance. It has most of the same benefits as PCR.

Limitations and Assumptions of Partial Least Squares Regression


● The major limitations are a higher risk of overlooking real correlations and
sensitivity to the relative scaling of the descriptor(independent) variables.
● This technique is again sensitive to scaling.

Tobit Regression Analysis


In Tobit Regression, the observed or known range of the dependent variable is censored in
some way. In statistics, censoring is a condition in which the value of a variable is only
partially known. Censoring or clipping can occur in the following ways- Censoring from
above takes place when cases with a value at or above some threshold, all take on the value
of that threshold so that the true value might be equal to the threshold, but it might also be
higher. In the case of censoring from below, values that fall at or below some threshold are
censored.

Let’s look at an example of Tobit analysis-

A research project is studying the level of lead in home drinking water as a function of the
age of a house and family income. The water testing kit cannot detect lead concentrations
below 5 parts per billion (ppb). The EPA considers levels above 15 ppb to be dangerous.
These data are an example of left-censoring.
Benefits and Applications of Tobit Regression Analysis
● Tobit's method can be easily extended to handle truncated and other non-randomly
selected samples. Tobit models have been applied in demand analysis to
accommodate observations with zero expenditures on some goods.
● It has also been applied to estimate factors that impact grant receipt, including
financial transfers distributed to sub-national governments who may apply for these
grants. In these cases, grant recipients cannot receive negative amounts, and the
data is thus left-censored.

Limitations and Assumptions of Tobit Regression Analysis


● One limitation of the Tobit model is its assumption that the processes in both
regimes of the outcome are equal up to a constant of proportionality.
● If we have a fundamentally bounded dependent variable rather than a truncated one
we might want to move to a generalized linear model framework with one of the
(less often chosen) distributions for Y e.g. log-normal, gamma, exponential, etc.
which respect that lower bound.

Cox Regression Analysis


The Cox regression model is commonly used in medical research for studying the
association between the survival time of patients and one or more predictor
variables(values on which survival time is dependent). The purpose of the model is to
evaluate simultaneously the effect of several factors on survival. In other words, it allows us
to examine how specific factors influence the rate of a particular event happening (e.g.,
infection, death) at a particular point in time. This rate is commonly referred to as the
hazard rate. Predictor variables (or factors) are usually termed covariates in the
survival-analysis literature. The Cox model is expressed by the hazard function denoted by
h(t). Briefly, the hazard function can be interpreted as the risk of dying at time t. It can be
estimated as follow:​

t stands for survival time, h(t) is the hazard function, the coefficients b1, b2,…etc measure
the impact of covariates x1, x2 , …xp . The term h0 is the baseline hazard.

Benefits and Applications of Cox Regression Analysis


● It can be used in investigating the impact of diet, amount of exercise, hours of sleep,
age on the survival time after a person has been diagnosed with a disease such as
cancer. Survival data usually has censored data and the distribution is highly
skewed. Because of these two problems, Multiple Regression cannot be used.
● It uses a multivariate approach and can account for the impact of each variable on
the outcome.

Limitations and Applications of Cox Regression Analysis


● If the proportionality of the hazard assumption is not met, the outcome of
regression is incorrect.
● The model also assumes that each covariate has a multiplicative effect in the hazards
function that is constant over time.

Ridge Regression Analysis


Before going any further with this let’s understand the concept of regularization.
Regularization is a technique used to deal with overfitting. It adds an additional error term
to the loss function that penalizes overfitting and promotes generalization. So, in addition
to optimizing the model coefficients for loss, we also optimize for the regularization term,
so we get a well-fit model. There are basically 2 kinds of regularizations – L1 and L2. We’ll
better understand them as we’ll go through the regression models that use them. Ridge
Regression uses L2 regularization also called the L2 penalty which is the square of the
magnitude of model coefficients added to the error term. It is merely an extension of simple
linear regression model with better control on overfitting. The ridge regression model
equation remains the same as in multiple linear regression:

y = a + bX + cX2 + dX3 …….

If the loss function we have chosen is RMSE:

Then now the error becomes:

Error = RMSE + λ (a2 + b2 + c2 +………)

Here, λ is the level of regularization.

Benefits and Applications of Ridge Regression Analysis


● Deals with overfitting, make the model generalize well.
● Shrinks model coefficients and reduce the model complexity and multi-collinearity.
In any real-world scenario, Ridge Regression is always a better method than Linear
Regression because of its ability to learn general patterns rather than noise.
Limitations and Assumptions of Ridge Regression Analysis
● Ridge Regression is at the heart of a linear regression model and thus can only be
used to model linear relations. It makes most assumptions of the linear regression
model.
● Doesn’t deal well with sparse data

Lasso Regression
Lasso is also an extension of Linear Regression, but it implements L1 regularization instead
of L2. The only difference between L1 and L2 is instead of taking the square of the
coefficients, magnitudes are taken into account.

The error term now is:

Error = RMSE + λ(|a| +| b| +| c| +………)

Here, λ is used to control the level of regularization. The goal of lasso regression is to
obtain the subset of predictors that minimizes prediction error for a quantitative
response(dependent) variable. It does this by imposing a constraint on the model
parameters that causes regression coefficients for some variables to shrink toward zero.
Variables with a regression coefficient equal to zero after the shrinkage process are
excluded from the model. Variables with non-zero regression coefficient variables are most
strongly associated with the response variable. Thus, it helps in feature selection.

Benefits and Applications of Lasso Regression Analysis


● It avoids overfitting and can be used when the number of features is more than the
number of samples. Lasso regression is well suited for building forecasting models
when the number of potential covariates is large, and the number of observations is
small or roughly equal to the number of covariates.
● It does feature selection and reduces model complexity through it. It’s also fast, in
terms of training and inference on test data.

Limitations and Assumptions of Lasso Regression Analysis


● Since it’s a linear model at the core, it follows most of the assumptions of a linear
model. It also fails to do grouped selection. It tends to select one variable from a
group and ignore the others.
● It’s not very intuitive, in the sense that there is no way to know why it selected the
features it did. It might lose some important independent variable on the way, but
this is dependent on the level of regularization λ.
ElasticNet Regression
ElasticNet is a combination of Lasso and Ridge Regression in the sense that it uses both L1
and L2 regularization. The feature selection of Lasso can be too dependent on data and thus
unstable, therefore ElasticNet combines the two approaches to give the best of both worlds.

The error term goes like this :

Error = RMSE + λ α × L1 penalty+1- α ×L2 penalty

Here, λ is used to control the level of regularization as usual while α is to give weights to L1
and L2 penalty. The value always lies between 0 and 1.

Benefits and Applications of ElasticNet Regression Analysis


● Deals with overfitting and can also do feature selection with L1 regularization.
● Can perform grouped selection because of the presence of L2 penalty. It has
interesting applications in sparse PCA and new support kernel machines.
● It is also used in Cancer prognosis and portfolio optimization.

Limitations and Assumptions of ElasticNet Regression Analysis


● Regularization leads to dimensionality reduction, which means the machine learning
model is built using a lower-dimensional dataset. This generally leads to a high bias
error.
● Again, it follows all the assumptions of a linear model

KNN Algorithm - Finding Nearest Neighbors

Introduction
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be
used for both classification as well as regression predictive problems. However, it is mainly
used for classification of predictive problems in industry. The following two properties
would define KNN well −
● Lazy learning algorithm − KNN is a lazy learning algorithm because it does
not have a specialized training phase and uses all the data for training while
classification.
● Non-parametric learning algorithm − KNN is also a non-parametric learning
algorithm because it doesn’t assume anything about the underlying data.

Working of KNN Algorithm


K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new
data points which further means that the new data point will be assigned a value based on
how closely it matches the points in the training set. We can understand its working with
the help of following steps −
Step 1 − For implementing any algorithm, we need a dataset. So during the first step of
KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.
Step 3 − For each point in the test data do the following −
● 3.1 − Calculate the distance between test data and each row of training data
with the help of any of the methods, namely: Euclidean, Manhattan or
Hamming distance. The most commonly used method to calculate distance is
Euclidean.
● 3.2 − Now, based on the distance value, sort them in ascending order.
● 3.3 − Next, it will choose the top K rows from the sorted array.
● 3.4 − Now, it will assign a class to the test point based on the most frequent
class of these rows.

Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN algorithm

Suppose we have a dataset which can be plotted as follows −
Now, we need to classify a new data point with black dot (at point 60,60) into a blue or red
class. We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the
next diagram −

We can see in the above diagram the three nearest neighbors of the data point with black
dot. Among those three, two of them lie in Red class hence the black dot will also be
assigned in red class.

KNN as Classifier
First, start with importing necessary python packages −
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Next, download the iris dataset from its weblink as follows −


path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

Next, we need to assign column names to the dataset as follows −


headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

Now, we need to read dataset to pandas dataframe as follows −


dataset = pd.read_csv(path, names = headernames)
dataset.head()
sepal-length sepal-width petal-length petal-width Class

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

Data Preprocessing will be done with the help of following script lines.
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

Next, we will divide the data into train and test split. Following code will split the dataset
into 60% training data and 40% of testing data −
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40)

Next, data scaling will be done as follows −


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, train the model with the help of KNeighborsClassifier class of sklearn as follows −
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 8)
classifier.fit(X_train, y_train)

At last we need to make prediction. It can be done with the help of following script −
y_pred = classifier.predict(X_test)
Next, print the results as follows −
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Output
Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 21
Iris-versicolor 0.70 1.00 0.82 16
Iris-virginica 1.00 0.70 0.82 23
micro avg 0.88 0.88 0.88 60
macro avg 0.90 0.90 0.88 60
weighted avg 0.92 0.88 0.88 60

Accuracy: 0.8833333333333333

KNN as Regressor
First, start with importing necessary Python packages −
import numpy as np
import pandas as pd

Next, download the iris dataset from its web link as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

Next, we need to assign column names to the dataset as follows −


headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandas dataframe as follows −
data = pd.read_csv(url, names = headernames)
array = data.values
X = array[:,:2]
Y = array[:,2]
data.shape
output:(150, 5)
Next, import KNeighborsRegressor from sklearn to fit the model −
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors = 10)
knnr.fit(X, y)

At last, we can find the MSE as follows −


print ("The MSE is:",format(np.power(y-knnr.predict(X),2).mean()))

Output
The MSE is: 0.12226666666666669

Pros and Cons of KNN


Pros
● It is a very simple algorithm to understand and interpret.
● It is very useful for nonlinear data because there is no assumption about data
in this algorithm.
● It is a versatile algorithm as we can use it for classification as well as
regression.
● It has relatively high accuracy but there are much better supervised learning
models than KNN.

Cons
● It is computationally a bit expensive algorithm because it stores all the
training data.
● High memory storage required as compared to other supervised learning
algorithms.
● Prediction is slow in case of big N.
● It is very sensitive to the scale of data as well as irrelevant features.

Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in the banking system to predict weather an individual is fit for loan
approval? Does that individual have the characteristics similar to the defaulters one?
Calculating Credit Ratings
KNN algorithms can be used to find an individual’s credit rating by comparing with the
persons having similar traits.
Politics
With the help of KNN algorithms, we can classify a potential voter into various classes like
“Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
Other areas in which KNN algorithms can be used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.
Perceptron
A neural network link that contains computations to track features and uses Artificial
Intelligence in the input data is known as Perceptron. This neural links to the artificial
neurons using simple logic gates with binary outputs. An artificial neuron invokes the
mathematical function and has node, input, weights, and output equivalent to the cell
nucleus, dendrites, synapse, and axon, respectively, compared to a biological neuron.
Biological Neuron
A human brain has billions of neurons. Neurons are interconnected nerve cells in the
human brain that are involved in processing and transmitting chemical and electrical
signals. Dendrites are branches that receive information from other neurons.
Cell nucleus or Soma processes the information received from dendrites. Axon is a cable
that is used by neurons to send information. Synapse is the connection between an axon
and other neuron dendrites.
Researchers Warren McCullock and Walter Pitts published their first concept of simplified
brain cell in 1943. This was called the McCullock-Pitts (MCP) neuron. They described such a
nerve cell as a simple logic gate with binary outputs.
Multiple signals arrive at the dendrites and are then integrated into the cell body, and, if the
accumulated signal exceeds a certain threshold, an output signal is generated that will be
passed on by the axon. In the next section, let us talk about the artificial neuron
Artificial Neuron
An artificial neuron is a mathematical function based on a model of biological neurons,
where each neuron takes inputs, weighs them separately, sums them up and passes this
sum through a nonlinear function to produce output.

.
Biological Neuron vs. Artificial Neuron
The biological neuron is analogous to artificial neurons in the following terms:

Biological Neuron Artificial Neuron

Cell Nucleus (Soma) Node


Dendrites Input

Synapse Weights or interconnections

Axon Output

The artificial neuron has the following characteristics:


● A neuron is a mathematical function modeled on the working of biological
neurons
● It is an elementary unit in an artificial neural network
● One or more inputs are separately weighted
● Inputs are summed and passed through a nonlinear function to produce output
● Every neuron holds an internal state called activation signal
● Each connection link carries information about the input signal
● Every neuron is connected to another neuron via connection link

Perceptron
Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a Perceptron
learning rule based on the original MCP neuron. A Perceptron is an algorithm for
supervised learning of binary classifiers. This algorithm enables neurons to learn and
process elements in the training set one at a time.

Types of Perceptron:
1. Single layer: Single layer perceptron can learn only linearly separable patterns.
2. Multilayer: Multilayer perceptrons can learn about two or more layers having a
greater processing power.
The Perceptron algorithm learns the weights for the input signals in order to draw a linear
decision boundary.
Note: Supervised Learning is a type of Machine Learning used to learn models from labeled
training data. It enables output prediction for future or unseen data. Let us focus on the
Perceptron Learning Rule in the next section.
Perceptron in Machine Learning
The most commonly used term in Artificial Intelligence and Machine Learning (AIML) is
Perceptron. It is the beginning step of learning coding and Deep Learning technologies,
which consists of input values, scores, thresholds, and weights implementing logic gates.
Perceptron is the nurturing step of an Artificial Neural Link. In 19h century, Mr. Frank
Rosenblatt invented the Perceptron to perform specific high-level calculations to detect
input data capabilities or business intelligence. However, now it is used for various other
purposes.
Perceptron Model in Machine Learning
A machine-based algorithm used for supervised learning of various binary sorting tasks is
called Perceptron. Furthermore, Perceptron also has an essential role as an Artificial
Neuron or Neural link in detecting certain input data computations in business intelligence.
A perceptron model is also classified as one of the best and most specific types of Artificial
Neural networks. Being a supervised learning algorithm of binary classifiers, we can also
consider it a single-layer neural network with four main
Working of Perceptron
AS discussed earlier, Perceptron is considered a single-layer neural link with four main
parameters. The perceptron model begins with multiplying all input values and their
weights, then adds these values to create the weighted sum. Further, this weighted sum is
applied to the activation function ‘f’ to obtain the desired output. This activation function is
also known as the step function and is represented by ‘f.’

This step function or Activation function is vital in ensuring that output is mapped between
(0,1) or (-1,1). Take note that the weight of input indicates a node’s strength. Similarly, an
input value gives the ability to shift the activation function curve up or down.
Step 1: Multiply all input values with corresponding weight values and then add to calculate
the weighted sum. The following is the mathematical expression of it:
∑wi*xi = x1*w1 + x2*w2 + x3*w3+……..x4*w4
Add a term called bias ‘b’ to this weighted sum to improve the model’s performance.
Step 2: An activation function is applied with the above-mentioned weighted sum giving us
an output either in binary form or a continuous value as follows:
Y=f(∑wi*xi + b)
Types of Perceptron models
We have already discussed the types of Perceptron models in the Introduction. Here, we
shall give a more profound look at this:
1. Single Layer Perceptron model: One of the easiest ANN(Artificial Neural
Networks) types consists of a feed-forward network and includes a threshold
transfer inside the model. The main objective of the single-layer perceptron
model is to analyze the linearly separable objects with binary outcomes. A
Single-layer perceptron can learn only linearly separable patterns.
2. Multi-Layered Perceptron model: It is mainly similar to a single-layer
perceptron model but has more hidden layers.
Forward Stage: From the input layer in the on stage, activation functions begin and
terminate on the output layer.
Backward Stage: In the backward stage, weight and bias values are modified per the
model’s requirement. The backstage removed the error between the actual output and
demands originating backward on the output layer. A multilayer perceptron model has a
greater processing power and can process linear and non-linear patterns. Further, it also
implements logic gates such as AND, OR, XOR, XNOR, and NOR.
Advantages:
● A multi-layered perceptron model can solve complex nonlinear problems.
● It works well with both small and large input data.
● Helps us to obtain quick predictions after the training.
● Helps us obtain the same accuracy ratio with big and small data.
Disadvantages:
● In multi-layered perceptron models, computations are time-consuming and
complex.
● It is tough to predict how much the dependent variable affects each independent
variable.
● The model functioning depends on the quality of training.
Characteristics of the Perceptron Model
1. It is a machine learning algorithm that uses supervised learning of binary
classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and then the decision is made
whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the function is more
significant than zero.
5. The linear decision boundary is drawn, enabling the distinction between the two
linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must have
an output signal; otherwise, no output will be shown.

Limitation of Perceptron Model


The following are the limitation of a Perceptron model:
1. The output of a perceptron can only be a binary number (0 or 1) due to the
hard-edge transfer function.
2. It can only be used to classify the linearly separable sets of input vectors. If the
input vectors are non-linear, it is not easy to classify them correctly.
Perceptron Learning Rule
Perceptron Learning Rule states that the algorithm would automatically learn the optimal
weight coefficients. The input features are then multiplied with these weights to determine
if a neuron fires or not.

The Perceptron receives multiple input signals, and if the sum of the input signals exceeds a
certain threshold, it either outputs a signal or does not return an output. In the context of
supervised learning and classification, this can then be used to predict the class of a sample.

Perceptron Function
Perceptron is a function that maps its input “x,” which is multiplied with the learned weight
coefficient; an output value ”f(x)”is generated.
In the equation given above:
● “w” = vector of real-valued weights
● “b” = bias (an element that adjusts the boundary away from origin without any
dependence on the input value)
● “x” = vector of input x values

● “m” = number of inputs to the Perceptron


The output can be represented as “1” or “0.” It can also be represented as “1” or “-1”
depending on which activation function is used.
Inputs of a Perceptron
A Perceptron accepts inputs, moderates them with certain weight values, then applies the
transformation function to output the final result. The image below shows a Perceptron
with a Boolean output.

A Boolean output is based on inputs such as salaried, married, age, past credit profile, etc. It
has only two values: Yes and No or True and False. The summation function “∑” multiplies
all inputs of “x” by weights “w” and then adds them up as follows:

Activation Functions of Perceptron


The activation function applies a step rule (convert the numerical output into +1 or -1) to
check if the output of the weighting function is greater than zero or not.
For example:
If ∑ wixi> 0 => then final output “o” = 1 (issue bank loan)
Else, final output “o” = -1 (deny bank loan)
Step function gets triggered above a certain value of the neuron output; else it outputs zero.
Sign Function outputs +1 or -1 depending on whether neuron output is greater than zero or
not. Sigmoid is the S-curve and outputs a value between 0 and 1.
Output of Perceptron
Perceptron with a Boolean output:
Inputs: x1…xn
Output: o(x1….xn)

Weights: wi=> contribution of input xi to the Perceptron output;


w0=> bias or threshold
If ∑w.x > 0, output is +1, else -1. The neuron gets triggered only when weighted input
reaches a certain threshold value.

An output of +1 specifies that the neuron is triggered. An output of -1 specifies that the
neuron did not get triggered.
“sgn” stands for sign function with output +1 or -1.
Error in Perceptron
In the Perceptron Learning Rule, the predicted output is compared with the known output.
If it does not match, the error is propagated backward to allow weight adjustment to
happen.
Perceptron: Decision Function
A decision function φ(z) of Perceptron is defined to take a linear combination of x and w
vectors.

The value z in the decision function is given by:

The decision function is +1 if z is greater than a threshold θ, and it is -1 otherwise.

This is the Perceptron algorithm.


Bias Unit
For simplicity, the threshold θ can be brought to the left and represented as w0x0, where
w0= -θ and x0= 1.

The value w0 is called the bias unit.


The decision function then becomes:

Output:
The figure shows how the decision function squashes wTx to either +1 or -1 and how it can
be used to discriminate between two linearly separable classes.
Perceptron at a Glance
Perceptron has the following characteristics:
● Perceptron is an algorithm for Supervised Learning of single layer binary linear
classifiers.
● Optimal weight coefficients are automatically learned.
● Weights are multiplied with the input features and decision is made if the neuron
is fired or not.
● Activation function applies a step rule to check if the output of the weighting
function is greater than zero.
● Linear decision boundary is drawn enabling the distinction between the two
linearly separable classes +1 and -1.
● If the sum of the input signals exceeds a certain threshold, it outputs a signal;
otherwise, there is no output.
Types of activation functions include the sign, step, and sigmoid functions.
Implement Logic Gates with Perceptron
Perceptron - Classifier Hyperplane
The Perceptron learning rule converges if the two classes can be separated by the linear
hyperplane. However, if the classes cannot be separated perfectly by a linear classifier, it
could give rise to errors.
As discussed in the previous topic, the classifier boundary for a binary output in a
Perceptron is represented by the equation given below:

The diagram above shows the decision surface represented by a two-input Perceptron.
Observation:
● In Fig(a) above, examples can be clearly separated into positive and negative
values; hence, they are linearly separable. This can include logic gates like AND,
OR, NOR, NAND.
● Fig (b) shows examples that are not linearly separable (as in an XOR gate).
● Diagram (a) is a set of training examples and the decision surface of a Perceptron
that classifies them correctly.
● Diagram (b) is a set of training examples that are not linearly separable, that is,
they cannot be correctly classified by any straight line.
● X1 and X2 are the Perceptron inputs.
Logic gates are the building blocks of a digital system, especially neural networks. In short,
they are the electronic circuits that help in addition, choice, negation, and combination to
form complex circuits. Using the logic gates, Neural Networks can learn on their own
without you having to manually code the logic. Most logic gates have two inputs and one
output.
Each terminal has one of the two binary conditions, low (0) or high (1), represented by
different voltage levels. The logic state of a terminal changes based on how the circuit
processes data.
Based on this logic, logic gates can be categorized into seven types:
● AND
● NAND
● OR
● NOR
● NOT
● XOR
● XNOR
Implementing Basic Logic Gates With Perceptron
The logic gates that can be implemented with Perceptron are discussed below.
1. AND
If the two inputs are TRUE (+1), the output of Perceptron is positive, which amounts to
TRUE.
This is the desired behavior of an AND gate.
x1= 1 (TRUE), x2= 1 (TRUE)
w0 = -.8, w1 = 0.5, w2 = 0.5
=> o(x1, x2) => -.8 + 0.5*1 + 0.5*1 = 0.2 > 0
2. OR
If either of the two inputs are TRUE (+1), the output of Perceptron is positive, which
amounts to TRUE.
This is the desired behavior of an OR gate.
x1 = 1 (TRUE), x2 = 0 (FALSE)
w0 = -.3, w1 = 0.5, w2 = 0.5
=> o(x1, x2) => -.3 + 0.5*1 + 0.5*0 = 0.2 > 0
3. XOR
A XOR gate, also called an Exclusive OR gate, has two inputs and one output.

The gate returns a TRUE as the output if and ONLY if one of the input states is true.
XOR Truth Table

Input Output

A B

0 0 0

0 1 1

1 0 1

1 1 0

XOR Gate with Neural Networks


Unlike the AND and OR gate, an XOR gate requires an intermediate hidden layer for
preliminary transformation in order to achieve the logic of an XOR gate.
An XOR gate assigns weights so that XOR conditions are met. It cannot be implemented
with a single layer Perceptron and requires Multi-layer Perceptron or MLP.
H represents the hidden layer, which allows XOR implementation.
I1, I2, H3, H4, O5are 0 (FALSE) or 1 (TRUE)
t3= threshold for H3; t4= threshold for H4; t5= threshold for O5
H3= sigmoid (I1*w13+ I2*w23–t3); H4= sigmoid (I1*w14+ I2*w24–t4)
O5= sigmoid (H3*w35+ H4*w45–t5);
Sigmoid Activation Function
The diagram below shows a Perceptron with sigmoid activation function. Sigmoid is one of
the most popular activation functions.

A Sigmoid Function is a mathematical function with a Sigmoid Curve (“S” Curve). It is a


special case of the logistic function and is defined by the function given below:

Here, value of z is:

Sigmoid Curve
The curve of the Sigmoid function called “S Curve” is shown here.

This is called a logistic sigmoid and leads to a probability of the value between 0 and 1.
This is useful as an activation function when one is interested in probability mapping rather
than precise values of input parameter t.
The sigmoid output is close to zero for highly negative input. This can be a problem in
neural network training and can lead to slow learning and the model getting trapped in
local minima during training. Hence, hyperbolic tangent is more preferable as an activation
function in hidden layers of a neural network.
Sigmoid Logic for Sample Data

Output
The Perceptron output is 0.888, which indicates the probability of output y being a 1.
If the sigmoid outputs a value greater than 0.5, the output is marked as TRUE. Since the
output here is 0.888, the final output is marked as TRUE.
In the next section, let us focus on the rectifier and softplus functions.
Rectifier and Softplus Functions
Apart from Sigmoid and Sign activation functions seen earlier, other common activation
functions are ReLU and Softplus. They eliminate negative units as an output of max function
will output 0 for all units 0 or less.
A rectifier or ReLU (Rectified Linear Unit) is a commonly used activation function. This
function allows one to eliminate negative units in an ANN. This is the most popular
activation function used in deep neural networks.
● A smooth approximation to the rectifier is the Softplus function.
● The derivative of Softplus is the logistic or sigmoid function.
In the next section, let us discuss the advantages of the ReLu function.
Advantages of ReLu Functions
The advantages of ReLu function are as follows:
● Allows faster and more effective training of deep neural architectures on large
and complex datasets
● Sparse activation of only about 50% of units in a neural network (as negative
units are eliminated)
● More plausible or one-sided, compared to anti-symmetry of tanh
● Efficient gradient propagation, which means no vanishing or exploding gradient
problems
● Efficient computation with the only comparison, addition, or multiplication
● Scales well

Limitations of ReLu Functions


● Non-differentiable at zero - Non-differentiable at zero means that values close to
zero may give inconsistent or intractable results.
● Non-zero centered - Being non-zero centered creates asymmetry around data
(only positive values handled), leading to the uneven handling of data.
● Unbounded - The output value has no limit and can lead to computational issues
with large values being passed through.
● Dying ReLU problem - When the learning rate is too high, Relu neurons can
become inactive and “die.”
Softmax Function
Another very popular activation function is the Softmax function. The Softmax outputs the
probability of the result belonging to a certain set of classes. It is akin to a categorization
logic at the end of a neural network. For example, it may be used at the end of a neural
network that is trying to determine if the image of a moving object contains an animal, a
car, or an airplane.
In Mathematics, the Softmax or normalized exponential function is a generalization of the
logistic function that squashes a K-dimensional vector of arbitrary real values to a
K-dimensional vector of real values in the range (0, 1) that add up to 1.
In probability theory, the output of the Softmax function represents a probability
distribution over K different outcomes.
In Softmax, the probability of a particular sample with net input z belonging to the ith class
can be computed with a normalization term in the denominator, that is, the sum of all M
linear functions:

The Softmax function is used in ANNs and Naïve Bayes classifiers.


For example, if we take an input of [1,2,3,4,1,2,3], the Softmax of that is [0.024, 0.064, 0.175,
0.475, 0.024, 0.064, 0.175]. The output has most of its weight if the original input is '4’ This
function is normally used for:
● Highlighting the largest values
● Suppressing values that are significantly below the maximum value.
The Softmax function is demonstrated here.

This code implements the softmax formula and prints the probability of belonging to one of
the three classes. The sum of probabilities across all classes is 1.
Let us talk about Hyperbolic functions in the next section.
Hyperbolic Functions
1. Hyperbolic Tangent
Hyperbolic or tanh function is often used in neural networks as an activation function. It
provides output between -1 and +1. This is an extension of logistic sigmoid; the difference
is that output stretches between -1 and +1 here.
The advantage of the hyperbolic tangent over the logistic function is that it has a broader
output spectrum and ranges in the open interval (-1, 1), which can improve the
convergence of the backpropagation algorithm.
2. Hyperbolic Activation Functions
The graph below shows the curve of these activation functions:

Apart from these, tanh, sinh, and cosh can also be used for activation functions.

Based on the desired output, a data scientist can decide which of these activation functions
need to be used in the Perceptron logic.
3. Hyperbolic Tangent

This code implements the tanh formula. Then it calls both logistic and tanh functions on the
z value. The tanh function has two times larger output space than the logistic function.

With larger output space and symmetry around zero, the tanh function leads to the more
even handling of data, and it is easier to arrive at the global maxima in the loss function.
Activation Functions at a Glance
Various activation functions that can be used with Perceptron are shown below:

The activation function to be used is a subjective decision taken by the data scientist, based
on the problem statement and the form of the desired results. If the learning process is
slow or has vanishing or exploding gradients, the data scientist may try to change the
activation function to see if these problems can be resolved.

Radial Basis Kernel is a kernel function that is used in machine learning to find a
non-linear classifier or regression line.

What is Kernel Function?

Kernel Function is used to transform n-dimensional input to m-dimensional input, where m


is much higher than n then find the dot product in higher dimensional efficiently. The main
idea to use kernel is: A linear classifier or regression curve in higher dimensions becomes a
Non-linear classifier or regression curve in lower dimensions.

Mathematical Definition of Radial Basis Kernel:

Radial Basis Kernel

where x, x’ are vector point in any fixed dimensional space.

But if we expand the above exponential expression, It will go upto infinite power of x and x’,
as expansion of ex contains infinite terms upto infinite power of x hence it involves terms
upto infinite powers in infinite dimension.

If we apply any of the algorithms like perceptron Algorithm or linear regression on this
kernel, actually we would be applying our algorithm to new infinite-dimensional datapoint
we have created. Hence it will give a hyperplane in infinite dimensions, which will give a
very strong non-linear classifier or regression curve after returning to our original
dimensions.

polynomial of infinite power

So, Although we are applying linear classifier/regression it will give a non-linear classifier
or regression line, that will be a polynomial of infinite power. And being a polynomial of
infinite power, Radial Basis kernel is a very powerful kernel, which can give a curve fitting
any complex dataset.
Why is the Radial Basis Kernel so powerful?

The main motive of the kernel is to do calculations in any d-dimensional space where d > 1,
so that we can get a quadratic, cubic or any polynomial equation of large degree for our
classification/regression line. Since the Radial basis kernel uses exponent and as we know
the expansion of e^x gives a polynomial equation of infinite power, so using this kernel, we
make our regression/classification line infinitely powerful too.

Some Complex Dataset Fitted Using RBF Kernel easily:


Probabilistics Models
Bayesian Belief Network

Bayesian belief network is key computer technology for dealing with probabilistic events
and to solve a problem which has uncertainty. We can define a Bayesian network as:

"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian model.

Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.

Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.

Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:

● Directed Acyclic Graph

● Table of conditional probabilities.

The generalized form of Bayesian network that represents and solves decision problems
under uncertain knowledge is known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:
● Each node corresponds to the random variables, and a variable can be continuous or
discrete.

● Arc or directed arrows represent the causal relationship or conditional probabilities


between random variables. These directed links or arrows connect the pair of nodes
in the graph.
These links represent that one node directly influence the other node, and if there is
no directed link that means that nodes are independent with each other

○ In the above diagram, A, B, C, and D are random variables represented by the


nodes of the network graph.

○ If we are considering node B, which is connected with node A by a directed


arrow, then node A is called the parent of Node B.

○ Node C is independent of node A.

The Bayesian network has mainly two components:


● Causal Component

● Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.

The Bayesian network is based on Joint probability distribution and conditional probability.
So let's first understand the joint probability distribution:

Joint probability distribution:

If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry
has two neighbors David and Sophia, who have taken a responsibility to inform Harry at
work when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here
we would like to compute the probability of Burglary Alarm.

Problem:

Calculate the probability that the alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called Harry.

Solution:
● The Bayesian network for the above problem is given below. The network structure
is showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm going off, but David and Sophia's calls depend on
alarm probability.

● The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.

● The conditional distributions for each node are given as a conditional probabilities
table or CPT.

● Each row in the CPT must be summed to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.

● In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if


there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

● Burglary (B)

● Earthquake(E)

● Alarm(A)

● David Calls(D)

● Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can
rewrite the above probability statement using joint probability distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]


= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake did not occur.

We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06


True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.

The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network, which is given
below:

1. To understand the network as the representation of the Joint probability distribution.

It is helpful to understand how to construct the network.

2. To understand the network as an encoding of a collection of conditional independence


statements.

It is helpful in designing inference procedures.

Maximum Likelihood Estimation

Suppose that we have a random sample from a population of interest. We may have a
theoretical model for the way that the population is distributed. However, there may be
several population parameters of which we do not know the values. Maximum likelihood
estimation is one way to determine these unknown parameters.

The basic idea behind maximum likelihood estimation is that we determine the values of
these unknown parameters. We do this in such a way to maximize an associated joint
probability density function or probability mass function. We will see this in more detail in
what follows. Then we will calculate some examples of maximum likelihood estimation.

Steps for Maximum Likelihood Estimation

The above discussion can be summarized by the following steps:

1. Start with a sample of independent random variables X1, X2, . . . Xn from a


common distribution each with probability density function f(x;θ1, . . .θk). The
thetas are unknown parameters.
2. Since our sample is independent, the probability of obtaining the specific sample
that we observe is found by multiplying our probabilities together. This gives us a
likelihood function L(θ1, . . .θk) = f( x1 ;θ1, . . .θk) f( x2 ;θ1, . . .θk) . . . f( xn ;θ1, . . .θk)
= Π f( xi ;θ1, . . .θk).
3. Next, we use Calculus to find the values of theta that maximize our likelihood
function L.
4. More specifically, we differentiate the likelihood function L with respect to θ if
there is a single parameter. If there are multiple parameters we calculate partial
derivatives of L with respect to each of the theta parameters.
5. To continue the process of maximization, set the derivative of L (or partial
derivatives) equal to zero and solve for theta.
6. We can then use other techniques (such as a second derivative test) to verify that
we have found a maximum for our likelihood function.

A coin is flipped 100 times. Given that there were 55 heads, find the maximum likelihood
estimate for the probability p of heads on a single toss

Suppose that the lifetime of Badger brand light bulbs is modeled by an exponential
distribution with (unknown) parameter λ. We test 5 bulbs and find they have lifetimes of 2,
3, 1, 3, and 4 years, respectively. What is the MLE for λ?

Suppose 10 animals are captured, tagged and released. A few months later, 20 animals are
captured, examined, and released. 4 of these 20 are found to be tagged. Estimate the size of
the wild population using the MLE for the probability that a wild animal is tagged.

Suppose we have two light bulbs whose lifetimes follow an exponential(λ) distribution.
Suppose also that we independently measure their lifetimes and get data x1 = 2 years and
x2 = 3 years. Find the value of λ that maximizes the probability of this data.

Activity : Run the code available in the following link and share the output in github.

https://python.quantecon.org/mle.html

Clustering Algorithms

Clustering algorithms are used to group data points based on certain similarities. There’s
no criterion for good clustering. Clustering determines the grouping with unlabelled data. It
mainly depends on the specific user and the scenario.

Typical cluster models include:

● Connectivity models – like hierarchical clustering, which builds models based on


distance connectivity.
● Centroid models – like K-Means clustering, which represents each cluster with a
single mean vector.
● Distribution models – here, clusters are modeled using statistical distributions.
● Density models – like DBSCAN and OPTICS, which define clustering as a connected
dense region in data space.
● Group models – these models don’t provide refined results. They only offer grouping
information.
● Graph-based models – a subset of nodes in the graph such that an edge connects
every two nodes in the subset can be considered as a prototypical form of cluster.
● Neural models – self-organizing maps are one of the most commonly known
Unsupervised Neural networks (NN), and they’re characterized as similar to one or
more models above.

Note, there are different types of clustering:

● Hard clustering – the data point either entirely belongs to the cluster, or doesn’t. For
example, consider customer segmentation with four groups. Each customer can
belong to either one of four groups.
● Soft clustering – a probability score is assigned to data points to be in those clusters.

Types of clustering algorithms and how to select one for your use case

Hierarchical clustering algorithms (connectivity-based clustering)

The main idea of hierarchical clustering is based on the concept that nearby objects are
more related than objects that are farther away. Let’s take a closer look at various aspects of
these algorithms:

● The algorithms connect to “objects” to form “clusters” based on their distance.


● A cluster can be defined by the max distance needed to connect to the parts of the
cluster.
● Dendrograms can represent different clusters formed at different distances,
explaining where the name “hierarchical clustering” comes from. These algorithms
provide a hierarchy of clusters that at certain distances are merged.
● In the dendrogram, the y-axis marks the distance at which clusters merge. The
objects are placed beside the x-axis such that clusters don’t mix.
Hierarchical clustering is a family of methods that compute distance in different ways.
Popular choices are known as single-linkage clustering, complete linkage clustering, and
UPGMA. Furthermore, hierarchical clustering can be:

1. Agglomerative – it starts with an individual element and then groups them into
single clusters.
2. Divisive – it starts with a complete dataset and divides it into partitions.

Agglomerative hierarchical clustering (AHC)

In this section, I’ll be explaining the AHC algorithm which is one of the most important
hierarchical clustering techniques. The steps to do it are:

1. Each data point is treated as a single cluster. We have K clusters in the beginning. At
the start, the number of data points will also be K.
2. Now we need to form a big cluster by joining 2 closest data points in this step. This
will lead to total K-1 clusters.
3. Two closest clusters need to be joined now to form more clusters. This will result in
K-2 clusters in total.
4. Repeat the above three steps until K becomes 0 to form one big cluster. No more
data points are left to join.
5. After forming one big cluster at last, we can use dendrograms to split the clusters
into multiple clusters depending on the use case.

The image below gives an idea of the hierarchical clustering approach.

Source

Advantages of AHC:
● AHC is easy to implement, it can also provide object ordering, which can be
informative for the display.
● We don’t have to pre-specify the number of clusters. It’s easy to decide the number
of clusters by cutting the dendrogram at the specific level.
● In the AHC approach smaller clusters will be created, which may uncover
similarities in data.

Disadvantages of AHC:

● The objects which are grouped wrongly in any steps in the beginning can’t be
undone.
● Hierarchical clustering algorithms don’t provide unique partitioning of the dataset,
but they give a hierarchy from which clusters can be chosen.
● They don’t handle outliers well. Whenever outliers are found, they will end up as a
new cluster, or sometimes result in merging with other clusters.

The Agglomerative Hierarchical Cluster Algorithm is a form of bottom-up clustering, where


each data point is assigned to a cluster. Those clusters then get connected together. Similar
clusters are merged at each iteration until all the data points are part of one big root cluster.

Clustering dataset

Getting started with clustering in Python through Scikit-learn is simple. Once the library is
installed, a variety of clustering algorithms can be chosen.

We will be using the `make_classification` function to generate a data set from the `sklearn`
library to demonstrate the use of different clustering algorithms. The `make_classification`
function accepts the following arguments:

● The number of samples.


● A total number of features.
● The number of informative features.
● The number of redundant features.
● The number of duplicate features drawn randomly from redundant and informative
features.
● The number of clusters per class.
from numpy import where

from numpy import unique

from sklearn.datasets import make_classification

from sklearn.cluster import AgglomerativeClustering

import matplotlib.pyplot as plot

# Initializing data

train_data, _ = make_classification(n_samples=1000,

n_features=2,

n_informative=2,

n_redundant=0,

n_clusters_per_class=1,

random_state=4)

agg_mdl = AgglomerativeClustering(n_clusters=4)

# each data point assigned to cluster

agg_result = agg_mdl.fit_predict(train_data)

# Obtain all clusters which are unique

agg_clusters = unique(agg_result)

# plot clusters

for agg_cluster in agg_clusters:

# fetch data point that fall in this clstr

index = where(agg_result == agg_cluster)

plot.scatter(train_data[index, 0], train_data[index,1])


# Agglomerative hierarchy plot

plot.show()

Clusters obtained by Hierarchical Cluster Algorithm

Hierarchical clustering is often used in the form of descriptive modeling rather than
predictive. It doesn’t work well on large datasets, it provides the best results in some cases
only. Sometimes it’s also difficult to detect the correct number of clusters from a
dendogram.

Centroid-based clustering algorithms / Partitioning clustering algorithms

In centroid/partitioning clustering, clusters are represented by a central vector, which may


not necessarily be a member of the dataset. Even in this particular clustering type, the value
of K needs to be chosen. This is an optimization problem: finding the number of centroids
or the value of K and assigning the objects to nearby cluster centers. These steps need to be
performed in such a way that the squared distance from clusters is maximized.

One of the most widely used centroid-based clustering algorithms is K-Means, and one of its
drawbacks is that you need to choose a K value in advance.

K-Means clustering algorithm

The K-Means algorithm splits the given dataset into a predefined(K) number of clusters
using a particular distance metric. The center of each cluster/group is called the centroid.
How does the K-Means algorithm work?

Let’s see how the K-Means algorithm works:

● Initially, a K number of centroids is chosen. There are different methods for selecting
the right value for K.
● Shuffle the data and initialize centroids—randomly select K data points for centroids
without replacement.
● Create new centroids by calculating the mean value of all the samples assigned to
each previous centroid.
● Randomly initialize the centroid until there’s no change in the centroid, so the
assignment of data points to the cluster isn’t changing.
● K-Means clustering uses the Euclidean Distance to find out the distance between
points.

Source

Note: An example of K-Means clustering is explained with customer segmentation examples in


the use cases section below.
There are two methods to choose the correct value of K: Elbow and Silhouette.

The Elbow method

The Elbow method picks the range of values and takes the best among them. It calculates
the Within Cluster Sum of Square(WCSS) for different values of K. It calculates the sum of
squared points and calculates the average distance.

In the above formula, Yi is the centroid for the observation of Xi. The value of K needs to be
chosen where WCSS starts to diminish. In the plot WCSS versus K, this shows up as an
elbow.

Source

The Silhouette method

The Silhouette score/coefficient(SC) is calculated using average intra-cluster distance(m)


and an average of the nearest cluster distance(n) for each sample.
In the above formula, n is the distance between a data point and the nearest cluster that the
data point is not part of. We can calculate the average SC on all the samples and use this as a
metric to decide the number of clusters.

The SC value ranges between -1 to 1. 1 means clusters are separated well and are
distinguishable. The Clusters are wrongly assigned if the value is -1.

These are some of the advantages K-Means poses over other algorithms:

● It’s straightforward to implement.


● It’s scalable to massive datasets and also faster for large datasets.
● It adapts to new examples very frequently.

K-Medians is another clustering algorithm relative to the K-Means algorithm, except cluster
centers are recomputed using the median. Sensitivity towards outliers is less in the
K-Median algorithm since sorting is required to calculate the median vector slower for large
datasets.

K-Means has some disadvantages; the algorithm may provide different clustering results on
different runs as K-Means begins with random initialization of cluster centers. The chances
are that the results obtained might not be repeated.

Other drawbacks posed by the K-Means algorithm are:

● K-Means clustering is good at capturing the structure of the data if the clusters have
a spherical-like shape. It always tries to construct a nice spherical shape around the
centroid. This means that the minute the clusters have different geometric shapes,
K-Means does a poor job clustering the data.
● Even when the data points belong to the same cluster, K-Means doesn’t allow the
data points far from one another, and they share the same cluster.
● K-Means algorithm is sensitive to outliers.
● As the number of dimensions increases, scalability decreases.

Here are some points to remember when using K-Means for clustering:

● Standardize the data when applying the K-Means algorithm, because it will help you
get good quality clusters and improve the performance of the clustering algorithm.
Since K-Means use a distance-based measure to find the similarity between data
points, it’s good to standardize the data to have a standard deviation of one and a
mean of zero. Usually, the features present in any dataset would have different units
of measurement, for example, income versus age.
● K-Means give more weight to bigger clusters.
● The elbow method used to select the number of clusters doesn’t work well as the
error function decreases for all Ks.
● If there’s overlap between clusters, K-Means doesn’t have an intrinsic measure for
uncertainty for the examples belonging to the overlapping region to determine
which cluster to assign each data point.
● K-Means clusters data even if it can’t be clustered, such as data that comes from
uniform distributions.

Mini-Batch K-Means clustering algorithm

K-Means is one of the popular clustering algorithms, mainly because of its good time
performance. When the size of the data set increases, K-Means will result in a memory issue
since it needs the entire dataset. For those reasons, to reduce the time and space
complexity of the algorithm, an approach called Mini-Batch K-Means was proposed.

The Mini-Batch K-Means algorithm tries to fit the data in the main memory in a way where
the algorithm uses small batches of data that are of fixed size chosen at random. Here are a
couple of points to note about the Mini-Batch K-Means algorithm:

● Clusters are updated (depending on the previous location of cluster centroids) in


each iteration by obtaining new arbitrary samples from the dataset, and these steps
are repeated until convergence.
● Some of the research suggests that this method saves significant computational time
with a trade-off, a little bit of loss in cluster quality. But intense research hasn’t been
done to quantify the number of clusters or their size that will impact the cluster
quality.
Source

● The location of the clusters is updated based on the new points from each batch.
● The update made is the gradient descent update, which is notably faster than normal
batch K-Means.

Density-based clustering algorithms

Density-based clustering connects areas of high example density into clusters. This allows
for arbitrary shape distributions as long as dense regions are connected. With the higher
dimension data and data of varying densities, these algorithms run into issues. By design,
these algorithms don’t assign outliers to clusters.

DBSCAN

The most popular density-based method is Density-Based Spatial Clustering of Applications


with Noise (DBSCAN). It features a well-defined cluster model called “density reachability”.

This type of clustering technique connects data points that satisfy particular density
criteria (minimum number of objects within a radius). After DBSCAN clustering is
complete, there are three types of points: core, border, and noise.
Source

If you look at the above figure, the core is a point that has some (m) points within a
particular (n) distance from itself. The border is a point that has at least one core point at
distance n.

Noise is a point that is neither border nor core. Data points in sparse areas required to
separate clusters are considered noise and broader points.

DBSCAN uses two parameters to determine how clusters are defined:

● minPts: for a region to be considered dense, the minimum number of points


required is `minPts`.
● eps: to locate data points in the neighborhood of any points, `eps (ε)` is used as a
distance measure.

Here’s a step-by-step explanation of the DBSCAN algorithm:

● DBSCAN starts with a random data point (non-visited points).


● The neighborhood of this point is extracted using a distance epsilon ε.
● The clustering procedure starts if there are sufficient data points within this area
and the current data point becomes the first point in the newest cluster, or else the
point is marked as noise and visited.
● The point within its epsilon ε distance neighborhood also becomes a part of the
same cluster for the first point in the new cluster. For all the new data points added
to the cluster above, the procedure for making all the data points belong to the same
cluster is repeated.
● The above two steps are repeated until all points in the cluster are determined. All
points within the ε neighborhood of the cluster have been visited and labeled. Once
we’re done with the current cluster, a new unvisited point is retrieved and
processed, leading to further discovery of the cluster or noise. The procedure is
repeated until all the data points are marked as visited.

An exciting property of DBSCAN is its low complexity. It requires a linear number of range
queries on the database.

The main problem with DBSCAN is:

● It expects some kind of density drop to detect cluster borders. DBSCAN connects
areas of high example density. The algorithm is better than K-Means when it comes
to oddly shaped data.

One great thing about the DBSCAN algorithm is:

● It doesn’t require a predefined number of clusters. It also identifies noise and


outliers. Furthermore, arbitrarily sized and shaped clusters are found pretty well by
the algorithm.

from numpy import where

from numpy import unique

from sklearn.datasets import make_classification

from sklearn.cluster import DBSCAN


import matplotlib.pyplot as plot

# Initialize data

train_data, _ = make_classification(n_samples=1000,

n_features=2,

n_informative=2,

n_redundant=0,

n_clusters_per_class=1,

random_state=4)

# Define model

dbscan_model = DBSCAN(eps=0.25, min_samples=9)

# Train the model

dbscan_model.fit(train_data)

# Assign each data point to a cluster

dbscan_res = dbscan_model.fit_predict(train_data)

# obtain all the unique clusters

dbscan_clstrs = unique(dbscan_res)

# Plot the DBSCAN clusters


for dbscan_clstr in dbscan_clstrs:

# Obtain data point that belong in this cluster

index = where(dbscan_res == dbscan_clstr)

# plot

plot.scatter(train_data[index, 0], train_data[index, 1])

# show plot

plot.show()

Clusters obtained by DBSCAN Cluster Algorithm

Distribution-based clustering algorithms

The clustering model that’s closely related to statistics is based on distribution models.
Clusters can then be defined as objects that belong to the same distribution. This approach
closely resembles how artificial datasets are generated, by sampling random objects from
distribution.

While the theoretical aspects of these methods are pretty good, these models suffer from
overfitting.

Gaussian mixture model


Gaussian mixture model (GMM) is one of the types of distribution-based clustering. These
clustering approaches assume data is composed of distributions, such as Gaussian
distributions. In the figure below, the distribution-based algorithm clusters data into three
Gaussian distributions. As the distance from the distribution increases, the probability that
the point belongs to the distribution decreases.

GMM can be used to find clusters in the same way as K-Means. The probability that a point
belongs to the distribution’s center decreases as the distance from the distribution center
increases. The bands show a decrease in probability in the below image. Since GMM
contains a probabilistic model under the hood, we can also find the probabilistic cluster
assignment. When you don’t know the type of distribution in data, you should use a
different algorithm.

Example of distribution-based clustering | Source

Let’s see how GMM calculates probabilities and assigns it to data point:

● A GM is a function composed of several Gaussians, each identified by k ∈ {1,…, K},


where K is the number of clusters. Each Gaussian K in the mixture is comprised of
following parameters:
○ A mean μ that defines its center.
○ A covariance Σ that defines its width.
○ A mixing probability that defines how big or small or big the Gaussian
function will be.

These parameters can be seen in the below image:

Source

To find the covariance, mean, variance and weights of clusters, GMM uses the Expectation
Maximization technique.

Consider we need to assign K number of clusters, meaning K Gaussian distributions, with


the mean and covariance values to be μ1, μ2, .. μk and Σ1, Σ2, .. Σk. There’s another
parameter Πi which represents the density of distribution.

To define the Gaussian distribution we need to find the values for these parameters. We’ve
already decided on the number of clusters and have assigned the value for mean,
covariance, and density. Next are the Expectation step and Maximizations step, which you
can check out in this post.

Advantages of GMM

● One of the advantages of GMM over K-Means is that K-Means doesn’t account for
variance (here, variance refers to the width of the bell-shaped curve) and GMM
returns the probability that data points belong to each of K clusters.
● In case of overlapped clusters, all the above clustering algorithms fail to identify it as
one cluster.
● GMM uses a probabilistic approach and provides probability for each data point that
belongs to the clusters.

Disadvantages of GMM

● Mixture models are computationally expensive if the number of distributions is


large or the dataset contains less observed data points.
● It needs large datasets and it’s hard to estimate the number of clusters.

Let’s now look at how GMM clusters data. The code below helps you to:

● create the data,


● fit the data to the `GaussianMixture` model,
● find the data points that are assigned to a cluster,
● obtain the unique clusters, and
● plot the clusters as shown below.

from numpy import where

from numpy import unique

from sklearn.datasets import make_classification

from sklearn.mixture import GaussianMixture

import matplotlib.pyplot as plot

#init data

train_data, _ = make_classification(n_samples=1200,

n_features=3,
n_informative=2,

n_redundant=0,

n_clusters_per_class=1,

random_state=4)

gaussian_mdl = GaussianMixture(n_components=3)

# model training

gaussian_mdl.fit(train_data)

# data points assigned to a cluster

gaussian_res = gaussian_mdl.fit_predict(train_data)

# get clusters which are unique

gaussian_clstr = unique(dbscan_res)

# Plot

for gaussian_cluser in gaussian_clstr:

index = where(gaussian_res == gaussian_cluser)

# plot

plot.scatter(train_data[index, 0], train_data[index, 1])


# show plot

plot.show()

Clusters obtained by Gaussian


Mixture Model Algorithm

Applications of clustering in different fields

Some of the domains in which clustering can be applied are:

● Marketing: customer segment discovery.


● Library: to cluster different books based on topics and information.
● Biology: classification among different species of plants and animals.
● City planning: analyze the value of houses based on location.
● Document Analysis: various research data and documents can be grouped according
to certain similarities. Labeling large data is really difficult. Clustering can be helpful
in these cases to cluster text & group it into various categories. Unsupervised
techniques like LDA are also beneficial in these cases to find hidden topics in a large
corpus.

Issues with the unsupervised modeling approach

These are some issues you may encounter when applying clustering techniques:

● The results may be less accurate since data isn’t labeled in advance and input data
isn’t known.
● The learning phase of the algorithm might take a lot of time as it calculates and
analyses all possibilities.
● Without any prior knowledge the model is learning from raw data.
● As the number of features increases, complexity increases.
● Some projects involving live data may require continuous data feeding to the model,
resulting in time-consuming and inaccurate results.

Factors to consider when choosing clustering algorithms

● Choose the clustering algorithm so that it scales well on the dataset. Not all
clustering algorithms scale efficiently. Datasets in machine learning can have
millions of examples.
● Many clustering algorithms work by computing similarities between all pairs of
examples. Run time increases with an increase in the number of samples n, denoted
as O(n^2) in complexity notation. O(n^2) isn’t practical when the number of
examples is in millions. This focuses on the K-Means algorithm, which has
complexity O(n), meaning the algorithm scales linearly with n.

Multiple Choice Questions

1. Which of the following metrics are used to evaluate classification models?

A. Area under the ROC curve

B. F1 score

C. Confusion matrix

D. All of the above

2. Which one is a classification algorithm?

A. Logistic regression

B. Linear regression

C. Polynomial regression
D. None

3. Classification is-

A. Unsupervised learning

B. Reinforcement learning

C. Supervised learning

D. None

4. You have a dataset of different flowers containing their petal lengths and color. Your
model has to predict the type of flower for given petal lengths and color. This is a-

A. Regression task

B. Classification task

C. Clustering task

D. None

5. A classifier-

A. Inputs a vector of continuous values and outputs a single discrete value

B. Inputs a vector of discrete values and outputs a single discrete value

C. Both A and B

D. None

6. Classification is appropriate when you-

A. Try to predict a continuous valued output

B. Try to predict a class or discrete output

C. Both A and B for different contexts

D. None

7. With the help of a confusion matrix, we can compute-


A. Recall

B. Precision

C. Accuracy

D. All of the above

8. What does recall refer to in classification?

A. The proportion of all the relevant data points

B. The proportion of only the correct data points

C. The proportion of all data points

D. None

9. False negatives are-

A. Predicted negatives that are actually positives

B. Predicted positives that are actually negatives

C. Predicted negatives that are actually negatives

D. Predicted positives that are actually positives

10. Suppose your classification model predicted true for a class which actual value was
false. Then this is a-

A. False positive

B. False negative

C. True positive

D. True negative

11. The false negative value is 5 and the true positive value is 20. What will be the value of
recall-

A. 0.2

B. 0.6
C. 0.8

D. 0.3

12. The true positive value is 10 and the false positive value is 15. Calculate the value of
precision-

A. 0.6

B. 0.4

C. 0.5

D. None

13. If the precision is 0.6 and the recall value is 0.4. What will be the f measure?

A. 0.5

B. 0.6

C. 0.4

D. 0.3

14. Which one is a different algorithm?

A. Logistic Regression

B. Support Vector Regression

C. Polynomial Regression

D. None

15. What is a support vector?

A. The distance between any two data points

B. The average distance between all the data points

C. The distance between two boundary data points

D. The minimum distance between any two data points


16. Which of the following is a lazy learning algorithm?

A. SVM

B. KNN

C. Decision tree

D. All of the above

17. Which of the following is not a lazy learning algorithm?

A. SVM

B. Decision tree

C. Random forest

D. All of the above

18. What is the most widely used distance metric in KNN?

A. Euclidean distance

B. Manhattan distance

C. Perpendicular distance

D. All of the above

19. Which of the following is the best algorithm for text classification?

A. KNN

B. Decision tree

C. Random forest

D. Naive Bayes

20. What does k stand for in the KNN algorithm?

A. Number of neighbors

B. Number of output classes


C. Number of input features

D. None

21. Support Vector Machine is-

A. a discriminative classifier

B. a lazy learning classifier

C. a probabilistic classifier

D. None

22. What are hyperplanes?

A. Decision boundaries

B. Decision functions

C. Mapping functions

D. None

23. What is a kernel?

A. A function that calculates the distance of two boundary data points

B. A function that maps the value from one dimension to the other

C. A function that predicts the output value of a regression

D. None

24. Which of the following is not a kernel?

A. Polynomial Kernel

B. Gaussian Kernel

C. Sigmoid Kernel

D. None

25. Why Naive Bayes is called naive?


A. Because its assumption may or may not true

B. Because it’s a bad classifier

C. The accuracy is very poor

D. All of the above

26. For two events A and B, the Bayes theorem will be-

A. P(A | B) = P(B) * P(B | A) / P(A)

B. P(A | B) = P(A) * P(B | A) / P(B)

C. P(A | B) = P(B) * P(A | B) / P(A)

D. P(A | B) = P(A) * P(A | B) / P(B)

27. How does a decision tree work?

A. Minimizes the information gain and maximizes the entropy

B. Maximizes the information gain and minimizes the entropy

C. Minimizes the information gain and minimizes the entropy

D. Maximizes the information gain and maximizes the entropy

28. Suppose you have a dataset that is randomly distributed. What will be the best
algorithm for that dataset?

A. Support vector machine

B. Naive Bayes

C. K nearest neighbors

D. Decision tree

29. Which pair of the algorithms are similar in operation?

A. SVM and KNN

B. Decision tree and Random forest


C. SVM and Naive Bayes

D. All of the above

30. Which metric is not used for evaluating classification models?

A. AUC ROC score

B. Accuracy

C. Mean absolute error

D. Precision
Articles
https://doi.org/10.1038/s42256-022-00520-5

Deep learning-based robust positioning for


all-weather autonomous driving
Yasin Almalioglu 1 ✉, Mehmet Turan2, Niki Trigoni1 and Andrew Markham1

Interest in autonomous vehicles (AVs) is growing at a rapid pace due to increased convenience, safety benefits and potential
environmental gains. Although several leading AV companies predicted that AVs would be on the road by 2020, they are still
limited to relatively small-scale trials. The ability to know their precise location on the map is a challenging prerequisite for
safe and reliable AVs due to sensor imperfections under adverse environmental and weather conditions, posing a formidable
obstacle to their widespread use. Here we propose a deep learning-based self-supervised approach for ego-motion estima-
tion that is a robust and complementary localization solution under inclement weather conditions. The proposed approach is
a geometry-aware method that attentively fuses the rich representation capability of visual sensors and the weather-immune
features provided by radars using an attention-based learning technique. Our method predicts reliability masks for the sen-
sor measurements, eliminating the deficiencies in the multimodal data. In various experiments we demonstrate the robust
all-weather performance and effective cross-domain generalizability under harsh weather conditions such as rain, fog and
snow, as well as day and night conditions. Furthermore, we employ a game-theoretic approach to analyse the interpretability
of the model predictions, illustrating the independent and uncorrelated failure modes of the multimodal system. We anticipate
our work will bring AVs one step closer to safe and reliable all-weather autonomous driving.

A
utonomous vehicles (AVs) have recently attracted consid- AV operation in urban areas surrounded by high-rise buildings
erable attention from academia, industry and the general remains highly challenging. In addition, GPS merely provides
public due to their potential to revolutionize transportation, metre-level location accuracy without orientation information,
accelerated by advances in artificial intelligence. The deployment of which is potentially fatal for passengers of AVs or those in the sur-
AVs in our environmental landscape has the potential to decrease roundings. For example, an AV might detect itself in the wrong lane
road accidents and traffic congestion, as well as improve our mobil- before a turn, or might stop too late at an intersection due to impre-
ity in overcrowded cities. Despite extraordinary efforts from many cise localization. Ego-motion estimation (also called odometry)
of the leading names in the AV industry and research, AVs are still with onboard sensors provides a complementary localization solu-
out of reach except in limited trial programs due to key concerns tion in challenging environments, predicting the accurate relative
on their reliability and safety1 (see Supplementary Note 1 for details self-position of AVs. It is therefore an essential component that lies
on AV safety levels). Apart from the technical problems, adverse at the core of an autonomous driving algorithmic stack and serves
weather conditions such as rain, fog and snow pose substantial chal- as the basis for numerous algorithms such as localization, predic-
lenges for safe and reliable driverless technology2,3. tion and motion planning. A robust and reliable ego-motion esti-
Autonomous vehicles are equipped with different types of sen- mation system should address the sensor vulnerabilities that might
sors such as cameras, lidars, radars, ultrasound and GPS to achieve be caused by various factors such as poor environmental conditions
a higher level of awareness of the surroundings, leading to increased and sensor imperfections.
safety, efficiency and capabilities2,4. Along with multiple sensors, Artificial intelligence in AV research and development relies
artificial intelligence methodologies, machine learning, deep learn- heavily on the use of public datasets in the computer vision and
ing and large datasets play major roles in the development of AVs robotics communities9. Although the datasets are ever-increasingly
with higher levels of intelligence and mobility5,6. Artificial intelli- massive, the acquisition of accurate ground-truth data to supervise
gence systems efficiently process the vast amount of multisensory the artificial intelligence systems is limited due to the need for man-
data to train and validate the family of machine learning models ual labelling and deficiencies of the existing sensors. Cameras and
that underpin the perception, localization, prediction and motion lidars constitute the two primary perception sensors that are com-
planning capabilities of autonomous driving systems7,8. These sys- monly adopted in AV research; however, as these sensors operate in
tems make sense of the world and the objects in the environment the visible and infrared spectrum, inclement weather dramatically
and dictate the paths that the vehicles ultimately take. disrupts their sensory data, causing attenuation, multiple scattering
The localization capability is responsible for precisely predict- and absorption10 (Supplementary Note 2). Millimetre-wave radars
ing the AV’s position on a map. Most of the core components of provide a key advantage over visible spectrum sensors in their
AVs such as prediction and planning rely on precise localization immunity to adverse conditions, for example, they are agnostic to
to, for example, within a few centimetres. Although AVs heavily scene illumination and airborne obscurants10,11. The wavelength of
rely on signals from space-based global navigation satellite systems millimetre-wave radars is much larger than the tiny airborne parti-
such as GPS for localization, radio signals can be lost or degraded cles that form fog, rain and snow, and hence easily penetrates or dif-
in many environments due to obstacles or reflections. In particular, fracts around them. Furthermore, as they are radiofrequency-based

Department of Computer Science, University of Oxford, Oxford, UK. 2Department of Computer Engineering, Bogazici University, Istanbul, Turkey.
1

✉e-mail: [email protected]

Nature Machine Intelligence | www.nature.com/natmachintell


Articles NATurE MAcHInE InTEllIgEncE

sensors, radars do not require optical lenses and can be integrated stereo) under challenging conditions. The key supervision signal
into plastic housings, making them highly resilient to water and to train the neural networks for depth and pose prediction comes
dust ingress. We therefore believe that odometry approaches utiliz- from the new view-reconstruction algorithm: given a multimodal
ing millimetre-wave radars will allow robust ego-motion estimation input view of a scene, it reconstructs a new view of the scene cap-
under diverse settings such as day, night, rain, fog and snow, and tured from a different position. The visual-reconstruction algo-
address the challenges in implementing radars (which are described rithm uses the predicted per-pixel depth and ego-motion, whereas
in Supplementary Note 3). The introduction of a high-resolution the range-reconstruction algorithm uses the predicted ego-motion
radar in AV datasets created new opportunities for ego-motion and range measurements, both making use of multimodal masks.
estimation under challenging conditions. Despite the improved The spatial transformer module of GRAMME implements the view
measurements, the radar measurements are still much coarser and reconstruction in a fully differentiable manner compatible with the
noisier than those of lidars and cameras. As a result, ego-motion ego-motion, depth and mask prediction neural networks.
techniques developed for lidars cause large motion errors. Although At a high level, GRAMME has a modular design to enable inde-
further information in the full AV software stack from passive sen- pendent operation for each modality during both training and
sors (for example, wheel encoders and inertial measurement units) inference, which improves the robustness of the system to achieve a
and intermediate predictions of software modules (for example, loop minimal risk condition14. Although we train the modules for depth,
closure and bundle adjustment) can supplement the ego-motion pose and mask predictions jointly, they can directly operate on the
estimation module, perception sensors such as the camera, lidar input frames separately from each other during test time, lead-
and radar play a pivotal role in the performance12. Ego-motion esti- ing to independent and uncorrelated failure modes for the mod-
mation methods should therefore exploit the advantages of cam- ules. Moreover, the modular design enables the performance gains
eras (rich, dense visual information), lidars (fine granularity within achieved during multimodal training to be maintained at inference
visible range) and radars (immunity to inclement weather) while time even when the complementary modalities are partially or
addressing their relative shortcomings. Although deep learning entirely unavailable. We use a reciprocal multimodal training tech-
models offer state-of-the-art solutions for ego-motion estimation nique to enhance the predictions on individual modalities, provid-
tasks (Supplementary Note 4), adverse weather conditions pose a ing information flow across submodules. Furthermore, the range
host of substantial challenges such as reduced sensing capability (for measurements of radar can directly capture strong patterns related
example, due to the occlusions caused by precipitation) and a wide to the geometry of the scene, whereas a simple colour value of
range of domain shifts (for example, due to the discrepancy between camera pixels is associated with the geometry through an accurate
a training dataset and the data encountered during deployment). depth estimation of the pixel. As the camera and radar measure-
Here we propose a novel self-supervised deep learning frame- ments are perceptually different, we exploit a late multimodal deep
work, geometry-aware multimodal ego-motion estimation fusion technique, which also facilitates the modular design. The
(GRAMME; Fig. 1) that addresses the key ego-motion estimation multilayer perceptron-based late fusion layer uses the unaligned
challenges for AVs outlined above. Our novel multimodal geo- ego-motion predictions from multiple modalities to predict the
metric reconstruction algorithm and reciprocal training technique ultimate motion. Due to the tight formulation of ego-motion and
create a supervisory signal for the self-supervised neural network. depth prediction, the multimodal fusion technique substantially
Under five diverse settings (day, night, rain, fog and snow) using improves the depth predictions as well. The fusion consists of two
publicly available independent datasets, we show that our multi- stages: first, the individual ego-motion predictions are used to
modal approach provides robustness to unfavourable weather con- reconstruct the corresponding camera and range views; second,
ditions and outperforms state-of-the-art ego-motion estimation the predictions of each modality are interchangeably used in the
approaches. Following a challenging experimental protocol, we counterpart view-reconstruction algorithms for both visual and
show that the proposed modular design improves the performance range reconstructions.
of individual modalities even if the other modalities are unavailable In the proceeding sections, we demonstrate the generalizability,
at test time, providing robustness to sensor failures. Furthermore, data efficiency and interpretability of GRAMME in five different
we demonstrate the generalization capability of GRAMME by diverse settings such as day, night, rain, fog and snow. We quali-
showing that models trained on regular sequences typically targeted tatively and quantitatively evaluate the state-of-the-art ego-motion
by self-supervised studies can directly be applied to challenging estimation and depth prediction performance on multiple datasets,
sequences. We employ different sensors with various resolutions emphasizing the effect of modular design on individual modalities.
and beamwidths in the experiments and show that GRAMME is
sensor agnostic. Furthermore, we use game-theoretic approach to Results
visualize the learnt feature space and illustrate the independent Evaluation of model performance. We evaluated the depth and
and uncorrelated failure modes of the proposed multimodal sys- ego-motion prediction performance of GRAMME in five adverse
tem, and show that GRAMME focuses on the relevant details in settings such as day, night, rain, fog and snow using fivefold
the environment. GRAMME is publicly available as an easy-to-use cross-validation. To quantitatively measure the generalization per-
Python package13. formance of GRAMME, we conducted an effective and reliable—
yet rather challenging—cross-condition evaluation on the Robotcar
Self-supervised artificial intelligence for all-weather dataset, enabled by the modular design of GRAMME. We trained
ego-motion estimation the models on typical day sequences15 (training dataset) and directly
GRAMME is a deep learning-based self-supervised method that evaluated them on more challenging conditions (night, rain, fog and
uses multiple sensors such as cameras, lidars and radars to estimate snow)16 (test dataset). For each cross-validated fold, we randomly
the ego-motion of AVs by reconstructing the three-dimensional partitioned each public AV training dataset into a training set (80%
scene geometry under diverse settings such as day, night, rain, fog of sequences), a validation set (10% of sequences) and a test set (10%
and snow. GRAMME is sensor agnostic and designed to support of sequences). Each set contains the time-synchronized matching
sensors with various configurations in terms of resolution, beam- frames from each modality used for the training. The proportions of
width and field-of-view (Supplementary Note 5). GRAMME uses different settings (in terms of the number of frames) were kept con-
a novel differentiable view-reconstruction algorithm to incorpo- stant in each set during partitioning. In each fold, we monitored the
rate the measurements of range sensors (for example, lidars and model’s performance on the validation set during training and used
radars), mitigating the limitations of cameras (both monocular and the validation set for model selection while the test set was held-out

Nature Machine Intelligence | www.nature.com/natmachintell


NATurE MAcHInE InTEllIgEncE Articles
a Multimodal measurements under diverse settings b Self-supervised neural network

DepthNet

[ Lsmooth ]
Multisensory perception

VisionNet

FusionNet Motion

Spatial
Transformer

RangeNet
Lgeo,
Lradar,
Lcam

Camera Lidar Radar MaskNet

Day Night Sleet

Rain Fog Snow

Fig. 1 | Overview of the GRAMME conceptual framework and architecture. a, The publicly available independent AV datasets are collected using
multiple sensors such as camera, lidar and radar under diverse settings such as variable ambient illumination and precipitation. Example multimodal
measurements from the RADIATE dataset17 are shown to illustrate the data types and the degradation in sensor measurements caused by adverse
conditions. b, Architecture overview for self-supervised estimation of scene geometry and ego-motion. DepthNet and VisionNet modules predict the
pixel-wise depth map of each camera frame and the ego-motion between consecutive camera frames, respectively. In parallel, the RangeNet and MaskNet
modules operate on range sensors (that is, lidar and radar) to predict ego-motion and input masks, respectively. FusionNet collects the unaligned individual
motion predictions as input and predicts the ultimate motion. Finally, the spatial transformer module uses the multimodal predictions and geometrically
reconstructs the scene, creating a supervisory signal (L).

and referred to just once after training was complete to evaluate the with ground-truth pose information (as explained in the 'Datasets'
performance of the model on day sequences. The final models are section) following the same evaluation protocol. As shown in Fig. 2,
directly evaluated on the test datasets that contain adverse condi- ground-truth supervision reduces the generalization capability
tions never observed by the models during training. compared with self-supervision. Moreover, the relative multimodal
performance of the supervised models is even worse than the
Multimodal, modular and generalizable depth prediction. In self-supervised models. Although the camera-only self-supervised
the first set of experiments we analyse the depth prediction perfor- models are trained, validated and tested on day sequences and
mance, which is a critical component of the self-supervision signal. lead to overall performance improvement, challenging conditions
GRAMME formulates multimodal ego-motion using a tight con- involving glare and non-Lambertian surfaces still suffer from a
nection between depth prediction and ego-motion estimation to considerable performance loss; however, the models trained on
eliminate the need for the labelled data. The geometry-aware mul- additional range sensors (that is, lidar and radar) are much more
timodal self-supervised architecture improves the generalization immune to such effects. Although stereo camera-based models are
performance of the model to diverse conditions. Figure 2 shows slightly better than their monocular counterparts, we have a similar
the depth prediction performance for the model trained using: observation on the other test conditions that the models trained only
(1) a monocular camera, (2) a stereo camera, (3) a lidar–camera on camera are dramatically prone to failure. Moreover, although
(stereo) and (4) a radar–camera (stereo). Note that we use only the lidar- and radar-based models provide qualitatively similar results
day sequences in the training set for each training experiment on the and generally improve the overall performance, the model trained
modalities. Also, for each experiment, we use only the modalities with radar data provides greater immunity to precipitation. Under
labelled on the training modality column. Owing to GRAMME's foggy conditions, the lidar measurements contribute relatively less
modular design, the vision and range modules can make predic- to the generalization performance than to the other test conditions
tions directly and separately on the camera, lidar and radar inputs. with higher error variance; this is caused by poorer measurements
To evaluate the generalizability of the depth module, we use the due to water droplets condensed on the sensor surface. On the other
monocular sequences to test the depth prediction performance of hand, the depth prediction of the model trained with lidar–camera
the DepthNet module. The camera-only experiments also demon- fusion achieves better performance than the radar–camera model.
strate the robustness of the system to sensor deficiencies. We also As the lidar measurements are invulnerable to the lighting condi-
demonstrate the effect of external supervision by training the model tions and provide dense measurements, the model has an advantage

Nature Machine Intelligence | www.nature.com/natmachintell


Articles NATurE MAcHInE InTEllIgEncE

RGB Monocular Stereo Lidar–camera Radar–camera

Day
Night
Rain
Fog
Snow

Depth 0 m 1 km

b c
Ground-truth supervision Self-supervision

0.34 0.34

0.32 0.32 Day


Relative absolute error

Relative absolute error

Night
0.30 0.30
Rain
0.28 0.28 Fog

0.26 0.26 Snow

0.24 0.24

0.22 0.22

Monocular Stereo Lidar– Radar– Monocular Stereo Lidar– Radar–


camera camera camera camera

Fig. 2 | Multimodal, modular, and generalizable depth prediction performance. a, Qualitative results and sample test frames16 to visualize the
generalization ability of GRAMME on depth prediction. We train each model using the day sequences in the training set and test them under diverse
conditions to analyse the generalization performance. GRAMME successfully exploits the complementary aspects of the sensors. b, Comparatively weaker
generalization performance of the supervised models. c, Quantitative results to compare the self-supervised generalization performance of GRAMME
with respect to ground-truth supervision and intra-modality performance. The models trained only on camera are dramatically prone to failure in all of the
challenging test conditions. Although lidar- and radar-based models provide qualitatively similar results and generally improve the overall performance,
the model trained with radar provides greater immunity to precipitation. Error bars represent the depth prediction errors with respect to the ground truth.
Camera fusion models employ the stereo setting.

over the radar-based version. GRAMME exploits the multimodal Sensor-dependent masking, multisensory fusion and generaliz-
system design effectively, unlike the past work focusing mainly on able ego-motion estimation. View reconstruction provides the
either deep network architecture or objective function. The results key supervision signal for the model training. In this set of experi-
show the benefits of multimodal fusion on depth prediction as an ments we investigate the effectiveness of the masking system as the
additional supervision signal, improving the generalization ability major geometrically consistent element in the reconstruction. We
of the model under diverse settings. Moreover, we test the gener- then provide the overall generalization performance of the mul-
alization performance of GRAMME to different datasets, repeat- tisensor ego-motion estimation coupled with the masking system.
ing the same training, validation and test protocol on the publicly As the view reconstruction is based on sampling from the adjacent
available RADIATE dataset17. We exhibit both the depth predic- frames, and occluded areas cannot be sampled by definition, recon-
tion and ego-motion estimation performance in Extended Data structed occlusion areas might corrupt the supervisory signal. The
Fig. 1. Although the dataset contains shorter sequences with high inherent heterogeneous radar artefacts such as ghost objects, phase
variations in scene appearance and structure, GRAMME achieves and amplitude stability, speckle and saturation are other sources of
remarkable domain adaptation performance on this challenging inconsistency for view reconstruction (see Supplementary Note 3).
dataset (the observations on the RADIATE dataset is provided in Furthermore, the adverse weather conditions pose further challenges
Supplementary Note 6). for camera and lidar that inhibit the underlying scene consistency

Nature Machine Intelligence | www.nature.com/natmachintell


NATurE MAcHInE InTEllIgEncE Articles
a

Day Night Rain Fog Snow

Input
Camera
Mask
Input
Lidar
Mask
Input
Radar
Mask

b
30 15
Rotational error (° per 100 m)

25
Translational error (%)

Day
20 Night 10
Rain
15
Fog
10 Snow 5

0
Lidar– Radar– Lidar– Radar–
Monocular Stereo Monocular Stereo
camera camera camera camera

Fig. 3 | Sensor-dependent mask predictions and performance evaluations on generalizable multisensory ego-motion estimation under diverse settings.
a, Illustration of sample frames16, multimodal measurements and the corresponding predicted masks. Each row shows a pair of input measurements
and predicted masks of each modality. White and dark regions represent the valid and invalid points in the measurements to effectively capture the
multisensory degradation resulting from both adverse weather and inherent sensor deficiencies, respectively. b, Multimodal performance evaluation on
ego-motion estimation and multisensory fusion. The box plots show the median, first and third quartiles, as well as the minimum and maximum quartiles
to show the errors in motion predictions. The error distribution in motion predictions in terms of the error quartiles are shown for translation and rotation
components of motion for each modality. Sensor fusion greatler boosts the overall motion estimation performance.

assumptions. Poor weather introduces sharp intensity fluctuations For example, intense glare caused by direct exposure to sunlight sat-
in camera images, which degrade the consistency across frames. It urates most of the camera pixels and restrains the frame matching.
is therefore important to detect the imperfect and unreliable regions The predicted camera mask captures the glaring regions and excludes
in measurements and exclude them from the view reconstruction. them from the view reconstruction to prevent an incorrect consis-
GRAMME predicts a mask that is a combination of learnt and geo- tency calculation that might corrupt the loss values computed during
metric masks to remove the invalid parts. The former is predicted training. Similarly, although stereo camera provides binocular vision
by GRAMME's mask module, whereas the latter is based on the and is marginally less susceptible to occlusions than the monocu-
geometric inconsistency between consecutive multimodal frames lar one, both camera types are still considerably prone to occlusions
that accounts for motion explanation, the nearly identical frames, and poor visibility due to precipitation and weak illumination. For
and dynamic objects. We show that the predicted masks improve lidar, the reflections from the ground cause unreliable regions in the
the performance of GRAMME by eliminating the imperfections on measurements that cannot be consistently matched across consecu-
each modality. Figure 3 shows example frames and the correspond- tive frames, which are detected and eliminated by the lidar masks.
ing mask predictions for each modality under challenging condi- The mask also identifies false detections caused by fog droplets. On
tions. We use the stereo setting for the camera fusion models, which the other hand, although radar is more resistant to weather condi-
provides additional information due to binocular vision. To show tions, the radar measurements still suffer from the inherent artefacts
that the masks eliminate the effect of unfavourable weather on each discussed above. The radar masks seamlessly detect the imperfect
sensor, we trained the model using the individual modalities only. measurements and filter them from the radar frames.

Nature Machine Intelligence | www.nature.com/natmachintell


Articles NATurE MAcHInE InTEllIgEncE

Following the same experimental protocol described above, we MonoDepth2; ref. 21) varies depending on the modality and the test
evaluate the generalization performance of the overall ego-motion condition. For example, fusion models achieve good performance
estimation system. We show an ablation study of GRAMME in terms with a dataset size of at least 50% in all test conditions. However, the
of the contribution of the fusion module to individual modalities model trained with cameras needs at least 75% of the training data-
and contribution of different sensors under unique test conditions. set. Notably, the performance of fusion models might deteriorate
Figure 3 shows the translational and rotational errors of different with access to very limited data (for example, with only 25% of the
ablation schemes, averaged over the day, night, rain, fog and snow dataset). The increased complexity needed to implement the mul-
test conditions. To evaluate the benefits of multisensor fusion, we timodal architecture makes the model more data-dependent than
train camera-only models in monocular and stereo settings. In sep- those with single modalities. Furthermore, although radar–camera
arate training experiments, we fuse lidar and radar modalities with fusion provides more immunity to adverse weather than lidar–cam-
the stereo camera. As shown in Fig. 3, lidar–camera and radar–cam- era fusion, the latter performs relatively well under the poor illu-
era fusion significantly improves both the translational and rota- mination in the night sequences. Both lidar and radar modalities
tional motion prediction performance compared with camera-only are not affected by the illumination, but the lidar model utilizes the
models. The GRAMME modal trained with lidar has notably higher dense measurements of the lidar sensor and achieves better perfor-
errors in fog, showing the negative impact of fog on lidar data. mances, accordingly.

Interpretability and dataset size-dependent performance. Human Discussion


interpretability of the trained self-supervised deep learning AV We showed that GRAMME addresses five key challenges in autono-
model not only serves to validate that the predictive basis of the mous driving. The first is multimodal self-supervision: we trained
model aligns with the intuitive geometry perception for depth and models with self-supervision using only the sensor measurements
ego-motion prediction, but also promotes trust for end-users and captured by the camera, lidar and radar sensors. We formulated a
liability for regulatory bodies18,19. We use a game-theoretic approach differentiable view-reconstruction algorithm to create a supervi-
to visualize the contribution scores of each pixel by decomposing sory signal from range scanning sensors (that is, lidar and radar).
the output prediction of the DepthNet module on a specific input We demonstrated that multimodality improves the robustness
by back-propagating the contributions of all neurons in the network of the model to poor illumination and adverse weather, while
to every feature of the input frame. The visualization is based on self-supervision eliminates the need for cumbersome ground-truth
Shapley additive explanations (SHAP)20 that assigns each feature collection and improves the generalization capability compared with
an importance value based on SHAP values for a particular predic- supervised approaches. A possible explanation for the poor general-
tion. GRAMME models make the multimodal predictions by first ization performance of the supervised models is that they are opti-
identifying and aggregating regions in the vision and range sensor mized to learn the relationship between the input frames and the
measurements that are of high predictive importance (high SHAP ground truth rather than the underlying geometry. We also dem-
values, red) while ignoring regions of low relevance (low SHAP val- onstrated that multimodal self-supervision achieves state-of-the-art
ues, blue); see Fig. 4 for the visualization of SHAP values on sample depth reconstruction and ego-motion estimation results compared
multimodal inputs for different training modalities. Although the with the established self-supervised approaches. Although radars
higher SHAP values on the camera-only models are concentrated provide a reliable complementary perception, the imaging radars
around static objects, they are usually scattered across input images. are still sparse and the resolutions are limited. We argue that the
Besides, the lower SHAP values are more frequent than the higher development of higher resolution radars in three-dimensions will
values and concentrate around the imperfections on the input such be a milestone enabler for all-weather AVs. The second challenge is
as glare and occlusions. However, when the model is trained with modularity: we trained models using different modalities in various
lidar and radar sensors, the SHAP values focus on the object region settings. We showed that the modules could be trained and validated
with geometric structures (for example, cars and static objects), with partial availability of the intermediate outcomes and the other
and the layout (house silhouette and road boundaries). The fusion modules, resulting in a more robust system under diverse settings.
model focuses on the structural representations that reflect essential We further showed that the modules could transfer the improved
information for depth estimation, which is semantically more con- capabilities acquired during multimodal training to test time even
sistent between various unfavourable conditions such as night, rain, when the other modalities are partially or completely unavailable at
fog and snow. Note that the fusion models are trained with mul- test time, leading to independent and uncorrelated failure modes.
tiple modalities, but the tests are conducted on the camera depth We argue that modularity is an essential capability to achieve a min-
prediction without access to the data from the additional sensors. imal risk condition, improving the safety of AVs in case of hardware
The camera fusion models refer to the stereo setting. Although the or software failures of the components. Although a unitary design
DepthNet module trained with the camera struggles to find consis- with tight connections might result in performance gains, it should
tent and depth-related points, the fusion of additional sensors that not come at the cost of safety. The third challenge is generalizability:
are more resistant to environmental changes helps the DepthNet unlike most past self-supervised studies, we also focused on gen-
focus on geometric structure and object boundaries even when it eralization to all of the weather conditions. We trained the model
does not have access to the lidar and radar data at test time. using only day sequences in the training set and directly evaluated
Motivated by the inadequacy of accurate ground-truth data in it against the other conditions. Camera-only models showed poor
diverse, multimodal datasets at scale, we evaluated the performance generalization due to the degraded performance of cameras under
of GRAMME with sequentially sampled subsets of training data challenging conditions; however, we showed that models trained
under different test conditions (25%, 50%, 75% and 100%) while with multiple modalities (that is, lidar and radar) achieve a substan-
keeping the validation and test sets the same to investigate the tial performance boost in terms of generalizability to unseen condi-
dependency of the model’s performance on the amount of training tions during training. Although we use a diverse dataset including
data available. Figure 4 shows the relative absolute error for multi- several difficult conditions that AVs might commonly face during
modal depth prediction in diverse settings, visualizing the median, regular operation, it is beyond feasible to cover all kinds of adverse
first and third quartile of errors. When supervising GRAMME mod- conditions in an AV dataset. Research on generalization perfor-
els with the smaller, sampled subsets of training data, we observed mance under unfavourable weather conditions is thus particularly
that the number of frames required to achieve satisfactory perfor- crucial for the development of AVs. Furthermore, the existing pub-
mance (with respect to the baseline monocular performance of lic AV datasets in the literature cover a wide range of conditions,

Nature Machine Intelligence | www.nature.com/natmachintell


NATurE MAcHInE InTEllIgEncE Articles
a

RGB Monocular Stereo Lidar–camera Radar–camera

Day
Night
Rain
Fog
Snow

b Monocular Stereo Lidar–camera Radar–camera


Relative absolute error

0.35
Day
Night
0.30
Rain
Fog
0.25
Snow

25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100


Dataset size (%) Dataset size (%) Dataset size (%) Dataset size (%)

Fig. 4 | Interpretability and dataset size-dependent performance. a, A game-theoretic visualization of GRAMME to interpret the depth predictions based
on the SHAP values for sample frames16. Pixels annotated by red points increase the depth prediction accuracy, whereas blue points lower the accuracy.
The challenging conditions such as glare, poor illumination and adverse weather lead to concentrated blue regions around the occluded pixels. However,
the training with lidar and radar data helps the model focus on more semantically invariant pixels across diverse test conditions, as visualized by the red
points around static objects and road edges. The distribution of the values illustrates the independent and uncorrelated failure modes of the proposed
multimodal system. b, Dataset size-dependent performance of GRAMME in terms of mean depth prediction error, with standard deviation with respect
to the depth ground truth. Although the lidar–camera (stereo) and radar–camera (stereo) fusions improve the overall performance, access to minimal
data (for example, only 25%) causes a worse performance than the camera due to the increased complexity of the model required for the multimodal
architecture. On the other hand, despite the model complexity, the lidar and radar-based models achieve good performances (compared with the baseline
approaches) with a dataset size of at least 50% in all test conditions.

but they do not extensively cover extreme conditions such as heavy complexity. However, the specific inherent vulnerability of sensors
downpours and large snowfalls, which is a limiting factor in eval- (such as lidar in fog) might deteriorate the performance with mini-
uating the generalization capability. Fourth, interpretability: we mal data availability (for example, 25% of a dataset). For depth
demonstrated that our models are interpretable and capable of and ego-motion estimation in adverse weather, we believe that the
capturing semantically and geometrically consistent regions. We diversity and accuracy of ground truth in the existing public datas-
visualized the extracted features using SHAP values and observed ets are still insufficient, which is likely to constitute a limiting factor;
that the camera-only model struggles to focus on consistent regions data-efficiency analysis is therefore important to understand how
across frames. On the other hand, multimodal training helps the sensitive is the performance of a deep learning model to the avail-
model to capture more consistent areas that are interpretable by ability of additional data.
humans. Although deep learning models are heavily deployed in an The key aspects discussed above address a critical issue of AVs:
AV software stack, interpretability remains a considerable challenge the ability to know precisely where they are on the map. Core AV
due to the lack of insightful and lucid interpretability approaches components such as prediction and planning rely on this local-
to analyse the complex deep learning architectures. Finally, data ization ability. In this study we showed that robust and accurate
efficiency: our quantitative experiments and comparative analysis ego-motion estimation provides a complementary solution to local-
demonstrated that GRAMME models trained with multiple modal- ization and is a critical component of autonomous driving to achieve
ities achieve satisfactory results compared with baseline methods, safety and reliability under adverse conditions. The high level of
even with dataset sizes limited by up to 50%, despite the increased location accuracy provided by GRAMME enables AVs to reliably

Nature Machine Intelligence | www.nature.com/natmachintell


Articles NATurE MAcHInE InTEllIgEncE

understand their environment and make safer decisions. We dem- alone is ambiguous, especially in low-textured regions due to the multiple matches
onstrated that the complementary and redundant perception that with one pixel. To prevent depth ambiguity due to incorrect pixel matches in
low-textured and occluded areas, we apply a regularization:
AVs gain from multiple sensors improves the reliability of vehicles ∑∑
in challenging situations, especially in unfavourable weather condi- Ls (D, 2) = 2
|∇d D(xt )|e
−α |∇d I(xt )|
(1)
tions. Furthermore, the self-supervised aspect of GRAMME enables xt d∈x,y

artificial intelligence systems deployed on AVs to learn localization Ls(D, 2) is a second-order spatial depth smoothness term that penalizes the
from orders of magnitude more data, which is important to quickly divergence of the depth prediction gradients along both the x and y directions22.
recognize and understand new driving conditions. We believe The regularization encourages the alignment of the depth values in the planar
AV technologies should meticulously involve these fundamental surface in the absence image gradients. For multiview projection between multiple
aspects to achieve safe and reliable autonomous driving. camera views, let D(xt) denote the depth value of the target image at coordinate
xt, and K be the camera intrinsics matrix. Assume a rigid transformation Tt→s is
In terms of future directions, the presented technology can be fur- the relative pose from the target view to source view, and h(x) is the homogeneous
ther improved in several directions. For example, the signal-to-noise coordinates given x. The perspective projection to find corresponding pixels in the
ratio of range sensors can be integrated into the masking component source view can be formulated as,
of GRAMME, providing an additional physical source of confidence
D(xs )h(xs ) = KTtarget→source D(xt )K−1 h(xt ) (2)
for the measurements. Moreover, the Doppler measurements from
radars can help the model better distinguish dynamic and static and the image coordinate xs can be obtained by de-homogenization of D(xs)h(xs); xs
objects in the scene, enabling a more consistent geometric and and xt are therefore a pair of matching coordinates in the source and target views,
semantic understanding of the environment. Moreover, GRAMME and the similarity between the two can be compared to validate the correctness of
structure. Given the pixel-wise matching pairs in Ict and Ics , we can reconstruct a
as a learning-based approach can be extended to higher level learn- c
target view Iˆs from the given source view as described in ref. 27, and calculate the
ing schemes of autonomous driving such as lifelong and continual final camera objective using the photometric error Lc = Lp (M, Îcs , Îct ) + λs Ls (D)
learning, resulting in AVs that continuously and collaboratively following the camera masking method offered in ref. 22. The camera module
improve autonomous driving artificial intelligence. is applicable to monocular and stereo cameras by exploiting the left–right
consistency21.
Methods Range module. The range module is designed to predict ego-motion from radar
GRAMME. GRAMME is a self-supervised deep learning framework designed to
and lidar measurements that are represented by a bird’s-eye view in Cartesian
robustly estimate the ego-motion and depth map for an AV under diverse settings.
coordinates, consisting of two feature extractor networks based on ResNet18
GRAMME follows an end-to-end design and leverages data-driven learning to
followed by two fully connected layers to regress the relative pose. RangeNet
combat the inherent limitations of conventional and state-of-the-art ego-motion
predicts the relative pose Tt→s between source and target frames <Is, It>, whereas
estimation methods. GRAMME demonstrates the feasibility of multimodal
MaskNet individually predicts a mask M in parallel to detect the consistent regions
odometry under adverse weather conditions and proposes a multisensor fusion
in the frames. Finally, our view synthesis algorithm reconstructs the target view
framework, resulting in a robust ego-motion estimation system. The standard
using the predicted pose and mask.
self-supervised ego-motion prediction is based on monocular camera, and it
consists of two joint stages22,23. The first stage predicts a depth map for a given
View synthesis for range sensors. Given a source Is and target It views in Cartesian
camera frame, whereas the second stage predicts ego-motion between two
coordinates for radar and lidar measurements, we use the relative predicted
consecutive camera frames. Given the ego-motion and depth predictions, a c
pose Tt→s between the views to reconstruct a target view Iˆs through bilinear
spatial transformer algorithm reconstructs the target camera frame from the
interpolation. To reconstruct the value of Îs (xt ) from the value of Is(xs), we
source frames. The spatial transformer module builds on the idea presented by
use a differentiable bilinear sampling mechanism similar to the photometric
Jaderberg et al.24, explicitly allowing the spatial manipulation of multimodal data
approaches24, linearly interpolating the values of the four-pixel neighbours
within the network. The reconstruction quality establishes the supervisory signal
N = (top-left, top-right, bottom-left and bottom-right) of xs to approximate Is(xs),
to optimize the neural network. GRAMME builds on the self-supervised training ∑
that is, Îs (xt ) = Is (xs ) = i,j∈N wij Is (xijs ), where wij ∝ ∣xs − xt∣, and ∑i,jwij = 1;
idea and describes a multimodal architecture to promote complementary sensor
then, given the Lambertian and a static rigid scene assumptions, we can
behaviours, yielding a robust ego-motion estimation for AVs under diverse settings
calculate the average intensity error to refine the predicted relative pose. However,
such as day, night, rain, fog and snow. GRAMME introduces a novel differentiable
this assumption is not always true because of dynamic objects and sensor
range-reconstruction algorithm for range frames (that is, lidar and radar) as part
deficiencies, which might be further violated under adverse weather. We introduce
of its multimodal spatial transformer that is adaptable to the back-propagation
a consistency mask M to compensate for the regions violating the assumption.
during training of the deep learning architecture. The RangeNet module uses
Formally, the masked intensity loss for lidar (Ll) and radar (Lr) is,
two consecutive range frames to predict the ego-motion of AV, whereas MaskNet
predicts the reliable regions in individual frames. Given the ego-motion and ∑
S ∑
mask predictions, the spatial transformer algorithm uses the source frames to Ll,r (M, Îs , Ît ) = Ms (xt )|It (xt ) − Iˆs (xt )|,
reconstruct the target range frames. To exploit the complementary information
s=1 xt (3)
obtained from different sensors, GRAMME proposes a fusion method that consists such that ∀xt , s Ms (xt ) ∈ [0, 1]
of the FusionNet layer and cross-modal training technique. The novel fusion
S
method enables information flow across different modalities due to the joint where {Îs }s=1 is the set of reconstructed source views, {Ms} is a set of consistency
training technique, improving the robustness of individual modalities. Extended masks, and Ms(xt) ∈ [0, 1] provides a weight on the error at xt from source view s.
Data Fig. 2 shows the details of the architecture. The range-reconstruction algorithm is summarized in Algorithm 1. Moreover, the
explainability mask has a trivial solution in this formulation, assigning all mask
Problem definition. Each loosely time-synchronized triplet of consecutive camera values to zero. We apply a regularization term to encourage non-zero masks to
( < Ics,i−1 , Ict,i , Ics,i+1 >), lidar ( < Ils,i−1 , Ilt,i , Ils,i+1 >) and radar ( < Irs,i−1 , Irt,i , Irs,i+1 >) prevent the saturation in the network activation, using a cross-entropy loss for the
frames in the training set ( I = Ic ∪ Il ∪ Ir) represents a single data point at predicted masks:
time index i with unknown ego-motion and depth map of the camera source Is ∑∑
Lm (M) = − log P(Ms (xt ) = 1). (4)
and target It frames. Our goal is to estimate T, where the pose Tt→s = [R∣t] ∈ SE(3)
s xt
is a transformation between the target (t) and source (s) frames with rotation
matrix R and translation vector t. Although the standard commercial radars In the bird’s-eye view, vehicles and large objects occupy smaller areas
are two-dimensional sensors, we formulate our problem in SE(3) to enable compared with the front-view. For example, a vehicle with an average size of
compatibility with other three-dimensional sensor modalities. Unlike existing 2.5 × 5.1 m occupies only a 13 × 26 pixels area with an input resolution of 0.2 m.
self-supervised radar approaches25, GRAMME directly predicts the pose between Downsampling the bird’s-eye view map through the encoder makes the region-wise
the consecutive frames without imposing strong motion prior factors. features vulnerable to quantization errors in the subsequent mask generator; thus,
GRAMME upsamples the coarse-grained feature map via a transposed convolution
Camera module. The camera module consists of two networks. DepthNet uses layer (decoder) and concatenates the output with the fine-grained feature map with
UNet style skip connections26 to predict per-pixel depth map D of a given RGB skip links, following the UNet design26.
image. In parallel, VisionNet follows ResNet18 architecture to predict the relative
pose Tt→s between source and target RGB images < Ics , Ict >. We use the predicted Multimodal fusion. GRAMME introduces a self-supervised fusion approach
depth and pose values in the spatial transformer algorithm to create a supervisory that involves an attention module, a fusion network and a training technique.
signal based on perspective projection. However, photometric error supervision The features extracted from range and camera modules are used in an attention

Nature Machine Intelligence | www.nature.com/natmachintell


NATurE MAcHInE InTEllIgEncE Articles
module to create weighted features. Lidar and radar features are not always equally is collected in a period of one year in Oxford, and around 1,000 km in total.
important, and their contributions to final pose prediction should be weighted We use several sensors attached to the Oxford RobotCar: a Bumblebee XB3 stereo
accordingly. We extract a weight vector from the concatenated features through camera, and a SICK LD-MRS three-dimensional lidar with a drastically limited
a ResNet18-based encoder followed by a fully connected layer and a SoftMax field of view, unlike the lidar on the newer version of the dataset. Within this
layer to predict the importance weights between [0, 1] for each input feature. configuration, the lidar and stereo camera yield a data stream on 11 fps and 16 fps,
The pose regressors then use the weighted features to predict the relative pose. respectively. On the other hand, although the RADIATE dataset17 is collected
FusionNet uses the unaligned relative pose values predicted by VisionNet and mainly for object detection, it is still an interesting dataset as it contains shorter
RangeNet, and predicts the ultimate ego-motion without any correction using sequences with high variation in scene appearance compared to the Robotcar
the extrinsic calibration among the sensors. Furthermore, during multimodal dataset. The RADIATE dataset involves a ZED stereo camera at 15 fps, which is
training, we impose a cross-modal fusion loss Lf, in which we use the fused pose protected by a waterproof housing under extreme weather conditions. The images
in the range-reconstruction algorithm and calculate the reconstruction error. The might have severe blur, haze or might be obstructed due to raindrops, dense fog or
cross-modal training technique not only increases the robustness of ego-motion snow flakes. A 32 channel, 10 Hz, Velodyne HDL-32e LiDAR gives 360° coverage.
prediction but also improves the predictions from the other modalities, such as The lidar data can be missing and noisy since the signal can be severely attenuated
depth prediction. and back-scattered by intervening fog or snow. The RADIATE dataset adopts the
same radar as the Robotcar dataset: the Navtech CTS350-X radar that is a scanning
Training details. During training, the triplet frames in the training set are randomly radar providing 360° high-resolution range-azimuth images without Doppler
sampled and provided to the model using a batch size of 16. We augment the information. The radar is set to 100 m maximum operating range with 0.175 m
lidar and radar scans with a random rotation around the vehicle centre by an range resolution, 1.8° azimuth resolution. For both datasets, we follow the original
angle in [−10, 10]∘ because a large fraction of the AV datasets consists of either implementation of the authors for the conversion of radar frames from polar to
driving straight or waiting in traffic. The total loss for a given sequence L is the Cartesian coordinates and bird’s eye view projection of lidar frames. Figures 2–4
sum of the individual modality losses and masking loss with optional scaling illustrate the large differences in weather conditions. Between the day and snow
factors λc for the camera component and λm for the mask component of the loss. conditions, there was significant dissimilarity in visual appearance. For example,
The final learning objective is given by: most of the lane lines are barely visible during the snow.

L(D, T, M) = Ll,r (M, Îs , Ît ) + λc Lc (D, T, M)


(5) Comparative analysis and ablation study. Ego-motion estimation. We compare the
+ λm Lm (M) ego-motion estimation performance of GRAMME with baseline methods, shown
in Extended Data Table 1. In accordance with the baselines, we use the same spatial
Given the objective functional, the photometric and intensity error is cross-validation setting suggested by Barnes et al.30, and report our results using
back-propagated to depth, pose and mask networks by applying the spatial the KITTI odometry metrics 31, which average the relative position and orientation
transform operation to supervise the learning process. We used λc = 30 and λm = 1 errors over every sub-sequence of length (100 m, 200 m, … , 800 m). Evaluation of
for all experiments. Our models are implemented in PyTorch28, trained for ego-motion estimation techniques based on full global trajectory end-points is
at least 50 epochs until 200 epochs using Adam29 unless an early stopping misleading because a large motion error in the earlier trajectory points leads to
criterion is met, which was chosen using the validation set explained in Section substantial errors in the end-points. We thus use trajectory segments to analyse
Results. The validation loss is monitored each epoch, and early stopping is rotation and translation errors, following the standard evaluation benchmarks
triggered when it has not decreased from the previous low for over five consecutive described in ref. 31 to allow for deeper insights into the qualities and failure
epochs. The saved model with the lowest validation loss is then evaluated modes of motion prediction. To provide a comprehensive analysis, we evaluate
on the test set. We use Adam optimizer using a learning rate of 1 × 10−4 with the competing approaches using the camera, lidar and radar modalities. To
an L2 weight decay of 1 × 10−5. provide a fair comparison, we report both the individual and fused modalities for
GRAMME. We evaluate the proposed approaches in three different settings as
Algorithm 1. View reconstruction for range sensors grouped in Extended Data Table 1: camera, lidar and radar ego-motion estimation.
m, n ← Height, Width ⊳Input dimensions GRAMME models trained with additional data are indicated in parentheses,
C ← Stack((1, 2, ... , m), (1, 2, ... , n)) ⊳ Pixel coordinates in homogenous form where the camera fusion refers to the stereo setting. Regardless of the training
function InverseWarp(Is, ps, Ms) strategy of each method, each competing approach receives the same input without
Tt→s = Rodrigues2TransformationMatrix(ps) additional data at test time. Although we use exactly the same GRAMME camera
C̃ = Tt→s C ⊳ Transformed points model for each experiment, the model trained with additional data outperforms
C̃ = Normalize(C̃) ⊳ Normalized pixel coordinates in [−1, 1] the competing approaches by a considerable margin, proving the effectiveness
Îs = BilinearSample(C̃, Is ) ⊳ Reconstructed frame of the proposed multimodal approach. Besides, we independently report the
if Ms! = None then performance of the competing approaches from Adolfsson and colleagues32. For
   M̃s = BilinearSample(C̃, Ms ) ⊳ Reconstructed mask camera-based evaluation, we compare GRAMME in monocular and stereo settings
else with the visual odometry method used in the Robotcar dataset, which employs
   M̃s = 1 an extensive number of features at the cost of a high computational burden33. We
return Îs , M̃s also compare our method with ORB-SLAM234, which loses the track and fails on
the sequences shown in Extended Data Table 1. On the other hand, GRAMME
Datasets. We design GRAMME for multiple modalities such as camera, lidar in stereo setting successfully completes all of the test sequences and substantially
and radar, exploiting the complementary features of each sensor under diverse outperforms the baselines. Similar to ORB-SLAM2, the lidar-based approaches
settings. Although there are several publicly available AV datasets, they typically LOAM35, Lego-LOAM36 and SuMa37 fail to finish the whole sequences or rapidly
employ sparse lidar and radar sensors, useful for object detection, but not for deviate due to the challenging dynamics. Similar to LOAM, Lego-LOAM is tightly
three-dimensional perception. Moreover, they mostly consist of common daytime linked to the mapping and failed to perform odometry without the mapping
conditions only, which is not enough to evaluate the performance of an AV module module. We therefore report the results for Lego-LOAM with the mapping module
under diverse conditions. Under these requirements, we conduct our experiments enabled. The GRAMME lidar model outperformed the proposed approach on
on the Oxford Robotcar16 and the RADIATE17 datasets. Figure 1 shows samples the full field-of-view lidar setting. Hence we report their results up to the point
from the RADIATE dataset, whereas Figs. 2–4 show samples from the Robotcar where they lose tracking. Note that we project the six degrees-of-freedom pose
dataset. We follow the same coordinate reference systems as suggested by the predictions provided by the vision and lidar approaches onto the XY plane for
authors of these datasets to achieve a standard evaluation setup for comparisons. evaluation. It can be seen that the proposed GRAMME can achieve comparable
The Oxford Radar Robotcar (ORR) dataset is collected by a vehicle equipped with or better localization accuracy with enhanced robustness. On the other hand,
a NavTech CTS350-X radar, and two co-located Velodyne HDL-32E lidars at the although the GRAMME model trained on a low-cost lidar (for example, SICK
roof centre. The dataset contains the merged point clouds of the lidars, providing LD-MRS 3D LIDAR 85∘ HFoV) and labelled as narrow in the table shows worse
the ground truth for depth maps. The authors of the Oxford Radar Robotcar performance than the full field-of-view alternative, it successfully completes all
Dataset15 include visual odometry and loop closures into a large-scale optimization the test sequences. The poor performance is caused by the limited measurement
of their GPS/INS system to provide the ground-truth trajectory. The ORR radar capability of the sensor that is mainly designed for obstacle detection at a short
scans the 360° field of view at an angular step of 0.9° every 0.25 s, and the lidar range within a limited view. SuMa achieves similar performance to GRAMME
at a step of 0.33° every 0.05 s. The dataset provides the radar and lidar scanning in terms of odometry accuracy. The main reason is that GRAMME is designed
results transformed into a two-dimensional intensity map and three-dimensional for bird’s-eye-view images, whereas SuMa specializes in front-view lidar inputs.
point cloud, respectively. Both sensors share the same coordinate origin. The Specifically, SuMa constructs and reserves a surfel-based map of the environment,
ORR dataset contains 8,862 samples, which are split into 7,090 for training, 886 which embodies dense information for front-view lidar input, but sparse and
for validation and 886 for testing, without geographic overlapping. The dataset isolated information for bird’s eye view images. However, the GRAMME model
includes thirty-two traversals around 280 km of driving in total. We also evaluate based on lidar and camera shows a better performance with a noticeable margin,
our model on an earlier version of the ORR dataset, Oxford Robotcar dataset16 that demonstrating the effectiveness of the fusion approach. Moreover, we compare
contains the same lidar and camera sensors except for the radar. This version our radar-based GRAMME model with state-of-the-art radar odometry methods

Nature Machine Intelligence | www.nature.com/natmachintell


Articles NATurE MAcHInE InTEllIgEncE
provided by Cen et al.38, Barnes et al.30 and Hong et al.39. The results show that multimodal aspect of GRAMME. Figure 4a compares the SHAP values of
GRAMME radar-only model surpasses the performance of both geometry- and multimodal depth predictions under different test conditions, and explains the
learning-based approaches. Besides, GRAMME exceeds the performance of output of GRAMME trained on the Robotcar dataset. While the pixels marked
the supervised radar odometry approach30 without any need for ground-truth with red points increase the prediction accuracy, blue points decrease it. The input
supervision, which indicates the advantage of GRAMME deployment in regions RGB images are shown on the left, and we also place the transparent grey-scale
where a source of high-quality location information is unavailable such as a versions of them in the background of each explanation. The sum of the SHAP
GPS/INS system. values for each explanation equals the difference between the current model output
and the expected model output that is averaged over the background dataset.
Depth prediction. We comparatively evaluate the depth prediction performance of Note that the red points for camera predictions are highly scattered, and the blue
GRAMME using the error and accuracy metrics that were initially proposed by points are usually concentrated around the occluded and the glaring regions.
ref. 40 and widely adopted in the literature. Also, as a convention in the competing However, when the same DepthNet model is trained using multiple modalities
approaches, we evaluate the performance of depth prediction capped at 60 m as that are more immune to adverse conditions, the red points are focused more on
the measures without threshold can be sensitive to the great errors in depth caused geometrically meaningful and semantically consistent regions. For example,
by prediction errors at small disparity values. Although DepthNet predicts the lidar and radar fusion enables the model features to capture road boundaries,
dept maps within [0 − 1 km] range for better visualization, the reported errors are traffic signs, and static objects. Another notable difference is that although the
capped to achieve a common evaluation criteria. Note that the range of the depth road markings are not clearly visible in the snow, the range sensors attract the
prediction formulated in equation (2) is theoretically not limited. The error and model focus to road boundaries. On the other hand, the lidar-based model suffers
accuracy metrics used in the evaluations are defined as: from the water droplets in fog, visualized by the dense blue points around it.
∑ The results validate the effectiveness of GRAMME in exploiting the cross-domain
1 |D(x,y)−Dgt (x,y)|
AbsRel ≡ |Ω| Dgt (x,y) complementary features.
(x,y)∈Ω

1
∑ |D(x,y)−Dgt (x,y)|2 Visualizing feature space with SHAP values. SHAP20 approximates an interpretable,
SqRel ≡ |Ω| Dgt (x,y) explanation model g of the original, complex model f, to explain a prediction
(x,y)∈Ω
√ made by the model f(x). SHAP provides post-hoc model explanations for
1
∑ an individual output of f and is model-agnostic. SHAP is a game-theoretic
RMSE ≡ |Ω|
|D(x, y) − Dgt (x, y)|2
(x,y)∈Ω approach based on Shapley values45, which calculates the contribution
√ ∑
of each feature in the final prediction performance. We use a special
RMSElog ≡ 1
|Ω|
| log D(x, y) − log Dgt (x, y)|2 implementation of the SHAP approach, Deep SHAP method introduced
(x,y)∈Ω by Lundberg and colleagues20, which combines SHAP values computed for
∑ smaller components of the network into SHAP values for the whole network.
log10 ≡ 1
|Ω|
| log D(x, y) − log Dgt (x, y)| It defines DeepLIFT’s multipliers46 in terms of SHAP values, and recursively
(x,y)∈Ω
( ) passes the values backwards through the network. Deep SHAP exploits the
. D(x,y) Dgt (x,y) composition rule and the efficient analytical SHAP solutions for simple
Accuracy ≡ % of D(x, y) s.t. δ = max Dgt (x,y) , D(x,y) <τ
networks components such as linear, max pooling, or an activation function
with just one input, enabling a fast approximation of values for the whole
D(x, y) is the predicted depth at (x, y) ∈ Ω and Dgt(z, y) is the corresponding ground model. This approach helps us derive an effective linearisation from the
truth. We use the most common three different thresholds τ (1.25, 1.252 and 1.253) SHAP values computed for each component instead of heuristically choosing
in the accuracy metric. Since the monocular camera lacks the absolute scale, we ways to linearize components.
multiply the monocular depth predictions by a scaling factor, s, that matches the
median with the ground-truth depth map to solve the scale ambiguity issue, that is, Computational hardware and software. We stored the raw dataset files on
s = median(Dgt)/median(D). The depth prediction results in terms of those metrics multiple hard drives. We performed the demosaicing of camera images, the
are shown in Extended Data Table 2. We evaluate the depth prediction performance projection of lidar frames, and Cartesian conversion of radar measurements on
of the competing approaches under diverse settings such as day, night, rain, fog, Intel Xeon CPUs, which are then stored on a fast local SSD. We used two local
and snow, following the same training and test protocol. We use Monodepth2 NVIDIA RTX 3090 GPUs for each training experiment accelerated through
(ref. 21) as a baseline, which is the most similar architecture to the camera module batch parallelization and a local NVIDIA GTX 1080Ti GPU to evaluate
of GRAMME. We train, validate and test it using the same dataset split as run-time performance. We implement our multimodal processing pipeline in
GRAMME. Although Monodepth2 achieves comparable results in day sequences, Python and employ imaging processing libraries such as colour-demosaicing
it performs poorly in reduced visibility conditions due to the occlusions and low (v.0.1.6), and pillow (v.8.4.0). To train the deep learning models, and augment
lighting. Since the camera module of GRAMME is most similar to Monodepth2, the datasets, we used machine learning libraries such as PyTorch (version 1.8.0),
we provide the performance evaluation for GRAMME models trained using torchvision (v.0.9.1). We generated all plots using matplotlib (v3.5.0) and
range sensors (for example, lidar and radar), emphasizing the effectiveness of the seaborn (v.0.11.2). The Robotcar dataset is processed using Robotcar dataset
multimodal approach. Note that none of the models has access to additional sensor SDK (v.3.1), and the RADIATE dataset is processed with RADIATE dataset SDK
measurements at test time other than camera images. The results indicate that (commit dca2270).
GRAMME models distinctly and consistently outperform the other approaches
thanks to the fusion model design, and reiterate the robustness of GRAMME to the
lack of modalities. The results indicate that exploiting the cross-modal relations is Data availability
crucial for robust all-weather ego-motion estimation for AVs. The Oxford Robotcar Dataset16 and the Oxford Robotcar Radar15 datasets
are available from the University of Oxford under a Creative Commons
Ablation study on the deep network. Deep learning models might benefit from Attribution-NonCommercial-ShareAlike 4.0 International License (https://
larger and more complex networks to improve the prediction accuracy41, which robotcar-dataset.robots.ox.ac.uk/). The RADIATE dataset17 is available from
comes at a run-time cost. The encoders in GRAMME are based on ResNet1842 the Edinburgh Centre for Robotics, Heriot-Watt University, under a Creative
architecture. We replace the encoder with commonly used networks such as Commons Attribution-NonCommercial-ShareAlike 4.0 International License
MobileNet43 and VGG16 to analyse the performance and latency of the models44, (http://pro.hw.ac.uk/radiate/). The references involve the minimum datasets that
which is shown in Extended Data Fig. 1b. We benchmark the models in terms of are necessary to interpret, verify and extend the research in the article, transparent
depth prediction performance and inference time for a minibatch size of four on to readers.
an NVIDIA GTX 1080Ti consumer-grade GPU. The inference time is evaluated
for the total of pose and depth predictions with an additional pose fusion for the Code availability
multimodal tests. While the networks have the same inference time for different All code was implemented in Python using the deep learning framework PyTorch.
test conditions, the networks in fusion models have higher latency than the models Code, trained models and scripts reproducing the experiments of this paper are
for single-modality. The multimodal input and parallel network branches for available at https://github.com/yasinalm/gramme (refs. 47–49). All source code is
multiple modalities cause a higher latency in fusion models. Although overall provided under the MIT license.
run-time for ResNet is higher than MobileNet, ResNet achieves a significant
performance boost. On the other hand, despite the slight performance gain of VGG Received: 24 January 2022; Accepted: 12 July 2022;
in the monocular setting at the cost of four-times the inference time of ResNet,
VGG falls behind ResNet in the other test settings. ResNet efficiently trades Published: xx xx xxxx
off between accuracy and latency, and has a noticeably lower GPU run-time.
We, therefore, select ResNet as our encoder. References
1. Safe driving cars. Nat. Mach. Intell. 4, 95–96 (2022).
Interpretability. We visualize the feature space of the depth prediction module 2. Yang, G.-Z. et al. The grand challenges of Science Robotics. Sci. Robot. 3,
with respect to the camera, lidar, and radar inputs to better understand the eaar7650 (2018).

Nature Machine Intelligence | www.nature.com/natmachintell


NATurE MAcHInE InTEllIgEncE Articles
3. Fagnant, D. J. & Kockelman, K. Preparing a nation for autonomous 30. Barnes, D. & Posner, I. Under the radar: learning to predict robust
vehicles: opportunities, barriers and policy recommendations. Transp. keypoints for odometry estimation and metric localisation in radar.
Res. 77, 167–181 (2015). In 2020 IEEE International Conference on Robotics and Automation (ICRA)
4. Cadena, C. et al. Past, present, and future of simultaneous localization and 9484–9490 (IEEE, 2020).
mapping: toward the robust-perception age. IEEE Trans. Robotics 32, 31. Geiger, A., Lenz, P. & Urtasun, R. Are we ready for autonomous driving?
1309–1332 (2016). The KITTI vision benchmark suite. In 2012 IEEE Conference on Computer
5. Spielberg, N. A., Brown, M., Kapania, N. R., Kegelman, J. C. & Gerdes, J. C. Vision and Pattern Recognition 3354–3361 (IEEE, 2012).
Neural network vehicle models for high-performance automated driving. Sci. 32. Adolfsson, D., Magnusson, M., Alhashimi, A., Lilienthal, A. J. &
Robot. 4, aaw1975 (2019). Andreasson, H. CFEAR Radarodometry–conservative filtering for
6. Hancock, P. A., Nourbakhsh, I. & Stewart, J. On the future of transportation efficient and accurate radar odometry. In 2021 IEEE/RSJ International
in an era of automated and autonomous vehicles. Proc. Natl Acad. Sci. USA Conference on Intelligent Robots and Systems (IROS) 5462-5469 (IEEE,
116, 7684–7691 (2019). 2021).
7. Waldrop, M. M. Autonomous vehicles: no drivers required. Nature 518, 33. Churchill, W. & Churchill, W. S. Experience Based Navigation: Theory, Practice
20–23 (2015). and Implementation. Ph.D. thesis, Univ. Oxford (2016).
8. Orr, I., Cohen, M. & Zalevsky, Z. High-resolution radar road 34. Mur-Artal, R. & Tardós, J. D. ORB-SLAM2: an open-source SLAM
segmentation using weakly supervised learning. Nat. Mach. Intell. 3, system for monocular, stereo, and RGB-D Cameras. IEEE Trans. Robot. 33,
239–246 (2021). 1255–1262 (2017).
9. Geiger, A., Lenz, P., Stiller, C. & Urtasun, R. Vision meets robotics: the 35. Zhang, J. & Singh, S. LOAM: lidar odometry and mapping in real-time. In
KITTI dataset. Int. J. Robot. Res. 32, 1231–1237 (2013). Robotics: Science and Systems Vol. 2, 9 (2014).
10. Zang, S. et al. The impact of adverse weather conditions on autonomous 36. Shan, T. & Englot, B. LeGO-LOAM: lightweight and ground-optimized
vehicles: how rain, snow, fog, and hail affect the performance of a self-driving lidar odometry and mapping on variable terrain. In 2018 IEEE/RSJ
car. IEEE Veh. Technol. Mag. 14, 103–111 (2019). International Conference on Intelligent Robots and Systems (IROS)
11. Orr, I. et al. Coherent, super-resolved radar beamforming using 4758–4765 (IEEE, 2018).
self-supervised learning. Sci. Robot. 6, eabk0431 (2021). 37. Behley, J. & Stachniss, C. Efficient surfel-based SLAM using 3D laser
12. Zaffar, M., Ehsan, S., Stolkin, R. & Maier, K. M. Sensors, SLAM and range data in urban environments. In Robotics: Science and Systems XIV
long-term autonomy: a review. In 2018 NASA/ESA Conference on Adaptive (RSS, 2018).
Hardware and Systems (AHS) 285–290 (NASA, 2018). 38. Cen, S. H. & Newman, P. Precise ego-motion estimation with
13. yasinalm/gramme (GitHub, 2022); https://github.com/yasinalm/gramme millimeter-wave radar under diverse and challenging conditions.
14. Committee, S. O.-R. A. V. S. et al. Taxonomy and definitions for terms related In 2018 IEEE International Conference on Robotics and Automation (ICRA)
to on-road motor vehicle automated driving systems. SAE Standard J. 3016, 6045–6052 (IEEE, 2018).
1–16 (2014). 39. Burnett, K., Yoon, D. J., Schoellig, A. P. & Barfoot, T. D. Radar odometry
15. Barnes, D., Gadd, M., Murcutt, P., Newman, P. & Posner, I. The Oxford Radar combining probabilistic estimation and unsupervised feature learning.
RobotCar Dataset: a radar extension to the Oxford RobotCar Dataset. In In Robotics: Science and Systems XVII (RSS, 2021).
2020 IEEE International Conference on Robotics and Automation (ICRA) 40. Eigen, D., Puhrsch, C. & Fergus, R. Depth map prediction from a single
6433–6438 (IEEE, 2020). image using a multi-scale deep network. In Advances in Neural Information
16. Maddern, W., Pascoe, G., Linegar, C. & Newman, P. 1 year, 1000 km: The Processing Systems Vol. 27, 2366–2374 (NeurIPS, 2014).
Oxford RobotCar Dataset. Int. J. Robot. Res. 36, 3–15 (2017). 41. Guizilini, V., Ambrus, R., Pillai, S., Raventos, A. & Gaidon, A. 3D packing for
17. Sheeny, M. et al. RADIATE: a radar dataset for automotive perception in bad self-supervised monocular depth estimation. In Proc. IEEE/CVF Conference
weather. In 2021 IEEE International Conference on Robotics and Automation on Computer Vision and Pattern Recognition 2485–2494 (CVF, 2020).
(ICRA) 1–7 (IEEE, 2021). 42. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image
18. Schwarting, W., Pierson, A., Alonso-Mora, J., Karaman, S. & Rus, D. recognition. In Proc. IEEE Conference on Computer Vision and Pattern
Social behavior for autonomous vehicles. Proc. Natl Acad. Sci. USA 116, Recognition 770–778 (IEEE, 2016).
24972–24978 (2019). 43. Howard, A. G. et al. MobileNets: efficient convolutional neural networks
19. Choi, J. K. & Ji, Y. G. Investigating the importance of trust on for mobile vision applications. Preprint at https://arxiv.org/abs/1704.04861
adopting an autonomous vehicle. Int. J. Hum.-Comput. Int. 31, (2017)
692–702 (2015). 44. Almalioglu, Y. et al. SelfVIO: self-supervised deep monocular visual-inertial
20. Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model odometry and depth estimation. Neural Netw. 50, 119–136 (2022).
predictions. In Proc. 31st International Conference on Neural Information 45. Shapley, L. S. in A Value for n-Person Games Ch. 17, 307–318 (Princeton
Processing Systems 4768–4777 (Curran Associates, 2017). Univ. Press, 2016).
21. Godard, C., Aodha, O. M., Firman, M. & Brostow, G. Digging into 46. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features
self-supervised monocular depth estimation. In 2019 IEEE/CVF International through propagating activation differences. Preprint at https://arxiv.org/abs/
Conference on Computer Vision (ICCV) 3827–3837 (CVF, 2019). 1704.02685 (2019).
22. Zhou, T., Brown, M., Snavely, N. & Lowe, D. G. Unsupervised learning 47. Yasin, A. Yasinalm/gramme: GRAMME (Zenodo, 2022); https://doi.org/
of depth and ego-motion from video. In 2017 IEEE Conference on 10.5281/zenodo.6464055
Computer Vision and Pattern Recognition (CVPR) Vol. 6, 6612–6619 48. Cen, S. H. & Newman, P. Radar-only ego-motion estimation in difficult
(IEEE, 2017). settings via graph matching. In 2019 International Conference on Robotics and
23. Almalioglu, Y., Santamaria-Navarro, A., Morrell, B. & Agha-Mohammadi, A.-A. Automation (ICRA) 298–304 (IEEE, 2019).
Unsupervised deep persistent monocular visual odometry and depth 49. Hong, Z., Petillot, Y. & Wang, S. RadarSLAM: radar based large-scale SLAM
estimation in extreme environments. In 2021 IEEE/RSJ International in all weathers. In 2020 IEEE/RSJ International Conference on Intelligent
Conference on Intelligent Robots and Systems (IROS) 3534–3541 (IEEE, Robots and Systems (IROS) 5164–5170 (IEEE, 2020).
2021).
24. Jaderberg, M., Simonyan, K., Zisserman, A. & Kavukcuoglu, K. Spatial Acknowledgements
transformer networks. In Advances in Neural Information Processing Systems This work is supported in part by NIST grant nos. 70NANB17H185 (received by Y.A.,
Vol. 28, 2017–2025 (NeurIPS, 2015). N.T. and A.M.) and UKRI EP/S030832/1 ACE-OPS (received by Y.A., N.T. and A.M.).
25. Barnes, D., Weston, R. & Posner, I. Masking by moving: learning M.T. thanks TUBITAK for the 2232 International Outstanding Researcher Fellowship
distraction-free radar odometry from pose information. In Conference on and ULAKBIM for High Performance and Grid Computing Center (TRUBA resources).
Robot Learning 303–316 (PMLR, 2020). Y.A. would like to thank the Ministry of National Education in Turkey for their funding
26. Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks and support, and DeepMIA lab at the Institute of Biomedical Engineering, Bogazici
for biomedical image segmentation. In Medical Image Computing University, for their GPU support. Y.A. and M.T. thank Hunter Gilbert for the critical
and Computer-Assisted Intervention–MICCAI 2015 234–241 (Springer, review of the manuscript.
2015).
27. Garg, R., Kumar, V.B.G., Carneiro, G. & Reid, I. Unsupervised CNN for
single view depth estimation: geometry to the rescue. In Computer
Author contributions
Y.A. conceived the study, designed the experiments, and performed the experimental
Vision–ECCV 2016, 740–756 (Springer, 2016).
analysis. Y.A. and M.T. analysed the results and prepared the manuscript. N.T. and A.M.
28. Paszke, A. et al. PyTorch: an imperative style, high-performance deep
supervised the research.
learning library. In Advances in Neural Information Processing Systems
Vol. 32 (Curran Associates, 2019).
29. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Competing interests
Preprint at https://arxiv.org/abs/1412.6980 (2017). The authors declare no competing interests.

Nature Machine Intelligence | www.nature.com/natmachintell


Articles NATurE MAcHInE InTEllIgEncE

Additional information Open Access This article is licensed under a Creative Commons
Extended data is available for this paper at https://doi.org/10.1038/s42256-022-00520-5. Attribution 4.0 International License, which permits use, sharing, adap-
tation, distribution and reproduction in any medium or format, as long
Supplementary information The online version contains supplementary material
as you give appropriate credit to the original author(s) and the source, provide a link to
available at https://doi.org/10.1038/s42256-022-00520-5.
the Creative Commons license, and indicate if changes were made. The images or other
Correspondence and requests for materials should be addressed to Yasin Almalioglu. third party material in this article are included in the article’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in
Peer review information Nature Machine Intelligence thanks Itai Orr and the other,
the article’s Creative Commons license and your intended use is not permitted by statu-
anonymous, reviewer(s) for their contribution to the peer review of this work.
tory regulation or exceeds the permitted use, you will need to obtain permission directly
Reprints and permissions information is available at www.nature.com/reprints. from the copyright holder. To view a copy of this license, visit http://creativecommons.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in org/licenses/by/4.0/.
published maps and institutional affiliations. © The Author(s) 2022

Nature Machine Intelligence | www.nature.com/natmachintell


NATurE MAcHInE InTEllIgEncE Articles

Extended Data Fig. 1 | Performance on the RADIATE dataset, ablation study on the deep network, and run-time performance. (a) To evaluate the
generalization of GRAMME to new datasets, we train, validate and test models on the publicly available RADIATE dataset17 that contains shorter
sequences for ego-motion estimation with high variation in scene and structure appearance. We report the mean depth predictions errors with standard
deviations and the distribution of motion prediction errors with quartiles. Although GRAMME performs well in terms of depth prediction and ego-motion
estimation performance, we observe on this dataset that due to dense fog and heavy precipitation, the performance of the lidar&camera-based
GRAMME model drops significantly compared to the day sequences. (b) As an additional ablation study, we replace the UNet network in the GRAMME
modules with commonly used networks. The results show the basis for our choice of ResNet in the GRAMME architecture. We also report the run-time
requirements in milliseconds, indicating the real-time capability of GRAMME on a consumer-grade GPU.

Nature Machine Intelligence | www.nature.com/natmachintell


Articles NATurE MAcHInE InTEllIgEncE

Extended Data Fig. 2 | Detailed architecture design of the proposed geometry-aware, multimodal, modular, interpretable, and self-supervised
ego-motion estimation. The modules consisting of encoder and decoder networks are based UNet architecture with skip connections. Feature extractors
with an encoder are based on the ResNet18 network visualized on sample input17. FC layers represent fully connected layers. Pose fusion network
is a multilayer-perceptron. As part of the spatial transformer module, the inverse warp algorithm re-uses the input target frames to calculate the
reconstruction loss. Camera input can be set to contain more frames than the range sensor due to the higher fps rate. The fused pose is the final output,
optimized in a self-supervised manner without ground-truth with respect to the intermediate pose and depth predictions.

Nature Machine Intelligence | www.nature.com/natmachintell


NATurE MAcHInE InTEllIgEncE Articles
Extended Data Table 1 | Comparative evaluation with different sensor modalities. Results on the Oxford Radar RobotCar dataset15
are given in tuples of (% translation error) / (deg/100m). The column ‘Mean’ shows the mean spatial cross-validation error over the
test sequences, ensuring that test and training data are not correlated. Failed sequences are marked with ‘x’. Although GRAMME
models with individual modalities slightly outperform the baselines in most sequences, the fusion models significantly improve upon
the most related state-of-the-art due to the effective use of additional data

Nature Machine Intelligence | www.nature.com/natmachintell


Articles NATurE MAcHInE InTEllIgEncE

Extended Data Table 2 | Quantitative comparison with state-of-the-art methods in terms of depth prediction error and accuracy.
A higher value is better for the accuracy columns, and a lower value is better for the others. The methods are trained on the
daytime data of the Oxford Robotcar dataset15 and directly tested under the weather conditions labelled in the test column. The
modalities used to train the models are represented by: Monocular (M), stereo (S), lidar (L), and radar (R). The models are tested
with monocular images only without access to any additional sensor. Notably, the fusion GRAMME models significantly improve
the generalization performance against adverse conditions compared to Monodepth2 that is the most similar architecture to the
camera-only GRAMME model

Nature Machine Intelligence | www.nature.com/natmachintell


Unit-2 Assignment

With respect to the problem identified in Unit-1,Write a regression/classification algorithm


from Machine Learning to analyze it.Collect and proper datasets and validate the model.

Resources
1.https://www.nature.com/articles/s42256-022-00516-1
2.Andrew NG Youtube Tutorial - Machine Learning
3.Introduction to Machine Learning,IITM,NPTEL/Swayam Course.
Rajalakshmi Institute of Technology
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCES
AD8552- Machine Learning
Unit-2: Machine Learning Methods
Part A
Sl.no Questions Blooms
CO
Level*
1 Compare Supervised and Unsupervised learning algorithms. CO2 B2
2 Define Generalized Linear Model. CO2 B1
3 Write the formula for linear regression and optimal weight vector B3
CO2
for prediction.
4 Define regularization. CO2 B1
5 Compare Lasso and ridge regression. CO2 B2
6 “Logistic regression is a Classification Technique”-Justify. CO2 B5
7 Write the algorithm of KNN. CO2 B3
8 Write the merits and demerits of KNN. CO2 B3
9 Define Perceptron. CO2 B1
10 Solve AND using Perceptron. CO2 B6
11 Write the training processing steps in MLP. CO2 B3
12 List the training methods in back propagation technique. CO2 B1
13 Define RBFNN. CO2 B1
14 Define overfitting. CO2 B1
15 Write the cost functions of L1 & L2 regularization techniques. CO2 B3
16 Write the merits of dropout regularization technique. CO2 B3
17 List the merits of decision tree algorithms. CO2 B1
18 Compare the algorithms used in building the decision trees. CO2 B2
19 Write the terminologies and formulae related to the regression CO2 B3
tree.
20 Write the measures of classification trees. CO2 B3
21 Write the training steps in the decision tree. CO2 B3
22 List the types of ensembling methods. CO2 B1
23 Write the steps in bagging. CO2 B3
24 List the merits of random forest algorithms. CO2 B1
25 Define Decision Jungle. CO2 B1
26 Compare the boosting techniques in the decision tree. CO2 B2
27 Relate SVM and multi class classification. CO2 B2
28 Write the regularization version of SVM. CO2 B3
29 Write the formula of C-SVM. CO2 B3
30 List the different Kernel Functions. CO2 B1
31 List the classification of Probabilistic Models. CO2 B1
32 Compare the Methods of Discriminative Probabilistic Models. CO2 B2
33 List the types of distribution. CO2 B1
34 Write the training steps in Bayesian Network CO2 B3
35 Write the algorithm of MLE. CO2 B3
36 Define Clustering.Write the types of Clustering Algorithms. CO2 B1
37 Define ICA. CO2 B1
38 Define SOM CO2 B1
39 Define Autoencoder. CO2 B1
40 Write the steps in K-Means Clustering CO2 B3
Part B
Sl.no Questions Blooms
CO
Level*

1 Summarize the following eight points (with (x, y) representing CO2 B5


locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1,
2), A8(4, 9).Initial cluster centers are: A1(2, 10), A4(5, 8) and
A7(1, 2).The distance function between two points a = (x1, y1)
and b = (x2, y2) is defined as-Ρ(a, b) = |x2 – x1| + |y2 – y1|.Use
K-Means Algorithm to find the three cluster centers after the
second iteration.
2 Write the derivation of the classification function of SVM. CO2 B3
3 Relate Regularization Techniques and Overfitting. CO2 B3
4 Solve XOR using Multi-Layer Perceptron CO2 B6
5 Create a decision making model for UG course selection for a CO2 B6
school passed out students.
6 Design an algorithm to classify your handwriting with your CO2 B6
sibling.
Part C

Sl.no Questions CO Blooms


Level*

1 Propose a idea to design a Neural Network for Flood Water CO2 B6


Drainage System.
2 Propose a specific problem in Financial Applications and CO2 B6
formulate your way of solving the problem using Machine
Learning.
3 Propose a specific problem in Automobile Engineering and CO2 B6
formulate your way of solving the problem using Machine
Learning.

You might also like