Unit 2
Unit 2
3 Assignment,Resources 96
1.Linear Methods
In general machine learning algorithms are divided into 2 types:
1. Supervised learning algorithms
2. Unsupervised learning algorithms
Supervised learning algorithms upset issues that involve learning with guidance. In
different words, the coaching information in supervised learning strategies need
labeled samples. for instance for a categorification downside, we'd like samples with
class label, or for a regression problem we need samples with desired output worth for
every sample, and so on The underlying mathematical model then learns its parameters
victimization the labeled samples so it's able to build predictions on samples that
the model has not been seen, additionally referred to as taking a look at samples.
Most of the applications in machine learning involve some variety of direction and thus
most of the chapters during this course can target the various supervised learning
methods.Unsupervised learning deals with issues that involve information while not labels.
In some sense one will argue that this can be not extremely a tangle in machine learning
as there's no knowledge to be learned from the past experiences. Unsupervised approaches
try to notice some structure or some form of trends in the coaching data.
Some unsupervised algorithms try to perceive the origin of the info itself. A common
example of unsupervised learning is clustering.
Case Study
Let's try to understand regression analysis with an example. Imagine you have made plans
with friends after a long time and you wish to go out, but you are not sure whether it will
rain or not. It’s the monsoon season, but your mom says the air feels dry today, and
therefore the probability of raining today is less. On the contrary, your sister believes
because it rained yesterday it’s likely that it will rain today. Considering you are no Lord of
Thunder and you have no control over the weather, how will you decide whose opinion to
take more seriously, keeping in mind the fact that you are impartial towards both?
Regression Analysis might come to your rescue. There are many factors on which rain
depends like geography, time of the year, precipitation, wind speed but unless you are the
weather department or Sheldon you wouldn’t want to work with all these values.So, you
would take the humidity level and the previous day’s precipitation to decide today’s level of
precipitation level(or the amount of rainfall). You can get both of these values easily on the
internet, I know you can get the weather forecast for today too, but we are trying to learn
something here.
In our example what we are trying to predict is today’s precipitation level which is
dependent on the level of humidity and rain received yesterday hence it is called, the
dependent variable. The variables on which it depends will be called independent variables.
What we try to do with regression Analysis is to model or quantify the relationship
between these two kinds of variables and hence predict one with the help of the other with
a level of certainty. An informed guess is better than random guessing, right? To solve our
problem If we were to do a simple linear regression, we would collect the humidity level
and precipitation level for the previous month and plot them.
Even without doing any math, we can infer that humidity and rainfall(precipitation) are
linearly correlated. An increase in the value of one leads to an increase in the other value
too. But here you can see we have oversimplified the problem by making a lot of
assumptions, the major one being humidity is the only or the most important factor in
deciding rainfall. In real-world (not-so simplified) business problems, there are many
variables with complex relationships between them.
Terminology
Outliers: Outliers are basically values or data points that are very stray from the general
population or distribution of data. Outliers have the ability to skew the results of any ML
model towards their detection. Therefore, it is necessary to detect them early on or use
algorithms resistant to outliers.
Overfitting: Overfitting happens when the ML model learns the training data such that it
memorizes every little detail and noise, making it less generalizable and therefore the
model performs badly on any unknown dataset and is ridiculously complex.
Heteroscedasticity: This term is hard to even read, let alone to understand. So, we’ll take
an example. We have humidity which predicts rainfall or precipitation. Now as humidity
increases the amount by which precipitation increases or decreases is variable and is not
fixed. If the level of precipitation would have increased/decreased constantly as the
humidity was increasing, we would have called it homoscedasticity.
where X1, X2, X3 are all explanatory variables and a, b, c are regression coefficients. A
positive coefficient tells how much of a positive influence a predictor has on the dependent
variable and a negative coefficient says vice versa.
Linear regression can also increase the operational efficiency of the business by data-driven
decision-making. A bike rental company can avoid overstocking or understocking of bikes
by modeling the relationship between the number of bikes rented and factors like time of
the day, traffic on the road, weather, etc.
y = a + bX + cX2 + dX3 ……
Before we go any further, let us familiarize ourselves with the concept called loss function,
used to assess the usefulness of our regression algorithm. While fitting our regression line
to our data, we position the line in such a way that the sum of perpendicular distances of
the data points from the line is minimized.
The Root Mean Squared error is very similar to this, it just takes the square of these
residuals(the distance of a point from the line) and takes a root of their sum.
Here Predicted(i) are the red points and Actual(i) are the black points. RMSE will tell you
how well fit the line of regression is.
Now, coming back to polynomial regression, when the relationship between variables is not
linear, it’s hard to fit a line on the data and minimize our cost function. This is when we
need Polynomial Regression.
Where x1, x2,x3 are independent variables and b0, b1, b2 are regression coefficients. In a
Binary Classification problem, p gives the probability that the sample belongs to the main
class.When Logistic regression is applied in real-world problems – like detecting cancer in
people P here, would tell the probability of whether the person has cancer or not. P less
than 0.5 would be mapped to no cancer and greater than that would map to cancer. Logistic
regression is a linear method, but the predictions are transformed using the logistic
function. The curve for it follows the curve for log function.
In our probability distribution, 25% of the data points would lie on the left of Q1 and 75%
would lie to the left of Q3.
Ordinary Least Squares Regression or Linear Regression is modeled around the mean of the
dependent variable. Quantile regression allows us to understand relationships between
variables outside of the mean of the data, making it useful in understanding outcomes that
are non-normally distributed and that have non-linear relationships with predictor
variables.
Benefits and Applications of Quantile Regression Analysis
● Quantile regression can be used when the assumptions of linear regression are not
met. It is robust to outliers and can be used when heteroscedasticity is present.
● It is also useful when data is skewed as it does not depend on measures of mean but
quantiles. In any business, it is likely that the amount of money spent by customers
is skewed and the business might be more interested in the top quantiles rather
than the mean.
The red dots correspond to one class and green to the other. These classes can be easily
separated by a line in 2D space. But for SVM, it can’t be just any line. The distance between
the points in the two classes closest to each other is taken and the line passing mid-way
through it is the optimal dividing plane. These points that play a major role in deciding the
position of the separator line are called Support Vectors and hence the whole technique is
called Support Vector Machine. In more realistic cases, we have an n-dimensional space,
where n is the number of features and the decision plane is obviously not linear.
Where PX(k) is the probability of seeing k events in time t, e-(λt) is the event rate or the
number of events happening per unit time and k is the number of events.
Consider a small-scale restaurant where we are recording the number of customers
walking in an hour between 10 a.m. – 11 a.m., on average 5 customers are in the restaurant
at this hour. With this information, we can calculate the probability that there will be no
customer between 10 a.m. – 11 a.m. as follows:
● Perform PCA on the explanatory variables to obtain principal components and then
choose a subset from this.
● Using this subset and our dependent variable, fit a linear regression model to get a
vector of estimated regression coefficients.
● Transform this vector back to the scale of the original independent variables.
A research project is studying the level of lead in home drinking water as a function of the
age of a house and family income. The water testing kit cannot detect lead concentrations
below 5 parts per billion (ppb). The EPA considers levels above 15 ppb to be dangerous.
These data are an example of left-censoring.
Benefits and Applications of Tobit Regression Analysis
● Tobit's method can be easily extended to handle truncated and other non-randomly
selected samples. Tobit models have been applied in demand analysis to
accommodate observations with zero expenditures on some goods.
● It has also been applied to estimate factors that impact grant receipt, including
financial transfers distributed to sub-national governments who may apply for these
grants. In these cases, grant recipients cannot receive negative amounts, and the
data is thus left-censored.
t stands for survival time, h(t) is the hazard function, the coefficients b1, b2,…etc measure
the impact of covariates x1, x2 , …xp . The term h0 is the baseline hazard.
Lasso Regression
Lasso is also an extension of Linear Regression, but it implements L1 regularization instead
of L2. The only difference between L1 and L2 is instead of taking the square of the
coefficients, magnitudes are taken into account.
Here, λ is used to control the level of regularization. The goal of lasso regression is to
obtain the subset of predictors that minimizes prediction error for a quantitative
response(dependent) variable. It does this by imposing a constraint on the model
parameters that causes regression coefficients for some variables to shrink toward zero.
Variables with a regression coefficient equal to zero after the shrinkage process are
excluded from the model. Variables with non-zero regression coefficient variables are most
strongly associated with the response variable. Thus, it helps in feature selection.
Here, λ is used to control the level of regularization as usual while α is to give weights to L1
and L2 penalty. The value always lies between 0 and 1.
Introduction
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be
used for both classification as well as regression predictive problems. However, it is mainly
used for classification of predictive problems in industry. The following two properties
would define KNN well −
● Lazy learning algorithm − KNN is a lazy learning algorithm because it does
not have a specialized training phase and uses all the data for training while
classification.
● Non-parametric learning algorithm − KNN is also a non-parametric learning
algorithm because it doesn’t assume anything about the underlying data.
Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN algorithm
−
Suppose we have a dataset which can be plotted as follows −
Now, we need to classify a new data point with black dot (at point 60,60) into a blue or red
class. We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the
next diagram −
We can see in the above diagram the three nearest neighbors of the data point with black
dot. Among those three, two of them lie in Red class hence the black dot will also be
assigned in red class.
KNN as Classifier
First, start with importing necessary python packages −
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Data Preprocessing will be done with the help of following script lines.
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
Next, we will divide the data into train and test split. Following code will split the dataset
into 60% training data and 40% of testing data −
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40)
Next, train the model with the help of KNeighborsClassifier class of sklearn as follows −
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 8)
classifier.fit(X_train, y_train)
At last we need to make prediction. It can be done with the help of following script −
y_pred = classifier.predict(X_test)
Next, print the results as follows −
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Output
Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 21
Iris-versicolor 0.70 1.00 0.82 16
Iris-virginica 1.00 0.70 0.82 23
micro avg 0.88 0.88 0.88 60
macro avg 0.90 0.90 0.88 60
weighted avg 0.92 0.88 0.88 60
Accuracy: 0.8833333333333333
KNN as Regressor
First, start with importing necessary Python packages −
import numpy as np
import pandas as pd
Next, download the iris dataset from its web link as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Output
The MSE is: 0.12226666666666669
Cons
● It is computationally a bit expensive algorithm because it stores all the
training data.
● High memory storage required as compared to other supervised learning
algorithms.
● Prediction is slow in case of big N.
● It is very sensitive to the scale of data as well as irrelevant features.
Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in the banking system to predict weather an individual is fit for loan
approval? Does that individual have the characteristics similar to the defaulters one?
Calculating Credit Ratings
KNN algorithms can be used to find an individual’s credit rating by comparing with the
persons having similar traits.
Politics
With the help of KNN algorithms, we can classify a potential voter into various classes like
“Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
Other areas in which KNN algorithms can be used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.
Perceptron
A neural network link that contains computations to track features and uses Artificial
Intelligence in the input data is known as Perceptron. This neural links to the artificial
neurons using simple logic gates with binary outputs. An artificial neuron invokes the
mathematical function and has node, input, weights, and output equivalent to the cell
nucleus, dendrites, synapse, and axon, respectively, compared to a biological neuron.
Biological Neuron
A human brain has billions of neurons. Neurons are interconnected nerve cells in the
human brain that are involved in processing and transmitting chemical and electrical
signals. Dendrites are branches that receive information from other neurons.
Cell nucleus or Soma processes the information received from dendrites. Axon is a cable
that is used by neurons to send information. Synapse is the connection between an axon
and other neuron dendrites.
Researchers Warren McCullock and Walter Pitts published their first concept of simplified
brain cell in 1943. This was called the McCullock-Pitts (MCP) neuron. They described such a
nerve cell as a simple logic gate with binary outputs.
Multiple signals arrive at the dendrites and are then integrated into the cell body, and, if the
accumulated signal exceeds a certain threshold, an output signal is generated that will be
passed on by the axon. In the next section, let us talk about the artificial neuron
Artificial Neuron
An artificial neuron is a mathematical function based on a model of biological neurons,
where each neuron takes inputs, weighs them separately, sums them up and passes this
sum through a nonlinear function to produce output.
.
Biological Neuron vs. Artificial Neuron
The biological neuron is analogous to artificial neurons in the following terms:
Axon Output
Perceptron
Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a Perceptron
learning rule based on the original MCP neuron. A Perceptron is an algorithm for
supervised learning of binary classifiers. This algorithm enables neurons to learn and
process elements in the training set one at a time.
Types of Perceptron:
1. Single layer: Single layer perceptron can learn only linearly separable patterns.
2. Multilayer: Multilayer perceptrons can learn about two or more layers having a
greater processing power.
The Perceptron algorithm learns the weights for the input signals in order to draw a linear
decision boundary.
Note: Supervised Learning is a type of Machine Learning used to learn models from labeled
training data. It enables output prediction for future or unseen data. Let us focus on the
Perceptron Learning Rule in the next section.
Perceptron in Machine Learning
The most commonly used term in Artificial Intelligence and Machine Learning (AIML) is
Perceptron. It is the beginning step of learning coding and Deep Learning technologies,
which consists of input values, scores, thresholds, and weights implementing logic gates.
Perceptron is the nurturing step of an Artificial Neural Link. In 19h century, Mr. Frank
Rosenblatt invented the Perceptron to perform specific high-level calculations to detect
input data capabilities or business intelligence. However, now it is used for various other
purposes.
Perceptron Model in Machine Learning
A machine-based algorithm used for supervised learning of various binary sorting tasks is
called Perceptron. Furthermore, Perceptron also has an essential role as an Artificial
Neuron or Neural link in detecting certain input data computations in business intelligence.
A perceptron model is also classified as one of the best and most specific types of Artificial
Neural networks. Being a supervised learning algorithm of binary classifiers, we can also
consider it a single-layer neural network with four main
Working of Perceptron
AS discussed earlier, Perceptron is considered a single-layer neural link with four main
parameters. The perceptron model begins with multiplying all input values and their
weights, then adds these values to create the weighted sum. Further, this weighted sum is
applied to the activation function ‘f’ to obtain the desired output. This activation function is
also known as the step function and is represented by ‘f.’
This step function or Activation function is vital in ensuring that output is mapped between
(0,1) or (-1,1). Take note that the weight of input indicates a node’s strength. Similarly, an
input value gives the ability to shift the activation function curve up or down.
Step 1: Multiply all input values with corresponding weight values and then add to calculate
the weighted sum. The following is the mathematical expression of it:
∑wi*xi = x1*w1 + x2*w2 + x3*w3+……..x4*w4
Add a term called bias ‘b’ to this weighted sum to improve the model’s performance.
Step 2: An activation function is applied with the above-mentioned weighted sum giving us
an output either in binary form or a continuous value as follows:
Y=f(∑wi*xi + b)
Types of Perceptron models
We have already discussed the types of Perceptron models in the Introduction. Here, we
shall give a more profound look at this:
1. Single Layer Perceptron model: One of the easiest ANN(Artificial Neural
Networks) types consists of a feed-forward network and includes a threshold
transfer inside the model. The main objective of the single-layer perceptron
model is to analyze the linearly separable objects with binary outcomes. A
Single-layer perceptron can learn only linearly separable patterns.
2. Multi-Layered Perceptron model: It is mainly similar to a single-layer
perceptron model but has more hidden layers.
Forward Stage: From the input layer in the on stage, activation functions begin and
terminate on the output layer.
Backward Stage: In the backward stage, weight and bias values are modified per the
model’s requirement. The backstage removed the error between the actual output and
demands originating backward on the output layer. A multilayer perceptron model has a
greater processing power and can process linear and non-linear patterns. Further, it also
implements logic gates such as AND, OR, XOR, XNOR, and NOR.
Advantages:
● A multi-layered perceptron model can solve complex nonlinear problems.
● It works well with both small and large input data.
● Helps us to obtain quick predictions after the training.
● Helps us obtain the same accuracy ratio with big and small data.
Disadvantages:
● In multi-layered perceptron models, computations are time-consuming and
complex.
● It is tough to predict how much the dependent variable affects each independent
variable.
● The model functioning depends on the quality of training.
Characteristics of the Perceptron Model
1. It is a machine learning algorithm that uses supervised learning of binary
classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and then the decision is made
whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the function is more
significant than zero.
5. The linear decision boundary is drawn, enabling the distinction between the two
linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must have
an output signal; otherwise, no output will be shown.
The Perceptron receives multiple input signals, and if the sum of the input signals exceeds a
certain threshold, it either outputs a signal or does not return an output. In the context of
supervised learning and classification, this can then be used to predict the class of a sample.
Perceptron Function
Perceptron is a function that maps its input “x,” which is multiplied with the learned weight
coefficient; an output value ”f(x)”is generated.
In the equation given above:
● “w” = vector of real-valued weights
● “b” = bias (an element that adjusts the boundary away from origin without any
dependence on the input value)
● “x” = vector of input x values
A Boolean output is based on inputs such as salaried, married, age, past credit profile, etc. It
has only two values: Yes and No or True and False. The summation function “∑” multiplies
all inputs of “x” by weights “w” and then adds them up as follows:
An output of +1 specifies that the neuron is triggered. An output of -1 specifies that the
neuron did not get triggered.
“sgn” stands for sign function with output +1 or -1.
Error in Perceptron
In the Perceptron Learning Rule, the predicted output is compared with the known output.
If it does not match, the error is propagated backward to allow weight adjustment to
happen.
Perceptron: Decision Function
A decision function φ(z) of Perceptron is defined to take a linear combination of x and w
vectors.
Output:
The figure shows how the decision function squashes wTx to either +1 or -1 and how it can
be used to discriminate between two linearly separable classes.
Perceptron at a Glance
Perceptron has the following characteristics:
● Perceptron is an algorithm for Supervised Learning of single layer binary linear
classifiers.
● Optimal weight coefficients are automatically learned.
● Weights are multiplied with the input features and decision is made if the neuron
is fired or not.
● Activation function applies a step rule to check if the output of the weighting
function is greater than zero.
● Linear decision boundary is drawn enabling the distinction between the two
linearly separable classes +1 and -1.
● If the sum of the input signals exceeds a certain threshold, it outputs a signal;
otherwise, there is no output.
Types of activation functions include the sign, step, and sigmoid functions.
Implement Logic Gates with Perceptron
Perceptron - Classifier Hyperplane
The Perceptron learning rule converges if the two classes can be separated by the linear
hyperplane. However, if the classes cannot be separated perfectly by a linear classifier, it
could give rise to errors.
As discussed in the previous topic, the classifier boundary for a binary output in a
Perceptron is represented by the equation given below:
The diagram above shows the decision surface represented by a two-input Perceptron.
Observation:
● In Fig(a) above, examples can be clearly separated into positive and negative
values; hence, they are linearly separable. This can include logic gates like AND,
OR, NOR, NAND.
● Fig (b) shows examples that are not linearly separable (as in an XOR gate).
● Diagram (a) is a set of training examples and the decision surface of a Perceptron
that classifies them correctly.
● Diagram (b) is a set of training examples that are not linearly separable, that is,
they cannot be correctly classified by any straight line.
● X1 and X2 are the Perceptron inputs.
Logic gates are the building blocks of a digital system, especially neural networks. In short,
they are the electronic circuits that help in addition, choice, negation, and combination to
form complex circuits. Using the logic gates, Neural Networks can learn on their own
without you having to manually code the logic. Most logic gates have two inputs and one
output.
Each terminal has one of the two binary conditions, low (0) or high (1), represented by
different voltage levels. The logic state of a terminal changes based on how the circuit
processes data.
Based on this logic, logic gates can be categorized into seven types:
● AND
● NAND
● OR
● NOR
● NOT
● XOR
● XNOR
Implementing Basic Logic Gates With Perceptron
The logic gates that can be implemented with Perceptron are discussed below.
1. AND
If the two inputs are TRUE (+1), the output of Perceptron is positive, which amounts to
TRUE.
This is the desired behavior of an AND gate.
x1= 1 (TRUE), x2= 1 (TRUE)
w0 = -.8, w1 = 0.5, w2 = 0.5
=> o(x1, x2) => -.8 + 0.5*1 + 0.5*1 = 0.2 > 0
2. OR
If either of the two inputs are TRUE (+1), the output of Perceptron is positive, which
amounts to TRUE.
This is the desired behavior of an OR gate.
x1 = 1 (TRUE), x2 = 0 (FALSE)
w0 = -.3, w1 = 0.5, w2 = 0.5
=> o(x1, x2) => -.3 + 0.5*1 + 0.5*0 = 0.2 > 0
3. XOR
A XOR gate, also called an Exclusive OR gate, has two inputs and one output.
The gate returns a TRUE as the output if and ONLY if one of the input states is true.
XOR Truth Table
Input Output
A B
0 0 0
0 1 1
1 0 1
1 1 0
Sigmoid Curve
The curve of the Sigmoid function called “S Curve” is shown here.
This is called a logistic sigmoid and leads to a probability of the value between 0 and 1.
This is useful as an activation function when one is interested in probability mapping rather
than precise values of input parameter t.
The sigmoid output is close to zero for highly negative input. This can be a problem in
neural network training and can lead to slow learning and the model getting trapped in
local minima during training. Hence, hyperbolic tangent is more preferable as an activation
function in hidden layers of a neural network.
Sigmoid Logic for Sample Data
Output
The Perceptron output is 0.888, which indicates the probability of output y being a 1.
If the sigmoid outputs a value greater than 0.5, the output is marked as TRUE. Since the
output here is 0.888, the final output is marked as TRUE.
In the next section, let us focus on the rectifier and softplus functions.
Rectifier and Softplus Functions
Apart from Sigmoid and Sign activation functions seen earlier, other common activation
functions are ReLU and Softplus. They eliminate negative units as an output of max function
will output 0 for all units 0 or less.
A rectifier or ReLU (Rectified Linear Unit) is a commonly used activation function. This
function allows one to eliminate negative units in an ANN. This is the most popular
activation function used in deep neural networks.
● A smooth approximation to the rectifier is the Softplus function.
● The derivative of Softplus is the logistic or sigmoid function.
In the next section, let us discuss the advantages of the ReLu function.
Advantages of ReLu Functions
The advantages of ReLu function are as follows:
● Allows faster and more effective training of deep neural architectures on large
and complex datasets
● Sparse activation of only about 50% of units in a neural network (as negative
units are eliminated)
● More plausible or one-sided, compared to anti-symmetry of tanh
● Efficient gradient propagation, which means no vanishing or exploding gradient
problems
● Efficient computation with the only comparison, addition, or multiplication
● Scales well
This code implements the softmax formula and prints the probability of belonging to one of
the three classes. The sum of probabilities across all classes is 1.
Let us talk about Hyperbolic functions in the next section.
Hyperbolic Functions
1. Hyperbolic Tangent
Hyperbolic or tanh function is often used in neural networks as an activation function. It
provides output between -1 and +1. This is an extension of logistic sigmoid; the difference
is that output stretches between -1 and +1 here.
The advantage of the hyperbolic tangent over the logistic function is that it has a broader
output spectrum and ranges in the open interval (-1, 1), which can improve the
convergence of the backpropagation algorithm.
2. Hyperbolic Activation Functions
The graph below shows the curve of these activation functions:
Apart from these, tanh, sinh, and cosh can also be used for activation functions.
Based on the desired output, a data scientist can decide which of these activation functions
need to be used in the Perceptron logic.
3. Hyperbolic Tangent
This code implements the tanh formula. Then it calls both logistic and tanh functions on the
z value. The tanh function has two times larger output space than the logistic function.
With larger output space and symmetry around zero, the tanh function leads to the more
even handling of data, and it is easier to arrive at the global maxima in the loss function.
Activation Functions at a Glance
Various activation functions that can be used with Perceptron are shown below:
The activation function to be used is a subjective decision taken by the data scientist, based
on the problem statement and the form of the desired results. If the learning process is
slow or has vanishing or exploding gradients, the data scientist may try to change the
activation function to see if these problems can be resolved.
Radial Basis Kernel is a kernel function that is used in machine learning to find a
non-linear classifier or regression line.
But if we expand the above exponential expression, It will go upto infinite power of x and x’,
as expansion of ex contains infinite terms upto infinite power of x hence it involves terms
upto infinite powers in infinite dimension.
If we apply any of the algorithms like perceptron Algorithm or linear regression on this
kernel, actually we would be applying our algorithm to new infinite-dimensional datapoint
we have created. Hence it will give a hyperplane in infinite dimensions, which will give a
very strong non-linear classifier or regression curve after returning to our original
dimensions.
So, Although we are applying linear classifier/regression it will give a non-linear classifier
or regression line, that will be a polynomial of infinite power. And being a polynomial of
infinite power, Radial Basis kernel is a very powerful kernel, which can give a curve fitting
any complex dataset.
Why is the Radial Basis Kernel so powerful?
The main motive of the kernel is to do calculations in any d-dimensional space where d > 1,
so that we can get a quadratic, cubic or any polynomial equation of large degree for our
classification/regression line. Since the Radial basis kernel uses exponent and as we know
the expansion of e^x gives a polynomial equation of infinite power, so using this kernel, we
make our regression/classification line infinitely powerful too.
Bayesian belief network is key computer technology for dealing with probabilistic events
and to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
The generalized form of Bayesian network that represents and solves decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
● Each node corresponds to the random variables, and a variable can be continuous or
discrete.
● Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.
The Bayesian network is based on Joint probability distribution and conditional probability.
So let's first understand the joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
In general for each variable Xi, we can write the equation as:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry
has two neighbors David and Sophia, who have taken a responsibility to inform Harry at
work when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here
we would like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that the alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called Harry.
Solution:
● The Bayesian network for the above problem is given below. The network structure
is showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm going off, but David and Sophia's calls depend on
alarm probability.
● The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
● The conditional distributions for each node are given as a conditional probabilities
table or CPT.
● Each row in the CPT must be summed to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
● Burglary (B)
● Earthquake(E)
● Alarm(A)
● David Calls(D)
● Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can
rewrite the above probability statement using joint probability distribution:
P(E= False)= 0.999, Which is the probability that an earthquake did not occur.
The Conditional probability of David that he will call depends on the probability of Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."
From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
There are two ways to understand the semantics of the Bayesian network, which is given
below:
Suppose that we have a random sample from a population of interest. We may have a
theoretical model for the way that the population is distributed. However, there may be
several population parameters of which we do not know the values. Maximum likelihood
estimation is one way to determine these unknown parameters.
The basic idea behind maximum likelihood estimation is that we determine the values of
these unknown parameters. We do this in such a way to maximize an associated joint
probability density function or probability mass function. We will see this in more detail in
what follows. Then we will calculate some examples of maximum likelihood estimation.
A coin is flipped 100 times. Given that there were 55 heads, find the maximum likelihood
estimate for the probability p of heads on a single toss
Suppose that the lifetime of Badger brand light bulbs is modeled by an exponential
distribution with (unknown) parameter λ. We test 5 bulbs and find they have lifetimes of 2,
3, 1, 3, and 4 years, respectively. What is the MLE for λ?
Suppose 10 animals are captured, tagged and released. A few months later, 20 animals are
captured, examined, and released. 4 of these 20 are found to be tagged. Estimate the size of
the wild population using the MLE for the probability that a wild animal is tagged.
Suppose we have two light bulbs whose lifetimes follow an exponential(λ) distribution.
Suppose also that we independently measure their lifetimes and get data x1 = 2 years and
x2 = 3 years. Find the value of λ that maximizes the probability of this data.
Activity : Run the code available in the following link and share the output in github.
https://python.quantecon.org/mle.html
Clustering Algorithms
Clustering algorithms are used to group data points based on certain similarities. There’s
no criterion for good clustering. Clustering determines the grouping with unlabelled data. It
mainly depends on the specific user and the scenario.
● Hard clustering – the data point either entirely belongs to the cluster, or doesn’t. For
example, consider customer segmentation with four groups. Each customer can
belong to either one of four groups.
● Soft clustering – a probability score is assigned to data points to be in those clusters.
Types of clustering algorithms and how to select one for your use case
The main idea of hierarchical clustering is based on the concept that nearby objects are
more related than objects that are farther away. Let’s take a closer look at various aspects of
these algorithms:
1. Agglomerative – it starts with an individual element and then groups them into
single clusters.
2. Divisive – it starts with a complete dataset and divides it into partitions.
In this section, I’ll be explaining the AHC algorithm which is one of the most important
hierarchical clustering techniques. The steps to do it are:
1. Each data point is treated as a single cluster. We have K clusters in the beginning. At
the start, the number of data points will also be K.
2. Now we need to form a big cluster by joining 2 closest data points in this step. This
will lead to total K-1 clusters.
3. Two closest clusters need to be joined now to form more clusters. This will result in
K-2 clusters in total.
4. Repeat the above three steps until K becomes 0 to form one big cluster. No more
data points are left to join.
5. After forming one big cluster at last, we can use dendrograms to split the clusters
into multiple clusters depending on the use case.
Source
Advantages of AHC:
● AHC is easy to implement, it can also provide object ordering, which can be
informative for the display.
● We don’t have to pre-specify the number of clusters. It’s easy to decide the number
of clusters by cutting the dendrogram at the specific level.
● In the AHC approach smaller clusters will be created, which may uncover
similarities in data.
Disadvantages of AHC:
● The objects which are grouped wrongly in any steps in the beginning can’t be
undone.
● Hierarchical clustering algorithms don’t provide unique partitioning of the dataset,
but they give a hierarchy from which clusters can be chosen.
● They don’t handle outliers well. Whenever outliers are found, they will end up as a
new cluster, or sometimes result in merging with other clusters.
Clustering dataset
Getting started with clustering in Python through Scikit-learn is simple. Once the library is
installed, a variety of clustering algorithms can be chosen.
We will be using the `make_classification` function to generate a data set from the `sklearn`
library to demonstrate the use of different clustering algorithms. The `make_classification`
function accepts the following arguments:
# Initializing data
train_data, _ = make_classification(n_samples=1000,
n_features=2,
n_informative=2,
n_redundant=0,
n_clusters_per_class=1,
random_state=4)
agg_mdl = AgglomerativeClustering(n_clusters=4)
agg_result = agg_mdl.fit_predict(train_data)
agg_clusters = unique(agg_result)
# plot clusters
plot.show()
Hierarchical clustering is often used in the form of descriptive modeling rather than
predictive. It doesn’t work well on large datasets, it provides the best results in some cases
only. Sometimes it’s also difficult to detect the correct number of clusters from a
dendogram.
One of the most widely used centroid-based clustering algorithms is K-Means, and one of its
drawbacks is that you need to choose a K value in advance.
The K-Means algorithm splits the given dataset into a predefined(K) number of clusters
using a particular distance metric. The center of each cluster/group is called the centroid.
How does the K-Means algorithm work?
● Initially, a K number of centroids is chosen. There are different methods for selecting
the right value for K.
● Shuffle the data and initialize centroids—randomly select K data points for centroids
without replacement.
● Create new centroids by calculating the mean value of all the samples assigned to
each previous centroid.
● Randomly initialize the centroid until there’s no change in the centroid, so the
assignment of data points to the cluster isn’t changing.
● K-Means clustering uses the Euclidean Distance to find out the distance between
points.
Source
The Elbow method picks the range of values and takes the best among them. It calculates
the Within Cluster Sum of Square(WCSS) for different values of K. It calculates the sum of
squared points and calculates the average distance.
In the above formula, Yi is the centroid for the observation of Xi. The value of K needs to be
chosen where WCSS starts to diminish. In the plot WCSS versus K, this shows up as an
elbow.
Source
The SC value ranges between -1 to 1. 1 means clusters are separated well and are
distinguishable. The Clusters are wrongly assigned if the value is -1.
These are some of the advantages K-Means poses over other algorithms:
K-Medians is another clustering algorithm relative to the K-Means algorithm, except cluster
centers are recomputed using the median. Sensitivity towards outliers is less in the
K-Median algorithm since sorting is required to calculate the median vector slower for large
datasets.
K-Means has some disadvantages; the algorithm may provide different clustering results on
different runs as K-Means begins with random initialization of cluster centers. The chances
are that the results obtained might not be repeated.
● K-Means clustering is good at capturing the structure of the data if the clusters have
a spherical-like shape. It always tries to construct a nice spherical shape around the
centroid. This means that the minute the clusters have different geometric shapes,
K-Means does a poor job clustering the data.
● Even when the data points belong to the same cluster, K-Means doesn’t allow the
data points far from one another, and they share the same cluster.
● K-Means algorithm is sensitive to outliers.
● As the number of dimensions increases, scalability decreases.
Here are some points to remember when using K-Means for clustering:
● Standardize the data when applying the K-Means algorithm, because it will help you
get good quality clusters and improve the performance of the clustering algorithm.
Since K-Means use a distance-based measure to find the similarity between data
points, it’s good to standardize the data to have a standard deviation of one and a
mean of zero. Usually, the features present in any dataset would have different units
of measurement, for example, income versus age.
● K-Means give more weight to bigger clusters.
● The elbow method used to select the number of clusters doesn’t work well as the
error function decreases for all Ks.
● If there’s overlap between clusters, K-Means doesn’t have an intrinsic measure for
uncertainty for the examples belonging to the overlapping region to determine
which cluster to assign each data point.
● K-Means clusters data even if it can’t be clustered, such as data that comes from
uniform distributions.
K-Means is one of the popular clustering algorithms, mainly because of its good time
performance. When the size of the data set increases, K-Means will result in a memory issue
since it needs the entire dataset. For those reasons, to reduce the time and space
complexity of the algorithm, an approach called Mini-Batch K-Means was proposed.
The Mini-Batch K-Means algorithm tries to fit the data in the main memory in a way where
the algorithm uses small batches of data that are of fixed size chosen at random. Here are a
couple of points to note about the Mini-Batch K-Means algorithm:
● The location of the clusters is updated based on the new points from each batch.
● The update made is the gradient descent update, which is notably faster than normal
batch K-Means.
Density-based clustering connects areas of high example density into clusters. This allows
for arbitrary shape distributions as long as dense regions are connected. With the higher
dimension data and data of varying densities, these algorithms run into issues. By design,
these algorithms don’t assign outliers to clusters.
DBSCAN
This type of clustering technique connects data points that satisfy particular density
criteria (minimum number of objects within a radius). After DBSCAN clustering is
complete, there are three types of points: core, border, and noise.
Source
If you look at the above figure, the core is a point that has some (m) points within a
particular (n) distance from itself. The border is a point that has at least one core point at
distance n.
Noise is a point that is neither border nor core. Data points in sparse areas required to
separate clusters are considered noise and broader points.
An exciting property of DBSCAN is its low complexity. It requires a linear number of range
queries on the database.
● It expects some kind of density drop to detect cluster borders. DBSCAN connects
areas of high example density. The algorithm is better than K-Means when it comes
to oddly shaped data.
# Initialize data
train_data, _ = make_classification(n_samples=1000,
n_features=2,
n_informative=2,
n_redundant=0,
n_clusters_per_class=1,
random_state=4)
# Define model
dbscan_model.fit(train_data)
dbscan_res = dbscan_model.fit_predict(train_data)
dbscan_clstrs = unique(dbscan_res)
# plot
# show plot
plot.show()
The clustering model that’s closely related to statistics is based on distribution models.
Clusters can then be defined as objects that belong to the same distribution. This approach
closely resembles how artificial datasets are generated, by sampling random objects from
distribution.
While the theoretical aspects of these methods are pretty good, these models suffer from
overfitting.
GMM can be used to find clusters in the same way as K-Means. The probability that a point
belongs to the distribution’s center decreases as the distance from the distribution center
increases. The bands show a decrease in probability in the below image. Since GMM
contains a probabilistic model under the hood, we can also find the probabilistic cluster
assignment. When you don’t know the type of distribution in data, you should use a
different algorithm.
Let’s see how GMM calculates probabilities and assigns it to data point:
Source
To find the covariance, mean, variance and weights of clusters, GMM uses the Expectation
Maximization technique.
To define the Gaussian distribution we need to find the values for these parameters. We’ve
already decided on the number of clusters and have assigned the value for mean,
covariance, and density. Next are the Expectation step and Maximizations step, which you
can check out in this post.
Advantages of GMM
● One of the advantages of GMM over K-Means is that K-Means doesn’t account for
variance (here, variance refers to the width of the bell-shaped curve) and GMM
returns the probability that data points belong to each of K clusters.
● In case of overlapped clusters, all the above clustering algorithms fail to identify it as
one cluster.
● GMM uses a probabilistic approach and provides probability for each data point that
belongs to the clusters.
Disadvantages of GMM
Let’s now look at how GMM clusters data. The code below helps you to:
#init data
train_data, _ = make_classification(n_samples=1200,
n_features=3,
n_informative=2,
n_redundant=0,
n_clusters_per_class=1,
random_state=4)
gaussian_mdl = GaussianMixture(n_components=3)
# model training
gaussian_mdl.fit(train_data)
gaussian_res = gaussian_mdl.fit_predict(train_data)
gaussian_clstr = unique(dbscan_res)
# Plot
# plot
plot.show()
These are some issues you may encounter when applying clustering techniques:
● The results may be less accurate since data isn’t labeled in advance and input data
isn’t known.
● The learning phase of the algorithm might take a lot of time as it calculates and
analyses all possibilities.
● Without any prior knowledge the model is learning from raw data.
● As the number of features increases, complexity increases.
● Some projects involving live data may require continuous data feeding to the model,
resulting in time-consuming and inaccurate results.
● Choose the clustering algorithm so that it scales well on the dataset. Not all
clustering algorithms scale efficiently. Datasets in machine learning can have
millions of examples.
● Many clustering algorithms work by computing similarities between all pairs of
examples. Run time increases with an increase in the number of samples n, denoted
as O(n^2) in complexity notation. O(n^2) isn’t practical when the number of
examples is in millions. This focuses on the K-Means algorithm, which has
complexity O(n), meaning the algorithm scales linearly with n.
B. F1 score
C. Confusion matrix
A. Logistic regression
B. Linear regression
C. Polynomial regression
D. None
3. Classification is-
A. Unsupervised learning
B. Reinforcement learning
C. Supervised learning
D. None
4. You have a dataset of different flowers containing their petal lengths and color. Your
model has to predict the type of flower for given petal lengths and color. This is a-
A. Regression task
B. Classification task
C. Clustering task
D. None
5. A classifier-
C. Both A and B
D. None
D. None
B. Precision
C. Accuracy
D. None
10. Suppose your classification model predicted true for a class which actual value was
false. Then this is a-
A. False positive
B. False negative
C. True positive
D. True negative
11. The false negative value is 5 and the true positive value is 20. What will be the value of
recall-
A. 0.2
B. 0.6
C. 0.8
D. 0.3
12. The true positive value is 10 and the false positive value is 15. Calculate the value of
precision-
A. 0.6
B. 0.4
C. 0.5
D. None
13. If the precision is 0.6 and the recall value is 0.4. What will be the f measure?
A. 0.5
B. 0.6
C. 0.4
D. 0.3
A. Logistic Regression
C. Polynomial Regression
D. None
A. SVM
B. KNN
C. Decision tree
A. SVM
B. Decision tree
C. Random forest
A. Euclidean distance
B. Manhattan distance
C. Perpendicular distance
19. Which of the following is the best algorithm for text classification?
A. KNN
B. Decision tree
C. Random forest
D. Naive Bayes
A. Number of neighbors
D. None
A. a discriminative classifier
C. a probabilistic classifier
D. None
A. Decision boundaries
B. Decision functions
C. Mapping functions
D. None
B. A function that maps the value from one dimension to the other
D. None
A. Polynomial Kernel
B. Gaussian Kernel
C. Sigmoid Kernel
D. None
26. For two events A and B, the Bayes theorem will be-
28. Suppose you have a dataset that is randomly distributed. What will be the best
algorithm for that dataset?
B. Naive Bayes
C. K nearest neighbors
D. Decision tree
B. Accuracy
D. Precision
Articles
https://doi.org/10.1038/s42256-022-00520-5
Interest in autonomous vehicles (AVs) is growing at a rapid pace due to increased convenience, safety benefits and potential
environmental gains. Although several leading AV companies predicted that AVs would be on the road by 2020, they are still
limited to relatively small-scale trials. The ability to know their precise location on the map is a challenging prerequisite for
safe and reliable AVs due to sensor imperfections under adverse environmental and weather conditions, posing a formidable
obstacle to their widespread use. Here we propose a deep learning-based self-supervised approach for ego-motion estima-
tion that is a robust and complementary localization solution under inclement weather conditions. The proposed approach is
a geometry-aware method that attentively fuses the rich representation capability of visual sensors and the weather-immune
features provided by radars using an attention-based learning technique. Our method predicts reliability masks for the sen-
sor measurements, eliminating the deficiencies in the multimodal data. In various experiments we demonstrate the robust
all-weather performance and effective cross-domain generalizability under harsh weather conditions such as rain, fog and
snow, as well as day and night conditions. Furthermore, we employ a game-theoretic approach to analyse the interpretability
of the model predictions, illustrating the independent and uncorrelated failure modes of the multimodal system. We anticipate
our work will bring AVs one step closer to safe and reliable all-weather autonomous driving.
A
utonomous vehicles (AVs) have recently attracted consid- AV operation in urban areas surrounded by high-rise buildings
erable attention from academia, industry and the general remains highly challenging. In addition, GPS merely provides
public due to their potential to revolutionize transportation, metre-level location accuracy without orientation information,
accelerated by advances in artificial intelligence. The deployment of which is potentially fatal for passengers of AVs or those in the sur-
AVs in our environmental landscape has the potential to decrease roundings. For example, an AV might detect itself in the wrong lane
road accidents and traffic congestion, as well as improve our mobil- before a turn, or might stop too late at an intersection due to impre-
ity in overcrowded cities. Despite extraordinary efforts from many cise localization. Ego-motion estimation (also called odometry)
of the leading names in the AV industry and research, AVs are still with onboard sensors provides a complementary localization solu-
out of reach except in limited trial programs due to key concerns tion in challenging environments, predicting the accurate relative
on their reliability and safety1 (see Supplementary Note 1 for details self-position of AVs. It is therefore an essential component that lies
on AV safety levels). Apart from the technical problems, adverse at the core of an autonomous driving algorithmic stack and serves
weather conditions such as rain, fog and snow pose substantial chal- as the basis for numerous algorithms such as localization, predic-
lenges for safe and reliable driverless technology2,3. tion and motion planning. A robust and reliable ego-motion esti-
Autonomous vehicles are equipped with different types of sen- mation system should address the sensor vulnerabilities that might
sors such as cameras, lidars, radars, ultrasound and GPS to achieve be caused by various factors such as poor environmental conditions
a higher level of awareness of the surroundings, leading to increased and sensor imperfections.
safety, efficiency and capabilities2,4. Along with multiple sensors, Artificial intelligence in AV research and development relies
artificial intelligence methodologies, machine learning, deep learn- heavily on the use of public datasets in the computer vision and
ing and large datasets play major roles in the development of AVs robotics communities9. Although the datasets are ever-increasingly
with higher levels of intelligence and mobility5,6. Artificial intelli- massive, the acquisition of accurate ground-truth data to supervise
gence systems efficiently process the vast amount of multisensory the artificial intelligence systems is limited due to the need for man-
data to train and validate the family of machine learning models ual labelling and deficiencies of the existing sensors. Cameras and
that underpin the perception, localization, prediction and motion lidars constitute the two primary perception sensors that are com-
planning capabilities of autonomous driving systems7,8. These sys- monly adopted in AV research; however, as these sensors operate in
tems make sense of the world and the objects in the environment the visible and infrared spectrum, inclement weather dramatically
and dictate the paths that the vehicles ultimately take. disrupts their sensory data, causing attenuation, multiple scattering
The localization capability is responsible for precisely predict- and absorption10 (Supplementary Note 2). Millimetre-wave radars
ing the AV’s position on a map. Most of the core components of provide a key advantage over visible spectrum sensors in their
AVs such as prediction and planning rely on precise localization immunity to adverse conditions, for example, they are agnostic to
to, for example, within a few centimetres. Although AVs heavily scene illumination and airborne obscurants10,11. The wavelength of
rely on signals from space-based global navigation satellite systems millimetre-wave radars is much larger than the tiny airborne parti-
such as GPS for localization, radio signals can be lost or degraded cles that form fog, rain and snow, and hence easily penetrates or dif-
in many environments due to obstacles or reflections. In particular, fracts around them. Furthermore, as they are radiofrequency-based
Department of Computer Science, University of Oxford, Oxford, UK. 2Department of Computer Engineering, Bogazici University, Istanbul, Turkey.
1
✉e-mail: [email protected]
sensors, radars do not require optical lenses and can be integrated stereo) under challenging conditions. The key supervision signal
into plastic housings, making them highly resilient to water and to train the neural networks for depth and pose prediction comes
dust ingress. We therefore believe that odometry approaches utiliz- from the new view-reconstruction algorithm: given a multimodal
ing millimetre-wave radars will allow robust ego-motion estimation input view of a scene, it reconstructs a new view of the scene cap-
under diverse settings such as day, night, rain, fog and snow, and tured from a different position. The visual-reconstruction algo-
address the challenges in implementing radars (which are described rithm uses the predicted per-pixel depth and ego-motion, whereas
in Supplementary Note 3). The introduction of a high-resolution the range-reconstruction algorithm uses the predicted ego-motion
radar in AV datasets created new opportunities for ego-motion and range measurements, both making use of multimodal masks.
estimation under challenging conditions. Despite the improved The spatial transformer module of GRAMME implements the view
measurements, the radar measurements are still much coarser and reconstruction in a fully differentiable manner compatible with the
noisier than those of lidars and cameras. As a result, ego-motion ego-motion, depth and mask prediction neural networks.
techniques developed for lidars cause large motion errors. Although At a high level, GRAMME has a modular design to enable inde-
further information in the full AV software stack from passive sen- pendent operation for each modality during both training and
sors (for example, wheel encoders and inertial measurement units) inference, which improves the robustness of the system to achieve a
and intermediate predictions of software modules (for example, loop minimal risk condition14. Although we train the modules for depth,
closure and bundle adjustment) can supplement the ego-motion pose and mask predictions jointly, they can directly operate on the
estimation module, perception sensors such as the camera, lidar input frames separately from each other during test time, lead-
and radar play a pivotal role in the performance12. Ego-motion esti- ing to independent and uncorrelated failure modes for the mod-
mation methods should therefore exploit the advantages of cam- ules. Moreover, the modular design enables the performance gains
eras (rich, dense visual information), lidars (fine granularity within achieved during multimodal training to be maintained at inference
visible range) and radars (immunity to inclement weather) while time even when the complementary modalities are partially or
addressing their relative shortcomings. Although deep learning entirely unavailable. We use a reciprocal multimodal training tech-
models offer state-of-the-art solutions for ego-motion estimation nique to enhance the predictions on individual modalities, provid-
tasks (Supplementary Note 4), adverse weather conditions pose a ing information flow across submodules. Furthermore, the range
host of substantial challenges such as reduced sensing capability (for measurements of radar can directly capture strong patterns related
example, due to the occlusions caused by precipitation) and a wide to the geometry of the scene, whereas a simple colour value of
range of domain shifts (for example, due to the discrepancy between camera pixels is associated with the geometry through an accurate
a training dataset and the data encountered during deployment). depth estimation of the pixel. As the camera and radar measure-
Here we propose a novel self-supervised deep learning frame- ments are perceptually different, we exploit a late multimodal deep
work, geometry-aware multimodal ego-motion estimation fusion technique, which also facilitates the modular design. The
(GRAMME; Fig. 1) that addresses the key ego-motion estimation multilayer perceptron-based late fusion layer uses the unaligned
challenges for AVs outlined above. Our novel multimodal geo- ego-motion predictions from multiple modalities to predict the
metric reconstruction algorithm and reciprocal training technique ultimate motion. Due to the tight formulation of ego-motion and
create a supervisory signal for the self-supervised neural network. depth prediction, the multimodal fusion technique substantially
Under five diverse settings (day, night, rain, fog and snow) using improves the depth predictions as well. The fusion consists of two
publicly available independent datasets, we show that our multi- stages: first, the individual ego-motion predictions are used to
modal approach provides robustness to unfavourable weather con- reconstruct the corresponding camera and range views; second,
ditions and outperforms state-of-the-art ego-motion estimation the predictions of each modality are interchangeably used in the
approaches. Following a challenging experimental protocol, we counterpart view-reconstruction algorithms for both visual and
show that the proposed modular design improves the performance range reconstructions.
of individual modalities even if the other modalities are unavailable In the proceeding sections, we demonstrate the generalizability,
at test time, providing robustness to sensor failures. Furthermore, data efficiency and interpretability of GRAMME in five different
we demonstrate the generalization capability of GRAMME by diverse settings such as day, night, rain, fog and snow. We quali-
showing that models trained on regular sequences typically targeted tatively and quantitatively evaluate the state-of-the-art ego-motion
by self-supervised studies can directly be applied to challenging estimation and depth prediction performance on multiple datasets,
sequences. We employ different sensors with various resolutions emphasizing the effect of modular design on individual modalities.
and beamwidths in the experiments and show that GRAMME is
sensor agnostic. Furthermore, we use game-theoretic approach to Results
visualize the learnt feature space and illustrate the independent Evaluation of model performance. We evaluated the depth and
and uncorrelated failure modes of the proposed multimodal sys- ego-motion prediction performance of GRAMME in five adverse
tem, and show that GRAMME focuses on the relevant details in settings such as day, night, rain, fog and snow using fivefold
the environment. GRAMME is publicly available as an easy-to-use cross-validation. To quantitatively measure the generalization per-
Python package13. formance of GRAMME, we conducted an effective and reliable—
yet rather challenging—cross-condition evaluation on the Robotcar
Self-supervised artificial intelligence for all-weather dataset, enabled by the modular design of GRAMME. We trained
ego-motion estimation the models on typical day sequences15 (training dataset) and directly
GRAMME is a deep learning-based self-supervised method that evaluated them on more challenging conditions (night, rain, fog and
uses multiple sensors such as cameras, lidars and radars to estimate snow)16 (test dataset). For each cross-validated fold, we randomly
the ego-motion of AVs by reconstructing the three-dimensional partitioned each public AV training dataset into a training set (80%
scene geometry under diverse settings such as day, night, rain, fog of sequences), a validation set (10% of sequences) and a test set (10%
and snow. GRAMME is sensor agnostic and designed to support of sequences). Each set contains the time-synchronized matching
sensors with various configurations in terms of resolution, beam- frames from each modality used for the training. The proportions of
width and field-of-view (Supplementary Note 5). GRAMME uses different settings (in terms of the number of frames) were kept con-
a novel differentiable view-reconstruction algorithm to incorpo- stant in each set during partitioning. In each fold, we monitored the
rate the measurements of range sensors (for example, lidars and model’s performance on the validation set during training and used
radars), mitigating the limitations of cameras (both monocular and the validation set for model selection while the test set was held-out
DepthNet
[ Lsmooth ]
Multisensory perception
VisionNet
FusionNet Motion
Spatial
Transformer
RangeNet
Lgeo,
Lradar,
Lcam
Fig. 1 | Overview of the GRAMME conceptual framework and architecture. a, The publicly available independent AV datasets are collected using
multiple sensors such as camera, lidar and radar under diverse settings such as variable ambient illumination and precipitation. Example multimodal
measurements from the RADIATE dataset17 are shown to illustrate the data types and the degradation in sensor measurements caused by adverse
conditions. b, Architecture overview for self-supervised estimation of scene geometry and ego-motion. DepthNet and VisionNet modules predict the
pixel-wise depth map of each camera frame and the ego-motion between consecutive camera frames, respectively. In parallel, the RangeNet and MaskNet
modules operate on range sensors (that is, lidar and radar) to predict ego-motion and input masks, respectively. FusionNet collects the unaligned individual
motion predictions as input and predicts the ultimate motion. Finally, the spatial transformer module uses the multimodal predictions and geometrically
reconstructs the scene, creating a supervisory signal (L).
and referred to just once after training was complete to evaluate the with ground-truth pose information (as explained in the 'Datasets'
performance of the model on day sequences. The final models are section) following the same evaluation protocol. As shown in Fig. 2,
directly evaluated on the test datasets that contain adverse condi- ground-truth supervision reduces the generalization capability
tions never observed by the models during training. compared with self-supervision. Moreover, the relative multimodal
performance of the supervised models is even worse than the
Multimodal, modular and generalizable depth prediction. In self-supervised models. Although the camera-only self-supervised
the first set of experiments we analyse the depth prediction perfor- models are trained, validated and tested on day sequences and
mance, which is a critical component of the self-supervision signal. lead to overall performance improvement, challenging conditions
GRAMME formulates multimodal ego-motion using a tight con- involving glare and non-Lambertian surfaces still suffer from a
nection between depth prediction and ego-motion estimation to considerable performance loss; however, the models trained on
eliminate the need for the labelled data. The geometry-aware mul- additional range sensors (that is, lidar and radar) are much more
timodal self-supervised architecture improves the generalization immune to such effects. Although stereo camera-based models are
performance of the model to diverse conditions. Figure 2 shows slightly better than their monocular counterparts, we have a similar
the depth prediction performance for the model trained using: observation on the other test conditions that the models trained only
(1) a monocular camera, (2) a stereo camera, (3) a lidar–camera on camera are dramatically prone to failure. Moreover, although
(stereo) and (4) a radar–camera (stereo). Note that we use only the lidar- and radar-based models provide qualitatively similar results
day sequences in the training set for each training experiment on the and generally improve the overall performance, the model trained
modalities. Also, for each experiment, we use only the modalities with radar data provides greater immunity to precipitation. Under
labelled on the training modality column. Owing to GRAMME's foggy conditions, the lidar measurements contribute relatively less
modular design, the vision and range modules can make predic- to the generalization performance than to the other test conditions
tions directly and separately on the camera, lidar and radar inputs. with higher error variance; this is caused by poorer measurements
To evaluate the generalizability of the depth module, we use the due to water droplets condensed on the sensor surface. On the other
monocular sequences to test the depth prediction performance of hand, the depth prediction of the model trained with lidar–camera
the DepthNet module. The camera-only experiments also demon- fusion achieves better performance than the radar–camera model.
strate the robustness of the system to sensor deficiencies. We also As the lidar measurements are invulnerable to the lighting condi-
demonstrate the effect of external supervision by training the model tions and provide dense measurements, the model has an advantage
Day
Night
Rain
Fog
Snow
Depth 0 m 1 km
b c
Ground-truth supervision Self-supervision
0.34 0.34
Night
0.30 0.30
Rain
0.28 0.28 Fog
0.24 0.24
0.22 0.22
Fig. 2 | Multimodal, modular, and generalizable depth prediction performance. a, Qualitative results and sample test frames16 to visualize the
generalization ability of GRAMME on depth prediction. We train each model using the day sequences in the training set and test them under diverse
conditions to analyse the generalization performance. GRAMME successfully exploits the complementary aspects of the sensors. b, Comparatively weaker
generalization performance of the supervised models. c, Quantitative results to compare the self-supervised generalization performance of GRAMME
with respect to ground-truth supervision and intra-modality performance. The models trained only on camera are dramatically prone to failure in all of the
challenging test conditions. Although lidar- and radar-based models provide qualitatively similar results and generally improve the overall performance,
the model trained with radar provides greater immunity to precipitation. Error bars represent the depth prediction errors with respect to the ground truth.
Camera fusion models employ the stereo setting.
over the radar-based version. GRAMME exploits the multimodal Sensor-dependent masking, multisensory fusion and generaliz-
system design effectively, unlike the past work focusing mainly on able ego-motion estimation. View reconstruction provides the
either deep network architecture or objective function. The results key supervision signal for the model training. In this set of experi-
show the benefits of multimodal fusion on depth prediction as an ments we investigate the effectiveness of the masking system as the
additional supervision signal, improving the generalization ability major geometrically consistent element in the reconstruction. We
of the model under diverse settings. Moreover, we test the gener- then provide the overall generalization performance of the mul-
alization performance of GRAMME to different datasets, repeat- tisensor ego-motion estimation coupled with the masking system.
ing the same training, validation and test protocol on the publicly As the view reconstruction is based on sampling from the adjacent
available RADIATE dataset17. We exhibit both the depth predic- frames, and occluded areas cannot be sampled by definition, recon-
tion and ego-motion estimation performance in Extended Data structed occlusion areas might corrupt the supervisory signal. The
Fig. 1. Although the dataset contains shorter sequences with high inherent heterogeneous radar artefacts such as ghost objects, phase
variations in scene appearance and structure, GRAMME achieves and amplitude stability, speckle and saturation are other sources of
remarkable domain adaptation performance on this challenging inconsistency for view reconstruction (see Supplementary Note 3).
dataset (the observations on the RADIATE dataset is provided in Furthermore, the adverse weather conditions pose further challenges
Supplementary Note 6). for camera and lidar that inhibit the underlying scene consistency
Input
Camera
Mask
Input
Lidar
Mask
Input
Radar
Mask
b
30 15
Rotational error (° per 100 m)
25
Translational error (%)
Day
20 Night 10
Rain
15
Fog
10 Snow 5
0
Lidar– Radar– Lidar– Radar–
Monocular Stereo Monocular Stereo
camera camera camera camera
Fig. 3 | Sensor-dependent mask predictions and performance evaluations on generalizable multisensory ego-motion estimation under diverse settings.
a, Illustration of sample frames16, multimodal measurements and the corresponding predicted masks. Each row shows a pair of input measurements
and predicted masks of each modality. White and dark regions represent the valid and invalid points in the measurements to effectively capture the
multisensory degradation resulting from both adverse weather and inherent sensor deficiencies, respectively. b, Multimodal performance evaluation on
ego-motion estimation and multisensory fusion. The box plots show the median, first and third quartiles, as well as the minimum and maximum quartiles
to show the errors in motion predictions. The error distribution in motion predictions in terms of the error quartiles are shown for translation and rotation
components of motion for each modality. Sensor fusion greatler boosts the overall motion estimation performance.
assumptions. Poor weather introduces sharp intensity fluctuations For example, intense glare caused by direct exposure to sunlight sat-
in camera images, which degrade the consistency across frames. It urates most of the camera pixels and restrains the frame matching.
is therefore important to detect the imperfect and unreliable regions The predicted camera mask captures the glaring regions and excludes
in measurements and exclude them from the view reconstruction. them from the view reconstruction to prevent an incorrect consis-
GRAMME predicts a mask that is a combination of learnt and geo- tency calculation that might corrupt the loss values computed during
metric masks to remove the invalid parts. The former is predicted training. Similarly, although stereo camera provides binocular vision
by GRAMME's mask module, whereas the latter is based on the and is marginally less susceptible to occlusions than the monocu-
geometric inconsistency between consecutive multimodal frames lar one, both camera types are still considerably prone to occlusions
that accounts for motion explanation, the nearly identical frames, and poor visibility due to precipitation and weak illumination. For
and dynamic objects. We show that the predicted masks improve lidar, the reflections from the ground cause unreliable regions in the
the performance of GRAMME by eliminating the imperfections on measurements that cannot be consistently matched across consecu-
each modality. Figure 3 shows example frames and the correspond- tive frames, which are detected and eliminated by the lidar masks.
ing mask predictions for each modality under challenging condi- The mask also identifies false detections caused by fog droplets. On
tions. We use the stereo setting for the camera fusion models, which the other hand, although radar is more resistant to weather condi-
provides additional information due to binocular vision. To show tions, the radar measurements still suffer from the inherent artefacts
that the masks eliminate the effect of unfavourable weather on each discussed above. The radar masks seamlessly detect the imperfect
sensor, we trained the model using the individual modalities only. measurements and filter them from the radar frames.
Following the same experimental protocol described above, we MonoDepth2; ref. 21) varies depending on the modality and the test
evaluate the generalization performance of the overall ego-motion condition. For example, fusion models achieve good performance
estimation system. We show an ablation study of GRAMME in terms with a dataset size of at least 50% in all test conditions. However, the
of the contribution of the fusion module to individual modalities model trained with cameras needs at least 75% of the training data-
and contribution of different sensors under unique test conditions. set. Notably, the performance of fusion models might deteriorate
Figure 3 shows the translational and rotational errors of different with access to very limited data (for example, with only 25% of the
ablation schemes, averaged over the day, night, rain, fog and snow dataset). The increased complexity needed to implement the mul-
test conditions. To evaluate the benefits of multisensor fusion, we timodal architecture makes the model more data-dependent than
train camera-only models in monocular and stereo settings. In sep- those with single modalities. Furthermore, although radar–camera
arate training experiments, we fuse lidar and radar modalities with fusion provides more immunity to adverse weather than lidar–cam-
the stereo camera. As shown in Fig. 3, lidar–camera and radar–cam- era fusion, the latter performs relatively well under the poor illu-
era fusion significantly improves both the translational and rota- mination in the night sequences. Both lidar and radar modalities
tional motion prediction performance compared with camera-only are not affected by the illumination, but the lidar model utilizes the
models. The GRAMME modal trained with lidar has notably higher dense measurements of the lidar sensor and achieves better perfor-
errors in fog, showing the negative impact of fog on lidar data. mances, accordingly.
Day
Night
Rain
Fog
Snow
0.35
Day
Night
0.30
Rain
Fog
0.25
Snow
Fig. 4 | Interpretability and dataset size-dependent performance. a, A game-theoretic visualization of GRAMME to interpret the depth predictions based
on the SHAP values for sample frames16. Pixels annotated by red points increase the depth prediction accuracy, whereas blue points lower the accuracy.
The challenging conditions such as glare, poor illumination and adverse weather lead to concentrated blue regions around the occluded pixels. However,
the training with lidar and radar data helps the model focus on more semantically invariant pixels across diverse test conditions, as visualized by the red
points around static objects and road edges. The distribution of the values illustrates the independent and uncorrelated failure modes of the proposed
multimodal system. b, Dataset size-dependent performance of GRAMME in terms of mean depth prediction error, with standard deviation with respect
to the depth ground truth. Although the lidar–camera (stereo) and radar–camera (stereo) fusions improve the overall performance, access to minimal
data (for example, only 25%) causes a worse performance than the camera due to the increased complexity of the model required for the multimodal
architecture. On the other hand, despite the model complexity, the lidar and radar-based models achieve good performances (compared with the baseline
approaches) with a dataset size of at least 50% in all test conditions.
but they do not extensively cover extreme conditions such as heavy complexity. However, the specific inherent vulnerability of sensors
downpours and large snowfalls, which is a limiting factor in eval- (such as lidar in fog) might deteriorate the performance with mini-
uating the generalization capability. Fourth, interpretability: we mal data availability (for example, 25% of a dataset). For depth
demonstrated that our models are interpretable and capable of and ego-motion estimation in adverse weather, we believe that the
capturing semantically and geometrically consistent regions. We diversity and accuracy of ground truth in the existing public datas-
visualized the extracted features using SHAP values and observed ets are still insufficient, which is likely to constitute a limiting factor;
that the camera-only model struggles to focus on consistent regions data-efficiency analysis is therefore important to understand how
across frames. On the other hand, multimodal training helps the sensitive is the performance of a deep learning model to the avail-
model to capture more consistent areas that are interpretable by ability of additional data.
humans. Although deep learning models are heavily deployed in an The key aspects discussed above address a critical issue of AVs:
AV software stack, interpretability remains a considerable challenge the ability to know precisely where they are on the map. Core AV
due to the lack of insightful and lucid interpretability approaches components such as prediction and planning rely on this local-
to analyse the complex deep learning architectures. Finally, data ization ability. In this study we showed that robust and accurate
efficiency: our quantitative experiments and comparative analysis ego-motion estimation provides a complementary solution to local-
demonstrated that GRAMME models trained with multiple modal- ization and is a critical component of autonomous driving to achieve
ities achieve satisfactory results compared with baseline methods, safety and reliability under adverse conditions. The high level of
even with dataset sizes limited by up to 50%, despite the increased location accuracy provided by GRAMME enables AVs to reliably
understand their environment and make safer decisions. We dem- alone is ambiguous, especially in low-textured regions due to the multiple matches
onstrated that the complementary and redundant perception that with one pixel. To prevent depth ambiguity due to incorrect pixel matches in
low-textured and occluded areas, we apply a regularization:
AVs gain from multiple sensors improves the reliability of vehicles ∑∑
in challenging situations, especially in unfavourable weather condi- Ls (D, 2) = 2
|∇d D(xt )|e
−α |∇d I(xt )|
(1)
tions. Furthermore, the self-supervised aspect of GRAMME enables xt d∈x,y
artificial intelligence systems deployed on AVs to learn localization Ls(D, 2) is a second-order spatial depth smoothness term that penalizes the
from orders of magnitude more data, which is important to quickly divergence of the depth prediction gradients along both the x and y directions22.
recognize and understand new driving conditions. We believe The regularization encourages the alignment of the depth values in the planar
AV technologies should meticulously involve these fundamental surface in the absence image gradients. For multiview projection between multiple
aspects to achieve safe and reliable autonomous driving. camera views, let D(xt) denote the depth value of the target image at coordinate
xt, and K be the camera intrinsics matrix. Assume a rigid transformation Tt→s is
In terms of future directions, the presented technology can be fur- the relative pose from the target view to source view, and h(x) is the homogeneous
ther improved in several directions. For example, the signal-to-noise coordinates given x. The perspective projection to find corresponding pixels in the
ratio of range sensors can be integrated into the masking component source view can be formulated as,
of GRAMME, providing an additional physical source of confidence
D(xs )h(xs ) = KTtarget→source D(xt )K−1 h(xt ) (2)
for the measurements. Moreover, the Doppler measurements from
radars can help the model better distinguish dynamic and static and the image coordinate xs can be obtained by de-homogenization of D(xs)h(xs); xs
objects in the scene, enabling a more consistent geometric and and xt are therefore a pair of matching coordinates in the source and target views,
semantic understanding of the environment. Moreover, GRAMME and the similarity between the two can be compared to validate the correctness of
structure. Given the pixel-wise matching pairs in Ict and Ics , we can reconstruct a
as a learning-based approach can be extended to higher level learn- c
target view Iˆs from the given source view as described in ref. 27, and calculate the
ing schemes of autonomous driving such as lifelong and continual final camera objective using the photometric error Lc = Lp (M, Îcs , Îct ) + λs Ls (D)
learning, resulting in AVs that continuously and collaboratively following the camera masking method offered in ref. 22. The camera module
improve autonomous driving artificial intelligence. is applicable to monocular and stereo cameras by exploiting the left–right
consistency21.
Methods Range module. The range module is designed to predict ego-motion from radar
GRAMME. GRAMME is a self-supervised deep learning framework designed to
and lidar measurements that are represented by a bird’s-eye view in Cartesian
robustly estimate the ego-motion and depth map for an AV under diverse settings.
coordinates, consisting of two feature extractor networks based on ResNet18
GRAMME follows an end-to-end design and leverages data-driven learning to
followed by two fully connected layers to regress the relative pose. RangeNet
combat the inherent limitations of conventional and state-of-the-art ego-motion
predicts the relative pose Tt→s between source and target frames <Is, It>, whereas
estimation methods. GRAMME demonstrates the feasibility of multimodal
MaskNet individually predicts a mask M in parallel to detect the consistent regions
odometry under adverse weather conditions and proposes a multisensor fusion
in the frames. Finally, our view synthesis algorithm reconstructs the target view
framework, resulting in a robust ego-motion estimation system. The standard
using the predicted pose and mask.
self-supervised ego-motion prediction is based on monocular camera, and it
consists of two joint stages22,23. The first stage predicts a depth map for a given
View synthesis for range sensors. Given a source Is and target It views in Cartesian
camera frame, whereas the second stage predicts ego-motion between two
coordinates for radar and lidar measurements, we use the relative predicted
consecutive camera frames. Given the ego-motion and depth predictions, a c
pose Tt→s between the views to reconstruct a target view Iˆs through bilinear
spatial transformer algorithm reconstructs the target camera frame from the
interpolation. To reconstruct the value of Îs (xt ) from the value of Is(xs), we
source frames. The spatial transformer module builds on the idea presented by
use a differentiable bilinear sampling mechanism similar to the photometric
Jaderberg et al.24, explicitly allowing the spatial manipulation of multimodal data
approaches24, linearly interpolating the values of the four-pixel neighbours
within the network. The reconstruction quality establishes the supervisory signal
N = (top-left, top-right, bottom-left and bottom-right) of xs to approximate Is(xs),
to optimize the neural network. GRAMME builds on the self-supervised training ∑
that is, Îs (xt ) = Is (xs ) = i,j∈N wij Is (xijs ), where wij ∝ ∣xs − xt∣, and ∑i,jwij = 1;
idea and describes a multimodal architecture to promote complementary sensor
then, given the Lambertian and a static rigid scene assumptions, we can
behaviours, yielding a robust ego-motion estimation for AVs under diverse settings
calculate the average intensity error to refine the predicted relative pose. However,
such as day, night, rain, fog and snow. GRAMME introduces a novel differentiable
this assumption is not always true because of dynamic objects and sensor
range-reconstruction algorithm for range frames (that is, lidar and radar) as part
deficiencies, which might be further violated under adverse weather. We introduce
of its multimodal spatial transformer that is adaptable to the back-propagation
a consistency mask M to compensate for the regions violating the assumption.
during training of the deep learning architecture. The RangeNet module uses
Formally, the masked intensity loss for lidar (Ll) and radar (Lr) is,
two consecutive range frames to predict the ego-motion of AV, whereas MaskNet
predicts the reliable regions in individual frames. Given the ego-motion and ∑
S ∑
mask predictions, the spatial transformer algorithm uses the source frames to Ll,r (M, Îs , Ît ) = Ms (xt )|It (xt ) − Iˆs (xt )|,
reconstruct the target range frames. To exploit the complementary information
s=1 xt (3)
obtained from different sensors, GRAMME proposes a fusion method that consists such that ∀xt , s Ms (xt ) ∈ [0, 1]
of the FusionNet layer and cross-modal training technique. The novel fusion
S
method enables information flow across different modalities due to the joint where {Îs }s=1 is the set of reconstructed source views, {Ms} is a set of consistency
training technique, improving the robustness of individual modalities. Extended masks, and Ms(xt) ∈ [0, 1] provides a weight on the error at xt from source view s.
Data Fig. 2 shows the details of the architecture. The range-reconstruction algorithm is summarized in Algorithm 1. Moreover, the
explainability mask has a trivial solution in this formulation, assigning all mask
Problem definition. Each loosely time-synchronized triplet of consecutive camera values to zero. We apply a regularization term to encourage non-zero masks to
( < Ics,i−1 , Ict,i , Ics,i+1 >), lidar ( < Ils,i−1 , Ilt,i , Ils,i+1 >) and radar ( < Irs,i−1 , Irt,i , Irs,i+1 >) prevent the saturation in the network activation, using a cross-entropy loss for the
frames in the training set ( I = Ic ∪ Il ∪ Ir) represents a single data point at predicted masks:
time index i with unknown ego-motion and depth map of the camera source Is ∑∑
Lm (M) = − log P(Ms (xt ) = 1). (4)
and target It frames. Our goal is to estimate T, where the pose Tt→s = [R∣t] ∈ SE(3)
s xt
is a transformation between the target (t) and source (s) frames with rotation
matrix R and translation vector t. Although the standard commercial radars In the bird’s-eye view, vehicles and large objects occupy smaller areas
are two-dimensional sensors, we formulate our problem in SE(3) to enable compared with the front-view. For example, a vehicle with an average size of
compatibility with other three-dimensional sensor modalities. Unlike existing 2.5 × 5.1 m occupies only a 13 × 26 pixels area with an input resolution of 0.2 m.
self-supervised radar approaches25, GRAMME directly predicts the pose between Downsampling the bird’s-eye view map through the encoder makes the region-wise
the consecutive frames without imposing strong motion prior factors. features vulnerable to quantization errors in the subsequent mask generator; thus,
GRAMME upsamples the coarse-grained feature map via a transposed convolution
Camera module. The camera module consists of two networks. DepthNet uses layer (decoder) and concatenates the output with the fine-grained feature map with
UNet style skip connections26 to predict per-pixel depth map D of a given RGB skip links, following the UNet design26.
image. In parallel, VisionNet follows ResNet18 architecture to predict the relative
pose Tt→s between source and target RGB images < Ics , Ict >. We use the predicted Multimodal fusion. GRAMME introduces a self-supervised fusion approach
depth and pose values in the spatial transformer algorithm to create a supervisory that involves an attention module, a fusion network and a training technique.
signal based on perspective projection. However, photometric error supervision The features extracted from range and camera modules are used in an attention
1
∑ |D(x,y)−Dgt (x,y)|2 Visualizing feature space with SHAP values. SHAP20 approximates an interpretable,
SqRel ≡ |Ω| Dgt (x,y) explanation model g of the original, complex model f, to explain a prediction
(x,y)∈Ω
√ made by the model f(x). SHAP provides post-hoc model explanations for
1
∑ an individual output of f and is model-agnostic. SHAP is a game-theoretic
RMSE ≡ |Ω|
|D(x, y) − Dgt (x, y)|2
(x,y)∈Ω approach based on Shapley values45, which calculates the contribution
√ ∑
of each feature in the final prediction performance. We use a special
RMSElog ≡ 1
|Ω|
| log D(x, y) − log Dgt (x, y)|2 implementation of the SHAP approach, Deep SHAP method introduced
(x,y)∈Ω by Lundberg and colleagues20, which combines SHAP values computed for
∑ smaller components of the network into SHAP values for the whole network.
log10 ≡ 1
|Ω|
| log D(x, y) − log Dgt (x, y)| It defines DeepLIFT’s multipliers46 in terms of SHAP values, and recursively
(x,y)∈Ω
( ) passes the values backwards through the network. Deep SHAP exploits the
. D(x,y) Dgt (x,y) composition rule and the efficient analytical SHAP solutions for simple
Accuracy ≡ % of D(x, y) s.t. δ = max Dgt (x,y) , D(x,y) <τ
networks components such as linear, max pooling, or an activation function
with just one input, enabling a fast approximation of values for the whole
D(x, y) is the predicted depth at (x, y) ∈ Ω and Dgt(z, y) is the corresponding ground model. This approach helps us derive an effective linearisation from the
truth. We use the most common three different thresholds τ (1.25, 1.252 and 1.253) SHAP values computed for each component instead of heuristically choosing
in the accuracy metric. Since the monocular camera lacks the absolute scale, we ways to linearize components.
multiply the monocular depth predictions by a scaling factor, s, that matches the
median with the ground-truth depth map to solve the scale ambiguity issue, that is, Computational hardware and software. We stored the raw dataset files on
s = median(Dgt)/median(D). The depth prediction results in terms of those metrics multiple hard drives. We performed the demosaicing of camera images, the
are shown in Extended Data Table 2. We evaluate the depth prediction performance projection of lidar frames, and Cartesian conversion of radar measurements on
of the competing approaches under diverse settings such as day, night, rain, fog, Intel Xeon CPUs, which are then stored on a fast local SSD. We used two local
and snow, following the same training and test protocol. We use Monodepth2 NVIDIA RTX 3090 GPUs for each training experiment accelerated through
(ref. 21) as a baseline, which is the most similar architecture to the camera module batch parallelization and a local NVIDIA GTX 1080Ti GPU to evaluate
of GRAMME. We train, validate and test it using the same dataset split as run-time performance. We implement our multimodal processing pipeline in
GRAMME. Although Monodepth2 achieves comparable results in day sequences, Python and employ imaging processing libraries such as colour-demosaicing
it performs poorly in reduced visibility conditions due to the occlusions and low (v.0.1.6), and pillow (v.8.4.0). To train the deep learning models, and augment
lighting. Since the camera module of GRAMME is most similar to Monodepth2, the datasets, we used machine learning libraries such as PyTorch (version 1.8.0),
we provide the performance evaluation for GRAMME models trained using torchvision (v.0.9.1). We generated all plots using matplotlib (v3.5.0) and
range sensors (for example, lidar and radar), emphasizing the effectiveness of the seaborn (v.0.11.2). The Robotcar dataset is processed using Robotcar dataset
multimodal approach. Note that none of the models has access to additional sensor SDK (v.3.1), and the RADIATE dataset is processed with RADIATE dataset SDK
measurements at test time other than camera images. The results indicate that (commit dca2270).
GRAMME models distinctly and consistently outperform the other approaches
thanks to the fusion model design, and reiterate the robustness of GRAMME to the
lack of modalities. The results indicate that exploiting the cross-modal relations is Data availability
crucial for robust all-weather ego-motion estimation for AVs. The Oxford Robotcar Dataset16 and the Oxford Robotcar Radar15 datasets
are available from the University of Oxford under a Creative Commons
Ablation study on the deep network. Deep learning models might benefit from Attribution-NonCommercial-ShareAlike 4.0 International License (https://
larger and more complex networks to improve the prediction accuracy41, which robotcar-dataset.robots.ox.ac.uk/). The RADIATE dataset17 is available from
comes at a run-time cost. The encoders in GRAMME are based on ResNet1842 the Edinburgh Centre for Robotics, Heriot-Watt University, under a Creative
architecture. We replace the encoder with commonly used networks such as Commons Attribution-NonCommercial-ShareAlike 4.0 International License
MobileNet43 and VGG16 to analyse the performance and latency of the models44, (http://pro.hw.ac.uk/radiate/). The references involve the minimum datasets that
which is shown in Extended Data Fig. 1b. We benchmark the models in terms of are necessary to interpret, verify and extend the research in the article, transparent
depth prediction performance and inference time for a minibatch size of four on to readers.
an NVIDIA GTX 1080Ti consumer-grade GPU. The inference time is evaluated
for the total of pose and depth predictions with an additional pose fusion for the Code availability
multimodal tests. While the networks have the same inference time for different All code was implemented in Python using the deep learning framework PyTorch.
test conditions, the networks in fusion models have higher latency than the models Code, trained models and scripts reproducing the experiments of this paper are
for single-modality. The multimodal input and parallel network branches for available at https://github.com/yasinalm/gramme (refs. 47–49). All source code is
multiple modalities cause a higher latency in fusion models. Although overall provided under the MIT license.
run-time for ResNet is higher than MobileNet, ResNet achieves a significant
performance boost. On the other hand, despite the slight performance gain of VGG Received: 24 January 2022; Accepted: 12 July 2022;
in the monocular setting at the cost of four-times the inference time of ResNet,
VGG falls behind ResNet in the other test settings. ResNet efficiently trades Published: xx xx xxxx
off between accuracy and latency, and has a noticeably lower GPU run-time.
We, therefore, select ResNet as our encoder. References
1. Safe driving cars. Nat. Mach. Intell. 4, 95–96 (2022).
Interpretability. We visualize the feature space of the depth prediction module 2. Yang, G.-Z. et al. The grand challenges of Science Robotics. Sci. Robot. 3,
with respect to the camera, lidar, and radar inputs to better understand the eaar7650 (2018).
Additional information Open Access This article is licensed under a Creative Commons
Extended data is available for this paper at https://doi.org/10.1038/s42256-022-00520-5. Attribution 4.0 International License, which permits use, sharing, adap-
tation, distribution and reproduction in any medium or format, as long
Supplementary information The online version contains supplementary material
as you give appropriate credit to the original author(s) and the source, provide a link to
available at https://doi.org/10.1038/s42256-022-00520-5.
the Creative Commons license, and indicate if changes were made. The images or other
Correspondence and requests for materials should be addressed to Yasin Almalioglu. third party material in this article are included in the article’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in
Peer review information Nature Machine Intelligence thanks Itai Orr and the other,
the article’s Creative Commons license and your intended use is not permitted by statu-
anonymous, reviewer(s) for their contribution to the peer review of this work.
tory regulation or exceeds the permitted use, you will need to obtain permission directly
Reprints and permissions information is available at www.nature.com/reprints. from the copyright holder. To view a copy of this license, visit http://creativecommons.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in org/licenses/by/4.0/.
published maps and institutional affiliations. © The Author(s) 2022
Extended Data Fig. 1 | Performance on the RADIATE dataset, ablation study on the deep network, and run-time performance. (a) To evaluate the
generalization of GRAMME to new datasets, we train, validate and test models on the publicly available RADIATE dataset17 that contains shorter
sequences for ego-motion estimation with high variation in scene and structure appearance. We report the mean depth predictions errors with standard
deviations and the distribution of motion prediction errors with quartiles. Although GRAMME performs well in terms of depth prediction and ego-motion
estimation performance, we observe on this dataset that due to dense fog and heavy precipitation, the performance of the lidar&camera-based
GRAMME model drops significantly compared to the day sequences. (b) As an additional ablation study, we replace the UNet network in the GRAMME
modules with commonly used networks. The results show the basis for our choice of ResNet in the GRAMME architecture. We also report the run-time
requirements in milliseconds, indicating the real-time capability of GRAMME on a consumer-grade GPU.
Extended Data Fig. 2 | Detailed architecture design of the proposed geometry-aware, multimodal, modular, interpretable, and self-supervised
ego-motion estimation. The modules consisting of encoder and decoder networks are based UNet architecture with skip connections. Feature extractors
with an encoder are based on the ResNet18 network visualized on sample input17. FC layers represent fully connected layers. Pose fusion network
is a multilayer-perceptron. As part of the spatial transformer module, the inverse warp algorithm re-uses the input target frames to calculate the
reconstruction loss. Camera input can be set to contain more frames than the range sensor due to the higher fps rate. The fused pose is the final output,
optimized in a self-supervised manner without ground-truth with respect to the intermediate pose and depth predictions.
Extended Data Table 2 | Quantitative comparison with state-of-the-art methods in terms of depth prediction error and accuracy.
A higher value is better for the accuracy columns, and a lower value is better for the others. The methods are trained on the
daytime data of the Oxford Robotcar dataset15 and directly tested under the weather conditions labelled in the test column. The
modalities used to train the models are represented by: Monocular (M), stereo (S), lidar (L), and radar (R). The models are tested
with monocular images only without access to any additional sensor. Notably, the fusion GRAMME models significantly improve
the generalization performance against adverse conditions compared to Monodepth2 that is the most similar architecture to the
camera-only GRAMME model
Resources
1.https://www.nature.com/articles/s42256-022-00516-1
2.Andrew NG Youtube Tutorial - Machine Learning
3.Introduction to Machine Learning,IITM,NPTEL/Swayam Course.
Rajalakshmi Institute of Technology
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCES
AD8552- Machine Learning
Unit-2: Machine Learning Methods
Part A
Sl.no Questions Blooms
CO
Level*
1 Compare Supervised and Unsupervised learning algorithms. CO2 B2
2 Define Generalized Linear Model. CO2 B1
3 Write the formula for linear regression and optimal weight vector B3
CO2
for prediction.
4 Define regularization. CO2 B1
5 Compare Lasso and ridge regression. CO2 B2
6 “Logistic regression is a Classification Technique”-Justify. CO2 B5
7 Write the algorithm of KNN. CO2 B3
8 Write the merits and demerits of KNN. CO2 B3
9 Define Perceptron. CO2 B1
10 Solve AND using Perceptron. CO2 B6
11 Write the training processing steps in MLP. CO2 B3
12 List the training methods in back propagation technique. CO2 B1
13 Define RBFNN. CO2 B1
14 Define overfitting. CO2 B1
15 Write the cost functions of L1 & L2 regularization techniques. CO2 B3
16 Write the merits of dropout regularization technique. CO2 B3
17 List the merits of decision tree algorithms. CO2 B1
18 Compare the algorithms used in building the decision trees. CO2 B2
19 Write the terminologies and formulae related to the regression CO2 B3
tree.
20 Write the measures of classification trees. CO2 B3
21 Write the training steps in the decision tree. CO2 B3
22 List the types of ensembling methods. CO2 B1
23 Write the steps in bagging. CO2 B3
24 List the merits of random forest algorithms. CO2 B1
25 Define Decision Jungle. CO2 B1
26 Compare the boosting techniques in the decision tree. CO2 B2
27 Relate SVM and multi class classification. CO2 B2
28 Write the regularization version of SVM. CO2 B3
29 Write the formula of C-SVM. CO2 B3
30 List the different Kernel Functions. CO2 B1
31 List the classification of Probabilistic Models. CO2 B1
32 Compare the Methods of Discriminative Probabilistic Models. CO2 B2
33 List the types of distribution. CO2 B1
34 Write the training steps in Bayesian Network CO2 B3
35 Write the algorithm of MLE. CO2 B3
36 Define Clustering.Write the types of Clustering Algorithms. CO2 B1
37 Define ICA. CO2 B1
38 Define SOM CO2 B1
39 Define Autoencoder. CO2 B1
40 Write the steps in K-Means Clustering CO2 B3
Part B
Sl.no Questions Blooms
CO
Level*