Interview Questions For DS & DA (ML)
Interview Questions For DS & DA (ML)
Interview Questions For DS & DA (ML)
Supervised machine learning requires training labelled data. Let’s discuss it in bit detail,
when we have
Bias:
“Bias is error introduced in your model due to over simplification of machine learning
algorithm.” It can lead to under fitting. When you train your model at that time model
makes simplified assumptions to make the target function easier to understand.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias
machine learning algorithms — Linear Regression, Logistic Regression
Variance:
“Variance is error introduced in your model due to complex machine learning algorithm,
your model learns noise also from the training data set and performs bad on test data
set.” It can lead high sensitivity and over fitting.
Normally, as you increase the complexity of your model, you will see a reduction in error
due to lower bias in the model. However, this only happens till a particular point. As you
continue to make your model more complex, you end up over-fitting your model and
hence your model will start suffering from high variance.
The goal of any supervised machine learning algorithm is to have low bias and low
variance to achieve good prediction performance.
1. The k-nearest neighbours algorithm has low bias and high variance, but the
trade-off can be changed by increasing the value of k which increases the
number of neighbours that contribute to the prediction and in turn increases
the bias of the model.
2. The support vector machine algorithm has low bias and high variance, but
the trade-off can be changed by increasing the C parameter that influences
the number of violations of the margin allowed in the training data which
increases the bias but decreases the variance.
There is no escaping the relationship between bias and variance in machine learning.
Increasing the bias will decrease the variance. Increasing the variance will decrease the
bias.
Gradient:
Gradient is the direction and magnitude calculated during training of a neural network
that is used to update the network weights in the right direction and by the right amount.
“Exploding gradients are a problem where large error gradients accumulate and result
in very large updates to neural network model weights during training.” At an extreme,
the values of weights can become so large as to overflow and result in NaN values.
This has the effect of your model being unstable and unable to learn from your training
data. Now let’s understand what is the gradient.
The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary
classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity,
precision and recall are derived from it. Confusion Matrix
A data set used for performance evaluation is called test data set. It should contain the
correct labels and predicted labels.
The predicted labels will exactly the same if the performance of a binary classifier is
perfect.
The predicted labels usually match with part of the observed labels in real world
scenarios.
A binary classifier predicts all data instances of a test dataset as either positive or
negative. This produces four outcomes-
The R
OC curve is a graphical representation of the contrast between true positive rates
and false positive rates at various thresholds. It is often used as a proxy for the
trade-off between the sensitivity(true positive rate) and false positive rate.
7. What is selection Bias ?
Selection bias occurs when sample obtained is not representative of the population
intended to be analysed.
1. Linear Kernel
2. Polynomial kernel
3. Radial basis kernel
4. Sigmoid kernel
11. Explain Decision Tree algorithm in detail.
Decision tree is a supervised machine learning algorithm mainly used for the
Regression and Classification.It breaks down a data set into smaller and smaller
subsets while at the same time an associated decision tree is incrementally developed.
The final result is a tree with decision nodes and leaf nodes. Decision tree can handle
both categorical and numerical data.
The core algorithm for building decision tree is called ID3. ID3 uses Enteropy and
Information Gain to construct a decision tree.
Entropy
A decision tree is built top-down from a root node and involve partitioning of data into
homogenious subsets. ID3 uses enteropy to check the homogeneity of a sample. If the
sample is completely homogenious then entropy is zero and if the sample is an equally
divided it has entropy of one.
Information Gain
The I nformation Gain is based on the decrease in entropy after a dataset is split on an
attribute. Constructing a decision tree is all about finding attributes that returns the
highest information gain.
13. What is pruning in Decision Tree ?
When we remove sub-nodes of a decision node, this process is called pruning or
opposite process of splitting.
Bagging
Bagging tries to implement similar learners on small sample populations and then takes
a mean of all the predictions. In generalised bagging, you can use different learners on
different population. As you expect this helps us to reduce the variance error.
Boosting
In Random Forest, we grow multiple trees as opposed to a single tree. To classify a new
object based on attributes, each tree gives a classification. The forest chooses the
classification having the most votes(Over all the trees in the forest) and in case of
regression, it takes the a
verage of outputs by different trees.
16. What cross-validation technique would you use on a time series data set.
Instead of using k-fold cross-validation, you should be aware to the fact that a time
series is not randomly distributed data — It is inherently ordered by chronological order.
In case of time series data, you should use techniques like forward chaining — Where
you will be model on past data then look at forward-facing data.
17. What is logistic regression? Or State an example when you have used logistic
regression recently.
Logistic Regression often referred as logit model is a technique to predict the binary
outcome from a linear combination of predictor variables. For example, if you want to
predict whether a particular political leader will win the election or not. In this case, the
outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here
would be the amount of money spent for election campaigning of a particular
candidate, the amount of time spent in campaigning, etc.
Dependent variable for a regression analysis might not satisfy one or more assumptions
of an ordinary least squares regression. The residuals could either curve as the
prediction increases or follow skewed distribution. In such scenarios, it is necessary to
transform the response variable so that the data meets the required assumptions. A
Box cox transformation is a statistical technique to transform non-normal dependent
variables into a normal shape. If the given data is not normal then most of the statistical
techniques assume normality. Applying a box cox transformation means that you can
run a broader number of tests.
Though the Clustering Algorithm is not specified, this question will mostly be asked in
reference to K-Means clustering where “K” defines the number of clusters. For example,
the following image shows three different groups.
Within Sum of squares is generally used to explain the homogeneity within a cluster. If
you plot WSS for a range of number of clusters, you will get the plot shown below. The
Graph is generally known as Elbow Curve.
Red circled point in above graph i.e. Number of Cluster =6 is the point after which you
don’t see any decrement in WSS. This point is known as bending point and taken as K in
K — Means.This is the widely used approach but few data scientists also use
Hierarchical clustering first to create dendograms and identify the distinct groups from
there.
Deep learning is sub field of machine learning inspired by structure and function of brain
called artificial neural network. We have a lot numbers of algorithms under machine
learning like Linear regression, SVM, Neural network etc and deep learning is just an
extension of Neural networks. In neural nets we consider small number of hidden layers
but when it comes to deep learning algorithms we consider a huge number of hidden
layers to better understand the input output relationship.
22. What are Recurrent Neural Networks(RNNs) ?
Recurrent nets are type of artificial neural networks designed to recognise pattern from
the sequence of data such as Time series, stock market and government agencies etc.
To understand recurrent nets, first you have to understand the basics of feed forward
nets. Both these networks RNN and feed forward named after the way they channel
information through a series of mathematical orations performed at the nodes of the
network. One feeds information through straight(never touching same node twice),
while the other cycles it through loop, and the latter are called recurrent.
Recurrent networks on the other hand, take as their input not just the current input
example they see, but also the what they have perceived previously in time. The BTSXPE
at the bottom of the drawing represents the input example in the current moment, and
CONTEXT UNIT represents the output of the previous moment. The decision a recurrent
neural network reached at time t-1 affects the decision that it will reach one moment
later at time t. So recurrent networks have two sources of input, the present and the
recent past, which combine to determine how they respond to new data, much as we do
in life.
The error they generate will return via back propagation and be used to adjust their
weights until error can’t go any lower. Remember, the purpose of recurrent nets is to
accurately classify sequential input. We rely on the back propagation of error and
gradient descent to do so.
Back propagation in feed forward networks moves backward from the final error
through the outputs, weights and inputs of each hidden layer, assigning those weights
responsibility for a portion of the error by calculating their partial derivatives — ∂E/∂w,
or the relationship between their rates of change. Those derivatives are then used by
our learning rule, gradient descent, to adjust the weights up or down, whichever
direction decreases error.
Machine learning:
Machine learning is a field of computer science that gives computers the ability to learn
without being explicitly programmed. Machine learning can be categorised in following
three categories.
Deep learning:
Deep Learning is a sub field of machine learning concerned with algorithms inspired by
the structure and function of the brain called artificial neural networks.
Reinforcement learning
Selection bias is the bias introduced by the selection of individuals, groups or data for
analysis in such a way that proper randomisation is not achieved, thereby ensuring that
the sample obtained is not representative of the population intended to be analysed. It
is sometimes referred to as the selection effect. The phrase “selection bias” most often
refers to the distortion of a statistical analysis, resulting from the method of collecting
samples. If the selection bias is not taken into account, then some conclusions of the
study may not be accurate.
A subclass of information filtering systems that are meant to predict the preferences or
ratings that a user would give to a product. Recommender systems are widely used in
movies, news, research articles, products, social tags, music, etc.
30. If you are having 4GB RAM in your machine and you want to train your model on
10GB data set. How would you go about this problem. Have you ever faced this kind of
problem in your machine learning/data science experience so far ?
First of all you have to ask which ML model you want to train.
For Neural networks: Batch size with Numpy array will work.
Steps:
1. Load the whole data in Numpy array. Numpy array has property to create
mapping of complete data set, it doesn’t load complete data set in memory.
2. You can pass index to Numpy array to get required data.
3. Use this data to pass to Neural network.
4. Have small batch size.
Steps:
When you perform a hypothesis test in statistics, a p-value can help you determine the
strength of your results. p-value is a number between 0 and 1. Based on the value it will
denote the strength of the results. The claim which is on trial is called Null Hypothesis.
Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can
reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null
hypothesis which means we can accept the null Hypothesis p-value of 0.05 indicates
the Hypothesis could go either way. To put it in another way,
High P values: your data are likely with a true null. Low P values: your data are unlikely
with a true null.
The Naive Bayes Algorithm is based on the Bayes Theorem. Bayes’ theorem describes
the probability of an event, based on prior knowledge of conditions that might be related
to the event.
What is Naive ?
The Algorithm is ‘naive’ because it makes assumptions that may or may not turn out to
be correct.
33. Why we generally use Softmax non-linearity function as last operation in network ?
Ranking algorithms like LTR solves a ranking problem on a list of items. The aim of LTR
is to come up with optimal ordering of those items. As such, LTR doesn’t care much
about the exact score that each item gets, but cares more about the relative ordering
among all the items. RankNet, L
ambdaRank and L
ambdaMART are all LTR algorithms
developed by Chris Burges and his colleagues at Microsoft Research.
1. RankNet — The cost function for RankNet aims to minimize the number of
inversions in ranking. RankNet optimizes the cost function using Stochastic
Gradient Descent.
2. LambdaRank — Burgess et. al. found that during RankNet training procedure,
you don’t need the costs, only need the gradients (λ) of the cost with respect
to the model score. You can think of these gradients as little arrows attached
to each document in the ranked list, indicating the direction we’d like those
documents to move. Further they found that scaling the gradients by the
change in NDCG found by swapping each pair of documents gave good
results. The core idea of LambdaRank is to use this new cost function for
training a RankNet. On experimental datasets, this shows both speed and
accuracy improvements over the original RankNet.
3. LambdaMart — LambdaMART combines LambdaRank and MART (Multiple
Additive Regression Trees). While MART uses gradient boosted decision
trees for prediction tasks, LambdaMART uses gradient boosted decision
trees using a cost function derived from LambdaRank for solving a ranking
task. On experimental datasets, LambdaMART has shown better results than
LambdaRank and the original RankNet.
35. What is the Difference between Ridg and Lasso Regularisation ?
Ridge and Lasso regression uses two different penalty functions. Ridge uses l2 where
as lasso go with l1. In ridge regression, the penalty is the sum of the squares of the
coefficients and for the Lasso, it’s the sum of the absolute values of the coefficients. It’s
a shrinkage towards zero using an absolute value (l1 penalty) rather than a sum of
squares(l2 penalty).
As we know that ridge regression can’t zero coefficients. Here, you either select all the
coefficients or none of them whereas LASSO does both parameter shrinkage and
variable selection automatically because it zero out the co-efficients of collinear
variables. Here it helps to select the variable(s) out of given n variables while performing
lasso regression.
41 How can you generate a random number between 1 – 7 with only a die?
● Any die has six sides from 1-6. There is no way to get seven equal outcomes
from a single rolling of a die. If we roll the die twice and consider the event of two
rolls, we now have 36 different outcomes.
● To get our 7 equal outcomes we have to reduce this 36 to a number divisible by
7. We can thus consider only 35 outcomes and exclude the other one.
● A simple scenario can be to exclude the combination (6,6), i.e., to roll the die
again if 6 appears twice.
● All the remaining combinations from (1,1) till (6,5) can be divided into 7 parts of 5
each. This way all the seven sets of outcomes are equally likely.
42. A certain couple tells you that they have two children, at least one of which is a
girl. What is the probability that they have two girls?
In the case of two children, there are 4 equally likely possibilities
where B = Boy and G = Girl and the first letter denotes the first child.
From the question, we can exclude the first case of BB. Thus from the remaining 3
possibilities of BG, GB & BB, we have to find the probability of the case with two girls.
43 . A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at
random, and toss it 10 times. Given that you see 10 heads, what is the probability that
There are two ways of choosing the coin. One is to pick a fair coin and the other is to
unfair coin
0.5061 = 0.7531
● Python would be the best option because it has Pandas library that provides easy
● Cleaning data from multiple sources helps to transform it into a format that data
● Data Cleaning helps to increase the accuracy of the model in machine learning.
● It is a cumbersome process because as the number of data sources increases,
the time taken to clean the data increases exponentially due to the number of
● It might take up to 80% of the time for just cleaning data making it a critical part
differentiated based on the number of variables involved at a given point of time. For
example, the pie charts of sales based on territory involve only one variable and can the
The bivariate analysis attempts to understand the difference between two variables at a
time as in a scatterplot. For example, analyzing the volume of sale and spending can be
Multivariate analysis deals with the study of more than two variables to understand the
It is a traditional database schema with a central table. Satellite tables map IDs to
physical names or descriptions and can be connected to the central fact table using the
ID fields; these tables are known as lookup tables and are principally useful in real-time
applications, as they save a lot of memory. Sometimes star schemas involve several
Cluster sampling is a technique used when it becomes difficult to study the target
population spread across a wide area and simple random sampling cannot be applied.
cluster of elements.
For eg., A researcher wants to survey the academic performance of high school
students in Japan. He can divide the entire population of Japan into different clusters
(cities). Then the researcher selects a number of clusters depending on his research
Let’s continue our Data Science Interview Questions blog with some more statistics
questions.
manner so once you reach the end of the list, it is progressed from the top again. The
are the directions along which a particular linear transformation acts by flipping,
compressing or stretching.
Eigenvalue can be referred to as the strength of the transformation in the direction of
51. Can you cite some examples where a false positive is important than a false
negative?
Let us first understand what false positives and false negatives are.
● False Positives are the cases where you wrongly classified a non-event as an
● False Negatives are the cases where you wrongly classify events as non-events,
Example 1: In the medical field, assume you have to give chemotherapy to patients.
Assume a patient comes to that hospital and he is tested positive for cancer, based on
the lab prediction but he actually doesn’t have cancer. This is a case of false positive.
Here it is of utmost danger to start chemotherapy on this patient when he actually does
not have cancer. In the absence of cancerous cell, chemotherapy will do certain
damage to his normal healthy cells and might lead to severe diseases, even cancer.
Example 2: Let’s say an e-commerce company decided to give $1000 Gift voucher to the
customers whom they assume to purchase at least $10,000 worth of items. They send
free voucher mail directly to 100 customers without any minimum purchase condition
because they assume to make at least 20% profit on sold items above $10,000. Now the
issue is if we send the $1000 gift vouchers to customers who have not actually
purchased anything but are marked as having made $10,000 worth of purchase.
52. Can you cite some examples where a false negative important than a false
positive?
Example 1: Assume there is an airport ‘A’ which has received high-security threats and
threat or not. Due to a shortage of staff, they decide to scan passengers being predicted
as risk positives by their predictive model. What will happen if a true threat customer is
Example 3: What if you rejected to marry a very good person based on your predictive
model and you happen to meet him/her after a few years and realize that you had a
false negative?
53. Can you cite some examples where both false positive and false negatives are
equally important?
In the Banking industry giving loans is the primary source of making money but at the
same time if your repayment rate is not good you will not make any profit, rather you will
Banks don’t want to lose good customers and at the same point in time, they don’t want
to acquire bad customers. In this scenario, both the false positives and false negatives
54. Can you explain the difference between a Validation Set and a Test Set?
A Validation set can be considered as a part of the training set as it is used for
On the other hand, a Test Set is used for testing or evaluating the performance of a
In simple terms, the differences can be summarized as; training set is to fit the
parameters i.e. weights and test set is to assess the performance of the model i.e.
55 . Explain cross-validation.
backgrounds where the objective is forecast and one wants to estimate how accurately
The goal of cross-validation is to term a data set to test the model in the training phase
(i.e. validation data set) in order to limit problems like overfitting and get an insight on
Machine Learning explores the study and construction of algorithms that can learn from
devise complex models and algorithms that lend themselves to a prediction which in
the probability of an event, based on prior knowledge of conditions that might be related
to the event.
The Algorithm is ‘naive’ because it makes assumptions that may or may not turn out to
be correct.
1. Linear Kernel
2. Polynomial kernel
4. Sigmoid kernel
to predict the preferences or ratings that a user would give to a product. Recommender
systems are widely used in movies, news, research articles, products, social tags,
music, etc.
recommenders in e-commerce sites like Amazon, eBay & Flipkart, YouTube video
based on his/her ratings for other movies and others’ ratings for all movies. This
product recommenders in e-commerce sites like Amazon, eBay & Flipkart, YouTube
method. If the number of outlier values is few then they can be assessed individually but
for a large number of outliers, the values can be substituted with either the 99th or the
3. Prepare the data for modelling by detecting outliers, treating missing values,
4. After data preparation, start running the model, analyze the result and tweak the
approach. This is an iterative step until the best possible outcome is achieved.
6. Start implementing the model and track the result to analyze the performance of
values. If any patterns are identified the analyst has to concentrate on them as it could
If there are no patterns identified, then the missing values can be substituted with mean
or median values (imputation) or they can simply be ignored. Assigning a default value
which can be mean, minimum or maximum value. Getting into the data is important.
If it is a categorical variable, the default value is assigned. The missing value is
assigned a default value. If you have a distribution of data coming, for normal
If 80% of the values for a variable are missing then you can answer that you would be
65. How will you define the number of clusters in a clustering algorithm?
Though the Clustering Algorithm is not specified, this question is mostly in reference to
K-Means clustering where “K” defines the number of clusters. The objective of clustering
is to group similar entities in a way that the entities within a group are similar to each
Within Sum of squares is generally used to explain the homogeneity within a cluster. If
you plot WSS for a range of number of clusters, you will get the plot shown below.
● The Graph is generally known as Elbow Curve.
● Red circled a point in above graph i.e. Number of Cluster =6 is the point after
This is the widely used approach but few data scientists also use Hierarchical clustering
first to create dendrograms and identify the distinct groups from there.
regression and classification tasks. It is also used for dimensionality reduction, treats
missing values, outlier values. It is a type of ensemble learning method, where a group
object based on attributes, each tree gives a classification. The forest chooses the
classification having the most votes(Overall the trees in the forest) and in case of
69. What cross-validation technique would you use on a time series data set?
Instead of using k-fold cross-validation, you should be aware of the fact that a time
In case of time series data, you should use techniques like forward=chaining — Where
assumptions of an ordinary least squares regression. The residuals could either curve
necessary to transform the response variable so that the data meets the required
non-normal dependent variables into a normal shape. If the given data is not normal
then most of the statistical techniques assume normality. Applying a box cox
your data isn’t normal, applying a Box-Cox means that you are able to run a broader
number of tests. The Box-Cox transformation is named after statisticians George Box
and Sir David Roxbee Cox who collaborated on a 1964 paper and developed the
technique.
Data Analyst
Data Analysts deliver value to their companies by taking data, using it to answer
questions, and communicating the results to help make business decisions. Common
tasks done by data analysts include data cleaning, performing analysis and creating
data visualizations.
Roles
teams.
Data scientist
The data scientist is an individual who can provide immense value by tackling more
perspectives, then the scientist focuses on producing reliable predictions for the future.
Roles
analysis.
Data Engineer
The data engineer establishes the foundation that the data analysts and scientists build
upon. Data engineers are responsible for constructing data pipelines and often have to
use complex tools and techniques to handle data at scale. Unlike the previous two
career paths, data engineering leans a lot more toward a software development skill set.
Roles
performance.
72. Why derivative/differentiation is used ?
When updating the curve, to know in which direction and how much to change or update
the curve depending upon the slope.That is why we use differentiation in almost every
part of Machine Learning and Deep Learning.
Fig: Activation Function Cheetsheet
Fig: Derivative of Activation Functions
73) Mention the difference between Data Mining and Machine learning?
Machine learning relates with the study, design and development of the algorithms that
give computers the capability to learn without being explicitly programmed. While, data
mining can be defined as the process in which the unstructured data tries to extract
In machine learning, when a statistical model describes random error or noise instead
overfitting is normally observed, because of having too many parameters with respect
to the number of training data types. The model exhibits poor performance which has
been overfit.
The possibility of overfitting exists as the criteria used for training the model is not the
By using a lot of data overfitting can be avoided, overfitting happens relatively as you
have a small dataset, and you try to learn from it. But if you have a small database and
you are forced to come with a model based on that. In such situation, you can use a
technique known as cross validation. In this method the dataset splits into two section,
testing and training datasets, the testing dataset will only test the model while, in
In this technique, a model is usually given a dataset of a known data on which training
(training data set) is run and a dataset of unknown data against which the model is
tested. The idea of cross validation is to define a dataset to “test” the model in the
training phase.
a) Decision Trees
c) Probabilistic networks
d) Nearest Neighbor
a) Supervised Learning
b) Unsupervised Learning
c) Semi-supervised Learning
d) Reinforcement Learning
e) Transduction
f) Learning to Learn
80) What are the three stages to build the hypotheses or model in machine
learning?
a) Model building
b) Model testing
The standard approach to supervised learning is to split the set of example into the
In various areas of information science like machine learning, a set of data is used to
discover the potentially predictive relationship known as ‘Training Set’. Training set is an
examples given to the learner, while Test set is used to test the accuracy of the
hypotheses generated by the learner, and it is the set of example held back from the
a) Artificial Intelligence
b) Rule based inference
a) Classifications
b) Speech recognition
c) Regression
e) Annotate strings
88) What is the difference between artificial learning and machine learning?
Designing and developing algorithms according to the behaviours based on empirical
data are known as Machine Learning. While artificial intelligence in addition to machine
learning, it also covers other aspects like knowledge representation, natural language
continuous feature values and outputs a single discrete value, the class.
In Naïve Bayes classifier will converge quicker than discriminative models like logistic
regression, so you need less training data. The main advantage is that it can’t learn
a) Computer Vision
b) Speech Recognition
c) Data Mining
d) Statistics
e) Informal Retrieval
f) Bio-Informatics
92) What is Genetic Programming?
Genetic programming is one of the two techniques used in machine learning. The model
is based on the testing and selecting the best choice among a set of results.
Inductive Logic Programming (ILP) is a subfield of machine learning which uses logical
The process of selecting models among different mathematical models, which are used
to describe the same data set is known as Model Selection. Model selection is applied
95) What are the two methods used for the calibration in Supervised Learning?
The two methods used for predicting good probabilities in Supervised Learning are
a) Platt Calibration
b) Isotonic Regression
These methods are designed for binary classification, and it is not trivial.
issue.
97) What is the difference between heuristic for rule learning and heuristics for
decision trees?
The difference is that the heuristics for decision trees evaluate the average quality of a
number of disjointed sets while rule learners only evaluate the quality of the set of
Bayesian logic program consists of two components. The first component is a logical
one ; it consists of a set of Bayesian Clauses, which captures the qualitative structure of
the domain. The second component is a quantitative one, it encodes the quantitative
Bayesian Network is used to represent the graphical model for probability relationship
101) Why instance based learning algorithm sometimes referred as Lazy learning
algorithm?
Instance based learning algorithm is also referred as Lazy learning algorithm as they
can handle?
experts are strategically generated and combined. This process is known as ensemble
learning.
Ensemble learning is used when you build component classifiers that are more accurate
models built with a given learning algorithm in order to improve robustness over a
classification schemes. While boosting method are used sequentially to reduce the
bias of the combined model. Boosting and Bagging both can reduce errors by reducing
method?
The expected error of a learning algorithm can be decomposed into bias and variance. A
bias term measures how closely the average classifier produced by the learning
algorithm matches the target function. The variance term measures how much the
Incremental learning method is the ability of an algorithm to learn from new data that
may be available after classifier has already been generated from already available
dataset.
110) What is PCA, KPCA and ICA used for?
Analysis) and ICA ( Independent Component Analysis) are important feature extraction
In Machine Learning and statistics, dimension reduction is the process of reducing the
number of random variables under considerations and can be divided into feature
Support vector machines are supervised learning algorithms used for classification and
regression analysis.
a) Data Acquisition
d) Query Type
e) Scoring Metric
f) Significance Test
114) What are the different methods for Sequential Supervised Learning?
a) Sliding-window methods
115) What are the areas in robotics and information processing where sequential
The areas in robotics and information processing where sequential prediction problem
arises are
a) Imitation Learning
b) Structured prediction
observed data that can make predictions about unseen or future data. These
techniques provide guarantees on the performance of the learned predictor on the
future unseen data based on a statistical assumption on the data generating process.
PAC (Probably Approximately Correct) learning is a learning framework that has been
118) What are the different categories you can categorized the sequence learning
process?
a) Sequence prediction
b) Sequence generation
c) Sequence recognition
d) Sequential decision
a) Genetic Programming
b) Inductive Learning
121) Give a popular application of machine learning that you see on day to day basis?
The recommendation engine implemented by major ecommerce websites uses
Machine Learning