Sutherland Interview Question and Answer
Sutherland Interview Question and Answer
But what if the size of the data is humongous? Does this approach
make sense? Not really – measuring the weight of all the students
will be a very tiresome and long process. So, what can we do
instead? Let’s look at an alternate approach.
In the case of graph (a), you are looking at the residuals of the data
points and the overall sample mean. In the case of graph (c), you
are looking at the residuals of the data points and the model that
you calculated from the data. But in graph (b), you are looking at the
residuals of the model and the overall sample mean.
The sum of squares is a measure of how the residuals compare to
the model or the mean, depending on which one we are working
with. There are three that we are concerned with.
It is important to note that while the equations may look the same at
first glance, there is an important distinction. The SS R equation
involves the predicted value, so the second Y has a little carrot over
it (pronounced Y-hat). The SST equation involves the sample mean,
so the second Y has a little bar over it (pronounced Y-bar). Don’t
forget this very important distinction.
The difference between the two (SSR – SST) will tell you the overall
sum of squares for the model itself, like graph (b). This is what we
are after in order to finally start to calculate the actual F value.
These sum of squares values give us a sense of how much the
model varies from the observed values, which comes in handy in
determining if the model is really any good for prediction. The next
step in the F-test process is to calculate the mean of squares for the
residuals and for the model.
If your two groups are the same size and you are taking a sort of
before-and-after experiment, then you will conduct what is called
a Dependent or Paired Sample t-test.
If the two groups are different sizes or you are comparing two
separate event means, then you conduct a Independent Sample t-
test.
Now, let us understand the degree of freedom for within group and
between groups respectively.
Random Forests are less likely to overfit but it is still something that
you want to make an explicit effort to avoid. the main thing you need
to do is optimize a tuning parameter that governs the number of
features that are randomly chosen to grow each tree from the
bootstrapped data. Typically, you do this via kk-fold cross-
validation, where k∈{5,10}k∈{5,10}, and choose the tuning
parameter that minimizes test sample prediction error. In addition,
growing a larger forest will improve predictive accuracy, although
there are usually diminishing returns once you get up to several
hundreds of trees.
ID NAME
1 iNeuron
2 One neuron
3 iNeuron
Expected Output:
NAME num
iNeuron 1
One neuron 2
Duplicated NAME existed more than one time, so to count the times
each NAME exists, we can use the following code:
select NAME, count(NAME) as num
from Person
group by NAME;
Question 7. regularisation technique for feature selection
and how are features reduced.