0% found this document useful (0 votes)

2K views11 pages

Sutherland Interview Question and Answer

The document discusses various statistical techniques including the central limit theorem, f-test, t-test, and ANOVA. It provides explanations and examples of each technique. For the central limit theorem, it explains how sampling can be used to estimate population means. For the f-test, it describes how the test compares a regression model to the overall mean. For the t-test, it discusses comparing group means. And for ANOVA, it discusses comparing means across multiple groups while controlling the error rate. The document also answers questions about each technique.

Uploaded by

Saumya Kumari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views11 pages

Sutherland Interview Question and Answer

Uploaded by

Saumya Kumari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Sutherland interview Question and Answer

1. Central Limit Theorem (CLT)

Let’s start with one example, Consider that there are 15 sections in
the science department of a university and each section hosts
around 100 students. Our task is to calculate the average weight of
students in the science department. Sounds simple, right?

The approach I get from aspiring data scientists is to simply

calculate the average:

 First, measure the weights of all the students in the science

department
 Add all the weights
 Finally, divide the total sum of weights with a total number of
students to get the average

But what if the size of the data is humongous? Does this approach
make sense? Not really – measuring the weight of all the students
will be a very tiresome and long process. So, what can we do
instead? Let’s look at an alternate approach.

 First, draw groups of students at random from the class. We

will call this a sample. We’ll draw multiple samples, each
consisting of 30 students.
 Calculate the individual mean of these samples
 Calculate the mean of these sample means
 This value will give us the approximate mean weight of the
students in the science department
 Additionally, the histogram of the sample mean weights of
students will resemble a bell curve (or normal distribution)
Significance of the Central Limit Theorem
The central limit theorem has both statistical significance as well as
practical applications. Isn’t that the sweet spot we aim for when
we’re learning a new concept?
We’ll look at both aspects to gauge where we can use them.

Statistical Significance of CLT

 Analyzing data involves statistical methods like hypothesis

testing and constructing confidence intervals. These methods
assume that the population is normally distributed. In the case
of unknown or non-normal distributions, we treat the sampling
distribution as normal according to the central limit theorem
 If we increase the samples drawn from the population, the
standard deviation of sample means will decrease. This helps
us estimate the population mean much more accurately
 Also, the sample mean can be used to create the range of
values known as a confidence interval (that is likely to consist
of the population mean)

Assumptions Behind the Central Limit Theorem

1. The data must follow the randomization condition. It must be
sampled randomly
2. Samples should be independent of each other. One sample
should not influence the other samples
3. Sample size should be not more than 10% of the
population when sampling is done without replacement
4. The sample size should be sufficiently large. Now, how we will
figure out how large this size should be? Well, it depends on
the population. When the population is skewed or asymmetric,
the sample size should be large. If the population is
symmetric, then we can draw small samples as well.

Question 2. Explain f-test.

Sometimes we want to compare a model that we have
calculated to a mean. For example, let’s say that you have
calculated a linear regression model. Remember that
the mean is also a model that can be used to explain the data.
The F-Test is a way that we compare the model that we have
calculated to the overall mean of the data. Similar to the t-test,
if it is higher than a critical value then the model is better at
explaining the data than the mean is.
Before we get into the nitty-gritty of the F-test, we need to talk
about the sum of squares. Let’s take a look at an example of
some data that already has a line of best fit on it.
The F-test compares what is called the mean sum of squares for
the residuals of the model and and the overall mean of the data.
Party fact, the residuals are the difference between the actual, or
observed, data point and the predicted data point.

In the case of graph (a), you are looking at the residuals of the data
points and the overall sample mean. In the case of graph (c), you
are looking at the residuals of the data points and the model that
you calculated from the data. But in graph (b), you are looking at the
residuals of the model and the overall sample mean.
The sum of squares is a measure of how the residuals compare to
the model or the mean, depending on which one we are working
with. There are three that we are concerned with.

The sum of squares of the residuals (SSR) is the sum of the squares

of the residuals between the data points and the actual regression
lines, like graph (c). They are squared to compensate for the
negative values. SSR is calculated by
The sum of squares of the total (SST) is the sum of the
squares of the residuals between the data points and the
mean of the sample, like graph (a). They are squared to
compensate for the negative values. SST is calculated by

It is important to note that while the equations may look the same at
first glance, there is an important distinction. The SS R equation
involves the predicted value, so the second Y has a little carrot over
it (pronounced Y-hat). The SST equation involves the sample mean,
so the second Y has a little bar over it (pronounced Y-bar). Don’t
forget this very important distinction.

The difference between the two (SSR – SST) will tell you the overall
sum of squares for the model itself, like graph (b). This is what we
are after in order to finally start to calculate the actual F value.
These sum of squares values give us a sense of how much the
model varies from the observed values, which comes in handy in
determining if the model is really any good for prediction. The next
step in the F-test process is to calculate the mean of squares for the
residuals and for the model.

To calculate the mean of squares of the model, or MSM, you need to

know the degrees of freedom for the model. Thankfully, it is pretty
straightforward. The degrees of freedom for the model is the
number of variables in the model! Then follow the formula MS M =
SSM ÷ dfmodel
To calculate the mean of squares of the residuals , or MSR, you need
to know the degrees of freedom in the sample size. The degrees of
freedom in the sample size is always N – 1. Then simply follow the
formula MSR = SSR ÷ dfresiduals
Ok, you have done a whole lot of math so far. I’m proud of you
because I know that it is not super fun. But it is super important to
know where these values come from because it helps understand
how they work. Because now we are actually going to see how
the F-statistic is actually calculated!

This calculation gives you a ratio of the model’s prediction to the

regular mean of the data. Then you compare this ratio to an F-
distribution table as you would the t-statistic. If the calculated value
exceeds the critical value in the table, then the model is significantly
different from the mean of the data, and therefore better at
explaining the patterns in the data.

Question 3 Explain t test.

The t-test is a test statistic that compares the means of two different
groups. There are a bunch of cases in which you may want to
compare group performance such as test scores, clinical trials, or
even how happy different types of people are in different places. Of
course, different types of groups and setups call for different types
of tests. The type of t-test that you may need depends on the type of
sample that you have.

If your two groups are the same size and you are taking a sort of
before-and-after experiment, then you will conduct what is called
a Dependent or Paired Sample t-test.
If the two groups are different sizes or you are comparing two
separate event means, then you conduct a Independent Sample t-
test.

Question 4. Explain ANOVA.

ANOVA (Analysis of Variance) is used to check if at least one of two
or more groups have statistically different means. Now, the question
arises – Why do we need another test for checking the difference of
means between independent groups? Why can we not use multiple
t-tests to check for the difference in means?

The answer is simple. Multiple t-tests will have a compound effect

on the error rate of the result. Performing t-test thrice will give an
error rate of ~15% which is too high, whereas ANOVA keeps it at
5% for a 95% confidence interval.

To perform an ANOVA, you must have a continuous response

variable and at least one categorical factor with two or more levels.
ANOVA requires data from approximately normally distributed
populations with equal variances between factor levels. However,
ANOVA procedures work quite well even if the normality
assumption has been violated unless one or more of the
distributions are highly skewed or if the variances are quite different.

ANOVA is measured using a statistic known as F-Ratio. It is defined

as the ratio of Mean Square (between groups) to the Mean Square
(within group).

Mean Square (between groups) = Sum of Squares (between

groups) / degree of freedom (between groups)

Mean Square (within group) = Sum of Squares (within group) /

degree of freedom (within group)
Here, p = represents the number of groups

n = represents the number of observations in a group

= represents the mean of a particular group

X (bar) = represents the mean of all the observations

Now, let us understand the degree of freedom for within group and
between groups respectively.

Between groups : If there are k groups in ANOVA model, then k-1

will be independent. Hence, k-1 degree of freedom.

Within groups : If N represents the total observations in ANOVA (∑n

over all groups) and k are the number of groups then, there will be k
fixed points. Hence, N-k degree of freedom.
Question 5. random forest overfitting avoidance
technique.

Random Forests are less likely to overfit but it is still something that
you want to make an explicit effort to avoid. the main thing you need
to do is optimize a tuning parameter that governs the number of
features that are randomly chosen to grow each tree from the
bootstrapped data. Typically, you do this via kk-fold cross-
validation, where k∈{5,10}k∈{5,10}, and choose the tuning
parameter that minimizes test sample prediction error. In addition,
growing a larger forest will improve predictive accuracy, although
there are usually diminishing returns once you get up to several
hundreds of trees.

Question 6. SQL code to print duplicates.

Input

ID NAME
1 iNeuron
2 One neuron
3 iNeuron

Expected Output:

NAME num
iNeuron 1
One neuron 2

Duplicated NAME existed more than one time, so to count the times
each NAME exists, we can use the following code:
select NAME, count(NAME) as num
from Person
group by NAME;
Question 7. regularisation technique for feature selection
and how are features reduced.

Regularisation consists in adding a penalty to the different

parameters of the machine learning model to reduce the freedom of
the model and in other words to avoid overfitting. In linear model
regularisation, the penalty is applied over the coefficients that
multiply each of the predictors.

There is some techniques to do a feature selection.

 Wrapper methods (forward, backward, and stepwise

selection),
 Filter methods (ANOVA, Pearson correlation, variance
thresholding),
 Embedded methods (Lasso, Ridge, Decision Tree)

Question 8. Heteroskedasticity and how does it happen

in regression.

Heteroscedasticity is a systematic change in the spread of the

residuals over the range of measured values.
Heteroscedasticity is a problem because ordinary least
squares (OLS) regression assumes that all residuals are
drawn from a population that has a constant variance
(homoscedasticity).

There are three common ways to fix heteroscedasticity in

regression:
1. Transform the dependent variable. One way to fix
heteroscedasticity is to transform the dependent variable in
some way.
2. Redefine the dependent variable. Another way to fix
heteroscedasticity is to redefine the dependent variable.
3. Use weighted regression.

Question 9. odd function in logistic regression

The important thing to remember about the odds ratio is that an
odds ratio greater than 1 is a positive association (i.e., higher
number for the predictor means group 1 in the outcome), and an
odds ratio less than 1 is negative association (i.e., higher number
for the predictor means group 0 in the outcome.

Factor Analysis
50% (2)
Factor Analysis
18 pages
Bag II
No ratings yet
Bag II
28 pages
Information Storage and Retrival
No ratings yet
Information Storage and Retrival
31 pages
Ttec Assessment
No ratings yet
Ttec Assessment
17 pages
Cognizant: Online Assessment Guidebook
50% (2)
Cognizant: Online Assessment Guidebook
38 pages
Chapter 11 - ANOVA 5
No ratings yet
Chapter 11 - ANOVA 5
36 pages
20-Introduction To Analysis of Variance
No ratings yet
20-Introduction To Analysis of Variance
31 pages
Psych 1xx3 Quiz Answers
100% (4)
Psych 1xx3 Quiz Answers
55 pages
Principles of The T-Test and ANOVA
No ratings yet
Principles of The T-Test and ANOVA
64 pages
Anova
No ratings yet
Anova
51 pages
Statistics Traning Exam Answer
No ratings yet
Statistics Traning Exam Answer
9 pages
Chapter II Review of Related Literature SAIMS HRIS
No ratings yet
Chapter II Review of Related Literature SAIMS HRIS
13 pages
Kaye R. Rubica BSCS 3A 02 Task Performance 1
No ratings yet
Kaye R. Rubica BSCS 3A 02 Task Performance 1
4 pages
IBM SPSS Amos 19 User's Guide
No ratings yet
IBM SPSS Amos 19 User's Guide
654 pages
HR Interview Quetions Answers PDF
100% (1)
HR Interview Quetions Answers PDF
18 pages
Testing Reliability of Questions
No ratings yet
Testing Reliability of Questions
18 pages
Statistical Inference 2 Note 02
No ratings yet
Statistical Inference 2 Note 02
7 pages
Tif ch08
80% (10)
Tif ch08
33 pages
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
27 pages
Jumeed Oral Questions
100% (1)
Jumeed Oral Questions
261 pages
Practice Exams
100% (1)
Practice Exams
155 pages
Afni
No ratings yet
Afni
4 pages
Unit 5.0-Research Methodology, Tools, Techniques
No ratings yet
Unit 5.0-Research Methodology, Tools, Techniques
39 pages
The Assessment of The Training Program For Employees of VXI Business Process Outsourcing Company 1 1
No ratings yet
The Assessment of The Training Program For Employees of VXI Business Process Outsourcing Company 1 1
36 pages
Testing of Hypothesis.
No ratings yet
Testing of Hypothesis.
12 pages
Interview Guide
No ratings yet
Interview Guide
6 pages
Personal Development
No ratings yet
Personal Development
40 pages
Chapters 8 To 10
No ratings yet
Chapters 8 To 10
6 pages
6 Inferential Statistics
100% (1)
6 Inferential Statistics
55 pages
Graduate & Professional School Interview Questions: Personal Characteristics / Skills / Strengths
No ratings yet
Graduate & Professional School Interview Questions: Personal Characteristics / Skills / Strengths
4 pages
Part B Hypothesis Testing and Confidence Intervals
100% (1)
Part B Hypothesis Testing and Confidence Intervals
10 pages
Questions
No ratings yet
Questions
11 pages
Chapter-4: Pairs of Random Variables
No ratings yet
Chapter-4: Pairs of Random Variables
111 pages
Ancova Reading
No ratings yet
Ancova Reading
20 pages
Interview Training
No ratings yet
Interview Training
13 pages
Model Answer of 2 Old Methods of Research Exam
No ratings yet
Model Answer of 2 Old Methods of Research Exam
10 pages
Applied Statistics Final Exam
No ratings yet
Applied Statistics Final Exam
3 pages
Module 20 Inferential Statistics (Parametric Test)
No ratings yet
Module 20 Inferential Statistics (Parametric Test)
71 pages
Hypothesis Testing: Frances Chumney, PHD
No ratings yet
Hypothesis Testing: Frances Chumney, PHD
38 pages
Ttec Interview
No ratings yet
Ttec Interview
1 page
Read Me First
No ratings yet
Read Me First
1 page
Spouses Cha Vs CA GR No. 124520
No ratings yet
Spouses Cha Vs CA GR No. 124520
2 pages
Combined Engilish
No ratings yet
Combined Engilish
198 pages
Dimensions
No ratings yet
Dimensions
61 pages
Question Bank RM
100% (1)
Question Bank RM
14 pages
Interview Questions and Answers
100% (1)
Interview Questions and Answers
2 pages
Basic SPSS
No ratings yet
Basic SPSS
14 pages
Common Interview Questions and Answers PDF
No ratings yet
Common Interview Questions and Answers PDF
4 pages
Optum
No ratings yet
Optum
12 pages
EOI 2019 01 Website PDF
No ratings yet
EOI 2019 01 Website PDF
15 pages
05 Hands-On Activity 1
No ratings yet
05 Hands-On Activity 1
2 pages
37 Back Office Job Interview Questions (Plus Sample Answers)
No ratings yet
37 Back Office Job Interview Questions (Plus Sample Answers)
9 pages
Lecture 2.a Analysis of RC Beams
No ratings yet
Lecture 2.a Analysis of RC Beams
27 pages
x1500 - Wondershar Accoutns
No ratings yet
x1500 - Wondershar Accoutns
66 pages
My Link Building Recommendations
100% (1)
My Link Building Recommendations
2 pages
Interview
No ratings yet
Interview
8 pages
Mec 466 Sylabus
No ratings yet
Mec 466 Sylabus
1 page
Open A New Query Editor To Combine The Two Tables Using The INNER JOIN Syntax As Follows
100% (1)
Open A New Query Editor To Combine The Two Tables Using The INNER JOIN Syntax As Follows
2 pages
3 Two Sample Independent Test
No ratings yet
3 Two Sample Independent Test
5 pages
Best Practices in Change Management
100% (2)
Best Practices in Change Management
114 pages
Sitel Baguio
0% (1)
Sitel Baguio
4 pages
Rose Ann C. Fabros: Training
No ratings yet
Rose Ann C. Fabros: Training
2 pages
Lesson 15 ANOVA (Analysis of Variance)
No ratings yet
Lesson 15 ANOVA (Analysis of Variance)
6 pages
Common Interview Questions: Here Are Some Examples of Incorrect Responses
No ratings yet
Common Interview Questions: Here Are Some Examples of Incorrect Responses
4 pages
Linear Regression Interview Questions
No ratings yet
Linear Regression Interview Questions
4 pages
Aptitude Test
No ratings yet
Aptitude Test
8 pages
Life Cycle Costing
100% (1)
Life Cycle Costing
8 pages
Chapter 04
100% (1)
Chapter 04
27 pages
Property Management Presentation
100% (1)
Property Management Presentation
14 pages
Interview QandA
No ratings yet
Interview QandA
2 pages
Advance Structures (7th Semester) (B.ARCH)
No ratings yet
Advance Structures (7th Semester) (B.ARCH)
93 pages
Land and Inland
No ratings yet
Land and Inland
28 pages
IJRPR11690
No ratings yet
IJRPR11690
4 pages
Optimum Equipment Management Through: Life Cycle Costing
No ratings yet
Optimum Equipment Management Through: Life Cycle Costing
4 pages
Rooftop-Mounted Wind Turbine: Final Design Report: Client: Professor Upmanu Lall, EEE
No ratings yet
Rooftop-Mounted Wind Turbine: Final Design Report: Client: Professor Upmanu Lall, EEE
20 pages
Extract Hidden Data From PDF
No ratings yet
Extract Hidden Data From PDF
2 pages
Fraud Alert!: "@ril - VC" and "@ril - Sg". These
No ratings yet
Fraud Alert!: "@ril - VC" and "@ril - Sg". These
2 pages
Dicom Communication Protocols
No ratings yet
Dicom Communication Protocols
23 pages
Sison V Teodoro
No ratings yet
Sison V Teodoro
1 page
Dayananda Sagar College of Engineering: M.TECH: Digital Electronics and Communication
No ratings yet
Dayananda Sagar College of Engineering: M.TECH: Digital Electronics and Communication
4 pages
Stax-21 Quick Reference Guides - Digital - PAX A920
No ratings yet
Stax-21 Quick Reference Guides - Digital - PAX A920
2 pages
Dominique Pelz Resume
No ratings yet
Dominique Pelz Resume
1 page
(Final Draft) Taskap Sesdilu - M. Arief Priowahono
No ratings yet
(Final Draft) Taskap Sesdilu - M. Arief Priowahono
21 pages
Dividend Payout of Meezan Sovereign Fund and Meezan Cash Fund
No ratings yet
Dividend Payout of Meezan Sovereign Fund and Meezan Cash Fund
11 pages
DESCO 19640 Certificated
No ratings yet
DESCO 19640 Certificated
2 pages
Smile-B3-Plus: Residential Series 3 KW Inverter
No ratings yet
Smile-B3-Plus: Residential Series 3 KW Inverter
2 pages
0901d19680089cee PDF Preview Medium
No ratings yet
0901d19680089cee PDF Preview Medium
4 pages
HTML in A Day For Digital Marketing Pro Course
No ratings yet
HTML in A Day For Digital Marketing Pro Course
1 page
Experimental Study On The Application of Polymer Modified Bitumen in The Flexible Pavement
No ratings yet
Experimental Study On The Application of Polymer Modified Bitumen in The Flexible Pavement
1 page
Solow Growth Model Summary
No ratings yet
Solow Growth Model Summary
1 page

Sutherland Interview Question and Answer

Uploaded by

Sutherland Interview Question and Answer

Uploaded by

Sutherland interview Question and Answer

1. Central Limit Theorem (CLT)

The approach I get from aspiring data scientists is to simply

 First, measure the weights of all the students in the science

 First, draw groups of students at random from the class. We

Statistical Significance of CLT

 Analyzing data involves statistical methods like hypothesis

Assumptions Behind the Central Limit Theorem

Question 2. Explain f-test.

The sum of squares of the residuals (SSR) is the sum of the squares

To calculate the mean of squares of the model, or MSM, you need to

This calculation gives you a ratio of the model’s prediction to the

Question 3 Explain t test.

Question 4. Explain ANOVA.

The answer is simple. Multiple t-tests will have a compound effect

To perform an ANOVA, you must have a continuous response

ANOVA is measured using a statistic known as F-Ratio. It is defined

Mean Square (between groups) = Sum of Squares (between

Mean Square (within group) = Sum of Squares (within group) /

n = represents the number of observations in a group

= represents the mean of a particular group

X (bar) = represents the mean of all the observations

Between groups : If there are k groups in ANOVA model, then k-1

Within groups : If N represents the total observations in ANOVA (∑n

Question 6. SQL code to print duplicates.

Regularisation consists in adding a penalty to the different

There is some techniques to do a feature selection.

 Wrapper methods (forward, backward, and stepwise

Question 8. Heteroskedasticity and how does it happen

Heteroscedasticity is a systematic change in the spread of the

There are three common ways to fix heteroscedasticity in

Question 9. odd function in logistic regression

You might also like