Crack Your Next Data Science Interview With 300+ Questions
Crack Your Next Data Science Interview With 300+ Questions
For the last few weeks we have been working extensively with Data
Scientists to find out the type of questions they are asked in a Data Scientist
interview. Those who are aware of The Data Monk
(www.thedatamonk.com) already know that we have a pool of 100s of Data
Scientists who have provided us with valuable input on How to get into the
best Data Science companies.
We are publishing this book with the help of their experience and expertise.
For any information or feedback visit at www.thedatamonk.com or mail us
at [email protected]
List of The Data Monk books on Amazon
This does not imply that every company will have all these rounds. In
general there are 4-5 rounds in any recruitment drive.
P.S. – More than 80% of the questions given below were asked in
recruitment drive of companies like Amazon, BookMyShow, Accenture,
Cognizant, Sapient, Deloitte, OYO Rooms, Flipkart, Myntra, etc.
We have tried to give ample number of examples to help you getting the
vibe of these interviews
1.Telephonic Information Round
There is very little scope of asking anything other than these questions. The
call will end with a description of the next round and the mode of the next
round(Telephonic/Face-to-Face)
Bonus Tip – Do ask the HR about the complete recruitment process and
what should you work on for the interview. They might drop you a sample
assignment. Shot in the dark !!
2. Aptitude Test – Elimination Round
Now a days, more than 50% of the companies wants to eliminate the major
chunk of candidates with the aptitude test. We have personally seen people
getting eliminated after scoring more than 70% marks in the test. So be
aware of it.
You can easily find these questions on various websites, but there is a
plethora of questions and you need to know the right type of questions
which are asked in these rounds. Whether it’s an online test or a centre test,
you need to have good speed to solve as many questions as possible. So
practice these questions before an aptitude round.
Time – 30 minutes
Questions – 25
1.
T is older than E.
C is older than T.
E is older than C.
If first two statements are true, then the third statement is?
a. True
b. False
c. Uncertain
Ans.
B i.e False as E is the youngest of the three.
2. Two cards are drawn from a pack of 52 cards, What is the
probability that both of the cards are being Kings?
Ans.
n(S) = 52C2 = (52*51)/(2*1) = 1326
E = Event of getting 2 kings out of 4
n(E) = 4C2 = 6
Ans.
In the first pattern, 3 is added, in the second,2 is subtracted. So, the answer
is 10.
4. How many 5 letter words can be formed using BIHAR?
Ans.
5! = 120
Ans.
S = {HH,HT,TH,TT}
E = {HH,HT,TH}
P(E) = n(E)/n(S) = ¾
6. Find Cost Price when Selling Price is Rs. 40.60 and profit is 16%.
Ans.
C.P. = Rs.(100/116)*40.60 = Rs. 35
7. A bag contains 4 white, 5 red and 6 blue balls. Three balls are drawn
at random from the bag. The probability that all of them are red is:
8. A can finish a work in 18 days and B can do the same work in half
the time take by A. Then, working together, what part of the same
work they can finish in a day?
A's 1 day's work = 1/18
B's 1 day's work = 1/9
(A + B)'s 1 day's work = (1/18) + (1/9) = 1/6
Ans.
1365 = 15C11
10. A takes twice as much time as B or thrice as much time to finish a
piece of work. Working together, they can finish the work in 2 days. B
can do the work alone in:
suppose A, B and C take x, x/2 and x/3 hrs respectively to finish the work
then, (1/x) + (2/x) + (3/x) = ½ .i.e. 6/x = ½
x = 12 hrs, B = x/2 = 12/2 = 6 hrs
Ans. (b)
12. Pipe A can fill in 20 minutes and Pipe B in 30 mins and Pipe C can
empty the same in 40 mins. If all of them work together, find the time
taken to fill the tank
(a) 17 1/7 mins
(b) 20 mins
(c) 8 mins
(d) none of these
Ans. (a)
13. (51+52+53+………+100) is equal to:
(a) 2525
(b) 2975
(c) 3225
(d) 3775
16. Find the average of all the numbers between 6 and 34 which
are divisible by 5
a. 18
b. 20
c. 24
d. 30
17. A man can row a boat at 10 km/hr in still water and the speed of
the stream is 8 km/hr. What is the time taken to row a distance of 90
km down the stream?
a) 8hrs
b)5 hrs
c) 15 hrs
d) 20 hrs
This is one of the most important round. You have to get the best feedback
out of this round(be it a telephonic round or a face-to-face round). We have
did extensive research and figures out the areas from which questions are
asked in the interview. Following are the questions which will definitely
help you in understanding the type of questions asked in this round.
First we will deal with some very basic questions. We will be using one
table and there will be 30 questions from these tables. The questions will
start with the most basic ones, you can skip these 30 questions if you think
you are good with SQL.
Below is the employee(emp) table
24. List the detail of employees along with the annual salary, order it on
the annual salary
SELECT *, sal*12 as Annual_Income
FROM emp
ORDER BY Annual_Income;
29. Show all the employee who joined on 01-Aug-2018, 4-Aug-2018, 29-
Oct-2018 in descending order of Hire Date
SELECT *
FROM emp
WHERE HireDate IN (’01-Aug-2018′,’04-Aug-2018′,’29-OCt-2018′)
ORDER BY HireDate DESC;
30. List the employees who joined in 2018
SELECT *
FROM emp
WHERE HireDate BETWEEN (’01-Jan-2018′) AND (’31-Dec-2018′);
32. List the employees with name starting with N and containing 5
alphabets
SELECT *
FROM emp
WHERE EName LIKE ‘N____’;
Or
SELECT *
FROM emp
WHERE EName LIKE ‘N%’ AND len(EName) = 5;
33. List the employee with the third alphabet in their name as K
SELECT *
FROM emp
WHERE Upper(EName) LIKE ‘__K%’;
34. Show the name of the employees who joined in August month of
any year.
SELECT *
FROM emp
WHERE to_char(HireDate,’mon’)=’Aug’
35. Show the employee details of those who were hired in the 90s
SELECT *
FROM emp
WHERE to_char(HireDate, ‘yy’) LIKE ‘9_’;
36. Show the employee who were not hired in the month of October.
SELECT *
FROM emp
WHERE to_char(HireDate,’MON’) NOT IN ‘(‘Oct’);
37. List the total information of the employees along with DName and
Location of people working under ‘Accounts’
SELECT *
FROM emp e
INNER JOIN dept d ON (e.DeptNo = d.DeptNo)
WHERE d.DName = ‘Account’
38. List all the employees with more than 10 years of experience as of
now
SELECT *
FROM emp
WHERE TIMESTAMPDIFF(MONTH, HireDate, sysdate)
39. List the detail of all the employees whose salary is less than that of
Aman
SELECT *
FROM emp
WHERE sal > (SELECT sal FROM emp WHERE EName = ‘Aman’);
40. Show the name of those employees who are senior to their own
Manager.
SELECT *
FROM emp w, emp m
WHERE w.MGR = m.EmpNo and w.HireDate < m.HireDate
Or
SELECT *
FROM emp w, emp m
WHERE w.EmpNo = m.MGR and w.HireDate < m.HireDate
42. Show the employees who are senior to Aman and are working in
Delhi or Bangalore
SELECT *
FROM emp e, dept d
WHERE UPPER(d.loc) IN (‘DELHI,’BANGALORE’) AND e.DeptNo =
d.DeptNo
AND e.HireDate < (SELECT e.HireDate FROM emp e WHERE EName =
‘Aman’);
43. Show the employees with the same job as Aman or Amit.
SELECT *
FROM emp
WHERE job in (SELECT job from emp WHERE EName IN
(‘Aman’,’Amit’);
45. Find the detail of the employee with the minimum pay
SELECT *
FROM emp
WHERE Salary = (SELECT MIN(Salary) FROM emp);
46. Show the detail of the recently hired employee working in Delhi
Try it yourself
SQL Tricky interview Questions
The above questions were to make sure that you are good with the basics.
The below questions are asked mostly in the interviews
49. How to find Third highest salary in Employee table using self-join?
Select * from Employee a Where 3 = (Select Count (distinct Salary) from
Employee where a.salary<=b.salary;
50. How to calculate number of rows in table without using count
function?
SELECT table_name, num_rows
FROM user_tables
WHERE table_name=’Employee’;
It assigns a unique id to each row returned from the query ,even if the ids
are the same. Sample query:-
SELECT emp.*,
row_number() over (order by salary DESC) Row_Number
from Employee emp;
Even when the salary is the same for Bhargav and Chirag, they have a
different Row_Number, this means that the function row_number just gives
a number to every row
56. What is RANK() function?
RANK() function is used to give a rank and not a row number to the data
set. The basic difference between RANK() and ROW_NUMBER is that
Rank will give equal number/rank to the data points with same value. In the
above case, RANK() will give a value of 2 to both Bhargav and Chirag and
thus will rank Dinesh as 4. Similarly, it will give rank 5 to both Esha and
Farhan.
SELECT emp.*,
RANK() over (order by salary DESC) Ranking
from Employee emp;
SELECT emp.*,
NTILE(3) over (order by salary DESC) as GeneratedRank
from Employee emp
This will divide the complete data set in 3 groups from top. So the
GeneratedRank will be 1 for Amit and Bhargav, 2 for Chirag and Dinesh: 3
for Esha and Farhan
58. What is DENSE_RANK() ?
This gives the rank of each row within a result set partition, with no gaps in
the ranking values. Basically there is no gap, so if the top 2 employees have
the same salary then they will get the same rank i.e. 1 , much like the
RANK() function. But, the third person will get a rank of 2 in
DENSE_RANK as there is no gap in ranking where as the third person will
get a rank of 3 when we use RANK() function. Syntax below:-
SELECT emp.*,
DENSE_RANK() OVER (order by salary DESC) DenseRank
from Employee emp;
SELECT EmpID,EmpName
FROM Employee
where EmpName like ‘[aeiou]%’
60. Write a query to get employee name starting and ending with
vowels.
SELECT EmpID,EmpName
FROM Employee
where EmpName like ‘[aeiou]%[aeiou]’
61. What are the different types of statements supported in SQL?
A. Full Outer Join is a combination of Left Outer and Right Outer Join in
SQL
SELECT column_name(s)
FROM table_name
WHERE condition
ORDER BY column_name
OFFSET rows_to_skip ROWS;
There were some output related questions where a table was given, mostly
on group by, order by, top, etc. command
ADVANCE EXCEL
73. What are the ways to create a dynamic range?
Creating a Table
Using OFFSET and COUNTA Functions
74. What is the order of operations that Excel uses while evaluating
formulas?
PEMDAS Rule
Parenthesis
Exponentiation
Multiplication/Division
Addition
Subtraction
79.VLOOKUP Vs INDEX-MATCH
VLOOKUP – Using VLOOKUP, we can retrieve the data from left to right
in the range/Table.
INDEX-MATCH – Using a combination of INDEX and MATCH, we can
retrieve the data from left to right/right to left in a range/table.
80. How would you get the data from different data sources?
Data > Get External Data section > Choose your data source
81. What is the use of Option Explicit in VBA?
Option Explicit will force the user to declare variables. If the user uses
undeclared variables, an error occurs while compiling the code.
90. What are the different types of errors that you can encounter in
Excel?
#N/A
#DIV/0!
#VALUE!
#REF!
#NAME
#NUM
91. What are volatile functions in Excel?
Volatile functions recalculate the formula, again and again, so Excel
workbook performance will be slow down. Volatile functions recalculate
the formulas when any changes happen in the worksheet.
Reports: Reports are not live and we use historical data to make reports.
sometimes Reports are included with visuals such as Table, Graphs and
Charts, Text, Numbers or anything.
94. Name some of the Excel formats which an Excel file can be saved?
.XLS
.XLSX
.XLSM (Macro-enabled workbook)
.XLSB (Binary format)
.CSV (Comma Separated Values)
95. How will you pass arguments to VBA Function?
In 2 ways we can pass arguments to VBA Functions
ByVal
ByRef
102. What are the ways in which a variable can be declared in the
VBScript language?
Implicit Declaration: When variables are used without declaration is called
Implicit declaration.
Explicit Declaration: Declaring variables before using is called Explicit
Declaration.
103. Name some of the operators available in VBA?
Arithmetic Operators, Comparison Operators, Logical Operators etc.
Ex: If you have taken a personal loan, and if you can able to pay the EMI of
6K instead of 10K, how many months do you need to close your personal
loan?
115. How will you find the number of duplicate values in a range?
There might be different ways to find the duplicate values from a range.
One of that is, using COUNTIF function we can find duplicate values.
117. How would you add a new column to an existing pivot table for
calculations?
Using Calculated Field
60+ (consume less than 2.5 packets per month, 2 packets): 15%
{which equals to (364*0.15*2) million packets per month = 109.2 million
packets per month}
Total approximate consumption = (145.6 + 709.8+109.2) million
packets/month = 964.6 million packets/month
Laptop is a costly product. I am assuming that people buy laptop only when
they needed. That's why i am going to calculate potential market of laptops
in India.
Total population of Bangalore = 18Mn ~ 20Mn
In rural areas assume that a new mobile is bought once in 3 years. Hence,
new mobiles bought In current year- 55 Mn
Urban (30%) :176 Mn
Assume Avg No of Mobiles per person : 1.5
Urban Mobile Penetration: 265 Mn
Assuming that a new mobile is bought once in 1.5 years. Hence new
mobiles in current year- 176 Mn
Total New Mobiles: 231 mn
Assuming 3 out of 10 new mobiles are smart phones
No. of smart phones sold=70 Mn
123. What is the total number of people who get a new job in India in a
year?
Observations:
Note:
Migrants working in India are negligible
Due to urbanization, very few go for work without completing their 10th
grade
Increased feminism has a significant effect on the estimates
OYO Rooms Case Study
Working class people, let's assume half are married and half remain
unmarried. So married -> 6 Mn and unmarried -> 6 Mn
Married couples:-
Number of married couples = 6 Mn/2 -> 3 Mn
I am assuming 10% belong to the rich class and prefer luxury cars and 20%
cannot afford a car. The rest 70% has one car each.
70% of 3 Mn = 2.1 Mn
There is the equal distribution of above mentioned 50 cars among these 2.1
couples again. So the number of Swift Cars right now is 2.1 Mn / 50 =
0.042 Mn. I am assuming Swift car comes in 10 colors. Hence number of
red swift cars in married couples is 0.0042 Mn -> 42,000
Unmarried couples:-
Out of 6 Mn unmarried couples, Only 10% can afford mid range non luxury
cars. Hence no of cars = 6 lakh. These are again divided into 50 models as
above and each model has 10 colors. So number of red colored swift cars
among unmarried people = 6 lakh / 500 -> 12,000
Senior citizens
Out of 2 Mn families(4 Mn people), 20% i.e. 0.4 Mn families own a car.
Again, as above, these cars are divided into 50 models with each model
having 10 colors. So 4 lakh/500 -> 8,000
Total number of red colored swift cars in Delhi = 42,000 + 12,000 + 8,000 -
> 62,000
130. The client of our company is The Minnesota Mining
Manufacturing Company of United States whose main office is in
Minnesota, United States. The company manufactures products such as
car décor products, medical products, adhesives, electronic products
and dental products etc. On a global level, the company is a thriving
one with its employee population of over 84000 and product types of
over 55000 and business in over 60 countries. One of their major
investments is in Brazil, which is the manufacturing of a particular
kind of steel that is only produced by two other companies in Brazil.
Throughout the world, Steel has an amazing market capture and
increasing demand. Now, the company has hired BCG to frame a plan
for the progress of this business only after acquiring a proper
knowledge of the market trend. What would you do about it?
Possible answer:
The candidate would begin by discussing the market dynamics in Brazil as
well as globally on which he/she has to base the suggestion. Furthermore,
an idea is to be framed up about the cost, market, value, customers,
transportation facility and price if the steel is to be exported. Also, Brazil
has some taxes on foreign goods export which would only add up to the
price. Since the local market is more profitable than the international trade,
it is advisable to try out the products first in the local market of Brazil since
there is a chance of price war.
R Language
So, this will return the name vector 4 times. First it prints the name and then
increase the temp to 7 and so on.
136. Define while.
In the while loop the condition is tested and the control goes into the body
only when the condition is true
Example
Name_of_function<- function(argument_1,argument_2,..)
{
function body
}
Argument is the place holder , whenever a function is invoked, it pass a
value to the argument. Arguments are optional
139. What is the use of sort() function? How to use the function to sort
in descending order?
Ans.) Elements in a vector can be sorted using the function sort()
Example.
b <- 4
f <- function(a)
{
b <- 3
b^3 + g(a)
}
g <- function(a)
{
a*b
}
Ans.) The global variable b has a value 4. The function f has an argument 2
and the function’s body has the local variable b with the value 3. So
function f(2) will return 3^3 + g(2) and g(2) will give the value 2*4 = 8
where 4 is the value of b.
Thus, the answer is 35
142. How to set and get the working directory in R ?
Ans.)setwd() and getwd() functions are used to set a working directory and
get the working directory for R.
setwd() is used to direct R to perform any action in that location and to
directly import objects from there itself.
getwd() is used to see which is the current working directory of R
143. Get all the data of the person having maximum salary.
Ans.)
max_salary_person<- subset(data, salary == max(salary))
print(max_salary_person)
144. Now create an output file which will have data of all the people
who joined TCS in 2016 with salary more than 300000
Ans.)
*outer join
merge(x=df1, y=df2, by = “id”, all = TRUE)
This all = TRUE will give you the outer join, so the new data set will have
all the value from both the data frame merged on the id
147. How to ideally use the read.csv() function?
Ans.) You must be wondering that it’s very easy to use a csv file by putting
the name inside the read.csv() function. But, in most of the cases we also
need to put some extra conditions in order to get things right and less
frustrating for us.
Use the below syntax for the use.
You can use other functions like max, min, sum, etc.
149. What is sapply() function in R?
Ans.)sapply() function is used when you want to apply a function to each
element of a list in turn, but you want a vector back, rather than a list.
Vector is useful sometimes because it will get you a set of values and you
can easily perform an operation on it.
Example.
x <-list(a =1, b =1:3, c =10:100)
#Compare with above; a named vector, not a list
sapply(x, FUN = length)
a b c
1391
Where
x is the data set whose values are the horizontal coordinates
y is the data set whose values are the vertical coordinates
main is the tile in the graph
xlab and ylab is the label in the horizontal and vertical axis
xlim and ylim are the limits of values of x and y used in the plotting
axes indicates whether both axis should be there on the plot
plot(x =input$wt,y=input$mpg,
xlab="Weight",
ylab="Milage",
xlim= c(2.5,5),
ylim= c(15,30),
main="Weight vsMilage"
)
151. How to write a countdown function in R?
Ans.)
timer<- function(time)
{
print(time)
while(time!=0)
{
Sys.sleep(1)
time<- time - 1
print(time)
}
}
countdown(5)
[1] 5
[2] 4
[3] 3
[4] 2
[5] 1
if salary>20000:
print(“Good Salary”)
elif salary<20000
print(“Average Salary”)
else
print(“Salary is 20000”)
154. WAP to create a dictionary and then iterate over it and print
lucky_number = {‘Amit’:4,’Rahul’:6,’Nihar’:8}
for name,number in lucky_number.items():
print(name+’prefers’+str(number))
lucky_number = {‘Amit’:4,’Rahul’:6,’Nihar’:8}
for name in lucky_number.keys():
print(name)
156. How to read a file and store the lines in your variable.
filename = 'abc.txt'
with open(filename) as file_object:
lines = file_object.readlines()
for line in lines:
print(line)
try:
inp = int(inp)
except ValueError:
print(“Sorry, Please Try again latter”)
else:
print(“That’s a beautiful age “)
158. Try the following operations using List
TDM = [‘The’,’Data’,’Monk]
a. Print the last object of the list
print(TDM[-1])
b. Change the last element to Monkey
TDM[-1] = ‘Monkey
c. Remove Monkey from the list
del TDM[-1]
d. GET Monk back to the list
TDM.append(‘Monk’)
159. Print all the Prime numbers less than 20
i=2
while(i < 20):
j=2
while(j <= (i/j)):
if not(i%j):
break
j=j+1
if (j > i/j) :
print (i," is a prime number")
i=i+1
160. Write a function to print the square of all numbers from 0 to 11
sq = [x**2 for x in range(10)]
print(sq)
list_example = ['Amit','Sumit','Rahul']
print(list_example)
list_example[1] = 'Kamal'
print(list_example)
example = [‘Cricket’,’Football’,’TT’]
game_name(example)
164.When you don’t know how many arguments will be passed to a
function, then you need to pass a variable number of arguments. Show
by an example.
make_pizza('small', 'pepperoni')
make_pizza('large', 'bacon bits', 'pineapple')
make_pizza('medium', 'mushrooms', 'peppers', 'onions', 'extra cheese')
Splitting a dataset into train and test is one of the initial stage of most of the
machine learning models. Following is how you can split dataset in
python:-
str='TheDataMonk'
print(str*3)
TheDataMonkTheDataMonkTheDataMonk
167.Plot a sin graph using line plot
labels = 'Sachin','Dhoni','Kohli','Dravid'
size = [100,25,70,50]
colors = ['pink','blue','red','orange']
explode = (0.1,0,0,0)
plt.pie(size,explode=explode,labels=labels,colors=colors,autopct='%1.1f%
%',shadow=True,startangle=140)
plt.axis('equal')
plt.show()
Project discussion - 1
I had this project on forecasting number of tickets for our client. So, the
questions were mostly with respect to predictive modeling.
180. What were the problems that you faced in the project?
- There were different levels at which we were predicting the volume of
tickets. For the top most level we were using Linear Regression and for the
lower levels we were using ARIMA and ARIMAX, so the main pain point
was the Normalization of number because the numbers predicted by Linear
Regression need not match the summation of all the sub level prediction
from ARIMA and ARIMAX. So, we normalized the ARIMA and
ARIMAX prediction with the Linear Regression.
181. What could have made the result better?
- We could have done more feature engineering to include in the Linear
Regression. So, there is a scope to better the already 96% accuracy
There were few more questions on the basics of Statistics, you can easily
find the answers to these questions on internet, the reason behind not
answering these 4 questions is that Amazon does not allow us to share
knowledge which is easily available on the internet:-
182. What is normal distribution?
183. How to calculate correlation?
184. Formula of Variance and Standard Deviation.
185. What is chi-square test?
Miscellaneous Questions
In the above sections we dealt with all the possible domain from which an
interviewer can ask you questions. The Miscellaneous Question section will
have mixed questions so that you can switch between sections in quick
time. Solve or learn each concept
186. How would you construct a feed to show relevant content for a site
that involves user interactions with items?
We can do so using building a recommendation engine. The easiest we can
do is to show contents that are popular other users, which is still a valid
strategy if for example the contents are news articles. To be more accurate,
we can build a content based filtering or collaborative filtering. If there’s
enough user usage data, we can try collaborative filtering and recommend
contents other similar users have consumed. If there isn’t, we can
recommend similar items based on vectorization of items (content based
filtering).
187. How would you suggest to a franchise where to open a new store?
Build a master dataset with local demographic information available for
each location.
-local income levels
-proximity to traffic
-weather
-population density
-proximity to other businesses
-a reference dataset on local, regional, and national macroeconomic
conditions (e.g. unemployment, inflation, prime interest rate, etc.)
-Any data on the local franchise owner-operators, to the degree the manager
-Identify a set of KPIs acceptable to the management that had requested the
analysis concerning the most desirable factors surrounding a franchise.
Quarterly operating profit, ROI, EVA, pay-down rate, etc.
-Run econometric models to understand the relative significance of each
variable
-Run machine learning algorithms to predict the performance of each
location candidate
188. You’re Uber and you want to design a heat map to recommend to
drivers where to wait for a passenger. How would you approach this?
-Based on the past pickup location of passengers around the same time of
the day, day of the week (month, year), construct a travel map
-Based on the number of past pickups
-Account for periodicity (seasonal, monthly, weekly, daily, hourly)
-Special events (concerts, festivals, etc.) from tweets
a. Find out the place where people have mostly searched for 5 or 7 star
hotels
b. Find the place where the average annual income is high, may be
Bangalore, Pune, Delhi, Hyderanad, etc.
c. Look for that place which is known for tourism as it will attract foreign
customers
d. Look for that area which has good facilities around like popular
restaurants, pubs, malls, etc.
e. Look for that city where there are all the necessary facilities like airport
near the city, railway station, etc.
f. Look for that city where you can get good service from third party
vendors for basic services like laundry, service employees, security service,
etc.
190. Give an example of Normal Distribution from daily life.
Height of all the employees on this floor or in this office
191. How do you think TVF makes a profit? Did moving to it’s own
website advantageous to TVF?
Approach
Honestly, I did not expect such a topic in a case study. I took some 4-5
minutes to shape my idea. Following are the points on which we discussed:-
1. TVF has some 10Million subscriber on Youtube, and it release it’s video
on Youtube after a week of it’s original release on the TVF website. These
videos give it a good amount of money to keep the show running
2. The main reason for TVF to move to it’s own website was to create an
ecosystem comparable to Netflix so that people buy subscription to watch
the show.
3. Netflix charges some $9 for subscription, TVF could be planning to
launch it’s series exclusively to any of these and can get some part of the
subscription. Even a dollar per person can get them close to 10Million
dollars
4. The estimated revenue of a Youtube channel with 10 Million subscriber
is ~500,000 dollars per year.
5. Apart from these, a major chunk of the production cost is taken care by
the sponsor of the show. For example Tiago in Trippling, Kingfisher in
Pitchers, etc. So the production cost is next to zero for the episodes
6. TVF is also going for it’s own website and raising funding to acquire
customers and drive them to their website
It’s hard to get a $10 subscription, but even a basic subscription or tie-up
with some other production can get them a handful of money.
192. The mean of a distribution is 20 and the standard deviation is 5.
What is the value of the coefficient of variation?
Variation
= (Standard Deviation/Mean)*100
= (5/20)*100
= 25%
193. When the mean is less than mode and median, then what type of
distribution is it?
Negatively Skewed
Zero
Continuous Value
Feedback of 100 customers about your website, rest all are discrete
Random error
216. Find the speed of the train, if a train 142 m long passes a pole in 6
seconds.
Speed = `[142/6]` m/sec
= `[23.6 ** 18/5]` km/hr
= 84.9 km/hr
220. What is Imbalanced Data Set and how to handle them? Name Few
Examples.
Fraud detection
Disease screening
Imbalanced Data Set means that the population of one class is extremely
large than the other
(Eg: Fraud – 99% and Non-Fraud – 1%)
Imbalanced dataset can be handled by either oversampling, under sampling
and penalized Machine Learning Algorithm.
221.If you are dealing with 10M Data, then will you go for Machine
learning (or) Deep learning Algorithm?
Machine learning algorithm suits well for small data and it might take huge
amount of time to train for large data.
Whereas Deep learning algorithm takes less amount of data to train due to
the help of GPU(Parallel Processing).
225. How to select the important features in the given data set?
In Logistic Regression, we can use step() which gives AIC score of set of
features
In Decision Tree, We can use information gain(which internally uses
entropy)
In Random Forest, We can use varImpPlot
226. When does multicollinearity problem occur and how to handle it?
It exists when 2 or more predictors are highly correlated with each other.
Example: In the Data Set if you have grades of 2nd PUC and marks of 2nd
PUC, Then both gives the same trend to capture, which might internally
hamper the speed and time.so we need to check if the multi collinearity
exists by using VIF(variance Inflation Factor).
Note: if the Variance Inflation Factor is more than 4, then multi collinearity
problem exists.
227. What is Variance inflation Factors (VIF)?
Measure how much the variance of the estimated regression coefficients are
inflated as compared to when the predictor variables are not linearly related.
232. Give some example for false positive, false negative, true positive,
true negative
False Positive – A cancer screening test comes back positive, but you don’t
have cancer
False Negative – A cancer screening test comes back negative, but you have
cancer
True Positive – A Cancer Screening test comes back positive, and you have
cancer
True Negative – A Cancer Screening test comes back negative, and you
don’t have cancer
Mean imputation
Median Imputation
MICE
miss forest
Amelia
AIC is the measure of fit which penalizes model for the number of model
coefficients. Therefore, we always prefer model with minimum AIC value.
237. Suppose you have 10 samples, where 8 are positive and 2 are
negative, how to calculate Entropy (important to know)
E(S) = 8/10log(8/10) – 2/10log(2/10)
Ensemble learning is used when you build component classifiers that are
more accurate and independent from each other.
243. When will you use SVM and when to use Random Forest?
SVM can be used if the data is outlier free whereas Naïve Bayes can be
used even if it has outliers (since it has built in package to take care).
SVM suits best for Text Classification Model and Random Forest suits for
Binomial/Multinomial Classification Problem.
Random Forest takes care of over fitting problem with the help of tree
pruning
246. If you are given with a use case – ‘Predict the house price range in
the coming years”, which algorithm would you choose
Linear Regression
27
The order of precedence is ** then *. Thus 3**2 = 9 and then 9*3 = 27.
tup = (‘the’,’Data’,’Monk’)
list_example = list(tup)
print(list_example)
[‘the’,’Data’,’Monk’]
249. Though you are comfortable with string, but do try to answer the
output of the following basic operations on string.
str="TheDataMonk"
print (str)
print (str*2)
print (str[2:5])
print (str[3:])
print (str + ".com")
print ("www."+str+".com")
TheDataMonk
TheDataMonkTheDataMonk
eDa
DataMonk
TheDataMonk.com
www.TheDataMonk.com
250. Calculate IQR
Step 2: Subtract the mean from each value. This gives you the differences:
$1550 – $1158.33 = $391.67
$1700 – $1158.33 = $541.67
$900 – $1158.33 = -$258.33
$850 – $1158.33 = -$308.33
$1000 – $1158.33 = $158.33
$950 – $1158.33 = $208.33
Step 4: Add up all of the squares you found in Step 3 and divide by 5
(which is 6 – 1):
(153405.3889 + 293406.3889 + 66734.3889 + 95067.3889 + 25068.3889 +
43401.3889) / 5 = 135416.66668
Step 5: Find the square root of the number you found in Step 4 (the
variance):
√135416.66668 = 367.99
A. Mean is affected badly by the outliers. It’s said that if a Billionaire walks
into a cheap bar, the average crowd becomes millionaire
253. Which of the following is not possible to compute for the following
data set?
Data set – 23,43,223,54,64,0,1,2
a. Median
b. Mean
c. Mode
d. Standard Deviation
e. Geometric Mean
f. Harmonic Mean
A. Whenever there is a data point 0, then you cannot compute the Harmonic
mean
A. Bi-modal
A. Descriptive Statistics
258. Which measure of average can have more than one value?
a. Mode
b. Mean
c. Median
d. Harmonic Mean
A. A series can have more than one values for mode. Example –
2,3,4,5,5,5,4,4,6,7,7 Here both 5 and 4 are mode
262. The sum of the percent frequencies for all classes will always equal
A. one
B. the number of classes
C. the number of items in the study
D. 100
265. Display the depart numbers with more than three employees in
each dept.
SELECT deptno, count(deptno)
FROM emp
GROUP BY deptno
HAVING count(*)>3;
266. In a single toss of 2 fair (evenly-weighted) six-sided dice, find the
probability that their sum will be at most .
When you roll two dice, you have 6 possibilities for each roll (6 sides). This
is 36 total combinations.
Let's list the combinations that result in sums greater than 9.
(4,6) (6,4) (5,5) (6,5) (5,6) (6,6)
That's 6 out of the 36 total possibilities. Therefore, the remaining 30/36
possibilities fulfill the less than or equal to 9 requirement. Simplifying by a
factor of 6, that's 5/6 chance.
267. Let’s say you have a very tall father. On average, what would you
expect the height of his son to be? Taller, equal, or shorter? What if you
had a very short father?
269. What is R2? What are some other metrics that could be better
than R2 and why?
Goodness of fit measure. Variance explained by the regression / total
variance
Remember, the more predictors you add the higher R^2 becomes. Hence
use adjusted R^2 which adjusts for the degrees of freedom or train error
metrics
271. Is more data always better?
Statistically, It depends on the quality of your data, for example, if your
data is biased, just getting more data won’t help. It depends on your model.
If your model suffers from high bias, getting more data won’t improve your
test results beyond a point. You’d need to add more features, etc.
Practically, Also there’s a tradeoff between having more data and the
additional storage, computational power, memory it requires. Hence, always
think about the cost of having more data.
272. You have several variables that are positively correlated with your
response, and you think combining all of the variables could give you a
good prediction of your response. However, you see that in the multiple
linear regression, one of the weights on the predictors is negative. What
could be the issue?
Multi collinearity refers to a situation in which two or more explanatory
variables in a multiple regression model are highly linearly related.
Leave the model as is, despite multi collinearity. The presence of multi
collinearity doesn’t affect the efficiency of extrapolating the fitted model to
new data provided that the predictor variables follow the same pattern of
multi collinearity in the new data as in the data on which the regression
model is based.
principal component regression
273. What are the tests which are performed on data sets?
There are many tests, few are:-
-A/B Test
-Student’s T Test
-Chi Square Test
-Fisher’s Exact Test
-Mann-Whitney Test
Human Resource Round
Though these questions are asked only at the end of the interview and most
probably you have already made it to the final list. But, whenever you have
time, give these questions a shot.