0% found this document useful (0 votes)
386 views43 pages

Statistics Interview Questions & Answers For Data Scientists

This document contains 20 statistics interview questions and answers related to topics like the central limit theorem, A/B testing, hypothesis testing, distributions, and other statistical concepts. Some key points covered include: - The central limit theorem states that the sample mean will be approximately normally distributed regardless of the population distribution, if the sample size is large enough. - A/B testing involves measuring metrics on treatment and control groups to statistically determine if a change causes a significant impact. Common pitfalls include improper metrics, lack of countermetrics, and sample bias. - Hypothesis testing assesses if a difference between a current and alternative state is significant using a p-value, which is the probability of observing those results

Uploaded by

mlissali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
386 views43 pages

Statistics Interview Questions & Answers For Data Scientists

This document contains 20 statistics interview questions and answers related to topics like the central limit theorem, A/B testing, hypothesis testing, distributions, and other statistical concepts. Some key points covered include: - The central limit theorem states that the sample mean will be approximately normally distributed regardless of the population distribution, if the sample size is large enough. - A/B testing involves measuring metrics on treatment and control groups to statistically determine if a change causes a significant impact. Common pitfalls include improper metrics, lack of countermetrics, and sample bias. - Hypothesis testing assesses if a difference between a current and alternative state is significant using a p-value, which is the probability of observing those results

Uploaded by

mlissali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Statistics Interview Questions & Answers for Data Scientists

Questions
Q1: Explain the central limit theorem and give examples of when you can use it in a real-world problem?
Q2: Briefly explain the A/B testing and its application? What are some common pitfalls encountered in A/B testing?
Q3: Describe briefly the hypothesis testing and p-value in layman’s term? And give a practical application for them ?
Q4: Given a left-skewed distribution that has a median of 60, what conclusions can we draw about the mean and the mode of the data?
Q5: What is the meaning of selection bias and how to avoid it?
Q6: Explain the long-tailed distribution and provide three examples of relevant phenomena that have long tails. Why are they important in
classification and regression problems?
Q7: What is the meaning of KPI in statistics
Q8: Say you flip a coin 10 times and observe only one head. What would be the null hypothesis and p-value for testing whether the coin is
fair or not?
Q9: You are testing hundreds of hypotheses, each with a t-test. What considerations would you take into account when doing this?
Q10: What general conditions must be satisfied for the central limit theorem to hold?
Q11: What is skewness discuss two methods to measure it?
Q12: You sample from a uniform distribution [0, d] n times. What is your best estimate of d?
Q13: Discuss the Chi-square, ANOVA, and t-test
Q14: Say you have two subsets of a dataset for which you know their means and standard deviations. How do you calculate the blended mean
and standard deviation of the total dataset? Can you extend it to K subsets?
Q15: What is the relationship between the significance level and the confidence level in Statistics?
Q16: What is the Law of Large Numbers in statistics and how it can be used in data science ?
Q17: What is the difference between a confidence interval and a prediction interval, and how do you calculate them?
Q18: What are the differences between the z-test and t-test?
Q19: When to use a z-test Vs a t-test?
Q20: Given a specific dataset, how do you calculate t-statistic or z-statistics?

Questions & Answers


Q1: Explain the central limit theorem and give examples of when you can use it in a real-world problem.

Answers:

The center limit theorem states that if any random variable, regardless of the distribution, is sampled a large enough time, the sample mean will be
approximately normally distributed. This allows for studying the properties of any statistical distribution as long as there is a large enough sample
size.

Important remark from Adrian Olszewski: ⚠ we can rely on the CLT with means (because it applies to any unbiased statistic) only if expressing
data in this way makes sense. And it makes sense ONLY in the case of unimodal and symmetric data, coming from additive processes. So forget
skewed, multi-modal data with mixtures of distributions, coming from multiplicative processes, and non-trivial mean-variance relationships. That
are the places where arithmetic means is meaningless. Thus, using the CLT of e.g. bootstrap will give some valid answers to an invalid question.

⚠ the distribution of means isn't enough. Every single kind of inference requires the entire test statistic to follow a certain distribution. And the test
statistic consists also of the estimate of variance. Never assume the same sample size sufficient for means will suffice for the entire test statistic. See
an excerpt from Rand Wilcox attached. Especially do never believe in magic numbers like N=30.

⚠ think first about how to sensible describe your data, state the hypothesis of interest and then apply a valid method.

Examples of real-world usage of CLT:

1. The CLT can be used at any company with a large amount of data. Consider companies like Uber/Lyft wants to test whether adding a new
feature will increase the booked rides or not using hypothesis testing. So if we have a large number of individual ride X, which in this case is
a Bernoulli random variable (since the rider will book a ride or not), we can estimate the statistical properties of the total number of bookings.
Understanding and estimating these statistical properties play a significant role in applying hypothesis testing to your data and knowing
whether adding a new feature will increase the number of booked riders or not.

2. Manufacturing plants often use the central limit theorem to estimate how many products produced by the plant are defective.

Q2: Briefly explain the A/B testing and its application? What are some common pitfalls encountered in A/B testing?
A/B testing helps us to determine whether a change in something will cause a change in performance significantly or not. So in other words you aim
to statistically estimate the impact of a given change within your digital product (for example). You measure success and counter metrics on at least
1 treatment vs 1 control group (there can be more than 1 XP group for multivariate tests).

Applications:

1. Consider the example of a general store that sells bread packets but not butter, for a year. If we want to check whether its sale depends on the
butter or not, then suppose the store also sells butter and sales for next year are observed. Now we can determine whether selling butter can
significantly increase/decrease or doesn't affect the sale of bread.

2. While developing the landing page of a website you create 2 different versions of the page. You define a criteria for success eg. conversion
rate. Then define your hypothesis Null hypothesis(H): No difference between the performance of the 2 versions. Alternative hypothesis(H'):
version A will perform better than B.

NOTE: You will have to split your traffic randomly(to avoid sample bias) into 2 versions. The split doesn't have to be symmetric, you just need to
set the minimum sample size for each version to avoid undersample bias.
Now if version A gives better results than version B, we will still have to statistically prove that results derived from our sample represent the entire
population. Now one of the very common tests used to do so is 2 sample t-test where we use values of significance level (alpha) and p-value to see
which hypothesis is right. If p-value<alpha, H is rejected.

Common pitfalls:

1. Wrong success metrics inadequate to the business problem


2. Lack of counter metric, as you might add friction to the product regardless along with the positive impact
3. Sample mismatch: heterogeneous control and treatment, unequal variances
4. Underpowered test: too small sample or XP running too short 5. Not accounting for network effects (introduce bias within measurement)

Q3: Describe briefly the hypothesis testing and p-value in layman’s term? And give a practical application for them ?
In Layman's terms:

Hypothesis test is where you have a current state (null hypothesis) and an alternative state (alternative hypothesis). You assess the results of
both of the states and see some differences. You want to decide whether the difference is due to the alternative approach or not.

You use the p-value to decide this, where the p-value is the likelihood of getting the same results the alternative approach achieved if you keep
using the existing approach. It's the probability to find the result in the gaussian distribution of the results you may get from the existing approach.

The rule of thumb is to reject the null hypothesis if the p-value < 0.05, which means that the probability to get these results from the existing
approach is <95%. But this % changes according to task and domain.

To explain the hypothesis testing in Layman's term with an example, suppose we have two drugs A and B, and we want to determine whether these
two drugs are the same or different. This idea of trying to determine whether the drugs are the same or different is called hypothesis testing. The
null hypothesis is that the drugs are the same, and the p-value helps us decide whether we should reject the null hypothesis or not.

p-values are numbers between 0 and 1, and in this particular case, it helps us to quantify how confident we should be to conclude that drug A is
different from drug B. The closer the p-value is to 0, the more confident we are that the drugs A and B are different.

Q4: Given a left-skewed distribution that has a median of 60, what conclusions can we draw about the mean and the
mode of the data?

Answer: Left skewed distribution means the tail of the distribution is to the left and the tip is to the right. So the mean which tends to be near
outliers (very large or small values) will be shifted towards the left or in other words, towards the tail.

While the mode (which represents the most repeated value) will be near the tip and the median is the middle element independent of the distribution
skewness, therefore it will be smaller than the mode and more than the mean.

Mean < 60 Mode > 60

Alt_text

Q5: What is the meaning of selection bias and how to avoid it?
Answer:

Sampling bias is the phenomenon that occurs when a research study design fails to collect a representative sample of a target population. This
typically occurs because the selection criteria for respondents failed to capture a wide enough sampling frame to represent all viewpoints.

The cause of sampling bias almost always owes to one of two conditions.

1. Poor methodology: In most cases, non-representative samples pop up when researchers set improper parameters for survey research. The
most accurate and repeatable sampling method is simple random sampling where a large number of respondents are chosen at random. When
researchers stray from random sampling (also called probability sampling), they risk injecting their own selection bias into recruiting
respondents.

2. Poor execution: Sometimes data researchers craft scientifically sound sampling methods, but their work is undermined when field workers
cut corners. By reverting to convenience sampling (where the only people studied are those who are easy to reach) or giving up on reaching
non-responders, a field worker can jeopardize the careful methodology set up by data scientists.

The best way to avoid sampling bias is to stick to probability-based sampling methods. These include simple random sampling, systematic
sampling, cluster sampling, and stratified sampling. In these methodologies, respondents are only chosen through processes of random selection—
even if they are sometimes sorted into demographic groups along the way. Alt_text

Q6: Explain the long-tailed distribution and provide three examples of relevant phenomena that have long tails. Why
are they important in classification and regression problems?

Answer: A long-tailed distribution is a type of heavy-tailed distribution that has a tail (or tails) that drop off gradually and asymptotically.

Three examples of relevant phenomena that have long tails:

1. Frequencies of languages spoken


2. Population of cities
3. Pageviews of articles

All of these follow something close to 80-20 rule: 80% of outcomes (or outputs) result from 20% of all causes (or inputs) for any given event. This
20% forms the long tail in the distribution.
It’s important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make
up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning
techniques with the assumption that the data is normally distributed. Alt_text

Q7: What is the meaning of KPI in statistics


Answer:

KPI stands for key performance indicator, a quantifiable measure of performance over time for a specific objective. KPIs provide targets for teams
to shoot for, milestones to gauge progress, and insights that help people across the organization make better decisions. From finance and HR to
marketing and sales, key performance indicators help every area of the business move forward at the strategic level.

KPIs are an important way to ensure your teams are supporting the overall goals of the organization. Here are some of the biggest reasons why you
need key performance indicators.

Keep your teams aligned: Whether measuring project success or employee performance, KPIs keep teams moving in the same direction.
Provide a health check: Key performance indicators give you a realistic look at the health of your organization, from risk factors to financial
indicators.
Make adjustments: KPIs help you clearly see your successes and failures so you can do more of what’s working, and less of what’s not.
Hold your teams accountable: Make sure everyone provides value with key performance indicators that help employees track their progress
and help managers move things along.

Types of KPIs Key performance indicators come in many flavors. While some are used to measure monthly progress against a goal, others have a
longer-term focus. The one thing all KPIs have in common is that they’re tied to strategic goals. Here’s an overview of some of the most common
types of KPIs.

Strategic: These big-picture key performance indicators monitor organizational goals. Executives typically look to one or two strategic KPIs
to find out how the organization is doing at any given time. Examples include return on investment, revenue and market share.
Operational: These KPIs typically measure performance in a shorter time frame, and are focused on organizational processes and
efficiencies. Some examples include sales by region, average monthly transportation costs and cost per acquisition (CPA).
Functional Unit: Many key performance indicators are tied to specific functions, such finance or IT. While IT might track time to resolution
or average uptime, finance KPIs track gross profit margin or return on assets. These functional KPIs can also be classified as strategic or
operational.
Leading vs Lagging: Regardless of the type of key performance indicator you define, you should know the difference between leading
indicators and lagging indicators. While leading KPIs can help predict outcomes, lagging KPIs track what has already happened.
Organizations use a mix of both to ensure they’re tracking what’s most important.

Alt_text

Q8: Say you flip a coin 10 times and observe only one head. What would be the null hypothesis and p-value for testing
whether the coin is fair or not?

Answer:

The null hypothesis is that the coin is fair, and the alternative hypothesis is that the coin is biased. The p-value is the probability of observing the
results obtained given that the null hypothesis is true, in this case, the coin is fair.

In total for 10 flips of a coin, there are 2^10 = 1024 possible outcomes and in only 10 of them are there 9 tails and one head.

Hence, the exact probability of the given result is the p-value, which is 10/1024 = 0.0098. Therefore, with a significance level set, for example, at
0.05, we can reject the null hypothesis.

Q9: You are testing hundreds of hypotheses, each with a t-test. What considerations would you take into account when
doing this?
Answer: The main consideration when we have a large number of tests is that probability of getting a significant test due to chance alone increases.
This will increase the type 1 error (rejecting the null hypothesis when it's actually true).

Therefore we need to consider the Bonferroni Effect which happens when we make many tests. Ex. If our significance level is 0.05 but we made a
100 test it means that the probability of getting a value inside the rejection rejoin is 0.0005, not 0.05 so here we need to use another significance
level which's called alpha star = significance level /K Where K is the number of the tests.

Q10: What general conditions must be satisfied for the central limit theorem to hold?

Answer:

In order to apply the central limit theorem, there are four conditions that must be met:

1.** Randomization:** The data must be sampled randomly such that every member in a population has an equal probability of being selected to be
in the sample.

2. Independence: The sample values must be independent of each other.

3. The 10% Condition: When the sample is drawn without replacement, the sample size should be no larger than 10% of the population.

4. Large Sample Condition: The sample size needs to be sufficiently large.

Q11: What is skewness discuss two methods to measure it?

Answer:
Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in a set of data. If the curve is
shifted to the left or to the right, it is said to be skewed.Skewness can be quantified as a representation of the extent to which a given distribution
varies from a normal distribution. There are two main types of skewness negative skew which refers to a longer or fatter tail on the left side of the
distribution, while positive skew refers to a longer or fatter tail on the right. These two skews refer to the direction or weight of the distribution.

The mean of positively skewed data will be greater than the median. In a negatively skewed distribution, the exact opposite is the case: the mean of
negatively skewed data will be less than the median. If the data graphs symmetrically, the distribution has zero skewness, regardless of how long or
fat the tails are.

There are several ways to measure skewness. Pearson’s first and second coefficients of skewness are two common methods. Pearson’s first
coefficient of skewness, or Pearson mode skewness, subtracts the mode from the mean and divides the difference by the standard deviation.
Pearson’s second coefficient of skewness, or Pearson median skewness, subtracts the median from the mean, multiplies the difference by three, and
divides the product by the standard deviation.

Q12: You sample from a uniform distribution [0, d] n times. What is your best estimate of d?
Answer:

Intuitively it is the maximum of the sample points. Here's the mathematical proof is in the figure below:

Q13: Discuss the Chi-square, ANOVA, and t-test


Answer:

Chi-square test A statistical method is used to find the difference or correlation between the observed and expected categorical variables in the
dataset.

Example: A food delivery company wants to find the relationship between gender, location, and food choices of people.

It is used to determine whether the difference between 2 categorical variables is:

Due to chance or

Due to relationship

Analysis of Variance (ANOVA) is a statistical formula used to compare variances across the means (or average) of different groups. A range of
scenarios uses it to determine if there is any difference between the means of different groups.

t_test is a statistical method for the comparison of the mean of the two groups of the normally distributed sample(s).

It comes in various types such as:

1. One sample t-test:

Used to compare the mean of a sample and the population.

2. Two sample t-tests:

Used to compare the mean of two independent samples and whether their population is statistically different.

3. Paired t-test:

Used to compare means of different samples from the same group.

Q14: Say you have two subsets of a dataset for which you know their means and standard deviations. How do you
calculate the blended mean and standard deviation of the total dataset? Can you extend it to K subsets?
Answer:

Q15: What is the relationship between the significance level and the confidence level in Statistics?###

Answer: Confidence level = 1 - significance level.

It's closely related to hypothesis testing and confidence intervals.

Significance Level according to the hypothesis testing literature means the probability of Type-I error one is willing to tolerate.

Confidence Level according to the confidence interval literature means the probability in terms of the true parameter value lying inside the
confidence interval. They are usually written in percentages.

Q16: What is the Law of Large Numbers in statistics and how it can be used in data science ?
Answer: The law of large numbers states that as the number of trials in a random experiment increases, the average of the results obtained from the
experiment approaches the expected value. In statistics, it's used to describe the relationship between sample size and the accuracy of statistical
estimates.

In data science, the law of large numbers is used to understand the behavior of random variables over many trials. It's often applied in areas such as
predictive modeling, risk assessment, and quality control to ensure that data-driven decisions are based on a robust and accurate representation of
the underlying patterns in the data.

The law of large numbers helps to guarantee that the average of the results from a large number of independent and identically distributed trials will
converge to the expected value, providing a foundation for statistical inference and hypothesis testing.

Q17: What is the difference between a confidence interval and a prediction interval, and how do you calculate them?

Answer:

A confidence interval is a range of values that is likely to contain the true value of a population parameter with a certain level of confidence. It is
used to estimate the precision or accuracy of a sample statistic, such as a mean or a proportion, based on a sample from a larger population.

For example, if we want to estimate the average height of all adults in a certain region, we can take a random sample of individuals from that region
and calculate the sample mean height. Then we can construct a confidence interval for the true population mean height, based on the sample mean
and the sample size, with a certain level of confidence, such as 95%. This means that if we repeat the sampling process many times, 95% of the
resulting intervals will contain the true population mean height.

The formula for a confidence interval is: confidence interval = sample statistic +/- margin of error

The margin of error depends on the sample size, the standard deviation of the population (or the sample, if the population standard deviation is
unknown), and the desired level of confidence. For example, if the sample size is larger or the standard deviation is smaller, the margin of error will
be smaller, resulting in a narrower confidence interval.

A prediction interval is a range of values that is likely to contain a future observation or outcome with a certain level of confidence. It is used to
estimate the uncertainty or variability of a future value based on a statistical model and the observed data.
For example, if we have a regression model that predicts the sales of a product based on its price and advertising budget, we can use a prediction
interval to estimate the range of possible sales for a new product with a certain price and advertising budget, with a certain level of confidence, such
as 95%. This means that if we repeat the prediction process many times, 95% of the resulting intervals will contain the true sales value.

The formula for a prediction interval is: prediction interval = point estimate +/- margin of error

The point estimate is the predicted value of the outcome variable based on the model and the input variables. The margin of error depends on the
residual standard deviation of the model, which measures the variability of the observed data around the predicted values, and the desired level of
confidence. For example, if the residual standard deviation is larger or the level of confidence is higher, the margin of error will be larger, resulting

in a wider prediction interval.

SQL & DB Interview Questions & Answers for Data Scientists


Questions
Q1: What are joins in SQL and discuss its types?
Q2: Define the primary, foreign, and unique keys and the differences between them?
Q3: What is the difference between BETWEEN and IN operators in SQL?
Q4: Assume you have the given table below which contains information on user logins. Write a query to obtain the number of reactivated
users (Users who did not log in the previous month and then logged in the current month) Alt_text
Q5: Describe the advantages and disadvantages of relational database vs NoSQL databases
Q6: Assume you are given the table below on user transactions. Write a query to obtain the third transaction of every user

Q7: What do you understand by Self Join? Explain using an example


Q8: Write an SQL query to join 3 tables
Q9: Write a SQL query to get the third-highest salary of an employee from employee_table and arrange them in descending order.
Q10: What is the difference between temporary tables and common table expressions?
Q11: Why use Right Join When Left Join can suffice the requirement?
Q12: Why Rank skips sequence?

Questions & Answers


Q1: What are joins in SQL and discuss its types?
A JOIN clause is used to combine rows from two or more tables, based on a related column between them. It is used to merge two tables or retrieve
data from there. There are 4 types of joins: inner join left join, right join, and full join.

Inner join: Inner Join in SQL is the most common type of join. It is used to return all the rows from multiple tables where the join condition is
satisfied.
Left Join: Left Join in SQL is used to return all the rows from the left table but only the matching rows from the right table where the join
condition is fulfilled.
Right Join: Right Join in SQL is used to return all the rows from the right table but only the matching rows from the left table where the join
condition is fulfilled.
Full Join: Full join returns all the records when there is a match in any of the tables. Therefore, it returns all the rows from the left-hand side
table and all the rows from the right-hand side table. alt text

Q2: Define the primary, foreign, and unique keys and the differences between them?
Primary key: Is a key that is used to uniquely identify each row or record in the table, it can be a single column or composite pk that contains more
than one column

The primary key doesn't accept null or repeated values


The purpose of the primary key is to keep the Entity's integrity
There is only one PK in each table
Every row must have a unique primary key

Foreign key: Is a key that is used to identify, show or describe the relationship between tuples of two tables. It acts as a cross-reference between
tables because it references the primary key of another table, thereby establishing a link between them.

The purpose of the foreign key is to keep data integrity


It can contain null values or primary key values

Unique key: It's a key that can identify each row in the table as the primary key but it can contain one null value

Every table can have more than one Unique key

Q3: What is the difference between BETWEEN and IN operators in SQL?

Answer:

The SQL BETWEEN operator selects values within a given range. It is inclusive of both the ranges, begin and end values are included. The values
can be text, date, numbers, or other

For example, select * from tablename where price BETWEEN 10 and 100;

The IN operator is used to select rows in which a certain value exists in a given field. It is used with the WHERE clause to match values in a list.

For example, select COLUMN from tablename where 'USA' in (country);

IN is mainly best for categorical variables(it can be used with Numerical as well) whereas Between is for Numerical Variables alt_text

Q4: Assume you have the given table below which contains information on user logins. Write a query to obtain the
number of reactivated users (Users who did not log in the previous month and then logged in the current month)
Alt_text

Answer: First, we look at all the users who did not log in during the previous month. To obtain the last month's data, we subtract an 𝐈𝐍𝐓𝐄𝐑𝐕𝐀𝐋
of 1 month from the current month's login date. Then, we use 𝐖𝐇𝐄𝐑𝐄 𝐄𝐗𝐈𝐒𝐓𝐒 against the previous month's interval to check whether there
was login in the previous month. Finally, we 𝗖𝗢𝗨𝗡𝗧 the number of users satisfying this condition.
SELECT
DATE_TRUNC('month', current_month.login_date) AS current_month,
COUNT(*) AS num_reactivated_users
FROM
user_logins current_month
WHERE
NOT EXISTS (
SELECT
*
FROM
user_logins last_month
WHERE
DATE_TRUNC('month', last_month.login_date) BETWEEN DATE_TRUNC('month', current_month.login_date) AND DATE_TRUNC('mo
)

Q5: Describe the advantages and disadvantages of relational database vs NoSQL databases

Answer:

Advantages of Relational Databases: Ensure data integrity through a defined schema and ACID properties. Easy to get started with and use for
small-scale applications. Lends itself well to vertical scaling. Uses an almost standard query language, making learning or switching between types
of relational databases easy.

Advantages of NoSQL Databases: Offers more flexibility in data format and representations, which makes working with Unstructured or
semistructured data easier. Hence, useful when still the data schema or adding new features/functionality rapidly like in a startup environment to
scale with horizontal scaling. Lends itself better to applications that need to be highly available.

Disadvantages of Relational Databases: Data schema needs to be known in advance. Ale schemas is possible, but frequent changes to the schema
for large tables can cause performance issues. Horizontal scaling is relatively difficult, leading to eventual performance bottlenecks

Disadvantages of NoSQL Databases: As outlined by the BASE framework, weaker guarantees of data correctness are made due to the soft-state
and eventual consistency property. Managing consistency can also be difficult due to the lack of a predefined schema that's strictly adhered to.
Depending on the type of NoSQL database, it can be challenging for the database to handle its types of complex queries or access patterns.
Alt_text

Q6: Assume you are given the table below on user transactions. Write a query to obtain the third transaction of every
user

Answer: First, we obtain the transaction numbers for each user. We can do this by using the ROW_NUMBER window function, where we
PARTITION by the user_id and ORDER by the transaction_date fields, calling the resulting field a transaction number. From there, we can simply
take all transactions having a transaction number equal to 3.

Q7: What do you understand by Self Join? Explain using an example

Answer:

Self-join is as its name implies, joining a table to itself on a database, this process may come in handy in a number of cases, such as:

1- comparing the table's rows to themselves:

It's like we have two copies of the same table and join them together on a given condition to reach the required output query.

Ex. If we have a store database with a client's data table holding a bunch of demographics, we could self-join the client's table to get clients who are
located in the same city/made a purchase on the same day/etc.

2- querying a table that has hierarchical data:

Meaning, the table has a primary key that has a one-to-many relationship with another foreign key inside the same table, in other words, the table
has data that refers to the same table. We could use self-join in order to have a clear look at the data by matching its keys.

Ex. The organizational structure of a company may contain an employee table that has an employee id and his manager id (who is also an
employee, hence has an employee id too) in the same table. Using self-join on this table would allow us to reference every employee directly to his
manager.

P.S. we would need to take care of duplicates that may occur and consider them in the conditions.

Q8: Write an SQL query to join 3 tables


Q9: Write a SQL query to get the third-highest salary of an employee from employee_table and arrange them in
descending order.
Answer:

Q10: What is the difference between temporary tables and common table expressions?

Answer:

𝗧𝗲𝗺𝗽𝗼𝗿𝗮𝗿𝘆 𝘁𝗮𝗯𝗹𝗲𝘀 and 𝗖𝗧𝗘s are both used to store intermediate results in MySQL, but there are some key differences between the two:

𝗗𝗲𝗳𝗶𝗻𝗶𝘁𝗶𝗼𝗻: A temporary table is a physical table that is created in the database and persists until it is explicitly dropped or the session ends. A
CTE is a virtual table that is defined only within the scope of a single SQL statement.

𝗦𝘁𝗼𝗿𝗮𝗴𝗲: Temporary tables are stored in the database and occupy physical disk space. CTEs are not stored on disk and exist only in memory for
the duration of the query.

𝗔𝗰𝗰𝗲𝘀𝘀: Temporary tables can be accessed from any session that has the appropriate privileges. CTEs are only accessible within the scope of the
query in which they are defined.

𝗟𝗶𝗳𝗲𝘀𝗽𝗮𝗻: Temporary tables persist until they are explicitly dropped or the session ends. CTEs are only available for the duration of the query in
which they are defined and are then discarded.

𝗦𝘆𝗻𝘁𝗮𝘅: Temporary tables are created using the CREATE TEMPORARY TABLE statement, while CTEs are defined using the WITH clause.

𝗣𝘂𝗿𝗽𝗼𝘀𝗲: Temporary tables are typically used to store intermediate results that will be used in multiple queries, while CTEs are used to simplify
complex queries by breaking them down into smaller, more manageable parts.

In summary, temporary tables are physical tables that persist in the database and can be accessed from any session, while CTEs are virtual tables
that exist only within the scope of a single query and are discarded once the query is complete. Both temporary tables and CTEs can be useful tools
for simplifying complex queries and storing intermediate results.

Q11: Why use Right Join When Left Join can suffice the requirement?

Answer: In MySQL, the 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 𝗮𝗻𝗱 𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 are used to retrieve data from multiple tables by joining them based on a specified
condition.

Generally, the 𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 is used more frequently than the 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 because it returns all the rows from the left table and matching rows
from the right table, or NULL values if there is no match.

In most cases, a 𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 is sufficient to meet the requirement of retrieving all the data from the left table and matching data from the right
table.

However, there may be situations where using a 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 is more appropriate.

Here are a few examples:

𝟭. 𝗪𝗵𝗲𝗻 𝘁𝗵𝗲 𝗽𝗿𝗶𝗺𝗮𝗿𝘆 𝘁𝗮𝗯𝗹𝗲 𝗶𝘀 𝘁𝗵𝗲 𝗿𝗶𝗴𝗵𝘁 𝘁𝗮𝗯𝗹𝗲: If the right table contains the primary data that needs to be retrieved, and the left table
contains supplementary data, a 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 can be used to retrieve all the data from the right table and matching data from the left table.

𝟮. 𝗪𝗵𝗲𝗻 𝘁𝗵𝗲 𝗾𝘂𝗲𝗿𝘆 𝗻𝗲𝗲𝗱𝘀 𝘁𝗼 𝗯𝗲 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗱: In some cases, a 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 may be more efficient than a 𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 because the
database optimizer can choose the most efficient join order based on the query structure and the available indexes.

𝟯. 𝗪𝗵𝗲𝗻 𝘂𝘀𝗶𝗻𝗴 𝗼𝘂𝘁𝗲𝗿 𝗷𝗼𝗶𝗻𝘀: If the query requires an outer join, a 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 may be used to return all the rows from the right table,
including those with no matching rows in the left table. It's important to note that while a 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 can provide additional functionality in
certain cases, it may also make the query more complex and difficult to read. In most cases, a 𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 is the preferred method for joining
tables in MySQL.

Q12: Why Rank skips sequence?


Answers: In MySQL, the rank function may skip a sequence of numbers when using the DENSE_RANK() function or the RANK() function, depending
on the data and the query. The DENSE_RANK() function assigns a unique rank to each distinct value in a result set, whereas the RANK() function
assigns the same rank to the duplicate values.

Here are some of the reasons why the rank function may skip a sequence in MySQL:

1. 𝗧𝗵𝗲 𝗗𝗘𝗡𝗦𝗘_𝗥𝗔𝗡𝗞() function skips ranks when there are ties. For example, if there are two rows with the same values in the ranking
column, both will be assigned the same rank, and the next rank will be incremented by 1.

2. 𝗧𝗵𝗲 𝗥𝗔𝗡𝗞() function skips ranks when there are gaps between the duplicate values. For example, if there are three rows with the same
values in the ranking column, and then the next row has a higher value, the RANK() function will skip over the fourth rank.

3. The query may have filtering or grouping clauses that affect the ranking. For example, if a query filters out some rows or groups them by a
different column, the ranking may not be sequential.

It's important to note that the ranking function in MySQL behaves differently from the ranking function in other databases, so the same query may
produce different results in different database systems.

Resume Based Questions


Questions
Q1: Discuss a challenging problem you faced while working on a data science project and how did you solve it?
Q2: Question: Explain a data science project that you are most proud to work on?

Python Questions
Questions:
Q1: Given two arrays, write a python function to return the intersection of the two? For example, X = [1,5,9,0] and Y = [3,0,2,9] it should
return [9,0]

Q2: Given an array, find all the duplicates in this array? For example: input: [1,2,3,1,3,6,5] output: [1,3]

Q3: Given an integer array, return the maximum product of any three numbers in the array?

Q4: Given an integer array, find the sum of the largest contiguous subarray within the array. For example, given the array A =
[0,-1,-5,-2,3,14] it should return 17 because of [3,14]. Note that if all the elements are negative it should return zero.

Q5: Define tuples and lists in Python What are the major differences between them?

Q6: Compute the Euclidean Distance Between Two Series?

[Q7: Given an integer n and an integer K, output a list of all of the combination of k numbers chosen from 1 to n. For example, if n=3 and
k=2, return 1,2][1,3],[2,3]

Q8: Write a function to generate N samples from a normal distribution and plot them on the histogram

Q9: What is the difference between apply and applymap function in pandas?

Q10: Given a string, return the first recurring character in it, or “None” if there is no recurring character. Example: input =
"pythoninterviewquestion" , output = "n"

Q11: Given a positive integer X return an integer that is a factorial of X. If a negative integer is provided, return -1. Implement the solution by
using a recursive function.

Q12: Given an m-by-n matrix with positive integers, determine the length of the longest path of increasing within the matrix. For example,
consider the input matrix:
[ 1 2 3 ]

[ 4 5 6 ]

[ 7 8 9 ]

The answer should be 5 since the longest path would be 1-2-5-6-9]()

Q13: 𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐰𝐡𝐚𝐭 𝐅𝐥𝐚𝐬𝐤 𝐢𝐬 𝐚𝐧𝐝 𝐢𝐭𝐬 𝐛𝐞𝐧𝐞𝐟𝐢𝐭𝐬

Q14: 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐭𝐡𝐞 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐥𝐢𝐬𝐭𝐬, 𝐚𝐫𝐫𝐚𝐲𝐬, 𝐚𝐧𝐝 𝐬𝐞𝐭𝐬 𝐢𝐧 𝐏𝐲𝐭𝐡𝐨𝐧, 𝐚𝐧𝐝 𝐰𝐡𝐞𝐧 𝐲𝐨𝐮 𝐬𝐡𝐨𝐮𝐥𝐝 𝐮𝐬𝐞 𝐞𝐚𝐜𝐡 𝐨𝐟
𝐭𝐡𝐞𝐦?

Q15: What are some common ways to handle missing data in Python, and which method do you prefer and why?
Questions & Answers:
Q1: Given two arrays, write a python function to return the intersection of the two? For example, X = [1,5,9,0] and Y =
[3,0,2,9] it should return [9,0]

Answer:
set(X).intersect (set(Y))

Q2: Given an array, find all the duplicates in this array? For example: input: [1,2,3,1,3,6,5] output: [1,3]

Answer:
set1=set()
res=set()
for i in list:
if i in set1:
res.add(i)
else:
set1.add(i)
print(res)

Q3: Given an integer array, return the maximum product of any three numbers in the array?

Answer:
import heapq

def max_three(arr):
a = heapq.nlargest(3, arr) # largerst 3 numbers for postive case
b = heapq.nsmallest(2, arr) # for negative case
return max(a[2]*a[1]*a[0], b[1]*b[0]*a[0])

Q4: Given an integer array, find the sum of the largest contiguous subarray within the array. For example, given the
array A = [0,-1,-5,-2,3,14] it should return 17 because of [3,14]. Note that if all the elements are negative it should
return zero.
def max_subarray(arr):
n = len(arr)
max_sum = arr[0] #max
curr_sum = 0
for i in range(n):
curr_sum += arr[i]
max_sum = max(max_sum, curr_sum)
if curr_sum <0:
curr_sum = 0
return max_sum

Q5: Define tuples and lists in Python What are the major differences between them?
Answer:

Lists: In Python, a list is created by placing elements inside square brackets [], separated by commas. A list can have any number of items and they
may be of different types (integer, float, string, etc.). A list can also have another list as an item. This is called a nested list.

1. Lists are mutable


2. Lists are better for performing operations, such as insertion and deletion.
3. Lists consume more memory
4. Lists have several built-in methods

Tuples: A tuple is a collection of objects which ordered and immutable. Tuples are sequences, just like lists. The differences between tuples and lists
are, the tuples cannot be changed unlike lists and tuples use parentheses, whereas lists use square brackets.

1. Tuples are immutable


2. Tuple data type is appropriate for accessing the elements
3. Tuples consume less memory as compared to the list
4. Tuple does not have many built-in methods.

Mutable = we can change, add, delete and modify stuff


Immutable = we cannot change, add, delete and modify stuff

Q6: Compute the Euclidean Distance Between Two Series?

Q7: Given an integer n and an integer K, output a list of all of the combination of k numbers chosen from 1 to n. For
example, if n=3 and k=2, return [1,2],[1,3],[2,3]

Answer
from itertools import combinations
def find_combintaion(k,n):
list_num = []
comb = combinations([x for x in range(1, n+1)],k)
for i in comb:
list_num.append(i)
print("(K:{},n:{}):".format(k,n))
print(list_num,"\n")

Q8: Write a function to generate N samples from a normal distribution and plot them on the histogram
Answer: Using bultin Libraries:
import numpy as np
import matplotlib.pyplot as plt

x = np.random.randn((N,))
plt.hist(x)

From scratch: Alt_text

Q9: What is the difference between apply and applymap function in pandas?

Answer:

Both the methods only accept callables as arguments but what sets them apart is that applymap is defined on dataframes and works element-wise.
While apply can be defined on data frames as well as series and can work row/column-wise as well as element-wise. In terms of use case, applymap
is used for transformations while apply is used for more complex operations and aggregations. Applymap only returns a dataframe while apply can
return a scalar value, series, or dataframe.

Q10: Given a string, return the first recurring character in it, or “None” if there is no recurring character. Example:
input = "pythoninterviewquestion" , output = "n"
Answer:
input_string = "pythoninterviewquestion"

def first_recurring(input_str):

a_str = ""
for letter in input_str:
a_str = a_str + letter
if a_str.count(letter) > 1:
return letter
return None

first_recurring(input_string)

Q11: Given a positive integer X return an integer that is a factorial of X. If a negative integer is provided, return -1.
Implement the solution by using a recursive function.

Answer:
def factorial(x):
# Edge cases
if x < 0: return -1
if x == 0: return 1

# Exit condition - x = 1
if x == 1:
return x
else:
# Recursive part
return x * factorial(x - 1)

Q12: Given an m-by-n matrix with positive integers, determine the length of the longest path of increasing within the
matrix. For example, consider the input matrix:
[ 1 2 3 ]

[ 4 5 6 ]

[ 7 8 9 ]

The answer should be 5 since the longest path would be 1-2-5-6-9

Answer:
MAX = 10

def Longest_Increasing_Path(dp, mat, n, m, x, y):

# If value not calculated yet.


if (dp[x][y] < 0):
result = 0

# // If reach bottom right cell, return 1


if (x == n - 1 and y == m - 1):
dp[x][y] = 1
return dp[x][y]
# If reach the corner
# of the matrix.
if (x == n - 1 or y == m - 1):
result = 1

# If value greater than below cell.


if (x + 1 < n and mat[x][y] < mat[x + 1][y]):
result = 1 + LIP(dp, mat, n,
m, x + 1, y)

# If value greater than left cell.


if (y + 1 < m and mat[x][y] < mat[x][y + 1]):
result = max(result, 1 + LIP(dp, mat, n,
m, x, y + 1))
dp[x][y] = result
return dp[x][y]

# Wrapper function
def wrapper(mat, n, m):
dp = [[-1 for i in range(MAX)]
for i in range(MAX)]
return Longest_Increasing_Path(dp, mat, n, m, 0, 0)

Q13: 𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐰𝐡𝐚𝐭 𝐅𝐥𝐚𝐬𝐤 𝐢𝐬 𝐚𝐧𝐝 𝐢𝐭𝐬 𝐛𝐞𝐧𝐞𝐟𝐢𝐭𝐬


Answer:

Flask is a web framework. This means flask provides you with tools, libraries, and technologies that allow you to build a web application. This web
application can be some web pages, a blog, a wiki, or go as big as a web-based calendar application or a commercial website.

Benefits of Flask:

1. Scalable Flask’s status as a microframework means that it can be used to grow a tech project such as a web app very quickly.

2. Flexible It allows the project to be rearranged and moved around. Also makes sure that the project structure does not collapse when a part is
altered.

3. Easy to negotiate At its core, the microframework is easy to understand for web developers also giving them more control over their code and
what is possible.

4. Lightweight Certain parts of a design of a tool/framework might need assembling and reassembling and do not rely on a large number of
extensions to function which gives web developers a certain level of control. Further, Flask also supports modular programming, which is
where its functionality can be split into several interchangeable modules and each module acts as an independent entity and executes a part of
the functionality.

Q14: 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐭𝐡𝐞 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐥𝐢𝐬𝐭𝐬, 𝐚𝐫𝐫𝐚𝐲𝐬, 𝐚𝐧𝐝 𝐬𝐞𝐭𝐬 𝐢𝐧 𝐏𝐲𝐭𝐡𝐨𝐧, 𝐚𝐧𝐝 𝐰𝐡𝐞𝐧 𝐲𝐨𝐮 𝐬𝐡𝐨𝐮𝐥𝐝
𝐮𝐬𝐞 𝐞𝐚𝐜𝐡 𝐨𝐟 𝐭𝐡𝐞𝐦?

Answer:

All three are data structures that can store sequences of data. but with some differences.

List denoted by [ ], set by , and array/tuple by ( )

𝐋𝐢𝐬𝐭: built-in data type in Python that helps store data in sequence with a very rich API that allows insertion removal retrieval and expansion. one
of its benefits is that it allows the use of many data types in the same lists as it can store string, integers, floats of any other derived objects. one of
its cons that are very slow if it will be used in numerical computation.

𝐀𝐫𝐫𝐚𝐲: on the other hand array can only store a single data type like integers only, float only, or any derived object only. but unlike lists, it's very
efficient in terms of speed and memory usage (NumPy is one of the best libraries that implements array operations as it's a very rich library that
solves many problems in numerical computation like vectorization, broadcasting, ...etc).

𝐒𝐞𝐭: it's also a built-in data type in Python and can store more that data types. but it does not allow for the existence of duplicates and if there are
duplicates it only uses one of them. provide a lot of methods like unions, diffs, and intersections.

Probability Interview Questions & Answers for Data Scientists


Questions
Q1: You and your friend are playing a game with a fair coin. The two of you will continue to toss the coin until the sequence HH or TH
shows up. If HH shows up first, you win, and if TH shows up first your friend win. What is the probability of you winning the game?
Q2: If you roll a dice three times, what is the probability to get two consecutive threes?
Q3: Suppose you have ten fair dice. If you randomly throw them simultaneously, what is the probability that the sum of all of the top faces is
divisible by six?
Q4: If you have three draws from a uniformly distributed random variable between 0 and 2, what is the probability that the median of three
numbers is greater than 1.5?
Q5: Assume you have a deck of 100 cards with values ranging from 1 to 100 and you draw two cards randomly without replacement, what is
the probability that the number of one of them is double the other?
Q6: What is the difference between the Bernoulli and Binomial distribution?
Q7: If there are 30 people in a room, what is the probability that everyone has different birthdays?
Q8: Assume two coins, one fair and the other is unfair. You pick one at random, flip it five times, and observe that it comes up as tails all five
times. What is the probability that you are fliping the unfair coin?
Q9: Assume you take a stick of length 1 and you break it uniformly at random into three parts. What is the probability that the three pieces
can be used to form a triangle?
Q10: Say you draw a circle and choose two chords at random. What is the probability that those chords will intersect?
Q11: If there’s a 15% probability that you might see at least one airplane in a five-minute interval, what is the probability that you might see
at least one airplane in a period of half an hour?
Q12: Say you are given an unfair coin, with an unknown bias towards heads or tails. How can you generate fair odds using this coin?
Q13: According to hospital records, 75% of patients suffering from a disease die from that disease. Find out the probability that 4 out of the 6
randomly selected patients survive.
Q14: Discuss some methods you will use to estimate the Parameters of a Probability Distribution
Q15: You have 40 cards in four colors, 10 reds, 10 greens, 10 blues, and ten yellows. Each color has a number from 1 to 10. When you pick
two cards without replacement, what is the probability that the two cards are not in the same color and not in the same number?
Q16: Can you explain the difference between frequentist and Bayesian probability approaches?
Q17: Explain the Difference Between Probability and Likelihood

Questions & Answers


Q1: You and your friend are playing a game with a fair coin. The two of you will continue to toss the coin until the
sequence HH or TH shows up. If HH shows up first, you win, and if TH shows up first your friend win. What is the
probability of you winning the game?

Answer:

If T is ever flipped, you cannot then reach HH before your friend reaches TH. Therefore, the probability of you winning this is to flip HH initially.
Therefore the sample space will be {HH, HT, TH, TT} and the probability of you winning will be (1/4) and your friend (3/4)

Q2: If you roll a dice three times, what is the probability to get two consecutive threes?

The right answer is 11/216

There are different ways to answer this question:

1. If we roll a dice three times we can get two consecutive 3’s in three ways:

2. The first two rolls are 3s and the third is any other number with a probability of 1/6 * 1/6 * 5/6.

3. The first one is not three while the other two rolls are 3s with a probability of 5/6 * 1/6 * 1/6

4. The last one is that the three rolls are 3s with probability 1/6 ^ 3

So the final result is 2 * (5/6 * (1/6)^2) + (1/6)*3 = 11/216

By Inclusion-Exclusion Principle:

Probability of at least two consecutive threes = Probability of two consecutive threes in first two rolls + Probability of two consecutive threes in last
two rolls - Probability of three consecutive threes

= 2 * Probability of two consecutive threes in first two rolls - Probability of three consecutive threes = 2 * (1/6) * (1/6) - (1/6) * (1/6) * (1/6) =
11/216

It can be seen also like this:

The sample space is made of (x, y, z) tuples where each letter can take a value from 1 to 6, therefore the sample space has 6x6x6=216 values, and
the number of outcomes that are considered two consecutive threes is (3,3, X) or (X, 3, 3), the number of possible outcomes is therefore 6 for the
first scenario (3,3,1) till (3,3,6) and 6 for the other scenario (1,3,3) till (6,3,3) and subtract the duplicate (3,3,3) which appears in both, and this
leaves us with a probability of 11/216.

Q3: Suppose you have ten fair dice. If you randomly throw them simultaneously, what is the probability that the sum
of all of the top faces is divisible by six?
Answer: 1/6

Explanation: With 10 dices, the possible sums divisible by 6 are 12, 18, 24, 30, 36, 42, 48, 54, and 60. You don't actually need to calculate the
probability of getting each of these numbers as the final sums from 10 dices because no matter what the sum of the first 9 numbers is, you can still
choose a number between 1 to 6 on the last die and add to that previous sum to make the final sum divisible by 6. Therefore, we only care about the
last die. And the probability to get that number on the last die is 1/6. So the answer is 1/6

Q4: If you have three draws from a uniformly distributed random variable between 0 and 2, what is the probability
that the median of three numbers is greater than 1.5?

The right answer is 5/32 or 0.156. There are different methods to solve it:

Method 1:

To get a median greater than 1.5 at least two of the three numbers must be greater than 1.5. The probability of one number being greater than 1.5 in
this distribution is 0.25. Then, using the binomial distribution with three trials and a success probability of 0.25 we compute the probability of 2 or
more successes to get the probability of the median is more than 1.5, which would be about 15.6%.

Method2 :
A median greater than 1.5 will occur when o all three uniformly distributed random numbers are greater than 1.5 or 1 uniform distributed random
number between 0 and 1.5 and the other two are greater than 1.5.

So, the probability of the above event is = {(2 - 1.5) / 2}^3 + (3 choose 1)(1.5/2)(0.5/2)^2 = 10/64 = 5/32

Method3:

Using the Monte Carlo method as shown in the figure below: Alt_text

Q5: Assume you have a deck of 100 cards with values ranging from 1 to 100 and you draw two cards randomly without
replacement, what is the probability that the number of one of them is double the other?
There are a total of (100 C 2) = 4950 ways to choose two cards at random from the 100 cards and there are only 50 pairs of these 4950 ways that
you will get one number and it's double. Therefore the probability that the number of one of them is double the other is 50/4950.

Q6: What is the difference between the Bernoulli and Binomial distribution?

Answer:

Bernoulli and Binomial are both types of probability distributions.

The function of Bernoulli is given by

p(x) =px * q(1-x) , x=[0,1]

Mean is p

Variance p*(1-p)

The function Binomial is given by:

p(x) = nCx px q(n-x) x=[0,1,2...n]

Mean : np

Variance :npq

Where p and q are the probability of success and probability of failure respectively, n is the number of independent trials and x is the number of
successes.

As we can see sample space( x ) for Bernoulli distribution is Binary (2 outcomes), and just a single trial.

Eg: A loan sanction for a person can be either a success or a failure, with no other possibility. (Hence single trial).

Whereas for Binomial the sample space(x) ranges from 0 -n.

Eg. Tossing a coin 6 times, what is the probability of getting 2 or a few heads?

Here sample space is x=[0,1,2] and more than 1 trial and n=6(finite)

In short, Bernoulli Distribution is a single trial version of Binomial Distribution.

Q7: If there are 30 people in a room, what is the probability that everyone has different birthdays?

The sample space is 36530 and the number of events is 365p30 because we need to choose persons without replacement to get everyone to have a unique birthday therefore
the Prob = 356p30 / 36530 = 0.2936

A theoretical explanation is provided in the figure below thanks to Fazil Mohammed.

Interesting facts provided by Rishi Dey Chowdhury:

1. With just 23 people there is over 50% chance of a birthday match and with 57 people the match probability exceeds 99%. One intuition to
think of why with such a low number of people the probability of a match is so high. It's because for a match we require a pair of people and
23 choose 2 is 23*11 = 253 which is a relatively big number and ya 50% sounds like a decent probability of a match for this case.

2. Another interesting fact is if the assumption of equal probability of birthday of a person on any day out of 365 is violated and there is a non-
equal probability of birthday of a person among days of the year then, it is even more likely to have a birthday match. Alt_text

Q8: Assume two coins, one fair and the other is unfair. You pick one at random, flip it five times, and observe that it
comes up as tails all five times. What is the probability that you are fliping the unfair coin?

Answer:

Let's use Baye’s theorem let U denote the case where you are flipping the unfair coin and F denote the case where you are flipping the fair coin.
Since the coin is chosen randomly, we know that P(U)=P(F)=0.5. Let 5T denote the event of flipping 5 tails in a row.

Then, we are interested in solving for P(U|5T) (the probability that you are flipping the unfair coin given that you obtained 5 tails). Since the unfair
coin always results in tails, therefore P(5T|U) = 1 and also P(5T|F) =1/2⁵ = 1/32 by the definition of a fair coin.

Lets apply Bayes theorem where P(U|5T) = P(5T|U) * P(U) / P(5T|U)* P(U) + P(5T|F)* P(F) = 0.5 / 0.5 +0.5* 1/32 = 0.97
Therefore the probability that you picked the unfair coin is 97%

Q9: Assume you take a stick of length 1 and you break it uniformly at random into three parts. What is the probability
that the three pieces can be used to form a triangle?
Answer: The right answer is 0.25

Let's say, x and y are the lengths of the two parts, so the length of the third part will be 1-x-y

As per the triangle inequality theorem, the sum of two sides should always be greater than the third side. Therefore, no two lengths can be more
than 1/2. x<1/2 y<1/2

Based on the triangle inequality theorem: x+y > 1-a-b x+y > 1/2

From the diagram below, there is only one triangle that matches all the above conditions out of 4 triangles. Therefore, the probability will be 1/4

Q10: Say you draw a circle and choose two chords at random. What is the probability that those chords will intersect?

Answer: For making 2 chords, 4 points are necessary and from 4 points there are 3 different combinations of pairs of chords can be made. From the
3 combinations, there is only one combination in which the two chords intersect hence answer is 1/3. Let's assume that P1, P2, P3, and P4 are four
points then 3 different combinations are possible for pairs of chords: (P1 P2) (P3 P4) or (P1 P3) (P4 P2) or (P1 P4) (P2 P3) there the 3rd one will
only intersect.

Q11: If there’s a 15% probability that you might see at least one airplane in a five-minute interval, what is the
probability that you might see at least one airplane in a period of half an hour?
Answer:

Probability of at least one plane in 5 mins interval=0.15 Probability of no plane in 5 mins interval=0.85 Probability of seeing at least one plane in 30
mins=1 - Probability of not seeing any plane in 30 minutes =1-(0.85)^6 = 0.6228

Q12: Say you are given an unfair coin, with an unknown bias towards heads or tails. How can you generate fair odds
using this coin?

Answer:
Q13: According to hospital records, 75% of patients suffering from a disease die from that disease. Find out the
probability that 4 out of the 6 randomly selected patients survive.
Answer: This has to be a binomial since there are only 2 outcomes – death or life.

Here n =6, and x=4.

p=0.25 (probability if life) q = 0.75(probability of death)

Using probability mass function equation:

P(X) = nCx *p q(n-x)

Then:

P(4) = 6C4* (0.25)4(0.75)*2 = 0.032

Q14: Discuss some methods you will use to estimate the Parameters of a Probability Distribution

Answer:

Q15: You have 40 cards in four colors, 10 reds, 10 greens, 10 blues, and ten yellows. Each color has a number from 1 to
10. When you pick two cards without replacement, what is the probability that the two cards are not in the same color
and not in the same number?
Answer:

Since it doesn't matter how you choose the first card, so, choose one card at random. Now, all we have to care about is the restriction on the second
card. It can't be the same number (i.e. 3 cards from the other colors can't be chosen in favorable cases) and also can't be the same color (i.e. 9 cards
from the same color can't be chosen keep in mind we have already picked one).

So, the number of favorable choices for the 2nd card is (39-12)/39 = 27/39 = 9/13
Q16: Can you explain the difference between frequentist and Bayesian probability approaches?
Answer:

The frequentist approach to probability defines probability as the long-run relative frequency of an event in an infinite number of trials. It views
probabilities as fixed and objective, determined by the data at hand. In this approach, the parameters of a model are treated as fixed and unknown
and estimated using methods like maximum likelihood estimation.

On the other hand, Bayesian probability defines probability as a degree of belief, or the degree of confidence, in an event. It views probabilities as
subjective and personal, representing an individual's beliefs. In this approach, the parameters of a model are treated as random variables with prior
beliefs, which are updated as new data becomes available to form a posterior belief.

In summary, the frequentist approach deals with fixed and objective probabilities and uses methods like estimation, while the Bayesian approach
deals with subjective and personal probabilities and uses methods like updating prior beliefs with new data.

Q17: Explain the Difference Between Probability and Likelihood

Probability and likelihood are two concepts that are often used in statistics and data analysis, but they have different meanings and uses.

Probability is the measure of the likelihood of an event occurring. It is a number between 0 and 1, with 0 indicating an impossible event and 1
indicating a certain event. For example, the probability of flipping a coin and getting heads is 0.5.

The likelihood, on the other hand, is the measure of how well a statistical model or hypothesis fits a set of observed data. It is not a probability, but
rather a measure of how plausible the data is given the model or hypothesis. For example, if we have a hypothesis that the average height of people
in a certain population is 6 feet, the likelihood of observing a random sample of people with an average height of 5 feet would be low.

Machine Learning Interview Questions & Answers for Data


Scientists
Questions
Q1: Mention three ways to make your model robust to outliers?
Q2: Describe the motivation behind random forests and mention two reasons why they are better than individual decision trees?
Q3: What are the differences and similarities between gradient boosting and random forest? and what are the advantage and disadvantages of
each when compared to each other?
Q4: What are L1 and L2 regularization? What are the differences between the two?
Q5: What are the Bias and Variance in a Machine Learning Model and explain the bias-variance trade-off?
Q6: Mention three ways to handle missing or corrupted data in a dataset?
Q7: Explain briefly the logistic regression model and state an example of when you have used it recently?
Q8: Explain briefly batch gradient descent, stochastic gradient descent, and mini-batch gradient descent? and what are the pros and cons for
each of them?
Q9: Explain what is information gain and entropy in the context of decision trees?
Q10: Explain the linear regression model and discuss its assumption?
Q11: Explain briefly the K-Means clustering and how can we find the best value of K?
Q12: Define Precision, recall, and F1 and discuss the trade-off between them?
Q13: What are the differences between a model that minimizes squared error and the one that minimizes the absolute error? and in which
cases each error metric would be more appropriate?
Q14: Define and compare parametric and non-parametric models and give two examples for each of them?
Q15: Explain the kernel trick in SVM and why we use it and how to choose what kernel to use?
Q16: Define the cross-validation process and the motivation behind using it?
Q17: You are building a binary classifier and you found that the data is imbalanced, what should you do to handle this situation?
Q18: You are working on a clustering problem, what are different evaluation metrics that can be used, and how to choose between them?
Q19: What is the ROC curve and when should you use it?
Q20: What is the difference between hard and soft voting classifiers in the context of ensemble learners?
Q21: What is boosting in the context of ensemble learners discuss two famous boosting methods
Q22: How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?
Q23: Define the curse of dimensionality and how to solve it.
Q24: In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?
Q25: Discuss two clustering algorithms that can scale to large datasets
Q26: Do you need to scale your data if you will be using the SVM classifier and discus your answer
Q27: What are Loss Functions and Cost Functions? Explain the key Difference Between them.
Q28: What is the importance of batch in machine learning and explain some batch depend gradient descent algorithm?
Q29: What are the different methods to split a tree in a decision tree algorithm?
Q30: Why boosting is a more stable algorithm as compared to other ensemble algorithms?
Q31: What is active learning and discuss one strategy of it?
Q32: What are the different approaches to implementing recommendation systems?
Q33: What are the evaluation metrics that can be used for multi-label classification?
Q34: What is the difference between concept and data drift and how to overcome each of them?
Q35: Can you explain the ARIMA model and its components?
Q36: What are the assumptions made by the ARIMA model?

Questions & Answers


Q1: Mention three ways to make your model robust to outliers?

Investigating the outliers is always the first step in understanding how to treat them. After you understand the nature of why the outliers occurred
you can apply one of the several methods mentioned here.

Q2: Describe the motivation behind random forests and mention two reasons why they are better than individual
decision trees?

The motivation behind random forest or ensemble models in general in layman's terms, Let's say we have a question/problem to solve we bring 100
people and ask each of them the question/problem and record their solution. The rest of the answer is here

Q3: What are the differences and similarities between gradient boosting and random forest? and what are the
advantage and disadvantages of each when compared to each other?
Similarities:

1. Both these algorithms are decision-tree based algorithms


2. Both these algorithms are ensemble algorithms
3. Both are flexible models and do not need much data preprocessing.

The rest of the answer is here

Q4: What are L1 and L2 regularization? What are the differences between the two?

Answer:

Regularization is a technique used to avoid overfitting by trying to make the model more simple.The rest of the answer is here

Q5: What are the Bias and Variance in a Machine Learning Model and explain the bias-variance trade-off?
Answer:

The goal of any supervised machine learning model is to estimate the mapping function (f) that predicts the target variable (y) given input (x). The
prediction error can be broken down into three parts: The rest of the answer is here

Q6: Mention three ways to handle missing or corrupted data in a dataset?

Answer:

In general, real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The rest of
the answer is here

Q7: Explain briefly the logistic regression model and state an example of when you have used it recently?

Answer:

Logistic regression is used to calculate the probability of occurrence of an event in the form of a dependent output variable based on independent
input variables. Logistic regression is commonly used to estimate the probability that an instance belongs to a particular class. If the probability is
bigger than 0.5 then it will belong to that class (positive) and if it is below 0.5 it will belong to the other class. This will make it a binary classifier.

It is important to remember that the Logistic regression isn't a classification model, it's an ordinary type of regression algorithm, and it was
developed and used before machine learning, but it can be used in classification when we put a threshold to determine specific categories"

There is a lot of classification applications to it:


Classify email as spam or not, To identify whether the patient is healthy or not, and so on.

Q8: Explain briefly batch gradient descent, stochastic gradient descent, and mini-batch gradient descent? and what are
the pros and cons for each of them?
Gradient descent is a generic optimization algorithm cable for finding optimal solutions to a wide range of problems. The general idea of gradient
descent is to tweak parameters iteratively in order to minimize a cost function.

Batch Gradient Descent: In Batch Gradient descent the whole training data is used to minimize the loss function by taking a step towards the nearest
minimum by calculating the gradient (the direction of descent)

Pros: Since the whole data set is used to calculate the gradient it will be stable and reach the minimum of the cost function without bouncing (if the
learning rate is chosen cooreclty)

Cons:

Since batch gradient descent uses all the training set to compute the gradient at every step, it will be very slow especially if the size of the training
data is large.

Stochastic Gradient Descent:

Stochastic Gradient Descent picks up a random instance in the training data set at every step and computes the gradient-based only on that single
instance.

Pros:

1. It makes the training much faster as it only works on one instance at a time.
2. It become easier to train large datasets

Cons:

Due to the stochastic (random) nature of this algorithm, this algorithm is much less regular than the batch gradient descent. Instead of gently
decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close
to the minimum, but once it gets there it will continue to bounce around, not settling down there. So once the algorithm stops the final parameter are
good but not optimal. For this reason, it is important to use a training schedule to overcome this randomness.

Mini-batch Gradient:

At each step instead of computing the gradients on the whole data set as in the Batch Gradient Descent or using one random instance as in the
Stochastic Gradient Descent, this algorithm computes the gradients on small random sets of instances called mini-batches.

Pros:

1. The algorithm's progress space is less erratic than with Stochastic Gradient Descent, especially with large mini-batches.
2. You can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

Cons:

1. It might be difficult to escape from local minima.

alt text

Q9: Explain what is information gain and entropy in the context of decision trees?

Entropy and Information Gain are two key metrics used in determining the relevance of decision making when constructing a decision tree model
and to determine the nodes and the best way to split.

The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until we reach a small enough set that
contains data points that fall under one label.

Entropy is the measure of impurity, disorder, or uncertainty in a bunch of examples. Entropy controls how a Decision Tree decides to split the data.
Information gain calculates the reduction in entropy or surprise from transforming a dataset in some way. It is commonly used in the construction of
decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the
information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification.

Q10: Explain the linear regression model and discuss its assumption?
Linear regression is a supervised statistical model to predict dependent variable quantity based on independent variables. Linear regression is a
parametric model and the objective of linear regression is that it has to learn coefficients using the training data and predict the target value given
only independent values.

Some of the linear regression assumptions and how to validate them:

1. Linear relationship between independent and dependent variables


2. Independent residuals and the constant residuals at every x We can check for 1 and 2 by plotting the residuals(error terms) against the fitted
values (upper left graph). Generally, we should look for a lack of patterns and a consistent variance across the horizontal line.
3. Normally distributed residuals We can check for this using a couple of methods:

Q-Q-plot(upper right graph): If data is normally distributed, points should roughly align with the 45-degree line.
Boxplot: it also helps visualize outliers
Shapiro–Wilk test: If the p-value is lower than the chosen threshold, then the null hypothesis (Data is normally distributed) is rejected.
4. Low multicollinearity

you can calculate the VIF (Variable Inflation Factors) using your favorite statistical tool. If the value for each covariate is lower than 10
(some say 5), you're good to go.

The figure below summarizes these assumptions. alt text

Q11: Explain briefly the K-Means clustering and how can we find the best value of K?
K-Means is a well-known clustering algorithm. K-Means clustering is often used because it is easy to interpret and implement. The rest of the
answer is here

Q12: Define Precision, recall, and F1 and discuss the trade-off between them?

Precision and recall are two classification evaluation metrics that are used beyond accuracy. The rest of the answer is here

Q13: What are the differences between a model that minimizes squared error and the one that minimizes the absolute
error? and in which cases each error metric would be more appropriate?
Both mean square error (MSE) and mean absolute error (MAE) measures the distances between vectors and express average model prediction in
units of the target variable. Both can range from 0 to infinity, the lower they are the better the model.

The main difference between them is that in MSE the errors are squared before being averaged while in MAE they are not. This means that a large
weight will be given to large errors. MSE is useful when large errors in the model are trying to be avoided. This means that outliers affect MSE
more than MAE, that is why MAE is more robust to outliers. Computation-wise MSE is easier to use as the gradient calculation will be more
straightforward than MAE, which requires linear programming to calculate it.

Q14: Define and compare parametric and non-parametric models and give two examples for each of them?

Answer:

Parametric models assume that the dataset comes from a certain function with some set of parameters that should be tuned to reach the optimal
performance. For such models, the number of parameters is determined prior to training, thus the degree of freedom is limited, and reduces the
chances of overfitting.

Ex. Linear Regression, Logistic Regression, LDA

Nonparametric models don't assume anything about the function from which the dataset was sampled. For these models, the number of parameters
is not determined prior to training, thus they are free to generalize the model based on the data. Sometimes these models overfit themselves while
generalizing. To generalize they need more data in comparison with Parametric Models. They are relatively more difficult to interpret compared to
Parametric Models.

Ex. Decision Tree, Random Forest.

Q15: Explain the kernel trick in SVM and why we use it and how to choose what kernel to use?

Answer: Kernels are used in SVM to map the original input data into a particular higher dimensional space where it will be easier to find patterns in
the data and train the model with better performance.

For eg.: If we have binary class data which form a ring-like pattern (inner and outer rings representing two different class instances) when plotted in
2D space, a linear SVM kernel will not be able to differentiate the two classes well when compared to a RBF (radial basis function) kernel,
mapping the data into a particular higher dimensional space where the two classes are clearly separable.

Typically without the kernel trick, in order to calculate support vectors and support vector classifiers, we need first to transform data points one by
one to the higher dimensional space, and do the calculations based on SVM equations in the higher dimensional space, then return the results. The
‘trick’ in the kernel trick is that we design the kernels based on some conditions as mathematical functions that are equivalent to a dot product in the
higher dimensional space without even having to transform data points to the higher dimensional space. i.e we can calculate support vectors and
support vector classifiers in the same space where the data is provided which saves a lot of time and calculations.

Having domain knowledge can be very helpful in choosing the optimal kernel for your problem, however in the absence of such knowledge
following this default rule can be helpful: For linear problems, we can try linear or logistic kernels and for nonlinear problems, we can use RBF or
Gaussian kernels.

Alt_text

Q16: Define the cross-validation process and the motivation behind using it?

Cross-validation is a technique used to assess the performance of a learning model in several subsamples of training data. In general, we split the
data into train and test sets where we use the training data to train our model and the test data to evaluate the performance of the model on unseen
data and validation set for choosing the best hyperparameters. Now, a random split in most cases(for large datasets) is fine. But for smaller datasets,
it is susceptible to loss of important information present in the data in which it was not trained. Hence, cross-validation though computationally bit
expensive combats this issue.

The process of cross-validation is as the following:

1. Define k or the number of folds


2. Randomly shuffle the data into K equally-sized blocks (folds)
3. For each i in fold 1 to k train the data using all the folds except for fold i and test on the fold i.
4. Average the K validation/test error from the previous step to get an estimate of the error.
This process aims in accomplishing the following: 1- Prevent overfitting during training by avoiding training and testing on the same subset of the
data points

2- Avoid information loss by using a certain subset of the data for validation only. This is important for small datasets.

Cross-validation is always good to be used for small datasets, and if used for large datasets the computational complexity will increase depending
on the number of folds.

Alt_text

Q17: You are building a binary classifier and you found that the data is imbalanced, what should you do to handle this
situation?
Answer: If there is a data imbalance there are several measures we can take to train a fairer binary classifier:

1. Pre-Processing:

Check whether you can get more data or not.

Use sampling techniques (Up Sample minority class, Downsample majority class, can take the hybrid approach as well). We can also use data
augmentation to add more data points for the minority class but with little deviations/changes leading to new data points which are similar to
the ones they are derived from. The most common/popular technique is SMOTE (Synthetic Minority Oversampling technique)

Suppression: Though not recommended, we can drop off some features directly responsible for the imbalance.

Learning Fair Representation: Projecting the training examples to a subspace or plane minimizes the data imbalance.

Re-Weighting: We can assign some weights to each training example to reduce the imbalance in the data.

2. In-Processing:

Regularisation: We can add score terms that measure the data imbalance in the loss function and therefore minimizing the loss function will
also minimize the degree of imbalance with respect to the score chosen which also indirectly minimizes other metrics which measure the
degree of data imbalance.

Adversarial Debiasing: Here we use the adversarial notion to train the model where the discriminator tries to detect if there are signs of data
imbalance in the predicted data by the generator and hence the generator learns to generate data that is less prone to imbalance.

3. Post-Processing:

Odds-Equalization: Here we try to equalize the odds for the classes wrt the data is imbalanced for correct imbalance in the trained model.
Usually, the F1 score is a good choice, if both precision and recall scores are important

Choose appropriate performance metrics. For example, accuracy is not a correct metric to use when classes are imbalanced. Instead, use
precision, recall, F1 score, and ROC curve.

Alt_text

Q18: You are working on a clustering problem, what are different evaluation metrics that can be used, and how to
choose between them?

Answer:

Clusters are evaluated based on some similarity or dissimilarity measure such as the distance between cluster points. If the clustering algorithm
separates dissimilar observations apart and similar observations together, then it has performed well. The two most popular metrics evaluation
metrics for clustering algorithms are the 𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 and 𝐃𝐮𝐧𝐧’𝐬 𝐈𝐧𝐝𝐞𝐱.

𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 The Silhouette Coefficient is defined for each sample and is composed of two scores: a: The mean distance between
a sample and all other points in the same cluster. b: The mean distance between a sample and all other points in the next nearest cluster.

S = (b-a) / max(a,b)

The 𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 for a set of samples is given as the mean of the Silhouette Coefficient for each sample. The score is bounded
between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters. The score is higher when
clusters are dense and well separated, which relates to a standard concept of a cluster.

Dunn’s Index

Dunn’s Index (DI) is another metric for evaluating a clustering algorithm. Dunn’s Index is equal to the minimum inter-cluster distance divided by
the maximum cluster size. Note that large inter-cluster distances (better separation) and smaller cluster sizes (more compact clusters) lead to a
higher DI value. A higher DI implies better clustering. It assumes that better clustering means that clusters are compact and well-separated from
other clusters.

Alt_text

Q19: What is the ROC curve and when should you use it?
Answer:

ROC curve, Receiver Operating Characteristic curve, is a graphical representation of the model's performance where we plot the True Positive Rate
(TPR) against the False Positive Rate (FPR) for different threshold values, for hard classification, between 0 to 1 based on model output.
This ROC curve is mainly used to compare two or more models as shown in the figure below. Now, it is easy to see that a reasonable model will
always give FPR less (since it's an error) than TPR so, the curve hugs the upper left corner of the square box 0 to 1 on the TPR axis and 0 to 1 on
the FPR axis.

The more the AUC(area under the curve) for a model's ROC curve, the better the model in terms of prediction accuracy in terms of TPR and FPR.

Here are some benefits of using the ROC Curve :

Can help prioritize either true positives or true negatives depending on your case study (Helps you visually choose the best hyperparameters
for your case)

Can be very insightful when we have unbalanced datasets

Can be used to compare different ML models by calculating the area under the ROC curve (AUC)

Alt_text

Q20: What is the difference between hard and soft voting classifiers in the context of ensemble learners?
Answer:

Hard Voting: We take into account the class predictions for each classifier and then classify an input based on the maximum votes to a
particular class.

Soft Voting: We take into account the probability predictions for each class by each classifier and then classify an input to the class with
maximum probability based on the average probability (averaged over the classifier's probabilities) for that class.

Alt_text

Q21: What is boosting in the context of ensemble learners discuss two famous boosting methods

Answer:

Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is
to train predictors sequentially, each trying to correct its predecessor.

There are many boosting methods available, but by far the most popular are:

Adaptive Boosting: One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the
predecessor under-fitted. This results in new predictors focusing more and more on the hard cases.
Gradient Boosting: Another very popular Boosting algorithm is Gradient Boosting. Just like AdaBoost, Gradient Boosting works by
sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at
every iteration as AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

Q22: How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?
Answer:

Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the dataset without losing too much
information. One way to measure this is to apply the reverse transformation and measure the reconstruction error. However, not all dimensionality
reduction algorithms provide a reverse transformation.

Alternatively, if you are using dimensionality reduction as a preprocessing step before another Machine Learning algorithm (e.g., a Random Forest
classifier), then you can simply measure the performance of that second algorithm; if dimensionality reduction did not lose too much information,
then the algorithm should perform just as well as when using the original dataset.

Q23: Define the curse of dimensionality and how to solve it.


Answer: Curse of dimensionality represents the situation when the amount of data is too few to be represented in a high-dimensional space, as it
will be highly scattered in that high-dimensional space and it becomes more probable that we overfit this data. If we increase the number of
features, we are implicitly increasing model complexity and if we increase model complexity we need more data.

Possible solutions are: Remove irrelevant features not discriminating classes correlated or features not resulting in much improvement, we can use:

Feature selection(select the most important ones).


Feature extraction(transform current feature dimensionality into a lower dimension preserving the most possible amount of information like
PCA ).

Q24: In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?

Answer:

Regular PCA is the default, but it works only if the dataset fits in memory. Incremental PCA is useful for large datasets that don't fit in memory, but
it is slower than regular PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA is also useful for online tasks when
you need to apply PCA on the fly, every time a new instance arrives. Randomized PCA is useful when you want to considerably reduce
dimensionality and the dataset fits in memory; in this case, it is much faster than regular PCA. Finally, Kernel PCA is useful for nonlinear datasets.

Q25: Discuss two clustering algorithms that can scale to large datasets
Answer:

Minibatch Kmeans: Instead of using the full dataset at each iteration, the algorithm is capable of using mini-batches, moving the centroids just
slightly at each iteration. This speeds up the algorithm typically by a factor of 3 or 4 and makes it possible to cluster huge datasets that do not fit in
memory. Scikit-Learn implements this algorithm in the MiniBatchKMeans class.

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can cluster large datasets by first
generating a small and compact summary of the large dataset that retains as much information as possible. This smaller summary is then clustered
instead of clustering the larger dataset.

Q26: Do you need to scale your data if you will be using the SVM classifier and discus your answer

Answer: Yes, feature scaling is required for SVM and all margin-based classifiers since the optimal hyperplane (the decision boundary) is
dependent on the scale of the input features. In other words, the distance between two observations will differ for scaled and non-scaled cases,
leading to different models being generated.

This can be seen in the figure below, when the features have different scales, we can see that the decision boundary and the support vectors are only
classifying the X1 features without taking into consideration the X0 feature, however after scaling the data to the same scale the decision
boundaries and support vectors are looking much better and the model is taking into account both features.

To scale the data, normalization and standardization are the most popular approaches.

Q27: What are Loss Functions and Cost Functions? Explain the key Difference Between them.

Answer: The loss function is the measure of performance of the model on a single training example, whereas the cost function is the average loss
function over all training examples or across the batch in the case of mini-batch gradient descent.

Some examples of loss functions are Mean Squared Error, Binary Cross Entropy, etc.

Whereas, the cost function is the average of the above loss functions over training examples.
Q28: What is the importance of batch in machine learning and explain some batch depend gradient descent algorithm?

Answer: In the memory, the dataset can load either completely at once or in a form of a set. If we have a huge size of the dataset, then loading the
whole data into memory will reduce the training speed, hence batch term introduce.

Example: image data contains 1,00,000 images, we can load this into 3125 batches where 1 batch = 32 images. So instead of loading the whole
1,00,000 images in memory, we can load 32 images 3125 times which requires less memory.

In summary, a batch is important in two ways: (1) Efficient memory consumption. (2) Improve training speed.

There are 3 types of gradient descent algorithms based on batch size: (1) Stochastic gradient descent (2) Batch gradient descent (3) Mini Batch
gradient descent

If the whole data is in a single batch, it is called batch gradient descent. If the single data points are equal to one batch i.e. number of batches =
number of data instances, it is called stochastic gradient descent. If the number of batches is less than the number of data points or greater than 1, it
is known as mini-batch gradient descent.

Q29: What are the different methods to split a tree in a decision tree algorithm?

Answer:

Decision trees can be of two types regression and classification. For classification, classification accuracy created a lot of instability. So the
following loss functions are used:

Gini's Index Gini impurity is used to predict the likelihood of a randomly chosen example being incorrectly classified by a particular node.
It’s referred to as an “impurity” measure because it demonstrates how the model departs from a simple division.

Cross-Entropy or Information Gain Information gain refers to the process of identifying the most important features/attributes that convey the
most information about a class. The entropy principle is followed with the goal of reducing entropy from the root node to the leaf nodes.
Information gain is the difference in entropy before and after splitting, which describes the impurity of in-class items.

For regression, the good old mean squared error serves as a good loss function which is minimized by splits of the input features and predicting the
mean value of the target feature on the subspaces resulting from the split. But finding the split that results in the minimum possible residual sum of
squares is computationally infeasible, so a greedy top-down approach is taken i.e. the splits are made at a level from top to down which results in
maximum reduction of RSS. We continue this until some max depth or number of leaves is attained.

Q30: Why boosting is a more stable algorithm as compared to other ensemble algorithms?

Answer:

Since Boosting algorithms focus on errors found in previous iterations until they become obsolete. Whereas in bagging there is no corrective loop.
That’s why boosting is a more stable algorithm compared to other ensemble algorithms.

Q31: What is active learning and discuss one strategy of it?

Answer: Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other
information source) to label new data points with the desired outputs. In statistics literature, it is sometimes referred to as optimal experimental
design.

1. Stream-based sampling In stream-based selective sampling, unlabelled data is continuously fed to an active learning system, where the
learner decides whether to send the same to a human oracle or not based on a predefined learning strategy. This method is apt in scenarios
where the model is in production and the data sources/distributions vary over time.

2. Pool-based sampling In this case, the data samples are chosen from a pool of unlabelled data based on the informative value scores and sent
for manual labeling. Unlike stream-based sampling, oftentimes, the entire unlabelled dataset is scrutinized for the selection of the best
instances.
Q32: What are the different approaches to implementing recommendation systems?
Answer:

1. 𝐂𝐨𝐧𝐭𝐞𝐧𝐭-𝐁𝐚𝐬𝐞𝐝 𝐅𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠: Content-Based Filtering depends on similarities of items and users' past activities on the website to
recommend any product or service.

This filter helps in avoiding a cold start for any new products as it doesn't rely on other users' feedback, it can recommend products based on
similarity factors. However, content-based filtering needs a lot of domain knowledge so that the recommendations made are 100 percent accurate.

2. 𝐂𝐨𝐥𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐢𝐯𝐞-𝐁𝐚𝐬𝐞𝐝 𝐅𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠: The primary job of a collaborative filtering system is to overcome the shortcomings of content-
based filtering.

So, instead of focusing on just one user, the collaborative filtering system focuses on all the users and clusters them according to their interests.

Basically, it recommends a product 'x' to user 'a' based on the interest of user 'b'; users 'a' and 'b' must have had similar interests in the past, which is
why they are clustered together.

The domain knowledge that is required for collaborative filtering is less, recommendations made are more accurate and it can adapt to the changing
tastes of users over time. However, collaborative filtering faces the problem of a cold start as it heavily relies on feedback or activity from other
users.

3. 𝐇𝐲𝐛𝐫𝐢𝐝 𝐟𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠: A mixture of content and collaborative methods. Uses descriptors and interactions.

More modern approaches typically fall into the hybrid filtering category and tend to work in two stages:

1). A candidate generation phase where we coarsely generate candidates from a corpus of hundreds of thousands, millions, or billions of items
down to a few hundred or thousand

2. A ranking phase where we re-rank the candidates into a final top-n set to be shown to the user. Some systems employ multiple candidate
generation methods and rankers.

Q33: What are the evaluation metrics that can be used for multi-label classification?

Answer:

Multi-label classification is a type of classification problem where each instance can be assigned to multiple classes or labels simultaneously.

The evaluation metrics for multi-label classification are designed to measure the performance of a multi-label classifier in predicting the correct set
of labels for each instance. Some commonly used evaluation metrics for multi-label classification are:

1. Hamming Loss: Hamming Loss is the fraction of labels that are incorrectly predicted. It is defined as the average number of labels that are
predicted incorrectly per instance.

2. Accuracy: Accuracy is the fraction of instances that are correctly predicted. In multi-label classification, accuracy is calculated as the
percentage of instances for which all labels are predicted correctly.
3. Precision, Recall, F1-Score: These metrics can be applied to each label separately, treating the classification of each label as a separate binary
classification problem. Precision measures the proportion of predicted positive labels that are correct, recall measures the proportion of actual
positive labels that are correctly predicted, and F1-score is the harmonic mean of precision and recall.

4. Macro-F1, Micro-F1: Macro-F1 and Micro-F1 are two types of F1-score metrics that take into account the label imbalance in the dataset.
Macro-F1 calculates the F1-score for each label and then averages them, while Micro-F1 calculates the overall F1-score by aggregating the
true positive, false positive, and false negative counts across all labels.

There are other metrics that can be used such as:

Precision at k (P@k)
Average precision at k (AP@k)
Mean average precision at k (MAP@k)

Q34: What is the difference between concept and data drift and how to overcome each of them?
Answer:

Concept drift and data drift are two different types of problems that can occur in machine learning systems.

Concept drift refers to changes in the underlying relationships between the input data and the target variable over time. This means that the
distribution of the data that the model was trained on no longer matches the distribution of the data it is being tested on. For example, a spam filter
model that was trained on emails from several years ago may not be as effective at identifying spam emails from today because the language and
tactics used in spam emails may have changed.

Data drift, on the other hand, refers to changes in the input data itself over time. This means that the values of the input feature that the model was
trained on no longer match the values of the input features in the data it is being tested on. For example, a model that was trained on data from a
particular geographical region may not be as effective at predicting outcomes for data from a different region.

To overcome concept drift, one approach is to use online learning methods that allow the model to adapt to new data as it arrives. This involves
continually training the model on the most recent data while using historical data to maintain context. Another approach is to periodically retrain the
model using a representative sample of the most recent data.

To overcome data drift, one approach is to monitor the input data for changes and retrain the model when significant changes are detected. This may
involve setting up a monitoring system that alerts the user when the data distribution changes beyond a certain threshold.

Another approach is to preprocess the input data to remove or mitigate the effects of the features changing over time so that the model can continue
learning from the remaining features.
Q35: Can you explain the ARIMA model and its components?
Answer: The ARIMA model, which stands for Autoregressive Integrated Moving Average, is a widely used time series forecasting model. It
combines three key components: Autoregression (AR), Differencing (I), and Moving Average (MA).

Autoregression (AR): The autoregressive component captures the relationship between an observation in a time series and a certain number
of lagged observations. It assumes that the value at a given time depends linearly on its own previous values. The "p" parameter in ARIMA(p,
d, q) represents the order of autoregressive terms. For example, ARIMA(1, 0, 0) refers to a model with one autoregressive term.

Differencing (I): Differencing is used to make a time series stationary by removing trends or seasonality. It calculates the difference between
consecutive observations to eliminate any non-stationary behavior. The "d" parameter in ARIMA(p, d, q) represents the order of differencing.
For instance, ARIMA(0, 1, 0) indicates that differencing is applied once.

Moving Average (MA): The moving average component takes into account the dependency between an observation and a residual error from
a moving average model applied to lagged observations. It assumes that the value at a given time depends linearly on the error terms from
previous time steps. The "q" parameter in ARIMA(p, d, q) represents the order of the moving average terms. For example, ARIMA(0, 0, 1)
signifies a model with one moving average term.

By combining these three components, the ARIMA model can capture both autoregressive patterns, temporal dependencies, and stationary behavior
in a time series. The parameters p, d, and q are typically determined through techniques like the Akaike Information Criterion (AIC) or Bayesian
Information Criterion (BIC).

It's worth noting that there are variations of the ARIMA model, such as SARIMA (Seasonal ARIMA), which incorporates additional seasonal
components for modeling seasonal patterns in the data.

ARIMA models are widely used in forecasting applications, but they do make certain assumptions about the underlying data, such as linearity and
stationarity. It's important to validate these assumptions and adjust the model accordingly if they are not met.
Q36: What are the assumptions made by the ARIMA model?
Answer:

The ARIMA model makes several assumptions about the underlying time series data. These assumptions are important to ensure the validity and
accuracy of the model's results. Here are the key assumptions:

Stationarity: The ARIMA model assumes that the time series is stationary. Stationarity means that the statistical properties of the data, such as the
mean and variance, remain constant over time. This assumption is crucial for the autoregressive and moving average components to hold. If the
time series is non-stationary, differencing (the "I" component) is applied to transform it into a stationary series.

Linearity: The ARIMA model assumes that the relationship between the observations and the lagged values is linear. It assumes that the future
values of the time series can be modeled as a linear combination of past values and error terms.

No Autocorrelation in Residuals: The ARIMA model assumes that the residuals (the differences between the predicted values and the actual values)
do not exhibit any autocorrelation. In other words, the errors are not correlated with each other.

Normally Distributed Residuals: The ARIMA model assumes that the residuals follow a normal distribution with a mean of zero. This assumption
is necessary for statistical inference, parameter estimation, and hypothesis testing.

It's important to note that while these assumptions are commonly made in ARIMA modeling, they may not always hold in real-world scenarios. It's
essential to assess the data and, if needed, apply transformations or consider alternative models that relax some of these assumptions. Additionally,
diagnostics tools, such as residual analysis and statistical tests, can help evaluate the adequacy of the assumptions and the model's fit to the data.

Deep Learning Interview Questions for Data Scientists


Questions
Deep Neural Networks
Q1: What are autoencoders? Explain the different layers of autoencoders and mention three practical usages of them?
Q2: What is an activation function and discuss the use of an activation function? Explain three different types of activation functions?
Q3: You are using a deep neural network for a prediction task. After training your model, you notice that it is strongly overfitting the training
set and that the performance on the test isn’t good. What can you do to reduce overfitting?
Q4: Why should we use Batch Normalization?
Q5: How to know whether your model is suffering from the problem of Exploding Gradients?
Q6: Can you name and explain a few hyperparameters used for training a neural network?
Q7: Can you explain the parameter sharing concept in deep learning?
Q8: Describe the architecture of a typical Convolutional Neural Network (CNN)?
Q9: What is the Vanishing Gradient Problem in Artificial Neural Networks and How to fix it?
Q10: When it comes to training an artificial neural network, what could be the reason why the loss doesn't decrease in a few epochs?
Q11: Why Sigmoid or Tanh is not preferred to be used as the activation function in the hidden layer of the neural network?
Q12: Discuss in what context it is recommended to use transfer learning and when it is not.
Q13: Discuss the vanishing gradient in RNN and How they can be solved.
Q14: What are the main gates in LSTM and what are their tasks?
Q15: Is it a good idea to use CNN to classify 1D signals?
Q16: How does L1/L2 regularization affect a neural network?
Q17: 𝐇𝐨𝐰 𝐰𝐨𝐮𝐥𝐝 𝐲𝐨𝐮 𝐜𝐡𝐚𝐧𝐠𝐞 𝐚 𝐩𝐫𝐞-𝐭𝐫𝐚𝐢𝐧𝐞𝐝 𝐧𝐞𝐮𝐫𝐚𝐥 𝐧𝐞𝐭𝐰𝐨𝐫𝐤 𝐟𝐫𝐨𝐦 𝐜𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐭𝐨 𝐫𝐞𝐠𝐫𝐞𝐬𝐬𝐢𝐨𝐧?
Q18: What might happen if you set the momentum hyperparameter too close to 1 (e.g., 0.9999) when using an SGD optimizer?
Q19: What are the hyperparameters that can be optimized for the batch normalization layer?
Q20: What is the effect of dropout on the training and prediction speed of your deep learning model?
Q21: What is the advantage of deep learning over traditional machine learning?
Q22: What is a depthwise Separable layer and what are its advantages?

Natural Language Processing


Q23: What is a transformer architecture, and why is it widely used in natural language processing tasks?
Q24: Explain the key components of a transformer model.
Q25: What is self-attention, and how does it work in transformers?
Q26: What are the advantages of transformers over traditional sequence-to-sequence models?
Q27: How does the attention mechanism help transformers capture long-range dependencies in sequences?
Q28: What are the limitations of transformers, and what are some potential solutions?
Q29: How are transformers trained, and what is the role of pre-training and fine-tuning?
Q30: What is BERT (Bidirectional Encoder Representations from Transformers), and how does it improve language understanding tasks?
Q31: Describe the process of generating text using a transformer-based language model.
Q32: What are some challenges or ethical considerations associated with large language models?
Q33: Explain the concept of transfer learning and how it can be applied to transformers.
Q34: How can transformers be used for tasks other than natural language processing, such as computer vision?

Computer Vision
Q35: What is computer vision, and why is it important?
Q36: Explain the concept of image segmentation and its applications.
Q37: What is object detection, and how does it differ from image classification?
Q38: Describe the steps involved in building an image recognition system.
Q39: What are the challenges in implementing real-time object tracking?
Q40: Can you explain the concept of feature extraction in computer vision?
Q41: What is optical character recognition (OCR), and what are its main applications?
Q42: How does a convolutional neural network (CNN) differ from a traditional neural network in the context of computer vision?
Q43: What is the purpose of data augmentation in computer vision, and what techniques can be used?
Q44: Discuss some popular deep learning frameworks or libraries used for computer vision tasks.

Questions & Answers


Q1: What are autoencoders? Explain the different layers of autoencoders and mention three practical usages of them?
Answer:

Autoencoders are one of the deep learning types used for unsupervised learning. There are key layers of autoencoders, which are the input layer,
encoder, bottleneck hidden layer, decoder, and output.

The three layers of the autoencoder are:-

1. Encoder - Compresses the input data to an encoded representation which is typically much smaller than the input data.
2. Latent Space Representation/ Bottleneck/ Code - Compact summary of the input containing the most important features
3. Decoder - Decompresses the knowledge representation and reconstructs the data back from its encoded form. Then a loss function is used at
the top to compare the input and output images. NOTE- It's a requirement that the dimensionality of the input and output be the same.
Everything in the middle can be played with.

Autoencoders have a wide variety of usage in the real world. The following are some of the popular ones:

1. Transformers and Big Bird (Autoencoders is one of these components in both algorithms): Text Summarizer, Text Generator
2. Image compression
3. Nonlinear version of PCA

Q2: What is an activation function and discuss the use of an activation function? Explain three different types of
activation functions?

Answer:

In mathematical terms, the activation function serves as a gate between the current neuron input and its output, going to the next level. Basically, it
decides whether neurons should be activated or not. It is used to introduce non-linearity into a model.

Activation functions are added to introduce non-linearity to the network, it doesn't matter how many layers or how many neurons your net has, the
output will be linear combinations of the input in the absence of activation functions. In other words, activation functions are what make a linear
regression model different from a neural network. We need non-linearity, to capture more complex features and model more complex variations that
simple linear models can not capture.

There are a lot of activation functions:

Sigmoid function: f(x) = 1/(1+exp(-x))

The output value of it is between 0 and 1, we can use it for classification. It has some problems like the gradient vanishing on the extremes, also it is
computationally expensive since it uses exp.

Relu: f(x) = max(0,x)

it returns 0 if the input is negative and the value of the input if the input is positive. It solves the problem of vanishing gradient for the positive side,
however, the problem is still on the negative side. It is fast because we use a linear function in it.

Leaky ReLU:

F(x)= ax, x<0 F(x)= x, x>=0

It solves the problem of vanishing gradient on both sides by returning a value “a” on the negative side and it does the same thing as ReLU for the
positive side.

Softmax: it is usually used at the last layer for a classification problem because it returns a set of probabilities, where the sum of them is 1.
Moreover, it is compatible with cross-entropy loss, which is usually the loss function for classification problems.

Q3: You are using a deep neural network for a prediction task. After training your model, you notice that it is strongly
overfitting the training set and that the performance on the test isn’t good. What can you do to reduce overfitting?
To reduce overfitting in a deep neural network changes can be made in three places/stages: The input data to the network, the network architecture,
and the training process:
1. The input data to the network:

Check if all the features are available and reliable


Check if the training sample distribution is the same as the validation and test set distribution. Because if there is a difference in validation set
distribution then it is hard for the model to predict as these complex patterns are unknown to the model.
Check for train / valid data contamination (or leakage)
The dataset size is enough, if not try data augmentation to increase the data size
The dataset is balanced

2. Network architecture:

Overfitting could be due to model complexity. Question each component:


can fully connect layers be replaced with convolutional + pooling layers?
what is the justification for the number of layers and number of neurons chosen? Given how hard it is to tune these, can a pre-trained
model be used?
Add regularization - lasso (l1), ridge (l2), elastic net (both)
Add dropouts
Add batch normalization

3. The training process:

Improvements in validation losses should decide when to stop training. Use callbacks for early stopping when there are no significant
changes in the validation loss and restore_best_weights.

Q4: Why should we use Batch Normalization?


Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch.

Usually, a dataset is fed into the network in the form of batches where the distribution of the data differs for every batch size. By doing this, there
might be chances of vanishing gradient or exploding gradient when it tries to backpropagate. In order to combat these issues, we can use BN (with
irreducible error) layer mostly on the inputs to the layer before the activation function in the previous layer and after fully connected layers.

Batch Normalisation has the following effects on the Neural Network:

1. Robust Training of the deeper layers of the network.


2. Better covariate-shift proof NN Architecture.
3. Has a slight regularisation effect.
4. Centred and Controlled values of Activation.
5. Tries to Prevent exploding/vanishing gradient.
6. Faster Training/Convergence to the minimum loss function

Alt_text

Q5: How to know whether your model is suffering from the problem of Exploding Gradients?

By taking incremental steps towards the minimal value, the gradient descent algorithm aims to minimize the error. The weights and biases in a
neural network are updated using these processes. However, at times, the steps grow excessively large, resulting in increased updates to weights and
bias terms to the point where the weights overflow (or become NaN, that is, Not a Number). An exploding gradient is the result of this, and it is an
unstable method.

There are some subtle signs that you may be suffering from exploding gradients during the training of your network, such as:

1. The model is unable to get traction on your training data (e g. poor loss).
2. The model is unstable, resulting in large changes in loss from update to update.
3. The model loss goes to NaN during training.

If you have these types of problems, you can dig deeper to see if you have a problem with exploding gradients. There are some less subtle signs that
you can use to confirm that you have exploding gradients:

1. The model weights quickly become very large during training.


2. The model weights go to NaN values during training.
3. The error gradient values are consistently above 1.0 for each node and layer during training.

Q6: Can you name and explain a few hyperparameters used for training a neural network?
Answer:

Hyperparameters are any parameter in the model that affects the performance but is not learned from the data unlike parameters ( weights and
biases), the only way to change it is manually by the user.

1. Number of nodes: number of inputs in each layer.

2. Batch normalization: normalization/standardization of inputs in a layer.

3. Learning rate: the rate at which weights are updated.

4. Dropout rate: percent of nodes to drop temporarily during the forward pass.

5. Kernel: matrix to perform dot product of image array with

6. Activation function: defines how the weighted sum of inputs is transformed into outputs (e.g. tanh, sigmoid, softmax, Relu, etc)
7. Number of epochs: number of passes an algorithm has to perform for training

8. Batch size: number of samples to pass through the algorithm individually. E.g. if the dataset has 1000 records and we set a batch size of 100
then the dataset will be divided into 10 batches which will be propagated to the algorithm one after another.

9. Momentum: Momentum can be seen as a learning rate adaptation technique that adds a fraction of the past update vector to the current update
vector. This helps damps oscillations and speed up progress towards the minimum.

10. Optimizers: They focus on getting the learning rate right.

Adagrad optimizer: Adagrad uses a large learning rate for infrequent features and a smaller learning rate for frequent features.

Other optimizers, like Adadelta, RMSProp, and Adam, make further improvements to fine-tuning the learning rate and momentum to get to
the optimal weights and bias. Thus getting the learning rate right is key to well-trained models.

11. Learning Rate: Controls how much to update weights & bias (w+b) terms after training on each batch. Several helpers are used to getting the
learning rate right.

Q7: Can you explain the parameter sharing concept in deep learning?
Answer: Parameter sharing is the method of sharing weights by all neurons in a particular feature map. Therefore helps to reduce the number of
parameters in the whole system, making it computationally cheap. It basically means that the same parameters will be used to represent different
transformations in the system. This basically means the same matrix elements may be updated multiple times during backpropagation from varied
gradients. The same set of elements will facilitate transformations at more than one layer instead of those from a single layer as conventional. This
is usually done in architectures like Siamese that tend to have parallel trunks trained simultaneously. In that case, using shared weights in a few
layers( usually the bottom layers) helps the model converge better. This behavior, as observed, can be attributed to more diverse feature
representations learned by the system. Since neurons corresponding to the same features are triggered in varied scenarios. Helps to model to
generalize better.

Note that sometimes the parameter sharing assumption may not make sense. This is especially the case when the input images to a ConvNet have
some specific centered structure, where we should expect, for example, that completely different features should be learned on one side of the
image than another.

One practical example is when the input is faces that have been centered in the image. You might expect that different eye-specific or hair-specific
features could (and should) be learned in different spatial locations. In that case, it is common to relax the parameter sharing scheme, and instead,
simply call the layer a Locally-Connected Layer.

Q8: Describe the architecture of a typical Convolutional Neural Network (CNN)?

Answer:

In a typical CNN architecture, a few convolutional layers are connected in a cascade style. Each convolutional layer is followed by a Rectified
Linear Unit (ReLU) layer or other activation function, then a pooling layer*, then one or more convolutional layers (+ReLU), then another pooling
layer.

The output from each convolution layer is a set of objects called feature maps, generated by a single kernel filter. The feature maps are used to
define a new input to the next layer. A common trend is to keep on increasing the number of filters as the size of the image keeps dropping as it
passes through the Convolutional and Pooling layers. The size of each kernel filter is usually 3×3 kernel because it can extract the same features
which extract from large kernels and faster than them.

After that, the final small image with a large number of filters(which is a 3D output from the above layers) is flattened and passed through fully
connected layers. At last, we use a softmax layer with the required number of nodes for classification or use the output of the fully connected layers
for some other purpose depending on the task.

The number of these layers can increase depending on the complexity of the data and when they increase you need more data. Stride, Padding,
Filter size, Type of Pooling, etc all are Hyperparameters and need to be chosen (maybe based on some previously built successful models)

*Pooling: it is a way to reduce the number of features by choosing a number to represent its neighbor. And it has many types max-pooling, average
pooling, and global average.

Max pooling: it takes the max number of window 2×2 as an example and represents this window by using the max number in it then slides on
the image to make the same operation.
Average pooling: it is the same as max-pooling but takes the average of the window.

Alt_text

Q9: What is the Vanishing Gradient Problem in Artificial Neural Networks and How to fix it?
Answer:

The vanishing gradient problem is encountered in artificial neural networks with gradient-based learning methods and backpropagation. In these
learning methods, each of the weights of the neural network receives an update proportional to the partial derivative of the error function with
respect to the current weight in each iteration of training. Sometimes when gradients become vanishingly small, this prevents the weight to change
value.

When the neural network has many hidden layers, the gradients in the earlier layers will become very low as we multiply the derivatives of each
layer. As a result, learning in the earlier layers becomes very slow. 𝐓𝐡𝐢𝐬 𝐜𝐚𝐧 𝐜𝐚𝐮𝐬𝐞 𝐭𝐡𝐞 𝐧𝐞𝐮𝐫𝐚𝐥 𝐧𝐞𝐭𝐰𝐨𝐫𝐤 𝐭𝐨 𝐬𝐭𝐨𝐩 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠. This
problem of vanishing gradient descent happens when training neural networks with many layers because the gradient diminishes dramatically as it
propagates backward through the network.

Some ways to fix it are:


1. Use skip/residual connections.
2. Using ReLU or Leaky ReLU over sigmoid and tanh activation functions.
3. Use models that help propagate gradients to earlier time steps like in GRUs and LSTMs.

Q10: When it comes to training an artificial neural network, what could be the reason why the loss doesn't decrease in
a few epochs?
Answer:

Some of the reasons why the loss doesn't decrease after a few Epochs are:

a. The model is under-fitting the training data.

b. The learning rate of the model is large.

c. The initialization is not proper (like all the weights initialized with 0 doesn't make the network learn any function)

d. The Regularisation hyper-parameter is quite large.

e). The classic case of vanishing gradients

Q11: Why Sigmoid or Tanh is not preferred to be used as the activation function in the hidden layer of the neural
network?

Answer:

A common problem with Tanh or Sigmoid functions is that they saturate. Once saturated, the learning algorithms cannot adapt to the weights and
enhance the performance of the model. Thus, Sigmoid or Tanh activation functions prevent the neural network from learning effectively leading to a
vanishing gradient problem. The vanishing gradient problem can be addressed with the use of Rectified Linear Activation Function (ReLu) instead
of sigmoid and Tanh. Alt_text

Q12: Discuss in what context it is recommended to use transfer learning and when it is not.
Answer:

Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task. It is
a popular approach in deep learning where pre-trained models are used as the starting point for computer vision and natural language processing
tasks given the vast computing and time resources required to develop neural network models on these problems and from the huge jumps in a skill
that they provide on related problems.

Transfer learning is used for tasks where the data is too little to train a full-scale model from the beginning. In transfer learning, well-trained, well-
constructed networks are used which have learned over large sets and can be used to boost the performance of a dataset.

𝐓𝐫𝐚𝐧𝐬𝐟𝐞𝐫 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐜𝐚𝐧 𝐛𝐞 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐭𝐡𝐞 𝐟𝐨𝐥𝐥𝐨𝐰𝐢𝐧𝐠 𝐜𝐚𝐬𝐞𝐬:

1. The downstream task has a very small amount of data available, then we can try using pre-trained model weights by switching the last layer
with new layers which we will train.

2. In some cases, like in vision-related tasks, the initial layers have a common behavior of detecting edges, then a little more complex but still
abstract features and so on which is common in all vision tasks, and hence a pre-trained model's initial layers can be used directly. The same
thing holds for Language Models too, for example, a model trained in a large Hindi corpus can be transferred and used for other Indo-Aryan
Languages with low resources available.

𝐂𝐚𝐬𝐞𝐬 𝐰𝐡𝐞𝐧 𝐭𝐫𝐚𝐧𝐬𝐟𝐞𝐫 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐬𝐡𝐨𝐮𝐥𝐝 𝐧𝐨𝐭 𝐛𝐞 𝐮𝐬𝐞𝐝:

1. The first and most important is the "COST". So is it cost-effective or we can have a similar performance without using it.

2. The pre-trained model has no relation to the downstream task.

3. If the latency is a big constraint (Mostly in NLP ) then transfer learning is not the best option. However Now with the TensorFlow lite kind of
platform and Model Distillation, Latency is not a problem anymore.

Q13: Discuss the vanishing gradient in RNN and How they can be solved.

Answer:

In Sequence to Sequence models such as RNNs, the input sentences might have long-term dependencies for example we might say "The boy who
was wearing a red t-shirt, blue jeans, black shoes, and a white cap and who lives at ... and is 10 years old ...... etc, is genius" here the verb (is) in the
sentence depends on the (boy) i.e if we say (The boys, ......, are genius". When training an RNN we do backward propagation both through layers
and backward through time. Without focusing too much on mathematics, during backward propagation we tend to multiply gradients that are either
> 1 or < 1, if the gradients are < 1 and we have about 100 steps backward in time then multiplying 100 numbers that are < 1 will result in a very
very tiny gradient causing no change in the weights as we go backward in time (0.1 * 0.1 * 0.1 * .... a 100 times = 10^(-100)) such that in our
previous example the word "is" doesn't affect its main dependency the word "boy" during learning the meanings of the word due to the long
description in between.

Models like the Gated Recurrent Units (GRUs) and the Long short-term memory (LSTMs) were proposed, the main idea of these models is to use
gates to help the network determine which information to keep and which information to discard during learning. Then Transformers were proposed
depending on the self-attention mechanism to catch the dependencies between words in the sequence.

Q14: What are the main gates in LSTM and what are their tasks?
Answer: There are 3 main types of gates in a LSTM Model, as follows:

Forget Gate
Input/Update Gate
Output Gate

1. Forget Gate:- It helps in deciding which data to keep or thrown out


2. Input Gate:- it helps in determining whether new data should be added in long term memory cell given by previous hidden state and new
input data
3. Output Gate:- this gate gives out the new hidden state

Common things for all these gates are they all take take inputs as the current temporal state/input/word/observation and the previous hidden state
output and sigmoid activation is mostly used in all of these.

Q15: Is it a good idea to use CNN to classify 1D signal?


Answer: For time-series data, where we assume temporal dependence between the values, then convolutional neural networks (CNN) are one of the
possible approaches. However the most popular approach to such data is to use recurrent neural networks (RNN), but you can alternatively use
CNNs, or a hybrid approach (quasi-recurrent neural networks, QRNN).

With CNN, you would use sliding windows of some width, that would look at certain (learned) patterns in the data, and stack such windows on top
of each other, so that higher-level windows would look for patterns within the lower-level patterns. Using such sliding windows may be helpful for
finding things such as repeating patterns within the data. One drawback is that it doesn't take into account the temporal or sequential aspect of the
1D signals, which can be very important for prediction.

With RNN, you would use a cell that takes as input the previous hidden state and current input value, to return output and another hidden form, so
the information flows via the hidden states and takes into account the temporal dependencies.

QRNN layers mix both approaches.

Q16: How does L1/L2 regularization affect a neural network?

Answer:

Overfitting occurs in more complex neural network models (many layers, many neurons) and the complexity of the neural network can be reduced
by using L1 and L2 regularization as well as dropout , Data augmenration and Dropaout. L1 regularization forces the weight parameters to become
zero. L2 regularization forces the weight parameters towards zero (but never exactly zero|| weight deccay )

Smaller weight parameters make some neurons neglectable therfore neural network becomes less complex and less overfitting.

Regularisation has the following benefits:

Reducing the variance of the model over unseen data.


Makes it feasible to fit much more complicated models without overfitting.
Reduces the magnitude of weights and biases.
L1 learns sparse models that is many weights turn out to be 0.
https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/tree/main

Q17: 𝐇𝐨𝐰 𝐰𝐨𝐮𝐥𝐝 𝐲𝐨𝐮 𝐜𝐡𝐚𝐧𝐠𝐞 𝐚 𝐩𝐫𝐞-𝐭𝐫𝐚𝐢𝐧𝐞𝐝 𝐧𝐞𝐮𝐫𝐚𝐥 𝐧𝐞𝐭𝐰𝐨𝐫𝐤 𝐟𝐫𝐨𝐦 𝐜𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐭𝐨 𝐫𝐞𝐠𝐫𝐞𝐬𝐬𝐢𝐨𝐧?
Answer: Using transfer learning where we can use our knowledge about one task to do another. First set of layers of a neural network are usually
feature extraction layers and will be useful for all tasks with the same input distribution. So, we should replace the last fully connected layer and
Softmax responsible for classification with one neuron for regression-or fully connected-layer for correction then one neuron for regression.

We can optionally freeze the first set of layers if we have few data or to converge fast. Then we can train the network with the data we have and
using the suitable loss for the regression problem, making use of the robust feature extraction -first set of layers- of a pre-trained model on huge
data.
Q18: What might happen if you set the momentum hyperparameter too close to 1 (e.g., 0.9999) when using an SGD
optimizer?

Answer:

If the momentum hyperparameter is set too close to 1 (e.g., 0.99999) when using an SGD optimizer, then the algorithm will likely pick up a lot of
speed, hopefully moving roughly toward the global minimum, but its momentum will carry it right past the minimum.

Then it will slow down and come back, accelerate again, overshoot again, and so on. It may oscillate this way many times before converging, so
overall it will take much longer to converge than with a smaller momentum value.

Also since the momentum is used to update the weights based on an "exponential moving average" of all the previous gradients instead of the
current gradient only, this in some sense, combats the instability of the gradients that comes with stochastic gradient descent, the higher the
momentum term, the stronger the influence of previous gradients to the current optimization step (with the more recent gradients having even
stronger influence), setting a momentum term close to 1, will result in a gradient that is almost a sum of all the previous gradients basically, which
might result in an exploding gradient scenario.

Q19: What are the hyperparameters that can be optimized for the batch normalization layer?

Answer: The \(\gamma\) and \(\beta\) hyperparameters for the batch normalization layer are learned end to end by the network. In batch-
normalization, the outputs of the intermediate layers are normalized to have a mean of 0 and standard deviation of 1. Rescaling by \(\gamma\) and
shifting by \(\beta\) helps us change the mean and standard deviation to other values.

Q20: What is the effect of dropout on the training and prediction speed of your deep learning model?

Answer: Dropout is a regularization technique, which zeroes down some weights and scales up the rest of the weights by a factor of 1/(1-p). Let's
say if Dropout layer is initialized with p=0.5, that means half of the weights will zeroed down, and rest will be scaled by a factor of 2. This layer is
only enabled during training and is disabled during validation and testing. Hence validation and testing is faster. The reason why it works only
during training is, we want to reduce the complexity of the model so that model doesn't overfit. Once the model is trained, it doesn't make sense to
keep that layer enabled.

Q21: What is the advantage of deep learning over traditional machine learning?

Answer:

Deep learning offers several advantages over traditional machine learning approaches, including:

1. Ability to process large amounts of data: Deep learning models can analyze and process massive amounts of data quickly and accurately,
making it ideal for tasks such as image recognition or natural language processing.

2. Automated feature extraction: In traditional machine learning, feature engineering is a crucial step in the model building process. Deep
learning models, on the other hand, can automatically learn and extract features from the raw data, reducing the need for human intervention.

3. Better accuracy: Deep learning models have shown to achieve higher accuracy levels in complex tasks such as speech recognition and image
classification when compared to traditional machine learning models.

4. Adaptability to new data: Deep learning models can adapt and learn from new data, making them suitable for use in dynamic and ever-
changing environments.

While deep learning does have its advantages, it also has some limitations, such as requiring large amounts of data and computational resources,
making it unsuitable for some applications.

Q22: What is a depthwise Separable layer and what are its advantages?
Answer:

Standard neural network Convolution layers involve a lot of multiplications that make them unsuitable for deployment.
In this above scenario, we have an input image of 12x12x3 pixels and we apply a 5x5 convolution(no padding, stride = 1). We stack 256 such
kernels so that we get an output of dimensions 8x8x256.

Here, there are 256 5x5x3 kernels that move 8x8 times which leads to 256x3x5x5x8x8 = 1,28,800 multiplications.

Depthwise separable convolution separates this process into two parts: a depthwise convolution and a pointwise convolution.

In depthwise convolution, we apply a kernel parallelly to each channel of the image.

We end up getting 3 different outputs (representing 3 channels of the image) to get an 8x8x1 image. These are stacked together to form a 8x8x3
image.

Pointwise Convolution now converts this 8x8x3 image input from the depthwise convolution back to an 8x8x1 output.
Stacking 256 1x1x3 kernels give us the final output as the standard convolution.

Total Number of multiplications:

For Depthwise convolution, we have 3 5x5x1 kernels moving 8x8 times, totalling 3x5x5x8x8=4800 multiplications.

In Pointwise convolution, we have 256 1x1x3 kernels moving 8x8 times, which is a total of 256x1x1x3x8x8=49152 multiplications.

Total number of multiplications = 4800 + 49152 = 53952 multiplications which is way lower than the standard convolution case.

Reference: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728

Natural Language Processing


Q23: What is transformer architecture, and why is it widely used in natural language
processing tasks?
Answer: The key components of a transformer architecture are as follows:
1. Encoder: The encoder processes the input sequence, such as a sentence or a document, and transforms it into a set of representations that
capture the contextual information of each input element. The encoder consists of multiple identical layers, each containing a self-attention
mechanism and position-wise feed-forward neural networks. The self-attention mechanism allows the model to attend to different parts of the
input sequence while encoding it.

2. Decoder: The decoder takes the encoded representations generated by the encoder and generates an output sequence. It also consists of
multiple identical layers, each containing a self-attention mechanism and additional cross-attention mechanisms. The cross-attention
mechanisms enable the decoder to attend to relevant parts of the encoded input sequence when generating the output.

3. Self-Attention: Self-attention is a mechanism that allows the transformer to weigh the importance of different elements in the input sequence
when generating representations. It computes attention scores between each element and every other element in the sequence, resulting in a
weighted sum of the values. This process allows the model to capture dependencies and relationships between different elements in the
sequence.

4. Positional Encoding: Transformers incorporate positional encoding to provide information about the order or position of elements in the input
sequence. This encoding is added to the input embeddings and allows the model to understand the sequential nature of the data.

5. Feed-Forward Networks: Transformers utilize feed-forward neural networks to process the representations generated by the attention
mechanisms. These networks consist of multiple layers of fully connected neural networks with activation functions, enabling non-linear
transformations of the input representations.

The transformer architecture is widely used in NLP tasks due to several reasons:

Self-Attention Mechanism: Transformers leverage a self-attention mechanism that allows the model to focus on different parts of the input sequence
during processing. This mechanism enables the model to capture long-range dependencies and contextual information efficiently, making it
particularly effective for tasks that involve understanding and generating natural language.

Parallelization: Transformers can process the elements of a sequence in parallel, as opposed to recurrent neural networks (RNNs) that require
sequential processing. This parallelization greatly accelerates training and inference, making transformers more computationally efficient.

Scalability: Transformers scale well with the length of input sequences, thanks to the self-attention mechanism. Unlike RNNs, transformers do not
suffer from the vanishing or exploding gradient problem, which can hinder the modeling of long sequences. This scalability makes transformers
suitable for tasks that involve long texts or documents.

Transfer Learning: Transformers have shown great success in pre-training and transfer learning. Models like BERT (Bidirectional Encoder
Representations from Transformers) and GPT (Generative Pre-trained Transformer) are pre-trained on massive amounts of text data, enabling them
to learn rich representations of language. These pre-trained models can then be fine-tuned on specific downstream tasks with comparatively smaller
datasets, leading to better generalization and improved performance.

Contextual Understanding: Transformers excel in capturing the contextual meaning of words and sentences. By considering the entire input
sequence simultaneously, transformers can generate more accurate representations that incorporate global context, allowing for better language
understanding and generation.

Q24: Explain the key components of a transformer model.


Answer:

A transformer model consists of several key components that work together to process and generate representations for input sequences. The main
components of a transformer model are as follows:

Encoder: The encoder is responsible for processing the input sequence and generating representations that capture the contextual information
of each element. It consists of multiple identical layers, typically stacked on top of each other. Each layer contains two sub-layers: a self-
attention mechanism and a position-wise feed-forward neural network.
Self-Attention Mechanism: This mechanism allows the model to attend to different parts of the input sequence while encoding it. It
computes attention scores between each element and every other element in the sequence, resulting in a weighted sum of values. This
process allows the model to capture dependencies and relationships between different elements.

Position-wise Feed-Forward Neural Network: After the self-attention mechanism, a feed-forward neural network is applied to each
position separately. It consists of fully connected layers with activation functions, enabling non-linear transformations of the input
representations.

Decoder: The decoder takes the encoded representations generated by the encoder and generates an output sequence. It also consists of
multiple identical layers, each containing sub-layers such as self-attention, cross-attention, and position-wise feed-forward networks.

Self-Attention Mechanism: Similar to the encoder, the decoder uses self-attention to attend to different parts of the decoded sequence
while generating the output. It allows the decoder to consider the previously generated elements in the output sequence when
generating the next element.

Cross-Attention Mechanism: In addition to self-attention, the decoder employs cross-attention to attend to relevant parts of the encoded
input sequence. It allows the decoder to align and extract information from the encoded sequence when generating the output.

Self-Attention and Cross-Attention: These attention mechanisms are fundamental components of the transformer architecture. They enable
the model to weigh the importance of different elements in the input and output sequences when generating representations. Attention scores
are computed by measuring the compatibility between elements, and the weighted sum of values is used to capture contextual dependencies.

Positional Encoding: Transformers incorporate positional encoding to provide information about the order or position of elements in the input
sequence. It is added to the input embeddings and allows the model to understand the sequential nature of the data.

Residual Connections and Layer Normalization: Transformers employ residual connections and layer normalization to facilitate the flow of
information and improve gradient propagation. Residual connections enable the model to capture both high-level and low-level features,
while layer normalization normalizes the inputs to each layer, improving the stability and performance of the model.

These components collectively enable the transformer model to process and generate representations for input sequences in an efficient and
effective manner. The self-attention mechanisms, along with the feed-forward networks and positional encoding, allow the model to capture long-
range dependencies, handle the parallel processing, and generate high-quality representations, making transformers highly successful in natural
language processing tasks.

Q25: What is self-attention, and how does it work in transformers?


Answer:

Q26: What are the advantages of transformers over traditional sequence-to-sequence


models?
Answer: Transformers have several advantages over traditional sequence-to-sequence models, such as recurrent neural networks (RNNs), when it
comes to natural language processing tasks. Here are some key advantages:

Long-range dependencies: Transformers are capable of capturing long-range dependencies in sequences more effectively compared to RNNs.
This is because RNNs suffer from vanishing or exploding gradient problems when processing long sequences, which limits their ability to
capture long-term dependencies. Transformers address this issue by using self-attention mechanisms that allow for capturing relationships
between any two positions in a sequence, regardless of their distance.

Parallelization: Transformers can process inputs in parallel, making them more efficient in terms of computational time compared to RNNs.
In RNNs, the sequential nature of computation limits parallelization since each step depends on the previous step's output. Transformers, on
the other hand, process all positions in a sequence simultaneously, enabling efficient parallelization across different positions.

Scalability: Transformers are highly scalable and can handle larger input sequences without significantly increasing computational
requirements. In RNNs, the computational complexity grows linearly with the length of the input sequence, making it challenging to process
long sequences efficiently. Transformers, with their parallel processing and self-attention mechanisms, maintain a constant computational
complexity, making them suitable for longer sequences.

Global context understanding: Transformers capture global context information effectively due to their attention mechanisms. Each position
in the sequence attends to all other positions, allowing for a comprehensive understanding of the entire sequence during the encoding and
decoding process. This global context understanding aids in various NLP tasks, such as machine translation, where the translation of a word
can depend on the entire source sentence.

Transfer learning and fine-tuning: Transformers facilitate transfer learning and fine-tuning, which is the ability to pre-train models on large-
scale datasets and then adapt them to specific downstream tasks with smaller datasets. Pretraining transformers on massive amounts of data,
such as in models like BERT or GPT, helps capture rich language representations that can be fine-tuned for a wide range of NLP tasks,
providing significant performance gains.

Q27: How does the attention mechanism help transformers capture long-range dependencies
in sequences?
Answer: The attention mechanism in transformers plays a crucial role in capturing long-range dependencies in sequences. It allows each position in
a sequence to attend to other positions, enabling the model to focus on relevant parts of the input during both the encoding and decoding stages.
Here's how the attention mechanism works in transformers:

Self-Attention: Self-attention, also known as intra-attention, is the key component of the attention mechanism in transformers. It computes
the importance, or attention weight, that each position in the sequence should assign to other positions. This attention weight determines how
much information a position should gather from other positions.
Query, Key, and Value: To compute self-attention, each position in the sequence is associated with three learned vectors: query, key, and
value. These vectors are derived from the input embeddings and transformed through linear transformations. The query vector is used to
search for relevant information, the key vector represents the positions to which the query attends, and the value vector holds the information
content of each position.

Attention Scores: The attention mechanism calculates attention scores between the query vector of a position and the key vectors of all other
positions in the sequence. The attention scores quantify the relevance or similarity between positions. They are obtained by taking the dot
product between the query and key vectors and scaling it by a factor of the square root of the dimensionality of the key vectors.

Attention Weights: The attention scores are then normalized using the softmax function to obtain attention weights. These weights determine
the contribution of each position to the final representation of the current position. Positions with higher attention weights have a stronger
influence on the current position's representation.

Weighted Sum: Finally, the attention weights are used to compute a weighted sum of the value vectors. This aggregation of values gives the
current position a comprehensive representation that incorporates information from all relevant positions, capturing the long-range
dependencies effectively.

By allowing each position to attend to other positions, the attention mechanism provides a mechanism for information to flow across the entire
sequence. This enables transformers to capture dependencies between distant positions, even in long sequences, without suffering from the
limitations of vanishing or exploding gradients that affect traditional recurrent neural networks. Consequently, transformers excel in modeling
complex relationships and dependencies in sequences, making them powerful tools for various tasks, including natural language processing and
computer vision.

Q28: What are the limitations of transformers, and what are some potential solutions?
Answer: While transformers have revolutionized many natural language processing tasks, they do have certain limitations. Here are some notable
limitations of transformers and potential solutions:

Sequential Computation: Transformers process the entire sequence in parallel, which limits their ability to model sequential information
explicitly. This can be a disadvantage when tasks require strong sequential reasoning. Potential solutions include incorporating recurrent
connections into transformers or using hybrid models that combine the strengths of transformers and recurrent neural networks.

Memory and Computational Requirements: Transformers consume more memory and computational resources compared to traditional
sequence models, especially for large-scale models and long sequences. This limits their scalability and deployment on resource-constrained
devices. Solutions involve developing more efficient architectures, such as sparse attention mechanisms or approximations, to reduce
memory and computational requirements without sacrificing performance significantly.

Lack of Interpretability: Transformers are often considered as black-box models, making it challenging to interpret the reasoning behind their
predictions. Understanding the decision-making process of transformers is an ongoing research area. Techniques such as attention
visualization, layer-wise relevance propagation, and saliency maps can provide insights into the model's attention and contribution to
predictions, enhancing interpretability.

Handling Out-of-Distribution Data: Transformers can struggle with data that significantly deviates from the distribution seen during training.
They may make overconfident predictions or produce incorrect outputs when faced with out-of-distribution samples. Solutions include
exploring uncertainty estimation techniques, robust training approaches, or incorporating external knowledge sources to improve
generalization and handle out-of-distribution scenarios.

Limited Contextual Understanding: Transformers rely heavily on context information to make predictions. However, they can still struggle
with understanding the broader context, especially in scenarios with complex background knowledge or multi-modal data. Incorporating
external knowledge bases, leveraging graph neural networks, or combining transformers with other modalities like images or graphs can help
improve contextual understanding and capture richer representations.

Training Data Requirements: Transformers typically require large amounts of labeled data for effective training due to their high capacity.
Acquiring labeled data can be expensive and time-consuming, limiting their applicability to domains with limited labeled datasets. Solutions
include exploring semi-supervised learning, active learning, or transfer learning techniques to mitigate the data requirements and leverage
pretraining on large-scale datasets.

Researchers and practitioners are actively working on addressing these limitations to further enhance the capabilities and applicability of
transformers in various domains. As the field progresses, we can expect continued advancements and novel solutions to overcome these challenges.

Q29: How are transformers trained, and what is the role of pre-training and fine-tuning?
Answer:

Q30: What is BERT (Bidirectional Encoder Representations from Transformers), and how
does it improve language understanding tasks?
Answer: BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based neural network model introduced by Google in
2018. It is designed to improve the understanding of natural language in various language processing tasks, such as question answering, sentiment
analysis, named entity recognition, and more.

BERT differs from previous language models in its ability to capture the context of a word by considering both the left and right context in a
sentence. Traditional language models, like the ones based on recurrent neural networks, process text in a sequential manner, making it difficult to
capture the full context.

BERT, on the other hand, is a "pre-trained" model that is trained on a large corpus of unlabeled text data. During pre-training, BERT learns to
predict missing words in sentences by considering the surrounding words on both sides. This bidirectional training allows BERT to capture
contextual information effectively.
Once pre-training is complete, BERT is fine-tuned on specific downstream tasks. This fine-tuning involves training the model on labeled data from
a particular task, such as sentiment analysis or named entity recognition. During fine-tuning, BERT adapts its pre-trained knowledge to the specific
task, further improving its understanding and performance.

The key advantages of BERT include:

1. Contextual understanding: BERT can capture the contextual meaning of words by considering both the preceding and following words in a
sentence, leading to better language understanding.

2. Transfer learning: BERT is pre-trained on a large corpus of unlabeled data, enabling it to learn general language representations. These pre-
trained representations can then be fine-tuned for specific tasks, even with limited labeled data.

3. Versatility: BERT can be applied to a wide range of natural language processing tasks. By fine-tuning the model on specific tasks, it can
achieve state-of-the-art performance in tasks such as question answering, text classification, and more.

4. Handling ambiguity: BERT's bidirectional nature helps it handle ambiguous language constructs more effectively. It can make more informed
predictions by considering the context from both directions.

Q31: Describe the process of generating text using a transformer-based language model.
Answer:

Q32: What are some challenges or ethical considerations associated with large language
models?
Answer:

Q33: Explain the concept of transfer learning and how it can be applied to transformers.
Answer:

Transfer learning is a machine learning technique where knowledge gained from training on one task is leveraged to improve performance on
another related task. Instead of training a model from scratch on a specific task, transfer learning enables the use of pre-trained models as a starting
point for new tasks.

In the context of transformers, transfer learning has been highly successful, particularly with models like BERT (Bidirectional Encoder
Representations from Transformers) and GPT (Generative Pre-trained Transformer).

Here's how transfer learning is applied to transformers:

1. Pre-training: In the pre-training phase, a transformer model is trained on a large corpus of unlabeled text data. The model is trained to predict
missing words in a sentence (masked language modeling) or to predict the next word in a sequence (causal language modeling). This process
enables the model to learn general language patterns, syntactic structures, and semantic relationships.

2. Fine-tuning: Once the transformer model is pre-trained, it can be fine-tuned on specific downstream tasks with smaller labeled datasets. Fine-
tuning involves retraining the pre-trained model on task-specific labeled data. The model's parameters are adjusted to optimize performance
on the specific task, while the pre-trained knowledge acts as a strong initialization for the fine-tuning process.

a. Task-specific architecture: During fine-tuning, the architecture of the pre-trained transformer model is often modified or extended to
accommodate the specific requirements of the downstream task. For example, in sentiment analysis, an additional classification layer may be
added on top of the pre-trained model to classify text sentiment.

b. Few-shot or zero-shot learning: Transfer learning with transformers allows for few-shot or even zero-shot learning scenarios. Few-shot
learning refers to training a model on a small amount of labeled data, which is beneficial when data availability is limited. Zero-shot learning
refers to using the pre-trained model directly on a task for which it hasn't been explicitly trained, but the model can still generate meaningful
predictions based on its understanding of language.

Transfer learning with transformers offers several advantages:

1. Reduced data requirements: Pre-training on large unlabeled datasets allows the model to capture general language understanding, reducing
the need for massive amounts of labeled task-specific data.

2. Improved generalization: The pre-trained model has learned rich representations of language from extensive pre-training, enabling it to
generalize well to new tasks and domains.

3. Efficient training: Fine-tuning a pre-trained model requires less computational resources and training time compared to training from scratch.

4. State-of-the-art performance: Transfer learning with transformers has achieved state-of-the-art performance on a wide range of NLP tasks,
including text classification, named entity recognition, question answering, machine translation, and more.

By leveraging the knowledge encoded in pre-trained transformers, transfer learning enables faster and more effective development of models for
specific NLP tasks, even with limited labeled data.

Q34: How can transformers be used for tasks other than natural language processing, such
as computer vision?
Answer:

Computer Vision
Q35: What is computer vision, and why is it important?
Answer:

Q36: Explain the concept of image segmentation and its applications.


Answer:

Q37: What is object detection, and how does it differ from image classification?
Answer:

Q38: Describe the steps involved in building an image recognition system.


Answer:

Q39: What are the challenges in implementing real-time object tracking?


Answer:

Q40: Can you explain the concept of feature extraction in computer vision?
Answer:

Q41: What is optical character recognition (OCR), and what are its main applications?
Answer:

Q42: How does a convolutional neural network (CNN) differ from a traditional neural
network in the context of computer vision?
Answer:

Q43: What is the purpose of data augmentation in computer vision, and what techniques can
be used?
Answer:
The purpose of data augmentation in computer vision is to artificially increase the size and diversity of a training dataset by applying various
transformations to the original images. Data augmentation helps prevent overfitting and improves the generalization ability of deep learning models
by exposing them to a broader range of variations and patterns present in the data. It also reduces the risk of the model memorizing specific
examples in the training data.

By applying different augmentation techniques, the model becomes more robust and capable of handling variations in the real-world test data that
may not be present in the original training set. Common data augmentation techniques include:

1. Horizontal Flipping: Flipping images horizontally, i.e., left to right, or vice versa. This is particularly useful for tasks where the orientation of
objects doesn't affect their interpretation, such as object detection or image classification.

2. Vertical Flipping: Similar to horizontal flipping but flipping images from top to bottom.

3. Random Rotation: Rotating images by a random angle. This can be helpful to simulate objects at different angles and orientations.

4. Random Crop: Taking random crops from the input images. This forces the model to focus on different parts of the image and helps in
handling varying object scales.

5. Scaling and Resizing: Rescaling images to different sizes or resizing them while maintaining the aspect ratio. This augmentation helps the
model handle objects of varying sizes.

6. Color Jittering: Changing the brightness, contrast, saturation, and hue of the images randomly. This augmentation can help the model become
more robust to changes in lighting conditions.

7. Gaussian Noise: Adding random Gaussian noise to the images, which simulates noisy environments and enhances the model's noise
tolerance.

8. Elastic Transformations: Applying local deformations to the image, simulating distortions that might occur due to variations in the imaging
process.

9. Cutout: Randomly masking out portions of the image with black pixels. This helps the model learn to focus on other informative parts of the
image.

10. Mixup: Combining two or more images and their corresponding labels in a weighted manner to create new training examples. This
encourages the model to learn from the combined patterns of multiple images.

It's important to note that the choice of data augmentation techniques depends on the specific computer vision task and the characteristics of the
dataset. Additionally, augmentation should be applied only during the training phase and not during testing or evaluation to ensure that the model
generalizes well to unseen data.

Q44: Discuss some popular deep learning frameworks or libraries used for computer vision
tasks.
Answer:

You might also like