Data Science Interview QnAs by CloudyML
Data Science Interview QnAs by CloudyML
Data Science Interview QnAs by CloudyML
Data Science
Interview
Questions &
Answers
(Save It Now)
www.cloudyml.com
1. How is Data modeling different from Database design?
Data Modeling: It can be considered as the first step towards the design of a
database. Data modeling creates a conceptual model based on the relationship
between various data models. The process involves moving from the conceptual
stage to the logical model to the physical schema. It involves the systematic
method of applying data modeling techniques.
Database Design: This is the process of designing the database. The database
design creates an output which is a detailed data model of the database. Strictly
speaking, database design includes the detailed logical model of a database but it
can also include physical design choices and storage parameters.
Dimensionality reduction reduces the dimensions and size of the entire dataset. It
drops unnecessary features while retaining the overall information in the data
intact. Reduction in dimensions leads to faster processing of the data.
The reason why data with high dimensions is considered so difficult to deal with is
that it leads to high time consumption while processing the data and training a
model on it. Reducing dimensions speeds up this process, removes noise, and also
leads to better model accuracy.
Just like bagging and boosting, stacking is also an ensemble learning method. In
bagging and boosting, we could only combine weak models that used the same
learning algorithms, e.g., logistic regression. These models are called
homogeneous learners.
However, in stacking, we can combine weak models that use different learning
algorithms as well. These learners are called heterogeneous learners. Stacking
works by training multiple (and different) weak models or learners and then using
them together by training another model, called a meta-model, to make predictions
based on the multiple outputs of predictions returned by these multiple weak
models.
www.cloudyml.com
4. What are Loss Function and Cost Functions? Explain the key
Difference Between them?
When calculating loss we consider only a single data point, then we use the term
loss function.
Whereas, when calculating the sum of error for multiple data then we use the cost
function. There is no major difference.
In other words, the loss function is to capture the difference between the actual and
predicted values for a single record
whereas cost functions aggregate the difference for the entire training dataset.
The Most commonly used loss functions are Mean-squared error and Hinge loss.
Mean-Squared Error(MSE): In simple words, we can say how our model predicted
values against the actual values.
Where y = -1 or 1 indicating two classes and y represents the output form of the
classifier. The most common cost function represents the total cost as the sum of
the fixed costs and the variable costs in the equation y = mx + b
SVM stands for support vector machine. They are used for classification and
prediction tasks. SVM consists of a separating plane that discriminates between
the two classes of variables.
www.cloudyml.com
This separating plane is known as hyperplane. Some of the kernels used in SVM
are –
Polynomial Kernel
Gaussian Kernel
Laplace RBF Kernel
Sigmoid Kernel
Hyperbolic Kernel
Data definition language (DDL): It defines the data structure that consists of
commands like CREATE, ALTER, DROP, etc.
Data control language (DCL): It controls access to the data stored in the database.
The commands in this category include GRANT and REVOKE.
www.cloudyml.com
each of the initial and subsequent copies can have pointers that relate to the same
underlying knowledge. Deep repetition clones the underlying data completely. It is
not shared by the first and, as a result, by the copy.
K-means clustering
Linear regression
K-NN (k-nearest neighbor)
Decision trees
The K nearest neighbor algorithm can be used because it can compute the nearest
neighbor and if it doesn't have a value, it just computes the nearest neighbor based
on all the other features.
When you're dealing with K-means clustering or linear regression, you need to do
that in your pre-processing, otherwise, they'll crash. Decision trees also have the
same problem, although there is some variance
This is statistical hypothesis testing for randomized experiments with two variables,
A and B. The objective of A/B testing is to detect any changes to a web page to
maximize or increase the outcome of a strategy.
www.cloudyml.com
SWITCH YOUR CAREER TO
DATA SCIENCE & ANALYTICS .
“Make your career in this fastest growing Data Science
industry without paying your lakh of rupees.”
Features of this course :-
✅Get Hands-on Practical Learning Experience
✅Topic Wise Structured Tutorial Videos
✅Guided Practice Assignments
✅Capstone End-to-End Projects
✅1-1 Doubt Clearance Support Everyday
✅One Month Internship Opportunity
✅Interview QnA PDF Collection
✅Course Completion Certificate
✅Lifetime Course Content Access
✅No Prior Coding Experience Required to Join
✅Resume Review Feature
✅Daily Interview QnA Mail Everyday
✅Job Opening Mail & More.
www.cloudyml.com
11. What is a star schema?
Star schema is the fundamental schema among the data mart schema and it is
simplest. It is said to be star as its physical
model resembles to the star shape having a fact table at its center and the
dimension tables at its peripheral representing the star’s points.
The Generative Adversarial Network takes inputs from the noise vector and sends
them forward to the Generator, and then to Discriminator, to identify and
differentiate unique and fake inputs.
The CASE statement is used to construct logic in which one column’s value is
determined by the values of other columns.
At least one set of WHEN and THEN commands makes up the SQL Server CASE
Statement. The condition to be tested is specified by the WHEN statement. If the
WHEN condition returns TRUE, the THEN sentence explains what to do.
When none of the WHEN conditions return true, the ELSE statement is executed.
The END keyword brings the CASE statement to a close.
www.cloudyml.com
15. What is pickling and unpickling?
Pickle module accepts any Python object and converts it into a string
representation and dumps it into a file by using dump function, this process is
called pickling. While the process of retrieving original Python objects from the
stored string representation is called unpickling.
It combines multiple models together to get the final output or, to be more precise, it
combines multiple decision trees together to get the final output. So, decision trees
are the building blocks of the random forest model.
17. How are Data Science and Machine Learning related to each other?
Data Science and Machine Learning are two terms that are closely related but are
often misunderstood. Both of them deal with data. Data Science is a broad field
that deals with large volumes of data and allows us to draw insights out of this
voluminous data. Machine Learning, on the other hand, can be thought of as a
sub-field of Data Science. It also deals with data, but here, we are solely focused
on learning how to convert the processed data into a functional model, which can
be used to
map inputs to outputs, e.g., a model that can expect an image as an input and tell
us if that image contains a flower as an output.
www.cloudyml.com
19. Explain TF/IDF vectorization.
Cluster sampling involves dividing the sample population into separate groups,
called clusters. Then, a simple random sample of clusters is selected from the
population. Analysis is conducted on data from the sampled clusters.
Macro refers to an algorithm or a set of actions that help automate a task in Excel
by recording and playing back the steps taken to complete that task. Once the
steps are stored, you create a Macro, and it can be edited and played back as
many times as the user wants.
www.cloudyml.com
Macro is great for repetitive tasks and also eliminates errors. For example, suppose
an account manager has to share reports regarding the company employees for
non-payment of dues. In that case, it can be automated using a Macro and doing
minor changes every month, as needed.
A KPI is a quantifiable measure to evaluate whether the objectives are being met or
not.
It is a reliable metric to measure the performance level of an organisation or
individual.
An example of a KPI in an organisation such as the expense ratio.
www.cloudyml.com
•Holdout method
•K-fold cross-validation
•Stratified k-fold cross-validation
•Leave p-out cross-validation
In the real world, Machine Learning models are built on top of features and
parameters. These features can be multidimensional and large in number.
Sometimes, the features may be irrelevant and it becomes a difficult task to
visualize them. This is where dimensionality reduction is used to cut down
irrelevant and redundant features with the help of principal variables. These
principal variables conserve the features, and are a subgroup, of the parent
variables.
If the p-value is more than then critical value, then we fail to reject the H0
If p-value = 0.015 (critical value = 0.05) – strong evidence
www.cloudyml.com
SWITCH YOUR CAREER TO
DATA SCIENCE & ANALYTICS .
“Make your career in this fastest growing Data Science
industry without paying your lakh of rupees.”
Features of this course :-
✅Get Hands-on Practical Learning Experience
✅Topic Wise Structured Tutorial Videos
✅Guided Practice Assignments
✅Capstone End-to-End Projects
✅1-1 Doubt Clearance Support Everyday
✅One Month Internship Opportunity
✅Interview QnA PDF Collection
✅Course Completion Certificate
✅Lifetime Course Content Access
✅No Prior Coding Experience Required to Join
✅Resume Review Feature
✅Daily Interview QnA Mail Everyday
✅Job Opening Mail & More.
www.cloudyml.com
29. What is the difference between one tail and two tail hypothesis
testing?
In quality control, an error-free data set is generated using six sigma statistics. σ is
known as standard deviation. The lower the standard deviation, the less likely that
a process performs accurately and commits errors. If a process delivers
99.99966% error-free results, it is said to be six sigma. A six sigma model is one
that outperforms 1σ, 2σ, 3σ, 4σ, and 5σ processes and is sufficiently reliable to
deliver defect-free work.
A KPI is a quantifiable measure to evaluate whether the objectives are being met or
not.
It is a reliable metric to measure the performance level of an organisation or
individual.
An example of a KPI in an organisation such as the expense ratio.
In terms of performance, KPIs are an effective way of measuring whether an
organisation or individual is meeting expectations.
www.cloudyml.com
result in NaN values. Hence the model becomes unstable and is unable to learn
from the training data.
There are two forms of the law of large numbers, but the differences are primarily
theoretical.
The weak law of large numbers states that as n increases, the sample statistic of
the sequence converges in probability to the population value.
The strong law of large numbers describes how a sample statistic converges on the
population value as the sample size or the number of trials increases. For example,
the sample mean will converge on the population mean as the sample size
increases.
The goal of A/B testing is to pick the best variant among two hypotheses, the use
cases of this kind of testing could be a web page or application responsiveness,
landing page redesign, banner testing, marketing campaign performance etc.
The first step is to confirm a conversion goal, and then statistical analysis is used to
understand which alternative performs better for the given conversion goal.
Eigenvectors depict the direction in which a linear transformation moves and acts
by compressing, flipping, or stretching. They are used to understand linear
transformations and are generally calculated for a correlation or covariance matrix.
www.cloudyml.com
The eigenvalue is the strength of the transformation in the direction of the
eigenvector.
Data preprocessing transforms the data into a format that is more easily and
effectively processed in data mining, machine learning and other data science
tasks.
1. Data profiling.
2. Data cleansing.
3. Data reduction.
4. Data transformation.
5. Data enrichment.
6. Data validation.
37. What Are the Three Stages of Building a Model in Machine Learning?
Model Building: Choosing a suitable algorithm for the model and train it according
to the requirement
Model Testing: Checking the accuracy of the model through the test data
Applying the Model: Making the required changes after testing and use the final
model for real-time projects
www.cloudyml.com
Data definition language (DDL): It defines the data structure that consists of
commands like CREATE, ALTER, DROP, etc.
Data control language (DCL): It controls access to the data stored in the database.
The commands in this category include GRANT and REVOKE.
A parameter is a dynamic value that a customer could select, and you can use it to
replace constant values in calculations, filters, and reference lines.
For example, when creating a filter to show the top 10 products based on total
profit instead of the fixed value, you can update the filter to show the top 10, 20, or
30 products using a parameter.
The filter() method filters a series using a function that checks if each element in
the sequence is true or not. The filter() function takes two arguments: function - a
function and iterable - an iterable like sets, lists, tuples etc.
First we have a List that contains duplicates. Create a dictionary, using the List
items as keys. This will automatically remove any duplicates because dictionaries
cannot have duplicate keys. Then, convert the dictionary back into a list.
www.cloudyml.com
42. Difference between a shallow and a deep copy?
The TF-IDF statistic, which stands for term frequency–inverse document frequency,
is a numerical measure of how essential a word is to a document in a collection or
corpus. It's frequently used in information retrieval, text mining, and user modelling
searches as a weighting factor. The tf–idf value rises in proportion to the number of
times a word appears in a document and is offset by the number of documents in
the corpus that
contain the term, which helps to compensate for the fact that some words appear
more frequently than others.
44. Give some statistical methods that are useful for data
Two main statistical methods are used in data analysis: descriptive statistics, which
summarizes data using indexes such as mean and median and another is
inferential statistics, which draw conclusions from data using statistical tests such
as student's t-test. The tests include ANOVA, Kruskal-Wallis H test, Friedman Test,
etc.
Delete is used to delete one or more tuples of a table. With the help of the “DROP”
command we can drop (delete) the whole structure in one go i.e. it removes the
named elements of the schema. Truncate is used to delete all the rows of a table.
www.cloudyml.com
46. What is the importance of a dashboard in tableau?
Transformation refers to operations that change data, which may include data
standardization, sorting, deduplication, validation, and verification. The ultimate
goal is to make it possible to analyze the data.
The first principal component axis is selected in a way such that it explains most of
the variation in the data and is closest to all n observations.
Hard-Margin SVMs have linearly separable training data. No data points are
allowed in the margin areas. This type of linear classification is known as Hard
margin classification.
Soft-Margin SVMs have training data that are not linearly separable. Margin
violation means choosing a hyperplane, which can allow some data points to stay
either in between the margin area or on the incorrect side of the hyperplane.
www.cloudyml.com
50. What is the empirical rule?
In statistics, the empirical rule states that every piece of data in a normal
distribution lies within three standard deviations of the mean. It is also known as the
68–95–99.7 rule. According to the empirical rule, the percentage of values that lie
in a normal distribution follow the 68%, 95%, and 99.7% rule. In other words, 68%
of values will fall within one standard deviation of the mean, 95% will fall within two
standard deviations, and 99.75 will fall within three standard deviations of the
mean.
In the left-skewed distribution, the left tail is longer than the right side.
Mean < median < mode
In the right-skewed distribution, the right tail is longer. It is also known as
positive-skew distribution.
Mode < median < mean
The coefficients and the odds ratios then represent the effect of each independent
variable controlling for all of the other independent variables in the model and each
coefficient can be tested for significance.
www.cloudyml.com
principal component formed by PCA will account for maximum variation in the
data.PC2 does the second-best job in capturing maximum variation and so on.
The LD1 the first new axes created by Linear Discriminant Analysis will account for
capturing most variation between the groups or categories and then comes LD2
and so on.
54. What’s the difference between logistic and linear regression? How
do you avoid local minima?
Type 1 error is a false positive error that ‘claims’ that an incident has occurred
when, in fact, nothing has occurred. The best example of a false positive error is a
false fire alarm – the alarm starts ringing when there’s no fire. Contrary to this, a
Type 2 error is a false negative error that ‘claims’ nothing has occurred when
something has definitely happened. It would be a Type 2 error to tell a pregnant
lady that she isn’t carrying a baby.
www.cloudyml.com
SWITCH YOUR CAREER TO
DATA SCIENCE & ANALYTICS .
“Make your career in this fastest growing Data Science
industry without paying your lakh of rupees.”
Features of this course :-
✅Get Hands-on Practical Learning Experience
✅Topic Wise Structured Tutorial Videos
✅Guided Practice Assignments
✅Capstone End-to-End Projects
✅1-1 Doubt Clearance Support Everyday
✅One Month Internship Opportunity
✅Interview QnA PDF Collection
✅Course Completion Certificate
✅Lifetime Course Content Access
✅No Prior Coding Experience Required to Join
✅Resume Review Feature
✅Daily Interview QnA Mail Everyday
✅Job Opening Mail & More.
www.cloudyml.com