Podar Pearl School: Chapter 1: Capstone Project Question and Answers
Podar Pearl School: Chapter 1: Capstone Project Question and Answers
(Under the supervision of the Ministry of Education and Higher Education, Qatar)
Ans:
1) Which category? (Classification)
1
2) How much or how many? (Regression)
3) Which group? (Clustering)
4) Is this unusual? (Anomaly Detection)
5) Which option should be taken? (Recommendation)
It is important to determine which of these questions we are asking, and how
answering it helps us solve our problem.
6. Define Design Thinking
Ans:
Design Thinking is a design methodology that provides a solution-based approach
to solving problems. It’s extremely useful in tackling complex problems that are ill-
defined or unknown.
7. Mention the five stages in Design Thinking
Ans:
1) Empathize
2) Define
3) Ideate
4) Prototype
5) Test
8. Briefly explain the term ‘ Business Understanding’
Ans:
Every project, regardless of its size, starts with business understanding, which lays
the foundation for successful resolution of the business problem. The business
sponsors who need the analytic solution play the critical role in this stage by
defining the problem, project objectives and solution requirements from a business
perspective. It is the first stage in foundational methodology for data science.
2
If the question is to show relationships, a descriptive approach maybe be
required.
11. How will you identify the data requirements as a part of solving problem?
Ans:
We can identify the data requirements by answering to the following:
Who?
What?
Where?
When?
Why?
How?
12. What are the points that a data scientist needs to identify at data requirement
stage in data science methodology?
Ans:
1. Which data ingredients are required?
2. How to source or the collect them?
3. How to understand or work with them?
4. How to prepare the data to meet the desired outcome?
13. How the data requirement stage is playing a vital role in data science
methodology?
Ans:
It is vital to define the data requirements for decision-tree classification prior to
undertaking the data collection and data preparation stages of the methodology.
This includes identifying the necessary data content, formats and sources for
initial data collection.
In this phase the data requirements are revised and decisions are made as to
whether or not the collection requires more or less data.
Once the data ingredients are collected, the data scientist will have a good
understanding of what they will be working with.
14. Why the techniques such as descriptive statistics and visualization can be
applied to the data set?
Ans:
The techniques such as descriptive statistics and visualization can be applied to the
data set to assess the content, quality, and initial insights about the data. Gaps in data
will be identified and plans to either fill or make substitutions will have to be made.
3
15. What are the two types of AI models?
Ans:
The two types of AI models are,
a) Descriptive
An example of a descriptive model might examine things such as Netflix uses this type
of analytics to see what genres and TV shows interest their subscribers most.
b) Predictive
A predictive model tries to yield yes/no, or stop/go type outcomes. These models are
based on the analytic approach that was taken, either statistically driven or machine
learning driven.
Data Modelling focuses on developing models that are either descriptive or predictive.
The end goal is to move the data scientist to a point where a data model can be built to
answer the question.
18. What is train-test split Evaluation?
Ans:
The train-test split is a technique for evaluating the performance of a machine
learning algorithm.
It can be used for classification or regression problems and can be used for any
supervised learning algorithm.
The procedure involves taking a dataset and dividing it into two subsets, training
dataset and testing dataset.
Train Dataset: Used to fit the machine learning model.
Test Dataset: Used to evaluate the fit machine learning model.
27. What are the standard mathematical measures to evaluate model quality?
Ans:
RMSE – Root Mean Square Error Method
MSE - Mean Square Error Method
MAPE - Mean Absolute Percentage Error
MAE – Mean Absolute Error Method
6
29. What are the types of Classification Loss and Regression Loss?
Ans:
7
31. What is MSE (Mean Squared Error)?
Ans:
MSE is the most commonly used regression loss function. MSE is the sum of squared
distances between our target variable and predicted values.
Formula:
Extra Questions
33. What is MAE and MAPE?
Ans:
MAE
The Mean Absolute Error is the squared mean of the difference between the actual
values and predictable values.
MAPE
Mean Absolute Percentage Error (MAPE) is a statistical measure to define the
accuracy of a machine learning algorithm on a particular dataset.
It represents the average of the absolute percentage errors of each entry in a dataset
to calculate how accurate the forecasted quantities were in comparison with the actual
quantities.
1. Understand the problem and then restate the problem in your own words
Know what the desired inputs and outputs are
8
Ask questions for clarification
2. Break the problem down into a few large pieces. Write these down, either on paper
or as comments in a file.
3. Break complicated pieces down into smaller pieces. Keep doing this until all of the
pieces are small.
4. Code one small piece at a time.
Think about how to implement it
Write the code/query
Test it.
2. Imagine that you want to create your first app. How would you
decompose the task of creating an app?
Ans:
To decompose this task, we would need to know the answer to a series of smaller
problems:
What kind of app you want to create?
What will your app will look like?
Who is the target audience for your app?
What will the graphics will look like?
What audio will you include?
What software will you use to build your app?
How will the user navigate your app?
How will you test your app?
This list has broken down the complex problem of creating an app into
much simpler problems that can now be worked out.
Ans:
Time series decomposition involves thinking of a series as a combination of level,
trend, seasonality, and noise components. Decomposition provides a useful abstract
model for thinking about time series generally and for better understanding problems
during time series analysis and forecasting.
These components are defined as follows:
Level: The average value in the series.
Trend: The increasing or decreasing value in the series.
Seasonality: The repeating short-term cycle in the series.
Noise: The random variation in the series.
9
4. Depict the Foundational Methodology of Data Science using a
diagram
Ans:
Ans:
The train-test split is a technique for evaluating the performance of a machine
learning algorithm.
It can be used for classification or regression problems and can be used for any
supervised learning algorithm.
The procedure involves taking a dataset and dividing it into two subsets. The
first subset is used to fit the model and is referred to as the training dataset. The
second dataset is referred to as the test dataset.
The test dataset is not used to train the model; instead, the input element of the
dataset is provided to the model, then predictions are made and compared to
the expected values.
The objective is to estimate the performance of the machine learning model on
new data – the data which is not used to train the model.
It is to fit it on available data with known inputs and outputs, then make
predictions on new examples in the future where we do not have the expected
output or target values.
The train-test procedure is appropriate when there is a
10
sufficiently large dataset available.
6. Explain the procedure of K-fold cross validation
Ans:
Shuffle the dataset randomly.
Split the dataset into k groups
For example if k=5, we divide the data into 5 pieces, each being 20% of the full
dataset.
We run an experiment called experiment 1 which uses the first fold as a holdout
set, and everything else as training data. This gives us a measure of model
quality based on a 20% holdout set.
We then run a second experiment, where we hold out data from the second fold
(using everything except the 2nd fold for training the model.) This gives us a
second estimate of model quality. We repeat this process, using every fold once
as the holdout. Putting this together, 100% of the data is used as a holdout at
some point.
Finally, Summarize the skill of the model using the sample of model evaluation
scores
11