0% found this document useful (0 votes)
694 views11 pages

Podar Pearl School: Chapter 1: Capstone Project Question and Answers

The document provides answers to 25 questions about capstone projects, artificial intelligence, data science methodology, and machine learning models. Some key points covered include: 1) A capstone project requires students to independently research a topic and demonstrate their knowledge through a comprehensive final project. 2) The main steps in developing an AI project are problem definition, data gathering, feature definition, model construction, evaluation, and deployment. 3) Common machine learning techniques like classification, regression, clustering, anomaly detection, and recommendation can be applied based on the type of question asked. 4) Data science methodology involves business understanding, analytical approach, data requirements, data compilation/preparation, and model development. 5

Uploaded by

Adarsh Gopakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
694 views11 pages

Podar Pearl School: Chapter 1: Capstone Project Question and Answers

The document provides answers to 25 questions about capstone projects, artificial intelligence, data science methodology, and machine learning models. Some key points covered include: 1) A capstone project requires students to independently research a topic and demonstrate their knowledge through a comprehensive final project. 2) The main steps in developing an AI project are problem definition, data gathering, feature definition, model construction, evaluation, and deployment. 3) Common machine learning techniques like classification, regression, clustering, anomaly detection, and recommendation can be applied based on the type of question asked. 4) Data science methodology involves business understanding, analytical approach, data requirements, data compilation/preparation, and model development. 5

Uploaded by

Adarsh Gopakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

PODAR PEARL SCHOOL

(Under the supervision of the Ministry of Education and Higher Education, Qatar)

Chapter 1: Capstone Project

Question and Answers:

Short Answer questions:

1. What is a Capstone Project?

Ans: A capstone project is a project where students must research a topic


independently to find a deep understanding of the subject matter. It gives an
opportunity for the student to integrate all their knowledge and demonstrate it
through a comprehensive project.
2. Mention some Capstone Project ideas
Ans:
1. Stock Prices Predictor
2. Develop a Sentiment Analyzer
3. Movie Ticket Price Predictor
4. Students Results Predictor
5. Human Activity Recognition using Smartphone Data set
6. Classifying humans and animals in a photo

3. List down the steps involved in developing an AI Project


Ans:
1) Problem definition or Understanding the problem
2) Data gathering
3) Feature definition
4) AI model construction
5) Evaluation & refinements
6) Deployment

4. State the important criteria in understanding the problem


Ans:
When we begin formulating or understanding a problem using Machine Learning
techniques, we have to look for a pattern in the data. If there is no pattern, then the
problem cannot be solved with AI technology.

5. What are the five types of questions to be geared towards applying AI


development techniques?
(or)
What are the five questions to be asked to begin with predictive analysis?

Ans:
1) Which category? (Classification)
1
2) How much or how many? (Regression)
3) Which group? (Clustering)
4) Is this unusual? (Anomaly Detection)
5) Which option should be taken? (Recommendation)
It is important to determine which of these questions we are asking, and how
answering it helps us solve our problem.
6. Define Design Thinking
Ans:
Design Thinking is a design methodology that provides a solution-based approach
to solving problems. It’s extremely useful in tackling complex problems that are ill-
defined or unknown.
7. Mention the five stages in Design Thinking
Ans:
1) Empathize
2) Define
3) Ideate
4) Prototype
5) Test
8. Briefly explain the term ‘ Business Understanding’
Ans:
Every project, regardless of its size, starts with business understanding, which lays
the foundation for successful resolution of the business problem. The business
sponsors who need the analytic solution play the critical role in this stage by
defining the problem, project objectives and solution requirements from a business
perspective. It is the first stage in foundational methodology for data science.

9. Briefly explain the term ‘ Analytical approach’


Ans:
 After clearly stating a business problem, the data scientist can define the
analytic approach to solve it.
 It involves expressing the problem in the context of statistical and machine
learning techniques so that the data scientist can identify techniques suitable
for achieving the desired outcome.
 Selecting the right analytic approach depends on the question being asked.
 Once the problem to be addressed is defined, the appropriate analytic
approach for the problem is selected in the context of the business
requirements.
 This is the second stage of the data science methodology.
10. How the appropriate analytic approach is getting selected based on the type
of question?
Ans:

 If the question is to determine probabilities of an action, then a predictive
model might be used.

2
 If the question is to show relationships, a descriptive approach maybe be
required.

 Statistical analysis applies to problems that require counts: if the question


requires a yes/ no answer, then a classification approach to predicting a
response would be suitable.

11. How will you identify the data requirements as a part of solving problem?
Ans:
We can identify the data requirements by answering to the following:
 Who?
 What?
 Where?
 When?
 Why?
 How?
12. What are the points that a data scientist needs to identify at data requirement
stage in data science methodology?
Ans:
1. Which data ingredients are required?
2. How to source or the collect them?
3. How to understand or work with them?
4. How to prepare the data to meet the desired outcome?

13. How the data requirement stage is playing a vital role in data science
methodology?
Ans:
 It is vital to define the data requirements for decision-tree classification prior to
undertaking the data collection and data preparation stages of the methodology.
 This includes identifying the necessary data content, formats and sources for
initial data collection.
 In this phase the data requirements are revised and decisions are made as to
whether or not the collection requires more or less data.
 Once the data ingredients are collected, the data scientist will have a good
understanding of what they will be working with.
14. Why the techniques such as descriptive statistics and visualization can be
applied to the data set?
Ans:
The techniques such as descriptive statistics and visualization can be applied to the
data set to assess the content, quality, and initial insights about the data. Gaps in data
will be identified and plans to either fill or make substitutions will have to be made.

3
15. What are the two types of AI models?
Ans:
The two types of AI models are,
a) Descriptive
An example of a descriptive model might examine things such as Netflix uses this type
of analytics to see what genres and TV shows interest their subscribers most.

b) Predictive
A predictive model tries to yield yes/no, or stop/go type outcomes. These models are
based on the analytic approach that was taken, either statistically driven or machine
learning driven.
Data Modelling focuses on developing models that are either descriptive or predictive.

16. What is training data set?


Ans:
A training set is a set of historical data in which the outcomes are already known. The
training set acts like a gauge to determine if the model needs to be calibrated. The
data scientist will use a training set for predictive modelling.
17. What are the steps to be undertaken for the success of data compilation,
preparation and modelling?
Ans:

 First, understand the question at hand.
 Second, select an analytic approach or method to solve the problem.

 Third, obtain, understand, prepare, and model the data.

The end goal is to move the data scientist to a point where a data model can be built to
answer the question.
18. What is train-test split Evaluation?
Ans:
 The train-test split is a technique for evaluating the performance of a machine
learning algorithm.
 It can be used for classification or regression problems and can be used for any
supervised learning algorithm.
 The procedure involves taking a dataset and dividing it into two subsets, training
dataset and testing dataset.
 Train Dataset: Used to fit the machine learning model.
 Test Dataset: Used to evaluate the fit machine learning model.

19. What is the main parameter in configuring the Train-Test Split?


Ans:
 The main configuration parameter is the size of the train and test sets. This is
most commonly expressed as a percentage between 0 and 1 for either the train
or test datasets.
 For example, a training set with the size of 0.67 (67 percent) means that the
remainder percentage 0.33 (33 percent) is assigned to the test set.
 There is no optimal split percentage.
4
20. What are the considerations in choosing the split percentage to meet our
project’s objectives?
Ans:

 Computational cost in training the model.


 Computational cost in evaluating the model.
 Training set representativeness.
 Test set representativeness.

21. What are the common split percentages?


Ans:
 The most common split percentage is Train: 80%, Test: 20%
The other split percentages include:
 Train: 67%, Test: 33%
 Train: 50%, Test: 50%
22. What are the prerequisites or Python libraries needed for Train-Test Split?
Ans:
 Pandas
 Sklearn
23. We split this data into labels and features. What do the terms ‘labels’ and
‘features’ refer to?
Ans:
Features - the data we use to predict labels
Labels - the data we want to predict
Using features, we predict labels.

24. What are the advantages and shortcomings of Train-Test Split?


Ans:
Advantages
 Easy to implement and interpret
 Less time consuming in execution
Disadvantages
 The train-test procedure is not appropriate when the dataset available is small.
The reason is that when the dataset is split into train and test sets, there will
not be enough data in the training dataset for the model to learn an effective
mapping of inputs to outputs. There will also not be enough data in the test
set to effectively evaluate the model performance. It will decrease the
accuracy of the predictive model.
 If the split is not random, the output of the evaluation matrices are inaccurate.
 Can cause over-fitted predictive models.

25. What is K-fold Cross Validation?


Ans:
 Cross-validation is a statistical method used to estimate the skill of machine
learning models.
 The procedure has a single parameter called k that refers to the number of
groups that a given data sample is to be split into. Hence the procedure is often
5
called k-fold cross-validation. When a specific value for k is chosen, it may be
used in place of k in the reference to the model, such as k=10 becoming 10-fold
cross-validation.
 It is a popular method because it is simple to understand and generally results
in a less biased estimate of the model skill than other methods, such as a
simple train/test split.

26. Differentiate Train-Test split and K-fold Cross Validation


Ans:
Train-Test split K-fold Cross Validation
There are problems where model quality Cross-validation gives a more accurate
scores would be least reliable with train- measure of model quality.
test split because only a portion of the
dataset are used for generating
evaluation matrices.
It will run faster It can take more time to run, because it
estimates models once for each fold.
It is suitable for larger datasets It is suitable for smaller datasets
We will not be able to explain the We can take mean accuracy and explain
stakeholders the exact accuracy. the stakeholders model accuracy. Also
able to explain what will be the min and
max accuracy the model will predict.

27. What are the standard mathematical measures to evaluate model quality?
Ans:
 RMSE – Root Mean Square Error Method
 MSE - Mean Square Error Method
 MAPE - Mean Absolute Percentage Error
 MAE – Mean Absolute Error Method

28. Explain the terms


a) Objective function:
All the algorithms in machine learning rely on minimizing or maximizing a function,
which we call “objective function”.
b) Loss Function:
The group of functions that are minimized are called “loss functions”. A loss function is
a measure of how good a prediction model does in terms of being able to predict the
expected outcome. Loss functions can be broadly categorized into 2 types:
Classification and Regression Loss.
c) Gradient Descent:
A most commonly used method of finding the minimum point of function is “gradient
descent”.

6
29. What are the types of Classification Loss and Regression Loss?
Ans:

30. What is RMSE (Root Mean Squared Error)?


Ans:
RMSE is one of the methods to determine the accuracy of our model in predicting the
target values. We take the root mean square of the error that has occurred between
the test values and the predicted values mathematically.
Formula:
For a single value:
Let a= (predicted value- actual value) ^2
Let b= mean of a = a (for single value)
Then RMSE= square root of b
For wide set of values:

7
31. What is MSE (Mean Squared Error)?
Ans:
MSE is the most commonly used regression loss function. MSE is the sum of squared
distances between our target variable and predicted values.
Formula:

32. Why and when to use mean squared error?


Ans:
 MSE is sensitive towards outliers and given several examples with the same
input feature values, the optimal prediction will be their mean target value. This
should be compared with Mean Absolute Error, where the optimal prediction is
the median.
 MSE is thus good to use if we believe that our target data, conditioned on the
input, is normally distributed around a mean value, and when it’s important to
penalize outliers/large errors extra much than small ones.

Extra Questions
33. What is MAE and MAPE?
Ans:
MAE
The Mean Absolute Error is the squared mean of the difference between the actual
values and predictable values.

MAPE
Mean Absolute Percentage Error (MAPE) is a statistical measure to define the
accuracy of a machine learning algorithm on a particular dataset.
It represents the average of the absolute percentage errors of each entry in a dataset
to calculate how accurate the forecasted quantities were in comparison with the actual
quantities.

Long answer questions:


1. Explain the Problem decomposition steps
Ans:
Real computational tasks are complicated. To accomplish them we need to break
down the problem into smaller units before coding.

1. Understand the problem and then restate the problem in your own words
 Know what the desired inputs and outputs are
8
 Ask questions for clarification
2. Break the problem down into a few large pieces. Write these down, either on paper
or as comments in a file.
3. Break complicated pieces down into smaller pieces. Keep doing this until all of the
pieces are small.
4. Code one small piece at a time.
 Think about how to implement it
 Write the code/query
 Test it.

 Fix problems, if any

2. Imagine that you want to create your first app. How would you
decompose the task of creating an app?

Ans:
To decompose this task, we would need to know the answer to a series of smaller
problems:
 What kind of app you want to create?
 What will your app will look like?
 Who is the target audience for your app?
 What will the graphics will look like?
 What audio will you include?
 What software will you use to build your app?
 How will the user navigate your app?
 How will you test your app?

This list has broken down the complex problem of creating an app into
much simpler problems that can now be worked out.

3. Explain Time Series Decomposition

Ans:
Time series decomposition involves thinking of a series as a combination of level,
trend, seasonality, and noise components. Decomposition provides a useful abstract
model for thinking about time series generally and for better understanding problems
during time series analysis and forecasting.
These components are defined as follows:
Level: The average value in the series.
Trend: The increasing or decreasing value in the series.
Seasonality: The repeating short-term cycle in the series.
Noise: The random variation in the series.

9
4. Depict the Foundational Methodology of Data Science using a
diagram

Ans:

5. Explain the Train-Test Split Evaluation

Ans:
 The train-test split is a technique for evaluating the performance of a machine
learning algorithm.
 It can be used for classification or regression problems and can be used for any
supervised learning algorithm.
 The procedure involves taking a dataset and dividing it into two subsets. The
first subset is used to fit the model and is referred to as the training dataset. The
second dataset is referred to as the test dataset.
 The test dataset is not used to train the model; instead, the input element of the
dataset is provided to the model, then predictions are made and compared to
the expected values.
 The objective is to estimate the performance of the machine learning model on
new data – the data which is not used to train the model.
 It is to fit it on available data with known inputs and outputs, then make
predictions on new examples in the future where we do not have the expected
output or target values.
 The train-test procedure is appropriate when there is a
10
sufficiently large dataset available.
6. Explain the procedure of K-fold cross validation
Ans:
 Shuffle the dataset randomly.
 Split the dataset into k groups
 For example if k=5, we divide the data into 5 pieces, each being 20% of the full
dataset.
 We run an experiment called experiment 1 which uses the first fold as a holdout
set, and everything else as training data. This gives us a measure of model
quality based on a 20% holdout set.
 We then run a second experiment, where we hold out data from the second fold
(using everything except the 2nd fold for training the model.) This gives us a
second estimate of model quality. We repeat this process, using every fold once
as the holdout. Putting this together, 100% of the data is used as a holdout at
some point.
 Finally, Summarize the skill of the model using the sample of model evaluation
scores

11

You might also like