0% found this document useful (0 votes)
52 views21 pages

Unit 1: Capstone Project

artificial inteligence class 12 notes

Uploaded by

Parampara Bhatia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views21 pages

Unit 1: Capstone Project

artificial inteligence class 12 notes

Uploaded by

Parampara Bhatia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit 1: Capstone Project

A capstone project is a project where students must


research a topic independently to find a deep
understanding of the subject matter. It gives an
opportunity for the student to integrate all their knowledge
and demonstrate it through a comprehensive project.

1)Understanding The Problem

Artificial Intelligence is perhaps the most transformative


technology available today. At a high level, every AI
project follows the following six steps:
1) Problem definition i.e. Understanding the problem
2) Data gathering
3) Feature definition
4) AI model construction
5) Evaluation & refinements
6) Deployment
The premise that underlies all Machine Learning
disciplines is that there needs to be a pattern. If there is no
pattern, then the problem cannot be solved with AI
technology. It is fundamental that this question is asked
before deciding to embark on an AI development journey.
If it is believed that there is a pattern in the data, then AI
development techniques may be employed. Applied uses
of these techniques are typically geared towards
answering five types of questions, all of which may be
categorized as being within the umbrella of predictive
analysis:
1) Which category? (Classification)
2) How much or how many? (Regression)
3) Which group? (Clustering)
4) Is this unusual? (Anomaly Detection)
5) Which option should be taken? (Recommendation)

2) Decomposing The Problem Through DT Framework

Design Thinking is a design methodology that provides a


solution-based approach to solving problems. It’s
extremely useful in tackling complex problems that are ill-
defined or unknown.
The five stages of Design Thinking are as follows:
Empathize, Define, Ideate, Prototype, and Test.
Real computational tasks are complicated. To accomplish
them you need to break down the problem into smaller
units before coding.
Problem decomposition steps

1. Understand the problem and then restate the problem in


your own words
Know what the desired inputs and outputs are
Ask questions for clarification (in class these questions
might be to your instructor, but most of the time they will
be asking either yourself or your collaborators)

2. Break the problem down into a few large pieces. Write


these down, either on paper or as comments in a file.

3. Break complicated pieces down into smaller pieces.


Keep doing this until all of the pieces are small.

4. Code one small piece at a time.


1. Think about how to implement it
2. Write the code/query
3. Test it… on its own.
4. Fix problems, if any

Reviewing the line plot, it suggests that there may be a


linear trend, but it is hard to be sure from eye-balling.
There is also seasonality, but the amplitude (height) of the
cycles appears to be increasing, suggesting that it is
multiplicative.

Running the example plots the observed, trend, seasonal,


and residual time series.
We can see that the trend and seasonality information
extracted from the series does seem reasonable. The
residuals are also interesting, showing periods of high
variability in the early and later years of the series.

3) Analytic Approach

Those who work in the domain of AI and Machine


Learning solve problems and answer questions through
data every day. They build models to predict outcomes or
discover underlying patterns, all to gain insights leading to
actions that will improve future outcomes.

Every project, regardless of its size, starts with business


understanding, which lays the foundation for successful
resolution of the business problem. The business
sponsors needing the analytic solution play the critical role
in this stage by defining the problem, project objectives
and solution requirements from a business perspective.
And, believe it or not—even with nine stages still to go—
this first stage is the hardest.
After clearly stating a business problem, the data scientist
can define the analytic approach to solving it. Doing so
involves expressing the problem in the context of
statistical and machine learning techniques so that the
data scientist can identify techniques suitable for achieving
the desired outcome.
Selecting the right analytic approach depends on the
question being asked. Once the problem to be addressed
is defined, the appropriate analytic approach for the
problem is selected in the context of the business
requirements. This is the second stage of the data science
methodology.

If the question is to determine probabilities of an action,


then a predictive model might be used.
If the question is to show relationships, a descriptive
approach maybe be required.
Statistical analysis applies to problems that require
counts: if the question requires a yes/ no answer, then a
classification approach to predicting a response would be
suitable.

4) Data Requirement
If the problem that needs to be resolved is "a recipe", so to
speak, and data is "an ingredient", then the data scientist
needs to identify:

1. which ingredients are required?


2. how to source or the collect them?
3. how to understand or work with them?
4. and how to prepare the data to meet the desired
outcome?

Prior to undertaking the data collection and data


preparation stages of the methodology, it's vital to define
the data requirements for decision-tree classification. This
includes identifying the necessary data content, formats
and sources for initial data collection.
In this phase the data requirements are revised and
decisions are made as to whether or not the collection
requires more or less data. Once the data ingredients are
collected, the data scientist will have a good
understanding of what they will be working with.
Techniques such as descriptive statistics and visualization
can be applied to the data set, to assess the content,
quality, and initial insights about the data. Gaps in data will
be identified and plans to either fill or make substitutions
will have to be made.
In essence, the ingredients are now sitting on the cutting
board.
5) Modeling Approach
Data Modeling focuses on developing models that are
either descriptive or predictive.
An example of a descriptive model might examine things
like: if a person did this, then they're likely to prefer that.
A predictive model tries to yield yes/no, or stop/go type
outcomes. These models are based on the analytic
approach that was taken, either statistically driven or
machine learning driven.

The data scientist will use a training set for predictive


modelling. A training set is a set of historical data in which
the outcomes are already known. The training set acts like
a gauge to determine if the model needs to be calibrated.
In this stage, the data scientist will play around with
different algorithms to ensure that the variables in play are
actually required.

The success of data compilation, preparation and


modelling, depends on the understanding of the problem
at hand, and the appropriate analytical approach being
taken. The data supports the answering of the question,
and like the quality of the ingredients in cooking, sets the
stage for the outcome.
Constant refinement, adjustments and tweaking are
necessary within each step to ensure the outcome is one
that is solid. .
The framework is geared to do 3 things:

First, understand the question at hand.


Second, select an analytic approach or method to solve
the problem.
Third, obtain, understand, prepare, and model the data.

The end goal is to move the data scientist to a point where


a data model can be built to answer the question.
6) How to validate model quality

6.1) Train-Test Split Evaluation

The train-test split is a technique for evaluating the


performance of a machine learning algorithm.
It can be used for classification or regression problems
and can be used for any supervised learning algorithm.
The procedure involves taking a dataset and dividing it
into two subsets. The first subset is used to fit the model
and is referred to as the training dataset. The second
subset is not used to train the model; instead, the input
element of the dataset is provided to the model, then
predictions are made and compared to the expected
values. This second dataset is referred to as the test
dataset.
Train Dataset: Used to fit the machine learning model.
Test Dataset: Used to evaluate the fit machine learning
model.

The objective is to estimate the performance of the


machine learning model on new data: data not used to
train the model.
This is how we expect to use the model in practice.
Namely, to fit it on available data with known inputs and
outputs, then make predictions on new examples in the
future where we do not have the expected output or target
values.
The train-test procedure is appropriate when there is a
sufficiently large dataset available.
How to Configure the Train-Test Split.

The procedure has one main configuration parameter,


which is the size of the train and test sets. This is most
commonly expressed as a percentage between 0 and 1
for either the train or test datasets. For example, a training
set with the size of 0.67 (67 percent) means that the
remainder percentage 0.33 (33 percent) is assigned to the
test set.

There is no optimal split percentage.


You must choose a split percentage that meets your
project’s objectives with considerations that include:
Computational cost in training the model.
Computational cost in evaluating the model.
Training set representativeness.
Test set representativeness.

Nevertheless, common split percentages include:


Train: 80%, Test: 20%
Train: 67%, Test: 33%
Train: 50%, Test: 50%
6.2) Introduce concept of cross validation
You will face choices about predictive variables to use,
what types of models to use, what arguments to supply
those models, etc. We make these choices in a data-
driven way by measuring model quality of various
alternatives.
You've already learned to use train_test_split to split the
data, so you can measure model quality on the test data.
Cross-validation extends this approach to model scoring
(or "model validation.") Compared to train_test_split,
cross-validation gives you a more reliable measure of your
model's quality, though it takes longer to run.
The Shortcoming of Train-Test Split
Imagine you have a dataset with 5000 rows. The
train_test_split function has an argument for test_size that
you can use to decide how many rows go to the training
set and how many go to the test set. The larger the test
set, the more reliable your measures of model quality will
be. At an extreme, you could imagine having only 1 row of
data in the test set. If you compare alternative models,
which one makes the best predictions on a single data
point will be mostly a matter of luck.
You will typically keep about 20% as a test dataset. But
even with 1000 rows in the test set, there's some random
chance in determining model scores. A model might do
well on one set of 1000 rows, even if it would be
inaccurate on a different 1000 rows. The larger the test
set, the less randomness (aka "noise") there is in our
measure of model quality.
The Cross-Validation Procedure

In cross-validation, we run our modeling process on


different subsets of the data to get multiple measures of
model quality. For example, we could have 5 folds or
experiments. We divide the data into 5 pieces, each being
20% of the full dataset.

We run an experiment called experiment 1 which uses the


first fold as a holdout set, and everything else as training
data. This gives us a measure of model quality based on a
20% holdout set, much as we got from using the simple
train-test split.

We then run a second experiment, where we hold out data


from the second fold (using everything except the 2nd fold
for training the model.) This gives us a second estimate of
model quality. We repeat this process, using every fold
once as the holdout. Putting this together, 100% of the
data is used as a holdout at some point.
Returning to our example above from train-test split, if we
have 5000 rows of data, we end up with a measure of
model quality based on 5000 rows of holdout (even if we
don't use all 5000 rows simultaneously.
Trade-offs Between Cross-Validation and Train-Test Split
Cross-validation gives a more accurate measure of model
quality, which is especially important if you are making a
lot of modeling decisions. However, it can take more time
to run, because it estimates models once for each fold. So
it is doing more total work.

Given these tradeoffs, when should you use each


approach? On small datasets, the extra computational
burden of running cross-validation isn't a big deal. These
are also the problems where model quality scores would
be least reliable with train-test split. So, if your dataset is
smaller, you should run cross-validation.
For the same reasons, a simple train-test split is sufficient
for larger datasets. It will run faster, and you may have
enough data that there's little need to re-use some of it for
holdout.

There's no simple threshold for what constitutes a large vs


small dataset. If your model takes a couple minute or less
to run, it's probably worth switching to cross-validation. If
your model takes much longer to run, cross-validation may
slow down your workflow more than it's worth.
Alternatively, you can run cross-validation and see if the
scores for each experiment seem close. If each
experiment gives the same results, train-test split is
probably sufficient.

7) Metrics of model quality by simple Math and


examples

After you make predictions, you need to know if they are


any good. There are standard measures that we can use
to summarize how good a set of predictions actually are.
Knowing how good a set of predictions is, allows you to
make estimates about how good a given machine learning
model of your problem,
You must estimate the quality of a set of predictions when
training a machine learning model.
Performance metrics like classification accuracy and root
mean squared error can give you a clear objective idea of
how good a set of predictions is, and in turn how good the
model is that generated them.
This is important as it allows you to tell the difference and
select among:
Different transforms of the data used to train the same
machine learning model.
Different machine learning models trained on the same
data.
Different configurations for a machine learning model
trained on the same data.
As such, performance metrics are a required building
block in implementing machine learning algorithms from
scratch.
All the algorithms in machine learning rely on minimizing
or maximizing a function, which we call “objective
function”. The group of functions that are minimized are
called “loss functions”. A loss function is a measure of how
good a prediction model does in terms of being able to
predict the expected outcome. A most commonly used
method of finding the minimum point of function is
“gradient descent”. Think of loss function like undulating
mountain and gradient descent is like sliding down the
mountain to reach the bottom most point.

Loss functions can be broadly categorized into 2


types: Classification and Regression Loss.
7.1) RMSE (Root Mean Squared Error)

In this section, we will be looking at one of the methods to


determine the accuracy of our model in predicting the
target values. All of you reading this article must have
heard about the term RMS i.e. Root Mean Square and you
might have also used RMS values in statistics as well. In
machine Learning when we want to look at the accuracy of
our model we take the root mean square of the error that
has occurred between the test values and the predicted
values mathematically:

7.2 MSE (Mean Squared Error)

Mean Square Error (MSE) is the most commonly used


regression loss function. MSE is the sum of squared
distances between our target variable and predicted
values.
Why use mean squared error
MSE is sensitive towards outliers and given several
examples with the same input feature values, the optimal
prediction will be their mean target value. This should be
compared with Mean Absolute Error, where the optimal
prediction is the median. MSE is thus good to use if you
believe that your target data, conditioned on the input, is
normally distributed around a mean value, and when it’s
important to penalize outliers extra much.
When to use mean squared error
Use MSE when doing regression, believing that your
target, conditioned on the input, is normally distributed,
and want large errors to be significantly (quadratically)
more penalized than small ones.

You might also like