0% found this document useful (0 votes)
38 views40 pages

Capstone Project

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views40 pages

Capstone Project

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

UNIT 1: CAPSTONE PROJECT

CAPSTONE PROJECT
• A capstone project is a project where students must research a topic
independently to find a deep understanding of the subject matter.
• It gives an opportunity for the student to integrate all their knowledge
and demonstrate it through a comprehensive project.
• Capstone projects
1. Stock Prices Predictor
2. Develop A Sentiment Analyzer
3. Movie Ticket Price Predictor
4. Students Results Predictor
5. Human Activity Recognition using Smartphone Data set
6. Classifying humans and animals in a photo
• Artificial Intelligence is perhaps the most transformative technology
available today. At a high level, every AI project follows the following
six steps:

1) Problem definition i.e. Understanding the problem


2) Data gathering
3) Feature definition
4) AI model construction
5) Evaluation & refinements
6) Deployment
UNDERSTANDING THE PROBLEM
• Begin formulating your problem by asking yourself this simple .

is there a pattern?
• The premise that underlies all Machine Learning disciplines is that
there needs to be a pattern.
• If there is no pattern, then the problem cannot be solved with AI
technology
• Applied uses of these techniques are typically geared towards answering five types
of questions, all of which may be categorized as being within the umbrella of
predictive analysis:
1) Which category? (Classification)
2) How much or how many? (Regression)
3) Which group? (Clustering)
4) Is this unusual? (Anomaly Detection)
5) Which option should be taken? (Recommendation)

It is important to determine which of these questions you’re asking, and how answering
it helps you solve your problem.
DECOMPOSING THE PROBLEM THROUGH
DT FRAMEWORK
• Design Thinking is a design methodology that provides a solution-
based approach to solving problems. It’s extremely useful in tackling
complex problems that are ill-defined or unknown.
PROBLEM DECOMPOSITION STEPS
• 1. Understand the problem and then restate the problem in your own
words
•  Know what the desired inputs and outputs are
•  Ask questions for clarification (in class these questions might be to
your instructor, but most of the time they will be asking either yourself
or your collaborators)
• 2. Break the problem down into a few large pieces. Write these down,
either on paper or as comments in a file.
• 3. Break complicated pieces down into smaller pieces. Keep doing this
until all of the pieces are small.
• 4. Code one small piece at a time.
1. Think about how to implement it
2. Write the code/query
3. Test it… on its own.
4. Fix problems, if any
• Example: Imagine that you want to create your first app. This is a complex problem.
How would you decompose the task of creating an app?
• To decompose this task, you would need to know the answer to a series of smaller
problems:

 what kind of app you want to create?


 what will your app will look like?
 who is the target audience for your app?
 what will the graphics will look like?
 what audio will you include?
 what software will you use to build your app?
 how will the user navigate your app?  how will you test your app?
ANALYTIC APPROACH
• Those who work in the domain of AI and Machine Learning solve problems and
answer questions through data every day.
• They build models to predict outcomes or discover underlying patterns, all to gain
insights leading to actions that will improve future outcomes.
 If the question is to determine probabilities of an action, then a predictive model
might be used.

 If the question is to show relationships, a descriptive approach maybe be required.

 Statistical analysis applies to problems that require counts: if the question requires a
yes/ no answer, then a classification approach to predicting a response would be
suitable.
DATA REQUIREMENT
• Prior to undertaking the data collection and data preparation stages of the
methodology, it's vital to define the data requirements for decision-tree
classification.
• This includes identifying the necessary data content, formats and sources for
initial data collection.
• n this phase the data requirements are revised and decisions are made as to
whether or not the collection requires more or less data.
• Once the data ingredients are collected, the data scientist will have a good
understanding of what they will be working with.
• Techniques such as descriptive statistics and visualization can be applied to
the data set, to assess the content, quality, and initial insights about the data.
• Gaps in data will be identified and plans to either fill or make substitutions will
have to be made.
MODELING APPROACH
• Data Modeling focuses on developing models that are either descriptive or
predictive.
 An example of a descriptive model might examine things like: if a person did this,
then they're likely to prefer that.
 A predictive model tries to yield yes/no, or stop/go type outcomes. These models
are based on the analytic approach that was taken, either statistically driven or
machine learning driven.
• The data scientist will use a training set for predictive modelling.
• A training set is a set of historical data in which the outcomes are already known.
• The training set acts like a gauge to determine if the model needs to be
calibrated.
• In this stage, the data scientist will play around with different algorithms to
ensure that the variables in play are actually required.
• The success of data compilation, preparation and modelling, depends on the understanding of
the problem at hand, and the appropriate analytical approach being taken.

• Constant refinement, adjustments and tweaking are necessary within each step to ensure the
outcome is one that is solid.
• The framework is geared to do 3 things:
 First, understand the question at hand.
 Second, select an analytic approach or method to solve the problem.
 Third, obtain, understand, prepare, and model the data
HOW TO VALIDATE MODEL QUALITY
• Train-Test Split Evaluation

• The train-test split is a technique for evaluating the performance of a machine


learning algorithm.
• It can be used for classification or regression problems and can be used for any
supervised learning algorithm.
• The procedure involves taking a dataset and dividing it into two subsets.
• The first subset is used to fit the model and is referred to as the training
dataset.
• The second subset is not used to train the model; instead, the input element
of the dataset is provided to the model, then predictions are made and
compared to the expected values.
• This second dataset is referred to as the test dataset.

 Train Dataset: Used to fit the machine learning model.


 Test Dataset: Used to evaluate the fit machine learning model.
• The objective is to estimate the performance of the machine learning model
on new data: data not used to train the model.
• The procedure has one main configuration parameter, which is the size of
the train and test sets.

• This is most commonly expressed as a percentage between 0 and 1 for


either the train or test datasets.

• For example, a training set with the size of 0.67 (67 percent) means that
the remainder percentage 0.33 (33 percent) is assigned to the test set.

• There is no optimal split percentage.


PREREQUISITES FOR TRAIN AND TEST
DATA
• We will need the following Python Libraries for this tutorial:
 Pandas
 Sklearn

We can install these with pip


1. pip install pandas
2. 2. pip install sklearn

We use pandas to import the dataset and sklearn to perform the splitting. You can import these
packages as:
1. >>> import pandas as pd
2. >>> from sklearn.model_selection import train_test_split
3. >>> from sklearn.datasets import load_iris
• Loading the Data set Let’s load the forestfires dataset using pandas.
1. >>> data=pd.read_csv(‘forestfires.csv’)
2. >>> data.head()

Let’s split this data into labels and features. Now, what’s that? Using features, we predict
labels. I mean using features (the data we use to predict labels), we predict labels (the
data we want to predict).
>>> y=data.temp
>>> x=data.drop(‘temp’,axis=1)

Temp is a label to predict temperatures in y; we use the drop() function to take all other
data in x.
>>> x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
>>> x_train.head()
CROSS-VALIDATION

• You've already learned to use train_test_split to split the data, so you can measure
model quality on the test data.
• Cross-validation extends this approach to model scoring (or "model validation.")
• Compared to train_test_split, cross-validation gives you a more reliable measure of
your model's quality, though it takes longer to run.
THE SHORTCOMING OF TRAIN-TEST SPLIT
• Imagine you have a dataset with 5000 rows.
• The train_test_split function has an argument for test_size that you can use to
decide how many rows go to the training set and how many go to the test set.
• The larger the test set, the more reliable your measures of model quality will
be.
• You will typically keep about 20% as a test dataset. But even with 1000 rows in
the test set, there's some random chance in determining model scores.
• A model might do well on one set of 1000 rows, even if it would be inaccurate
on a different 1000 rows.
• The larger the test set, the less randomness (aka "noise") there is in our
measure of model quality.
• In cross-validation, we run our modeling process on different subsets of the data to
get multiple measures of model quality.
• For example, we could have 5 folds or experiments. We divide the data into 5 pieces,
each being 20% of the full dataset
TRADE-OFFS BETWEEN CROSS-VALIDATION
AND TRAIN-TEST SPLIT
• Cross-validation gives a more accurate measure of model quality, which is
especially important if you are making a lot of modeling decisions.
• However, it can take more time to run, because it estimates models once
for each fold.
• On small datasets, the extra computational burden of running cross-
validation isn't a big deal. These are also the problems where model
quality scores would be least reliable with train-test split.
• So, if your dataset is smaller, you should run cross-validation.
• For the same reasons, a simple train-test split is sufficient for larger
datasets.
• It will run faster, and you may have enough data that there's little need to
re-use some of it for holdout.
• Using cross-validation gave us much better measures of model quality, with the added benefit
of cleaning up our code (no longer needing to keep track of separate train and test sets.

• Metrics of model quality by simple Math and examples


• After you make predictions, you need to know if they are any good.
• There are standard measures that we can use to summarize how good a set of predictions
actually are.
• Performance metrics like classification accuracy and root mean squared error can give you a
clear objective idea of how good a set of predictions is, and in turn how good the model is
that generated them.
• This is important as it allows you to tell the difference and select among:

 Different transforms of the data used to train the same machine learning model.
 Different machine learning models trained on the same data.
 Different configurations for a machine learning model trained on the same data.
• All the algorithms in machine learning rely on minimizing or
maximizing a function, which we call “objective function”.
The group of functions that are minimized are called “loss
functions”.
• A loss function is a measure of how good a prediction model
does in terms of being able to predict the expected
outcome.
• A most commonly used method of finding the minimum point
of function is “gradient descent”.
• Loss functions can be broadly categorized into 2 types: Classification
and Regression Loss.

• Regression functions
predict a quantity, and
classification functions
predict a label.
RMSE (ROOT MEAN SQUARED ERROR)
• In machine Learning when we want to look at the accuracy of our
model we take the root mean square of the error that has occurred
between the test values and the predicted values mathematically:

• For a Single Value


• Let a= (predicted value- actual value) ^2
• Let b= mean of a = a (for single value)
• Then RMSE= square root of b
• For a wide set of values RMSE is defined as follows:
• As you can see in this scattered graph the red dots are the actual values and the blue
line is the set of predicted values drawn by our model. Here X represents the
distance between the actual value and the predicted line this line represents the
error, similarly, we can draw straight lines from each red dot to the blue line. Taking
mean of all those distances and squaring them and finally taking the root will give us
RMSE of our model.
MSE (MEAN SQUARED ERROR)

• Mean Square Error (MSE) is the most commonly used regression loss function.
• MSE is the sum of squared distances between our target variable and predicted
values
from sklearn.metrics import mean_squared_error
# Given values
Y_true = [1,1,2,2,4] # Y_true = Y (original values)
# calculated values
Y_pred = [0.6,1.29,1.99,2.69,3.4] # Y_pred = Y'

# Calculation of Mean Squared Error (MSE)


mean_squared_error(Y_true,Y_pred)
Why use mean squared error
• MSE is sensitive towards outliers and given several examples with the same input
feature values, the optimal prediction will be their mean target value.
• This should be compared with Mean Absolute Error, where the optimal prediction is
the median.
• MSE is thus good to use if you believe that your target data, conditioned on the input,
is normally distributed around a mean value, and when it’s important to penalize
outliers extra much.
WHEN TO USE MEAN SQUARED ERROR

• Use MSE when doing regression, believing that your target,


conditioned on the input, is normally distributed, and want large errors
to be significantly (quadratically) more penalized than small ones.

• Example-1: You want to predict future house prices. The price is a


continuous value, and therefore we want to do regression. MSE can
here be used as the loss function.
• Example-2:Consider the given data points: (1,1), (2,1), (3,2), (4,2), (5,4)
MEAN ABSOLUTE PERCENTAGE ERROR
• One of the most common metrics of model prediction accuracy, mean absolute
percentage error (MAPE) is the percentage equivalent of mean absolute error
(MAE).
• MAPE is defined as the average absolute percentage difference between predicted
values and actual values.
• Where:
• N is the number of fitted points;
• A is the actual value;
• F is the forecast value; and

You might also like