Capstone Project
Capstone Project
CAPSTONE PROJECT
• A capstone project is a project where students must research a topic
independently to find a deep understanding of the subject matter.
• It gives an opportunity for the student to integrate all their knowledge
and demonstrate it through a comprehensive project.
• Capstone projects
1. Stock Prices Predictor
2. Develop A Sentiment Analyzer
3. Movie Ticket Price Predictor
4. Students Results Predictor
5. Human Activity Recognition using Smartphone Data set
6. Classifying humans and animals in a photo
• Artificial Intelligence is perhaps the most transformative technology
available today. At a high level, every AI project follows the following
six steps:
is there a pattern?
• The premise that underlies all Machine Learning disciplines is that
there needs to be a pattern.
• If there is no pattern, then the problem cannot be solved with AI
technology
• Applied uses of these techniques are typically geared towards answering five types
of questions, all of which may be categorized as being within the umbrella of
predictive analysis:
1) Which category? (Classification)
2) How much or how many? (Regression)
3) Which group? (Clustering)
4) Is this unusual? (Anomaly Detection)
5) Which option should be taken? (Recommendation)
It is important to determine which of these questions you’re asking, and how answering
it helps you solve your problem.
DECOMPOSING THE PROBLEM THROUGH
DT FRAMEWORK
• Design Thinking is a design methodology that provides a solution-
based approach to solving problems. It’s extremely useful in tackling
complex problems that are ill-defined or unknown.
PROBLEM DECOMPOSITION STEPS
• 1. Understand the problem and then restate the problem in your own
words
• Know what the desired inputs and outputs are
• Ask questions for clarification (in class these questions might be to
your instructor, but most of the time they will be asking either yourself
or your collaborators)
• 2. Break the problem down into a few large pieces. Write these down,
either on paper or as comments in a file.
• 3. Break complicated pieces down into smaller pieces. Keep doing this
until all of the pieces are small.
• 4. Code one small piece at a time.
1. Think about how to implement it
2. Write the code/query
3. Test it… on its own.
4. Fix problems, if any
• Example: Imagine that you want to create your first app. This is a complex problem.
How would you decompose the task of creating an app?
• To decompose this task, you would need to know the answer to a series of smaller
problems:
Statistical analysis applies to problems that require counts: if the question requires a
yes/ no answer, then a classification approach to predicting a response would be
suitable.
DATA REQUIREMENT
• Prior to undertaking the data collection and data preparation stages of the
methodology, it's vital to define the data requirements for decision-tree
classification.
• This includes identifying the necessary data content, formats and sources for
initial data collection.
• n this phase the data requirements are revised and decisions are made as to
whether or not the collection requires more or less data.
• Once the data ingredients are collected, the data scientist will have a good
understanding of what they will be working with.
• Techniques such as descriptive statistics and visualization can be applied to
the data set, to assess the content, quality, and initial insights about the data.
• Gaps in data will be identified and plans to either fill or make substitutions will
have to be made.
MODELING APPROACH
• Data Modeling focuses on developing models that are either descriptive or
predictive.
An example of a descriptive model might examine things like: if a person did this,
then they're likely to prefer that.
A predictive model tries to yield yes/no, or stop/go type outcomes. These models
are based on the analytic approach that was taken, either statistically driven or
machine learning driven.
• The data scientist will use a training set for predictive modelling.
• A training set is a set of historical data in which the outcomes are already known.
• The training set acts like a gauge to determine if the model needs to be
calibrated.
• In this stage, the data scientist will play around with different algorithms to
ensure that the variables in play are actually required.
• The success of data compilation, preparation and modelling, depends on the understanding of
the problem at hand, and the appropriate analytical approach being taken.
• Constant refinement, adjustments and tweaking are necessary within each step to ensure the
outcome is one that is solid.
• The framework is geared to do 3 things:
First, understand the question at hand.
Second, select an analytic approach or method to solve the problem.
Third, obtain, understand, prepare, and model the data
HOW TO VALIDATE MODEL QUALITY
• Train-Test Split Evaluation
• For example, a training set with the size of 0.67 (67 percent) means that
the remainder percentage 0.33 (33 percent) is assigned to the test set.
We use pandas to import the dataset and sklearn to perform the splitting. You can import these
packages as:
1. >>> import pandas as pd
2. >>> from sklearn.model_selection import train_test_split
3. >>> from sklearn.datasets import load_iris
• Loading the Data set Let’s load the forestfires dataset using pandas.
1. >>> data=pd.read_csv(‘forestfires.csv’)
2. >>> data.head()
Let’s split this data into labels and features. Now, what’s that? Using features, we predict
labels. I mean using features (the data we use to predict labels), we predict labels (the
data we want to predict).
>>> y=data.temp
>>> x=data.drop(‘temp’,axis=1)
Temp is a label to predict temperatures in y; we use the drop() function to take all other
data in x.
>>> x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
>>> x_train.head()
CROSS-VALIDATION
• You've already learned to use train_test_split to split the data, so you can measure
model quality on the test data.
• Cross-validation extends this approach to model scoring (or "model validation.")
• Compared to train_test_split, cross-validation gives you a more reliable measure of
your model's quality, though it takes longer to run.
THE SHORTCOMING OF TRAIN-TEST SPLIT
• Imagine you have a dataset with 5000 rows.
• The train_test_split function has an argument for test_size that you can use to
decide how many rows go to the training set and how many go to the test set.
• The larger the test set, the more reliable your measures of model quality will
be.
• You will typically keep about 20% as a test dataset. But even with 1000 rows in
the test set, there's some random chance in determining model scores.
• A model might do well on one set of 1000 rows, even if it would be inaccurate
on a different 1000 rows.
• The larger the test set, the less randomness (aka "noise") there is in our
measure of model quality.
• In cross-validation, we run our modeling process on different subsets of the data to
get multiple measures of model quality.
• For example, we could have 5 folds or experiments. We divide the data into 5 pieces,
each being 20% of the full dataset
TRADE-OFFS BETWEEN CROSS-VALIDATION
AND TRAIN-TEST SPLIT
• Cross-validation gives a more accurate measure of model quality, which is
especially important if you are making a lot of modeling decisions.
• However, it can take more time to run, because it estimates models once
for each fold.
• On small datasets, the extra computational burden of running cross-
validation isn't a big deal. These are also the problems where model
quality scores would be least reliable with train-test split.
• So, if your dataset is smaller, you should run cross-validation.
• For the same reasons, a simple train-test split is sufficient for larger
datasets.
• It will run faster, and you may have enough data that there's little need to
re-use some of it for holdout.
• Using cross-validation gave us much better measures of model quality, with the added benefit
of cleaning up our code (no longer needing to keep track of separate train and test sets.
Different transforms of the data used to train the same machine learning model.
Different machine learning models trained on the same data.
Different configurations for a machine learning model trained on the same data.
• All the algorithms in machine learning rely on minimizing or
maximizing a function, which we call “objective function”.
The group of functions that are minimized are called “loss
functions”.
• A loss function is a measure of how good a prediction model
does in terms of being able to predict the expected
outcome.
• A most commonly used method of finding the minimum point
of function is “gradient descent”.
• Loss functions can be broadly categorized into 2 types: Classification
and Regression Loss.
• Regression functions
predict a quantity, and
classification functions
predict a label.
RMSE (ROOT MEAN SQUARED ERROR)
• In machine Learning when we want to look at the accuracy of our
model we take the root mean square of the error that has occurred
between the test values and the predicted values mathematically:
• Mean Square Error (MSE) is the most commonly used regression loss function.
• MSE is the sum of squared distances between our target variable and predicted
values
from sklearn.metrics import mean_squared_error
# Given values
Y_true = [1,1,2,2,4] # Y_true = Y (original values)
# calculated values
Y_pred = [0.6,1.29,1.99,2.69,3.4] # Y_pred = Y'