We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21
Unit 1: Capstone Project
A capstone project is a project where students must
research a topic independently to find a deep understanding of the subject matter. It gives an opportunity for the student to integrate all their knowledge and demonstrate it through a comprehensive project.
1)Understanding The Problem
Artificial Intelligence is perhaps the most transformative
technology available today. At a high level, every AI project follows the following six steps: 1) Problem definition i.e. Understanding the problem 2) Data gathering 3) Feature definition 4) AI model construction 5) Evaluation & refinements 6) Deployment The premise that underlies all Machine Learning disciplines is that there needs to be a pattern. If there is no pattern, then the problem cannot be solved with AI technology. It is fundamental that this question is asked before deciding to embark on an AI development journey. If it is believed that there is a pattern in the data, then AI development techniques may be employed. Applied uses of these techniques are typically geared towards answering five types of questions, all of which may be categorized as being within the umbrella of predictive analysis: 1) Which category? (Classification) 2) How much or how many? (Regression) 3) Which group? (Clustering) 4) Is this unusual? (Anomaly Detection) 5) Which option should be taken? (Recommendation)
2) Decomposing The Problem Through DT Framework
Design Thinking is a design methodology that provides a
solution-based approach to solving problems. It’s extremely useful in tackling complex problems that are ill- defined or unknown. The five stages of Design Thinking are as follows: Empathize, Define, Ideate, Prototype, and Test. Real computational tasks are complicated. To accomplish them you need to break down the problem into smaller units before coding. Problem decomposition steps
1. Understand the problem and then restate the problem in
your own words Know what the desired inputs and outputs are Ask questions for clarification (in class these questions might be to your instructor, but most of the time they will be asking either yourself or your collaborators)
2. Break the problem down into a few large pieces. Write
these down, either on paper or as comments in a file.
3. Break complicated pieces down into smaller pieces.
Keep doing this until all of the pieces are small.
4. Code one small piece at a time.
1. Think about how to implement it 2. Write the code/query 3. Test it… on its own. 4. Fix problems, if any
Reviewing the line plot, it suggests that there may be a
linear trend, but it is hard to be sure from eye-balling. There is also seasonality, but the amplitude (height) of the cycles appears to be increasing, suggesting that it is multiplicative.
Running the example plots the observed, trend, seasonal,
and residual time series. We can see that the trend and seasonality information extracted from the series does seem reasonable. The residuals are also interesting, showing periods of high variability in the early and later years of the series.
3) Analytic Approach
Those who work in the domain of AI and Machine
Learning solve problems and answer questions through data every day. They build models to predict outcomes or discover underlying patterns, all to gain insights leading to actions that will improve future outcomes.
Every project, regardless of its size, starts with business
understanding, which lays the foundation for successful resolution of the business problem. The business sponsors needing the analytic solution play the critical role in this stage by defining the problem, project objectives and solution requirements from a business perspective. And, believe it or not—even with nine stages still to go— this first stage is the hardest. After clearly stating a business problem, the data scientist can define the analytic approach to solving it. Doing so involves expressing the problem in the context of statistical and machine learning techniques so that the data scientist can identify techniques suitable for achieving the desired outcome. Selecting the right analytic approach depends on the question being asked. Once the problem to be addressed is defined, the appropriate analytic approach for the problem is selected in the context of the business requirements. This is the second stage of the data science methodology.
If the question is to determine probabilities of an action,
then a predictive model might be used. If the question is to show relationships, a descriptive approach maybe be required. Statistical analysis applies to problems that require counts: if the question requires a yes/ no answer, then a classification approach to predicting a response would be suitable.
4) Data Requirement If the problem that needs to be resolved is "a recipe", so to speak, and data is "an ingredient", then the data scientist needs to identify:
1. which ingredients are required?
2. how to source or the collect them? 3. how to understand or work with them? 4. and how to prepare the data to meet the desired outcome?
Prior to undertaking the data collection and data
preparation stages of the methodology, it's vital to define the data requirements for decision-tree classification. This includes identifying the necessary data content, formats and sources for initial data collection. In this phase the data requirements are revised and decisions are made as to whether or not the collection requires more or less data. Once the data ingredients are collected, the data scientist will have a good understanding of what they will be working with. Techniques such as descriptive statistics and visualization can be applied to the data set, to assess the content, quality, and initial insights about the data. Gaps in data will be identified and plans to either fill or make substitutions will have to be made. In essence, the ingredients are now sitting on the cutting board. 5) Modeling Approach Data Modeling focuses on developing models that are either descriptive or predictive. An example of a descriptive model might examine things like: if a person did this, then they're likely to prefer that. A predictive model tries to yield yes/no, or stop/go type outcomes. These models are based on the analytic approach that was taken, either statistically driven or machine learning driven.
The data scientist will use a training set for predictive
modelling. A training set is a set of historical data in which the outcomes are already known. The training set acts like a gauge to determine if the model needs to be calibrated. In this stage, the data scientist will play around with different algorithms to ensure that the variables in play are actually required.
The success of data compilation, preparation and
modelling, depends on the understanding of the problem at hand, and the appropriate analytical approach being taken. The data supports the answering of the question, and like the quality of the ingredients in cooking, sets the stage for the outcome. Constant refinement, adjustments and tweaking are necessary within each step to ensure the outcome is one that is solid. . The framework is geared to do 3 things:
First, understand the question at hand.
Second, select an analytic approach or method to solve the problem. Third, obtain, understand, prepare, and model the data.
The end goal is to move the data scientist to a point where
a data model can be built to answer the question. 6) How to validate model quality
6.1) Train-Test Split Evaluation
The train-test split is a technique for evaluating the
performance of a machine learning algorithm. It can be used for classification or regression problems and can be used for any supervised learning algorithm. The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset. Train Dataset: Used to fit the machine learning model. Test Dataset: Used to evaluate the fit machine learning model.
The objective is to estimate the performance of the
machine learning model on new data: data not used to train the model. This is how we expect to use the model in practice. Namely, to fit it on available data with known inputs and outputs, then make predictions on new examples in the future where we do not have the expected output or target values. The train-test procedure is appropriate when there is a sufficiently large dataset available. How to Configure the Train-Test Split.
The procedure has one main configuration parameter,
which is the size of the train and test sets. This is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets. For example, a training set with the size of 0.67 (67 percent) means that the remainder percentage 0.33 (33 percent) is assigned to the test set.
There is no optimal split percentage.
You must choose a split percentage that meets your project’s objectives with considerations that include: Computational cost in training the model. Computational cost in evaluating the model. Training set representativeness. Test set representativeness.
Nevertheless, common split percentages include:
Train: 80%, Test: 20% Train: 67%, Test: 33% Train: 50%, Test: 50% 6.2) Introduce concept of cross validation You will face choices about predictive variables to use, what types of models to use, what arguments to supply those models, etc. We make these choices in a data- driven way by measuring model quality of various alternatives. You've already learned to use train_test_split to split the data, so you can measure model quality on the test data. Cross-validation extends this approach to model scoring (or "model validation.") Compared to train_test_split, cross-validation gives you a more reliable measure of your model's quality, though it takes longer to run. The Shortcoming of Train-Test Split Imagine you have a dataset with 5000 rows. The train_test_split function has an argument for test_size that you can use to decide how many rows go to the training set and how many go to the test set. The larger the test set, the more reliable your measures of model quality will be. At an extreme, you could imagine having only 1 row of data in the test set. If you compare alternative models, which one makes the best predictions on a single data point will be mostly a matter of luck. You will typically keep about 20% as a test dataset. But even with 1000 rows in the test set, there's some random chance in determining model scores. A model might do well on one set of 1000 rows, even if it would be inaccurate on a different 1000 rows. The larger the test set, the less randomness (aka "noise") there is in our measure of model quality. The Cross-Validation Procedure
In cross-validation, we run our modeling process on
different subsets of the data to get multiple measures of model quality. For example, we could have 5 folds or experiments. We divide the data into 5 pieces, each being 20% of the full dataset.
We run an experiment called experiment 1 which uses the
first fold as a holdout set, and everything else as training data. This gives us a measure of model quality based on a 20% holdout set, much as we got from using the simple train-test split.
We then run a second experiment, where we hold out data
from the second fold (using everything except the 2nd fold for training the model.) This gives us a second estimate of model quality. We repeat this process, using every fold once as the holdout. Putting this together, 100% of the data is used as a holdout at some point. Returning to our example above from train-test split, if we have 5000 rows of data, we end up with a measure of model quality based on 5000 rows of holdout (even if we don't use all 5000 rows simultaneously. Trade-offs Between Cross-Validation and Train-Test Split Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions. However, it can take more time to run, because it estimates models once for each fold. So it is doing more total work.
Given these tradeoffs, when should you use each
approach? On small datasets, the extra computational burden of running cross-validation isn't a big deal. These are also the problems where model quality scores would be least reliable with train-test split. So, if your dataset is smaller, you should run cross-validation. For the same reasons, a simple train-test split is sufficient for larger datasets. It will run faster, and you may have enough data that there's little need to re-use some of it for holdout.
There's no simple threshold for what constitutes a large vs
small dataset. If your model takes a couple minute or less to run, it's probably worth switching to cross-validation. If your model takes much longer to run, cross-validation may slow down your workflow more than it's worth. Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each experiment gives the same results, train-test split is probably sufficient.
7) Metrics of model quality by simple Math and
examples
After you make predictions, you need to know if they are
any good. There are standard measures that we can use to summarize how good a set of predictions actually are. Knowing how good a set of predictions is, allows you to make estimates about how good a given machine learning model of your problem, You must estimate the quality of a set of predictions when training a machine learning model. Performance metrics like classification accuracy and root mean squared error can give you a clear objective idea of how good a set of predictions is, and in turn how good the model is that generated them. This is important as it allows you to tell the difference and select among: Different transforms of the data used to train the same machine learning model. Different machine learning models trained on the same data. Different configurations for a machine learning model trained on the same data. As such, performance metrics are a required building block in implementing machine learning algorithms from scratch. All the algorithms in machine learning rely on minimizing or maximizing a function, which we call “objective function”. The group of functions that are minimized are called “loss functions”. A loss function is a measure of how good a prediction model does in terms of being able to predict the expected outcome. A most commonly used method of finding the minimum point of function is “gradient descent”. Think of loss function like undulating mountain and gradient descent is like sliding down the mountain to reach the bottom most point.
Loss functions can be broadly categorized into 2
types: Classification and Regression Loss. 7.1) RMSE (Root Mean Squared Error)
In this section, we will be looking at one of the methods to
determine the accuracy of our model in predicting the target values. All of you reading this article must have heard about the term RMS i.e. Root Mean Square and you might have also used RMS values in statistics as well. In machine Learning when we want to look at the accuracy of our model we take the root mean square of the error that has occurred between the test values and the predicted values mathematically:
7.2 MSE (Mean Squared Error)
Mean Square Error (MSE) is the most commonly used
regression loss function. MSE is the sum of squared distances between our target variable and predicted values. Why use mean squared error MSE is sensitive towards outliers and given several examples with the same input feature values, the optimal prediction will be their mean target value. This should be compared with Mean Absolute Error, where the optimal prediction is the median. MSE is thus good to use if you believe that your target data, conditioned on the input, is normally distributed around a mean value, and when it’s important to penalize outliers extra much. When to use mean squared error Use MSE when doing regression, believing that your target, conditioned on the input, is normally distributed, and want large errors to be significantly (quadratically) more penalized than small ones.