Unit IV
Unit IV
Validation data. During training, validation data infuses new data into the
model that it hasn’t evaluated before. Validation data provides the first test
against unseen data, allowing data scientists to evaluate how well the model
makes predictions based on the new data. Not all data scientists use
validation data, but it can provide some helpful information to optimize
hyper parameters, which influence how the model assesses data.
Test data. After the model is built, testing data once again validates that it
can make accurate predictions. If training and validation data include labels
to monitor performance metrics of the model, the testing data should be
unlabeled. Test data provides a final, real-world check of an unseen dataset
to confirm that the ML algorithm was trained effectively.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 2
Data and DataSet
There is some semantic ambiguity between validation data and testing data. Some
organizations call testing datasets “validation datasets.” Ultimately, if there are three
datasets to tune and check ML algorithms, validation data typically helps tune the
algorithm and testing data provides the final assessment.
•The limitation of such a method is that the error found in the test
dataset can highly depend on the observations included in the
train and test dataset.
• Also if the train or test dataset are not able to represent the
actual complete data then the results from the test sets can be
skewed.
False Negative Rate (FNR) : False Negative Rate (FNR) tells us what
proportion of the positive class got incorrectly classified by the classifier.
False Positive Rate (FPR): FPR tells us what proportion of the negative class
got incorrectly classified by the classifier.
A higher TNR and a lower FPR is desirable since we want to correctly
classify the negative class.
Compute all metrics for the confusion matrix made for a classifier
that classifies people based on whether they speak English or
Spanish.
https://www.inabia.com/learning/quiz/confusion-matrix-quiz/
Red distribution curve is of the positive class (patients with disease) and the green distribution
curve is of the negative class(patients with no disease).
If, however, the AUC had been 0, then the classifier would be
predicting all Negatives as Positives, and all Positives as Negatives.
This is an ideal situation. When two curves don’t overlap at all means
model has an ideal measure of separability. It is perfectly able to
distinguish between positive class and negative class.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 42
AUC-ROC Curve
When 0.5<AUC<1, there is a high chance that the classifier will be
able to distinguish the positive class values from the negative class
values. This is so because the classifier is able to detect more
numbers of True positives and True negatives than False negatives and
False positives.