0% found this document useful (0 votes)
43 views37 pages

5.design and Analysis of Machine Learning Experiments

Uploaded by

puneethsp2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views37 pages

5.design and Analysis of Machine Learning Experiments

Uploaded by

puneethsp2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

5.

Design and Analysis of


Machine Learning Experiments
Presentation Material
Department of Computer Science & Engineering
Course Code: Semester: V
Course Title: AI & Machine Learning Year: 2024
Faculty Name:
Indu Joseph Thoppil

AI & ML Department of Computer Science & Engineering 1


MODULE 5

TEXT BOOK REFERRED


Chapter 19 : Ethem Alpaydin, Introduction to Machine Learning (Adaptive
Computation and Machine Learning series), The MIT Press, Third Edition.

AI & ML Department of Computer Science & Engineering 2


Introduction
• In machine learning, there are several classification algorithms and,
given a certain problem, more than one may be applicable.
• There is a need to examine how we can assess how good a selected
algorithm is.
• Also, we need a method to compare the performance of two or
more different classification algorithms. These methods help us
choose the right algorithm in a practical situation.
Issues Related to Analysing the ML algos
• Having trained a classification algorithm on a dataset drawn from a
specific application, can we confidently predict its future performance in
real-life scenarios? While a well-trained model can achieve high accuracy
on the training data, it's essential to consider several factors that may
influence its performance when deployed in the real world.
• How can we determine which of two learning algorithms has a lower
error rate for a given application? The algorithms may belong to different
categories (e.g., parametric vs. nonparametric) or use varying
hyperparameter settings. Eg: we might want to compare a multilayer
perceptron with four hidden units to one with eight hidden units, or find
the optimal value of k for a k-nearest neighbor classifier.
Issues Related to Analysing the ML algos
• When evaluating machine learning models, relying on training set errors
is insufficient because these errors are always smaller than those on
unseen test data. Training errors cannot be used to compare algorithms,
as complex models tend to fit training data better than simpler ones,
regardless of their true performance.
• To make fair comparisons, a separate validation set is needed. However,
even a single validation run may not be enough due to two main reasons:
1. Small Dataset Bias: Limited training and validation sets may include noise or
outliers, which can distort evaluation.
2. Random Factors: Factors like random initialization of weights in algorithms like
multilayer perceptrons can lead to different outcomes, even with identical setups.
How randomness and variability in machine learning
experiments can impact model evaluation?
Single Training Run Limitation:
• When you train a model just once, the resulting learner and its
validation error are influenced by random factors such as:
• The specific training data subset selected.
• Initial model parameters (e.g., weights in neural networks).
• Stochastic elements in the learning process (e.g., mini-batch selection during
training).
• This single validation error provides only a snapshot, which may not
reliably represent the algorithm's true performance due to these
sources of randomness.
Averaging Over Multiple Runs:
• To account for variability, multiple learners are trained using the
same algorithm under slightly different conditions, such as:
• Different training or validation data splits (e.g., in cross-validation).
• Different initializations of the model's parameters.
• Each learner is tested on a separate validation set to compute its
validation error. This results in a distribution of validation errors
rather than a single value.
• Multiple Runs to Reduce Randomness:
Random factors, such as initial weights or stochastic training,
introduce variability in outcomes. To account for this, multiple
learners are trained on different subsets or configurations, and their
validation errors are averaged.
This helps generate a distribution of errors, offering a robust
comparison of algorithms.
Evaluating the Algorithm: The distribution of validation errors is used
to assess the algorithm's performance:
• Expected Error: The mean of the distribution gives an estimate of the
expected error rate for the algorithm on that problem.
• Variability and Robustness: The spread (e.g., standard deviation) of the
distribution reveals how consistent the algorithm is across different
conditions.
This approach also facilitates direct comparison between algorithms.
By comparing their respective error distributions, it’s possible to
identify which algorithm is more reliable or effective for a given
problem.
Significance:This methodology ensures that the evaluation accounts
for all potential sources of randomness. Instead of relying on a single
outcome, it provides a more statistically sound basis for comparing
algorithms or assessing their real-world performance.
Following are the key principles in the design and analysis of machine
learning experiments, emphasizing best practices for training,
validation, and testing to ensure robust model evaluation and
meaningful insights.
• No Free Lunch Theorem: (Wolpert 1995)

The theorem states that no single learning algorithm performs best


across all datasets. The performance of an algorithm depends on how
well its inductive biases align with the properties of the data at hand.
Example: A neural network may excel in tasks requiring non-linear
decision boundaries but may underperform in problems where
simpler models like k-NN suffice.
• Dataset Partitioning:
Proper data splitting into training, validation, and test sets is crucial:
• Training Set: Used to optimize the model parameters.
• Validation Set: Used to tune hyperparameters (e.g., number of layers in an
MLP, k in k-NN) or stop training.
• Test Set: Reserved for final performance evaluation, ensuring no prior
exposure during training or validation.
Example: In an experiment with an MLP, weights are trained on the
training set, hidden units and learning rate are fine-tuned on the
validation set, and final error is reported on the test set.
Criteria Beyond Accuracy:
Real-world decisions depend on multiple factors, not just error rates:
• Cost-sensitive Learning: Balancing false positives and false
negatives based on application-specific costs.
• Complexity: Training/testing time and space requirements.

• Interpretability: The ease of extracting insights from the model.

• Ease of Deployment: Simplicity in programming and integration.

Example: A support vector machine (SVM) might be accurate but less


interpretable compared to a decision tree.
Statistical Design of Experiments: Statistical principles should be
applied to experimental design and analysis to draw reliable
conclusions. This includes proper sampling, hypothesis testing, and
confidence intervals.
Applying statistical methodologies ensures meaningful conclusions
from experiments:
• Control for randomness by averaging multiple runs.

• Use proper statistical tests to validate significance between


algorithms.
Some other criteria include:
• training time and space complexity,

• testing time and space complexity,

• interpretability, namely, whether the method allows knowledge ex


traction which can be checked and validated by experts, and
• easy programmability

The relative importances of these factors change depending on the


application.
Principles of experimental design:

• There are three basic principles of design which were developed by


Sir Ronald A. Fisher.
(i) Randomization
(ii) Replication
(iii) Local control
(i) Randomization
• Randomization requires that the order in which the runs are carried out
should be randomly determined so that the results are indepen dent.
• Randomization involves randomly assigning subjects, instances, or
conditions to experimental groups to eliminate bias and ensure each unit
has an equal chance of being in any group.
• Purpose: To avoid systematic errors and confounding effects that can
skew results. Randomization balances unknown factors across groups,
making results more generalizable.
• Example:
• In a clinical trial testing a new drug, patients are randomly assigned to either the
drug group or the placebo group. This ensures that any differences in outcomes are
more likely due to the drug itself, not pre-existing differences between groups.
• In machine learning, when splitting a dataset into training and testing sets, random
shuffling ensures that the split is representative of the overall dataset and reduces
bias.
Randomization
Randomization forms a basis of a valid experiment but replication is
also needed for the validity of the experiment.
If the randomization process is such that every experimental unit has an
equal chance of receiving each treatment, it is called complete
randomization.
Replication
• Replication implies that for the same configuration of (controllable)
factors, the experiment should be run a number of times to average
over the effect of uncontrollable factors.
• ie it refers to repeating the experiment multiple times to confirm
the consistency and reliability of results. It can involve duplicating
the entire experiment or testing across different datasets or
conditions.
• In machine learning, this is typically done by running the same
algorithm on a number of resampled versions of the same dataset
which is known as cross-validation.
Replication
Purpose: To ensure results are not due to random chance and to
identify variations in data.
Example: In agriculture, testing the yield of a new crop variety in
different regions (under varied soil and weather conditions) ensures
that the findings are consistent and robust across environments.
In machine learning, running a model training multiple times with
different random initializations (e.g., weights in neural networks) and
averaging the performance metrics can account for randomness in the
training process.
Replication

The relationship between the variance of a sample mean and the
sample size can be observed in machine learning, particularly in
bagging (Bootstrap Aggregating), a technique used in ensemble
learning methods like Random Forests.
Scenario: In a Random Forest, multiple decision trees are trained on
bootstrap samples (random subsets) of the training data. Each tree
predicts an output for a given input, and the final prediction is
obtained by averaging (for regression) or voting (for classification)
across the predictions of all trees.
Connection to Variance: If the variance of predictions made by a
single decision tree is 𝜎2,the variance of the mean prediction made by
𝑛 trees is approximately:
Var(Mean Prediction)= 𝜎2/𝑛
As 𝑛 (the number of trees) increases, the variance of the ensemble's
prediction decreases, leading to a more stable and robust model.
Practical Observation: With fewer trees (𝑛 small), the model's
predictions might be more sensitive to the noise in the training data,
resulting in higher variance.
As more trees are added (𝑛 increases), the predictions converge to a
more reliable output, as the variance of the sample mean diminishes.
Conclusion: This principle is foundational to ensemble methods,
where aggregating predictions reduces variance and improves model
generalization. Thus, increasing the sample size (number of
observations or base models) enhances stability and reduces error
variance, analogous to the statistical property of the sample mean.
Blocking(Local Control)
• Blocking is used to reduce or eliminate the variability due to
nuisance factors(confounding factors) that influence the response
but in which we are not interested.
• It is a technique used in experimental design to control for factors
that might influence the results but aren't of primary interest. It's
like grouping similar things together to isolate the effect of the
things you do care about.
Blocking in Machine Learning Experiments:
• When comparing machine learning algorithms:
• Objective: Ensure that the differences in performance are due to the
algorithms themselves and not due to variability introduced by different
subsets of data.
• Method: Use the same training and testing splits (or resampled subsets)
for all algorithms being compared.
Why Blocking is Important:
1. Without Blocking:
1. If different algorithms use different training subsets, the observed differences in
accuracy might stem from the data split rather than the algorithms' performance.
2. For example, one algorithm might perform better simply because it had a "lucky"
data split with easier examples.
2. With Blocking:
1. By using identical subsets across replicated runs, the variability due to data splitting
is minimized, and the differences in performance reflect only the algorithms'
capabilities.
Eg from Machine Learning
• Pairing: If you're comparing two groups (like two different algorithms),
you might want to pair similar data points together. This helps you isolate
the effect of the algorithms and avoid confounding factors.

• Confounding Variables: A variable that is connected to both the


dependent and independent variables but is not a component of the
hypothesis being tested is referred to as a confounder. Confounding
factors have the potential to skew an experiment's findings and provide
false conclusions. Confounding variables must be understood in
experimental design and taken into account when creating the
experiment.
Here's an example:

• Imagine you're testing two different fertilizers on plants. You want


to know which fertilizer helps plants grow taller. But you also know
that the amount of sunlight a plant gets can affect its growth. So,
you divide your plants into groups based on how much sunlight
they get (e.g., sunny, partly sunny, shady). This is blocking. By
grouping plants based on sunlight, you can isolate the effect of the
fertilizer and see if it really makes a difference.

• Here Sunlight is the confounding factor.


Guidelines for Machine Learning Experiments
Before we start experimentation, we need to have a good idea about
1. what it is we are studying,

2. how the data is to be collected,

3. how we are planning to analyze it.


A. Aim of the study
• We need to start by stating the problem clearly, defining what the objectives
are. In machine learning, there may be several possibilities.
There are various goals one might pursue:
1. Assessing Expected Error: Determine if the learning algorithm achieves an
acceptable error level on a specific problem.
2. Comparing Two Algorithms: Evaluate which of two algorithms has a lower
generalization error for a given dataset. This could involve comparing different
algorithms or an improved version of one (e.g., by using better features).
3. Ranking Multiple Algorithms: For a given dataset, compare the performance of
more than two algorithms and rank them based on error or other performance
measures.
4. Cross-Dataset Comparison: Evaluate and compare algorithms across multiple
datasets to understand their general performance and robustness.
B. Selection of the Response Variable
We need to decide on what we should use as the quality measure.
1. Misclassification Error: Used for classification tasks to measure the
rate of incorrect predictions.
2. Mean Square Error (MSE): Applied in regression problems to
quantify the average squared difference between predicted and
actual values.
3. Precision and Recall: Widely used in information retrieval tasks to
evaluate the relevance of retrieved items.
Choosing the right measure ensures that the evaluation aligns with
the objectives of the task and the specific application context.
C. Choice of Factors and Levels
The factors in a machine learning experiment depend on the study's
goal. Examples include:
1. Hyperparameters: If optimizing an algorithm, the hyperparameters
(e.g., k in k-nearest neighbors) are the factors.
2. Algorithms: If comparing different learning algorithms, they serve
as the factors.
3. Datasets: When analyzing performance across multiple datasets,
the datasets become factors.
C. Choice of Factors and Levels
• It is always good to try to normalize factor levels. For example, in
optimizing k of k-nearest neighbor, one can try values such as 1, 3,
5, and so on
• It is also important to investigate all factors and factor levels that
may be of importance and not be overly influenced by past
experience.
D. Choice of Experimental Design
1. Factorial Design: Prefer factorial design as factors often interact. Avoid assuming
independence unless proven otherwise.
2. Replication:
1. Small datasets require more replication to ensure statistical reliability and valid comparisons.
2. Larger datasets may need fewer replicates but should still provide enough data to analyze
distributions effectively.
3. Dataset Division :
1. Separate data into test, training, and validation sets. Resampling techniques are often used to
improve reliability.
2. Small datasets tend to yield high variance in results, making conclusions less significant or
inconclusive.
4. Real-world Data:
1. Use real-world datasets collected under authentic conditions for experiments.
2. Synthetic or low-dimensional datasets may provide intuition but do not reliably represent
algorithm performance in high-dimensional, practical scenarios.

In short, proper design, realistic data, and careful replication enhance the validity and
applicability of machine learning experiments.
E. Performing the Experiment
Emphasizes the importance of planning, testing, and adhering to best
practices for reliable and unbiased machine learning experimentation.
It combines careful preparation with professional standards to
minimize errors, promote reproducibility, and ensure objective
evaluations of algorithms. Some important points to be checked here
are
• Trial Runs: Conduct a few preliminary runs with random settings to
ensure everything is working as expected before committing to a
large-scale experiment. This step helps catch potential errors early,
avoiding wasted effort on flawed setups
• Reproducibility and Backup: Save intermediate results or the random number
generator seeds to allow partial reruns of the experiment.Ensure all results are
reproducible, an essential aspect of scientific rigor.
• Software Aging: Be cautious of issues like software fatigue in long-running
experiments, where performance may degrade over time due to bugs, memory
leaks, or system errors.
• Unbiased Experimentation: Maintain objectivity, especially when comparing
algorithms. Both your algorithm and competitors' should receive equal effort and
diligence. In large-scale studies, consider separating the roles of testers and
developers to minimize bias.
• Use Reliable Code: Prefer reliable, well-tested libraries over custom-built solutions
to leverage the robustness and optimization of established codebases.
• Documentation: Properly document experiments to ensure clarity and facilitate
collaboration, especially in group projects.Use standard software engineering
practices for quality and maintainability in machine learning experiments.
F. Statistical Analysis of the Data
Emphasizes the importance of using statistical rigor and visual aids to ensure
findings in machine learning experiments are reliable and interpretable.
• Objective Analysis: The goal is to derive conclusions that are objective and
not influenced by chance or bias. This involves framing research questions as
statistical hypotheses.
• Hypothesis Testing: Questions like "Is algorithm A better than algorithm B?"
are translated into testable hypotheses, such as "The average error of
algorithm A is significantly lower than that of algorithm B. "Statistical
methods are used to determine whether the data supports this hypothesis.
• Visualization:
• Visual tools like histograms, box-and-whisker plots, and range plots
are helpful for exploring data and understanding error distributions.
• These visualizations complement statistical tests, providing an
intuitive sense of variability and differences.
G. Conclusions and Recommendation
• Iterative Nature of Experiments: Machine learning experiments are
iterative processes. Initial experiments are exploratory, and only a
fraction (e.g., 25%) of resources should be invested initially. Further
experimentation is often required to refine methods and results.
• Statistical Hypotheses: Statistical testing evaluates how well the sample
data supports a hypothesis but does not confirm its absolute truth. Small,
noisy datasets increase the risk of inconclusive or erroneous conclusions.
• Learning from Failures: When results do not meet expectations,
analyzing deficiencies can lead to insights for improvement.
Enhancements in algorithms often stem from identifying shortcomings in
prior versions.
• Thorough Analysis Before Next Steps: Before testing improved versions,
ensure that all insights from the current experiment have been
thoroughly explored and understood. Ideas are valuable only when
rigorously tested, and testing demands resources and effort.

You might also like