Mod 1
Mod 1
Supervised learning algorithms needs human Unsupervised learning algorithms does not any
Human Supervision
supervision to train the model. kind of supervision to train the model..
supervised machine learning methods are Unsupervised machine learning methods are
Complexity
computationally simple. computationally complex.
Supervised machine learning methods are highlyUnsupervised machine learning methods are less
Accuracy
accurate. accurate.
Aditi Deorukhakar
Overfitting happens when a model learns too much from the training data, including details that don’t
matter (like noise or outliers).
• For example, imagine fitting a very complicated curve to a set of points. The curve will go through
every point, but it won’t represent the actual pattern.
• As a result, the model works great on training data but fails when tested on new data.
Overfitting models are like students who memorize answers instead of understanding the topic. They do
well in practice tests (training) but struggle in real exams (testing).
Underfitting is the opposite of overfitting. It happens when a model is too simple to capture what’s going
on in the data.
• For example, imagine drawing a straight line to fit points that actually follow a curve. The line
misses most of the pattern.
• In this case, the model doesn’t work well on either the training or testing data.
Underfitting models are like students who don’t study enough. They don’t do well in practice tests or real
exams. Note: The underfitting model has High bias and low variance.
1. The model is too simple, So it may be not capable to represent the complexities in the data.
2. The input features which is used to train the model is not the adequate representations of
underlying factors influencing the target variable.
4. Excessive regularization are used to prevent the overfitting, which constraint the model to capture
the data well.
Let’s visually understand the concept of underfitting, proper fitting, and overfitting.
Aditi Deorukhakar
• Underfitting : Straight line trying to fit a curved dataset but cannot capture the data’s patterns,
leading to poor performance on both training and test sets.
• Overfitting: A squiggly curve passing through all training points, failing to generalize performing
well on training data but poorly on test data.
• Appropriate Fitting: Curve that follows the data trend without overcomplicating to capture the
true patterns in the data.
Aditi Deorukhakar
Aditi Deorukhakar
What is Machine Learning? What are the steps in developing a machine learning application?
Machine learning is a branch of artificial intelligence that enables algorithms to uncover hidden patterns
within datasets. It allows them to predict new, similar data without explicit programming for each task.
Machine learning finds applications in diverse fields such as image and speech recognition, natural
language processing, recommendation systems, fraud detection, portfolio optimization, and automating
tasks.
Machine learning’s impact extends to autonomous vehicles, drones, and robots, enhancing their
adaptability in dynamic environments. This approach marks a breakthrough where machines learn from
data examples to generate accurate outcomes, closely intertwined with data mining and data science.
There are several types of machine learning, each with special characteristics and applications. Some of
the main types of machine learning algorithms are as follows:
3. Reinforcement Learning
Additionally, there is a more specific category called semi-supervised learning, which combines elements
of both supervised and unsupervised learning.
Aditi Deorukhakar
Types of Machine Learning
Supervised learning is defined as when a model gets trained on a “Labelled Dataset”. Labelled datasets
have both input and output parameters. In Supervised Learning algorithms learn to map points between
inputs and correct outputs. It has both training and validation datasets labelled.
• Supervised Learning models can have high accuracy as they are trained on labelled data.
• It can often be used in pre-trained models which saves time and resources when developing new
models from scratch.
• It has limitations in knowing patterns and may struggle with unseen or unexpected patterns that
are not present in the training data.
Aditi Deorukhakar
• It can be time-consuming and costly as it relies on labeled data only.
• Natural language processing: Extract information from text, such as sentiment, entities, and
relationships.
• It helps to discover hidden patterns and various relationships between the data.
• Used for tasks such as customer segmentation, anomaly detection, and data exploration.
• It does not require labeled data and reduces the effort of data labeling.
• Without using labels, it may be difficult to predict the quality of the model’s output.
Aditi Deorukhakar
• Cluster Interpretability may not be clear and may not have meaningful interpretations.
• It has techniques such as autoencoders and dimensionality reduction that can be used to extract
meaningful features from raw data.
• Dimensionality reduction: Reduce the dimensionality of data while preserving its essential
information.
Reinforcement machine learning algorithm is a learning method that interacts with the environment by
producing actions and discovering errors. Trial, error, and delay are the most relevant characteristics of
reinforcement learning. In this technique, the model keeps on increasing its performance using Reward
Feedback to learn the behavior or pattern. These algorithms are specific to a particular problem e.g.
Google Self Driving car, AlphaGo where a bot competes with humans and even itself to get better and
better performers in Go Game. Each time we feed in data, they learn and add the data to their
knowledge which is training data. So, the more it learns the better it gets trained and hence experienced.
Aditi Deorukhakar
Cross-validation is a statistical method used in machine
learning to evaluate how well a model performs on an
independent data set. It involves dividing the available
data into multiple folds or subsets, using one of these
folds as a validation set and training the model on the
remaining folds. This process is repeated multiple times
each time using a different fold as the validation set.
Finally, the results from each validation step are
averaged to produce a more robust estimate of the
model’s performance.
The main purpose of cross validation is to prevent overfitting which occurs when a model is trained too
well on the training data and performs poorly on new, unseen data. By evaluating the model on multiple
validation sets, cross validation provides a more realistic estimate of the model’s generalization
performance i.e. its ability to perform well on new, unseen data. If you want to make sure your machine
learning model is not just memorizing the training data but is capable of adapting to real-world data
cross-validation is a commonly used technique.
If the algorithm is too simple (hypothesis with linear equation) then it may be on high bias and low
variance condition and thus is error-prone. If algorithms fit too complex (hypothesis with high degree
equation) then it may be on high variance and low bias. In the latter condition, the new entries will not
perform well. Well, there is something between both of these conditions, known as a Trade-off or Bias
Variance Trade-off. This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time. For the graph, the perfect tradeoff
will be like this.
Aditi Deorukhakar
We try to optimize the value of the total error for the model by using the BiasVariance Tradeoff.
The best fit will be given by the hypothesis on the tradeoff point. The error to complexity graph to show
trade-off is given as –
Aditi Deorukhakar
In the classification model, the values for the observations are as follows. True Negatives (TN)=300,
True Positive(TP)=500, False Positive (FP) = 50, False Negatives (FN)=150. Evaluate the performance of
the model by finding values of Accuracy, Precision, Recall and F1-Score:
Given:
Formulas Used:
Calculations:
Final Results:
• Accuracy: 0.80
• Precision: 0.909
• Recall: 0.769
• F1-Score: 0.832
What is Machine Learning? What are the steps in developing a machine learning application?
Aditi Deorukhakar
List various applications of machine learning. Describe SPAM and non SPAM email filtering application
in detail
Machine learning (ML) is a transformative technology with applications across various domains. Below is
an overview of its diverse applications, followed by an in-depth look at how ML is utilized in email spam
and non-spam filtering.
1. Email Spam and Malware Filtering: ML algorithms classify emails as spam or legitimate by
analyzing content, sender information, and patterns. Techniques include content-based filtering,
heuristic rules, and adaptive models. (.)
2. Finance: ML is employed for fraud detection, risk assessment, and algorithmic trading. For
instance, PayPal uses ML to identify fraudulent transactions by analyzing patterns in user
behavior.
4. Autonomous Vehicles: Self-driving cars rely on ML to process data from sensors and make real-
time decisions, such as obstacle avoidance and route planning. Tesla's Autopilot is a notable
example.
5. Natural Language Processing (NLP): ML enables machines to understand and generate human
language, powering applications like chatbots, translation services, and sentiment analysis.
Google Translate and virtual assistants like Siri leverage NLP.
6. Cybersecurity: ML detects anomalies and potential threats by analyzing network traffic and user
behavior, enhancing the ability to prevent cyberattacks. Companies like Darktrace use ML for real-
time threat detection.
9. Education: Adaptive learning platforms use ML to tailor educational content to individual student
needs, enhancing learning outcomes. Duolingo, for example, personalizes language lessons based
on user performance.
Aditi Deorukhakar
10. Marketing and Advertising: ML analyzes consumer data to optimize marketing strategies,
personalize content, and improve customer segmentation, leading to more effective advertising
campaigns.
Email spam filtering is a critical application of ML, aiming to distinguish between legitimate emails and
unsolicited messages. The process involves analyzing various features of emails to classify them
accurately.(.)
1. Content-Based Filtering: This method examines the actual content of emails, analyzing word
frequency and patterns to identify spam characteristics.
2. Case-Based Filtering: Utilizes a database of previously classified emails to compare and determine
the likelihood of a new email being spam.
3. Heuristic or Rule-Based Filtering: Applies predefined rules and patterns, such as specific phrases
or formats commonly found in spam, to assess emails.
5. Adaptive Filtering: Continuously learns and updates its filtering criteria based on new data,
improving accuracy over time.
Steps for SPAM and Non-SPAM Email Filtering using Machine Learning
Aditi Deorukhakar
1. Data Collection
• Example datasets: Enron Email Dataset, SpamAssassin, or UCI ML Email Spam dataset.
2. Data Preprocessing
• Text Cleaning:
3. Feature Extraction
4. Data Splitting
5. Model Selection
o Logistic Regression
Aditi Deorukhakar
6. Model Training
• The model learns patterns and relationships from the email text features.
7. Model Evaluation
o Accuracy
o Precision
o Recall
o F1-Score
o Confusion Matrix
• Helps assess how well the model distinguishes between spam and non-spam.
8. Hyperparameter Tuning
9. Deployment
• Regularly update the model using new labeled emails to keep up with evolving spam tactics.
Aditi Deorukhakar