0% found this document useful (0 votes)
3 views15 pages

Mod 1

The document provides a comprehensive overview of machine learning, differentiating between supervised and unsupervised learning, and discussing concepts like overfitting and underfitting. It outlines the steps in developing a machine learning application, various types of machine learning, and the bias-variance trade-off. Additionally, it highlights applications of machine learning, particularly in email spam filtering, and details the process involved in classifying emails.

Uploaded by

aditideo624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views15 pages

Mod 1

The document provides a comprehensive overview of machine learning, differentiating between supervised and unsupervised learning, and discussing concepts like overfitting and underfitting. It outlines the steps in developing a machine learning application, various types of machine learning, and the bias-variance trade-off. Additionally, it highlights applications of machine learning, particularly in email spam filtering, and details the process involved in classifying emails.

Uploaded by

aditideo624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Differentiate between supervised and unsupervised learning.

Basis Supervised Learning Unsupervised Learning

Supervised learning algorithms train data, whereUnsupervised


every learning algorithms find patterns in
Definition
input has a corresponding output. data that has no predefined labels.

The goal of supervised learning is to predict or classify


The goal of unsupervised learning is to discover
Goal
based on input features. hidden patterns, structures and relationships.

Input Data Labeled: Input data with corresponding output labels.


Unlabeled: Input data is raw and unlabeled.

Supervised learning algorithms needs human Unsupervised learning algorithms does not any
Human Supervision
supervision to train the model. kind of supervision to train the model..

Clustering, Association and Dimensionality


Tasks Regression, Classification
Reduction

supervised machine learning methods are Unsupervised machine learning methods are
Complexity
computationally simple. computationally complex.

Linear regression, K-Nearest Neighbors, Decision Trees,


Algorithms K- Means clustering, DBSCAN, Autoencoders
Naive Bayes, SVM

Supervised machine learning methods are highlyUnsupervised machine learning methods are less
Accuracy
accurate. accurate.

Image classification, Sentiment Analysis, Customer Segmentation, Anomaly Detection,


Applications
Recommendation systems Recommendation Engines, NLP

Write short note on overfitting and underfitting of model

1. Overfitting in Machine Learning

Aditi Deorukhakar
Overfitting happens when a model learns too much from the training data, including details that don’t
matter (like noise or outliers).

• For example, imagine fitting a very complicated curve to a set of points. The curve will go through
every point, but it won’t represent the actual pattern.

• As a result, the model works great on training data but fails when tested on new data.

Overfitting models are like students who memorize answers instead of understanding the topic. They do
well in practice tests (training) but struggle in real exams (testing).

Reasons for Overfitting:

1. High variance and low bias.

2. The model is too complex.

3. The size of the training data.

2. Underfitting in Machine Learning

Underfitting is the opposite of overfitting. It happens when a model is too simple to capture what’s going
on in the data.

• For example, imagine drawing a straight line to fit points that actually follow a curve. The line
misses most of the pattern.

• In this case, the model doesn’t work well on either the training or testing data.

Underfitting models are like students who don’t study enough. They don’t do well in practice tests or real
exams. Note: The underfitting model has High bias and low variance.

Reasons for Underfitting:

1. The model is too simple, So it may be not capable to represent the complexities in the data.

2. The input features which is used to train the model is not the adequate representations of
underlying factors influencing the target variable.

3. The size of the training dataset used is not enough.

4. Excessive regularization are used to prevent the overfitting, which constraint the model to capture
the data well.

5. Features are not scaled.

Let’s visually understand the concept of underfitting, proper fitting, and overfitting.

Aditi Deorukhakar
• Underfitting : Straight line trying to fit a curved dataset but cannot capture the data’s patterns,
leading to poor performance on both training and test sets.

• Overfitting: A squiggly curve passing through all training points, failing to generalize performing
well on training data but poorly on test data.

• Appropriate Fitting: Curve that follows the data trend without overcomplicating to capture the
true patterns in the data.

Discuss the various steps of developing a Machine Learning Application.

Aditi Deorukhakar
Aditi Deorukhakar
What is Machine Learning? What are the steps in developing a machine learning application?

Machine learning is a branch of artificial intelligence that enables algorithms to uncover hidden patterns
within datasets. It allows them to predict new, similar data without explicit programming for each task.
Machine learning finds applications in diverse fields such as image and speech recognition, natural
language processing, recommendation systems, fraud detection, portfolio optimization, and automating
tasks.

Machine learning’s impact extends to autonomous vehicles, drones, and robots, enhancing their
adaptability in dynamic environments. This approach marks a breakthrough where machines learn from
data examples to generate accurate outcomes, closely intertwined with data mining and data science.

Differentiate between supervised and unsupervised learning

Explain the overfitting and underfitting with example

Differentiate different Machine Learning approaches.

ypes of Machine Learning

There are several types of machine learning, each with special characteristics and applications. Some of
the main types of machine learning algorithms are as follows:

1. Supervised Machine Learning

2. Unsupervised Machine Learning

3. Reinforcement Learning

Additionally, there is a more specific category called semi-supervised learning, which combines elements
of both supervised and unsupervised learning.

Aditi Deorukhakar
Types of Machine Learning

1. Supervised Machine Learning

Supervised learning is defined as when a model gets trained on a “Labelled Dataset”. Labelled datasets
have both input and output parameters. In Supervised Learning algorithms learn to map points between
inputs and correct outputs. It has both training and validation datasets labelled.

Advantages of Supervised Machine Learning

• Supervised Learning models can have high accuracy as they are trained on labelled data.

• The process of decision-making in supervised learning models is often interpretable.

• It can often be used in pre-trained models which saves time and resources when developing new
models from scratch.

Disadvantages of Supervised Machine Learning

• It has limitations in knowing patterns and may struggle with unseen or unexpected patterns that
are not present in the training data.

Aditi Deorukhakar
• It can be time-consuming and costly as it relies on labeled data only.

• It may lead to poor generalizations based on new data.

Applications of Supervised Learning

Supervised learning is used in a wide variety of applications, including:

• Image classification: Identify objects, faces, and other features in images.

• Natural language processing: Extract information from text, such as sentiment, entities, and
relationships.

• Speech recognition: Convert spoken language into text.

2. Unsupervised Machine Learning


Unsupervised Learning Unsupervised learning is a type of machine learning technique in which an
algorithm discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn’t involve providing the algorithm with labeled target outputs. The
primary goal of Unsupervised learning is often to discover hidden patterns, similarities, or clusters
within the data, which can then be used for various purposes, such as data exploration, visualization,
dimensionality reduction, and more.

Advantages of Unsupervised Machine Learning

• It helps to discover hidden patterns and various relationships between the data.

• Used for tasks such as customer segmentation, anomaly detection, and data exploration.

• It does not require labeled data and reduces the effort of data labeling.

Disadvantages of Unsupervised Machine Learning

• Without using labels, it may be difficult to predict the quality of the model’s output.

Aditi Deorukhakar
• Cluster Interpretability may not be clear and may not have meaningful interpretations.

• It has techniques such as autoencoders and dimensionality reduction that can be used to extract
meaningful features from raw data.

Applications of Unsupervised Learning

Here are some common applications of unsupervised learning:

• Clustering: Group similar data points into clusters.

• Anomaly detection: Identify outliers or anomalies in data.

• Dimensionality reduction: Reduce the dimensionality of data while preserving its essential
information.

3. Reinforcement Machine Learning

Reinforcement machine learning algorithm is a learning method that interacts with the environment by
producing actions and discovering errors. Trial, error, and delay are the most relevant characteristics of
reinforcement learning. In this technique, the model keeps on increasing its performance using Reward
Feedback to learn the behavior or pattern. These algorithms are specific to a particular problem e.g.
Google Self Driving car, AlphaGo where a bot competes with humans and even itself to get better and
better performers in Go Game. Each time we feed in data, they learn and add the data to their
knowledge which is training data. So, the more it learns the better it gets trained and hence experienced.

What is Cross Validation?

Aditi Deorukhakar
Cross-validation is a statistical method used in machine
learning to evaluate how well a model performs on an
independent data set. It involves dividing the available
data into multiple folds or subsets, using one of these
folds as a validation set and training the model on the
remaining folds. This process is repeated multiple times
each time using a different fold as the validation set.
Finally, the results from each validation step are
averaged to produce a more robust estimate of the
model’s performance.

The main purpose of cross validation is to prevent overfitting which occurs when a model is trained too
well on the training data and performs poorly on new, unseen data. By evaluating the model on multiple
validation sets, cross validation provides a more realistic estimate of the model’s generalization
performance i.e. its ability to perform well on new, unseen data. If you want to make sure your machine
learning model is not just memorizing the training data but is capable of adapting to real-world data
cross-validation is a commonly used technique.

Discuss bias variance trade-off with suitable diagram.

Bias Variance Tradeoff

If the algorithm is too simple (hypothesis with linear equation) then it may be on high bias and low
variance condition and thus is error-prone. If algorithms fit too complex (hypothesis with high degree
equation) then it may be on high variance and low bias. In the latter condition, the new entries will not
perform well. Well, there is something between both of these conditions, known as a Trade-off or Bias
Variance Trade-off. This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time. For the graph, the perfect tradeoff
will be like this.

Aditi Deorukhakar
We try to optimize the value of the total error for the model by using the BiasVariance Tradeoff.

The best fit will be given by the hypothesis on the tradeoff point. The error to complexity graph to show
trade-off is given as –

Aditi Deorukhakar
In the classification model, the values for the observations are as follows. True Negatives (TN)=300,
True Positive(TP)=500, False Positive (FP) = 50, False Negatives (FN)=150. Evaluate the performance of
the model by finding values of Accuracy, Precision, Recall and F1-Score:

Given:

• True Positives (TP) = 500

• True Negatives (TN) = 300

• False Positives (FP) = 50

• False Negatives (FN) = 150

Formulas Used:

1. Accuracy = (TP + TN) / (TP + TN + FP + FN)

2. Precision = TP / (TP + FP)

3. Recall (Sensitivity) = TP / (TP + FN)

4. F1-Score = 2 × (Precision × Recall) / (Precision + Recall)

Calculations:

1. Accuracy = (500 + 300) / (500 + 300 + 50 + 150) = 800 / 1000 = 0.80

2. Precision = 500 / (500 + 50) = 500 / 550 = 0.909

3. Recall = 500 / (500 + 150) = 500 / 650 = 0.769

4. F1-Score = 2 × (0.909 × 0.769) / (0.909 + 0.769)


= 2 × 0.698 / 1.678 = 0.832

Final Results:

• Accuracy: 0.80

• Precision: 0.909

• Recall: 0.769

• F1-Score: 0.832

What is Machine Learning? What are the steps in developing a machine learning application?

Differentiate between supervised and unsupervised learning

Aditi Deorukhakar
List various applications of machine learning. Describe SPAM and non SPAM email filtering application
in detail

Machine learning (ML) is a transformative technology with applications across various domains. Below is
an overview of its diverse applications, followed by an in-depth look at how ML is utilized in email spam
and non-spam filtering.

Applications of Machine Learning

1. Email Spam and Malware Filtering: ML algorithms classify emails as spam or legitimate by
analyzing content, sender information, and patterns. Techniques include content-based filtering,
heuristic rules, and adaptive models. (.)

2. Finance: ML is employed for fraud detection, risk assessment, and algorithmic trading. For
instance, PayPal uses ML to identify fraudulent transactions by analyzing patterns in user
behavior.

3. Retail and E-commerce: Personalized recommendations are generated using ML by analyzing


customer behavior and preferences. Amazon and Netflix utilize such systems to enhance user
engagement.

4. Autonomous Vehicles: Self-driving cars rely on ML to process data from sensors and make real-
time decisions, such as obstacle avoidance and route planning. Tesla's Autopilot is a notable
example.

5. Natural Language Processing (NLP): ML enables machines to understand and generate human
language, powering applications like chatbots, translation services, and sentiment analysis.
Google Translate and virtual assistants like Siri leverage NLP.

6. Cybersecurity: ML detects anomalies and potential threats by analyzing network traffic and user
behavior, enhancing the ability to prevent cyberattacks. Companies like Darktrace use ML for real-
time threat detection.

7. Healthcare and Medical Diagnosis: ML assists in diagnosing diseases, predicting patient


outcomes, and personalizing treatment plans by analyzing medical data. AI-driven tools help in
early detection of conditions like cancer.

8. Manufacturing and Automation: Predictive maintenance powered by ML forecasts equipment


failures, reducing downtime. Quality control processes also benefit from ML by identifying defects
in real-time.

9. Education: Adaptive learning platforms use ML to tailor educational content to individual student
needs, enhancing learning outcomes. Duolingo, for example, personalizes language lessons based
on user performance.

Aditi Deorukhakar
10. Marketing and Advertising: ML analyzes consumer data to optimize marketing strategies,
personalize content, and improve customer segmentation, leading to more effective advertising
campaigns.

Email Spam and Non-Spam Filtering Using Machine Learning

Email spam filtering is a critical application of ML, aiming to distinguish between legitimate emails and
unsolicited messages. The process involves analyzing various features of emails to classify them
accurately.(.)

Key Filtering Techniques:

1. Content-Based Filtering: This method examines the actual content of emails, analyzing word
frequency and patterns to identify spam characteristics.

2. Case-Based Filtering: Utilizes a database of previously classified emails to compare and determine
the likelihood of a new email being spam.

3. Heuristic or Rule-Based Filtering: Applies predefined rules and patterns, such as specific phrases
or formats commonly found in spam, to assess emails.

4. Previous Likeness-Based Filtering: Employs algorithms like K-Nearest Neighbors (KNN) to


compare new emails with existing ones, classifying them based on similarity.

5. Adaptive Filtering: Continuously learns and updates its filtering criteria based on new data,
improving accuracy over time.

Steps for SPAM and Non-SPAM Email Filtering using Machine Learning

Aditi Deorukhakar
1. Data Collection

• Collect a dataset of emails labeled as "spam" or "non-spam (ham)".

• Example datasets: Enron Email Dataset, SpamAssassin, or UCI ML Email Spam dataset.

2. Data Preprocessing

• Text Cleaning:

o Convert all text to lowercase.

o Remove special characters, punctuations, and stopwords.

• Tokenization: Split the email content into individual words.

• Stemming/Lemmatization: Reduce words to their root forms (e.g., "running" → "run").

3. Feature Extraction

• Convert text into numerical format using techniques like:

o Bag of Words (BoW): Frequency of words.

o TF-IDF (Term Frequency-Inverse Document Frequency): Weighs words based on their


importance.

o Word Embeddings (for advanced models): Represent words in vector space.

4. Data Splitting

• Divide data into:

o Training set (e.g., 80% of the data): To train the model.

o Testing set (e.g., 20%): To evaluate model performance.

5. Model Selection

• Choose suitable machine learning algorithms such as:

o Naïve Bayes (popular for text classification)

o K-Nearest Neighbors (KNN)

o Logistic Regression

o Support Vector Machine (SVM)

o Random Forest or XGBoost

Aditi Deorukhakar
6. Model Training

• Feed the training data into the selected model.

• The model learns patterns and relationships from the email text features.

7. Model Evaluation

• Use metrics like:

o Accuracy

o Precision

o Recall

o F1-Score

o Confusion Matrix

• Helps assess how well the model distinguishes between spam and non-spam.

8. Hyperparameter Tuning

• Optimize model parameters (e.g., number of neighbors in KNN, regularization in SVM).

• Use techniques like Grid Search or Random Search with cross-validation.

9. Deployment

• Integrate the trained model into an email system or application.

• New emails are classified in real time as spam or non-spam.

10. Continuous Learning (Adaptive Filtering)

• Regularly update the model using new labeled emails to keep up with evolving spam tactics.

How to calculate Performance Measures by Measuring Quality of model.

Aditi Deorukhakar

You might also like