Unit1 ML
Unit1 ML
Introduction:
Machine learning is a subfield of artificial intelligence, which is broadly defined as the
capability of a machine to imitate intelligent human behaviour. Artificial intelligence systems
are used to perform complex tasks in a way that is similar to how humans solve problems.
The term machine learning was first introduced by Arthur Samuel in 1959. “Machine
Learning is the field of study that gives computers the capability to learn without being
explicitly programmed”. Machine learning (ML) is a field devoted to understanding and
building methods that let machines "learn" – that is, methods that leverage data to improve
computer performance on some set of tasks. Machine can learn itself from past data and
automatically improve. Machine learning is used to make decisions based on data. By
modelling the algorithms on the bases of historical data, Algorithms find the patterns and
relationships that are difficult for humans to detect. Machine learning (ML) is the process of
using mathematical models of data to help a computer learn without direct instruction. It’s
considered a subset of artificial intelligence (AI). Machine learning uses algorithms to
identify patterns within data, and those patterns are then used to create a data model that
can make predictions. With increased data and experience, the results of machine learning
are more accurate. Think, for example, of a supermarket chain that is selling thousands of
goods to millions of customers. The details of each transaction are stored: date, customer
id, goods bought and their amount, total money spent, and so forth. This typically amounts
to a lot of data every day. What the supermarket chain wants is to be able to predict which
customer is likely to buy which product, to maximize sales and profit. Similarly, each
customer wants to find the set of products best matching his/her needs.
The regular problems will have a fixed input and output. For some tasks, however, we do
not have an algorithm. Predicting customer behaviour is one; another is to tell spam emails
from legitimate ones. Machine learning also helps us find solutions to many problems in
vision, speech recognition, and robotics. With the help of sample historical data, which is
known as training data, machine learning algorithms build a mathematical model that helps
in making predictions or decisions without being explicitly programmed. Machine learning
brings computer science and statistics together for creating predictive models. Note:
Machine learning is primarily concerned with the accuracy and effectiveness of the
computer system
1. Image recognition
2. Product recommendations
3. Speech Recognition
4. Natural language processing
5. Online fraud detection
6. Email filtering
7. Medical diagnosis
8. Stock market trading and many more.
Machine learning life cycle or Process: to implement/ to understand the working of
machine learning is see the stages in the approach:
The steps involved are:
• Gathering the data
• Preparing the data
• Choosing a model
• Train the model
• Evaluate the model
• Parameter tuning
• Make predictions
One can understand the whole process based on the diagrams given here:
Learning Models:
Machine Learning is all about using the right features to build the right models that achieve
the right tasks.
• Features: the workhorses of Machine Learning.
• Models: the output of Machine Learning.
• Tasks: the problems that can be solved with Machine Learning.
Models are the central concept in machine learning as they are what one learns from data in
order to solve a given task.
Models are classified into the following:
1. Geometric Models
2. Logical models
3. Probabilistic models
4. Grouping and Grading
Logical Models: Use a logical expression to divide the instance space into segments and
hence construct grouping models. Here the instance space is a collection of all possible
instances to build the right model. A logical expression always results in Boolean value TRUE
or FALSE as outcome. There are 2 types of logical models:
Tree based and Rule based.
i) Tree Models: Here the tree structure is built to make the necessary model.
The tree consists of ellipses for features and rectangles for leaves. The leaves
consist of a CLASS/ VALUE/ PROBOBILITIES. If the value is a class, then the feature tree is a
Decision Tree.
The tree models use divide and conquer approach for making a tree. The popularly used
tree-based Machine Learning algorithms are- Decision Tree, Random Forest and XGBoost.
Ex-1: Here is an illustration of how the Decision Tree algorithm works in segmenting a set of
data points into 2 classes: “sold out” and “not sold out”. First, the algorithm will divide the
data into two parts using a horizontal or vertical line. In this case, the first step is done by
splitting the x-axis using a vertical line separating the price above and below $600. Next, the
algorithm splits the y-axis into the left and right sides. We can see that for the price above
$600, the products will be sold if the quality is above 60 and not sold if it is below 60. If the
price is below $600, the algorithm needs further segmentation.
Ex-1 Ex-2
ii) Ex-2
Rule Models: Here it consists of IF-THEN rules. The ‘if-parts’ define a segment and the
‘then-part’ defines the behaviour of the model. This is “Logical” because models of this kind
can easily be translated into rules that humans can understand, such as ., if lottery = 1 then
class = Y = spam. in case of Email SPAM or HAM. It uses separate and conquer technique.
Ex:
if SAVINGS = MEDIUM then credit_risk = good
else
if (SAVINGS = HIGH) then
if (INCOME = LOW) THEN
credit_risk = bad
else
credit_risk = good
else if (ASSETS = LOW) then
credit_risk = bad
else
credit_risk = good
Probabilistic Models:
Probabilistic models see features and target variables as random variables. The process of
modelling represents and manipulates the level of uncertainty with respect to these
variables. Probabilistic models use the idea of probability to classify new instances.
There are two types of probabilistic models:
Predictive and
Generative.
• Predictive analytics refers to the process of using statistical techniques, data mining, and
machine learning algorithms to analyse historical data and make predictions about future
events or trends. By uncovering patterns and relationships within datasets, predictive
analytics enables businesses and organizations to make data-driven decisions, anticipate
customer behaviour, optimize operations, and identify potential risks.
• Generative models are designed to address various purposes ranging from image
synthesis, text generation to drug discovery.
• Predictive probability models use the idea of a conditional probability distribution P (Y |X)
from which Y can be predicted from X.
• Generative models estimate the joint distribution P (Y, X). Once we know the joint
distribution for the generative models, we can derive any conditional or marginal
distribution involving the same variables Probabilistic models use the idea of probability to
classify new entities Naïve Bayes is an example of a probabilistic classifier.
1. Grouping (Clustering)
Definition: The process of dividing a dataset into meaningful clusters or groups
without predefined labels.
Type: Unsupervised Learning
Purpose: Identifies patterns and groups similar data points together based on
similarities.
Example Use Cases:
o Customer segmentation in marketing.
o Anomaly detection in network security.
o Grouping news articles based on topics.
Popular Clustering Algorithms:
K-Means Clustering – Divides data into K clusters.
Hierarchical Clustering – Builds a tree-like hierarchy of clusters.
DBSCAN – Identifies clusters of varying shapes and detects noise.
2. Grading (Classification)
Definition: Assigning labels or categories to data based on training from pre-labeled
examples.
Type: Supervised Learning
Purpose: Helps in predicting categories (grades, labels, or classes) based on input
features.
Example Use Cases:
o Email spam detection (Spam/Not Spam).
o Grading students based on their test scores (A, B, C, etc.).
o Diagnosing diseases based on symptoms (Healthy/Sick).
Popular Classification Algorithms:
Logistic Regression – Best for binary classification (Yes/No).
Decision Trees & Random Forests – Handle complex decision-making.
Support Vector Machines (SVM) – Separates classes using hyperplanes.
Neural Networks – Used for deep learning-based classification.
Key Differences:
Feature Grouping (Clustering) Grading (Classification)
Learning Type Unsupervised Supervised
Output Groups/clusters (No labels) Labeled categories
Use Case Customer segmentation Spam detection
Algorithms K-Means, DBSCAN Decision Trees, SVM
ML Tools:
To implement the machine learning concepts user can deploy his own code or can take the
assistance of libraries. Python provides a better list of libraries. The Python libraries that are
used in Machine Learning are: Numpy, Scipy, Scikit-learn, TensorFlow, Keras, PyTorch,
Pandas, Matplotlib.
Types of Learning:
Machine learning offers a variety of techniques and models you can choose based on your
application, the size of data you're processing, and the type of problem you want to solve.
Here is the machine learning process given in the diagram.
1. Supervised Learning
2. Unsupervised Learning
3. Reinforced Learning
Supervised Learning:
Supervised learning is a machine learning approach that’s defined by its use of labeled
datasets. These datasets are designed to train or “supervise” algorithms into classifying data
or predicting outcomes accurately. Using labeled inputs and outputs, the model can
measure its accuracy and learn over time.
Supervised learning is the types of machine learning in which machines are trained using
well "labelled" training data, and on basis of that data, machines predict the output. The
labelled data means some input data is already tagged with the correct output. Suppose we
have a dataset of different types of shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as a
Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides, then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.
For example, your spam filter is a machine learning program that can learn to flag spam
after being given examples of spam emails that are flagged by users, and examples of
regular non-spam (also called “ham”) emails. The examples the system uses to learn are
called the training set. In this case, the task (T) is to flag spam for new emails, the experience
(E) is the training data, and the performance measure (P) needs to be defined. For example,
you can use the ratio of correctly classified emails as P. This particular performance measure
is called accuracy and it is often used in classification tasks as it is a supervised learning
approach. Examples: Linear Regression, Logistic Regression, KNN classification, Support
Vector Machine (SVM), Decision Trees, Random Forest, Naïve Bayes’ theorem.
Classification:
Classification in machine learning categorizes data into predefined groups based on shared
characteristics. Machine learning models learn the characteristics of each group from the
input data and then generalize this information to new data points. You’d likely choose this
for tasks where your output falls into a category, meaning it’s a discrete outcome. For
example, you might determine the type of tumor from a medical scan, segment customers
based on purchasing behavior, categorize customer feedback, or classify email types.
What is classification used for?
The application of classification depends on the type of classification algorithm you use. You
can opt for different classification models, such as binary, multiclass, and multilabel.
Binary classification categorizes data into one of two groups. Examples of this might include
filtering spam emails (spam/not spam), detecting fraud (fraud/not fraud), or diagnosing a
disease (disease/no disease).
Multiclass classification groups data into several distinctive groups with no overlap. For
example, if you had an image classifier, you might group pictures of dogs into breeds such as
“dalmatian,” “collie,” and “poodle.” If you have overlap in your groups, this is a multilabel
classification. For example, if you classified videos, you could simultaneously label a movie
as “action” or “comedy” or label it as both genres.
Advantages and disadvantages of classification
To determine whether classification is the right machine learning method for your data, it’s
important to weigh the pros and cons to make an informed decision. Advantages and
disadvantages to consider as a starting point include:
Advantages
Versatile input data: You can use classification with various input styles, including
text, images, audio, or video.
Easily evaluated: You can evaluate the performance of your model with many
established indicators and toolkits, helping you make decisions to maximize
efficiency, speed, and quality.
Can make discrete or continuous predictions: You can use classification to make
discrete predictions, such as “disease” or “no disease,” as well as continuous
predictions, such as “25 percent probability of disease” and “75 percent probability
of no disease.”
Disadvantages
Risk of overfitting or underfitting: Without sufficient training data, your model may
“overfit” by aligning too closely with the training data and failing to generalize to
new data. Conversely, it could “underfit” by struggling to learn meaningful patterns
due to insufficient exposure.
Requires high-level training data: Your training data set must be structured in an
appropriate format to learn the categories and attribute features to them. This
involves some pre-processing and data management.
Regression:
Regression in machine learning is a technique used to predict a continuous outcome value
using the value of input variables. The algorithm analyzes the input data to understand the
relationship between independent variables and the dependent variable. For example, to
predict a student's future exam score (a continuous variable), you might use study time,
sleep hours, and previous grade averages as input variables. Regression models establish a
consistent framework for making accurate predictions of the dependent variable by
identifying patterns and relationships in the data.
What is regression used for?
Similar to classification methods, the application of regression depends on the type of
regression used. Often, you will work with simple linear regression or multiple linear
regression.
Simple linear regression models how independent variables relate to a dependent variable
by finding a straight line that best fits the data. For example, you might predict housing
prices based on square footage, product sales based on marketing budget, or disease spread
based on vaccination rates.
With multiple linear regression, you add more independent variables to predict your
response variable value. For example, you might predict housing prices based on square
footage, zip code, and number of available houses.
Advantages and disadvantages of regression
Regression is a powerful tool when used under the right conditions but has limitations.
Considering the advantages and disadvantages can help you make an educated choice
between regression and other machine learning methods.
Advantages
Handles continuous and categorical outputs: Linear regression allows you to handle
continuous outputs, while logistic regression works well with categorical outputs.
Easily interpretable: Regression outputs offer insight into the magnitude of the
association between variables, allowing a clearer understanding of your model.
Low computational requirement: Regression analysis is simpler and more efficient
to implement compared to other machine learning algorithms.
Disadvantages
Relies on accurate data and assumptions: If you have incorrect data or assumptions,
regression models can make inaccurate predictions, leading to poor decision-making.
Doesn’t imply causation: While regression results can show correlations between
variables, they do not necessarily imply that changes in certain variables cause
changes in another.
Requires careful variable selection: Accurate predictions for the dependent variable
rely on selecting the right set of independent variables to effectively capture the
underlying relationships in the data.
Unsupervised Learning:
Unsupervised learning is a branch of Machine Learning that deals with unlabeled data.
Unlike supervised learning, where the data is labeled with a specific category or outcome,
unsupervised learning algorithms are tasked with finding patterns and relationships within
the data without any prior knowledge of the data’s meaning. Unsupervised machine
learning algorithms find hidden patterns and data without any human intervention, i.e., we
don’t give output to our model. The training model has only input parameter values and
discovers the groups or patterns on its own.
The image shows set of animals: elephants, camels, and cows that represents raw data that
the unsupervised learning algorithm will process.
The “Interpretation” stage signifies that the algorithm doesn’t have predefined labels
or categories for the data. It needs to figure out how to group or organize the data
based on inherent patterns.
Algorithm represents the core of unsupervised learning process using techniques
like clustering, dimensionality reduction, or anomaly detection to identify patterns
and structures in the data.
Processing stage shows the algorithm working on the data.
The output shows the results of the unsupervised learning process. In this case, the
algorithm might have grouped the animals into clusters based on their species (elephants,
camels, cows).
How does unsupervised learning work?
Unsupervised learning works by analyzing unlabeled data to identify patterns and
relationships. The data is not labeled with any predefined categories or outcomes, so the
algorithm must find these patterns and relationships on its own. This can be a challenging
task, but it can also be very rewarding, as it can reveal insights into the data that would not
be apparent from a labeled dataset.
Data-set in Figure A is Mall data that contains information about its clients that subscribe to
them. Once subscribed they are provided a membership card and the mall has complete
information about the customer and his/her every purchase. Now using this data and
unsupervised learning techniques, the mall can easily group clients based on the parameters
we are feeding in.
Reinforcement Learning:
Reinforcement Learning revolves around the idea that an agent (the learner or decision-
maker) interacts with an environment to achieve a goal. The agent performs actions and
receives feedback to optimize its decision-making over time.
Agent: The decision-maker that performs actions.
Environment: The world or system in which the agent operates.
State: The situation or condition the agent is currently in.
Action: The possible moves or decisions the agent can make.
Reward: The feedback or result from the environment based on the agent’s action.
Version Space:
The Version Space (VS) is a subset of hypotheses from a given hypothesis space H that are
consistent with all the training examples. In other words, it contains only the hypotheses
that correctly classify all the training instances seen so far.
PAC Learning:
PAC Learning is a theoretical framework used in computational learning theory to
mathematically analyze learning algorithms. It helps answer key questions such as:
What concepts can be learned efficiently?
How many training samples are needed for a good hypothesis?
How accurate and confident can we be in the learned hypothesis?
Key Definitions
1. Hypothesis Class (H)
o The set of all functions that a learning algorithm can choose from.
2. Shattering
o A hypothesis class H shatters a set of points if, for every possible labeling
(classification) of those points, there is a hypothesis h∈Hh \in Hh∈H that
correctly separates them.
3. VC Dimension (VC(H))
o The largest number of points that can be shattered by HHH.
o If there exists a set of nnn points that can be shattered by the classifier, but
no set of n+1n+1n+1 points can be shattered, then VC(H) = n.
Examples