Unit 1 ML
Unit 1 ML
1. Data storage
Facilities for storing and retrieving huge amounts of data are an important component of the learning
process. Humans and computers alike utilize data storage as a foundation for advanced reasoning.
• In a human being, the data is stored in the brain and data is retrieved using electrochemical signals.
• Computers use hard disk drives, flash memory, random access memory and similar devices to store
data and use cables and other technology to retrieve data.
2. Abstraction
The second component of the learning process is known as abstraction.
Abstraction is the process of extracting knowledge about stored data. This involves creating general
concepts about the data as a whole. The creation of knowledge involves application of known models
and creation of new models.
The process of fitting a model to a dataset is known as training. When the model has been trained, the
data is transformed into an abstract form that summarizes the original information.
3. Generalization
The third component of the learning process is known as generalisation.
Understanding data
The different types and forms of data that are encountered in the machine learning process are.
Unit of observation
By a unit of observation we mean the smallest entity with measured properties of interest for a study.
Examples
• A person, an object or a thing
• A time point
• A geographic region
• A measurement
Sometimes, units of observation are combined to form units such as person-years
Examples and features
Datasets that store the units of observation and their properties can be imagined as collections of data
consisting of the following:
• Examples
An “example” is an instance of the unit of observation for which properties have been recorded.
An “example” is also referred to as an “instance”, or “case” or “record.” (It may be noted that
the word “example” has been used here in a technical sense.)
• Features
Figure Example for “examples” and “features” collected in a matrix format (data relates to automobiles and
their features)
Spam e-mail
Let it be required to build a learning algorithm to identify spam e-mail.
(a) The unit of observation could be an e-mail messages.
(b) The examples would be specific messages.
(c) The features might consist of the words used in the messages.
Based on the methods and way of learning, machine learning is divided into mainly four types,
which are:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-supervised learning
4. Reinforcement Learning
Figure: Illustration of a binary classification problem (plus, minus) and two feature variable (x1 and x2). :
2. Unsupervised learning
In contrast to supervised learning, unsupervised learning is a branch of machine learning that is
concerned with unlabeled data. Common tasks in unsupervised learning are clustering analysis
(assigning group memberships) and dimensionality reduction (compressing data onto a lower-
dimensional subspace or manifold).
Figure Illustration of clustering, where the dashed lines indicate potential group membership assignments of
unlabeled data points.
4. Reinforcement learning
Reinforcement is the process of learning from rewards while performing a series of actions. In
reinforcement learning, we do not tell the learner or agent, for example, a (ro)bot, which action to take
but merely assign a reward to each action and/or the overall outcome. Instead of having \correct/false"
label for each step, the learner must discover or learn a behavior that maximizes the reward for a series
CS1701 Machine Learning Department of CSE Unit I 9
of actions. In that sense, it is not a supervised setting and somewhat related to unsupervised learning;
however, reinforcement learning really is its own category of machine learning. Reinforcement
learning will not be covered further in this class.
Typical applications of reinforcement learning involve playing games (chess, Go, Atari video games)
and some form of robots, e.g., drones, warehouse robots, and more recently selfdriving cars.
Supervised learning
● Supervised learning is a type of machine learning that uses labeled data to train machine
learning models.
Working:
Supervised learning algorithms takes labeled inputs and map them to the known outputs, which
means you already know the target variable.
Supervised Learning methods need external supervision to train machine learning models.
Hence, the name was supervised. They need guidance and additional information to return the
desired result.
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of
test data (a subset of the training set), and then it predicts the output.
a. Regression: -
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc.
Popular Regression algorithms which come under supervised learning:
Linear Regression
Regression Trees
Non-Linear Regression
Bayesian Linear Regression
Polynomial Regression
b. Classification: -
Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.
Popular classification algorithms which come under supervised learning:
Random Forest
Decision Trees
Logistic Regression
Support vector Machines
Advantages of Supervised Learning: -
With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
In supervised learning, we can have an exact idea about the classes of objects.
Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
1.3 CLASSIFICAITON
The classification problem consists of taking input vectors and deciding which of N classes they belong
to, based on training from exemplars of each class
Example1: Credit scoring Differentiating between low-risk and high-risk customers from their
income and savings
In credit scoring , the bank calculates the risk given the amount of credit and the information about the
customer. The information about the customer includes data we have access to and is relevant in
calculating his or her financial capacity—namely, income, savings, collaterals, profession, age, past
financial history, and so forth. The bank has a record of past loans containing such customer data and
whether the loan was paid back or not. From this data of particular applications, the aim is to infer a
The most important point about the classification problem is that it is discrete—each example belongs
to precisely one class, and the set of classes covers the whole possible output space.
This is an example of a classification problem where there are two classes: low-risk and high-risk
customers. The information about a customer makes up the input to the classifier whose task is to
assign the input to one of the two classes. After training with the past data, a classification rule learned
may be of the form
IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
Discriminant is a function that separates the examples of different classes.
FIGURE Left: A set of straight line decision boundaries for a classification problem. Right: An alternative set of
decision boundaries that separate the plusses from the lightening strikes better, but requires a line that isn’t straight..
For example, if we tried to separate coins based only on colour, we wouldn’t get very far, because
the 20 ¢ and 50 ¢ coins are both silver and the $1 and $2 coins both bronze. However, if we use colour
and diameter, we can do a pretty good job of the coin classification problem for NZ coins. There are
some features that are entirely useless. For example, knowing that the coin is circular doesn’t tell us
anything about NZ coins, which are all circular . In other countries, though, it could be very useful.
1.4 REGRESSION
In machine learning, a regression problem is the problem of predicting the value of a numeric
variable based on observed values of the variable.
The value of the output variable may be a number, such as an integer or a floating point value.
These are often quantities, such as amounts and sizes.
The input variables may be discrete or real-valued.
Example Consider the data on car prices
General approach
Let x denote the set of input variables and y the output variable. In machine learning, the general
approach to regression is to assume a model, that is, some mathematical relation between x and y,
involving some parameters say in the following form:
The function f(x,θ ) is called the regression function. The machine learning algorithm optimizes the
parameters in the set such that the approximation error is minimized; that is, the estimates of the values
of the dependent variable y are as close as possible to the correct values given in the training set.
Regression: Applications
Loan Default Prediction
House Price Prediction
Stock Market Prediction
Market Sales Forecasting
Advertising
.
1.5 UNSUPERVISED LEARNING
Unsupervised learning
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
The goal of unsupervised learning is to find the underlying structure of dataset, group that data
according to similarities, and represent that dataset in a compressed format.
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below diagram:
Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in order
to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then will
apply suitable algorithms such as k-means clustering etc. Once it applies the suitable algorithm, the
algorithm divides the data objects into groups according to the similarities and difference between the
objects.
1. Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
2. Association: - An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs together
in the dataset. Association rule makes marketing strategy more effective. Such as people who buy X
item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical example of Association
rule is Market Basket Analysis.
Unsupervised Learning algorithms:
K-means clustering
Hierarchal clustering
Anomaly detection
Principle Component Analysis
Independent Component Analysis
Apriori algorithm
CS1701 Machine Learning Department of CSE Unit I 18
Singular value decomposition
Machine learning models can be classified into two types: Discriminative and Generative.
In simple words, a discriminative model makes predictions on unseen data based on
conditional probability and can be used either for classification or regression problem
statements.
On the contrary, a generative model focuses on the distribution of a dataset to return a
probability for a given example.
They are related to known effects of causal direction, classification vs. inference learning, and
observational vs. feedback learning.
CS1701 Machine Learning Department of CSE Unit I 19
Problem Formulation
Suppose we are working on a classification problem where our task is to decide if an email is spam or
not spam based on the words present in a particular email. To solve this problem, we have a joint model
over.
● Labels: Y=y, and
● Features: X={x1, x2, …xn}
Therefore, the joint distribution of the model can be represented as
p(Y,X) = P(y,x1,x2…xn)
Now, our goal is to estimate the probability of spam email i.e., P(Y=1|X). Both generative and
discriminative models can solve this problem but in different ways.
The Approach of Generative Models
In the case of generative models, to find the conditional probability P(Y|X), they estimate the prior
probability P(Y) and likelihood probability P(X|Y) with the help of the training data and use the Bayes
Theorem to calculate the posterior probability P(Y |X):
Figure : Hypothesis h1. is more general than hypothesis h11.. if and only if S11.. b S1
6. Version space
Consider a binary classification problem. Let D be a set of training examples and H a hypothesis
space for the problem. The version space for the problem with respect to the set D and the space H is
the set of hypotheses from H consistent with D; that is, it is the set
7. Noise
Noise and their sources
Noise is any unwanted anomaly in the data . Noise may arise due to several factors:
1. There may be imprecision in recording the input attributes, which may shift the data points in
the input space.
2. There may be errors in labeling the data points, which may relabel positive instances as negative
and vice versa. This is sometimes called teacher noise.
3. There may be additional attributes, which we have not taken into account, that affect the label of an
instance. Such attributes may be hidden or latent in that they may be unobservable. The effect of these
neglected attributes is thus modeled as a random component and is included in “noise.”
8. Learning multiple classes
So far we have been discussing binary classification problems. In a general case there may be more
than two classes. Two methods are generally used to handle such cases. These methods are known
by the names “one-against-all" and “one-against-one”.
It briefly examines the process by which machine learning algorithms can be selected, applied, and
evaluated for the problem.
Steps in Machine Learning Process
1. Data Collection and Preparation
2. Feature Selection
3. Algorithm Choice
4. Parameter and Model Selection
5. Training
6. Evaluation
FIGURE The reinforcement learning cycle: the learning agent performs action at in state st and receives reward rt+1
from the environment, ending up in state st+1.
Applications
Reinforcement learning algorithms are widely used in the gaming industries to build games. It is also
used to train robots to do human tasks.
This is where the name ‘reinforcement learning’ comes from, since you repeat actions that are
reinforced by a feeling of satisfaction.