ML Unit 1 Pallav
ML Unit 1 Pallav
UNIT-1
Basic Concepts
Definition of learning systems
A learning system is essentially a collection of artefacts that are ‘brought together’, in
an appropriate way, in order to create an environment that will facilitate various types
of learning process. Learning systems can take a variety of different forms - for
example, a book, a mobile form, a computer, an online forum, a school and a university.
Most learning systems will provide various types of learning resource and descriptions
of procedures for using these to achieve particular learning outcomes. They will also
embed various strategies for assessing the levels and quality of the achievement of
their users.
Training data
Training data is the real fuel to accelerate the machine learning process. It can only
provide the actual inputs to the algorithms to learn the certain patterns and utilize this
training to predict with the right results if such data comes again in real-life use.
Actually, training data is generated through labeling process which involves image
annotation, text annotation and video annotation using certain techniques to make the
objects recognizable to computer vision for machine learning training.
Labeling such data is called “Data Annotation” done by the well-trained and
experienced annotators to annotate the images available in different formats. The best
tools and techniques are used to annotate the object of interest in a image while ensuring
the accuracy to train the popular AI models like self-driving cars or robotics.
• Text Annotation
• Video Annotation
• Image Annotation
Image annotation is used to train the computer vision based perception model.
And there are different types of techniques adopted in this image annotation service.
Bounding box, semantic segmentation, 3D cuboid, polygons, landmark annotation and
polylines annotation are the leading image annotation methods used in such tasks.
In fact, in machine learning labeled or unlabeled training data is used as per the
supervised or unsupervised ML process. In supervised machine learning the objected is
categories, classified and segmented to make it recognizable to machines.
While in unsupervised machine learning, the data is not labeled and algorithm have to
group the object as per the understandings to group them accordingly with its own
segmentation to recognize the same when used in future.
Concept representation
In deep learning, feature representations are generally learned as a blob of ungrouped
features. However, an increasing number of visual applications nourish from inferring
knowledge from imagery which requires scene understanding. Semantic segmentation is a task
that paves the way towards scene understanding. Deep semantic segmentation [17] uses deep
learning for semantic segmentation.
Deep semantic segmentation makes dense predictions inferring labels for every pixel. It can be
carried out at three different levels:
• Class segmentation: each pixel is labeled with the class of its enclosing object or region
• Instance segmentation: separate labels for different instances of the same class
• Part segmentation: decomposition of already segmented classes into their component
sub-classes
CODL extends and generalizes deep semantic segmentation. In CODL, feature representations
are always learned semantically segmented in a concept-oriented manner. Concept orientation
means that each feature representation is associated with a concept, an instance or an
attribute. These concepts, instances and attributes form a concept graph. In addition, the
concept graph are generally linked to Microsoft Concept Graph, thus leveraging and integrating
with the common conceptual knowledge and conceptual understanding capability provided by
Microsoft Concept Graph.
A concept representation consists of a concept, its instances and attributes, and all the feature
representations associated with the concept and its instances and attributes. If a concept has
sub-concepts, its concept representation also consists of the 5 concept representations of its
sub-concepts. Concept representations, therefore, are the same as concept-oriented feature
representations, but provide a different view. The latter is data driven and provides a bottom-
up view starting from feature representations; the former is concept driven and provides a top-
down view starting from concepts. Due to the focus on concepts instead of low-level feature
representations, concept representations provide the proper view to work with in CODL.
Concept representations can be learned using supervised learning. Similar to deep semantic
segmentation, discussed above, it can be carried out at different levels:
• Concept level: each feature representation is labeled with the concept that owns the
feature
• Instance level: separate labels for different instances of the same concept
• Attribute level: separate labels for different attributes of the same concept
• Component level: decomposition of already learned concept representations into their
sub-concept representations
The concept, instance and attribute names used for labeling should be taken from Microsoft
Concept Graph, if available. This provides direct link to Microsoft Concept Graph to leverage its
common conceptual knowledge and conceptual understanding capability.
Function approximation
In general, a function approximation problem asks us to select a function among a well-
defined class that closely matches ("approximates") a target function in a task-specific
way. The need for function approximations arises in many branches of applied
mathematics, and computer science in particular.
One can distinguish two major classes of function approximation problems:
First, for known target functions approximation theory is the branch of numerical
analysis that investigates how certain known functions (for example, special functions)
can be approximated by a specific class of functions (for
example, polynomials or rational functions) that often have desirable properties
(inexpensive computation, continuity, integral and limit values, etc.).
Second, the target function, call it g, may be unknown; instead of an explicit formula,
only a set of points of the form (x, g(x)) is provided. Depending on the structure of
the domain and codomain of g, several techniques for approximating g may be
applicable. For example, if g is an operation on the real numbers, techniques
of interpolation, extrapolation, regression analysis, and curve fitting can be used. If
the codomain (range or target set) of g is a finite set, one is dealing with
a classification problem instead.
To some extent, the different problems (regression, classification, fitness approximation)
have received a unified treatment in statistical learning theory, where they are viewed
as supervised learning problems.
Types of Learning
• If shape of object is rounded and depression at top having color Red then it will be
labelled as –Apple.
• If shape of object is long curving cylinder having color Green-Yellow then it will be
labelled as –Banana.
Now suppose after training the data, you have given a new separate fruit say Banana
from basket and asked to identify it.
Since the machine has already learned the things from previous data and this time have
to use it wisely. It will first classify the fruit with its shape and color and would confirm
the fruit name as BANANA and put it in Banana category. Thus the machine learns the
things from training data(basket containing fruits) and then apply the knowledge to test
data(new fruit).
Supervised learning classified into two categories of algorithms:
• Classification: A classification problem is when the output variable is a category,
such as “Red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a real value, such
as “dollars” or “weight”.
Unsupervised learning
Unsupervised learning is the training of machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance. Here the task of machine is to group unsorted information according to
similarities, patterns and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to
the machine. Therefore machine is restricted to find the hidden structure in unlabeled
data by our-self.
For instance, suppose it is given an image having both dogs and cats which have not
seen ever.
Thus the machine has no idea about the features of dogs and cat so we can’t
categorize it in dogs and cats. But it can categorize them according to their similarities,
patterns, and differences i.e., we can easily categorize the above picture into two parts.
First first may contain all pics having dogs in it and second part may contain all pics
having cats in it. Here you didn’t learn anything before, means no training data or
examples.
Unsupervised learning classified into two categories of algorithms:
• Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior.
• Association: An association rule learning problem is where you want to discover
rules that describe large portions of your data, such as people that buy X also tend
to buy Y.
Overview of classification
Classification: Meaning
Classification is a type of supervised learning. It specifies the class to which data
elements belong to and is best used when the output has finite and discrete values. It
predicts a class for an input variable as well.
There are 2 types of Classification:
• Binomial
• Multi-Class
Classification: Example
Social media sentiment analysis has two potential outcomes, positive or negative, as
displayed by the chart given below.
• This chart shows the classification of the Iris flower dataset into its three sub-species
indicated by codes 0, 1, and 2.
• The test set dots represent the assignment of new test data points to one class or the
other based on the trained classifier model.
• Linear Models
o Logistic Regression
o Support Vector Machines
• Nonlinear models
o K-nearest Neighbors (KNN)
o Kernel Support Vector Machines (SVM)
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
Training Dataset
The actual dataset that we use to train the model (weights and biases in the case of
Neural Network). The model sees and learns from this data.
Validation Dataset
The validation set is used to evaluate a given model, but this is for frequent evaluation.
We as machine learning engineers use this data to fine-tune the model hyperparameters.
Hence the model occasionally sees this data, but never does it “Learn” from this.
We(mostly humans, at-least as of 2017 ) use the validation set results and update
higher level hyperparameters. So the validation set in a way affects a model, but
indirectly.
Test Dataset
Test Dataset: The sample of data used to provide an unbiased evaluation of a final
model fit on the training dataset.
The Test dataset provides the gold standard used to evaluate the model. It is only used
once a model is completely trained(using the train and validation sets). The test set is
generally what is used to evaluate competing models (For example on many Kaggle
competitions, the validation set is released initially along with the training set and the
actual test set is only released when the competition is about to close, and it is the result
of the the model on the Test set that decides the winner). Many a times the validation set
is used as the test set, but it is not good practice. The test set is generally well curated. It
contains carefully sampled data that spans the various classes that the model would face,
when used in the real world.
This mainly depends on 2 things. First, the total number of samples in your data and
second, on the actual model you are training.
Some models need substantial data to train upon, so in this case you would optimize for
the larger training sets. Models with very few hyperparameters will be easy to validate
and tune, so you can probably reduce the size of your validation set, but if your model has
many hyperparameters, you would want to have a large validation set as well(although
you should also consider cross validation). Also, if you happen to have a model with no
hyperparameters or ones that cannot be easily tuned, you probably don’t need a
validation set too!
All in all, like many other things in machine learning, the train-test-validation split ratio
is also quite specific to your use case and it gets easier to make judge ment as you train
and build more and more models.
Note on Cross Validation: Many a times, people first split their dataset into 2 — Train
and Test. After this, they keep aside the Test set, and randomly choose X% of their Train
dataset to be the actual Train set and the remaining (100-X)% to be
the Validation set, where X is a fixed number(say 80%), the model is then iteratively
trained and validated on these different sets. There are multiple ways to do this, and is
commonly known as Cross Validation. Basically you use your training set to generate
multiple splits of the Train and Validation sets. Cross validation avoids over fitting and
is getting more and more popular, with K-fold Cross Validation being the most popular
method of cross validation.
Classification Families
Discriminant Analysis
During a study, there are often questions that strike the researcher that
must be answered. These questions include questions like ‘are the groups
different?’, ‘on what variables, are the groups most different?’, ‘can one
predict which group a person belongs to using such varia bles?’ etc. In
answering such questions, discriminant analysis is quite helpful.
Statistics Solutions is the country’s leader in discriminant analysis and
dissertation statistics.
There are many examples that can explain when discriminant analysis fits. It
can be used to know whether heavy, medium and light users of soft drinks
are different in terms of their consumption of frozen foods. In the field of
psychology, it can be used to differentiate between the price sensitive and
non price sensitive buyers of groceries in terms of their psychological
attributes or characteristics. In the field of business, it can be used to
understand the characteristics or the attributes of a cu stomer possessing
store loyalty and a customer who does not have store loyalty.
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis
and projects data onto a new axis in a way to maximize the separation of the two
categories and hence, reducing the 2D graph into a 1D graph.
But Linear Discriminant Analysis fails when the mean of the distributions are shared, as
it becomes impossible for LDA to find a new axis that makes both the classes linearly
separable. In such cases, we use non-linear discriminant analysis.
Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used
such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of
the variance (actually covariance), moderating the influence of different variables on LDA.
Applications:
1. Face Recognition: In the field of Computer Vision, face recognition is a very popular
application in which each face is represented by a very large number of pixel values.
Linear discriminant analysis (LDA) is used here to reduce the number of features to a
more manageable number before the process of classification. Each of the new
dimensions generated is a linear combination of pixel values, which form a template. The
linear combinations obtained using Fisher’s linear discriminant are called Fisher faces.
2. Medical: In this field, Linear discriminant analysis (LDA) is used to classify the patient
disease state as mild, moderate or severe based upon the patient various parameters and
the medical treatment he is going through. This helps the doctors to intensify or reduce
the pace of their treatment.
3. Customer Identification: Suppose we want to identify the type of customers which are
most likely to buy a particular product in a shopping mall. By doing a simple question and
answers survey, we can gather all the features of the customers. Here, Linear discriminant
analysis will help us to identify and select the features which can describe the
characteristics of the group of customers that are most likely to buy that particular
product in the shopping mall.
Tree models where the target variable can take a discrete set of
values are called classification trees. Decision trees where the
target variable can take continuous values (typically real numbers)
are called regression trees. Classification And Regression Tree
(CART) is general term for this.
Throughout this post i will try to explain using the examples.
Data Format
Data comes in records of forms.
(x,Y)=(x1,x2,x3,....,xk,Y)
Example
training_data = [
['Green', 3, 'Apple'],
['Yellow', 3, 'Apple'],
['Red', 1, 'Grape'],
['Red', 1, 'Grape'],
['Yellow', 3, 'Lemon'],
]
# Header = ["Color", "diameter", "Label"]
# The last column is the label.
# The first two columns are features.
Information Gain
Information gain is used to decide which feature to split on at each
step in building the tree. Simplicity is best, so we want to keep our
tree small. To do so, at each step we should choose the split that
results in the purest daughter nodes. A commonly used measure of
purity is called information. For each node of the tree, the
information value measures how much information a feature
gives us about the class. The split with the highest
information gain will be taken as the first split and the
process will continue until all children nodes are pure,
or until the information gain is 0.
Probabilistic classification
In machine learning, a probabilistic classifier is a classifier that is able to predict,
given an observation of an input, a probability distribution over a set of classes, rather
than only outputting the most likely class that the observation should belong to.
Probabilistic classifiers provide classification that can be useful in its own right[1] or when
combining classifiers into ensembles.
There are many techniques available for improving the performance and speed of a
nearest neighbour classification. One approach to this problem is to pre-sort the
training sets in some way (such as kd-trees or Voronoi cells). Another solution is to
choose a subset of the training data such that classification by the 1-NN rule (using
the subset) approximates the Bayes classifier. This can result in significant speed
improvements as k can now be limited to 1 and redundant data points have been
removed from the training set. These data modification techniques can also improve
the performance through removing points that cause mis-
classifications. Several dataset reduction techniques are discussed in the section on
target detection.
The above discussion focuses on binary classification problems; there are only two
possible output classes. In the digit recognition example there are ten output classes,
which changes things slightly. The labelling of training samples and computing the
distance are unchanged, but ties can now occur even with k odd. If all of the k nearest
neighbours are from different classes we are no closer to a decision than with the
single nearest neighbour rule. We will therefore revert to a 1-NN rule when all there
is no majority within the k nearest neighbours.
The nearest neighbour rule is quite simple, but very computationally intensive. For
the digit example, each classification requires 60,000 distance calculations between
784 dimensional vectors (28x28 pixels). The nearest neighbour code was therefore
written in C in order to speed up the Matlab testing. The files are given below, but
note that these are set up to read in the image database after it has been converted
from the format available on the MNIST web page.