Unit 1 Introduction To Machine Learning
Unit 1 Introduction To Machine Learning
Learning
Unit 1
13
Machine learning process
• The basic machine learning process can be divided into three parts.
1. Data Input: Past data or information is utilized as a basis for future
decision-making
2. Abstraction: The input data is represented in a broader way through
the underlying algorithm
3. Generalization: The abstracted representation is generalized to form
a framework for making decisions
Let’s consider the situation
• Typical process of learning from classroom and books and preparing
for the examination
• It is a tendency of many students to try and memorize: may work well
for simple questions and scope of learning is not so vast
• Questions asked in the examination gets more complex and scope
gets broader
• The number of topics may get too vast memorize also, the capability
of memorizing varies from student to student
• In the case of human learning is that just by great memorizing and
perfect recall, i.e. just based on knowledge input
So what’s learning strategy?
• Students can do well in the examinations only till a certain stage, beyond
that, a better learning strategy needs to be adopted
• to be able to deal with the vastness of the subject matter and the related issues in
memorizing it
• to be able to answer questions where a direct answer has not been learnt
• A good option is to figure out the key points or ideas amongst a vast pool
of knowledge.
• This helps in creating an outline of topics and a conceptual mapping of
those outlined topics with the entire knowledge pool.
• For example, a broad pool of knowledge may consist of all living animals
and their characteristics like invertebrates and vertebrates.
Classification: Animals
1. Invertebrate: Do not have backbones and skeletons
2. Vertebrate
1. Fishes: Always live in water and lay eggs
2. Amphibians: Semi-aquatic i.e. may live in water or land; smooth skin; lay eggs
3. Reptiles: Semi-aquatic like amphibians; scaly skin; lay eggs; cold-blooded
4. Birds: Can fly; lay eggs; warm-blooded
5. Mammals: Have hair or fur; have milk to feed their young; warm-blooded
Machine Learning Terminology
• The vast pool of knowledge is available from the data input.
• a concept map, like the animal group to characteristic mapping explained
above, is drawn from the input data.
• This is nothing but knowledge abstraction as performed by the machine.
• In the end, the abstracted mapping from the input data can be applied to
make critical conclusions.
• For example, if the group of an animal is given, understanding of the
characteristics can be automatically made. Reversely, if the characteristic of
an unknown animal is given, a definite conclusion can be made about the
animal group it belongs to. This is generalization in context of machine
learning.
Step 1: Model
• During the machine learning process, knowledge is fed in the form of input
data.
• Data cannot be used in the original shape and form.
• Abstraction helps in deriving a conceptual map based on the input data.
• This map, or a model as it is known in the machine learning paradigm, is
summarized knowledge representation of the raw data. The model may be
in any one of the following forms
• Computational blocks like if/else rules
• Mathematical equations
• Specific data structures like trees or graphs
• Logical groupings of similar observations
Which model to choose?
• The decision related to the choice of model is taken based on multiple
aspects:
• The type of problem to be solved
• Nature of the input data
• Domain of the problem
Step 2: Training the model
• Next task is to fit the model based on the input data.
• In a case where the model is represented by a mathematical equation, ‘y =
c1 + c2 x’, based on the input data, we find the values of c1 and c2 .
• Otherwise, the equation (or the model) is of no use.
• So, fitting the model, in this case, means finding the values of the unknown
coefficients or constants of the equation or the model.
• This process of fitting the model based on the input data is known as
training.
• The input data based on which the model is being finalized is known as
training data.
The two steps:
• Abstraction:
• The first part of machine learning process is abstraction i.e. abstract the
knowledge which comes as input data in the form of a model.
• Abstraction process, or more popularly training the model, is just one part of
machine learning.
• Generalization:
• The other key part is to tune up the abstracted knowledge to a form which
can be used to take future decisions. This is achieved as a part of
generalization.
• This part is quite difficult to achieve. This is because the model is trained
based on a finite set of data, which may possess a limited set of
characteristics.
Problems with decision making
1. The trained model is aligned with the training data too much, hence
may not portray the actual trend.
2. The test data possess certain characteristics apparently unknown to
the training data.
An approximate or heuristic approach:
• gut feeling-based decision-making in human where exact reason-based
decision-making is not possible
• risk of not making a correct decision – quite obviously because certain
assumptions that are made may not be true in reality
The learning problem
• Step 1: What is the Problem?
• A number of information should be collected to know what is the problem.
• Informal description of the problem, e.g. I need a program that will prompt
the next word as and when I type a word
• For example:
• Task (T): Prompt the next word when I type a word.
• Experience (E): A corpus of commonly used English words and phrases.
• Performance (P): The number of correct words prompted considered as a
percentage (which in machine learning paradigm is known as learning
accuracy).
The learning problem
• Step 2: Why does the problem need to be solved?
• What is the motivation for solving the problem? What requirement will it
fulfil?
• For example, does this problem solve any long-standing business issue like
finding out potentially fraudulent transactions? Or the purpose is more trivial
like trying to suggest some movies for upcoming weekend.
• It is important to clearly understand the benefits of solving the problem.
These benefits can be articulated to sell the project.
• How will the solution to the problem be used and the life time of the solution
is expected to have?
The learning problem
• Step 3: How would I solve the problem?
• Try to explore how to solve the problem manually.
• Detail out step-by-step data collection, data preparation, and program design
to solve the problem.
• Collect all these details and update the previous sections of the problem
definition, especially the assumptions.
Summary
Step 1: What is the problem? Describe the problem informally and
formally and list assumptions and similar problems.
Step 2: Why does the problem need to be solved? List the motivation
for solving the problem, the benefits that the solution will provide and
how the solution will be used.
Step 3: How would I solve the problem? Describe how the problem
would be solved manually to flush domain knowledge.
Types of machine learning
Supervised Learning
• For example, there may be a very high correlation between the number of
salespeople employed by a company, the number of stores they operate,
and the revenue the business generates.
• Training data: This basic input or the experience
• The machine builds a predictive model that can be used on test data to
assign a label for each record in the test data
• Prediction of future cases: Use the rule to predict the output for future
inputs
• Knowledge extraction: The rule is easy to understand
• Compression: The rule is simpler than the data it explains
• Outlier detection: Exceptions that are not covered by the rule, e.g., fraud
29
Supervised Learning
Classification
• Example: Credit scoring
• Differentiating between
low-risk and high-risk
customers from their
income and savings
31
Classification
Classification: Applications
• Finding similarities aka Pattern recognition
• Face recognition: Pose, lighting, occlusion (glasses, beard), make-up,
hair style
• Character recognition: Different handwriting styles.
• Speech recognition: Temporal dependency.
• Medical diagnosis: From symptoms to illnesses
• Biometrics: Recognition/authentication using physical and/or
behavioral characteristics: Face, iris, signature, etc
• Outlier/novelty detection:
33
Face Recognition
Training examples of a person
Test images
ORL dataset,
AT&T Laboratories, Cambridge UK
34
Regression
35
Regression Applications
• Navigating a car: Angle of the steering
• Kinematics of a robot arm
• For example, there may be a very high correlation between the number of
salespeople employed by a company, the number of stores they operate, and
the revenue the business generates.
(x,y) α1= g1(x,y)
α2= g2(x,y)
α2
α1
37
Unsupervised Learning
Clustering
Associations
• Market Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X
and Y are products/services.
40
Reinforcement Learning
• A machine learns to act on its own to achieve the given goals
• Machines often learn to do tasks autonomously
• Learning a policy: A sequence of outputs
• No supervised output but delayed reward
• When a sub-task is accomplished successfully, a reward is given.
• Game playing
• Robot in a maze
• Multiple agents, partial observability, etc..
41
Reinforcement Learning
Applications of machine learning
• Banking and finance
• Insurance
• Healthcare
Tools in machine learning
• Python
•2R
• Matlab
• SAS
Problems in ML
• Data quality
• The complexity and quality trade-off
• Sampling bias in data
• Changing expectations and concept drift
• Monitoring and maintenance