Module 1
Module 1
BCSML601
3:0:2
Pre- requisites
• Linear Equations
• Probability and Statistics
• Graphs of Functions
Course Learning Objectives (CLO)
• This course will enable students to,
• Define machine learning and understand the basic theory underlying
machine learning.
• Differentiate supervised, unsupervised and reinforcement learning
• Understand the basic concepts of learning and decision trees.
• Understand neural networks and Bayesian techniques for problems appear
in machine learning
• Understand the instant based learning and statistical analysis of machine
learning techniques.
MODULE 1
• Introduction to Machine Learning
• Introduction
• What is Human Learning?
• Types of Human Learning,
• What is Machine Learning,
• Types of Machine Learning,
• Problems not to be solved using Machine Learning,
• Application of Machine Learning,
• Issues in Machine Learning,
• Prepare to Model –
• Introduction,
• Machine Learning activities,
• Basic types of Data in Machine Learning
Evolution
Human Learning
• Data
• Information
• Knowledge
• Wisdom
• where ‘x’ is the predictor variable and ‘y’ is the target variable.
• Typical applications of regression can be seen in
• Demand forecasting in retails
• Sales prediction for managers
• Price prediction in real estate
• Weather forecast
• Skill demand forecast in job market
Unsupervised learning
• In unsupervised learning, there is no labelled training data to learn from and no
prediction to be made.
• In unsupervised learning, the objective is to take a dataset as input and try to
find natural groupings or patterns within the data elements or records.
• Therefore, unsupervised learning is often termed as descriptive model
• The process of unsupervised learning is referred as pattern discovery or
knowledge discovery.
• One critical application of unsupervised learning is customer segmentation.
• Clustering is the main type of unsupervised learning, which intends to group or
organize similar objects together.
• Association analysis is another unsupervised learning, where the association
between data elements is identified.
• Objects belonging to the same cluster are similar to each other while objects
belonging to different clusters are dissimilar.
• Hence, the objective of clustering to discover the intrinsic grouping of
unlabelled data and form clusters, as depicted in Figure 1.
• Different measures of similarity can be applied for clustering, of which most
common is distance.
• Two data items are considered as a part of the same cluster if the distance
between them is less.
• In the same way, if the distance between the data items is high, the items do not
generally belong to the same cluster.
• This is also known as distance-based clustering.
• Figure 2 depicts the process of clustering at a high level.
Figure 2
Figure 1
Reinforcement learning
• We have seen babies learn to walk without any prior knowledge of how to do it.
• Machines often learn to do tasks autonomously.
• Let’s try to understand in context of the example of the child learning to walk.
• The action tried to be achieved is walking, the child is the agent and the place
with hurdles on which the child is trying to walk is the environment.
• It tries to improve its performance of doing the task.
• When a sub-task is accomplished successfully, a reward is given. When a sub-
task is not executed correctly, obviously no reward is given.
• This continues till the machine is able to complete execution of the whole task.
• This process of learning is known as reinforcement learning.
• Figure captures the high-level process of reinforcement learning.
• Key Concepts of
Reinforcement
Learning
• Environment: Everyt
hing the agent
interacts with.
• State: A specific
situation in which the
agent finds itself.
• Machine learning should not be applied to tasks in which humans are very effective or
frequent human intervention is needed.
• Air traffic control is a very complex task needing intense human involvement.
• Simple rule-driven or formula-based applications like price calculator engine, dispute
tracking application, etc. do not need machine learning techniques.
• Machine learning should be used only when the business process has some lapses.
• If the task is already optimized, incorporating machine learning will not serve to justify the
return on investment.
• For situations where training data is not sufficient, machine learning cannot be used
effectively.
• This is because, with small training data sets, the impact of bad data is exponentially worse.
• For the quality of prediction or recommendation to be good, the training data should be
sizeable.
Applications
• Banking and Finance – fraudulent transactions are spotted and prevented right
at the time of occurrence
• Insurance - Two major areas in the insurance industry where machine learning
is used are risk prediction during new customer onboarding and claims
management.
• Health Care - Wearable device data form a rich source for applying machine
learning and predict the health conditions of the person real time.
Issues of Machine Learning
• The biggest fear and issue arising out of machine learning is related to privacy and the breach
of it a well as adverse reactions.
• The primary focus of learning is on analyzing data, both past and current, and coming up
with insight from the data which may be related to people and the facts revealed might be
private enough to be kept confidential.
• Also, different people have a different preference when it comes to sharing of information.
• While some people may be open to sharing some level of information publicly, some other
people may not want to share it even to all friends and keep it restricted just to family
members.
• When machine learning algorithms are implemented using those information, inadvertently
people may get upset.
• A very critical consideration before applying machine learning is that proper human
judgement should be exercised before using any outcome from machine learning.
• Only then the decision taken will be beneficial and also not result in any adverse impact.
Machine Learning Activities
• The first step in machine learning activity starts with data.
• In case of supervised learning, it is the labelled training data set followed by test data which
is not labelled.
• In case of unsupervised learning, there is no question of labelled data but the task is to find
patterns in the input data.
• A thorough review and exploration of the data is needed to understand the type of the data,
the quality of the data and relationship between the different data elements.
• Based on that, multiple pre-processing activities may need to be done on the input data
before going ahead with core machine learning activities.
• Following are the typical preparation activities done once the input data comes
into the machine learning system:
• Understand the type of data in the given input data set.
• Explore the data to understand the nature and quality.
• Explore the relationships amongst the data elements, e.g. inter-feature relationship.
• Find potential issues in data.
• Do the necessary remediation, e.g. impute missing data values, etc., if needed.
• Apply pre-processing steps, as necessary.
• Once the data is prepared for modelling, then the learning tasks start off and as a part of
it, do the following activities:
• The input data is first divided into parts – the training data and the test data (called
holdout) ( applicable for supervised learning only.
• Consider different models or learning algorithms for selection.
• Train the model based on the training data for supervised learning problem and apply
to unknown data.
• Directly apply the chosen unsupervised model on the input data for unsupervised
learning problem.
• After the model is selected, trained (for supervised learning), and applied on
input data, the performance of the model is evaluated.
• Based on options available, specific actions can be taken to improve the
performance of the model, if possible.
• Figure depicts the four-step process of machine learning.
BASIC TYPES OF DATA IN MACHINE LEARNING
• A data set is a collection of related information or records.
• The information may be on some entity or some subject area.
• Each row of a data set is called a record.
• Each data set also has multiple attributes, each of which gives information on a specific
characteristic.
• For example, in the data set on students, there are four attributes namely Roll Number, Name,
Gender, and Age, each of which is a specific characteristic about the student entity.
• Attributes can also be termed as feature, variable, dimension or field.
• Both the data sets, Student and Student Performance, are having four features or dimensions;
hence they are told to have four dimensional data space.
• A row or record represents a point in the four-dimensional data space as each row has specific
values for each of the four attributes or features.
• Value of an attribute, may vary from record to record.
Data types
• Different types of data broadly divided into following two types:
1. Qualitative data
2. Quantitative data
Qualitative Data
• Qualitative data provides information about the quality of an object or information which
cannot be measured.
• For example, the quality of performance of students in terms of ‘Good’, ‘Average’, and
‘Poor’. Also, name or roll number of
• students cannot be measured using some scale of measurement.
• Qualitative data is also called categorical data.
• Qualitative data can be further subdivided into two types as follows:
1. Nominal data
2. Ordinal data
Nominal Data
• Nominal data is one which has no numeric value, but a named value.
• It is used for assigning named values to attributes.
• Nominal values cannot be quantified.
• Examples of nominal data are
• Blood group: A, B, O, AB, etc.
• Nationality: Indian, American, British, etc.
• Gender: Male, Female, Other
• A special case of nominal data is when only two labels are possible, e.g.
pass/fail as a result of an examination.
• This sub-type of nominal data is called ‘dichotomous’.
• Mathematical operations and statistical functions cannot be performed on
nominal data, but only basic count, is possible.
Ordinal Data
• Ordinal data, having the properties of nominal data, can be naturally ordered.
• Ordinal data assigns named values to attributes but, they can be arranged in a
sequence of increasing or decreasing value, so that a value can be compared to be
better than or greater than another value.
• Examples of ordinal data are
• Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc.
• Grades: A, B, C, etc.
• Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc.
• Basic counting is possible for ordinal data, hence, the mode can be identified.
• Since ordering is possible in ordinal data, median, and quartiles can be identified
in addition.
• Mean can still not be calculated.
Quantitative Data
• Quantitative data relates to information about the quantity of an object – hence
it can be measured.
• For example, consider the attribute ‘marks’, it can be measured using a scale of
measurement.
• Quantitative data is also termed as numeric data.
• There are two types of quantitative data:
1. Interval data
2. Ratio data
Interval Data
• Interval data is numeric data for which the order is known, and also the exact
difference between values is also known.
• An ideal example of interval data is Celsius temperature where the difference
between each value remains the same in Celsius temperature.
• For example, the difference between 12°C and 18°C degrees is measurable and
is 6°C as in the case of difference between 15.5°C and 21.5°C.
• Other examples include date, time, etc.
• For interval data, mathematical operations are possible.
• Hence, for interval data, the central tendency can be measured by mean, median,
mode, Standard deviation can be calculated.
• However, interval data do not have something called a ‘true zero’ value.
• For example, there is nothing called ‘zero or null temperature’ or ‘no
temperature’.
• Hence, only addition and subtraction applies for interval data.
• The ratio cannot be applied. This means, we can say a temperature of 40°C is
equal to the temperature of 20°C + temperature of 20°C.
• However, we cannot say the temperature of 40°C means it is twice as hot as in
temperature of 20°C.
Ratio Data
• Ratio data represents numeric data for which exact value can be measured.
• Absolute zero is available for ratio data.
• Also, these variables can be added, subtracted, multiplied, or divided.
• The central tendency can be measured by mean, median, or mode and methods
of dispersion such as standard deviation.
• Examples of ratio data include height, weight, age, salary, etc.
Summary of Types of Data
• The attributes can be either discrete or continuous based on based on a number of
values that can be assigned.
• Discrete attributes can assume a finite or countably infinite number of values.
• Nominal attributes such as roll number, street number, pin code, etc. can have a
finite number of values whereas numeric attributes such as count, rank of
students, etc. can have countably infinite values.
• A special type of discrete attribute which can assume two values only is
called binary attribute.
• Examples of binary attribute include male/ female, positive/negative, yes/no, etc.
• Continuous attributes can assume any possible value which is a real number.
• Examples of continuous attribute include length, height, weight, price, etc.
• In general, nominal and ordinal attributes are discrete, and, interval and ratio
attributes are continuous
Module 1 learning
• Introduction to Machine Learning
• What is Human Learning?
• Types of Human Learning,
• What is Machine Learning,
• Types of Machine Learning,
• Problems not to be solved using Machine Learning,
• Application of Machine Learning,
• Issues in Machine Learning,
• Prepare to Model –
• Machine Learning activities,
• Basic types of Data in Machine Learning