0% found this document useful (0 votes)
5 views42 pages

1 Unit-1

The document outlines the types of data relevant to machine learning, including qualitative (categorical) and quantitative (numeric) data, along with their subcategories. It emphasizes the importance of data exploration, quality assessment, and pre-processing steps before applying machine learning algorithms. Additionally, it discusses various learning paradigms such as supervised, unsupervised, and reinforcement learning, highlighting their applications in fields like healthcare and finance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views42 pages

1 Unit-1

The document outlines the types of data relevant to machine learning, including qualitative (categorical) and quantitative (numeric) data, along with their subcategories. It emphasizes the importance of data exploration, quality assessment, and pre-processing steps before applying machine learning algorithms. Additionally, it discusses various learning paradigms such as supervised, unsupervised, and reinforcement learning, highlighting their applications in fields like healthcare and finance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

UNIT-I

Types of data, exploring structure of data: Exploring and Plotting


numerical data, categorical data and relationship between variables,
data quality and remediation, data pre-processing: Dimensionality
reduction and feature selection.
Types of data
Learning Objectives:

• To understand the incoming data


• Basic understanding about the nature and quality
of the data.
Recap:
• Types of human learning –
supervised, unsupervised, and reinforcement.
• Supervised learning: learning from past data (training
data), known values (classes).
• Supervised learning - guided learning from human
inputs.
Example :
Medical Data, Dataset: Disease diagnosis using patient
records.
Features: Medical test results, symptoms, patient history.
Labels: Diagnoses (e.g., diabetic, non-diabetic).
Dataset: X-ray image classification (e.g., pneumonia
detection).
Features: X-ray images.
Labels: Presence or absence of disease.
• Unsupervised machine learning : doesn’t have labelled
data to learn from.
• finds patterns in unlabeled data- Grouping
• This learning is not guided by labelled inputs but uses
the knowledge gained from the labels themselves.
Example :
Customer Behavior Data
•Dataset: E-commerce user behavior.
• Features: Browsing history, clickstream data, time
spent on pages.
• Task: Group users with similar purchasing patterns.
•Dataset: Social media interactions.
• Features: Likes, shares, and network connections.
• Task: Identify communities or influencers.
• Reinforcement learning in which machine tries to
learn by itself through penalty/ reward mechanism
– again pretty much in the same way as human self-
learning happens.
• applications of machine learning in different domains:
such as banking and finance, insurance, and healthcare.
• Fraud detection is a critical business case which is
implemented in almost all banks across the world and
uses machine learning predominantly.
• Risk prediction for new customers is a similar critical
case in the insurance industry which finds the
application of machine learning.
• In the healthcare sector, disease prediction makes wide
use of machine learning, especially in the developed
countries.
Points to Ponder

No man is perfect. The same is applicable for machines. To increase the level
of accuracy of a machine, human participation should be added to the
machine learning process. In short, incorporating human intervention is the
recipe for the success of machine learning.
MACHINE LEARNING ACTIVITIES

• The first step in machine learning activity starts with


data.
• In case of supervised learning, it is the labelled training
data set followed by test data which is not labelled.
• In case of unsupervised learning, there is no question
of labelled data but the task is to find patterns in the
input data.
• A thorough review and exploration of the data is
needed
 To understand the type of the data,
 The quality of the data and
 Relationship between the different data elements.
• Based on that, multiple pre-processing activities may
need to be done on the input data before we can go
ahead with core machine learning activities.
• Following are the typical preparation activities done
once the input data comes into the machine learning
system:
• Understand the type of data in the given input data set.
• Explore the data to understand the nature and quality.
• Explore the relationships amongst the data elements,
e.g. inter-feature relationship.
• Find potential issues in data.
• Do the necessary remediation, e.g. impute missing data
values, etc., if needed.
• Apply pre-processing steps, as necessary.
• Once the data is prepared for modelling, then the
learning tasks start off.
• As a part of it, do the following activities:
• The input data is first divided into parts – the training
data and the test data (called holdout). This step is
applicable for supervised learning only.
• Consider different models or learning algorithms for
selection. Train the model based on the training data for
supervised learning problem and apply to unknown
data.
• Directly apply the chosen unsupervised model on the
input data for unsupervised learning problem.
• After the model is selected,
 Trained (for supervised learning), and applied on
input data.
 The performance of the model is evaluated.
 Based on options available, specific actions can be
taken to improve the performance of the model, if
possible.
Table 2.1 contains a summary of steps and activities
involved:
2.3 BASIC TYPES OF DATA IN MACHINE LEARNING

• Before starting with types of data, let’s first understand what a


data set is and what are the elements of a data set.
• A data set is a collection of related information or records. The
information may be on some entity or some subject area.
• For example, we may have a data set on students in which each
record consists of information about a specific student.
• Again, we can have a data set on student performance which has
records providing performance, i.e. marks on the individual
subjects.
• Each row of a data set is called a record. Each data set also has
multiple attributes, each of which gives information on a specific
characteristic.
• For example, in the data set on students, there are four attributes namely Roll Number, Name,
Gender, and Age, each of which understandably is a specific characteristic about the student
entity.
• Attributes can also be termed as feature, variable, dimension or field.
• Both the data sets, Student and Student Performance, are having four
features or dimensions; hence they are told to have four-dimensional
data space.
• A row or record represents a point in the four-dimensional data space
as each row has specific values for each of the four attributes or
features.
• Value of an attribute, quite understandably, may vary from record to
Comparison with Nominal Data
Comparison with Discrete Data
Now that a context of data sets is given, let’s try to
understand the different types of data that we generally come
across in machine learning problems. Data can broadly be
divided into following two types:
1. Qualitative data
2. Quantitative data

Qualitative data provides information about the quality of


an object or information which cannot be measured. For
example, if we consider the quality of performance of students
in terms of ‘Good’, ‘Average’, and ‘Poor’, it falls under the
category of qualitative data. Also, name or roll number of
students are information that cannot be measured using some
scale of measurement. So they would fall under qualitative
data. Qualitative data is also called categorical data.
Qualitative data can be further subdivided into two types as
follows:
1. Nominal data
2. Ordinal data
Nominal data is one which has no numeric value, but a
named value. It is used for assigning named values to
attributes. Nominal values cannot be quantified. Examples of
nominal data are
1. Blood group: A, B, O, AB, etc.
2. Nationality: Indian, American, British, etc.
3. Gender: Male, Female, Other
It is obvious, mathematical operations such as addition,
subtraction, multiplication, etc. cannot be performed on
nominal data. For that reason, statistical functions such as
mean, variance, etc. can also not be applied on nominal data.
However, a basic count is possible. So mode, i.e. most
frequently occurring value, can be identified for nominal data.
Ordinal data, in addition to possessing the properties of
nominal data, can also be naturally ordered. This means
ordinal data also assigns named values to attributes but unlike
nominal data, they can be arranged in a sequence of increasing
or decreasing value so that we can say whether a value is
better than or greater than another value. Examples of ordinal
data are
1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc.
2. Grades: A, B, C, etc.
3. Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc.

Like nominal data, basic counting is possible for ordinal


data. Hence, the mode can be identified. Since ordering is
possible in case of ordinal data, median, and quartiles can be
identified in addition. Mean can still not be calculated.
Quantitative data relates to information about the quantity
of an object – hence it can be measured. For example, if we
consider the attribute ‘marks’, it can be measured using a scale
of measurement. Quantitative data is also termed as numeric
data. There are two types of quantitative data:
1. Interval data
2. Ratio data
Interval data is numeric data for which not only the order
is known, but the exact difference between values is also
known. An ideal example of interval data is Celsius
temperature. The difference between each value remains the
same in Celsius temperature. For example, the difference
between 12°C and 18°C degrees is measurable and is 6°C as in
the case of difference between 15.5°C and 21.5°C. Other
examples include date, time, etc.
For interval data, mathematical operations such as addition
and subtraction are possible. For that reason, for interval data,
the central tendency can be measured by mean, median, or
mode. Standard deviation can also be calculated.
However, interval data do not have something called a ‘true
zero’ value. For example, there is nothing called ‘0
temperature’ or ‘no temperature’. Hence, only addition and
subtraction applies for interval data. The ratio cannot be
applied. This means, we can say a temperature of 40°C is
equal to the temperature of 20°C + temperature of 20°C.
However, we cannot say the temperature of 40°C means it is
twice as hot as in temperature of 20°C.
Ratio data represents numeric data for which exact value
can be measured. Absolute zero is available for ratio data.
Also, these variables can be added, subtracted, multiplied, or
divided. The central tendency can be measured by mean,
median, or mode and methods of dispersion such as standard
deviation. Examples of ratio data include height, weight, age,
salary, etc.
Figure 2.4 gives a summarized view of different types of
data that we may find in a typical machine learning problem.
Apart from the approach detailed above, attributes can also
be categorized into types based on a number of values that
can
be assigned. The attributes can be either discrete or
continuous
based on this factor.
Discrete attributes can assume a finite or countably infinite
number of values. Nominal attributes such as roll number,
street number, pin code, etc. can have a finite number of
values whereas numeric attributes such as count, rank of
students, etc. can have countably infinite values. A special
type of discrete attribute which can assume two values only is
called binary attribute. Examples of binary attribute include
male/ female, positive/negative, yes/no, etc.
Continuous attributes can assume any possible value which
is a real number. Examples of continuous attribute include
length, height, weight, price, etc.
Note:

In general, nominal and ordinal attributes are discrete. On


the other hand, interval and ratio attributes are continuous,

barring a few exceptions, e.g. ‘count’ attribute.

You might also like