Machine Learning
Machine Learning
UNIT 2:
Preparing to Model
Daxa Patel
Assistant Professor
COMPUTER SCIENCE & ENGINEERING
UNIT 2:
Preparing to Model
Outlines
■ Machine Learning activities
■ Types of data in Machine Learning
■ Structures of data
■ Data quality and remediation
■ Data Pre-Processing:
◻ Dimensionality reduction
◻ Feature subset selection.
It relates to
It provides information
information about the about the
quality of an object quantity of an
or information which object – hence it
cannot be measured can be
measured.
9
Introduction to Machine Learning © Daxa Patel
Types of Qualitative Data
■ Ordinal data, in addition to possessing the properties of nominal data, can
also be naturally ordered.
■ They can be arranged in a sequence of increasing or decreasing value so
that we can say whether a value is better than or greater than another value.
◻ Examples of ordinal data are
1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc.
2. Grades: A, B, C, etc.
3. Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc.
10
Introduction to Machine Learning © Daxa Patel
Types of Quantitative Data
■ Interval data is numeric data for which not only the order is known, but the
exact difference between values is also known.
■ An ideal example of interval data is Celsius temperature.
■ Interval data do not have something called a ‘true zero’ value.
◻ For example, there is nothing called ‘0 temperature’ or ‘no temperature’.
■ For that reason, for interval data, the central tendency can be measured by
mean, median, or mode. Standard deviation can also be calculated.
■ Ratio data represents numeric data for which exact value can be
measured. Absolute zero is available for ratio data.
◻ Examples of ratio data include height, weight, age, salary, etc.
11
Introduction to Machine Learning © Daxa Patel
Attribute types based on a no of values
assigned
■ The attributes can be either discrete or continuous
■ Discrete attributes can assume a finite or countably infinite number of values.
■ Nominal attributes such as roll number, street number, pin code, etc. can
have a finite number of values
■ whereas Numeric attributes such as count, rank of students, etc. can have
countably infinite values.
■ A special type of discrete attribute which can assume two values only is
called binary attribute.
■ Examples of binary attribute include male/ female, positive/negative, yes/no,
etc.
12
Introduction to Machine Learning © Daxa Patel
Attribute types based on a no of values
assigned
■ Continuous attributes can assume any possible value which is a real number.
■ Examples of continuous attribute include length, height, weight, price, etc.
■ In general, nominal and ordinal attributes are discrete.
■ Interval and ratio attributes are continuous,barring a few excepions, e.g.
‘count’ attribute.
13
Introduction to Machine Learning © Daxa Patel
Exploring Structure of Data
■ We need to understand that in a data set, which of the attributes are numeric
and which are categorical in nature.
■ This is because, the approach of exploring numeric data is different than the
approach of exploring categorical data.
■ Hence, these attributes are continuous in nature.
■ The only remaining attribute ‘car name’ is of type categorical, or more
■ specifically nominal. This data set is regarding prediction of fuel consumption
in miles per gallon, i.e. the numeric attribute ‘mpg’ is the target attribute.
■ With this understanding of the data set attributes, we can start exploring the
numeric and categorical attributes separately.
14
Introduction to Machine Learning © Daxa Patel
Exploring Structure of Data
15
Introduction to Machine Learning © Daxa Patel
Exploring numerical data
19
Introduction to Machine Learning © Daxa Patel
Exploring numerical data - Mean vs
Median
■ for the attributes such as ‘mpg’, ‘weight’, ‘acceleration’, and ‘model.year’ the
deviation between mean and median is not significant which means the
chance of these attributes having too many outlier values is less.
■ However, the deviation is significant for the attributes ‘cylinders’,
‘displacement’ and ‘origin’. So, we need to further drill down and look at some
more statistics for these attributes.
■ Also, there is some problem in the values of the attribute horsepower’
because of which the mean and median calculation is not possible.
■ For that reason, the attribute ‘horsepower’ is not treated as a numeric.
■ So we have to first remediate the missing values of the attribute
‘horsepower’before being able to do any kind of exploration.
20
Introduction to Machine Learning © Daxa Patel
Understanding data spread
■ To drill down more, we need to look at the entire range of values of the
attributes, though not at the level of data elements as that may be too vastto
review manually.
■ So we will take a granular view of the data spread in the form of
1. Dispersion of data
2. Position of the different data values
■ Consider the data values of two attributes:
◻ Attribute 1 values : 44, 46, 48, 45, and 47
◻ Attribute 2 values : 34, 46, 59, 39, and 52
21
Introduction to Machine Learning © Daxa Patel
Understanding data spread
■ To measure the extent of dispersion of a data,or to find out how much the
different values of a data are spread out, the variance of the data is
measured.
■ The variance of a data is measured using the formula given below:
23
Introduction to Machine Learning © Daxa Patel
24
Introduction to Machine Learning © Daxa Patel
Understanding data spread
■ So it is quite clear from the measure that attribute 1 values are quite
concentrated around the mean while attribute 2 values are extremely spread
out
25
Introduction to Machine Learning © Daxa Patel
Measuring Data Value Position
■ There are specific variants of quantile, the one dividing data set into four
parts being termed as quartile.
■ Another such popular variant is percentile, which divides the data set into 100
parts.
■ Quantiles refer to specific points in a data set which divide the data set into
equal parts or equally sized quantities.
■ However, we still cannot ascertain whether there is any outlier present in the
data.
■ For that, we can better adopt some means to visualize the data.
■ Box plot is an excellent visualization medium for numeric data.
■ Histogram is another plot which helps in effective visualization
26
Introduction to Machine Learning © Daxa Patel
Box plots(Box & whisker plot)
■ The five number summary is another name for the visual representation of
the box and whisker plot.
■ The five number summary consist of :
◻ The median ( 2nd quartile) – Q1
◻ The 1st quartile – Q2
◻ The 3rd quartile – Q3
◻ The maximum value in a data set – Max
27
Introduction to Machine Learning © Daxa Patel
Box plots(Box & whisker plot)
■ IQR(Inter – Quartile Range) - Q3 - Q1
■ Outliers : lower whisker can extend maximum till (Q1– 1.5 × IQR)
■ the actual length of the upper whisker will also depend on the highest data
value that falls within (Q3 + 1.5 × IQR)
■ The data values coming beyond the lower or upper whiskers are the ones
which are of unusually low or high values respectively. These are the outliers,
which may deserve special consideration.
■ Quartile divide the data into four equal part
■ Check these two values (min,max) are outlier or not
■ [Q1– 1.5 × IQR, Q3 + 1.5 × IQR] - value outside these range in your dataset
are Outlier
28
Introduction to Machine Learning © Daxa Patel
Box plots(Box & whisker plot)
29
Introduction to Machine Learning © Daxa Patel
Box plots(Box & whisker plot)
■ Step 1 - Find the median.
■ Remember, the median is the middle value in a data set.
18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100
30
Introduction to Machine Learning © Daxa Patel
Box plots(Box & whisker plot)
■ Step 3 – Find the upper quartile.
■ The upper quartile is the median of the data set to the right of 68.
18, 27, 34, 52, 54, 59, 61, 68, (78, 82, 85, 87, 91, 93, 100)
31
Introduction to Machine Learning © Daxa Patel
Box plots(Box & whisker plot)
■ Step 5 – Find the inter-quartile range (IQR).
■ The inter-quartile (IQR) range is the difference between the upper and lower
quartiles.
◻ Upper Quartile = 87
◻ Lower Quartile = 52
◻ 87 – 52 = 35
◻ 35 = IQR
■ Organize the 5 number summary
◻ Q2-Median – 68
◻ Q1-Lower Quartile – 52
◻ Q3-Upper Quartile – 87
◻ Max – 100
◻ Min – 18, 91, 93, 100
32
Introduction to Machine Learning © Daxa Patel
Box plots(Box & whisker plot)
■ Step 5 – Find the inter-quartile range (IQR).
■ The inter-quartile (IQR) range is the difference between the upper and lower
quartiles.
◻ Upper Quartile = 87
◻ Lower Quartile = 52
◻ 87 – 52 = 35
◻ 35 = IQR
■ Find the outlier Range using [Q1– 1.5 × IQR, Q3 + 1.5 × IQR]
◻ [52 – 1.5 × 35 , 87 + 1.5 × 35] = [-0.5,139.5]
◻ Here, We have zero element which have a value outside above range So outliers
are not available
33
Introduction to Machine Learning © Daxa Patel
Box plots(Box & whisker plot)
■ Graphing The Data
■ Notice, the Box includes the lower quartile, median, and upper quartile.
■ The Whiskers extend from the Box to the max and min.
34
Introduction to Machine Learning © Daxa Patel
Practice
■ Use the following set of data to create the 5 number summary.
3, 7, 11, 11, 15, 21, 23, 39, 41, 45, 50, 61, 87, 99, 220
35
Introduction to Machine Learning © Daxa Patel
Histogram
■ Histogram is another plot which helps in effective visualization of numeric
attributes.
■ It helps in understanding the distribution of a numeric data into series of
intervals, also termed as ‘bins’.
■ The important difference between histogram and box plot is
◻ The focus of histogram is to plot ranges of data values (acting as ‘bins’), the number of
data elements in each range will depend on the data distribution. Based on that, the size
of each bar corresponding to the different ranges will vary.
◻ The focus of box plot is to divide the data elements in a data set into four equal portions,
such that each portion contains an equal number of data elements.
36
Introduction to Machine Learning © Daxa Patel
General Histogram Shapes
37
Introduction to Machine Learning © Daxa Patel
Histogram Example
Bucket Frequency
AGES
1 29 0-9 6
3 19 10-19
3
27 22
20-29
5 51 5
63 58 30-39
1
26 9
40-49
25 42 2
18 6 50-59
2
16 4
60-69
45 1
38
Introduction to Machine Learning © Daxa Patel
Scatter plot
39
Introduction to Machine Learning © Daxa Patel
Box Plot & Histogram
40
Introduction to Machine Learning © Daxa Patel
Exploring Categorical Data
■ Mode of a data is the data value which appears most often.
■ In context of categorical attribute, it is the category which has highest number
of data values.
■ Since mean and median cannot be applied for categorical variables, mode is
the sole measure of central tendency.
■ An attribute may have one or more modes.
■ Frequency distribution of an attribute having single mode is called ‘unimodal’,
two modes are called ‘bimodal’ and multiple modes are called ‘multimodal’.
41
Introduction to Machine Learning © Daxa Patel
Exploring relationship between variables
■ Till now we have been exploring single attributes in isolation.
■ One more important angle of data exploration is to explore relationship
between attributes.
■ There are multiple plots to enable us explore the relationship between
variables.
■ The basic and most commonly used plot is scatter plot.
42
Introduction to Machine Learning © Daxa Patel
Scatter plot
■ A scatter plot helps in visualizing bivariate relationships, i.e. relationship
between two variables.
■ It is a two-dimensional plot in which points or dots are drawn on coordinates
provided by values of the attributes.
■ For example,
◻ in a data set there are two attributes – attr_1 and attr_2.
◻ We want to understand the relationship between two attributes, i.e. with a change in
value of one attribute, say attr_1, how does the value of the other attribute, say attr_2,
changes.
■ As in a two dimensional plot, attr_1 is said to be the independent variable and
attr_2 as the dependent variable.
43
Introduction to Machine Learning © Daxa Patel
Scatter plot
■ For example, there is one data element which has a mpg of 37 for a
displacement of 250.
■ This record is completely different from other data elements having similar
displacement value but mpg value in the range of 15 to 25.
■ This gives an indication that of presence of outlier data values.
■ As you can see, in most of the cases, there is a significant relationship
between the attribute pairs. However, in some cases, e.g. between attributes
‘weight’ and ‘acceleration’, the relationship doesn’t seem to be very strong.
44
Introduction to Machine Learning © Daxa Patel
Scatter plot
45
Introduction to Machine Learning © Daxa Patel
Two-way cross-tabulations
■ A cross-tab, very much like a scatter plot, helps to understand how much the
data values of one attribute changes with the change in data values of
another attribute.
46
Introduction to Machine Learning © Daxa Patel
Two-way cross-tabulations
■ Moving to the second cross-tab, it gives the number of 3, 4,5, 6, or 8 cylinder
cars in every region present in the sample data set.
47
Introduction to Machine Learning © Daxa Patel
Two-way cross-tabulations
■ The this cross-tab presents the number of 3, 4, 5, 6, or 8 cylinder cars every
year.
48
Introduction to Machine Learning © Daxa Patel
Two-way cross-tabulations
■ In cross-tab, i.e. the one showing relationship between attributes ‘model.
year’ and ‘origin’ help us understand the number of vehicles per year in each
of the regions North America, Europe, and Asia.
■ Looking at it in another way, we can get the count of vehicles per region over
the different years.
■ We may also want to create cross-tabs with a more summarized view like
have a cross-tab giving a number of cars having 4 or less cylinders and more
than 4 cylinders in each region or by the years. This can be done by rolling
up data values by the attribute ‘cylinder‘
49
Introduction to Machine Learning © Daxa Patel
Data quality
■ Success of machine learning depends largely on the quality of data.
■ A data which has the right quality helps to achieve better prediction accuracy,
in case of supervised learning
■ Looking at it in another way, we can get the count of vehicles per region over
the different years.
■ We may also want to create cross-tabs with a more summarized view like
have a cross-tab giving a number of cars having 4 or less cylinders and more
than 4 cylinders in each region or by the years. This can be done by rolling
up data values by the attribute ‘cylinder‘
■ problems:
◻ Certain data elements without a value or data with a missing value.
◻ Data elements having value surprisingly different from the other elements, which we term
as outliers.
50
Introduction to Machine Learning © Daxa Patel
Data quality
■ Incorrect sample set selection:
■ The data may not reflect normal or regular quality due to incorrect selection
of sample set.
■ For example, if we are selecting a sample set of sales transactions from a
festive period and trying to use that data to predict sales in future.
■ In this case, the prediction will be far apart from the actual scenario, just
because the sample set has been selected in a wrong time.
51
Introduction to Machine Learning © Daxa Patel
Data quality
■ Errors in data collection: resulting in outliers and missing values
■ In many cases, a person or group of persons are responsible for the
collection of data to be used in a learning activity.
■ In this manual process, there is the possibility of wrongly recording data
either in terms of value (say 20.67 is wrongly recorded as 206.7 or 2.067) or
in terms of a unit of measurement (say cm. is wrongly recorded as m. or
mm.).
■ This may result in data elements which have abnormally high or low value
from other elements. Such records are termed as outliers.
52
Introduction to Machine Learning © Daxa Patel
Data quality
■ Errors in data collection: resulting in outliers and missing values
■ It may also happen that the data is not recorded at all.
■ In case of a survey conducted to collect data, it is all the more possible as
survey responders may choose not to respond to a certain question.
■ So the data value for that data element in that responder’s record is missing.
53
Introduction to Machine Learning © Daxa Patel
Data remediation - Handling outliers
■ Data remediation is the process of cleansing, organizing and migrating
data
■ We will discuss how to handle outliers and missing values.
■ Remove outliers: If the number of records which are outliers is not many, a
simple approach may be to remove them.
■ Imputation: One other way is to (impute)represent the value with mean or
median or mode. The value of the most similar data element may also be
used for imputation.
■ Capping: For values that lie outside the 1.5|×| IQR limits, we can cap them
by replacing those observations
◻ observations that lie below the lower limit with the value of 5th percentile
◻ observations that lie above the upper limit, with the value of 95th percentile.
54
Introduction to Machine Learning © Daxa Patel
Data remediation - Handling outliers
■ If there is a significant number of outliers,
■ They should be treated separately in the statistical model.
■ In that case, the groups should be treated as two different groups, the model
should be built for both groups and then the output can be combined.
55
Introduction to Machine Learning © Daxa Patel
Data remediation - Handling missing
values
■ Records having a missing value of data elements
■ In case the proportion of data elements having missing values is within a
tolerable limit, a simple but effective approach is to remove the records
having such data elements.
■ In the case of Auto MPG data set, only in 6 out of 398 records, the value of
attribute ‘horsepower’ is missing. If we get rid of those 6 records, we will still
have 392 records, which is definitely a substantial number. So, we can very
well eliminate the records and keep working with the remaining data set.
■ However, this will not be possible if the proportion of records having data
elements with missing value is really high as that will reduce the power of
model because of reduction in the training data size.
56
Introduction to Machine Learning © Daxa Patel
Data remediation - Handling missing
values
■ Imputing missing values
■ Imputation is a method to assign a value to the data elements having missing
values.
■ Mean/mode/median is most frequently assigned value.
■ For quantitative attributes, all missing values are imputed with the mean,
median, or mode of the remaining values under the same attribute.
■ For qualitative attributes, all missing values are imputed by the mode of all
remaining values of the same attribute.
◻ ‘cylinders’ is the attribute which is logically most connected to ‘horsepower’ because with
the increase in number of cylinders of a car, the horsepower of the car is expected to
increase.
57
Introduction to Machine Learning © Daxa Patel
Data remediation - Handli ng missing
values
58
Introduction to Machine Learning © Daxa Patel
Data remediation - Handling missing
values
■ Estimate missing values
■ If there are data points similar to the ones with missing attribute values, then
the attribute values from those similar data points can be planted in place of
the missing value.
■ For finding similar data points or observations, distance function can be used.
◻ For example, let’s assume that the weight of a student having age 12 years and height 5
ft. is missing. Then the weight of any other student having age close to 12 years and
height close to 5 ft. can be assigned.
59
Introduction to Machine Learning © Daxa Patel
DATA PRE-PROCESSING
■ Dimensionality reduction
■ High-dimensional data sets need a high amount of computational space and
time.
■ At the same time, not all features are useful – they degrade the performance
of machine learning algorithms.
■ Most of the machine learning algorithms perform better if the dimensionality
of data set,
◻ i.e. the number of features in the data set, is reduced.
■ Dimensionality reduction helps in reducing irrelevance and redundancy in
features.
■ Also, it is easier to understand a model if the number of features involved in
the learning activity is less.
60
Introduction to Machine Learning © Daxa Patel
DATA PRE-PROCESSING
■ Dimensionality reduction
■ Dimensionality reduction refers to the techniques of reducing the
dimensionality of a data set by creating new attributes by combining the
original attributes.
■ The most common approach for dimensionality reduction is known as
Principal Component Analysis (PCA)
◻ PCA is a statistical technique to convert a set of correlated variables into a set of
transformed, uncorrelated variables called principal components.
◻ The principal components are a linear combination of the original variables. They are
orthogonal to each other.
◻ Since principal components are uncorrelated, they capture the maximum amount of
variability in the data.
◻ However, the only challenge is that the original attributes are lost due to the
transformation 61
Introduction to Machine Learning © Daxa Patel
DATA PRE-PROCESSING
■ Dimensionality reduction
■ Another commonly used technique which is used for dimensionality reduction
is Singular Value Decomposition (SVD).
62
Introduction to Machine Learning © Daxa Patel
DATA PRE-PROCESSING
■ Feature subset selection
■ Feature subset selection or simply called feature selection,both for
supervised as well as unsupervised learning, try to find out the optimal subset
of the entire feature set which significantly reduces computational cost
without any major impact on the learning accuracy.
■ It may seem that a feature subset may lead to loss of useful information as
certain features are going to be excluded from the final set of features used
for learning.
■ However, for elimination only features which are not relevant or redundant
are selected.
63
Introduction to Machine Learning © Daxa Patel
DATA PRE-PROCESSING
■ Feature subset selection
■ A feature is considered as irrelevant if it plays an insignificant role (or
contributes almost no information) in classifying or grouping together a set of
data instances.
■ All irrelevant features are eliminated while selecting the final feature subset.
A feature is potentially redundant when the information contributed by the
feature is more or less same as one or more other features.
■ Among a group of potentially redundant features, a small number of features
can be selected as a part of the final feature subset without causing any
negative impact to learn model accuracy.
64
Introduction to Machine Learning © Daxa Patel
Thank You