0% found this document useful (0 votes)

49 views6 pages

Notes Chapter Feature Representation

This document discusses feature representation for machine learning classifiers. It explains that while linear classifiers are simple, they cannot represent complex nonlinear decision boundaries. The document introduces using nonlinear feature transformations to map data into a higher-dimensional space where linear separators can represent nonlinear boundaries in the original space. As an example, it shows how transforming the XOR problem into a second-order polynomial feature space allows a linear classifier to solve it. The document also discusses different strategies for constructing effective feature representations for real-world problems with discrete and mixed attribute types.

Uploaded by

Parias L. Mukeba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views6 pages

Notes Chapter Feature Representation

Uploaded by

Parias L. Mukeba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

CHAPTER 4

Feature representation

Linear classifiers are easy to work with and analyze, but they are a very restricted class of
hypotheses. If we have to make a complex distinction in low dimensions, then they are
unhelpful.
Our favorite illustrative example is the “exclusive or” (XOR) data set, the drosophila of D. Melanogaster is a
machine-learning data sets: species of fruit fly, used
as a simple system in
which to study genetics,
since 1910.

There is no linear separator for this two-dimensional dataset! But, we have a trick
available: take a low-dimensional data set and move it, using a non-linear transformation
into a higher-dimensional space, and look for a linear separator there. Let’s look at an
example data set that starts in 1-D:

These points are not linearly separable, but consider the transformation φ(x) = [x, x2 ]. What’s a linear separa-
Putting the data in φ space, we see that it is now separable. There are lots of possible tor for data in 1D? A
point!
separators; we have just shown one of them here.

21
MIT 6.036 Fall 2019 22

separator

A linear separator in φ space is a nonlinear separator in the original space! Let’s see
how this plays out in our simple example. Consider the separator x2 − 1 = 0, which labels
the half-plane x2 − 1 > 0 as positive. What separator does it correspond to in the original
1-D space? We have to ask the question: which x values have the property that x2 − 1 = 0.
The answer is +1 and −1, so those two points constitute our separator, back in the original
space. And we can use the same reasoning to find the region of 1D space that is labeled
positive by this separator.

x
-1 1
0

This is a very general and widely useful strategy. It’s the basis for kernel methods, a
powerful technique that we unfortunately won’t get to in this class, and can be seen as a
motivation for multi-layer neural networks.
There are many different ways to construct φ. Some are relatively systematic and do-
main independent. We’ll look at the polynomial basis in section 1 as an example of that.
Others are directly related to the semantics (meaning) of the original features, and we con-
struct them deliberately with our domain in mind. We’ll explore that strategy in section 2.

1 Polynomial basis
If the features in your problem are already naturally numerical, one systematic strategy for
constructing a new feature space is to use a polynomial basis. The idea is that, if you are
using the kth-order basis (where k is a positive integer), you include a feature for every
possible product of k different dimensions in your original input.
Here is a table illustrating the kth order polynomial basis for different values of k.
Order d=1 in general
0 [1] [1]
1 [1, x] [1, x1 , . . . , xd ]
2 [1, x, x2 ] [1, x1 , . . . , xd , x21 , x1 x2 , . . .]
3 [1, x, x2 , x3 ] [1, x1 , . . . , x21 , x1 x2 , . . . , x1 x2 x3 , . . .]
.. .. ..
. . .
So, what if we try to solve the XOR problem using a polynomial basis as the feature
transformation? We can just take our two-dimensional data and transform it into a higher-

Last Updated: 12/18/19 11:56:05

MIT 6.036 Fall 2019 23

dimensional data set, by applying φ. Now, we have a classification problem as usual, and
we can use the perceptron algorithm to solve it.
Let’s try it for k = 2 on our XOR problem. The feature transformation is

φ((x1 , x2 )) = (1, x1 , x2 , x21 , x1 x2 , x22 ) .

Study Question: If we use perceptron to train a classifier after performing this fea-
ture transformation, would we lose any expressive power if we let θ0 = 0 (i.e. trained
without offset instead of with offset)?
After 4 iterations, perceptron finds a separator with coefficients θ = (0, 0, 0, 0, 4, 0) and
θ0 = 0. This corresponds to

0 + 0x1 + 0x2 + 0x21 + 4x1 x2 + 0x22 + 0 = 0

and is plotted below, with the gray shaded region classified as negative and the white
region classified as positive:

Study Question: Be sure you understand why this high-dimensional hyperplane is

a separator, and how it corresponds to the figure.
For fun, we show some more plots below. Here is the result of running perceptron
on XOR, but where the data are put in a different place on the plane. After 65 mistakes
(!) it arrives at these coefficients: θ = (1, −1, −1, −5, 11, −5), θ0 = 1, which generates this
separator: The jaggedness in the
plotting of the separator
is an artifact of a lazy
lpk strategy for mak-
ing these plots–the true
curves are smooth.

Last Updated: 12/18/19 11:56:05

MIT 6.036 Fall 2019 24

Study Question: It takes many more iterations to solve this version. Apply knowl-
edge of the convergence properties of the perceptron to understand why.
Here is a harder data set. After 200 iterations, we could not separate it with a second or
third-order basis representation. Shown below are the results after 200 iterations for bases
of order 2, 3, 4, and 5.

2 Hand-constructing features for real domains

In many machine-learning applications, we are given descriptions of the inputs with many
different types of attributes, including numbers, words, and discrete features. An impor-

Last Updated: 12/18/19 11:56:05

MIT 6.036 Fall 2019 25

tant factor in the success of an ML application is the way that the features are chosen to be
encoded by the human who is framing the learning problem.

2.1 Discrete features

Getting a good encoding of discrete features is particularly important. You want to create
“opportunities” for the ML system to find the underlying regularities. Although there are
machine-learning methods that have special mechanisms for handling discrete inputs, all
the methods we consider in this class will assume the input vectors x are in Rd . So, we
have to figure out some reasonable strategies for turning discrete values into (vectors of)
real numbers.
We’ll start by listing some encoding strategies, and then work through some examples.
Let’s assume we have some feature in our raw data that can take on one of k discrete values.

• Numeric Assign each of these values a number, say 1.0/k, 2.0/k, . . . , 1.0. We might
want to then do some further processing, as described in section 2.3. This is a sensible
strategy only when the discrete values really do signify some sort of numeric quantity,
so that these numerical values are meaningful.

• Thermometer code If your discrete values have a natural ordering, from 1, . . . , k, but
not a natural mapping into real numbers, a good strategy is to use a vector of length
k binary variables, where we convert discrete input value 0 < j 6 k into a vector in
which the first j values are 1.0 and the rest are 0.0. This does not necessarily imply
anything about the spacing or numerical quantities of the inputs, but does convey
something about ordering.

• Factored code If your discrete values can sensibly be decomposed into two parts (say
the “make” and “model” of a car), then it’s best to treat those as two separate features,
and choose an appropriate encoding of each one from this list.

• One-hot code If there is no obvious numeric, ordering, or factorial structure, then

the best strategy is to use a vector of length k, where we convert discrete input value
0 < j 6 k into a vector in which all values are 0.0, except for the jth, which is 1.0.

• Binary code It might be tempting for the computer scientists among us to use some
binary code, which would let us represent k values using a vector of length log k.
This is a bad idea! Decoding a binary code takes a lot of work, and by encoding your
inputs this way, you’d be forcing your system to learn the decoding algorithm.

As an example, imagine that we want to encode blood types, which are drawn from the
set {A+, A−, B+, B−, AB+, AB−, O+, O−}. There is no obvious linear numeric scaling or
even ordering to this set. But there is a reasonable factoring, into two features: {A, B, AB, O}
and {+, −1}. And, in fact, we can reasonably factor the first group into {A, notA}, {B, notB}
So, here are two plausible encodings of the whole set: It is sensible (according
to Wikipedia!) to treat
• Use a 6-D vector, with two dimensions to encode each of the factors using a one-hot O as having neither fea-
encoding. ture A nor feature B.

• Use a 3-D vector, with one dimension for each factor, encoding its presence as 1.0
and absence as −1.0 (this is sometimes better than 0.0). In this case, AB+ would be
(1.0, 1.0, 1.0) and O− would be (−1.0, −1.0, −1.0).

Study Question: How would you encode A+ in both of these approaches?

Last Updated: 12/18/19 11:56:05

MIT 6.036 Fall 2019 26

2.2 Text
The problem of taking a text (such as a tweet or a product review, or even this document!)
and encoding it as an input for a machine-learning algorithm is interesting and compli-
cated. Much later in the class, we’ll study sequential input models, where, rather than
having to encode a text as a fixed-length feature vector, we feed it into a hypothesis word
by word (or even character by character!).
There are some simpler encodings that work well for basic applications. One of them is
the bag of words (BOW) model. The idea is to let d be the number of words in our vocabulary
(either computed from the training set or some other body of text or dictionary). We will
then make a binary vector (with values 1.0 and 0.0) of length d, where element j has value
1.0 if word j occurs in the document, and 0.0 otherwise.

2.3 Numeric values

If some feature is already encoded as a numeric value (heart rate, stock price, distance, etc.)
then you should generally keep it as a numeric value. An exception might be a situation in
which you know there are natural “breakpoints” in the semantics: for example, encoding
someone’s age in the US, you might make an explicit distinction between under and over
18 (or 21), depending on what kind of thing you are trying to predict. It might make sense
to divide into discrete bins (possibly spacing them closer together for the very young) and
to use a one-hot encoding for some sorts of medical situations in which we don’t expect a
linear (or even monotonic) relationship between age and some physiological features.
If you choose to leave a feature as numeric, it is typically useful to scale it, so that it
tends to be in the range [−1, +1]. Without performing this transformation, if you have
one feature with much larger values than another, it will take the learning algorithm a lot
of work to find parameters that can put them on an equal basis. So, we might perform
x−x
transformation φ(x) = , where x is the average of the x(i) , and σ is the standard
σ
deviation of the x(i) . The resulting feature values will have mean 0 and standard deviation
1. This transformation is sometimes called standardizing a variable . Such standard variables
Then, of course, you might apply a higher-order polynomial-basis transformation to are often known as “z-
scores,” for example, in
one or more groups of numeric features.
the social sciences.
Study Question: Percy Eptron has a domain with 4 numeric input features,
(x1 , . . . , x4 ). He decides to use a representation of the form

φ(x) = PolyBasis((x1 , x2 ), 3)_ PolyBasis((x3 , x4 ), 3)

where a_ b means the vector a concatenated with the vector b. What is the dimen-
sion of Percy’s representation? Under what assumptions about the original features is
this a reasonable choice?

Last Updated: 12/18/19 11:56:05

Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
46 pages
03 - Non Linear Classifiers PDF
No ratings yet
03 - Non Linear Classifiers PDF
38 pages
PyTorch Neural Network Classifcation
No ratings yet
PyTorch Neural Network Classifcation
1 page
Notes Chapter Linear Classifiers
No ratings yet
Notes Chapter Linear Classifiers
4 pages
Srujitha 1
No ratings yet
Srujitha 1
91 pages
Third Year Coursebook
No ratings yet
Third Year Coursebook
39 pages
NN Lec - 04 - 05
No ratings yet
NN Lec - 04 - 05
84 pages
תרגול - SVM 1
No ratings yet
תרגול - SVM 1
32 pages
00 - Perceptron - Scientific Machine Learning (SciML)
No ratings yet
00 - Perceptron - Scientific Machine Learning (SciML)
42 pages
Chapter 3 - Neural Network
No ratings yet
Chapter 3 - Neural Network
47 pages
Search BoW Results
No ratings yet
Search BoW Results
35 pages
Ps 1
No ratings yet
Ps 1
10 pages
Support Vector Machines
No ratings yet
Support Vector Machines
27 pages
L03 Slides - Perceptron
No ratings yet
L03 Slides - Perceptron
22 pages
A Review of Artificial Neural Networks Applications in Microwave Computer-Aided Design Invited Article
No ratings yet
A Review of Artificial Neural Networks Applications in Microwave Computer-Aided Design Invited Article
17 pages
A Quantum Model For Multilayer Perceptron
No ratings yet
A Quantum Model For Multilayer Perceptron
23 pages
Chapter 5 PDF
No ratings yet
Chapter 5 PDF
39 pages
Cs8082 Machine Learning Techniques Ripped From Amazon Kindle e Books by Sai Seena
No ratings yet
Cs8082 Machine Learning Techniques Ripped From Amazon Kindle e Books by Sai Seena
148 pages
Learning 2
No ratings yet
Learning 2
104 pages
QB Ecc604 May 2022 Examination Te Extc Sem Vi 2021-22
No ratings yet
QB Ecc604 May 2022 Examination Te Extc Sem Vi 2021-22
25 pages
Building A Brain in 10 Minutes: Perceptron Research From The 50's & 6 Perceptron Research From The 50's & 6
No ratings yet
Building A Brain in 10 Minutes: Perceptron Research From The 50's & 6 Perceptron Research From The 50's & 6
14 pages
GeoStat DeepLearn NDesassis 15 06 22
No ratings yet
GeoStat DeepLearn NDesassis 15 06 22
134 pages
Lec 14
No ratings yet
Lec 14
18 pages
Digital Zooplankton Image Analysis Using The Zooscan Integrated System
No ratings yet
Digital Zooplankton Image Analysis Using The Zooscan Integrated System
20 pages
Computer Based Industrial Control
100% (3)
Computer Based Industrial Control
625 pages
Lec 41
No ratings yet
Lec 41
6 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
Machine Learning - AL3451 - Important Questions With Answer
No ratings yet
Machine Learning - AL3451 - Important Questions With Answer
27 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
Unit 3 .
No ratings yet
Unit 3 .
48 pages
2 - MLP and CNN
No ratings yet
2 - MLP and CNN
32 pages
Machine Learning Week 4
No ratings yet
Machine Learning Week 4
24 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Linear Regression Example
No ratings yet
Linear Regression Example
26 pages
Approximate Counting of Linear Extensions in Practice
No ratings yet
Approximate Counting of Linear Extensions in Practice
39 pages
Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4
21 pages
Be Comp Engg Sem-Viii r2019
No ratings yet
Be Comp Engg Sem-Viii r2019
56 pages
Unit 1 Lecture 3
No ratings yet
Unit 1 Lecture 3
5 pages
Python Unit 4
No ratings yet
Python Unit 4
43 pages
NN Learning and Expert Systems
No ratings yet
NN Learning and Expert Systems
8 pages
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
No ratings yet
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
5 pages
1 Algorithm: For I 1 To N Ify
No ratings yet
1 Algorithm: For I 1 To N Ify
6 pages
Well-Rounded: Visualizing Floating Point Representations: Neil Ryan, Katie Lim, Gus Smith, Dan Petrisko
No ratings yet
Well-Rounded: Visualizing Floating Point Representations: Neil Ryan, Katie Lim, Gus Smith, Dan Petrisko
4 pages
cs188 sp23 Lec25 - Z
No ratings yet
cs188 sp23 Lec25 - Z
38 pages
1991 Multilayer Perceptrons
No ratings yet
1991 Multilayer Perceptrons
15 pages
Features
No ratings yet
Features
5 pages
Chapter 8
No ratings yet
Chapter 8
103 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Self Reading - KNN - Notes
No ratings yet
Self Reading - KNN - Notes
7 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Support Vector Network
No ratings yet
Support Vector Network
25 pages
Chapter 7 - Neural-Networks
100% (1)
Chapter 7 - Neural-Networks
60 pages
DSH - L5 - Data-Driven Approaches - Concepts
No ratings yet
DSH - L5 - Data-Driven Approaches - Concepts
38 pages
Feature Scaling
No ratings yet
Feature Scaling
13 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
11.ABM SoftSensor MachineLearning DeepLearning
No ratings yet
11.ABM SoftSensor MachineLearning DeepLearning
13 pages
Pattern Recognition
No ratings yet
Pattern Recognition
33 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Intro To Machine Learning Nanodegree Program Syllabus
No ratings yet
Intro To Machine Learning Nanodegree Program Syllabus
13 pages
A Medical Diagnosis and Treatment Recommendation Chatbot Using MLP
No ratings yet
A Medical Diagnosis and Treatment Recommendation Chatbot Using MLP
6 pages
Pattern Recognition 14
No ratings yet
Pattern Recognition 14
46 pages
Pattern Recognition
No ratings yet
Pattern Recognition
33 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
27 pages
The Mass Appraisal of The Real Estate by Computational Intelligence
No ratings yet
The Mass Appraisal of The Real Estate by Computational Intelligence
6 pages
Vowel Recognition
No ratings yet
Vowel Recognition
3 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
HW 1 Eeowh 3
No ratings yet
HW 1 Eeowh 3
6 pages
Sol All
No ratings yet
Sol All
66 pages
Notes Chapter Logistic Regression
No ratings yet
Notes Chapter Logistic Regression
6 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Real Numbers, Data Science and Chaos: How To Fit Any Dataset With A Single Parameter
No ratings yet
Real Numbers, Data Science and Chaos: How To Fit Any Dataset With A Single Parameter
18 pages
AI Sheet NN
No ratings yet
AI Sheet NN
2 pages
CS464 Ch1 Intro Fall2020
No ratings yet
CS464 Ch1 Intro Fall2020
83 pages
Learning Rules in ANN
No ratings yet
Learning Rules in ANN
11 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Classification Techniques
No ratings yet
Classification Techniques
99 pages
AI Chapter 1
No ratings yet
AI Chapter 1
12 pages
Neural Networks and Fuzzy Logic QB PDF
No ratings yet
Neural Networks and Fuzzy Logic QB PDF
9 pages
Midterm Review Spring18 Sols
No ratings yet
Midterm Review Spring18 Sols
22 pages
Charotar University of Science and Technology Faculty of Technology and Engineering
No ratings yet
Charotar University of Science and Technology Faculty of Technology and Engineering
10 pages
Machine Learning Assignments and Answers
No ratings yet
Machine Learning Assignments and Answers
35 pages
Feature Engineering / Feature Selection
No ratings yet
Feature Engineering / Feature Selection
33 pages
This Story Paraphrased From A Post On 9/4/12
No ratings yet
This Story Paraphrased From A Post On 9/4/12
7 pages
SVM Using Python
No ratings yet
SVM Using Python
24 pages
Vahid
No ratings yet
Vahid
18 pages
ANN Most Notes
100% (1)
ANN Most Notes
6 pages
B.Tech. Theory Examination (Sem - IV) 2016-17 Introduction To Soft Computing (Neural Network, Fuzzy Logic & Genetic Algorithm)
No ratings yet
B.Tech. Theory Examination (Sem - IV) 2016-17 Introduction To Soft Computing (Neural Network, Fuzzy Logic & Genetic Algorithm)
1 page
2.neural Network
No ratings yet
2.neural Network
19 pages
HW 1
No ratings yet
HW 1
4 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
From Everand
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
Fouad Sabry
No ratings yet