Text Classification: When Not To Use Machine Learning
Text Classification: When Not To Use Machine Learning
Learning
We might choose, for example, every word and every two-word phrase in
the title to be a feature. Below are some examples of job titles and their
word-level and two-word level features.
Title
Features
General Manager
Director, Software
Engineering
The difficulty with this approach is that, unless the training set is very
large and sufficiently diverse, a machine learning solution can
significantly overfit it.
The term overfit means the learned model does not work adequately
well on titles not seen during training.
Below is a simple example that illustrates this. Imagine that the training
set has an entry chief medical officer C-level. Also imagine that no
other title in the training set has the word medical in it. In view of this, a
machine learning algorithm is likely to learn the association
that medical predicts C-level, which is clearly wrong.
Why is this happening? We are expecting the machine learning algorithm
to automatically figure out which words, and which two-word phrases,
predict specific ranks and which dont. Hundreds of thousands of different
words can occur in the imagined universe of titles. (The contacts
database at Data.com has more than ten million distinct titles.) Hundreds
of thousands squared two-word phrases. For the machine learning
solution to automatically discover which of these words and two-word
phrases predict specific ranks requires a very large training set.
Can we alleviate this issue by limiting our features to words? The
reasoning being that limiting features to words drastically reduces the
universe of feature values, thereby, needing a significantly smaller
training set to learn associations between individual words and ranks
from.
Yes, but we pay a price for it, in reduced accuracy. Certain two-word
phrases, for instance vice president, predict ranks more accurately than
the independent combination of the words in them. (president predicts Clevel, vice in of itself does not strongly predict VP-level.)
Moreover, the number of distinct words in the universe of titles is still
rather large, so the requisite training set will still remain large.
If a very large training set is available, great. If not, as is often the case,
what to do? Lets revisit the keyword rank rule-based approach.
A Rules-Based Solution
Interpret this as if the title contains the word manager, classify it to the
rank manager-level.
This single rule classifies most (but not all) titles which contain the word
manager in it correctly. In the parlance of machine learning, this one rule
generalizes massively (albeit not perfectly).
To improve on this, the following mechanism helps
If two rules fire on a particular title, and the antecedent of one of
the two is a subphrase of the antecedent of the other, override
the former rule.
Lets see an example.
Add the following rule:
Consider the title General Manager, data.com. Both rules fire on this title.
The general manager rule wins because manageris a subphrase
of general manager. This results in the title getting classified to VP-level.
How do we ensure that Assistant to Vice President gets classified
to Staf whereas Assistant Vice President to VP-level?
We need to add a simple mechanism, a numeric strength to each rule. To
illustrate this imagine that the rules set is as follows:
1. assistant Staff (1)
2. vice president VP-level (2)
3. assistant to Staff (3)
Consider the title Assistant to Vice President. Rules 1, 2, and 3 all fire on
this title. Rule 3 overrides rule 1. Next, rule 3 predicts Staf more strongly
than rule 2 predicts VP-level. So Staff wins. Next, consider the
title Assistant Vice President. Rules 1 and 2 fire. Rule 2 predicts VPlevel more strongly than rule 1 predicts Staf. So VP-level wins.
It turns out that by hand-crafting a couple of hundred such rules one can
achieve a high classification accuracy on a large test set of titles.
Sure, hand-crafting a few hundred rules takes work. Putting together a
training set of ten to hundreds of thousands of titles pre-classified to
ranks might take a whole lot more work.
Combining Rules and Machine Learning
The rules-based approach gives us massive generalization from a small
set of rules. However it doesnt automatically learn from its mistakes. If
feedback is expected to arrive continually (even if at a low rate),
automated learning from such feedback to improve classification
accuracy is very attractive. The alternative of manually adjusting the
rules from such feedback is more laborious, and injects humans in the
loop. (Humans are intelligent, but dont scale.)
So a sensible combination would be to use the rule-based approach to
quickly get a decent classifier off the ground; then use machine learning
to automatically adjust the rules from feedback.