0% found this document useful (0 votes)

54 views6 pages

Text Classification: When Not To Use Machine Learning

This document discusses the limitations of using machine learning to classify job titles into ranks like C-level, VP-level, etc. It explains that machine learning may overfit a small training set due to the large number of possible word and phrase features. Instead, it recommends a rules-based approach where hand-crafted rules map keywords to ranks, which can achieve high accuracy with just a few hundred rules rather than a huge training set. A combination of initial rules-based classification with subsequent machine learning refinement using feedback is suggested.

Uploaded by

Brahmesh Mandya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views6 pages

Text Classification: When Not To Use Machine Learning

Uploaded by

Brahmesh Mandya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Text Classification: When Not to Use Machine

Learning

Machine learning is a great approach for many text classification

problems. For example, the problem of classifying an email as spam or
not spam, based on its textual content.
The following is not one of them.
Consider the problem of classifying a job title into a rank from C-level, VPlevel, Director-level, Manager-level, or Staff. Below are some examples of
job titles and the ranks wed like them to be classified to.

Chief Information Officer Vice President C-level

Admin to Chief Information Officer Staff
E-Commerce Project Manager Manager-level
Director, Software Engineering Director-level
General Manager VP-level
Assistant to Vice President Staff
Assistant Vice President VP-level
At first glance, one might be tempted to use keywords/regular
expressions mapped to ranks for this purpose. For example, if the title
contained the phrase chief <word> officer, we might rank it C-level.
On further thought, after noting some subtleties, as witnessed in some of
the examples above, we might prefer a more sophisticated approach. If
we are (or aspire to be) data scientists, we might imagine that machine
learning would work better for this problem. Specifically, use a training
set of job titles classified to ranks to automatically learn, via machine
learning, to classify a title to its appropriate rank.
Sounds very appealing.
As well see below, this doesnt work out as well as we might imagine.

A Machine Learning Solution

For a machine learning solution to this problem,

one needs the following:
1. A training set of (job title, rank) pairs
2. Features to be extracted from a title that the machine learning
algorithm will use

We might choose, for example, every word and every two-word phrase in
the title to be a feature. Below are some examples of job titles and their
word-level and two-word level features.
Title

Features

General Manager

general, manager, general manager

Director, Software
Engineering

director, software, engineering, director software,

software engineering

The difficulty with this approach is that, unless the training set is very
large and sufficiently diverse, a machine learning solution can
significantly overfit it.
The term overfit means the learned model does not work adequately
well on titles not seen during training.
Below is a simple example that illustrates this. Imagine that the training
set has an entry chief medical officer C-level. Also imagine that no
other title in the training set has the word medical in it. In view of this, a
machine learning algorithm is likely to learn the association
that medical predicts C-level, which is clearly wrong.
Why is this happening? We are expecting the machine learning algorithm
to automatically figure out which words, and which two-word phrases,
predict specific ranks and which dont. Hundreds of thousands of different
words can occur in the imagined universe of titles. (The contacts
database at Data.com has more than ten million distinct titles.) Hundreds
of thousands squared two-word phrases. For the machine learning
solution to automatically discover which of these words and two-word
phrases predict specific ranks requires a very large training set.
Can we alleviate this issue by limiting our features to words? The
reasoning being that limiting features to words drastically reduces the
universe of feature values, thereby, needing a significantly smaller
training set to learn associations between individual words and ranks
from.
Yes, but we pay a price for it, in reduced accuracy. Certain two-word
phrases, for instance vice president, predict ranks more accurately than

the independent combination of the words in them. (president predicts Clevel, vice in of itself does not strongly predict VP-level.)
Moreover, the number of distinct words in the universe of titles is still
rather large, so the requisite training set will still remain large.
If a very large training set is available, great. If not, as is often the case,
what to do? Lets revisit the keyword rank rule-based approach.

A Rules-Based Solution

Consider the rule

manager manager-level

Interpret this as if the title contains the word manager, classify it to the
rank manager-level.
This single rule classifies most (but not all) titles which contain the word
manager in it correctly. In the parlance of machine learning, this one rule
generalizes massively (albeit not perfectly).
To improve on this, the following mechanism helps
If two rules fire on a particular title, and the antecedent of one of
the two is a subphrase of the antecedent of the other, override
the former rule.
Lets see an example.
Add the following rule:

general manager VP-level

Consider the title General Manager, data.com. Both rules fire on this title.
The general manager rule wins because manageris a subphrase
of general manager. This results in the title getting classified to VP-level.
How do we ensure that Assistant to Vice President gets classified
to Staf whereas Assistant Vice President to VP-level?
We need to add a simple mechanism, a numeric strength to each rule. To
illustrate this imagine that the rules set is as follows:
1. assistant Staff (1)
2. vice president VP-level (2)
3. assistant to Staff (3)

Consider the title Assistant to Vice President. Rules 1, 2, and 3 all fire on
this title. Rule 3 overrides rule 1. Next, rule 3 predicts Staf more strongly
than rule 2 predicts VP-level. So Staff wins. Next, consider the
title Assistant Vice President. Rules 1 and 2 fire. Rule 2 predicts VPlevel more strongly than rule 1 predicts Staf. So VP-level wins.
It turns out that by hand-crafting a couple of hundred such rules one can
achieve a high classification accuracy on a large test set of titles.
Sure, hand-crafting a few hundred rules takes work. Putting together a
training set of ten to hundreds of thousands of titles pre-classified to
ranks might take a whole lot more work.
Combining Rules and Machine Learning
The rules-based approach gives us massive generalization from a small
set of rules. However it doesnt automatically learn from its mistakes. If
feedback is expected to arrive continually (even if at a low rate),
automated learning from such feedback to improve classification
accuracy is very attractive. The alternative of manually adjusting the
rules from such feedback is more laborious, and injects humans in the
loop. (Humans are intelligent, but dont scale.)
So a sensible combination would be to use the rule-based approach to
quickly get a decent classifier off the ground; then use machine learning
to automatically adjust the rules from feedback.

For instance, machine learning can be used to automatically adjust the

strengths of the various rules from feedback.
How do you solve such issues? Wed love to hear from you.
If youre interested in these sorts of problems, Salesforce is hiring!
Visit http://www.salesforce.com/tech to find your #dreamjob.

ML Merged
No ratings yet
ML Merged
433 pages
21AI63 Module 1
No ratings yet
21AI63 Module 1
38 pages
Machine Learning For Absolute Beginners A - Oliver Theobald
100% (2)
Machine Learning For Absolute Beginners A - Oliver Theobald
179 pages
Machine Learning Absolute Beginners Introduction 2nd PDF
100% (2)
Machine Learning Absolute Beginners Introduction 2nd PDF
128 pages
WEEK 01 Merged
No ratings yet
WEEK 01 Merged
606 pages
Learning To Think Mathematically by Schoenfeld
100% (3)
Learning To Think Mathematically by Schoenfeld
102 pages
ML m1-m5 NOTES
No ratings yet
ML m1-m5 NOTES
160 pages
Jerome Bruner (Cognitive Development & Constructivist Theory)
No ratings yet
Jerome Bruner (Cognitive Development & Constructivist Theory)
10 pages
CLIL - Content and Language Integrated Learning - Poster
100% (1)
CLIL - Content and Language Integrated Learning - Poster
1 page
AML All Merged PDF Class 1 To 8
No ratings yet
AML All Merged PDF Class 1 To 8
423 pages
Chapter 01 Notes
No ratings yet
Chapter 01 Notes
11 pages
Expert System MCQs
No ratings yet
Expert System MCQs
5 pages
Learn Everything A I
No ratings yet
Learn Everything A I
128 pages
Machine Learning For Absolute Beginners - A Plain English Introduction (Second Edition) (Machine Learning For Beginners Book 1)
No ratings yet
Machine Learning For Absolute Beginners - A Plain English Introduction (Second Edition) (Machine Learning For Beginners Book 1)
153 pages
Chapter 5 Machine Learning
No ratings yet
Chapter 5 Machine Learning
96 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
78 pages
O Theobald - Machine Learning For Absolute Beginners - A Plain English Introduction (Second Edition) (AI, Data Science, Python & Statistics Fo - Libgen - Li
No ratings yet
O Theobald - Machine Learning For Absolute Beginners - A Plain English Introduction (Second Edition) (AI, Data Science, Python & Statistics Fo - Libgen - Li
72 pages
Lec 01 - Intro MachineLearning
No ratings yet
Lec 01 - Intro MachineLearning
59 pages
Certificate of Registration: Tarlac State University
No ratings yet
Certificate of Registration: Tarlac State University
1 page
Chapter 1 Introduction To ML
No ratings yet
Chapter 1 Introduction To ML
52 pages
1 - ML - Introduction
No ratings yet
1 - ML - Introduction
47 pages
1 - AML - Manish
No ratings yet
1 - AML - Manish
72 pages
Unit 3
No ratings yet
Unit 3
80 pages
Chapter - 4: OOP With C#
No ratings yet
Chapter - 4: OOP With C#
34 pages
Two Stage Job Title Identification-1
No ratings yet
Two Stage Job Title Identification-1
77 pages
01 - ML - Introduction
No ratings yet
01 - ML - Introduction
65 pages
1 - Introduction
No ratings yet
1 - Introduction
82 pages
INTRODUCTION
No ratings yet
INTRODUCTION
51 pages
Daily Detailed Lesson Plan
No ratings yet
Daily Detailed Lesson Plan
2 pages
UNIT I 1 ML Introduction To ML Well Posed Learning Problem
No ratings yet
UNIT I 1 ML Introduction To ML Well Posed Learning Problem
48 pages
Lesson Plan - Smart Start 1
0% (1)
Lesson Plan - Smart Start 1
8 pages
AIML Text Book 6th Semister
No ratings yet
AIML Text Book 6th Semister
226 pages
Examples of Literature Reviews in Early Childhood Education
100% (1)
Examples of Literature Reviews in Early Childhood Education
8 pages
对外汉语教学中的性别差异初探
100% (1)
对外汉语教学中的性别差异初探
39 pages
Module 1 Notes
No ratings yet
Module 1 Notes
56 pages
Machine Learning - For Beginners Your Definitive Guide For Neural Networks, Algorithms, Random Forests and Decision Trees Made Simple-AUVA PRESS (2017)
No ratings yet
Machine Learning - For Beginners Your Definitive Guide For Neural Networks, Algorithms, Random Forests and Decision Trees Made Simple-AUVA PRESS (2017)
80 pages
Q2-COT-LP - MUSIC10 (Philippine Popular Music)
No ratings yet
Q2-COT-LP - MUSIC10 (Philippine Popular Music)
4 pages
Wiley India Textbooks Price List June 2015 PDF
No ratings yet
Wiley India Textbooks Price List June 2015 PDF
152 pages
DocScanner Sep 27, 2024 9-01 AM
No ratings yet
DocScanner Sep 27, 2024 9-01 AM
24 pages
Chap 1
No ratings yet
Chap 1
56 pages
Introduction To Machine Learning: Mohsen Afsharchi
No ratings yet
Introduction To Machine Learning: Mohsen Afsharchi
72 pages
(PDF) Introduction To Machine Learning PDF
No ratings yet
(PDF) Introduction To Machine Learning PDF
94 pages
First Cut Draft LS1.1
No ratings yet
First Cut Draft LS1.1
12 pages
Module 1 Notes
No ratings yet
Module 1 Notes
38 pages
Lec 2
No ratings yet
Lec 2
15 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
PUSHKAR
No ratings yet
PUSHKAR
15 pages
Module 1
No ratings yet
Module 1
34 pages
Teaching Listening
No ratings yet
Teaching Listening
26 pages
ML 1
No ratings yet
ML 1
21 pages
Identifying Learner Levels and Learning Objectives
No ratings yet
Identifying Learner Levels and Learning Objectives
1 page
ML Unit-1
No ratings yet
ML Unit-1
39 pages
Introduction To Machinelearning
No ratings yet
Introduction To Machinelearning
75 pages
Master in Artificial Intelligence: Enrichmentors Growing Through Excellence Over 40 Years To Become Best in Management
No ratings yet
Master in Artificial Intelligence: Enrichmentors Growing Through Excellence Over 40 Years To Become Best in Management
13 pages
ML Bu
No ratings yet
ML Bu
31 pages
ML Module 1
No ratings yet
ML Module 1
26 pages
ML 2
No ratings yet
ML 2
13 pages
Machine Learning Interview Questions and Answers PDF
No ratings yet
Machine Learning Interview Questions and Answers PDF
15 pages
ML Lec 1
No ratings yet
ML Lec 1
47 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
15 pages
Guru Nanak Dev Engineering College, Ludhiana
No ratings yet
Guru Nanak Dev Engineering College, Ludhiana
48 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
17 pages
COS 511: Foundations of Machine Learning
No ratings yet
COS 511: Foundations of Machine Learning
7 pages
DLL - Mapeh 3 - Q4 - W2
100% (1)
DLL - Mapeh 3 - Q4 - W2
3 pages
Guide PDF
No ratings yet
Guide PDF
14 pages
DLL - Mapeh 2 - Q2 - W7
No ratings yet
DLL - Mapeh 2 - Q2 - W7
6 pages
Horgan Smoot Action Plan Presentation
No ratings yet
Horgan Smoot Action Plan Presentation
17 pages
Cheerdance DLP
No ratings yet
Cheerdance DLP
3 pages
Perspective 1681211819
No ratings yet
Perspective 1681211819
5 pages
ADA LAB - Programs List
No ratings yet
ADA LAB - Programs List
1 page
Machine Learning
No ratings yet
Machine Learning
11 pages
A Machine Learning Approach For Problem Solving
No ratings yet
A Machine Learning Approach For Problem Solving
16 pages
2017 Transfer Welcome Letter
100% (1)
2017 Transfer Welcome Letter
2 pages
Schapire MachineLearning
No ratings yet
Schapire MachineLearning
38 pages
DLP Quarter 1 Week 5 S.Y. 2022-2023
No ratings yet
DLP Quarter 1 Week 5 S.Y. 2022-2023
5 pages
Guiding Principles of The Philippine Lasallian Family
No ratings yet
Guiding Principles of The Philippine Lasallian Family
14 pages
Career Managing HRM Chapter # 7 Second Part Reference Book: Fundamentals of Human Resource Management by David A Decenzo and Stephen P Robbins
No ratings yet
Career Managing HRM Chapter # 7 Second Part Reference Book: Fundamentals of Human Resource Management by David A Decenzo and Stephen P Robbins
17 pages
Activty 4. Prof Ed 11. 2020 21
No ratings yet
Activty 4. Prof Ed 11. 2020 21
6 pages
Leave Management System
100% (1)
Leave Management System
2 pages
Ciml v0 - 99 ch01 PDF
No ratings yet
Ciml v0 - 99 ch01 PDF
11 pages
WLP Math I Week 3
No ratings yet
WLP Math I Week 3
5 pages
Tap 5
No ratings yet
Tap 5
11 pages
Rubrik New
No ratings yet
Rubrik New
2 pages
11 Machine Learning System Design PDF
No ratings yet
11 Machine Learning System Design PDF
7 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
5 pages
6WBS0017-0206-2019 - Leadership and Organisations (SDL) - Compressed
No ratings yet
6WBS0017-0206-2019 - Leadership and Organisations (SDL) - Compressed
5 pages
Your College Experience Transformation - Instructions and Grading Rubric
No ratings yet
Your College Experience Transformation - Instructions and Grading Rubric
4 pages
Sending SMS Via Windows Forms Application
No ratings yet
Sending SMS Via Windows Forms Application
4 pages
MAD Syllabus
No ratings yet
MAD Syllabus
3 pages
Portfolio Day and Cards Day Second Quarter: Sequence No. Name of The Personnel Time Signature
No ratings yet
Portfolio Day and Cards Day Second Quarter: Sequence No. Name of The Personnel Time Signature
1 page
Jigsaw Reading: Bjective
No ratings yet
Jigsaw Reading: Bjective
1 page
C# Assignment 1
No ratings yet
C# Assignment 1
1 page
Fundamentals of Systems Management
From Everand
Fundamentals of Systems Management
Daniel Riopel
No ratings yet
Database Management for Business Leaders: Building and Using Data Solutions That Work for You
From Everand
Database Management for Business Leaders: Building and Using Data Solutions That Work for You
Larry Ruddell
No ratings yet
Cascade
From Everand
Cascade
David Wright
3/5 (1)
Top Jobs: Computer and Information Technology
From Everand
Top Jobs: Computer and Information Technology
William Perry
No ratings yet

Text Classification: When Not To Use Machine Learning

Uploaded by

Text Classification: When Not To Use Machine Learning

Uploaded by

Text Classification: When Not to Use Machine

Machine learning is a great approach for many text classification

Chief Information Officer Vice President C-level

A Machine Learning Solution

For a machine learning solution to this problem,

general, manager, general manager

director, software, engineering, director software,

Consider the rule

general manager VP-level

For instance, machine learning can be used to automatically adjust the

You might also like