0% found this document useful (0 votes)
68 views

Tensorflow Implementation For Job Market Classification: Taras Mitran Jeff Waller

This document discusses using a convolutional neural network (CNN) to classify job descriptions by similar roles. Currently, job roles are classified manually which is inefficient as new roles are constantly being added. The document proposes using TensorFlow to build a CNN model that can learn embeddings and classifications from large datasets of job descriptions. It provides examples of CNN concepts like convolutions, pooling, activations and softmax. It also discusses preprocessing job data into fixed-length vectors and tuning hyperparameters like embedding size. Building the model requires GPU hardware, Linux, and libraries like CUDA and CUDNN from Nvidia. The document shares tips for installing TensorFlow and the challenges faced, like crashes, during initial model testing and development.

Uploaded by

subhanshu babbar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Tensorflow Implementation For Job Market Classification: Taras Mitran Jeff Waller

This document discusses using a convolutional neural network (CNN) to classify job descriptions by similar roles. Currently, job roles are classified manually which is inefficient as new roles are constantly being added. The document proposes using TensorFlow to build a CNN model that can learn embeddings and classifications from large datasets of job descriptions. It provides examples of CNN concepts like convolutions, pooling, activations and softmax. It also discusses preprocessing job data into fixed-length vectors and tuning hyperparameters like embedding size. Building the model requires GPU hardware, Linux, and libraries like CUDA and CUDNN from Nvidia. The document shares tips for installing TensorFlow and the challenges faced, like crashes, during initial model testing and development.

Uploaded by

subhanshu babbar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

TensorFlow Implementation for

Job Market Classification


Taras Mitran
Jeff Waller
HR Compensation Workflow
Scenario: ABC Corp wants to hire a statistician.

What the market rate for this job, at the 50th percentile?
60%ile?

Issue: Almost every company’s job title and description for


roughly the same “job” is different than other companies.
HR Compensation Workflow
1. ABC Corp submits all salaries with job information to salary survey
companies annually.
2. Survey companies aggregate data across companies.
3. They sell it back to ABC Corp in .csv format.
4. ABC Corp has to match several overlapping surveys to the job they
want to price.
HR Compensation Workflow
Instead of searching the entire dataset via keyword search:

Can similar or nearly identical jobs be clustered, so the user only has to
review a small subset of jobs?
HR Compensation Workflow
Luckily, this is being done.

The issue: its being done manually. And tens of thousands of different
job descriptions are created each year.

Challenge: augment this work with machine learning


A sample short form job description:
Head of statistical programming co-ordinate and oversee activities of statistical
programming teams develop programming standards. develop utility sas macros for
use in project programs. develop programming technologies to increase efficiencies
in clinical study reporting. generate summary data tables and analyses as part of
clinical study reports and iss/ise documents for fda submission. key programmer for
the production of interim analysis tables needed by data monitoring boards.
collaborate with it department to enhance biostatistics technologies; serve as key
contact for sas issues. program and validate data transfers and data conversions
between sas and other software for clients and regulatory agencies. develop, test,
and validate sas interfaces to non-sas data sources. supervise, instruct, and mentor
junior staff. participate in business development activities. experience: likely to have
had a least 10 years' experience in statistical programming. qualifications: bsc in
computing, life sciences, mathematical or statistical subjects. line management
experience. level responsibility: line management - high - 6+. project - high - 3+
countries. financial - medium. technical - high - expert. alternative titles: director of
statistical programming. survey level 4.
I heard Tensorflow was good?
What is Tensorflow?
• Python-based neural network framework
• Google open source project on Github
• Released November 2015
• Runs highly optimized C++ code for actual calculations
• Higher level APIs on top of tensorflow available, like skflow to fit within the
Scikit Learn API

TensorFlow is an open source software library for numerical


computation using data flow graphs. Nodes in the graph represent
mathematical operations, while the graph edges represent the
multidimensional data arrays (tensors) that flow between them.
Why Tensorflow?

Because Google?
Text-based Convolutional Neural Nets

• Udacity Course:
https://www.udacity.com/course/deep-learning--ud730
• All examples are in main repository of tensorflow:
https://www.udacity.com/course/deep-learning--ud730
• ConvNet on Rotten Tomatoes data set guide:
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-
classification-in-tensorflow/
Facebook whitepaper on convolutional neural
networks:

#TAGSPACE: Semantic Embeddings from Hashtags

• a convolutional neural network that learns feature representations for


short textual posts using hashtags as a supervised signal. The
proposed approach is trained on up to 5.5 billion words predicting
100,000 possible hashtags
• http://emnlp2014.org/papers/pdf/EMNLP2014194.pdf
Facebook’s neural network design:
How to get started?

• Step 1: pip install tensorflow


• Step 2: ??
• Step 3: Profit
Nvidia GPUs for NN

Neural networks are computationally


simple, but require massive amounts
of calculations, a problem well
conditioned for computation on a
GPU.
Nvidia GPUs for NN
• The Nvidia GTX 980 is (was until the GTX 1080) the most efficient hardware
for this, GTX 980 used by gamers (this is why the low price) — $520 prior to
introduction of GTX 1080, $400 now. The 1080 is $650.
• Neural networks do not need double sized floats, so the massively priced
K80, etc GPUS are not needed.
• 980 GTX uses Maxwell architecture 4 GB memory (224 GB/sec bandwidth)
and capable of 4.6 TFlops
• GTX 1080 introduced June 2016 using Pascal architecture which comes
with specialized support for neural networks (among other things support
for half-word sized floats), 8 GB of memory (320 GB/sec bandwidth) 8.9
TFlops.
Install notes
• Tensorflow can make use of of the GPU but requires Linux as the OS
and it’s best to set aside a video card for computation rather than
making it double as compute engine and video display.
• needs the latest CUDA was well as the latest CUDNN (CUDA for Neural
Networks) from Nvidia.
• these libraries must be obtained from the Nvidia developer site
• https://developer.nvidia.com/cuda-downloads
• https://developer.nvidia.com/cudnn
• follow install instructions given on nvidia site.
Install Cont’d
• Straightforward location is /usr/local

jeffw@chill:/usr/local$ cat /etc/ld.so.conf.d/cuda.conf


/usr/local/cuda/lib64
jeffw@chill:/usr/local$ cat /etc/ld.so.conf.d/cudnn.conf
/usr/local/cudnn/lib64
• install tensorflow with pip
• install the normal numerical packages numpy and scikit
Install Cont’d.
• use Ubuntu 16 to avoid the following problem:
https://github.com/tensorflow/tensorflow/issues/2190
The rig:
Post install
1. Do some ETL
2. Create a model
3. Run”python mymodel.py”
4. Tweak the model
5. Buy some more video cards to make it go faster*

* Has nothing to do with wanting to play games at 120 FPS on a


4k monitor
ETL

* Validation set should also be included, we just ran that in our final tool
ETL Cont’d.

• Map each document to a fixed length vector of integers (vocabulary


IDs) of length MAX_DOCUMENT_LENGTH
• each word is assigned a unique integer
• Save the vocabulary token off so IDs are consistent across multiple
training runs
We’ll come back to the model, but first an interlude:

Our experience getting the model up and running


First it crashed
And then it crashed again….

(Ubuntu 15 bug from install)


And again…
Then it burned

(our model didn’t perform so well)


What is a Convolutional Neural Network?
What is a Convolution?
• A sliding window function
applied to a matrix.
• The window is called kernel,
patch, or filter.

Source: http://deeplearning.stanford.edu/wiki/index.php/F
eature_extraction_using_convolution
What is a Convolution?

Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Strides
• Stride = number of pixels/words/characters to shift when looping
over input matrix

https://www.udacity.com/course/deep-learning--ud730
Embeddings
“The cat purrs”

“The cat hunts mice”

“The kitten purred”


Do kittens hunt mice?
Word2vec math
• Since embeddings are vectors of floats:

Puppy – Dog + Cat = Kitten


Embeddings
• Alternative to one-
hot encoding
• Scales better with
thousands of
categories, but
sparse values
• We learned
embeddings from
scratch instead of
word2vec
https://www.udacity.com/course/deep-learning--ud730
Embeddings

We tweaked the hyperparameter to:

EMBEDDING_SIZE = 32
Activation Function: non-linearity
• We’ll skim over these details, but we used a reLU
• It’s a rectified linear unit
• They are an alternative to sigmoid functions, such as tanh
• Linear if x > 0, 0 where x <= 0
• https://youtu.be/Opg63pan_YQ
Pooling: reduce complexity while maintaining accuracy

• aggregate windows of values


in the matrix to reduce the
output
• Average pooling: similar to
“blurring” an image
• Max Pooling: use maximum
of all values in neighborhood
of each value

https://www.udacity.com/course/deep-learning--ud730
Pooling
• Allows you to reduce stride, and increase accuracy
• Lower strides increase computation cost
• Another hyperparameter!

• Note: Output will lose edges, so either pad with zeros for ‘same’
padding, or the output size will be smaller with ‘valid’ padding.
Source: http://cs231n.github.io/convolutional-networks/#pool
Softmax: convert scores to probabilities
Returns a NumPy array with the same shape as the input:

def softmax(x):
e_x = np.exp(x)
return e_x / e_x.sum(axis=0)

scores = [1.0, 2.0, 3.0]


print softmax(scores)
[ 0.09003057 0.24472847 0.66524096]
What is a Convolutional Neural Network?
Demo!
Further Steps
Most inaccurate predictions had things like “responsible for supervising the collection of,
managing, and reports on <x>” where x might be “water samples” or “tax records”.

• Additional layers by training sentences instead of paragraphs, and then


training that output.
• Common structure in job descriptions
• Similar to a NN understanding edges, and then shapes

• Split classification into job function and career level as additional layers

• Pre-process with TF-IDF for job functionality


Questions?

You might also like