0% found this document useful (0 votes)
9 views

NLP Text Classification Week4

NLP Machine Learning

Uploaded by

vhawsbd
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

NLP Text Classification Week4

NLP Machine Learning

Uploaded by

vhawsbd
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Text Classification

1. Automatic or semi-
automatic processing of
human language
Natural 2. Can be used for various
Language applications like

Processing a. Sentiment Analysis

b. Intent Classification

c. Topic Labeling
General Process
Pre-process to Feed the data to
desired text format the model Output the class

Feature Predicti
Data Model Output
s on

Transform the text Set prediction


to vectors criteria once the
(numbers) model converge
Dataset / Text Corpus
- Dictionary or vocabulary which is
used to train the model

● Either tagged (for supervised learning) or


untagged (for unsupervised).

● Size depends on the algorithm used. Should be


pre-processed to remove unwanted characters,
to convert to wanted format, etc.
Dataset / Text Corpus
- Open-source dataset samples

● Amazon Reviews

● NYSK Dataset (News Articles

● Enrol Email Dataset

● Ling Spam Dataset


Feature Extraction
- Transforms texts to numbers (vector
space model)
- Choices:
● One-hot encoding

● Bag-of-words + TF*IDF

● Word2vec
One-hot encoding
- Creates a binary encoding of words.
1 is encoded on the index of the
word in the corpus
Bag-of-words
- Takes the word count of the target
word in the corpus as the feature
TF*IDF
- Term Frequency * Inverse Document
Frequency
● Frequently occurring words are typically not
important / has less weight (stopwords such as
“is, are, the, etc.”)

● Weights are assigned per word.


TF*IDF
- Term Frequency * Inverse Document
Frequency
BOW + TF*IDF
BOW + TF*IDF
word2vec
Uses the weights of the hidden layer
of a neural network as features of
the words
● Can predict a context or a word based on the
nearby words in the corpus

● Uses continuous bag-of-words or skip-gram


model + 1-1-1 neural network.
word2vec
- Gives better semantic/syntactic
relationships of words through
vectors
Schedule
Machine Learning Model
- A classifier algorithm that transforms
an input to the desired class

● Naive Bayes

● K-nearest neighbors

● Multilayer Perceptron

● Recurrent Neural Network + Long short-term


memory
Naive Bayes
- Probabilistic model that relies on
word count
● Uses bag of words as features

● Assumes that the position of words doesn't


matter and words are independent of each
other
Naive Bayes
- Probabilistic model that relies on
word count
K-Nearest Neighbors
- Classifies the class based on the
nearest distance from a known class
Multilayer Perceptron
- A feed-forward neural network

● Has at least 2 hidden layers

● Sigmoid function - binary classification

● Softmax function - multiclass classification


Multilayer Perceptron
Assessment
Option 1 Option 2
● Features: BOW + TF*IDF ● Features: word2vec word
embeddings
● ML Algorithm: Naive Bayes
● ML Algorithm: Multilayer
● Pros: Easier to implement
Perceptron
● Cons: Word count instead
● Pros: Produces better
of word sequence.
results, semantically and
● Ex. ‘Live to eat’ and ‘Eat to syntactically
live’ may mean the same’ ● Cons: Needs a big labeled
dataset to perform well
Main Blocks

ML.NET Learning Curve


- Still studying the framework.
- Not as well documented compared to Python
frameworks/libraries
- Ex. Has a method called TextCatalog.FeaturizeText() but
there’s no indication of the kind of feature extraction.
Supervised Learning Needs Big Data
- We can use open-source datasets for benchmark.
- But we need datasets with specific labels for the
algorithm to work.
Main Blocks

Model Update Criteria


- Retraining the model for every unknown word is
impractical.
- Suggestion:
- Set a minimum number of occurence of new words
before a model is to be retrained
- Ignore the rare, new words since it may not affect
the entire intent, sentiment, meaning, of the text.
Implementation Plan
- Email Cleaner
- Clean special characters, HTML tags, header and
footer of the email, etc.
- Set a standard file format (tsv, csv, txt, etc. or
transform to bin)
- Use spam dataset for the mean time as
benchmark (binary classification)
- Sentence Tokenizer + Feature Extraction
- Divide emails per sentence + word2vec
- Create Neural Network
- 1 input, 2 hidden, 1 output.
- Activation function - sigmoid
[1]D. Jurafsky and J. Martin, Speech
and language processing. Upper
Saddle River, N.J.: Pearson Prentice
Hall, 2009.

References [2]
https://developers.google.com/mac
hine-learning/

[3]bunch of stackoverflow /
stackexchange / Kaggle threads

[4]bunch of Medium posts

You might also like