Machine Learning, NLP_ Text Classification Using Scikit-learn, Python and NLTK
Machine Learning, NLP_ Text Classification Using Scikit-learn, Python and NLTK
Latest Update:
I have uploaded the complete code (Python and Jupyter notebook) on GitHub:
https://github.com/javedsha/text-classification
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 1/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.
Disclaimer: I am new to machine learning and also to blogging (First). So, if there are any
mistakes, please do let me know. All feedback appreciated.
4. Running ML algorithms.
i. Open command prompt in windows and type ‘jupyter notebook’. This will open the
notebook in browser and start a session for you.
ii. Select New > Python 2. You can give a name to the notebook - Text Classification
Demo 1
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 2/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.
iii. Loading the data set: (this might take few minutes, so patience)
Note: Above, we are only loading the training data. We will load the test data separately
later in the example.
iv. You can check the target names (categories) and some data files by following
commands.
Scikit-learn has a high level component which will create feature vectors for us
‘CountVectorizer’. More about it here.
TF: Just counting the number of words in each document has 1 issue: it will give more
weightage to longer documents than shorter documents. To avoid this, we can use
frequency (TF - Term Frequencies) i.e. #count(word) / #Total words, in each
document.
TF-IDF: Finally, we can even reduce the weightage of more common words like (the, is,
an etc.) which occurs in all document. This is called as TF-IDF i.e Term Frequency
times inverse document frequency.
The last line will output the dimension of the Document-Term matrix -> (11314,
130107).
You can easily build a NBclassifier in scikit using below 2 lines of code: (note - there are
many variants of NB, but discussion about them is out of scope)
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 4/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.
Building a pipeline: We can write less code and do all of the above, by building a
pipeline as follows:
The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)
The accuracy we get is ~77.38%, which is not bad for start and for a naive classifier.
Also, congrats!!! you have now written successfully a text classification algorithm 👍
Support Vector Machines (SVM): Let’s try using a different algorithm SVM, and see if
we can get any better performance. More about it here.
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 5/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.
Here, we are creating a list of parameters for which we would like to do performance
tuning. All the parameters name start with the classifier name (remember the arbitrary
name we gave). E.g. vect__ngram_range; here we are telling to use unigram and
bigrams and choose the one which is optimal.
Next, we create an instance of the grid search by passing the classifier, parameters and
n_jobs=-1 which tells to use multiple cores from user machine.
This might take few minutes to run depending on the machine configuration.
Lastly, to see the best mean score and the params, run the following code:
gs_clf.best_score_
gs_clf.best_params_
The accuracy has now increased to ~90.6% for the NB classifier (not so naive
anymore! 😄) and the corresponding parameters are {‘clf__alpha’: 0.01,
‘tfidf__use_idf’: True, ‘vect__ngram_range’: (1, 2)}.
Similarly, we get improved accuracy ~89.79% for SVM classifier with below code.
Note: You can further optimize the SVM classifier by tuning other parameters. This is left
up to you to explore more.
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 6/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.
This is the pipeline we build for NB classifier. Run the remaining steps like before. This
improves the accuracy from 77.38% to 81.69% (that is too good). You can try the same
for SVM and also while doing grid search.
2. FitPrior=False: When set to false for MultinomialNB, a uniform prior will be used.
This doesn’t helps that much, but increases the accuracy from 81.69% to 82.14% (not
much gain). Try and see if this works for your data set.
We need NLTK which can be installed from here. NLTK comes with various stemmers
(details on how stemmers work are out of scope for this article) which can help reducing
the words to their root form. Again use this, if it make sense for your problem.
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 7/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.
Below I have used Snowball stemmer which works very well for English language.
import nltk
nltk.download()
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer,
self).build_analyzer()
return lambda doc: ([stemmer.stem(w) for w in
analyzer(doc)])
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')
text_mnb_stemmed = text_mnb_stemmed.fit(twenty_train.data,
twenty_train.target)
predicted_mnb_stemmed = text_mnb_stemmed.predict(twenty_test.data)
np.mean(predicted_mnb_stemmed == twenty_test.target)
The accuracy with stemming we get is ~81.67%. Marginal improvement in our case
with NB classifier. You can also try out with SVM and other algorithms.
Update: If anyone tries a different algorithm, please share the results in the comment
section, it will be useful for everyone.
Please let me know if there were any mistakes and feedback is welcome ✌
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 8/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.
References:
http://scikit-learn.org/ (code)
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 9/9