0% found this document useful (0 votes)

17 views9 pages

Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK

This document provides a tutorial on text classification using machine learning techniques with Python, scikit-learn, and NLTK. It outlines the steps for setting up the environment, loading datasets, feature extraction, and implementing algorithms such as Naive Bayes and Support Vector Machines, along with performance tuning using Grid Search. The tutorial concludes with tips for improving accuracy and encourages experimentation with different algorithms.

Uploaded by

Trong Nguyen Duc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views9 pages

Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK

Uploaded by

Trong Nguyen Duc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

Machine Learning, NLP: Text

Classi cation using scikit-learn,
python and NLTK.
Javed Shaikh
Jul 24, 2017 · 7 min read

Text Classi cation

Latest Update:
I have uploaded the complete code (Python and Jupyter notebook) on GitHub:
https://github.com/javedsha/text-classification

Document/Text classification is one of the important and typical task in supervised

machine learning (ML). Assigning categories to documents, which can be a web page,
library book, media articles, gallery etc. has many applications like e.g. spam filtering,
email routing, sentiment analysis etc. In this article, I would like to demonstrate how
we can do text classification using python, scikit-learn and little bit of NLTK.

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 1/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

Disclaimer: I am new to machine learning and also to blogging (First). So, if there are any
mistakes, please do let me know. All feedback appreciated.

Let’s divide the classification problem into below steps:

1. Prerequisite and setting up the environment.

2. Loading the data set in jupyter.

3. Extracting features from text files.

4. Running ML algorithms.

5. Grid Search for parameter tuning.

6. Useful tips and a touch of NLTK.

Step 1: Prerequisite and setting up the environment

The prerequisites to follow this example are python version 2.7.3 and jupyter
notebook. You can just install anaconda and it will get everything for you. Also,
little bit of python and ML basics including text classification is required. We will be
using scikit-learn (python) libraries for our example.

Step 2: Loading the data set in jupyter.

The data set will be using for this example is the famous “20 Newsgoup” data set.
About the data from the original website:

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup

documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my
knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning
to filter netnews paper, though he does not explicitly mention this collection. The 20
newsgroups collection has become a popular data set for experiments in text applications
of machine learning techniques, such as text classification and text clustering.

This data set is in-built in scikit, so we don’t need to download it explicitly.

i. Open command prompt in windows and type ‘jupyter notebook’. This will open the
notebook in browser and start a session for you.

ii. Select New > Python 2. You can give a name to the notebook - Text Classification
Demo 1
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 2/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

iii. Loading the data set: (this might take few minutes, so patience)

from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

Note: Above, we are only loading the training data. We will load the test data separately
later in the example.

iv. You can check the target names (categories) and some data files by following
commands.

twenty_train.target_names #prints all the categories

print("\n".join(twenty_train.data[0].split("\n")[:3])) #prints first
line of the first data file

Step 3: Extracting features from text files.

Text files are actually series of words (ordered). In order to run machine learning
algorithms we need to convert the text files into numerical feature vectors. We will be
using bag of words model for our example. Briefly, we segment each text file into
words (for English splitting by space), and count # of times each word occurs in each
document and finally assign each word an integer id. Each unique word in our
dictionary will correspond to a feature (descriptive feature).

Scikit-learn has a high level component which will create feature vectors for us
‘CountVectorizer’. More about it here.

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 3/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

Here by doing ‘count_vect.fit_transform(twenty_train.data)’, we are learning the

vocabulary dictionary and it returns a Document-Term matrix. [n_samples,
n_features].

TF: Just counting the number of words in each document has 1 issue: it will give more
weightage to longer documents than shorter documents. To avoid this, we can use
frequency (TF - Term Frequencies) i.e. #count(word) / #Total words, in each
document.

TF-IDF: Finally, we can even reduce the weightage of more common words like (the, is,
an etc.) which occurs in all document. This is called as TF-IDF i.e Term Frequency
times inverse document frequency.

We can achieve both using below line of code:

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

The last line will output the dimension of the Document-Term matrix -> (11314,
130107).

Step 4. Running ML algorithms.

There are various algorithms which can be used for text classification. We will start
with the most simplest one ‘Naive Bayes (NB)’ (don’t think it is too Naive! 😃)

You can easily build a NBclassifier in scikit using below 2 lines of code: (note - there are
many variants of NB, but discussion about them is out of scope)

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

This will train the NB classifier on the training data we provided.

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 4/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

Building a pipeline: We can write less code and do all of the above, by building a
pipeline as follows:

>>> from sklearn.pipeline import Pipeline

>>> text_clf = Pipeline([('vect', CountVectorizer()),
... ('tfidf', TfidfTransformer()),
... ('clf', MultinomialNB()),
... ])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.

Performance of NB Classifier: Now we will test the performance of the NB classifier

on test set.

import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

The accuracy we get is ~77.38%, which is not bad for start and for a naive classifier.
Also, congrats!!! you have now written successfully a text classification algorithm 👍

Support Vector Machines (SVM): Let’s try using a different algorithm SVM, and see if
we can get any better performance. More about it here.

>>> from sklearn.linear_model import SGDClassifier

>>> text_clf_svm = Pipeline([('vect', CountVectorizer()),

... ('tfidf', TfidfTransformer()),
... ('clf-svm', SGDClassifier(loss='hinge',
penalty='l2',
... alpha=1e-3, n_iter=5,
random_state=42)),
... ])

>>> _ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

>>> predicted_svm = text_clf_svm.predict(twenty_test.data)

>>> np.mean(predicted_svm == twenty_test.target)

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 5/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

The accuracy we get is~82.38%. Yipee, a little better 👌

Step 5. Grid Search

Almost all the classifiers will have various parameters which can be tuned to obtain
optimal performance. Scikit gives an extremely useful tool ‘GridSearchCV’.

>>> from sklearn.model_selection import GridSearchCV

>>> parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
... 'tfidf__use_idf': (True, False),
... 'clf__alpha': (1e-2, 1e-3),
... }

Here, we are creating a list of parameters for which we would like to do performance
tuning. All the parameters name start with the classifier name (remember the arbitrary
name we gave). E.g. vect__ngram_range; here we are telling to use unigram and
bigrams and choose the one which is optimal.

Next, we create an instance of the grid search by passing the classifier, parameters and
n_jobs=-1 which tells to use multiple cores from user machine.

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

This might take few minutes to run depending on the machine configuration.

Lastly, to see the best mean score and the params, run the following code:

gs_clf.best_score_
gs_clf.best_params_

The accuracy has now increased to ~90.6% for the NB classifier (not so naive
anymore! 😄) and the corresponding parameters are {‘clf__alpha’: 0.01,
‘tfidf__use_idf’: True, ‘vect__ngram_range’: (1, 2)}.

Similarly, we get improved accuracy ~89.79% for SVM classifier with below code.
Note: You can further optimize the SVM classifier by tuning other parameters. This is left
up to you to explore more.
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 6/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

>>> from sklearn.model_selection import GridSearchCV

>>> parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)],
... 'tfidf__use_idf': (True, False),
... 'clf-svm__alpha': (1e-2, 1e-3),
... }
gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)
gs_clf_svm.best_score_
gs_clf_svm.best_params_

Step 6: Useful tips and a touch of NLTK.

1. Removing stop words: (the, then etc) from the data. You should do this only
when stop words are not useful for the underlying problem. In most of the text
classification problems, this is indeed not useful. Let’s see if removing stop words
increases the accuracy. Update the code for creating object of CountVectorizer as
follows:

>>> from sklearn.pipeline import Pipeline

>>> text_clf = Pipeline([('vect',
CountVectorizer(stop_words='english')),
... ('tfidf', TfidfTransformer()),
... ('clf', MultinomialNB()),
... ])

This is the pipeline we build for NB classifier. Run the remaining steps like before. This
improves the accuracy from 77.38% to 81.69% (that is too good). You can try the same
for SVM and also while doing grid search.

2. FitPrior=False: When set to false for MultinomialNB, a uniform prior will be used.
This doesn’t helps that much, but increases the accuracy from 81.69% to 82.14% (not
much gain). Try and see if this works for your data set.

3. Stemming: From Wikipedia, stemming is the process of reducing inflected (or

sometimes derived) words to their word stem, base or root form. E.g. A stemming
algorithm reduces the words “fishing”, “fished”, and “fisher” to the root word, “fish”.

We need NLTK which can be installed from here. NLTK comes with various stemmers
(details on how stemmers work are out of scope for this article) which can help reducing
the words to their root form. Again use this, if it make sense for your problem.

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 7/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

Below I have used Snowball stemmer which works very well for English language.

import nltk
nltk.download()

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english", ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer,
self).build_analyzer()
return lambda doc: ([stemmer.stem(w) for w in
analyzer(doc)])

stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect),

... ('tfidf', TfidfTransformer()),
... ('mnb', MultinomialNB(fit_prior=False)),
... ])

text_mnb_stemmed = text_mnb_stemmed.fit(twenty_train.data,
twenty_train.target)

predicted_mnb_stemmed = text_mnb_stemmed.predict(twenty_test.data)

np.mean(predicted_mnb_stemmed == twenty_test.target)

The accuracy with stemming we get is ~81.67%. Marginal improvement in our case
with NB classifier. You can also try out with SVM and other algorithms.

Conclusion: We have learned the classic problem in NLP, text classification. We

learned about important concepts like bag of words, TF-IDF and 2 important
algorithms NB and SVM. We saw that for our data set, both the algorithms were almost
equally matched when optimized. Sometimes, if we have enough data set, choice of
algorithm can make hardly any difference. We also saw, how to perform grid search for
performance tuning and used NLTK stemming approach. You can use this code on your
data set and see which algorithms works best for you.

Update: If anyone tries a different algorithm, please share the results in the comment
section, it will be useful for everyone.

Please let me know if there were any mistakes and feedback is welcome ✌

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 8/9
8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

Recommend, comment, share if you liked this article.

References:
http://scikit-learn.org/ (code)

http://qwone.com/~jason/20Newsgroups/ (data set)

Some rights reserved

Machine Learning Python Nltk Classi cation NLP

About Help Legal

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a 9/9

RMB1B - QUANTITAIVE AND RESEARCH - 10 Marks
No ratings yet
RMB1B - QUANTITAIVE AND RESEARCH - 10 Marks
11 pages
Reguno Manual
No ratings yet
Reguno Manual
2 pages
Sentiment Analysis Final Documentation Report
50% (2)
Sentiment Analysis Final Documentation Report
21 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Text Classification Reseach Paper
No ratings yet
Text Classification Reseach Paper
4 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory
No ratings yet
MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory
5 pages
IEEE-paper (1) Original
No ratings yet
IEEE-paper (1) Original
3 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
ML Projrct Article 2
No ratings yet
ML Projrct Article 2
6 pages
Mining Text Data and Classificatin
No ratings yet
Mining Text Data and Classificatin
4 pages
Project Proposal - Group 17-2-5
No ratings yet
Project Proposal - Group 17-2-5
4 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
IR - Group1
No ratings yet
IR - Group1
27 pages
IEEE-paper On NLP
No ratings yet
IEEE-paper On NLP
3 pages
A Complete Process of Text Classification System Using State of The Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State of The Art NLP Models
26 pages
Lec # 9
No ratings yet
Lec # 9
18 pages
Report Rohun Sjmoon
No ratings yet
Report Rohun Sjmoon
6 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
CAT King Study Material 4
No ratings yet
CAT King Study Material 4
32 pages
Text Classification Research Paper 2
No ratings yet
Text Classification Research Paper 2
7 pages
NLP m4
No ratings yet
NLP m4
97 pages
Text Classification
No ratings yet
Text Classification
3 pages
Video Presentation Information
No ratings yet
Video Presentation Information
5 pages
Statistical Learning and Text Classification With NLTK and Scikit-Learn
No ratings yet
Statistical Learning and Text Classification With NLTK and Scikit-Learn
24 pages
NLP Module 3
No ratings yet
NLP Module 3
66 pages
ML Report Fake News Detection
No ratings yet
ML Report Fake News Detection
15 pages
Wa0002
No ratings yet
Wa0002
21 pages
Bag of Tricks For Efficient Text Classification: Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov
No ratings yet
Bag of Tricks For Efficient Text Classification: Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov
5 pages
What Is Text Classification - Exxact
No ratings yet
What Is Text Classification - Exxact
12 pages
Text Classification Using NLP
No ratings yet
Text Classification Using NLP
28 pages
Lect 05
No ratings yet
Lect 05
17 pages
Group08 - BDM01 - Topic Modelling in Text Classification
No ratings yet
Group08 - BDM01 - Topic Modelling in Text Classification
19 pages
A Guide To Text Classification (NLP)
No ratings yet
A Guide To Text Classification (NLP)
17 pages
Unit 3
No ratings yet
Unit 3
27 pages
Parabot Notes PDF
No ratings yet
Parabot Notes PDF
2 pages
Unit 2
No ratings yet
Unit 2
26 pages
A Survey On Text Classification From Shallow To Deep Learning
No ratings yet
A Survey On Text Classification From Shallow To Deep Learning
21 pages
Best Text To Speech Ai - Aitech - Studio
No ratings yet
Best Text To Speech Ai - Aitech - Studio
8 pages
Report
No ratings yet
Report
2 pages
Bag of Tricks For Text Classification
No ratings yet
Bag of Tricks For Text Classification
5 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
Text Classification
No ratings yet
Text Classification
24 pages
Combine PDF
No ratings yet
Combine PDF
124 pages
Day19 Machine Learning
No ratings yet
Day19 Machine Learning
15 pages
Fake News Detection Using NLP
No ratings yet
Fake News Detection Using NLP
11 pages
Text Pre Processing (NLTK SpaCy) (1).HTML
No ratings yet
Text Pre Processing (NLTK SpaCy) (1).HTML
25 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
Machen e Learning
No ratings yet
Machen e Learning
9 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
UNIT-III Text Classification
No ratings yet
UNIT-III Text Classification
4 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
Chap1 Course Introduction
No ratings yet
Chap1 Course Introduction
34 pages
Setup The Environment - Visual Studio Code
No ratings yet
Setup The Environment - Visual Studio Code
6 pages
Gentle Start To Natural Language Processing Using Python
No ratings yet
Gentle Start To Natural Language Processing Using Python
6 pages
01 Introduction
No ratings yet
01 Introduction
50 pages
05 Iterations
No ratings yet
05 Iterations
48 pages
A Practitioner's Guide To Natural Language Processing (Part I) - Processing Understanding Text
No ratings yet
A Practitioner's Guide To Natural Language Processing (Part I) - Processing Understanding Text
46 pages
10 Tuples
No ratings yet
10 Tuples
17 pages
Session08 1slot Collections SuTV
No ratings yet
Session08 1slot Collections SuTV
31 pages
Natural Language Processing Is Fun! - Adam Geitgey - Medium
No ratings yet
Natural Language Processing Is Fun! - Adam Geitgey - Medium
19 pages
Session06 2slots Numbers and Strings SuTV
No ratings yet
Session06 2slots Numbers and Strings SuTV
30 pages
Session09 1slot Algorithms SuTV
No ratings yet
Session09 1slot Algorithms SuTV
18 pages
Modality-Aware Contrastive Instance Learning With
No ratings yet
Modality-Aware Contrastive Instance Learning With
10 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Deep Learning Techniques (Important Questions)
No ratings yet
Deep Learning Techniques (Important Questions)
5 pages
Lecture Machinelearning
No ratings yet
Lecture Machinelearning
32 pages
A Review On Facial Expression Recognition Techniques
No ratings yet
A Review On Facial Expression Recognition Techniques
8 pages
Shedding New Light On The Language of The Dark Web
No ratings yet
Shedding New Light On The Language of The Dark Web
17 pages
R18 - PG - MTEch (DS)
No ratings yet
R18 - PG - MTEch (DS)
60 pages
Applying The CRISP-DM Framework For Teaching Business Analytics
No ratings yet
Applying The CRISP-DM Framework For Teaching Business Analytics
20 pages
Hierarchical Clustering: Ke Chen
No ratings yet
Hierarchical Clustering: Ke Chen
21 pages
RNN LSTM
No ratings yet
RNN LSTM
49 pages
List of Experiments BI LAb-1
No ratings yet
List of Experiments BI LAb-1
2 pages
MCA-SEM-III-Syllabus Mobile Computing
No ratings yet
MCA-SEM-III-Syllabus Mobile Computing
12 pages
Final Last
No ratings yet
Final Last
34 pages
A Novel Approach For Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods
No ratings yet
A Novel Approach For Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods
11 pages
Mldoc Intro
No ratings yet
Mldoc Intro
4 pages
Intro To Weka
No ratings yet
Intro To Weka
13 pages
Lecture 1-Data Mining (Introduction)
No ratings yet
Lecture 1-Data Mining (Introduction)
30 pages
Adijfpqo
No ratings yet
Adijfpqo
8 pages
07 Clustering 2024
No ratings yet
07 Clustering 2024
51 pages
Data Science
No ratings yet
Data Science
38 pages
DLL 1
No ratings yet
DLL 1
2 pages
Sklearn Quick Reference
No ratings yet
Sklearn Quick Reference
11 pages
Overview of Software Defect Prediction Using Machine Learning Algorithms
No ratings yet
Overview of Software Defect Prediction Using Machine Learning Algorithms
12 pages
Clustering Data Without Distance Functions
No ratings yet
Clustering Data Without Distance Functions
6 pages
Geospatial Technologies For Land Degradation Assessment and Management R S Dwivedi All Chapters Instant Download
100% (7)
Geospatial Technologies For Land Degradation Assessment and Management R S Dwivedi All Chapters Instant Download
65 pages
Disease Prediction Using Python
100% (1)
Disease Prediction Using Python
7 pages
Multi Layer Perceptron 1
No ratings yet
Multi Layer Perceptron 1
54 pages
ASystematicReview DrMohamedAlloghani
No ratings yet
ASystematicReview DrMohamedAlloghani
30 pages
Algorithmic Transparency Via Quantitative Input Influence
No ratings yet
Algorithmic Transparency Via Quantitative Input Influence
24 pages

Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK

Uploaded by

Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK

Uploaded by

8/25/2019 Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

Machine Learning, NLP: Text

Text Classi cation

Document/Text classification is one of the important and typical task in supervised

Let’s divide the classification problem into below steps:

1. Prerequisite and setting up the environment.

2. Loading the data set in jupyter.

3. Extracting features from text files.

5. Grid Search for parameter tuning.

6. Useful tips and a touch of NLTK.

Step 1: Prerequisite and setting up the environment

Step 2: Loading the data set in jupyter.

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup

This data set is in-built in scikit, so we don’t need to download it explicitly.

from sklearn.datasets import fetch_20newsgroups

twenty_train.target_names #prints all the categories

Step 3: Extracting features from text files.

from sklearn.feature_extraction.text import CountVectorizer

Here by doing ‘count_vect.fit_transform(twenty_train.data)’, we are learning the

We can achieve both using below line of code:

from sklearn.feature_extraction.text import TfidfTransformer

Step 4. Running ML algorithms.

from sklearn.naive_bayes import MultinomialNB

This will train the NB classifier on the training data we provided.

>>> from sklearn.pipeline import Pipeline

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

Performance of NB Classifier: Now we will test the performance of the NB classifier

>>> from sklearn.linear_model import SGDClassifier

>>> text_clf_svm = Pipeline([('vect', CountVectorizer()),

>>> _ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

>>> predicted_svm = text_clf_svm.predict(twenty_test.data)

The accuracy we get is~82.38%. Yipee, a little better 👌

Step 5. Grid Search

>>> from sklearn.model_selection import GridSearchCV

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

>>> from sklearn.model_selection import GridSearchCV

Step 6: Useful tips and a touch of NLTK.

>>> from sklearn.pipeline import Pipeline

3. Stemming: From Wikipedia, stemming is the process of reducing inflected (or

from nltk.stem.snowball import SnowballStemmer

text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect),

Conclusion: We have learned the classic problem in NLP, text classification. We

Recommend, comment, share if you liked this article.

http://qwone.com/~jason/20Newsgroups/ (data set)

Some rights reserved

Machine Learning Python Nltk Classi cation NLP

About Help Legal

You might also like