0% found this document useful (0 votes)
135 views5 pages

Parikh IdentifyTagsFromMillionsOfTextQuestion PDF

The document discusses predicting tags for over 2 million questions using a model trained on 6 million questions from Stack Exchange sites. It builds separate classifiers for the top 500 tags using Vowpal Wabbit, selecting important features and comparing SVM and logistic loss functions. Feature selection identifies words like "c#" and "initquestion" as most important for the c# classifier. The approach builds discriminative classifiers for each tag to predict multiple tags per question.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views5 pages

Parikh IdentifyTagsFromMillionsOfTextQuestion PDF

The document discusses predicting tags for over 2 million questions using a model trained on 6 million questions from Stack Exchange sites. It builds separate classifiers for the top 500 tags using Vowpal Wabbit, selecting important features and comparing SVM and logistic loss functions. Feature selection identifies words like "c#" and "initquestion" as most important for the c# classifier. The approach builds discriminative classifiers for each tag to predict multiple tags per question.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Identifying Tags from millions of text question

Chintan Parikh, [email protected]

Abstract—Identifying tags or keywords from text has been a


very important class of application of text data mining. In the The problem statement is to predict the tags for the test set
case of Questions and Answer sites such as Stack overflow or of over 2 million questions using only the model learned from
Quora tagging allows users to explore more related content, the training set.
build and showcase expertise in a given area and in general get
more visibility to the question at hand. In this paper I take on Top ten tags sorted by their frequency are shown in table 1.
the problem of identifying tags for the questions asked at The top 10 tags account for 17.26% of all the tags. Similarly
Stack exchange sites based on title and text of the question. when analyzing top 100 tags they account for around 40% total
For this problem Vowpal Wabbit is used as a tool to build set tags. This follows perfect power law distribution as shown in
of discriminative classifiers for each of the tags in the training figure 2.
set. The resulting tags for each of test questions are predicted
using running through each of the classifiers.

Index Terms—Machine Learning, Clustering, Keyword


extraction, Text Analysis
I. BACKGROUND AND MOTIVATION
Tagging has become popular way to categorize text and
non-text information. With the advent of twitter hashtags,
people have started promoting usage of tags in order to
categorize and find related content easily. Stack overflow is a
popular site for discussing programming related questions, and
now they have dataset of over 6 million questions. On stack
overflow a user can tag each questions up to five tags to
categorize a question, using existing tags or create a new tag in
certain cases. Although they restrict ability to create a new tags
by having requirement on certain reputation.

In this problem, I look at dataset obtained from Kaggle


competition [1] which contains questions and related tags from
Figure 2: Tag distribution
Stack Exchange sites in the training set. The training dataset
has over 6 million questions with associated 42,049 unique
tags. Each question has average of 2.9 tags associated with it.
Figure 1 shows one example of training set with its associated
tags. Tags Times occurrence Percentage
occurrence
c# 463,526 2.66
java 412,189 2.36
php 392,451 2.25
javascript 365,623 2.1
android 320,622 1.84
jquery 305,614 1.75
c++ 199,280 1.14
python 184,928 1.06
iPhone 183,573 1.05
asp.net 177,334 1.02

Table 1: Top ten tags

Figure 1:Question with tags (c#, asp.net-mvc, linq, lambda)


II. APPROACH the features names to 2^18-1 space, there in reducing the
This section details the approach tried out for predicting the dimension and allows faster lookups.
tags for each of the question. In this discussion I take Top500
tags into account as they account for 61.8% of all the tags and Feature Selection:
also it speeds up the process of testing multiple hypothesis. The feature selection was done using vw wrapper called
as vw-varinfo, which exposes all variables of models in a
human readable form. The output includes the input variable
A. Setting up the baseline: names, including name-spaces where applicable, the vw hash
Kaggle has supplied with multiple baselines for comparing value, the range [min, max] of the variable values in the
the results, one with mean F1 score of 0.07 where it predicts training-set, the final model (regressor) weight, and the
top 5 tags (c#, java, python, php and javascript) for all the tags. relative distance of each variable from the best constant
For setting up another naïve baseline, first the data set was prediction.
cleaned removing XML tags, punctuation text and common
stop words in English language using Natural Language Using this, we learn the relative importance of the words
Toolkit [3]. This was tokenized and feed through a simple and we can remove the features with 0% relative importance,
classifier, which searches the post for the Tag keywords and if so as to reduce model size. Table below shows the feature
they are present then predicts that tag. This is then aggregated vectors for the model for c#.
and then finally selects the top three tags for each of the
question. This approach results in mean F1 score of 0.19. Feature Rel score Feature Rel score
c# 100.00% qt -80.51%
B. Building discriminative classifiers
initquestion 100.00% autoslideinterval -76.03%
Two approaches were identified for tag prediction for the
winform 54.71% andriod -76.03%
questions, one was to build a global multiclass classifier and
copypathlist 51.33% printf -72.89%
then use method such as One vs All to select the final class.
xmldocument 46.29% xcode -70.12%
The second approach was to build a discriminant classifier for
each of the tags and then predict the final tags choosing the addin 43.13% rails -67.70%
most likely tags. Since we need to predict more than one tags button_neutral 43.04% wikiversity -65.32%
for each of the question, it was decided to use the second csharp 40.42% gcc -65.32%
approach. The block diagram below explains the approach. enumerable 39.48% boost -62.96%
containskey 38.08% java -61.70%

Table 2: Relative importance of features for C# classifier

Choosing loss function:


For the classification task, vw has support for two loss
functions namely ‘hinge’ (SVM) and ‘logistic’. The
parameters The default values of regularization and learning
rate of 0.1 gives the best result for SVM classifier. Figure 3
and 4, compares the performance on SVM and logistic
classifiers, on the metric of Precision and Recall for the
classifier built for Top 100 tags.

Choosing input samples for building models


The SVM classifiers are sensitive to the ratio of positive (M)
training documents and negative (K) training documents. A
previous study [5] suggests that a discriminative model
produces the result for a class that has 60 positive and 1500
negative examples. In the initial phase of training, this fixed
value was being used to create training set for each of the
classifiers. But the total sample size of 2100 was not sufficient
to capture and it has high variance.
Figure X: Training of classifier for each tag
The approach which was implemented in the final classifier,
To build a discriminative classifier Vowpal Wabbit (vw) [4] which gave the maximum precision and recall is discussed
was selected. It provides several loss functions as well as below,
learning algorithms. vw provides a sparse matrix input format
which easily allows a bag of words models. Also vw hashes

Page 2
Figure 3: Logistic loss function, mean P = 0.64, R = 0.42

Figure 4: Hinge loss function, mean P = 0.76, R = 0.72


------------------------------------------------------------------- The output of vw classifier is between -1 and 1, we choose
For  each  of  the  tags:   the top 5 values which are above a fixed threshold (set to 0.1)
-­‐  Get  the  occurrence  value  (O).     in the list as our final tag output.
-­‐  Choose  M  to  be  O/2;  K  to  be  25*M  
-­‐  Each  M  and  K  are  chosen  from  the  question  which  has  
III. RESULTS
more   than   800   characters   so   as   to   have   enough  
information  in  the  model.     From the above section it is clear that hinge loss function
------------------------------------------------------------------- performs much better than the logistic loss function, hence it
was used in the final classifier.
C. Tag Suggestion
Once we have the models built for each of the Top500 A. Evaluation metric:
tags, next step is to predict multiple tags for each of the Mean F1 score is used as evaluation metric, which
question. For this we use the algorithm as below, measures accuracy using statistics precision p and recall r.
Precision is the ratio of true positive (tp) to all predicted
------------------------------------------------------------------- positive (tp+ fp). Recall is ratio of true positive (tp) to all
Step  1:  For  each  question  in  the  test  set:   actual positives (tp + fn).
-­‐  Run  through  all  the  classifiers  in  Top500  set  
-­‐  Add  the  SVM  output  of  each  classifier  to  a  list       F1 = 2pr / p + r
Step  2:  Now  sort  the  list  to  get  the  maximum  5  output  
for   particular   question,   and   assign   these   tags   to   the   So in order to maximize the F1 score, the algorithm
question.     should maximize both recall and precision simultaneously.
-------------------------------------------------------------------
 

Page 3
For example, api, list, file. This could have multiple
B. Some results from the tag suggestion connotations and don’t particularly belong to a particular
language or a tag set.
Original Tags Suggested Tags
Php, image-proecessing, file- Image, file, php D. Suggested tags performance
upload, upload, mime-types To measure the performance of the entire algorithm to
Firefox Firefox, windows predict the suggested tags, we run through the test set of
R, matlab, machine-learning Ubuntu, apache, networking 100,000 samples through Top500 classifier set. Following is
C#, url, encoding C#, string, json the result from the run
Php, api, file-get-contents Php, api, file
Core-plot Ios, iphone TP = 53219 P = 0.6439
C#, asp.net, windows-phone- Windows, asp.net, c# FP = 29426 R = 0.2551
7 FN = 206221 F1 = 0.3648
.net, javascript, code- Javascript, c#, linq
generatio Table 5: Final result of F1score
Visual-studio, makefile, gnu Visual-studio, file
Html, semantic, line-breaks Html The low recall is some what expected, as we are not
classifying from the entire set of 43k tags.
Table 3: Original and suggested tags for questions from
test set E. Kaggle Submission:
The algorithm described in the above section was
Note in the above results, the tags are predicted from the submitted to the Kaggle, where the test set contains over 2
million test question. The competition is particularly intense,
Top500 tags classifier set.
as Facebook is conducting it for recruiting. The competition
We can set that the top tags such as c#, php are being
ends on 12/20/2013 and as of 12/12 the algorithm described
predicted with a very high accuracy, where as lower
occurring tags which are not part of the Top500 tags are above had the standing of 74th out of 310 total teams.
missed or being predicted with some synonym from the
Top500 set. The example for that being makefile -> file, The mean F1 score of the submission using the methods
windows-phone-7 -> windows. describe above was 0.71132 compared to the top of 0.81539.
Note the high values of F1 score compared to above result
are largely due to overlap of Test set in the Training data set.
C. Classifiers performance
Thus for the questions in Test set if they belonged in
From the figure 4 we see that we can build a fairly
Training set, then the same tags were predicted for them.
accurate classifier with a mean Precision of 0.76 and Recall
of 0.72. This value is obtained from Test set of 100,000
samples which were not part of the training set. IV. CONCLUSION
Vowpal Wabbit was used extensively in the development
The tags for which the precision was lowest in the
of classifiers and its sparse input format, hashing trick and
Top100 set were; file, windows, forms, list, api, oop, class.
particularly vw-varinfo wrapper had been very useful to
The precision and recall for them is shown in the table
debug the models and come up with valid features. The
below.
hinge loss function works much better than logistic loss
function.
Tag Precision Recall
File 0.2044 0.5350 As discussed in the earlier sections, it is possible to build
Windows 0.3284 0.7286 highly accurate classifier for each of the tags in the Training
Forms 0.3526 0.7404 set. The precision is higher for specific tags such as php,
List 0.2594 0.7447 python and it decreases for generic tags such as file, java etc.
Oop 0.3620 0.7159 The results show that average precision of 0.76 is obtained
Api 0.2142 0.7138 for the tags in Top500 set. The recall is particularly low in
the results, since we are not predicting tags from the entire
Table 4: Tags with lowest precision values in Top100 tag set.

The classifiers in the above list have a noticeably low


precision and higher recall values. This could mean that the V. FUTURE WORK
algorithm is bit too liberal in making the classification The next immediate thing to try out it to build a set of
leading to lower precision values. This could be true for the Top2000, Top10000 and all the tags and see how the
tags, which occur generically in the context of multiple tags.

Page 4
Precision and Recall values vary. The expectation is that [3] Bird, Steven, Edward Loper and Ewan Klein (2009),
mean F1 score should go up by few percentage points. Natural Language Processing with Python. O’Reilly
Media Inc.
The other thing to try out could be to add more features [4] J. Langford, L. Li, and A. Strehl. Vowpal wabbit online
so as to improve accuracy of the existing Top500 tags. Also learning project, http://hunch.net/?p=309, 2007.
could look at techniques such as LDA to give us a list of [5] Wang Jian, Davidson Brian, Explorations in tag
topics for documents, which could be, then used a feature. suggestion and query expansion,” in Proceedings of the
2008 ACM workshop on Search in social media, ser.
REFERENCES SSM ’08. New York, NY, USA: ACM, 2008, pp.
[1] Kaggle competition Facebook, Keyword Extraction [6] Saha A, Saha R, Schineider K, A discriminative model
http://kaggle.com/c/facebook-recruiting-iii-keyword- approach for suggesting tags
extraction
[2] Kaggle leaderboard,
http://www.kaggle.com/c/facebook-recruiting-iii-
keyword-extraction/leaderboard

Page 5

You might also like