0% found this document useful (0 votes)

135 views5 pages

Parikh IdentifyTagsFromMillionsOfTextQuestion PDF

The document discusses predicting tags for over 2 million questions using a model trained on 6 million questions from Stack Exchange sites. It builds separate classifiers for the top 500 tags using Vowpal Wabbit, selecting important features and comparing SVM and logistic loss functions. Feature selection identifies words like "c#" and "initquestion" as most important for the c# classifier. The approach builds discriminative classifiers for each tag to predict multiple tags per question.

Uploaded by

Shailendra chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views5 pages

Parikh IdentifyTagsFromMillionsOfTextQuestion PDF

Uploaded by

Shailendra chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Identifying Tags from millions of text question

Chintan Parikh, [email protected]

Abstract—Identifying tags or keywords from text has been a

very important class of application of text data mining. In the The problem statement is to predict the tags for the test set
case of Questions and Answer sites such as Stack overflow or of over 2 million questions using only the model learned from
Quora tagging allows users to explore more related content, the training set.
build and showcase expertise in a given area and in general get
more visibility to the question at hand. In this paper I take on Top ten tags sorted by their frequency are shown in table 1.
the problem of identifying tags for the questions asked at The top 10 tags account for 17.26% of all the tags. Similarly
Stack exchange sites based on title and text of the question. when analyzing top 100 tags they account for around 40% total
For this problem Vowpal Wabbit is used as a tool to build set tags. This follows perfect power law distribution as shown in
of discriminative classifiers for each of the tags in the training figure 2.
set. The resulting tags for each of test questions are predicted
using running through each of the classifiers.

Index Terms—Machine Learning, Clustering, Keyword

extraction, Text Analysis
I. BACKGROUND AND MOTIVATION
Tagging has become popular way to categorize text and
non-text information. With the advent of twitter hashtags,
people have started promoting usage of tags in order to
categorize and find related content easily. Stack overflow is a
popular site for discussing programming related questions, and
now they have dataset of over 6 million questions. On stack
overflow a user can tag each questions up to five tags to
categorize a question, using existing tags or create a new tag in
certain cases. Although they restrict ability to create a new tags
by having requirement on certain reputation.

In this problem, I look at dataset obtained from Kaggle

competition [1] which contains questions and related tags from
Figure 2: Tag distribution
Stack Exchange sites in the training set. The training dataset
has over 6 million questions with associated 42,049 unique
tags. Each question has average of 2.9 tags associated with it.
Figure 1 shows one example of training set with its associated
tags. Tags Times occurrence Percentage
occurrence
c# 463,526 2.66
java 412,189 2.36
php 392,451 2.25
javascript 365,623 2.1
android 320,622 1.84
jquery 305,614 1.75
c++ 199,280 1.14
python 184,928 1.06
iPhone 183,573 1.05
asp.net 177,334 1.02

Table 1: Top ten tags

Figure 1:Question with tags (c#, asp.net-mvc, linq, lambda)

II. APPROACH the features names to 2^18-1 space, there in reducing the
This section details the approach tried out for predicting the dimension and allows faster lookups.
tags for each of the question. In this discussion I take Top500
tags into account as they account for 61.8% of all the tags and Feature Selection:
also it speeds up the process of testing multiple hypothesis. The feature selection was done using vw wrapper called
as vw-varinfo, which exposes all variables of models in a
human readable form. The output includes the input variable
A. Setting up the baseline: names, including name-spaces where applicable, the vw hash
Kaggle has supplied with multiple baselines for comparing value, the range [min, max] of the variable values in the
the results, one with mean F1 score of 0.07 where it predicts training-set, the final model (regressor) weight, and the
top 5 tags (c#, java, python, php and javascript) for all the tags. relative distance of each variable from the best constant
For setting up another naïve baseline, first the data set was prediction.
cleaned removing XML tags, punctuation text and common
stop words in English language using Natural Language Using this, we learn the relative importance of the words
Toolkit [3]. This was tokenized and feed through a simple and we can remove the features with 0% relative importance,
classifier, which searches the post for the Tag keywords and if so as to reduce model size. Table below shows the feature
they are present then predicts that tag. This is then aggregated vectors for the model for c#.
and then finally selects the top three tags for each of the
question. This approach results in mean F1 score of 0.19. Feature Rel score Feature Rel score
c# 100.00% qt -80.51%
B. Building discriminative classifiers
initquestion 100.00% autoslideinterval -76.03%
Two approaches were identified for tag prediction for the
winform 54.71% andriod -76.03%
questions, one was to build a global multiclass classifier and
copypathlist 51.33% printf -72.89%
then use method such as One vs All to select the final class.
xmldocument 46.29% xcode -70.12%
The second approach was to build a discriminant classifier for
each of the tags and then predict the final tags choosing the addin 43.13% rails -67.70%
most likely tags. Since we need to predict more than one tags button_neutral 43.04% wikiversity -65.32%
for each of the question, it was decided to use the second csharp 40.42% gcc -65.32%
approach. The block diagram below explains the approach. enumerable 39.48% boost -62.96%
containskey 38.08% java -61.70%

Table 2: Relative importance of features for C# classifier

Choosing loss function:

For the classification task, vw has support for two loss
functions namely ‘hinge’ (SVM) and ‘logistic’. The
parameters The default values of regularization and learning
rate of 0.1 gives the best result for SVM classifier. Figure 3
and 4, compares the performance on SVM and logistic
classifiers, on the metric of Precision and Recall for the
classifier built for Top 100 tags.

Choosing input samples for building models

The SVM classifiers are sensitive to the ratio of positive (M)
training documents and negative (K) training documents. A
previous study [5] suggests that a discriminative model
produces the result for a class that has 60 positive and 1500
negative examples. In the initial phase of training, this fixed
value was being used to create training set for each of the
classifiers. But the total sample size of 2100 was not sufficient
to capture and it has high variance.
Figure X: Training of classifier for each tag
The approach which was implemented in the final classifier,
To build a discriminative classifier Vowpal Wabbit (vw) [4] which gave the maximum precision and recall is discussed
was selected. It provides several loss functions as well as below,
learning algorithms. vw provides a sparse matrix input format
which easily allows a bag of words models. Also vw hashes

Page 2
Figure 3: Logistic loss function, mean P = 0.64, R = 0.42

Figure 4: Hinge loss function, mean P = 0.76, R = 0.72

------------------------------------------------------------------- The output of vw classifier is between -1 and 1, we choose
For each of the tags: the top 5 values which are above a fixed threshold (set to 0.1)
-‐ Get the occurrence value (O). in the list as our final tag output.
-‐ Choose M to be O/2; K to be 25*M
-‐ Each M and K are chosen from the question which has
III. RESULTS
more than 800 characters so as to have enough
information in the model. From the above section it is clear that hinge loss function
------------------------------------------------------------------- performs much better than the logistic loss function, hence it
was used in the final classifier.
C. Tag Suggestion
Once we have the models built for each of the Top500 A. Evaluation metric:
tags, next step is to predict multiple tags for each of the Mean F1 score is used as evaluation metric, which
question. For this we use the algorithm as below, measures accuracy using statistics precision p and recall r.
Precision is the ratio of true positive (tp) to all predicted
------------------------------------------------------------------- positive (tp+ fp). Recall is ratio of true positive (tp) to all
Step 1: For each question in the test set: actual positives (tp + fn).
-‐ Run through all the classifiers in Top500 set
-‐ Add the SVM output of each classifier to a list F1 = 2pr / p + r
Step 2: Now sort the list to get the maximum 5 output
for particular question, and assign these tags to the So in order to maximize the F1 score, the algorithm
question. should maximize both recall and precision simultaneously.
-------------------------------------------------------------------

Page 3
For example, api, list, file. This could have multiple
B. Some results from the tag suggestion connotations and don’t particularly belong to a particular
language or a tag set.
Original Tags Suggested Tags
Php, image-proecessing, file- Image, file, php D. Suggested tags performance
upload, upload, mime-types To measure the performance of the entire algorithm to
Firefox Firefox, windows predict the suggested tags, we run through the test set of
R, matlab, machine-learning Ubuntu, apache, networking 100,000 samples through Top500 classifier set. Following is
C#, url, encoding C#, string, json the result from the run
Php, api, file-get-contents Php, api, file
Core-plot Ios, iphone TP = 53219 P = 0.6439
C#, asp.net, windows-phone- Windows, asp.net, c# FP = 29426 R = 0.2551
7 FN = 206221 F1 = 0.3648
.net, javascript, code- Javascript, c#, linq
generatio Table 5: Final result of F1score
Visual-studio, makefile, gnu Visual-studio, file
Html, semantic, line-breaks Html The low recall is some what expected, as we are not
classifying from the entire set of 43k tags.
Table 3: Original and suggested tags for questions from
test set E. Kaggle Submission:
The algorithm described in the above section was
Note in the above results, the tags are predicted from the submitted to the Kaggle, where the test set contains over 2
million test question. The competition is particularly intense,
Top500 tags classifier set.
as Facebook is conducting it for recruiting. The competition
We can set that the top tags such as c#, php are being
ends on 12/20/2013 and as of 12/12 the algorithm described
predicted with a very high accuracy, where as lower
occurring tags which are not part of the Top500 tags are above had the standing of 74th out of 310 total teams.
missed or being predicted with some synonym from the
Top500 set. The example for that being makefile -> file, The mean F1 score of the submission using the methods
windows-phone-7 -> windows. describe above was 0.71132 compared to the top of 0.81539.
Note the high values of F1 score compared to above result
are largely due to overlap of Test set in the Training data set.
C. Classifiers performance
Thus for the questions in Test set if they belonged in
From the figure 4 we see that we can build a fairly
Training set, then the same tags were predicted for them.
accurate classifier with a mean Precision of 0.76 and Recall
of 0.72. This value is obtained from Test set of 100,000
samples which were not part of the training set. IV. CONCLUSION
Vowpal Wabbit was used extensively in the development
The tags for which the precision was lowest in the
of classifiers and its sparse input format, hashing trick and
Top100 set were; file, windows, forms, list, api, oop, class.
particularly vw-varinfo wrapper had been very useful to
The precision and recall for them is shown in the table
debug the models and come up with valid features. The
below.
hinge loss function works much better than logistic loss
function.
Tag Precision Recall
File 0.2044 0.5350 As discussed in the earlier sections, it is possible to build
Windows 0.3284 0.7286 highly accurate classifier for each of the tags in the Training
Forms 0.3526 0.7404 set. The precision is higher for specific tags such as php,
List 0.2594 0.7447 python and it decreases for generic tags such as file, java etc.
Oop 0.3620 0.7159 The results show that average precision of 0.76 is obtained
Api 0.2142 0.7138 for the tags in Top500 set. The recall is particularly low in
the results, since we are not predicting tags from the entire
Table 4: Tags with lowest precision values in Top100 tag set.

The classifiers in the above list have a noticeably low

precision and higher recall values. This could mean that the V. FUTURE WORK
algorithm is bit too liberal in making the classification The next immediate thing to try out it to build a set of
leading to lower precision values. This could be true for the Top2000, Top10000 and all the tags and see how the
tags, which occur generically in the context of multiple tags.

Page 4
Precision and Recall values vary. The expectation is that [3] Bird, Steven, Edward Loper and Ewan Klein (2009),
mean F1 score should go up by few percentage points. Natural Language Processing with Python. O’Reilly
Media Inc.
The other thing to try out could be to add more features [4] J. Langford, L. Li, and A. Strehl. Vowpal wabbit online
so as to improve accuracy of the existing Top500 tags. Also learning project, http://hunch.net/?p=309, 2007.
could look at techniques such as LDA to give us a list of [5] Wang Jian, Davidson Brian, Explorations in tag
topics for documents, which could be, then used a feature. suggestion and query expansion,” in Proceedings of the
2008 ACM workshop on Search in social media, ser.
REFERENCES SSM ’08. New York, NY, USA: ACM, 2008, pp.
[1] Kaggle competition Facebook, Keyword Extraction [6] Saha A, Saha R, Schineider K, A discriminative model
http://kaggle.com/c/facebook-recruiting-iii-keyword- approach for suggesting tags
extraction
[2] Kaggle leaderboard,
http://www.kaggle.com/c/facebook-recruiting-iii-
keyword-extraction/leaderboard

Page 5

Answer
100% (2)
Answer
7 pages
The C++ Template Handbook: Advanced Techniques for Modern C++ Developers
From Everand
The C++ Template Handbook: Advanced Techniques for Modern C++ Developers
Robert Johnson
No ratings yet
C & C++ Interview Questions You'll Most Likely Be Asked
From Everand
C & C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Litir - Laethanta Saoire
No ratings yet
Litir - Laethanta Saoire
3 pages
Mastering Generic Programming in C++: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Generic Programming in C++: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
From Everand
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
Manish Soni
No ratings yet
JSBL Enterprises
No ratings yet
JSBL Enterprises
1 page
New Doc 2018-01-27 - 6
No ratings yet
New Doc 2018-01-27 - 6
1 page
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
IGNOU MCA Previous Years Unsolved Papers All in One
From Everand
IGNOU MCA Previous Years Unsolved Papers All in One
Manish Soni
No ratings yet
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
From Everand
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
Georgio Daccache
No ratings yet
Structures Brochure
No ratings yet
Structures Brochure
44 pages
IFU SURGICAL INSTRUMENTS Titan
No ratings yet
IFU SURGICAL INSTRUMENTS Titan
2 pages
Practical Monte Carlo Simulation with Excel - Part 2 of 2: Applications and Distributions
From Everand
Practical Monte Carlo Simulation with Excel - Part 2 of 2: Applications and Distributions
Akram Najjar
2/5 (1)
Qualification and Validation
No ratings yet
Qualification and Validation
45 pages
Samuel Maldonado Setu - CV & Ijazah
No ratings yet
Samuel Maldonado Setu - CV & Ijazah
4 pages
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
From Everand
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
equitypress
3.5/5 (2)
Text Classification Using Support Vector Machine IJERTV1IS3174
No ratings yet
Text Classification Using Support Vector Machine IJERTV1IS3174
4 pages
Term 2 Basic 3 Week 3 Lesson Plan
No ratings yet
Term 2 Basic 3 Week 3 Lesson Plan
20 pages
Black Death
No ratings yet
Black Death
34 pages
Chapter 7
No ratings yet
Chapter 7
19 pages
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
Namra Finance Limited
No ratings yet
Namra Finance Limited
5 pages
Lesson 04 - Physical Science
No ratings yet
Lesson 04 - Physical Science
24 pages
Agente 13625683 Fecha 06-12-2021 Hora 08-42 Version 3-4-1-34 Estacion SMG-NSB LodID
No ratings yet
Agente 13625683 Fecha 06-12-2021 Hora 08-42 Version 3-4-1-34 Estacion SMG-NSB LodID
44 pages
Dept. of Chemistry, Rajabazar Science College, 92-Acharya Prafulla Chandra Road, University of Calcutta, Kolkata - 700009, West Bengal, India
No ratings yet
Dept. of Chemistry, Rajabazar Science College, 92-Acharya Prafulla Chandra Road, University of Calcutta, Kolkata - 700009, West Bengal, India
6 pages
Jim Richardson On The Kartilya 1
No ratings yet
Jim Richardson On The Kartilya 1
17 pages
Perspectives On The History of Mathematical Logic PDF
100% (1)
Perspectives On The History of Mathematical Logic PDF
218 pages
Confluent Certified Developer for Apache Kafka® Exam kit
From Everand
Confluent Certified Developer for Apache Kafka® Exam kit
PRIYANKA
No ratings yet
Python-Based Evolutionary Algorithms for Engineers
From Everand
Python-Based Evolutionary Algorithms for Engineers
Pankaj Jayaraman
No ratings yet
17 - PPT - NLP Project-2-24
No ratings yet
17 - PPT - NLP Project-2-24
23 pages
05 - Feature Engineering (Text)
No ratings yet
05 - Feature Engineering (Text)
28 pages
Peace and Conflict Studies
No ratings yet
Peace and Conflict Studies
18 pages
PEPSICO
No ratings yet
PEPSICO
5 pages
Shailesh020902@gmail - Com 6
No ratings yet
Shailesh020902@gmail - Com 6
2 pages
Stack Overflow Prediction Mode
No ratings yet
Stack Overflow Prediction Mode
10 pages
Shailesh020902@gmail - Com 1
No ratings yet
Shailesh020902@gmail - Com 1
1 page
Métodos numéricos aplicados a Ingeniería: Casos de estudio usando MATLAB
From Everand
Métodos numéricos aplicados a Ingeniería: Casos de estudio usando MATLAB
Héctor Jorquera González
5/5 (1)
PietraszekJ ConceptVariance PDF
No ratings yet
PietraszekJ ConceptVariance PDF
8 pages
Definition: The Ability To Use Strength Quickly To Produce An Explosive Effort
No ratings yet
Definition: The Ability To Use Strength Quickly To Produce An Explosive Effort
41 pages
2023 Article Jatit 19Vol101No14-3
No ratings yet
2023 Article Jatit 19Vol101No14-3
6 pages
The Ghosts of Adichanallur - Artefacts That Suggest An Ancient Tamil Civilisation of Great Sophistication - The Hindu
No ratings yet
The Ghosts of Adichanallur - Artefacts That Suggest An Ancient Tamil Civilisation of Great Sophistication - The Hindu
12 pages
Machine Learning With Python - Unit-5
No ratings yet
Machine Learning With Python - Unit-5
26 pages
Assignment 4
No ratings yet
Assignment 4
216 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
FCE Sample Use of English 1, Twins, Edinburugh, Languages
No ratings yet
FCE Sample Use of English 1, Twins, Edinburugh, Languages
6 pages
Daily Lesson Log: Tle - Icttd9 - 12al - Ic - E - 3
No ratings yet
Daily Lesson Log: Tle - Icttd9 - 12al - Ic - E - 3
4 pages
Part 11 MD
No ratings yet
Part 11 MD
53 pages
Interviewair Process: Institute of Emerging Technologies (IET)
No ratings yet
Interviewair Process: Institute of Emerging Technologies (IET)
10 pages
Big Data Analytics With Spark: A Practitioner's Guide To Using Spark For Large Scale Data Analysis
No ratings yet
Big Data Analytics With Spark: A Practitioner's Guide To Using Spark For Large Scale Data Analysis
1 page
SO Snippet ENASE
No ratings yet
SO Snippet ENASE
10 pages
Katakis Ecmlpkdd08 Challenge
No ratings yet
Katakis Ecmlpkdd08 Challenge
9 pages
Zhang DetectDistractedDriver Report
No ratings yet
Zhang DetectDistractedDriver Report
6 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Preparing OpenStackInstallation Guide
No ratings yet
Preparing OpenStackInstallation Guide
100 pages
Automatic Question Tagging Using K-Nearest Neighbors and Random Forest
No ratings yet
Automatic Question Tagging Using K-Nearest Neighbors and Random Forest
4 pages
Codeforces
No ratings yet
Codeforces
3 pages
Multi-Label Classification System That Automatically Tags Users' Questions To Enhance User Experience
No ratings yet
Multi-Label Classification System That Automatically Tags Users' Questions To Enhance User Experience
8 pages
Shenba 2021178047
No ratings yet
Shenba 2021178047
18 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
NCERT Class 10 Economics
No ratings yet
NCERT Class 10 Economics
93 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
5 Sequence Learning
No ratings yet
5 Sequence Learning
50 pages
Final Project Report MRI Reconstruction
No ratings yet
Final Project Report MRI Reconstruction
19 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Shankar Ganesh Indian Economy 3rd Clear Printable Version (Upscpdf - Com) PDF
No ratings yet
Shankar Ganesh Indian Economy 3rd Clear Printable Version (Upscpdf - Com) PDF
102 pages
Information Package: Including Terms & Conditions
No ratings yet
Information Package: Including Terms & Conditions
8 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
This Is AI4001: GCR: t37g47w
No ratings yet
This Is AI4001: GCR: t37g47w
51 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
2 PDF
No ratings yet
2 PDF
232 pages
Method Statement For Installation
No ratings yet
Method Statement For Installation
6 pages
RRL
100% (1)
RRL
3 pages
DTA Report 295 NR 1534 ISSN 1175-6594
No ratings yet
DTA Report 295 NR 1534 ISSN 1175-6594
31 pages
Task 3
No ratings yet
Task 3
17 pages
Models of Integration
No ratings yet
Models of Integration
18 pages
Introduction To Object Recognition: Slides Adapted From Fei-Fei Li, Rob Fergus, Antonio Torralba, and Others
No ratings yet
Introduction To Object Recognition: Slides Adapted From Fei-Fei Li, Rob Fergus, Antonio Torralba, and Others
60 pages
Chapter 03
100% (2)
Chapter 03
16 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Bag of Feature
No ratings yet
Bag of Feature
75 pages
Learning Similarities: An Ensemble Model For Textual Query Image Retrieval System
No ratings yet
Learning Similarities: An Ensemble Model For Textual Query Image Retrieval System
8 pages
Bag of Tricks For Text Classification
No ratings yet
Bag of Tricks For Text Classification
5 pages
Beyond Binary Classification
No ratings yet
Beyond Binary Classification
34 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
From Everand
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
Mark Magic
No ratings yet
Bag of Tricks For Efficient Text Classification: Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov
No ratings yet
Bag of Tricks For Efficient Text Classification: Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov
5 pages
One-Class Svms For Document Classification: Larry M. Manevitz Malik Yousef
No ratings yet
One-Class Svms For Document Classification: Larry M. Manevitz Malik Yousef
16 pages
Automatic Tag Recommendation Algorithms For Social Recommender Systems
No ratings yet
Automatic Tag Recommendation Algorithms For Social Recommender Systems
35 pages
Predicting and Evaluating The Popularity of Online News: He Ren Quan Yang
No ratings yet
Predicting and Evaluating The Popularity of Online News: He Ren Quan Yang
5 pages
Multi-Tagging For Transition-Based Dependency Parsing
No ratings yet
Multi-Tagging For Transition-Based Dependency Parsing
10 pages
Over Fitting and TBL
No ratings yet
Over Fitting and TBL
46 pages
Associative Tag Recommendation Exploiting Multiple Textual Features
No ratings yet
Associative Tag Recommendation Exploiting Multiple Textual Features
44 pages
Social Tag Prediction
No ratings yet
Social Tag Prediction
8 pages
Bag of Words
No ratings yet
Bag of Words
72 pages
Article 18 Colas
No ratings yet
Article 18 Colas
10 pages
Support Vector Machines
No ratings yet
Support Vector Machines
24 pages
Review On Comparison Between Text Classification Algorithms
No ratings yet
Review On Comparison Between Text Classification Algorithms
4 pages

Parikh IdentifyTagsFromMillionsOfTextQuestion PDF

Uploaded by

Parikh IdentifyTagsFromMillionsOfTextQuestion PDF

Uploaded by

Identifying Tags from millions of text question

Chintan Parikh, [email protected]

Abstract—Identifying tags or keywords from text has been a

Index Terms—Machine Learning, Clustering, Keyword

In this problem, I look at dataset obtained from Kaggle

Table 1: Top ten tags

Figure 1:Question with tags (c#, asp.net-mvc, linq, lambda)

Table 2: Relative importance of features for C# classifier

Choosing loss function:

Choosing input samples for building models

Figure 4: Hinge loss function, mean P = 0.76, R = 0.72

The classifiers in the above list have a noticeably low

You might also like