Report in ML
Report in ML
BY
PRITI GUPTA
1
INDEX
6. FUTURE SCOPE 8
7. CONCLUSION
8
2
Guide for NLP Pre-processing
Python notebook using data from News Headlines Dataset For Sarcasm Detection ·
Introduction
This notebooks is a guide to preprocessing and cleaning before performing tasks such as Sentiment
analysis, summary generation, next-word prediction etc. It will cover the basic concepts of NLP that
include:
Stemming
Lemmatization
Word embeddings
Bag of words models
linkcode
We will either be working with custom examples, or using the news-headline-sarcasm dataset for
understanding real-world use cases.
OBJECTIVE:
The objective behind this project is to develop a machine
learning model based on the data which contain the
feature of Each record consists of three attributes
is_sarcastic: if the record is sarcastic otherwise 0,headline:
the headline of the news article,article_link: link to the original
3
news article. Useful for collecting supplementary data. We will
either be working with custom examples, or using the news-headline-
sarcasm dataset for understanding real-world use cases.
The machine learning model which is built will be a supervised machine
learning classification model. This guide unearths the concepts of
natural language processing, its techniques and
implementation. The aim of the article is to teach the
concepts of natural language processing and apply it on real
data set.
BACKGROUND:
Python is an interpreted high-level general-purpose programming language. Python's design
philosophy emphasizes code readability with its notable use of significant indentation. Its language
constructs as well as its object-oriented approach aim to help programmers write clear, logical code
for small and large-scale projects.[30]
Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms,
including structured (particularly, procedural), object-oriented and functional programming. Python is
often described as a "batteries included" language due to its comprehensive standard library.
1. Text Classification
2. Language Modeling
3. Image Captioning
4. Machine Translation
5. Question Answering
6. Speech Recognition
7. Document Summarization
4
Text Classification
1.Text classification refers to labeling sentences or documents, such as email spam
classification and sentiment analysis.
Below are some good beginner text classification datasets.
2. Language Modeling
Language modeling involves developing a statistical model for predicting the next word in a
sentence or next letter in a word given whatever has come before. It is a pre-cursor task in
tasks like speech recognition and machine translation.
3.Image Captioning
Image captioning is the task of generating a textual description for a given image.
Common Objects in Context (COCO). A collection of more than 120 thousand images with
descriptions
Flickr 8K. A collection of 8 thousand described images taken from flickr.
4. Machine Translation
Machine translation is the task of translating text from one language to another.
5
Aligned Hansards of the 36th Parliament of Canada. Pairs of sentences in English and
French.
European Parliament Proceedings Parallel Corpus 1996-2011. Sentences pairs of a suite of
European languages.
There are a ton of standard datasets used for the annual machine translation challenges;
see:
TIMIT Acoustic-Phonetic Continuous Speech Corpus. Not free, but listed because of its
wide use. Spoken American English and associated transcription.
VoxForge. Project to build an open source database for speech recognition.
LibriSpeech ASR corpus. Large collection of English audiobooks taken from LibriVox.
7. Document Summarization
Document summarization is the task of creating a short meaningful description of a larger
document.
Coding
6
………………………………………………………….
'of',
SOFTWARE REQUIREMENT
VERSION 3.8
7
FUTURE SCOPE: The future scope of NLP NEWS HEADLINE DATASET
Developers can make use of NLP to perform tasks like speech
recognition, sentiment analysis, translation, auto-correct of
grammar while typing, and automated answer generation. NLP is
a challenging field since it deals with human language, which is
extremely diverse and can be spoken in a lot of ways.
CONCLUSION:
From all the experiments we observed PV-DOW method of paragraph vectors has great potential
in sarcasm detection. The combination of SVM and PV-DOW method achieved an accuracy of
91.26%. With the manual feature also the machine learning models, SVC and RF performed
well. As per the results we are successful in knowing the capability of paragraph vector and also
from the result it is evident that the data set which is used is suitable for both manual feature and
word embedding based approach of sarcasm detection. As the dataset which is used in this
research is different from that of the works which are discussed in section 2, it is quite difficult to
make a comparison of the results but the methods that was implemented have shown good results
for this dataset. In future large sarcasm datasets can be used. Also, cross domain experiment
needs to be done to explore how model performs in such situation. Sarcasm is a never-ending
problem in obtaining true opinion. Hence data needs to be collected from various different
sources and make use of word embedding as well as manual features whenever applicable. For
manual feature the dictionaries can be improved by adding more words and also polarity score of
the emoticons can included as in this research only the presence of the emoticons is taken into
consideration. Along with paragraph vectors, deep learning techniques can be used in future.
8
9