0% found this document useful (0 votes)
76 views9 pages

Report in ML

The document provides a guide for preprocessing news headline data using machine learning techniques for sarcasm detection. It discusses the dataset containing headlines, links, and sarcasm labels. The objective is to build a supervised machine learning classification model to detect sarcasm. It also covers relevant background information on Python, machine learning, and natural language processing preprocessing concepts like stemming, lemmatization, word embeddings, and bag-of-words. Hardware and software requirements for the project are also listed.

Uploaded by

Priti Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views9 pages

Report in ML

The document provides a guide for preprocessing news headline data using machine learning techniques for sarcasm detection. It discusses the dataset containing headlines, links, and sarcasm labels. The objective is to build a supervised machine learning classification model to detect sarcasm. It also covers relevant background information on Python, machine learning, and natural language processing preprocessing concepts like stemming, lemmatization, word embeddings, and bag-of-words. Hardware and software requirements for the project are also listed.

Uploaded by

Priti Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

A REPORT ON

NEWS HEADLINE DATASET USING MACHINE


LEARNING
22/07/2021
UNDER THE GUIDANCE OF
AISHWARYA SAXENA

BY
PRITI GUPTA

1
INDEX

SR NO. TOPIC PAGE NO.


1. INTRODUCTION 3
2. OBJECTIVE 3 to 4
3. BACKGROUND 4 to 6
4. HARDWARE
REQUIREMENT& 7
Software
Requirement
5. CODING
7

6. FUTURE SCOPE 8
7. CONCLUSION
8

2
Guide for NLP Pre-processing
Python notebook using data from News Headlines Dataset For Sarcasm Detection ·

Introduction
This notebooks is a guide to preprocessing and cleaning before performing tasks such as Sentiment
analysis, summary generation, next-word prediction etc. It will cover the basic concepts of NLP that
include:

 Stemming
 Lemmatization
 Word embeddings
 Bag of words models
linkcode
We will either be working with custom examples, or using the news-headline-sarcasm dataset for
understanding real-world use cases.

Dataset link - https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection?


select=Sarcasm_Headlines_Dataset.json

 is_sarcastic: 1 if the record is sarcastic otherwise 0


 headline: the headline of the news article
 article_link: link to the original news article. Useful for collecting supplementary data

OBJECTIVE:
The objective behind this project is to develop a machine
learning model based on the data which contain the
feature of Each record consists of three attributes
is_sarcastic: if the record is sarcastic otherwise 0,headline:
the headline of the news article,article_link: link to the original

3
news article. Useful for collecting supplementary data. We will
either be working with custom examples, or using the news-headline-
sarcasm dataset for understanding real-world use cases.
The machine learning model which is built will be a supervised machine
learning classification model.  This guide unearths the concepts of
natural language processing, its techniques and
implementation. The aim of the article is to teach the
concepts of natural language processing and apply it on real
data set. 

BACKGROUND:
Python is an interpreted high-level general-purpose programming language. Python's design
philosophy emphasizes code readability with its notable use of significant indentation. Its language
constructs as well as its object-oriented approach aim to help programmers write clear, logical code
for small and large-scale projects.[30]
Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms,
including structured (particularly, procedural), object-oriented and functional programming. Python is
often described as a "batteries included" language due to its comprehensive standard library.

Machine Learning is a sub-area of artificial intelligence, whereby the term refers to


the ability of IT systems to independently find solutions to problems by recognizing
patterns in databases. In other words: Machine Learning enables IT systems to
recognize patterns on the basis of existing algorithms and data sets and to develop
adequate solution concepts. Therefore, in Machine Learning, artificial knowledge is
generated on the basis of experience.

Datasets used for Natural Language Processing:


This post is divided into 7 parts; they are:

1. Text Classification
2. Language Modeling
3. Image Captioning
4. Machine Translation
5. Question Answering
6. Speech Recognition
7. Document Summarization

4
Text Classification
1.Text classification refers to labeling sentences or documents, such as email spam
classification and sentiment analysis.
Below are some good beginner text classification datasets.

 Reuters Newswire Topic Classification (Reuters-21578). A collection of news documents


that appeared on Reuters in 1987 indexed by categories. Also see RCV1, RCV2 and TRC2.
 IMDB Movie Review Sentiment Classification (stanford). A collection of movie reviews from
the website imdb.com and their positive or negative sentiment.
 News Group Movie Review Sentiment Classification (cornell). A collection of movie reviews
from the website imdb.com and their positive or negative sentiment.
For more, see the post:

 Datasets for single-label text categorization.

2. Language Modeling
Language modeling involves developing a statistical model for predicting the next word in a
sentence or next letter in a word given whatever has come before. It is a pre-cursor task in
tasks like speech recognition and machine translation.

It is a pre-cursor task in tasks like speech recognition and machine translation.

3.Image Captioning
Image captioning is the task of generating a textual description for a given image.

Below are some good beginner image captioning datasets.

 Common Objects in Context (COCO). A collection of more than 120 thousand images with
descriptions
 Flickr 8K. A collection of 8 thousand described images taken from flickr.

4. Machine Translation
Machine translation is the task of translating text from one language to another.

Below are some good beginner machine translation datasets.

5
 Aligned Hansards of the 36th Parliament of Canada. Pairs of sentences in English and
French.
 European Parliament Proceedings Parallel Corpus 1996-2011. Sentences pairs of a suite of
European languages.
There are a ton of standard datasets used for the annual machine translation challenges;
see:

 Statistical Machine Translation


5. Question Answering
Question answering is a task where a sentence or sample of text is provided from which
questions are asked and must be answered.

Below are some good beginner question answering datasets.

 Stanford Question Answering Dataset (SQuAD). Question answering about Wikipedia


articles.
 Deepmind Question Answering Corpus. Question answering about news articles from the
Daily Mail.
 Amazon question/answer data.
6. Speech Recognition
Speech recognition is the task of transforming audio of a spoken language into human
readable text.

Below are some good beginner speech recognition datasets.

 TIMIT Acoustic-Phonetic Continuous Speech Corpus. Not free, but listed because of its
wide use. Spoken American English and associated transcription.
 VoxForge. Project to build an open source database for speech recognition.
 LibriSpeech ASR corpus. Large collection of English audiobooks taken from LibriVox.

7. Document Summarization
Document summarization is the task of creating a short meaningful description of a larger
document.

Coding

6
………………………………………………………….
'of',

'th HARDWARE REQUIREMENT


HARDWARE TOOLS MINIMUM REQUIREMENT

PROCESSOR Intel Xeon E263v4corprocessor,2.2GHz

2TB HARD DISK 7200 RPM) + 512 GB SSD.

RAM 128 GB DDR4 2133 MHz.

MOTHERBOARD ASRock EPC612D8A.

GPU NVidia TitanX Pascal (12 GB VRAM)

SOFTWARE REQUIREMENT

SOFTWARE TOOLS MINIMUM REQUIREMENT

OPERATING SYSTEM WINDOW,LINUX,MACOS

TECHNOLOGY PYTHON,MACHINE LEARNING

VERSION 3.8

SCRIPTING LANGUAGE PYTHON

IDE PYCHARM,JUPYTER NOTEBOOK

7
 FUTURE SCOPE: The future scope of NLP NEWS HEADLINE DATASET
Developers can make use of NLP to perform tasks like speech
recognition, sentiment analysis, translation, auto-correct of
grammar while typing, and automated answer generation. NLP is
a challenging field since it deals with human language, which is
extremely diverse and can be spoken in a lot of ways.

CONCLUSION:

From all the experiments we observed PV-DOW method of paragraph vectors has great potential
in sarcasm detection. The combination of SVM and PV-DOW method achieved an accuracy of
91.26%. With the manual feature also the machine learning models, SVC and RF performed
well. As per the results we are successful in knowing the capability of paragraph vector and also
from the result it is evident that the data set which is used is suitable for both manual feature and
word embedding based approach of sarcasm detection. As the dataset which is used in this
research is different from that of the works which are discussed in section 2, it is quite difficult to
make a comparison of the results but the methods that was implemented have shown good results
for this dataset. In future large sarcasm datasets can be used. Also, cross domain experiment
needs to be done to explore how model performs in such situation. Sarcasm is a never-ending
problem in obtaining true opinion. Hence data needs to be collected from various different
sources and make use of word embedding as well as manual features whenever applicable. For
manual feature the dictionaries can be improved by adding more words and also polarity score of
the emoticons can included as in this research only the presence of the emoticons is taken into
consideration. Along with paragraph vectors, deep learning techniques can be used in future.

8
9

You might also like