0% found this document useful (0 votes)

76 views9 pages

Report in ML

The document provides a guide for preprocessing news headline data using machine learning techniques for sarcasm detection. It discusses the dataset containing headlines, links, and sarcasm labels. The objective is to build a supervised machine learning classification model to detect sarcasm. It also covers relevant background information on Python, machine learning, and natural language processing preprocessing concepts like stemming, lemmatization, word embeddings, and bag-of-words. Hardware and software requirements for the project are also listed.

Uploaded by

Priti Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views9 pages

Report in ML

Uploaded by

Priti Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

A REPORT ON

NEWS HEADLINE DATASET USING MACHINE

LEARNING
22/07/2021
UNDER THE GUIDANCE OF
AISHWARYA SAXENA

BY
PRITI GUPTA

1
INDEX

SR NO. TOPIC PAGE NO.

1. INTRODUCTION 3
2. OBJECTIVE 3 to 4
3. BACKGROUND 4 to 6
4. HARDWARE
REQUIREMENT& 7
Software
Requirement
5. CODING
7

6. FUTURE SCOPE 8
7. CONCLUSION
8

2
Guide for NLP Pre-processing
Python notebook using data from News Headlines Dataset For Sarcasm Detection ·

Introduction
This notebooks is a guide to preprocessing and cleaning before performing tasks such as Sentiment
analysis, summary generation, next-word prediction etc. It will cover the basic concepts of NLP that
include:

 Stemming
 Lemmatization
 Word embeddings
 Bag of words models
linkcode
We will either be working with custom examples, or using the news-headline-sarcasm dataset for
understanding real-world use cases.

Dataset link - https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection?

select=Sarcasm_Headlines_Dataset.json

 is_sarcastic: 1 if the record is sarcastic otherwise 0

 headline: the headline of the news article
 article_link: link to the original news article. Useful for collecting supplementary data

OBJECTIVE:
The objective behind this project is to develop a machine
learning model based on the data which contain the
feature of Each record consists of three attributes
is_sarcastic: if the record is sarcastic otherwise 0,headline:
the headline of the news article,article_link: link to the original

3
news article. Useful for collecting supplementary data. We will
either be working with custom examples, or using the news-headline-
sarcasm dataset for understanding real-world use cases.
The machine learning model which is built will be a supervised machine
learning classification model. This guide unearths the concepts of
natural language processing, its techniques and
implementation. The aim of the article is to teach the
concepts of natural language processing and apply it on real
data set.

BACKGROUND:
Python is an interpreted high-level general-purpose programming language. Python's design
philosophy emphasizes code readability with its notable use of significant indentation. Its language
constructs as well as its object-oriented approach aim to help programmers write clear, logical code
for small and large-scale projects.[30]
Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms,
including structured (particularly, procedural), object-oriented and functional programming. Python is
often described as a "batteries included" language due to its comprehensive standard library.

Machine Learning is a sub-area of artificial intelligence, whereby the term refers to

the ability of IT systems to independently find solutions to problems by recognizing
patterns in databases. In other words: Machine Learning enables IT systems to
recognize patterns on the basis of existing algorithms and data sets and to develop
adequate solution concepts. Therefore, in Machine Learning, artificial knowledge is
generated on the basis of experience.

Datasets used for Natural Language Processing:

This post is divided into 7 parts; they are:

1. Text Classification
2. Language Modeling
3. Image Captioning
4. Machine Translation
5. Question Answering
6. Speech Recognition
7. Document Summarization

4
Text Classification
1.Text classification refers to labeling sentences or documents, such as email spam
classification and sentiment analysis.
Below are some good beginner text classification datasets.

 Reuters Newswire Topic Classification (Reuters-21578). A collection of news documents

that appeared on Reuters in 1987 indexed by categories. Also see RCV1, RCV2 and TRC2.
 IMDB Movie Review Sentiment Classification (stanford). A collection of movie reviews from
the website imdb.com and their positive or negative sentiment.
 News Group Movie Review Sentiment Classification (cornell). A collection of movie reviews
from the website imdb.com and their positive or negative sentiment.
For more, see the post:

 Datasets for single-label text categorization.

2. Language Modeling
Language modeling involves developing a statistical model for predicting the next word in a
sentence or next letter in a word given whatever has come before. It is a pre-cursor task in
tasks like speech recognition and machine translation.

It is a pre-cursor task in tasks like speech recognition and machine translation.

3.Image Captioning
Image captioning is the task of generating a textual description for a given image.

Below are some good beginner image captioning datasets.

 Common Objects in Context (COCO). A collection of more than 120 thousand images with
descriptions
 Flickr 8K. A collection of 8 thousand described images taken from flickr.

4. Machine Translation
Machine translation is the task of translating text from one language to another.

Below are some good beginner machine translation datasets.

5
 Aligned Hansards of the 36th Parliament of Canada. Pairs of sentences in English and
French.
 European Parliament Proceedings Parallel Corpus 1996-2011. Sentences pairs of a suite of
European languages.
There are a ton of standard datasets used for the annual machine translation challenges;
see:

 Statistical Machine Translation

5. Question Answering
Question answering is a task where a sentence or sample of text is provided from which
questions are asked and must be answered.

Below are some good beginner question answering datasets.

 Stanford Question Answering Dataset (SQuAD). Question answering about Wikipedia

articles.
 Deepmind Question Answering Corpus. Question answering about news articles from the
Daily Mail.
 Amazon question/answer data.
6. Speech Recognition
Speech recognition is the task of transforming audio of a spoken language into human
readable text.

Below are some good beginner speech recognition datasets.

 TIMIT Acoustic-Phonetic Continuous Speech Corpus. Not free, but listed because of its
wide use. Spoken American English and associated transcription.
 VoxForge. Project to build an open source database for speech recognition.
 LibriSpeech ASR corpus. Large collection of English audiobooks taken from LibriVox.

7. Document Summarization
Document summarization is the task of creating a short meaningful description of a larger
document.

Coding

6
………………………………………………………….
'of',

'th HARDWARE REQUIREMENT

HARDWARE TOOLS MINIMUM REQUIREMENT

PROCESSOR Intel Xeon E263v4corprocessor,2.2GHz

2TB HARD DISK 7200 RPM) + 512 GB SSD.

RAM 128 GB DDR4 2133 MHz.

MOTHERBOARD ASRock EPC612D8A.

GPU NVidia TitanX Pascal (12 GB VRAM)

SOFTWARE REQUIREMENT

SOFTWARE TOOLS MINIMUM REQUIREMENT

OPERATING SYSTEM WINDOW,LINUX,MACOS

TECHNOLOGY PYTHON,MACHINE LEARNING

VERSION 3.8

SCRIPTING LANGUAGE PYTHON

IDE PYCHARM,JUPYTER NOTEBOOK

7
 FUTURE SCOPE: The future scope of NLP NEWS HEADLINE DATASET
Developers can make use of NLP to perform tasks like speech
recognition, sentiment analysis, translation, auto-correct of
grammar while typing, and automated answer generation. NLP is
a challenging field since it deals with human language, which is
extremely diverse and can be spoken in a lot of ways.

CONCLUSION:

From all the experiments we observed PV-DOW method of paragraph vectors has great potential
in sarcasm detection. The combination of SVM and PV-DOW method achieved an accuracy of
91.26%. With the manual feature also the machine learning models, SVC and RF performed
well. As per the results we are successful in knowing the capability of paragraph vector and also
from the result it is evident that the data set which is used is suitable for both manual feature and
word embedding based approach of sarcasm detection. As the dataset which is used in this
research is different from that of the works which are discussed in section 2, it is quite difficult to
make a comparison of the results but the methods that was implemented have shown good results
for this dataset. In future large sarcasm datasets can be used. Also, cross domain experiment
needs to be done to explore how model performs in such situation. Sarcasm is a never-ending
problem in obtaining true opinion. Hence data needs to be collected from various different
sources and make use of word embedding as well as manual features whenever applicable. For
manual feature the dictionaries can be improved by adding more words and also polarity score of
the emoticons can included as in this research only the presence of the emoticons is taken into
consideration. Along with paragraph vectors, deep learning techniques can be used in future.

8
9

Fake News Detection
100% (1)
Fake News Detection
25 pages
Python AI ML LLM TrainingJun142024
No ratings yet
Python AI ML LLM TrainingJun142024
192 pages
A Quick Guide To Artificial Intelligence
100% (3)
A Quick Guide To Artificial Intelligence
41 pages
Research Topics For Undergraduates
No ratings yet
Research Topics For Undergraduates
51 pages
BTech Advanced AI Unit04
No ratings yet
BTech Advanced AI Unit04
45 pages
Case 9.2 Aiding Allies PDF
No ratings yet
Case 9.2 Aiding Allies PDF
12 pages
15 Report PDF
No ratings yet
15 Report PDF
35 pages
NLP Handwritten Notes
No ratings yet
NLP Handwritten Notes
26 pages
Documentation-Fake News Detection
100% (1)
Documentation-Fake News Detection
57 pages
1.machine Learning and Its Applications
No ratings yet
1.machine Learning and Its Applications
75 pages
DL Assignment 2 Final
No ratings yet
DL Assignment 2 Final
15 pages
Satya Final Minor Report
100% (1)
Satya Final Minor Report
25 pages
Permutation
No ratings yet
Permutation
91 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
ML Project Report PDF
No ratings yet
ML Project Report PDF
26 pages
Summer Course Material
No ratings yet
Summer Course Material
52 pages
Mainprojectsample Documentation.
No ratings yet
Mainprojectsample Documentation.
51 pages
Modified Test
No ratings yet
Modified Test
12 pages
8.progress Report Presentation (Clickbait Detection System)
No ratings yet
8.progress Report Presentation (Clickbait Detection System)
26 pages
Data Science & Data Analytics Project - Documentation
No ratings yet
Data Science & Data Analytics Project - Documentation
10 pages
Water Well Drilling Machine and Tools Catalogue
No ratings yet
Water Well Drilling Machine and Tools Catalogue
49 pages
AI Terminologies You Must Know in 2024 1724502103
No ratings yet
AI Terminologies You Must Know in 2024 1724502103
42 pages
NM Project Phase-2
No ratings yet
NM Project Phase-2
9 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
LLM For QnA Proposal
No ratings yet
LLM For QnA Proposal
12 pages
Unit 5 - Aiaaia
No ratings yet
Unit 5 - Aiaaia
19 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
Perio Instruments
100% (3)
Perio Instruments
32 pages
Documentation-Fake News Detection
No ratings yet
Documentation-Fake News Detection
57 pages
SAP2000 Tutorial Example: Analysis and Design of Continuous RC Beam
No ratings yet
SAP2000 Tutorial Example: Analysis and Design of Continuous RC Beam
21 pages
Wa0013.
No ratings yet
Wa0013.
12 pages
Scheme of Examination
No ratings yet
Scheme of Examination
42 pages
Project Report
No ratings yet
Project Report
6 pages
NLP Short Que Ans
No ratings yet
NLP Short Que Ans
21 pages
1.2. Preparing Machine Learning Environment: Installation of Python (In Windows OS)
No ratings yet
1.2. Preparing Machine Learning Environment: Installation of Python (In Windows OS)
8 pages
What Is Natural Language Processing (NLP)
No ratings yet
What Is Natural Language Processing (NLP)
15 pages
News Classification Using Machine Learning
No ratings yet
News Classification Using Machine Learning
5 pages
Literature Review On Vulnerability Detection Using
No ratings yet
Literature Review On Vulnerability Detection Using
10 pages
Geetha Internship
No ratings yet
Geetha Internship
17 pages
Magnet Grade 5 WS
0% (1)
Magnet Grade 5 WS
7 pages
Third Order Intercepts
No ratings yet
Third Order Intercepts
6 pages
Design and Development of A Petrol-Powered Hammer Mill For Rural Nigerian Farmers
No ratings yet
Design and Development of A Petrol-Powered Hammer Mill For Rural Nigerian Farmers
11 pages
Using The Leica TC 307 v2
No ratings yet
Using The Leica TC 307 v2
4 pages
1974 Lambda Catalog and Application Handbook
No ratings yet
1974 Lambda Catalog and Application Handbook
191 pages
Tda8580j Datasheet
100% (1)
Tda8580j Datasheet
28 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
Project Plan - Kel 5 PDF
No ratings yet
Project Plan - Kel 5 PDF
5 pages
Radio Network Planning: in Arcgis
No ratings yet
Radio Network Planning: in Arcgis
12 pages
Lesson Explainer - Velocity - Nagwa
No ratings yet
Lesson Explainer - Velocity - Nagwa
34 pages
2.003J/1.053J Dynamics and Control I Fall 2007 Problem Set 4
No ratings yet
2.003J/1.053J Dynamics and Control I Fall 2007 Problem Set 4
4 pages
Daftar STandard Method
No ratings yet
Daftar STandard Method
33 pages
M. Tech. Chemical 2018
No ratings yet
M. Tech. Chemical 2018
37 pages
Using Simulation To Model Queuing
No ratings yet
Using Simulation To Model Queuing
8 pages
Simple Stresses and Strains of Statically Indeterminate Structures
No ratings yet
Simple Stresses and Strains of Statically Indeterminate Structures
12 pages
Visual Basic 6.0
No ratings yet
Visual Basic 6.0
9 pages
Musical Elements Table
No ratings yet
Musical Elements Table
3 pages
RM BV Manual PDF
No ratings yet
RM BV Manual PDF
9 pages
Aircraft Welding Cabriana
No ratings yet
Aircraft Welding Cabriana
5 pages
Audio Amplifier Applications Low Noise Audio Amplifier Applications
No ratings yet
Audio Amplifier Applications Low Noise Audio Amplifier Applications
5 pages
Chapter 15 PDF
No ratings yet
Chapter 15 PDF
39 pages
Nandha Engineering College, Erode-52 (: 15ME603 - Finite Element Analysis
No ratings yet
Nandha Engineering College, Erode-52 (: 15ME603 - Finite Element Analysis
4 pages
ENGR 2530 Syllabus-Spring 2015 - KLM Abbreviated
No ratings yet
ENGR 2530 Syllabus-Spring 2015 - KLM Abbreviated
2 pages
Module 3 - Pneumatics Activity 1
No ratings yet
Module 3 - Pneumatics Activity 1
2 pages
E GMAT SC Complete StudyPlan
No ratings yet
E GMAT SC Complete StudyPlan
6 pages
Learning Jupyter
From Everand
Learning Jupyter
Dan Toomey
3.5/5 (4)
Statistics with Rust, Second Edition
From Everand
Statistics with Rust, Second Edition
Keiko Nakamura
No ratings yet
Statistics with Rust, Second Edition: Explore rust programming and its powerful crates across data science, machine learning and NLP projects
From Everand
Statistics with Rust, Second Edition: Explore rust programming and its powerful crates across data science, machine learning and NLP projects
Keiko Nakamura
No ratings yet
RSpec Essentials
From Everand
RSpec Essentials
Mani Tadayon
3/5 (1)
Rust In Practice, Second Edition
From Everand
Rust In Practice, Second Edition
Rick Tim
No ratings yet
Mastering Deepseek in Python: A Complete Guide to Building, Training, Deploying, and Scaling Advanced NLP Applications with Deepseek Models in Python
From Everand
Mastering Deepseek in Python: A Complete Guide to Building, Training, Deploying, and Scaling Advanced NLP Applications with Deepseek Models in Python
Dargslan
No ratings yet
Statistics with Rust: 50+ Statistical Techniques Put into Action
From Everand
Statistics with Rust: 50+ Statistical Techniques Put into Action
Keiko Nakamura
No ratings yet
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
Data-Driven Security: Analysis, Visualization and Dashboards
From Everand
Data-Driven Security: Analysis, Visualization and Dashboards
Jay Jacobs
No ratings yet
Neural Networks with Python
From Everand
Neural Networks with Python
Mei Wong
No ratings yet
Building Python Real time Applications with Storm: Learn to process massive real-time data streams using Storm and Python—no Java required!
From Everand
Building Python Real time Applications with Storm: Learn to process massive real-time data streams using Storm and Python—no Java required!
Kartik Bhatnagar
No ratings yet
Rust In Practice
From Everand
Rust In Practice
GitforGits
No ratings yet
Rust In Practice: A Programmers Guide to Build Rust Programs, Test Applications and Create Cargo Packages
From Everand
Rust In Practice: A Programmers Guide to Build Rust Programs, Test Applications and Create Cargo Packages
Rustacean Team
No ratings yet
Machine Learning and Deep Learning With Python
From Everand
Machine Learning and Deep Learning With Python
James Chen
No ratings yet
Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch
From Everand
Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch
Ivan Vasilev
No ratings yet
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
From Everand
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
Chitra Lele
No ratings yet
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
Python for Mechanical and Aerospace Engineering
From Everand
Python for Mechanical and Aerospace Engineering
Alexander Kenan
No ratings yet
Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis
From Everand
Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis
Fabio Nelli
No ratings yet
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Python Data Persistence
From Everand
Python Data Persistence
Malhar Lathkar
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The spaCy Handbook: Simplifying Natural Language Processing
From Everand
The spaCy Handbook: Simplifying Natural Language Processing
Robert Johnson
No ratings yet
Large Scale Machine Learning with Python
From Everand
Large Scale Machine Learning with Python
Bastiaan Sjardin
2/5 (1)
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet

Report in ML

Uploaded by

Report in ML

Uploaded by

A REPORT ON

NEWS HEADLINE DATASET USING MACHINE

SR NO. TOPIC PAGE NO.

Dataset link - https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection?

 is_sarcastic: 1 if the record is sarcastic otherwise 0

Machine Learning is a sub-area of artificial intelligence, whereby the term refers to

Datasets used for Natural Language Processing:

 Reuters Newswire Topic Classification (Reuters-21578). A collection of news documents

 Datasets for single-label text categorization.

It is a pre-cursor task in tasks like speech recognition and machine translation.

Below are some good beginner image captioning datasets.

Below are some good beginner machine translation datasets.

 Statistical Machine Translation

Below are some good beginner question answering datasets.

 Stanford Question Answering Dataset (SQuAD). Question answering about Wikipedia

Below are some good beginner speech recognition datasets.

'th HARDWARE REQUIREMENT

PROCESSOR Intel Xeon E263v4corprocessor,2.2GHz

2TB HARD DISK 7200 RPM) + 512 GB SSD.

RAM 128 GB DDR4 2133 MHz.

MOTHERBOARD ASRock EPC612D8A.

GPU NVidia TitanX Pascal (12 GB VRAM)

SOFTWARE TOOLS MINIMUM REQUIREMENT

OPERATING SYSTEM WINDOW,LINUX,MACOS

TECHNOLOGY PYTHON,MACHINE LEARNING

SCRIPTING LANGUAGE PYTHON

IDE PYCHARM,JUPYTER NOTEBOOK

You might also like