0% found this document useful (0 votes)
41 views42 pages

Project Thesis Grp-8 - Final - Upload - Jul31

Uploaded by

saumitra shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views42 pages

Project Thesis Grp-8 - Final - Upload - Jul31

Uploaded by

saumitra shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Semantic Summarization of Web News

A Project Report

Submitted for the partial fulfilment


of B.Tech. Degree in
INFORMATION TECHNOLOGY

By

Soumitra Shukla (1705213047)

Sunny Jain (1705213051)

Vipul Sharma (1705213056)

Under the supervision of

Dr. Pawan Kumar Tiwari

Dr. Tulika Narang

Department of Computer Science and Engineering

Institute of Engineering and Technology

Dr. A.P.J. Abdul Kalam Technical University, Lucknow, Uttar Pradesh.

1
Contents

DECLARATION i

CERTIFICATE ii

ACKNOWLEDGEMENT iii

ABSTRACT iv

LIST OF FIGURES . v

LIST OF TABLES. vi

1. INTRODUCTION

2. LITERATURE REVIEW

3. METHODOLOGY

4. EXPERIMENTAL RESULTS

5. CONCLUSIONS

6. REFERENCES

2
Declaration

We hereby declare that this submission is our own work and that, to the best of our belief and
knowledge, it contains no material previously published or written by another person or material
which to a substantial error has been accepted for the award of any degree or diploma of
university or other institute of higher learning, except where the acknowledgement has been
made in the text. The project has not been submitted by us at any other institute for the
requirement of any other degree.

Submitted by: - Date: 21-July-2021

(1) Name: Sunny Jain


Roll No.: 1705213051
Branch: Information Technology

Signature: SUNNY JAIN

(2) Name: Saumitra Shukla


Roll No.: 1705213047
Branch: Information Technology
Signature: SAUMITRA SHUKLA

(3) Name: Vipul Sharma


Roll No.: 1705213056
Branch: Information Technology

Signature: VIPUL SHARMA

3
Certificate

This is to certify that the project report which is titled as Semantic Summarization of Web
News presented by Sunny Jain , Saumitra Shukla and Vipul Sharma Name in the partial
fulfillment for the award of Bachelor of Technology in Computer Science and Engineering, is a
record of work carried out by them under my supervision and guidance at the Department of
Computer Science and Engineering at Institute of Engineering and Technology, Lucknow.

It is also certified that this project has not been submitted at any other Institute for the award of
any other degrees to the best of my knowledge.

Dr. Pawan Kumar Tiwari

Dr. Tulika Narang

Department of Computer Science and Engineering


Institute of Engineering and Technology, Luckn

4
Acknowledgement

I would firstly like to thank my supporting group members Mr. Sunny Jain, Mr. Vipul Sharma
and Mr. Saumitra Shukla for working diligently with me day in and day out and without whom
this project was impossible.
I am deeply indebted to my mentor Dr. Tulika Narang, CSE department and Dr. Pawan Kumar
Tiwari,CSE department for their valuable guidance, keen interest, constructive criticism and
encouragement at various stages of my training period.
I would like to thank Dr. Promila Bahadur, CSE department and Dr. Tulika Narang, CSE
department, the project monitoring committee members for delivering the guidelines and
organising the online presentations composedly with time and ease.
Finally, I would like to wind up by paying my heartfelt thanks to my supporting family and friends
for motivating me and putting out their ideas for this project

Saumitra Shukla

Sunny Jain

Vipul Sharma

5
Abstract
We describe our experiences creating a news segmenter for our final year project in this thesis. In
our project, we utilise a variety of methodologies, and the outcomes of their experiments are
compared and assessed. We used relaxed error measures for performance evaluation because of the
application backdrop of the final year project.

We are expected to write a report on "Semantic Summarization of Web News" as part of our final
year project and to gain expertise in the field of data science. The primary goal of completing this
project report is to gain understanding about various software engineering tools.
.
We describe our experiences creating a news segmenter for our final year project in this thesis. In
our project, we utilise a variety of methodologies, and the outcomes of their experiments are
compared and assessed. We used relaxed error measures for performance evaluation because of the
application backdrop of the final year project.

We are expected to write a report on "Semantic Summarization of Web News" as part of our final
year project and to gain expertise in the field of data science. The primary goal of completing this
project report is to gain understanding about various software engineering tools.

Citizens do not have access to this information. As a result, the goal of this project is to automate
the extraction and display of essential information from newspaper articles and make it accessible
to the broader audience.

The completion of this project report allows us to expand our understanding of the work of
consumer attitudes about reading web news. We are going through a lot of experiences that are
linked to our theme concepts. We learn about the value of collaboration and the function of job
dedication from this report.

6
List of Figures

Figure Number Title Page number

3.1 Natural language processing 18

3.2 Software development lifecycle 29

3.3 use case diagram of user’s possible interaction 30


with the system

3.4 sequence diagram of showing object interactions 31


arranged in time sequence

3.5 The data flow of the proposed framework 33

4.1 Start Page 35

4.2 List of newspapers 35

4.3 Select news priority wise 36

7
List of Tables

Table Number Title Page Number

3.1 25
Commonly used tags in POS tagging

3.2 meaning of arguments for sentences 26

4.1 37
Meaning of arguments for sentences

4.2 39
Experimentation accuracy results

8
Chapter 1

Introduction

1.1. Backgr𝅴‍ound

The World Wide Web is also a huge database. This large quantity of information gives rapid access
by human users and algorithms to almost every conceivable content, yet the unstructured nature of
most of the data available might pose a major problem.

The human being can in principle best extract relevant information from posted documents and
texts, the huge amount of knowledge that can be handled demand computerised approaches, the
exponential growth of the online industry has apparently made information searches and tracking
easier and faster, but a massive overload of information requires algorithms and tools with a quick
and easy way to track information.

In other words, the huge amount of data itself is the reason why everyone may access it from the
one hand, so that it creates the recognised difficulty of distinguishing between valuable and
worthless information. In other words, the information is generated from the other hand. It is
therefore necessary to summarise the wide range of information available on the Internet for users
to read knowledge easily without wasting most of their time reading the vast data. This is why we
have used the summary tool so that we are ready to sum up the information and improve the user's
reading capacity.[3]

Our proposal is influenced by text resumption models that have supported Maximum Coverage
Problem, but we create a technique that blends both the syntactic and the semantic structure of a
text, other than them. The news are divided into several parts within the suggested model according
to the emotional element of news, commencing with the most positive and most negative notices.

9
Application of semantic networks to the examined web source will result in Semantic
characterisation. As a consequence, the language text maps into an abstract representation to
identify the subjects addressed inside the net resource itself. By employing the abstract
presentation, the latter job is achieved by a heuristic algorithm to develop the necessary text
segments in the first document.[3]

At the moment, individuals want to use as much news as possible from as many sources as possible
on topics that are essential or which are of interest to them. Interactivity refers to the innate
inclination of the masses to consume their own news. Immediation is an important characteristic in
which individuals need to be notified without delay about news. The environment in which we live
and the technology we are familiar with allows individuals to profit from these qualities by
providing them with quick news about occurrences in real time.

Online news sites have evolved efficient techniques for drawing the attention of the public. Online
news gives thoughts on news organisations that may consist of individuals, places, or objects
during reports of current occurrences. For this reason, many channels of different news websites
offer interactive rating services, i.e., news might be good, bad or neutral. Sentiment Analysis or
Opinion Mining is a technique of discovering the polarity or strength of the (positive or negative)
opinion in writing, in this paper – an item in the news. Manual labelling of words of feeling is a
procedure which takes time.
The sentiment analysis method is automated with two prominent ways. The first method uses a
weighted word lexicon while the second step is based on machine learning approaches. Methods
that are based on Lexicon employ a word inventory dictionary and fit a number of words into a text
to detect polarity.
This methodology does not require preprocessing data and does not need to train a classifier, as
opposed to machine learning approaches.
This investigation is based on a Lexicon-based news article sentiment analysis technique The rest
of this paper is structured as follows:

10
The literature research was out via sentiment analysis for Chapter II is presented in news articles.
Chapter III presents the proposed methodology and experiment setup of this paper. Results have
been presented in Chapter IV followed by Conclusion in Section V.

1.2. Scope of the Project

The news field and the extraction by newspaper articles of defined interest as structured templates
are part of this project. For a large choice of consumers it is very necessary to obtain information in
an extremely better method with the least use of resources, time and money. This type of
information is extremely important. This system saves, along with the extraction of knowledge, all
the news information gathered and offers it in effective and efficient methods that might assist to
make country-related decisions.
Different extraction systems are examined and the system requirements supported are selected as
the most suited. The process of extraction involves the formalism of the selected system of
extraction in an interchangeable manner. Following the writing, implementation and testing of
appropriate rules, the performance of the planned rules is assessed to determine if the whole
process is successful/failed.[1]
The project has been started to meet both academic and corporate standards and criteria of the IET
Lucknow Project. Users can utilise the system for various techniques, such as:

● A news extraction tool.

● Database of news sources.

● A tool to analyse the extracted data meaningfully.

11
1.3. Problem Statement:

The world is changing rapidly and the need to adapt to events now increases for others who wish
to globalisation. They therefore need to gain an enormous quantity of knowledge and understand
it in less time. Increasingly, news stories from thousands of internet sources are making it
increasingly necessary to summarise this information because not everybody has time to read
complete pieces. Readers may browse the information on the newest news from different news
sources. Our solution helps reduce this difficulty by gathering summary information and
extracting it from the news stories and therefore users will not have to go into the news in order
to obtain information about the event. A huge quantity of data is available in electronic format at
current digital era. However, we lack the tools and technology essential for summarising this
information in meaningful knowledge utilised for crucial decisions.[1]

We thus seek to develop a platform for users to login and to provide recent news from many
reputable sources
A user may choose the news source he/she wants explicitly.
We provide the user with semantically ordered news:
● Most negative
● Most positive
● Negative medium
● Positive medium
● Neutral

To use the time pressure of the user efficiently and provide most alarming news
them.
.
● We also provide the user a means to share the content efficiently on social media sources to
raise awareness throughout the population, leading the news source to the authorities
concerned.

12
Chapter 2

Literature Review

2.1 Works on Text Summarization

Text summarising may also refer to the process of extracting or gathering key information
from a source text and displaying it in a graphical style. In recent years, the need for
summarization has been seen in a variety of contexts and domains, including news article
summaries, email summaries, short messages of reports on mobile devices, and data
summaries for businesspeople, governance, researchers online searching through a
programme to receive a summary of relevant pages found, and medical field tracking
patient's storey for further treatment.

Many examples may be found on the internet, such as article summarizers like Microsoft
News2, Google1, or Columbia Newsblaster3. A few common biomedical summarising
tools include BaseLine, FreqDist, SumBasic, MEAD, AutoSummarize, and SWESUM [6].
Online summarising tools include Text Compactor, Simplify, Tools4Noobs,
FreeSummarizer, WikiSummarizer, and SummarizeTool. Among the most commonly used
open source summarising programmes are Open Text summarizer, Classifier4J, NClassifier,
and CNGLSummarizer.

The first automated summarizer was introduced in late 1950; the automated summarizer
chooses significant sentences from the text and puts them together; it takes less time or
saves time to grasp the information inside the large document. The goal of automated text
summarization is to reduce the size of long texts and save vital information.

13
2.2 Works on semantic analysis

In sentiment analysis, the quantity of documents is quickly rising. In terms of substance


throughout the years, too, the field has evolved. Study focused on public or expert opinions,
not on users and consumers' opinions before the supply of an endless number of texts and
thoughts on the Internet.

The main article was published in 1940, with the title "Cross-Out Technique as public
opinion analysis." The articles published during the periodical Quarterly, in 1945 and 1947,
covered the measurement of public views in post-WWII nations (Japan, Italy, and
Czechoslovakia) experienced during the war. Computer systems started to appear in the
mid-90s.

In research too, the computer revolution began to reflect. As an example, in 1995, an ADPS
article was released on 'Elicitation, evaluation and pooling of expert opinions using the
Possibility Theory' and utilised as an example a pooling of opinions for expert opinions
within the field of business safety[7]. However, the emergence of an up-to-date study of
feelings was over 10 years away.

The study carried out by the Association of Linguistics, created in 1962, also influenced the
creation of the contemporary sentimental analysis. In 1990 and subsequently in 1999,
Wiebe suggested a gold standard to be followed up to now. It was the computer-based
sentiment analysis which came into existence largely during this community, and thus it
was first found that Wiebe had presented techniques of detecting subjective sentences from
narratives

14
2.3 Related Work

Automatic summarizer systems are classified between abstraction or extraction-based


summaries. An extraction-based summary involves the selection of text fragments from the
source document while an abstraction-based summary involves the sentence's compression and
reformulation.

While an abstract summary is more intended, sophisticated speech processing techniques are
required while extractive summaries are more practical.[3]

Another important distinction is the summary of one or many papers. Multiple problems arise
since summary systems must take into account diverse characteristics as such as the distinction
and thus the similitude of all sources as well as the sequence of information collected.

Through these principles we categorise the summary of several documents:

(i)by defining the text fragment info value,

(ii) the extraction of a significant component of this and

(iii)how several sources may be combined

In order to extract the largest important section of a text, the strategy presented follows the fact
that terms often or seldom occurring in excessively large amounts of documents must therefore
be treated as human summaries.[3] In these systems, it is also crucial to identify not just what
information is typically included in the summary, but also how significant the relevance of the
changes to information is, looking at what information has previously been included in the
summary.

Several approaches are typically taken into account to achieve this:


15
The significance of the term is calculated using a regression model that supports multiple word
characteristics, such as frequency and dispersion, using a probabilistic approach. For the last
word summary, the words with the best effective ratings are chosen. In addition, a cross
sentence word value overlap is calculated to eliminate phrases that have equal semantine
content when the redundancy phenomena is smooth. Key sentences are derived by utilising
classifiers from the narrative paragraphs that learn how important every acceptable statement
is. Finally, an algorithm for extraction of a sentence is constructed.

Another method is that the aim is to investigate and collect information on a subject/object
triple extracted by words in the sort of semanticized charts. A generic framework is provided
that may integrate information on sentence-level structure with semantical similarity.

Finally, an extractive ontology summary supported an RDF phrase as a key unit of summary.
The summary is achieved by extracting a bunch of remarkable RDF phrases according to a new
method.[3]

16
Chapter 3

Methodology

3.1. System-Architecture

Methodology Proposed: We suggest a framework consisting of the following steps:

(i) Web Search: Using the accessible API of the most common newspaper rss feeds, various
websites are scraped and news is saved in an area database.

(ii) Text Extractor: enables for the extraction of linked textual sentences from documents
parsing HTML pages.

(iii) Natural language processing: various NLP algorithms divide the retrieved text into
sentences and identify the functional elements (subject, verb, and object) of each phrase, as
well as the related form, resulting in the extraction and sentiment analysis.

(iv) Clustering: the OPTICS algorithm is used to cluster the vector.

(vi) Creating a summary: creates a summary based on the search terms and user references.

3.2 Logic-Flow

17
● We'll start by picking phrases from the clusters with the highest data score and the least
amount of text redundancy.
● We rank the clusters according to their ratings, then utilise the edge to choose the most
essential ones based on the length constraints.
● We choose the sentence with the most representative semantic clusters while also
reducing repetition.
● We'd like to prevent the possibility of the same sentence being presented several times
in the summary because many statements may have different semantic clusters.
● We penalise each cluster's common score if then've already examined it throughout our
summary generating process, and we compare the clusters' cumulative average score to
the edge to see whether they contain any more valuable information.[3]
● After determining the best summary, the sentences are reordered to maximise the partial
ordering of sentences inside the article.

Figure .3.1: NLP Flow

18
3.3 Web-mining

The use of knowledge mining techniques to automatically identify and extract information
from web documents and services is known as web mining; application areas include
resource discovery, information selection, generalisation, and data analysis.
Machine-learning approaches, by the way, generally address the final two goals. Web
content mining, web structure mining, and web use mining are the three primary sub-areas
of web mining. The previous section deals with the analysis of web resource contents,
which often include a variety of data sources such as texts, pictures, videos, and audio;
metadata and hyperlinks are frequently classed as text content. It has been shown that
unstructured text makes up a significant portion of web resources, resulting in widespread
use of text mining tools.[2]

There are several studies in the literature that focus on text mining for web page mining.
We looked into certain website mining approaches for online search, subject extraction, and
web opinion mining. Web content mining might help with things like sentiment
categorization, customer review analysis and summarization, template identification, and
page segmentation. By establishing a framework for competitive information, online
content mining tackles corporate applications.
Web-content classification and word-level summarising approaches were assisted by a
sophisticated computer virus. Unwanted advertising was detected using a web-page
analyzer. The study reported offered a web-page recommendation system in which
collaborative filtering techniques and learning methods collaborated to provide a web filter
for effective user navigation.[2]
In two major ways, the method used in this study varies from that used in previous studies:
It uses semantic-based approaches to pick and score single phrases retrieved from text in
the first place.
Second, it combines website segmentation with summarization. The suggested technique
does not fall within the category of semantic web mining, which refers to approaches that
deal with the occurrence of particular ontologies that enhance original website material in a

19
structured fashion. To the authors' knowledge, there are just two studies in the literature that
employ semantic information for webpage mining. [2]

The study described customised multimedia management systems and employed semantic,
ontology-based contextual data to understand customised content access and retrieval
behaviour. The WordNet semantic network was used to provide innovative semantic
similarity metrics in a study on semantic-based feature extraction for web mining.

20
3.3.1 Web-scraping

Website pages are generally built for visual interaction and feature a variety of graphic parts
that transmit a variety of material. The goal of web page segmentation is to understand the
page structure and divide the information into visual pieces. This might be a difficult task
with a significant number of difficulties. In recent years, several approaches for website
segmentation have been used.[2]

Web scraping is the process of extracting data from a website. This data is gathered and
then exported in a manner that is more user-friendly. It doesn't matter if it's a spreadsheet or
an API. Although online scraping is frequently done manually, automated technologies are
generally preferable for scraping web data since they are more cost-effective and work at a
faster rate. Web scraping, on the other hand, isn't always an easy process. Because websites
come in a variety of shapes and sizes, web scrapers differ in their functionality and
capabilities. For web scraping, we use beautifulsoup4.
Beautiful Soup is a Python library for parsing HTML, XML, and other markup languages
for data. Let's say you come across some websites that display data important to your study,
such as dates or addresses, but don't allow you to download it directly. Beautiful Soup
allows you to extract specific material from a website, strip away the HTML markup, and
save the data. It's a web scraping programme that allows you to pause working and parse
the pages you've taken down from the internet.[2]

Heuristic algorithms are used in web page segmentation methods, which primarily rely on
the Document Object Model (DOM) tree structure associated with an online resource. As a
result, segmentation algorithms may not work effectively if such auxiliary features don't
appear to be present or if they don't match the web page's real semantic structure. The
technique given in this chapter, on the other hand, is based only on the processing of textual
information that may be acquired from an online resource.[2]

21
3.3.2 RSS-Feedparser

We first attempted scraping the positioning, however the scraping results are extremely
dependent on the arrangement of the placement, which can vary over time, resulting in scraper
failure. As a result, we moved on to RSS.
Rich Site Summary (RSS) is a web feed type that publishes regularly updated content such as
blog posts, news headlines, audio, and video. An RSS document (also known as a "feed," "web
feed," or "channel") contains full or summarised text as well as information such as the date
and name of the publication. It displays the data in an XML format, which is nearly identical to
HTML tags with the exception that the names of the tags used in XML are different.[1]

Example of an RSS document:

22
We utilised rss feedparser to retrieve these RSS feeds into our web app. Feedparser is a
Python module that parses feeds in all of the common formats, such as Atom, RSS, and
RDF. It supports Python versions 2.4 through 3.3. Because the rss feedparser is automated,
it obtains all of the news articles from the specified url without the need for human
intervention.

3.4. Data-Preprocessing

The RSS data isn't suitable for input into our information extraction engine because it is raw
text data. Preprocessing of the data is required. We utilise the following procedures for
preprocessing in this system such as:

Sentence tokenization, word tokenization, POS tagging, lemmatization, date removal, Unicode
removal, and semantic role labelling are all examples of tokenization.

23
3.4.1. Sentence-Tokenization

Tokenization is the process of breaking up a character sequence and a specific document


unit into bits, called tokens, while potentially tossing away some characters, such as
punctuation.
Sentence tokenization is the process of breaking down a paragraph or news item into
individual sentences. This is crucial for word tokenization since we want individual phrases
rather than entire news items for word tokenization. The sentence tokenization is
accomplished using the nltk library's sent tokenize() function.[1]

3.4.2. Word-Tokenization

The breaking of a sentence into individual words is referred to as word tokenization. This is
crucial for POS tagging, which uses individual words from a phrase as input and assigns
them a tag[1]. The sentence tokenization is accomplished using the nltk library's word
tokenize() method.

24
3.4.3. POS-Tagging
Attaching a POS (parts of speech) tag to each word in a phrase is referred to as POS
tagging[1]. It's crucial to use POS tagging to locate information about a sentence's context.
The following are some of the most often used tags:

Table 3.1: Commonly used tags

25
3.4.4. Semantic-Role-Labeling(SRL)

Semantic role labelling, also known as shallow semantic parsing, is a natural language
processing technique that gives labels to words or phrases in a sentence to identify their
grammatical category, such as agent, aim, or consequence[1]. It entails identifying and
categorising semantic arguments associated with a sentence's predicate or verb. The
meaning of arguments for sentences is shown in the table below.

Table 3.2: meaning of arguments for sentences.

26
3.5 Text-Summarization

A summary is a text created from one or more other writings that expresses significant
information from the original texts while being less than half the length of the original
documents. Text summarization approaches, on the other hand, attempt to reduce reading
effort by increasing the data density presented to the reader.

Summarization strategies may be divided into two types: extractive and abstractive.
Extractive methods use natural language generators to generate original summaries,
whereas abstractive methods use natural language generators to create original
summaries.[2]

Word frequency analysis, cue word extraction, and phrase selection based on their position
within the text were among the approaches used. Tf-idf metrics (term frequency - inverse
document frequency), graph analysis, latent semantic analysis, machine learning
approaches, and fuzzy systems have all been employed in recent studies.

Other techniques took advantage of semantic processing: lexicon analysis was used, and
ideas extraction was used to assist the study reported in.
The objective of abstractive summarization was addressed in, with the goal of
understanding the major concepts in a document and then expressing those notions in
natural language.

The present work actually relies on a hybrid extractive-abstractive approach.


First, most informative sentences are selected by using co-occurrence of semantic domains
, thus involving an extractive summarization. Then, abstractive information is produced by
working out the most representative domains for every document.[2]

27
3.5.1 Comparisons:

The basic difficulty we face while evaluating the summary is:

● the documents data set to be compared with


● the current tools to be compared with.

There are several summarising tools available. After examining the efficiency of each
summarizer, we've picked the following Summarizers for comparison:
Copernic Summarizer I (ii) Intellexer Summarizer Pro is a programme that allows you to
summarise information (iii) Excellent summary (iv) Text Compactor (v) Tools4Noobs
Summarizer [3]

3.6 Deployment info:

The web application is currently being deployed on a localhost server and will be available
soon. PostgreSQL is used to store data.

28
3.7. Software-Development-Model Info:

As a software development approach, the iterative process was used, which began with the
simple implementation of the need. It improved the developing version iteratively until the
entire system was implemented [1]. The iterative model is a type of software development
life cycle (SDLC) that focuses on a simple, initial implementation that gradually increases
in complexity and feature set until the final system is complete.

Figure 3.2: Software-Development-Lifecycle

The nature of a data extraction system based on the knowledge engineering method is
created by first developing a rule to extract a certain object or event, then implementing and
testing it on new types of articles before writing another rule.
When required, the rule is rewritten and re-implemented based on the performance until the
desired outcome is achieved. This step-by-step method to rule development guarantees that
mistakes are identified and corrected as soon as possible. Iterative development is the most
adaptable approach of development, allowing for new requirements and changes to be
easily accommodated.

29
3.8. Associated-Diagrams:

Various diagrams related to this method are included in this section. Use case diagrams,
entity relation diagrams, sequence diagrams, and several levels of data flowcharts are
among the diagrams provided.

3.8.1. Use-Case-Diagram:

The use case diagram for our system below depicts a set of activities (use cases) that the
system may execute in collaboration with one or more external users (actors).

Figure 3.3 : Use case diagram of user’s possible interaction with the system

30
3.8.3. Sequence-Diagram:

Figure 3.4 : Sequence diagram showing object interactions arranged in time sequence

31
3.9 Tools Used By Us:

Github(Repo):
Github will be utilised for source code management and distributed version control. We'll
make a public repository for our source code and update it there.

PostgreSQL database:
PostgreSQL is an object-relational database management system (ORDBMS) that focuses
on flexibility and compliance with industry standards[1]. In our project, we'll utilise
Postgres version 9.5. Pgadmin is a graphical programme administration application.

Python Programming Language:


As an artificial language, we selected Python 3.7 since it provides many of the common
modules and packages that we use in our project.

Django Framework:
Django is a web framework written in Python. It employs the MVC approach, which speeds
up and simplifies development. We'll choose Django 3.11 because it includes many of the
standard libraries and packages we'll need for our project.

Natural Language Toolkit:


The Natural Language Toolkit, or NLTK, is a collection of text processing tools that
includes categorization, tokenization, stemming, tagging, parsing, and semantic reasoning.
These libraries are used for a variety of applications in our project.

BeautifulSoup4:
BeautifulSoup4 is a Python module for extracting information from HTML and XML
documents. BeautifulSoup will be used to extract data from an RSS feed, which will then
be utilised to obtain information. The Feed Parser Python module allows you to download
and parse syndicated feeds. It's RSS-capable (Rich Site Summary).

32
Figure 3.5 : The data flow of the proposed framework

33
Chapter 4

Experimental Results

This section focuses on the detailed discussion on the set of experiments conducted on
sentiment labelling using Vader packages supported the news category of the dataset
and prediction of sentiment labels is presented within the classification report

4.1 Comparison of sentimental-labelling :

The sentiment labels are generated using the aspect-based tokenization method. For a
better understanding of the calculation, three typical review sentences from the info set
are highlighted from the news review data set. The aspect terms are highlighted words
in the following review sentences. A tuple of polarity and subjectivity must be collected
to calculate the polarity scores using the aspect-based tokenization per word. The rule
included inside the code was then supported by being labelled as negative or positive.

34
Figure 4.1 Start Page

Figure 4.2 List of Newspapers

35
Figure 4.3 Select News Priority Wise

36
Table no. 4.1

37
4.2 Prediction of sentiment labels with the help of results:

.In our approach, the results clearly distinguishes between actuality positive and
true negative values, but the overall number of values differs somewhat because to
a small subset of neutral values branching off from true negative and true positive
values within the Vader method's results.

4.3 Accuracy of Classification report

The classification report in table 4.2 shows that the Vader method's positive
precision does not entirely outperform the Vader method's negative precision, and
that the Vader method's negative recall value is greater than the positive recall
value. The following are some of my observations:

Positive recall indicates that the Vader approach selects 63 percent of positive
labels.
In the Vader technique, negative recall indicates that negative labels are picked 94
percent of the time, with a weighted average of 93 percent.

This model is decent in classifying error, but it appears to be poorer in the Vader
technique of classifying positive classes, with a recall percentage of 63 percent vs
93 percent in negative classes.

38
Table no 4.2

39
Chapter 5

Conclusions

5. 1 Conclusions

.The study described here provides a paradigm that might help sophisticated Web mining
technologies function more successfully.

The suggested system analyses textual data from an internet page and uses semantic networks to
accomplish a number of objectives:

1) the selection of the most important subjects;

2) the selection of phrases that are more closely related to a specific topic

3) a textual resource's automatic summary. The final framework makes use of these features to
tackle two tasks at once: text summarising and page segmentation.The suggested technique, which
relies on an abstract representation that represents the informative content of the fundamental
textual resource on a cognitive foundation, includes a semantic characterisation of text as a key
component. [ 2 ]

However, because it does not rely on semantic information already contained in web resources, the
current technique cannot be classified as Semantic Web.Semantic networks are used in the
proposed approach to define the content of a textual resource using semantic domains, which are as
important as a conventional bag of words.Experiments have shown that such an approach can result
in a coarse-grained level of sense distinctions, which promotes the identification of the themes that
are really discussed on the website. In this regard, testing results revealed that the system can
mimic human assessors in assessing the importance of a text's sole phrases.

An interesting feature of this work is that the page segmentation technique is predicated only on the
analysis of the textual a part of the net resource. [ 2 ]

40
The page segmentation approach used in this study is unique in that it is based only on the
examination of the textual portion of the internet resource.

The combining of the content-driven segmentation method with traditional segmentation engines,
which are more geared toward the study of the net page's underlying structure, might be a future
path of this research. The resultant framework should be able to integrate the results of the two
modules to improve the segmentation procedure's performance.

5.2 Future Works

The future of sentiment analysis in applicable fields will continue to grow, and as a result,
sentiment analysis techniques will become an integrated element of many services and products.
Advances in language communication processing and machine learning, we believe, will enhance
research methodologies. Furthermore, we witness a shift away from text-based sentiment analysis
approaches and toward methods that influence the opposite side of the brain, such as voice, gaze,
and neuromarker analysis.. However, we are doubtful whether sentiment analysis are able to do an
analogous 50-fold increase within the number of papers within the next ten years as has occurred
during the past ten years (2005-2015). this can be supported the very fact that this could end in
having over 250,000 papers on sentiment analysis published by the year 2025.

Extractive techniques are the most successful and adaptable methods employed in automatic
summarization to date: they attempt to choose the most relevant phrases from a collection of
original documents in order to produce a condensed text that renders essential bits of data. As
we've seen, both approaches are far from ideal: in multi-document summarization, the selection of
phrases from several sources leads in duplication, which must be deleted sequentially. Furthermore,
most of the time just a portion of a phrase is relevant, thus extracting only sub-sentences isn't
practical. Finally, extracting phrases from a variety of texts might be beneficial.Lastly, extracting
the sentences from various different documents may produce an inconsistent and/or hard-to-read
summary.

41
References
[1] Mafiadoc.com (Mafiadoc.com, 2021, #)

[2] www.intechopen.com

[3] Flora Amato, Vincenzo Moscato, Antonio Picariello, Giancarlo Sperlí, Antonio D’Acierno

Antonio Penta. "Semantic summarization of web news" , Encyclopedia with Semantic

Computing and Robotic Intelligence, 2017


[4] H. P. Luhn, The automatic creation of literature abstracts, IBM J.Res. Dev.,159 (1958)

[5] R. McDonald, A study of global inference algorithms in multi document summarization, Proc.
29th Eur. Conf. IR Res. (2007),pp. 557–564.

[6] U. Hahn and I. Mani, The challenges of automatic summarization,Computer 29 (2000).

[7] R. McDonald and V. Hristidis, A survey of text summarization techniques, Mining Text
Data,43 (2012).

[8] V. Gupta and S. Gurpreet, A survey of text summarization extractive techniques, J. Emerg.
Technol. Web Intel. 258 (2010)

[9] Deepali K. Gaikwad1 and C. Namrata Mahender , A review paper on text summarization(2016)

[10] S. A. Sandri, D. Dubois and H. W. Kalfsbeek, "Elicitation, assessment, and pooling of expert
judgments using possibility theory," in IEEE Transactions on Fuzzy Systems, vol. 3, no. 3,
pp. 313-335, Aug. 1995, doi: 10.1109/91.413236.

[11] Image references : Google

42

You might also like