0% found this document useful (0 votes)
95 views

Text Mining in Big Data Analytics (1) (1) - 1

This document outlines a project on text mining in big data analytics submitted for a Bachelor of Engineering degree. The project aims to design a grid framework for executing web pages in a distributed manner using threads run by executors managed by a central manager. The document includes an abstract, introduction, literature review, system design, implementation details, and conclusion.

Uploaded by

azim momin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Text Mining in Big Data Analytics (1) (1) - 1

This document outlines a project on text mining in big data analytics submitted for a Bachelor of Engineering degree. The project aims to design a grid framework for executing web pages in a distributed manner using threads run by executors managed by a central manager. The document includes an abstract, introduction, literature review, system design, implementation details, and conclusion.

Uploaded by

azim momin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Text mining in big data analytics

Submitted in partial fulfilment of the requirements of the degree of


Bachelor of Engineering

By
Mr. Shoaib Moosa ARMIET/BE/CS20MD218
Mr. Azim Momin ARMIET/BE/CS20MM229
Mr. Deepesh Panday ARMIET/BE/CS20MD219
Mr. Deevesh Panday ARMIET/BE/CS20TS220

Under the Guidance of


PROF. Vivek Pandey

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING


AND TECHNOLOGY

Affiliated to

UNIVERSITY OF MUMBAI

Department of Computer Engineering


Academic Year – 2022-2023
CERTIFICATE

This dissertation report entitled “Text Mining in big data analytics” by Mr. Shoaib Abdul
Razzak Moosa is approved for the degree of Bachelor of Engineering in Computer
Engineering for academic year 2022 - 2023.

Examiners

1.

2.

Supervisor

1.

(Prof. Archana Khelurkar)

Head of the Department Principal

Date:

Place:
Declaration
I declare that this written submission represents my ideas in my own words and where others'

ideas or words have been included, I have adequately cited and referenced the original

sources. I also declare that I have adhered to all principles of academic honesty and integrity

and have not misrepresented or fabricated or falsified any idea/data/fact/source in my

submission. I understand that any violation of the above will be cause for disciplinary action

by the Institute and can also evoke penal action from the sources which have thus not been

properly cited or from whom proper permission has not been taken when needed.

Mr. Shoaib Moosa

Date:
ACKNOWLEDGEMENT
We have immense pleasure in presenting the report for our project entitled “Text Mining in

big data analytics”.

We would like to take this opportunity to express our gratitude to a number of people who

have been sources of help & encouragement during the course of this project.

We are very grateful and indebted to our project guide PROF. Vivek Pandey & our

respected HOD PROF. MAYANK MANGAL for providing their enduring patience,

guidance & invaluable suggestions. They were the one who never let our morale down &

always supported us through our thick & thin. They were the constant source of inspiration

for us & took utmost interest in our project.

We would also like to thank all the staff members for their invaluable co-operation &

permitting us to work in the computer lab.

We are also thankful to all the students for giving us their useful advice & immense co-

operation. Their support made the working of this project very pleasant.
PREFACE
This project has been submitted in the fulfillment of the requirements for the diploma of
engineering. We the team members of this project, take pleasure in presenting the detail
project report that reflects our efforts in academic year 2022-23.

Our project involves designing a Grid framework for executing web page where the process
is divided into threads and accordingly the threads are executed by the executors. The outputs
generated by the executors are given back to the manager which in turn gives the results to
the owner. This is a dedicated in which the manager can select particular executors to run the
web page.

Initially manager is started by connecting it to a storage application. The executors are


connected to the manager by providing the required credentials. Once the executors get
connected to the manager the execution of the required can be started.

Additionally, there is a Grid console which keeps track of the executors connected and the
web page running. A record of all the operations performed by either of the logger is
maintained in a log file.

Group Members:
1. Shoaib Moosa
2. Azim Momin
3. Deepesh Panday
4. Deevesh Panday
CONTENTS
CH.N TOPIC NAME PAGE
O. NO.
INTRODUCTION 1
1
1.1 AIM AND OBJECTIVE 2
1.2 PROBLEM STATEMENT 2
2 REVIEW OF LITERATURE 4
3 EXISTING SYSTEM 7
4 SYSTEM ARCHITECHTURE 9
5 FLOW CHART 1
1
6 PROPOSED SYSTEM 1
3
SYSTEM DESIGN 1
7 5
7.1 SOFTWARE REQUIREMENTS 1
6
7.2 HARDWARE REQUIREMENTS 1
6
8 IMPLEMENTATION 1
7
9 CONCLUSION 2
2
10 REFERENCE 2
4
LIST OF FIGURES

FIGU FIGURE NAME PAG


RE E
NO. NO
1 SYSTEM ARCHITECTURE 9

2 FLOW CHART 11

3 EXPECTED OUTCOME 18

4 STYLE YOUR APPLICATION 18

5 GENERATING A COMPANY INFORMATION AND 19


GRAPHS
6 CREATING THE MACHINE LEARNING MODEL 20

7 DEPLOYING THE PROJECT ON HEROKU 21


Abstract

Text mining in big data analytics is emerging as a powerful tool for harnessing the power of
unstructured textual data by analyzing it to extract new knowledge and to identify significant
patterns and correlations hidden in the data. This study seeks to determine the state of text
mining research by examining the developments within published literature over past years and
provide valuable insights for practitioners and researchers on the predominant trends, methods,
and applications of text mining research. In accordance with this, more than 200 academic
journal articles on the subject are included and discussed in this review; the state-of-the-art text
mining approaches and techniques used for analyzing transcripts and speeches, meeting
transcripts, and academic journal articles, as well as websites, emails, blogs, and social media
platforms, across a broad range of application areas are also investigated. Additionally, the
benefits and challenges related to text mining are also briefly outlined.
Text Mining In Big Data Analysis

CHAPTER-1
INTRODUCTION

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

Introduction-

In recent years, we have witnessed an increase in the quantities of available digital textual
data, generating new insights and thereby opening up opportunities for research along new
channels. In this rapidly evolving field of big data analytic techniques, text mining has gained
significant attention across a broad range of applications. In both academia and industry,
there has been a shift towards research projects and more complex research questions that
mandate more than the simple retrieval of data. Due to the increasing importance of artificial
intelligence and its implementation on digital platforms, the application of parallel
processing, deep learning, and pattern recognition to textual information is crucial. All types
of business models, market research, marketing plans, political campaigns, or strategic
decision-making are facing an increasing need for text mining techniques in order to address
the competition.

Aim And Objective: -

Widely used in knowledge-driven organizations, text mining is the process of examining


large collections of documents to discover new information or help answer specific research
questions. Text mining identifies facts, relationships and assertions that would otherwise
remain buried in the mass of textual big data.

PROBLEM STATEMENT: -
Many issues occur during the text mining process and effectthe efficiency and effectiveness
of decision making. Complexities can arise at the intermediate stage of text mining. In pre-
processing stage various rules and regulations are defined tostandardize the text that make
text mining process efficient. Before applying pattern analysis on the document there is a
need to convert unstructured data into intermediate form but at this stage mining process has
its own complications. Sometime real theme or data mislay its importance due to the
modification in the text sequence

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

CHAPTER-2
REVIEW OF LITERATURE

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

described that gathering, extracting, pre-processing, text transformation, feature extraction,


pattern selection, and evaluation steps are part of text mining process. In addition,different
widely used text mining techniques, i.e., clustering, categorization, decision tree
categorization, and their application in diverse fields are surveyed. [8] highlighted the issues in
text mining applications and techniques. They discussed that dealing with unstructured text is
difficult as compared to structured or tabular data using traditional mining tools and
techniques. They have shown the applications of text mining process in bioinformatics,
business intelligence and national security system. Natural language processing and entity
recognition techniques has reduced the issues that occur during text mining process. However,
there exist issues which need attention

explored MEDLINE biomedical database by integrating a framework for named entity


recognition, classification of text, hypothesis generation and testing, relationship and synonym
extraction, extract abbreviations. This new framework helps to eliminate unnecessary details
and extract valuable information. analyzed the text using text mining patterns and showed term
based approaches cannot analyze synonyms and polysemy properly. Moreover, a prototype
model was designed for specification of patterns in terms of assigning weight according to
their distribution. This approach helps to enhance the efficiency of text mining process.
presented a crime detection system using text mining tools and relation discovery algorithm
was designed to correlate the term with abbreviation.C. data repository

presented a top down and bottom up approach for web based text mining process. To combine
the similar text documents, they apply k-mean clustering technique for bottom up partitioning.
To find out the similarity within the document TF-IDF (Term Frequency- Inverse Document
Frequency) algorithm has been used to find information regarding specific subjects. gave an
overview of applications, tools and issues arises to mine the text. They discussed that
documents may be structured, semi structured or unstructured and extracting useful
information is a tiresome task. They presented a generic framework for concept based mining
which can be visualized as text refinement and knowledge distillation phases. The intermediate
form of entity representation mining depends on specific domain.

presented innovative and efficient pattern discovery techniques. They used the pattern evolving
and discovering techniques to enhance the effectiveness of discovering relevant and
appropriate information. They performed BM25 and vector support machine based filtering on
router corpus volume 1 and text retrieval conference data to estimate the effectiveness of the
suggested technique. performed various experiments of classification using multi-word
features on the text. They proposed a hand-crafted method to extract multi-word features from
the data set. To classify and extract multi-word text they divide text into linear and nonlinear
polynomial form in support of vector machine that improve the effectiveness of the extracted
data.

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

CHAPTER-3
EXISTING SYSTEM

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

In existing system, we tend to propose that a company’s performance, in terms of its stock
worth movement, is foreseen by internal communication patterns. to get early warning
signals, we tend to believe that it’s vital for patterns in company communication networks to
be detected earlier for the pre- diction of serious stock worth movement to avoid attainable
adversities that an organization could face within the securities market in order that
stakeholders’ interests is protected the maximum amount as attainable. Despite the potential
importance of such data regarding corporate communication, very little work has been tired
this vital direction. We attempt to bridge these research gaps by employing a data-mining
method to examine the linkage between a firm’s communication data and its share price. As
Enron Corporation’s e-mail messages constitute the only corpus available to the public, we
make use of Enron’s e-mail corpus as the training and testing data for our proposed
algorithm.

Predictions of stock and Forex have always been a trending and profitable area of study.
Deep learning applications have been approved to submit better accuracy and return in the
field of financial prediction and forecasting. In this survey, we selected research papers from
the Digital Bibliography & Library Project (DBLP) database for comparison and analysis.
We separated papers according to different type of deep learning methods, which mentioned
Convolutional neural network (CNN); Long Short-Term Memory (LSTM); Deep neural
network (DNN); Recurrent Neural Network (RNN); Reinforcement Learning; and other deep
learning methods such as Hybrid Attention Networks (HAN), self-paced learning mechanism
(NLP), and Wave net. Furthermore, this paper examines the dataset, variable, model, and
results of each one article. The survey used represents the results through the most used
performance models: Root

Mean Square Error (MSE), exactness, keen ratio, and return rate. We recognized that recent
models combining Square Error (RMSE), Mean Absolute Percentage Error (MAPE), Mean
Absolute Error (MAE), and Mean LSTM with other methods, for example, DNN, are widely
researched. Reinforcement learning and other deep learning methods submitted great returns
and performances. We conclude that, in previous recent years, the trend of using deep-
learning- based methods for financial modelling is increasing exponentially.

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

CHAPTER-4
SYSTEM ARCHITECHTURE

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

FIGURE 1: - SYSTEM ARCHITECTURE

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

CHAPTER-5
FLOW CHART

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

FIGURE 2: - FLOW CHART

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

CHAPTER-6
PROPOSED SYSTEM

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

Manufacturing and product development:


Analysis of machine logs and maintenance tickets can pinpoint problems in the
manufacturing process, as well as in the finished product.

Email filtering:
Email system providers mine incoming email to identify distinctive characteristics of spam
and phishing messages, automatically deleting or quarantining messages before they are
delivered to employees. This helps businesses minimize the risk of cyberattacks.

Competitive marketing analysis:


Mining the sentiment of competitor reviews in sources such as Yelp enables a business to
assess the competition's strengths and weaknesses.

Human resources:
By analyzing the content of emails and other communications within the company, HR teams
can gain insights into employee concerns and measure employee engagement.

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

CHAPTER-7
SYSTEM DESIGN

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

7.1 Software Requirements


 Operating System : Windows 98, Windows XP ,Windows 7 or better
 Language : Python

7.2 Hardware Requirements


• Random Access Memory (RAM): 1 GB or above
• Central Processing Unit (CPU): 1.7 GHz Processor and above
• Operating System (OS): Windows 8 and above

The system is designed such that it works in the following way:

1. In the case research, we need to visualize consumer habits and styles from different
perspectives. You don’t need to go into this method recklessly. Otherwise, the result
will be dirty and disordered.

2. The next step is to assemble the data to discover more different patterns and biases
inside the datasets.

3. K-means clustering is a famous method of unsupervised machine learning. This


method obtains all of the diverse “clusters” and clubs them collectively while
maintaining them as tiny as attainable.

4. Determining the most beneficial kit of hyperparameters for an algorithm is the


subsequent measure in customer segments with Ml because it assists us in
attaining the most genuine and satisfying customer crowds.

5. At last, we visualize the decisions applying the open-source Plotly-Python, a plotting


library in python for making interactive graphs, plots, and charts. Then we understand
the charts and various graphs to develop our enterprise..

6. Finally, our script is deployed and can be accessed by anyone in the world.

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

CHAPTER-8
IMPLEMENTATION

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

A. Expected Outcome:
In the first step of this data science project, we will perform data exploration. We will import
the essential packages required for this role and then read our data. Finally, we will go through
the input data to gain necessary insights about it.

We will now display the first six rows of our dataset using the head() function and use the
summary() function to output summary of it.

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

B. Customer Gender Visualizations:


In this, we will create a barplot and a piechart to show the gender distribution across our
customer_data dataset.

Code:

Screenshots

Output

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

C. Visualization of Age Distribution:


Let us plot a histogram to view the distribution to plot the frequency of customer ages. We
will first proceed by taking summary of the Age variable.

Code

Output

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

D. Analyzing Spending Score of the Customers:


Code

Output

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

CHAPTER-9
CONCLUSION

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

CONCLUSION
New technologies have facilitated access to immense quantities of digital text, recording an
ever increasing share of human interaction, communication, and culture. Text mining
provides a framework to maximize the value of information within large quantities of text;
thereby, the use of text mining technologies has increased steadily in recent years and has
become highly diverse.

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

CHAPTER-10
REFERENCES

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY


Text Mining In Big Data Analysis

REFERENCES
[1] Blanchard, Tommy. Bhatnagar, Pranshu. Behera, Trash. (2019). Marketing Analytics Scientific
Data: Achieve your marketing objectives with Python's data analytics capabilities. S.l: Packt printing is
limited

[2] Griva, A., Bardaki, C., Pramatari, K., Papakyriakopoulos, D. (2018). Sales business analysis:
Customer categories use market basket data. Systems Expert Systems, 100, 1-16.

[3] Hong, T., Kim, E. (2011). It separates consumers from online stores based on factors that affect the
customer's intention to purchase. Expert System Applications, 39 (2), 2127-2131.

[4] Hwang, Y. H. (2019). Hands-on Advertising Science Data: Develop your machine learning
marketing strategies… using python and r. S.l: Packt printing is limited

[5] Puwanenthiren Premkanth, - Market Classification and Its Impact on Customer Satisfaction and
Special Reference to the Commercial Bank of Ceylon PLC.‖ Global Journal of Management and
Business Publisher Research: Global Magazenals Inc. (USA). 2012. Print ISSN: 0975-5853. Volume 12
Issue 1.

[6] Puwanenthiren Premkanth, - Market Classification and Its Impact on Customer Satisfaction and
Special Reference to the Commercial Bank of Ceylon PLC.‖ Global Journal of Management and
Business Publisher Research: Global Magazenals Inc. (USA). 2012. Print ISSN: 0975-5853. Volume 12
Issue 1.

[7] Sulekha Goyat. "The basis of market segmentation: a critical review of the literature. European
Journal of Business and Management www.iiste.org. 2011. ISSN 2222-1905 (Paper) ISSN 2222-2839
(Online). Vol 3, No.9, 2011

[8] By Jerry W Thomas. 2007. Accessed at:


www.decisionanalyst.com on July 12, 2015.

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING & TECHNOLOGY

You might also like