Text Mining in Big Data Analytics (1) (1) - 1
Text Mining in Big Data Analytics (1) (1) - 1
By
Mr. Shoaib Moosa ARMIET/BE/CS20MD218
Mr. Azim Momin ARMIET/BE/CS20MM229
Mr. Deepesh Panday ARMIET/BE/CS20MD219
Mr. Deevesh Panday ARMIET/BE/CS20TS220
Affiliated to
UNIVERSITY OF MUMBAI
This dissertation report entitled “Text Mining in big data analytics” by Mr. Shoaib Abdul
Razzak Moosa is approved for the degree of Bachelor of Engineering in Computer
Engineering for academic year 2022 - 2023.
Examiners
1.
2.
Supervisor
1.
Date:
Place:
Declaration
I declare that this written submission represents my ideas in my own words and where others'
ideas or words have been included, I have adequately cited and referenced the original
sources. I also declare that I have adhered to all principles of academic honesty and integrity
submission. I understand that any violation of the above will be cause for disciplinary action
by the Institute and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been taken when needed.
Date:
ACKNOWLEDGEMENT
We have immense pleasure in presenting the report for our project entitled “Text Mining in
We would like to take this opportunity to express our gratitude to a number of people who
have been sources of help & encouragement during the course of this project.
We are very grateful and indebted to our project guide PROF. Vivek Pandey & our
respected HOD PROF. MAYANK MANGAL for providing their enduring patience,
guidance & invaluable suggestions. They were the one who never let our morale down &
always supported us through our thick & thin. They were the constant source of inspiration
We would also like to thank all the staff members for their invaluable co-operation &
We are also thankful to all the students for giving us their useful advice & immense co-
operation. Their support made the working of this project very pleasant.
PREFACE
This project has been submitted in the fulfillment of the requirements for the diploma of
engineering. We the team members of this project, take pleasure in presenting the detail
project report that reflects our efforts in academic year 2022-23.
Our project involves designing a Grid framework for executing web page where the process
is divided into threads and accordingly the threads are executed by the executors. The outputs
generated by the executors are given back to the manager which in turn gives the results to
the owner. This is a dedicated in which the manager can select particular executors to run the
web page.
Additionally, there is a Grid console which keeps track of the executors connected and the
web page running. A record of all the operations performed by either of the logger is
maintained in a log file.
Group Members:
1. Shoaib Moosa
2. Azim Momin
3. Deepesh Panday
4. Deevesh Panday
CONTENTS
CH.N TOPIC NAME PAGE
O. NO.
INTRODUCTION 1
1
1.1 AIM AND OBJECTIVE 2
1.2 PROBLEM STATEMENT 2
2 REVIEW OF LITERATURE 4
3 EXISTING SYSTEM 7
4 SYSTEM ARCHITECHTURE 9
5 FLOW CHART 1
1
6 PROPOSED SYSTEM 1
3
SYSTEM DESIGN 1
7 5
7.1 SOFTWARE REQUIREMENTS 1
6
7.2 HARDWARE REQUIREMENTS 1
6
8 IMPLEMENTATION 1
7
9 CONCLUSION 2
2
10 REFERENCE 2
4
LIST OF FIGURES
2 FLOW CHART 11
3 EXPECTED OUTCOME 18
Text mining in big data analytics is emerging as a powerful tool for harnessing the power of
unstructured textual data by analyzing it to extract new knowledge and to identify significant
patterns and correlations hidden in the data. This study seeks to determine the state of text
mining research by examining the developments within published literature over past years and
provide valuable insights for practitioners and researchers on the predominant trends, methods,
and applications of text mining research. In accordance with this, more than 200 academic
journal articles on the subject are included and discussed in this review; the state-of-the-art text
mining approaches and techniques used for analyzing transcripts and speeches, meeting
transcripts, and academic journal articles, as well as websites, emails, blogs, and social media
platforms, across a broad range of application areas are also investigated. Additionally, the
benefits and challenges related to text mining are also briefly outlined.
Text Mining In Big Data Analysis
CHAPTER-1
INTRODUCTION
Introduction-
In recent years, we have witnessed an increase in the quantities of available digital textual
data, generating new insights and thereby opening up opportunities for research along new
channels. In this rapidly evolving field of big data analytic techniques, text mining has gained
significant attention across a broad range of applications. In both academia and industry,
there has been a shift towards research projects and more complex research questions that
mandate more than the simple retrieval of data. Due to the increasing importance of artificial
intelligence and its implementation on digital platforms, the application of parallel
processing, deep learning, and pattern recognition to textual information is crucial. All types
of business models, market research, marketing plans, political campaigns, or strategic
decision-making are facing an increasing need for text mining techniques in order to address
the competition.
PROBLEM STATEMENT: -
Many issues occur during the text mining process and effectthe efficiency and effectiveness
of decision making. Complexities can arise at the intermediate stage of text mining. In pre-
processing stage various rules and regulations are defined tostandardize the text that make
text mining process efficient. Before applying pattern analysis on the document there is a
need to convert unstructured data into intermediate form but at this stage mining process has
its own complications. Sometime real theme or data mislay its importance due to the
modification in the text sequence
CHAPTER-2
REVIEW OF LITERATURE
presented a top down and bottom up approach for web based text mining process. To combine
the similar text documents, they apply k-mean clustering technique for bottom up partitioning.
To find out the similarity within the document TF-IDF (Term Frequency- Inverse Document
Frequency) algorithm has been used to find information regarding specific subjects. gave an
overview of applications, tools and issues arises to mine the text. They discussed that
documents may be structured, semi structured or unstructured and extracting useful
information is a tiresome task. They presented a generic framework for concept based mining
which can be visualized as text refinement and knowledge distillation phases. The intermediate
form of entity representation mining depends on specific domain.
presented innovative and efficient pattern discovery techniques. They used the pattern evolving
and discovering techniques to enhance the effectiveness of discovering relevant and
appropriate information. They performed BM25 and vector support machine based filtering on
router corpus volume 1 and text retrieval conference data to estimate the effectiveness of the
suggested technique. performed various experiments of classification using multi-word
features on the text. They proposed a hand-crafted method to extract multi-word features from
the data set. To classify and extract multi-word text they divide text into linear and nonlinear
polynomial form in support of vector machine that improve the effectiveness of the extracted
data.
CHAPTER-3
EXISTING SYSTEM
In existing system, we tend to propose that a company’s performance, in terms of its stock
worth movement, is foreseen by internal communication patterns. to get early warning
signals, we tend to believe that it’s vital for patterns in company communication networks to
be detected earlier for the pre- diction of serious stock worth movement to avoid attainable
adversities that an organization could face within the securities market in order that
stakeholders’ interests is protected the maximum amount as attainable. Despite the potential
importance of such data regarding corporate communication, very little work has been tired
this vital direction. We attempt to bridge these research gaps by employing a data-mining
method to examine the linkage between a firm’s communication data and its share price. As
Enron Corporation’s e-mail messages constitute the only corpus available to the public, we
make use of Enron’s e-mail corpus as the training and testing data for our proposed
algorithm.
Predictions of stock and Forex have always been a trending and profitable area of study.
Deep learning applications have been approved to submit better accuracy and return in the
field of financial prediction and forecasting. In this survey, we selected research papers from
the Digital Bibliography & Library Project (DBLP) database for comparison and analysis.
We separated papers according to different type of deep learning methods, which mentioned
Convolutional neural network (CNN); Long Short-Term Memory (LSTM); Deep neural
network (DNN); Recurrent Neural Network (RNN); Reinforcement Learning; and other deep
learning methods such as Hybrid Attention Networks (HAN), self-paced learning mechanism
(NLP), and Wave net. Furthermore, this paper examines the dataset, variable, model, and
results of each one article. The survey used represents the results through the most used
performance models: Root
Mean Square Error (MSE), exactness, keen ratio, and return rate. We recognized that recent
models combining Square Error (RMSE), Mean Absolute Percentage Error (MAPE), Mean
Absolute Error (MAE), and Mean LSTM with other methods, for example, DNN, are widely
researched. Reinforcement learning and other deep learning methods submitted great returns
and performances. We conclude that, in previous recent years, the trend of using deep-
learning- based methods for financial modelling is increasing exponentially.
CHAPTER-4
SYSTEM ARCHITECHTURE
CHAPTER-5
FLOW CHART
CHAPTER-6
PROPOSED SYSTEM
Email filtering:
Email system providers mine incoming email to identify distinctive characteristics of spam
and phishing messages, automatically deleting or quarantining messages before they are
delivered to employees. This helps businesses minimize the risk of cyberattacks.
Human resources:
By analyzing the content of emails and other communications within the company, HR teams
can gain insights into employee concerns and measure employee engagement.
CHAPTER-7
SYSTEM DESIGN
1. In the case research, we need to visualize consumer habits and styles from different
perspectives. You don’t need to go into this method recklessly. Otherwise, the result
will be dirty and disordered.
2. The next step is to assemble the data to discover more different patterns and biases
inside the datasets.
6. Finally, our script is deployed and can be accessed by anyone in the world.
CHAPTER-8
IMPLEMENTATION
A. Expected Outcome:
In the first step of this data science project, we will perform data exploration. We will import
the essential packages required for this role and then read our data. Finally, we will go through
the input data to gain necessary insights about it.
We will now display the first six rows of our dataset using the head() function and use the
summary() function to output summary of it.
Code:
Screenshots
Output
Code
Output
Output
CHAPTER-9
CONCLUSION
CONCLUSION
New technologies have facilitated access to immense quantities of digital text, recording an
ever increasing share of human interaction, communication, and culture. Text mining
provides a framework to maximize the value of information within large quantities of text;
thereby, the use of text mining technologies has increased steadily in recent years and has
become highly diverse.
CHAPTER-10
REFERENCES
REFERENCES
[1] Blanchard, Tommy. Bhatnagar, Pranshu. Behera, Trash. (2019). Marketing Analytics Scientific
Data: Achieve your marketing objectives with Python's data analytics capabilities. S.l: Packt printing is
limited
[2] Griva, A., Bardaki, C., Pramatari, K., Papakyriakopoulos, D. (2018). Sales business analysis:
Customer categories use market basket data. Systems Expert Systems, 100, 1-16.
[3] Hong, T., Kim, E. (2011). It separates consumers from online stores based on factors that affect the
customer's intention to purchase. Expert System Applications, 39 (2), 2127-2131.
[4] Hwang, Y. H. (2019). Hands-on Advertising Science Data: Develop your machine learning
marketing strategies… using python and r. S.l: Packt printing is limited
[5] Puwanenthiren Premkanth, - Market Classification and Its Impact on Customer Satisfaction and
Special Reference to the Commercial Bank of Ceylon PLC.‖ Global Journal of Management and
Business Publisher Research: Global Magazenals Inc. (USA). 2012. Print ISSN: 0975-5853. Volume 12
Issue 1.
[6] Puwanenthiren Premkanth, - Market Classification and Its Impact on Customer Satisfaction and
Special Reference to the Commercial Bank of Ceylon PLC.‖ Global Journal of Management and
Business Publisher Research: Global Magazenals Inc. (USA). 2012. Print ISSN: 0975-5853. Volume 12
Issue 1.
[7] Sulekha Goyat. "The basis of market segmentation: a critical review of the literature. European
Journal of Business and Management www.iiste.org. 2011. ISSN 2222-1905 (Paper) ISSN 2222-2839
(Online). Vol 3, No.9, 2011