0% found this document useful (0 votes)
33 views

Analytical Study On Unstructured Data Management in Application Data Base Through NLP and Datamining

Business Organizations are flooded with large pool of unstructured data. Loading these data into business database warranted a lot of processes. Companies having BPO and KPO are working for converting unstructured data into their software database with huge resources through programming, with multiple queries and users. To deal with such complex and perplexed situations need an automated system in place and thereby saving a large amount of time and resources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Analytical Study On Unstructured Data Management in Application Data Base Through NLP and Datamining

Business Organizations are flooded with large pool of unstructured data. Loading these data into business database warranted a lot of processes. Companies having BPO and KPO are working for converting unstructured data into their software database with huge resources through programming, with multiple queries and users. To deal with such complex and perplexed situations need an automated system in place and thereby saving a large amount of time and resources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Analytical Study on Unstructured Data


Management in Application Data Base through
NLP and Datamining
Anisha S1 and Dr. S Thiyagarajan 2
1
Department of Computer Science, St. Joseph University, Dimapur, Nagaland, India
2
Department of Computer Science, St. Joseph University, Dimapur, Nagaland, India

Abstract:- Business Organizations are flooded with large  Heterogeneous Data


pool of unstructured data. Loading these data into Multiple sources and formats of data need to be
business database warranted a lot of processes. Companies associated with organizational interest for it to be meaningful
having BPO and KPO are working for converting in decision-making and subject related to business activities.
unstructured data into their software database with huge This study is considering different sample data for
resources through programming, with multiple queries examination. Even though unstructured data comes in various
and users. To deal with such complex and perplexed formats, we are concerned only with unstructured data in the
situations need an automated system in place and thereby forms of text, excel. Extraction and classification of
saving a large amount of time and resources. The aim of unstructured data according to the subjects and issue able to
the present research was to analyse methodically, the transform the data into more concrete and firm data for
technical works relating to the application of data mining, effective use of organization in its decision-making process.
artificial intelligence (AI) and machine learning (ML) in Optimal utilization and manipulation of unstructured data
the software industry. In this paper combining with requires a good business intelligence model to enable the
different disciplines of data mining techniques, ML and association of unstructured data with the subjects and issues
NLP. Objective of this paper is to improve the related to organizational interest.
organization's business intelligence process through
maximum exploitation of unstructured data owned by Purpose of this study is to provide an in-depth overview
them. This paper primarily attempts to examine the on applicability of various data mining ML algorithm in
applicability of combination of data mining techniques, application domain database instead of SQL queries for
NLP and ML in handling unstructured data and reduces unstructured data management. This paper addresses the
the burden on users by minimizing the usage of multiple following research questions: -
queries and make them user-friendly to extract data from  Is AI, NLP and other data mining processes can replace
large database. entire skilled resources and long database queries during
conversion structured and unstructured data in to
Keywords:- Application Database, Data mining, ML, NLP. Application Database?
 If it so, is it reliable?
I. INTRODUCTION  Will it reduce conversion time and cost of testing database?

The unstructured data have commonly appeared in II. CONVERTING UNSTRUCTURED DATA INTO
portals, blogs, bulk excel, emails, notes from call centers, and BUSINESS-ORIENTED DATA
all forms of human communications including the system to
stem processing. All these process and media starts producing To create entity of unstructured data, it needs to be
large amounts of unstructured and semi- structured data. associated with subject related application database structure.
Creating value and extracting the right information from large The unstructured data is to be transformed into more concrete
sets of unstructured data is a tiresome process. Many large and firm data that can be used by the organizations in
organizations like IBM, GE, and Siemens have developed development of database in application domain for decision-
analytical tools for unstructured data management; this system making process. This study proposes five processes for the
is superior in terms of handling data using natural language. said transformation from unstructured data to structured data,
However, many middle-level software companies not which are Data Extraction, Word embedding, Clustering,
adopting the above things due to complexity and hesitation. In Classification and Data Mapping. Data Extraction is about
this paper, some simple models for unstructured data identifying, analyzing unstructured data from multiple sources
management is proposed. and formats. Data Classification upon the extraction process,
the unstructured data need to be classified or categorized based
The objective of this paper is to provide insight on how on the requirements needed. The Two main steps involve are
to apply the principles of Data mining and AI via NLP to the determining the main data classes and categorizing the data
unstructured data in the application domain database. according to its main classes. Categorization of unstructured
data is important in helping the data searching process much

IJISRT24JAN1677 www.ijisrt.com 1786


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
better by grouping unstructured data which has been equipped decide what new information are going to store in the cell
with metadata to the class of the same characteristics. The state. This has two components. First, a sigmoid layer called
categorization also is essential to facilitate in repositories the “input gate layer” decides which values we’ll update. Next,
development for each data class as well as facilitate data a layer creates a vector of new candidate values that could be
mapping of unstructured data from main classes to thematic added to the state. Next step is to combine these two to create
topics. an update to the state. [3]. Output gate: The output gate
controls the value of the next hidden state. It contains
III. TEXT MINING THROUGH DEEP LEARNING information on previous inputs.

A. Word representation D. CNN


In mathematical model, all word represents a vector form CNN models achieved excellent results in semantic
of text and each dimension of the vector represents a single parsing and query retrieval and found to be effective for NLP
word. Particular word if found in the sentence will be flagged [4],[5].CNN model is used as a feature extractor, that encodes
as ‘1’ and if not ‘0’. Measurement of vocabulary words is semantic features of sentences before these features are fed to
equal to the measurement of total vectors. a classifier.
wdj = v1, j, v2, j, ..., Vt, j
E. Support Vector Machine
An embedding layer serves as a look-up table which Support Vector Machine (SVM) approach is used to
takes words’ indexes in the vocabulary as input and output. classify related documents using vector method. This
word vector consists of total size of vocabulary and dimension. technique was studied as per ref. [11]. SVM give a two-class
research problem that depends on the distribution of hyper
B. RNN planes represented by the data classes. In machine learning,
Recurrent Neural Networksor RNN was designed to support vector machines drive learning models to explore data.
work with sequence prediction problems. Sequence prediction The optimal hyper plane shown in Figure 1 is such that the
comes in the following forms: - space of the plane up to some point is maximized. The highest
marginal hyper plane best divides the image shown in the
 One-to-Many: An observation as input mapped to a figure. Basically, only the points closest to the boundary
multiclass or label as an output. matter when choosing a hyper plane; all others are pale. These
points are called support vectors, and the hyper plane is
 Many-to-One: The input are sequences of words, output is understood as a support vector classifier (SVC) because it
one single class. places each support vector in the same class or in the opposite
direction of real adjacent values.
 Many-to-Many: The input are sequences of words, output
is multiclass.

C. LSTM
LSTM is one of the forms of RNN and can be used for
learning long-term dependencies during classification and
efficient gradient-based technique. LSTM is designed to get
rid of the vanishing error problems [1]. It works extremely
well on a large variety of problems and are now widely used.
LSTMs to have this chain like structure, but the repeating
module has a slightly different structure, there are multiple
layers, interrelating in unique way. LSTM is efficient than
simple RNN [2].

The main to LSTMs is the cell state, and is like a


conveyor belt. It runs straight down the entire chain, with only Fig. 1. SVM using hyper plane
minor linear interactions. It is very convenient for the
information to flow unchanged. The LSTM have the capacity  Proposed Frame Work: Data Loading Automation Model
to remove or add information to the cell state, carefully Implementation of simple model framework that
regulated by structures called gates. [3] combines data pre-processing, clustering and classification of
algorithm for easy implementation in NLP (python). The
Gates are a path for the information to pass. They are specified model classifies the unstructured text into predefined
self-possessed out of a sigmoid neural net layer and a point classes and used various set of data as input. Even though,
wise multiplication operation.The sigmoid layer outputs unstructured data comes in various format, this study
numbers between zero and one, describing how much of each considered only unstructured data in the forms of text.
component should be able to pass. A value of zero means “let
nothing pass,” while a value of one means “let everything
pass!”An LSTM has three of following gates: -forget gate
layer: first step in LSTM in which decide what information
we’re going to throw away from the cell state. input layer:

IJISRT24JAN1677 www.ijisrt.com 1787


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Data Extraction (LSTM) model. For extracting semantic structure efficiently,
The first step involved in this model is extraction of data we will be using CNN with LSTM.
using beautiful soup library, panda and sk learn libraries for
text extraction. Metadata Management are used for A. Clustering with Classification (LSTM)
identification of the file types and sources. The Classifier works well with the clustered data.
Accuracy of classifier can be improved by applying clustered
data before classification algorithm on data set [6]. The
clustering algorithms used in the proposed framework are K-
means algorithm implemented through sklearn library in
python. Here sentence vectors are clustered into k sub classes,
here we can train the data as table wise or column wise
according to database structure. Cluster ID is to be applied to
the resultant clustered data and the same is considered as input
into the classification. For each method, training and test of
data sets to be conducted distinctly.

Fig. 2. Data Loading Automation model

IV. DATA PRE-PROCESSING


Fig. 3 Flow chart clustering with LSTM model
This process removes all noise from data, cleaning,
padding, and fills blank data with mean or constant value B. CNN with LSTM
according to the business logic. The process used is sklearn
The output generated from word representation as
pre-processing and pandas’ libraries. For word representation
embedding layer is used here as input. Embedded layer will be
the data model used is Gensim model which further generate a passed to convolution layer and these outputs are passed in to
word embedding layer. The output generates here is used as the pooling layers. The resultant output merged together and
input for next step.
reduced as linear layer output, that will be passed
toLSTM.CNN learn spatial structure, learned spatial structure
In data pre-processing, the data set is split to train and passed to LSTM layer for further learning.
test. For this purpose, we will use the function provided by
sklearn. If the data set has more dimension and long sentences,
the method used is combination of clustering and classification

IJISRT24JAN1677 www.ijisrt.com 1788


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
VI. OBSERVATION AND ANALYSIS

NLP, machine learning techniques and data mining could


map the unstructured text into structured form as well as
enable automatic identification and extraction of relevant
information which can load data into database of the
application domain.

Replacing the procedure of loading unstructured data


into database through long queries with human intervention by
applying the same logic to rewrite the code in Python libraries
and ML algorithms with minimal coding. The cleaning during
conversion time can replace with data mining and ML
algorithms. Data mapping can be done efficiently in AI
techniques. To improve the accuracy by applying clustering
technique preceding classification algorithms on data set and
combine CNN with LSTM.

Classification accuracy is calculated as per below


mentioned formula and the resultant output is displayed in
Table 1: -
𝑡𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑑𝑎𝑡𝑎
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑑𝑎𝑡𝑎
∗ 100(1)

Table. 1. Comparisons of data loading process with different


Fig. 4. Flow chart CNN with LSTM model model
Model Accuracy
C. SVM LSTM 86%
The above preprocessing steps were performed for CNN with LSTM 88%
Prediction and the resultant data set applies for classification SQL NA
for using the SVM algorithm function. Here, the support
vector machine in Figure 1 represents the points of the hyper
plane so that the data points belonging to two different classes
are separated by the support vector with the largest gap, it can
be observed that the predicted values of the SVM model are
very close to the actual adjacent values. The confidence
interval of the SVM model is 0.986 and it can use for
prediction and feature extraction.

V. MODEL ANALYSIS USING THE TITANIC


DATASET AND IMDB

Deep neural networks have been trained on the IMDB


dataset. In our investigation, we used many deep learning
techniques. Models that use the IMDB dataset for
classification are successful in reaching validation accuracy Fig. 5. Model accuracy Graph
levels. Our enormous data collection was used for data
cleaning and clustering analysis; for that, we used the Titanic Table. 2. Comparisons of data loading process with SQL
data set. It includes details about the people on board the SQL and Programming Data mining with NLP and
Titanic, such as their age, gender, class of travel, cabin, and python libraries
level of survival. In this project, we will use Python to conduct Data extraction
preprocessing, clustering, and classification on the Titanic and
IMDB datasets. The transformation of unstructured data into  Programs 
Pandas
structured data requires data cleaning; we have to deal with the  Database queries & 
Beautiful Soup
dataset's outliers, inconsistent values, and missing values.  More Human 
sklearn libraries &
intervention 
Minimal Human
intervention
Data preprocessing and cleaning
 Long database  NTLK, Gensim
procedures

IJISRT24JAN1677 www.ijisrt.com 1789


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 More Human The paper is for academic purpose as a part of pursuing
intervention PhD, non-sponsored and non-financial conflicts of interest. It
Classification is the primary findings of the analysis and study.
 More Human  Clustering and
intervention classification REFERENCES
 Database queries Algorithms
procedures  K means deep learning [1]. Hochreiter, S., and Schmidhuber, J. 1997. Long short-
 Application programs model term memory. Neural Computation 9(8):1735–1780.
 LSTM, CNN [2]. Ilya Sutskever, OriolVinyals, Quoc V. Le., "Sequence to
 Kera’s, tensor flow Sequence Learning with Neural Networks" Google.
[3]. Christopher Olah, http://colah.github.io/posts/2015-08-
 Field Mapping  AI algorithms or labeling
Understanding-LSTMs
through classification [4]. Yih, X. He, C. Meek. 2014.”Semantic Parsing for Single-
algorithm Relation Question Answering. JO - 52nd Annual
Meeting of the Association for Computational
VII. FUTURE WORK Linguistics, ACL 2014 - Proceedings of the Conference.
[5]. Shen, X. He, J. Gao, L. Deng, et al,” Learning Semantic
How efficient automatedfield mapping has to be done Representations Using Convolutional Neural Networks
with AI features in accordance with organizational interest. for Web Search.” In Proceedings of WWW 2014.
Further to evaluate performance of different ML algorithm [6]. Yaswanth Kumar Alapati and Korrapati Sindhu.,
with different data sets for text classification. “Combining Clustering with Classification: A Technique
to Improve Classification Accuracy”, International
VIII. CONCLUSION Journal of Computer Science Engineering (IJCSE), Vol.
5 No.06 Nov 2016.
The purpose of this paper is to scrutinize the applicability [7]. Sepp Hochreiter, YoshuaBengio, Paolo Frasconi, et al.
of data mining and ML techniques for extracting unstructured "Gradient Flow in Recurrent Nets: The Difficulty of
data in various software firms for their application domain Learning Long-Term Dependencies." Wiley-IEEE Press;
database using NLP and python instead of SQL queries and 2001
programs. This model bridges the gap between SQL developer [8]. Kyosuke Nishida, KugatsuSadamitsu, Ryuichiro
and data mining algorithm. Higashinaka, et al. "Understanding the Semantic
Structures of Tables with a Hybrid Deep Neural Network
In this model, examined few classifications and Architecture." Proceedings of the Thirty-First AAAI
clustering algorithm. Using k means algorithm, deep learning Conference on Artificial Intelligence (AAAI-17)
model (LSTM) and CNN & LSTM. The said combination is [9]. Gers, F. A.; Schmidhuber, J.; and Cummins, F. A. 2000.
applied after preprocessing steps for better results. Learning to forget: Continual prediction with LSTM.
Implementation of the above is less complex through python Neural Computation 12(10):2451–2471
libraries (Keras, Tensor Flow and PyTorch frame works) than [10]. Yelong Shen,Xiaodong He., et al."Learning Semantic
long SQL queries and programs. Representations Using Convolutional Neural Networks
for Web Search".Microsoft.
We concluded that whatever doing through SQL queries [11]. Ertekin S eyda.: Learning in extreme conditions: online
for unstructured data management application domain can be and active learning with massive,imbalanced and noisy
done through data mining and deep learning algorithm (RQ1). data; A Dissertation.
The efficiency and accuracy depend on how train data set and
construct model (RQ 2). Performance efficiency depends upon
input data and choice of classification model [RQ 3].

Efficient data management enables programmers to


spend minimal time in the creation of programming code and
focusing more time on aligning the right data to solve complex
business issues. The study identified how to overcome the
existing gap between theoretical researches and application
domain programmers and thereby help in improve the
decision-making process of the organizations.

CONFLICT OF INTEREST

 Anisha S is Part Time Research Scholar, at St. Joseph


University, Dimapur, Nagaland.
 Dr. S Thiyagarajan is an internal Research Supervisor
St. Joseph University, Dimapur, Nagaland

IJISRT24JAN1677 www.ijisrt.com 1790

You might also like