0% found this document useful (0 votes)
10 views15 pages

Manipulating Data Lakes Intelligently With Java Annotations

The document presents a novel software framework called Java Annotation for Manipulating Data Lakes (JAMDL) aimed at efficiently managing heterogeneous data in data lakes using Java annotations. It addresses challenges such as data silos and complex data transformations, proposing AI models for data manipulation and visibility enhancement. The framework's architecture is designed to facilitate the representation, retrieval, and conversion of various data formats in enterprise-level applications.

Uploaded by

mvds2109
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views15 pages

Manipulating Data Lakes Intelligently With Java Annotations

The document presents a novel software framework called Java Annotation for Manipulating Data Lakes (JAMDL) aimed at efficiently managing heterogeneous data in data lakes using Java annotations. It addresses challenges such as data silos and complex data transformations, proposing AI models for data manipulation and visibility enhancement. The framework's architecture is designed to facilitate the representation, retrieval, and conversion of various data formats in enterprise-level applications.

Uploaded by

mvds2109
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Received 3 February 2024, accepted 28 February 2024, date of publication 1 March 2024, date of current version 8 March 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3372618

Manipulating Data Lakes Intelligently


With Java Annotations
LAP MAN HOI 1 , (Member, IEEE), WEI KE 1, (Member, IEEE),
AND SIO KEI IM 1,2 , (Member, IEEE)
1 Faculty
of Applied Sciences, Macao Polytechnic University, Macau, SAR, China
2 Engineering
Research Centre of Applied Technology on Machine Translation and Artificial Intelligence, Ministry of Education, Macao Polytechnic University,
Macau, China
Corresponding author: Lap Man Hoi ([email protected])
This work was supported by Macao Polytechnic University under Project RP/ESCA-03/2020.

ABSTRACT Data lakes are typically large data repositories where enterprises store data in a variety of data
formats. From the perspective of data storage, data can be categorized into structured, semi-structured, and
unstructured data. On the one hand, due to the complexity of data forms and transformation procedures, many
enterprises simply pour valuable data into data lakes without organizing and managing them effectively.
This can create data silos (or data islands) or even data swamps, with the result that some data will be
permanently invisible. Although data are integrated into a data lake, they are simply physically stored in
the same environment and cannot be correlated with other data to leverage their precious value. On the
other hand, processing data from a data lake into a desired format is always a difficult and tedious task
that requires experienced programming skills, such as conversion from structured to semi-structured. In this
article, a novel software framework called Java Annotation for Manipulating Data Lakes (JAMDL) that can
manage heterogeneous data is proposed. This approach uses Java annotations to express the properties of
data in metadata (data about data) so that the data can be converted into different formats and managed
efficiently in a data lake. Furthermore, this article suggests using artificial intelligence (AI) translation
models to generate Data Manipulation Language (DML) operations for data manipulation and uses AI
recommendation models to improve the visibility of data when data precipitation occurs.

INDEX TERMS Data lake, data precipitation, data stewards, enterprise-level applications, impedance
mismatch, java annotations, JAMDL, object-oriented, ORMapping, software framework.

I. INTRODUCTION warehouses are no longer able to meet the requirements of


Data has always been regarded as one of the most valuable storing diverse forms of data. Therefore, the concept of a
assets in business and academia. Talent will retire, technology data lake was introduced to create a large data repository
will age and become obsolete, and only data can provide for storing structured data (spreadsheet, tabular, etc.), semi-
the insights and evidence to remain in an irreplaceable structured data (JSON, XML, YAML, etc.), and unstructured
position. In retrospect, when the Internet began to boom data (audio, corpus, image, log, etc.).
in the business world, data warehouses were developed to Generally speaking, data in a data warehouse is more
store the rapidly growing volume of data coming from tightly coupled, while data in a data lake is more loosely
Online Transaction Processing (OLTP) systems. With the coupled. Data warehouse projects use the Extract-Transform-
popularity and practicality of Big Data (BD), Internet of Load (ETL) approach to data processing where everything is
Things (IoT), and AI technologies in recent years, data defined before writing, technically known as the ‘‘schema-
on-write’’. On the other hand, the Data Lake project
The associate editor coordinating the review of this manuscript and uses the Extract-Load-Transform (ELT) or ‘‘schema-on-
approving it for publication was Rahim Rahmani . read’’ approach [2]. Consequently, data in different formats
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 34903
L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

can be saved and exported more flexibly. The data lake


concept is designed to support a variety of data operations,
such as persistent log files for BD and IoT projects, and
domain-specific output datasets for training AI projects.
However, manipulating large volumes and complex forms
of data has been a challenging problem for many years.
Despite the many innovative solutions that continue to be
offered, the demanding needs of data lakes are still not being
met. Some of the outstanding issues are how to deal with
differently structured data in a data lake with a single data
model, which includes representing different data structures
in an abstract model, transforming between them, retrieving
data from different data sources, and storing them in a data FIGURE 1. Transaction data can be represented in JSON (semi-structured
data format) and multiple database tables (structured data format).
lake. There are also questions about how to leverage modern
AI technology to govern data lakes. Nevertheless, as more
people become aware of these new requirements for data B. TEDIOUS CRUD TRANSACTIONS
lakes, it will have a positive impact on the development of In most cases, enterprise applications require at least
data lakes. one database engine to store transactional data. How-
Therefore, this article focuses on managing diverse data ever, developing fundamental Create, Read, Update, and
structures in data lakes. A novel software framework Delete (CRUD) transactions using SQL statements at the
Java Annotation for Manipulating Data Lakes (JAMDL) is database level is a tedious task. Therefore, it is common
proposed to help people quickly develop web applications for practice to use ORMapping software frameworks for simpli-
manipulating data in data lakes. As a result, people use Java fication. However, most ORMapping software frameworks
annotations to define data models to manage different data only deal with structured data. There is relatively little
structures in the data lake. In pursuance of creating JAMDL, research on ORMapping for semi-structured and unstructured
some interesting questions and fundamental requirements are data.
discussed below.
C. DATABASE NORMALIZATION PROBLEMS
A. IMPEDANCE MISMATCH Conforming to the database normalization design principles
In computer science, impedance mismatch is the study of of reducing data redundancy and improving data integrity,
how to match two things with different properties together. the data in the transactions of the OLTP system needs to
For instance, people often want to map database data with be decomposed and stored in multiple database tables [5].
programming objects. It is one of the core problems for ORMapping typically uses a single object to represent
Object-Relational Mapping (ORMapping), i.e., the disparity a relational database (RDB) table, which means that
between the object-oriented application development model multiple objects are needed to represent a single transaction.
and the object-relational database model [3]. Neward claimed Thus, if the transaction contains a substantial amount of
that ORMapping is the Vietnam of Computer Science. It rep- information, the matching, persistence, and retrieval process
resents a quagmire that starts well, gets more complicated can become very complex. As shown in Fig. 1, the student
as time passes, and before long entraps its users in a information is broken down into three tables and stored in the
commitment that has no clear demarcation point, no clear database.
win conditions, and no clear exit strategy [4]. For example, A single transaction at the modern enterprise level
persisting and retrieving tree-structured data (e.g., JSON) into usually contains a large amount of information. People are
the flat database table shown in Fig. 1 is not a straightforward starting to debate whether it is certainly worth breaking up
process. It not only involves programming techniques for transactions into multiple tables for data persistence and
recursively reading schema-oblivious trees but also applies then joining them for data retrieval. Some database engines
data normalization concepts to persist data in database such as Not Only Structured Query Language (NoSQL) or
tables. Not Relational (Non-SQL) recommend saving the entire
Over the years, there have been a plethora of systems transaction as an object and operating on it without mod-
catering to the new needs of businesses resulting in a ifying the structure [6]. As a result, it reduces the data
wide variety of data formats. Based on the exact nature of granularity problems associated with database normalization
the records, datasets can be classified as structured, semi- and simplifies the data transformation process. Nonetheless,
structured, and unstructured datasets [1]. If an ORMapping database normalization provides data integrity and powerful
software framework is to be developed to solve the problem querying capabilities. Both are genuinely necessary and using
of data transformation in a data lake, then how to represent ORMapping objects to represent informative datasets that
and transform data of different structures is the core problem combine different data structures is a major challenge for
to be overcome. framework developers.

34904 VOLUME 12, 2024


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

FIGURE 2. The modular software architecture for manipulating the data lake.

Furthermore, most modern ORMapping software frame- cannot meet the needs of storing unstructured data such as
works only use inner join methods to join database tables images, videos, and corpus files. Data lakes are one of the
and do not provide other table join methods (left join, solutions for persisting data in various formats. However, data
right join, intersect join, full join, etc.) for data integration. is often not well organized due to the complexity and diversity
Consequently, developers either have to change the table of data in data lakes. As a result, it is difficult for data to be
structure or write complex SQL statements themselves. fully utilized and analyzed to help decision-makers identify
Likewise, data integrity, especially the merging of various interesting issues and govern.
data structures, is one of the difficulties in manipulating data In other words, a data lake is simply a repository of all
lakes. data (including raw data) for people to access at one point.
The terms used to describe this large data repository are
D. BIG PICTURE varied and include data puddles, data ponds, data pools, data
In this article, the JAMDL framework is presented compre- oceans, and more. They differ mainly in their size, maturity,
hensively, and it is based on ORMapping for manipulating and purposes [7]. Nevertheless, it is more appropriate to use
data in a data lake. The structure diagram is shown in Fig. 2. the term ‘‘data lake’’ in this article, as it is more relevant to
The software architecture is componentized into different enterprise-level applications.
modules (data modeling, data persisting, data retrieving, and Starting in 2010, different architectures have been sug-
data governing), and each module is discussed separately gested for building data lakes. Recently, many service
below. providers have adopted data lakes in the cloud. Some well-
The JAMDL framework is designed to address the known companies, such as Amazon Web Services (AWS),
problems listed in the previous sections and to provide a Azure Data Lake Store, Google Cloud Platform (GCP),
way for people to represent different data structures using an Alibaba Cloud, and the Data Cloud from Snowflake, even
object model. As a result, one can use this object model to offer powerful tools and user-friendly service interfaces
read, write, and convert data in a data lake to a desired format. for enterprises to build their own data lakes. In academia,
The rest of the article is organized as follows: Section I scholars have never stopped to propose innovative solutions
briefly describes the purpose of this study; Section II reviews for constructing data lakes. According to the structure
state-of-the-art techniques and related research; Section III and function of a data lake, data lakes usually consist
shows how to build a software framework for manipulating of four layers (Ingestion, Maintenance, Exploration, and
data lakes; Section IV evaluates and analyses the significance Storage) [8], [9], [10].
of all the results; and Section V concludes all the research Nonetheless, people are more concerned with the archi-
study and discusses future work. tecture than with manipulating the content of the data in
the data lake. The demand for comprehensive solutions for
II. LITERATURE REVIEW manipulating data in data lakes continues to exist.
A. DATA LAKES
Enterprises collect digital footprints from a wide range of B. JAVA ANNOTATIONS
activities, with data coming in a heterogeneous form. The Java annotations, first released in 2004, are a form of
predefined table schema of the data warehouse architecture metadata that provide information about a program rather

VOLUME 12, 2024 34905


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

than being part of the program itself [11]. Java annotations D. ORMAPPING SOFTWARE FRAMEWORK
provide three different retention policies (source, class, and ORMapping is a mechanism for connecting classes in an
runtime) for specifying how long to retain annotations. object-oriented (OO) programming language to tables in a
Therefore, once a Java program is annotated, the annotation relational database [21]. ORMapping allows us to query and
information can be read at different stages of the program manipulate data stored in an RDB using OO approaches
(compile time, deployment time, and runtime). Moreover, without the need to use DQL or DML [5]. As a result,
annotations can even be used to dynamically generate code programmers can retrieve and save data in a variety of
that outputs a Java program, which is very much in line with database engines without having to write cumbersome SQL
the needs of framework developers. statements. In other words, developers can simply use their
Although annotations are not the core programming favorite programming language (Java, PHP, C#, etc.) without
language by themselves, people often use them to extend having to develop at the database level.
the language’s support for custom features such as compiler However, the popular ORMapping software frameworks
information, documentation, runtime logging, generating on the market (Hibernate, MyBatis, TopLink, etc.) and
additional files, and so on [12]. In addition, some recent the Java Persistence API (JPA) industry standard share
research studies have suggested the use of Java annotations many common issues that cannot fully satisfy the needs
for validation [13], [14], [15], mining [16], [17], [18], and of developers [5]. For example, most ORMapping software
maintenance [19]. frameworks are unable to address the impedance mismatch
People commonly use XML to define and configure between structured, semi-structured, and unstructured data.
different software frameworks. However, XML is considered Although ORMapping has been around for a while,
heavyweight because it has too many tags and its tree struc- there are still some outstanding issues that cannot be fully
ture is bulky when parsing content. Hence, Java annotations settled [22]. A typical example of this is manipulating
can be used to store configuration information of software JavaScript Object Notation (JSON) data. JSON is one of the
systems instead of the verbose XML. most commonly used data exchange formats in modern online
In the field of deep learning, data labeling is one of systems, but existing ORMapping software frameworks only
the important processes in machine-supervised training, and partially support the complex tree structure of JSON, which
Java annotations can be the metadata describing various data does not meet the expectations of modern developers. Most
structures in the data lake. As a result, the applications of Java database engines that support JSON will keep the JSON
annotations in different domains are yet to be explored by objects intact, which loses flexibility and performance for
researchers. searching and updating data.
To design a new ORMapping software framework to
C. TYPES OF SQL STATEMENTS manipulate data in a data lake, conversion between differ-
There are various types of Structured Query Language (SQL) ent data formats is unequivocally a challenging problem.
statements used to process database records. The most In general, there are three problems (mapping, retrieving,
commonly used SQL statements are Data Control Lan- and persisting) that need to be overcome to design a new
guage (DCL), Data Definition Language (DDL), Data ORMapping software framework [23]. In the proposed
Manipulation Language (DML), and Data Query Lan- solution, Java annotations are used as data objects to represent
guage (DQL) [20]. They are sub-languages that perform all different data structures to manipulate the data in the data
the basic operations in the database engine. lake. Thus, the JAMDL framework based on ORMapping
• DCL operations grant access to all elements in the must overcome the following problems.
database. Typical DCL statements are the GRANT and • Map data in different formats to data objects.
REVOKE statements. • Store data objects to different datasets.
• DDL operations define elements such as schema for • Retrieve data from multiple datasets and convert it back
storing data. Typical DDL statements are the CREATE to a single data object.
and DROP statements.
• DML operations manipulate the contents of data E. DATA STEWARDS
records. Typical DML statements are INSERT, Over the years, people have had different names for people
DELETE, and UPDATE statements. who work with data, such as data engineers, data analysts,
• DQL operations retrieve data records and combine them and data scientists. Roughly speaking, data engineers are
into a subset of data. A typical DQL statement is a responsible for underlying data processing, data analysts for
SELECT statement. business insight analysis, and data scientists for academic
These sub-languages have traditionally been used only for research. People usually categorize employees into specific
manipulating structured data, i.e., database records. Hence, roles based on their interests and job characteristics in
researchers should improve these sub-languages so that the company. More recently, enterprises can even appoint
they can handle other types of data (semi-structured and data stewards for data governance, data quality control,
unstructured data) and assist the JAMDL framework in data pipeline management, business definition regulation,
managing the heterogeneous data in the data lakes. glossary creation, and sensitive data operations.

34906 VOLUME 12, 2024


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

Data governance is a very broad term that can cover many III. METHODOLOGY
areas. It can be linked to plans, policies, and procedures for To propose a new type of ORMapping software to
managing and implementing data accountability [24], [25], comprehensively address the salient issues of data lakes,
[26]. Data governance is now often used to represent the job it is necessary to address the fundamental issues of data
responsibilities of a data steward. As the role of the data manipulation (mapping, retrieval, and persistence). All the
stewards becomes more important, enterprises require them modules listed in Fig. 2 are discussed below. The JAMDL
to have a more holistic view of the data lake conduce to framework is designed to handle different datasets, which will
effectively manage its dynamic nature. It has also become be demonstrated below using the simple dataset mentioned in
necessary to use modern AI tools to help them in their Fig. 1.
governance efforts.
With the help of cutting-edge AI technologies, many A. DATA MODELING
hidden issues and problems can be detected in advance The mapping process is the most critical part of the
and data stewards can react quickly. Therefore, modern proposed software framework. There are several conventional
software frameworks should also offer the ability to include methods for mapping relational data to program objects.
AI technologies to predict and recommend management The common practice in today’s software frameworks is
strategies to data stewards. to define an XML file for the object mapping process.
However, popular software frameworks (Spring Boot, iBaits,
F. AI TECHNIQUES
etc.) require writing and managing many XML files for
Since the rise of neural networks and AI-related techniques, configuration. As the project evolved, the content and syntax
they have immediately dominated the field of academic became lengthy and complex. Instead, Java annotation is
research. They specialize in automation and prediction and recommended to be used as an object to describe different
can be applied to many different domains. types of data in a data lake.
AI techniques can be used in many different areas to
help the JAMDL framework provide powerful features for
secondary developers. Some common but not limited to 1) OBJECT MODEL
these neural networks can help are Convolutional Neural The principle of OO is that everything can be object-based.
Network (CNN) for building classification model [27], Data in various formats can also be conceptually represented
[28], Recurrent Neural Network (RNN), Long Short-Term by corresponding objects. In a previous study [34], object
Memory (LSTM), and Gated Recurrent Unit (GRU) for models are illustrated to process unstructured data (parallel
building models that require to understand the context of corpora). Therefore, the object model can help to represent
the sentences [29], [30], end-to-end (E2E) network for and manage complex data. Abstraction refers to the basic
building translation models [32], and Emphasized Channel characteristics of an object that distinguish it from all other
Attention, Propagation and Aggregation in Time Delay types of objects, thus providing a clearly defined conceptual
Neural Network (ECAPA-TDNN) for voice and speaker boundary relative to the perspective of the viewer [35]. This
recognition [31]. concept helps in discovering the characteristics of various
In this article, AI techniques are mainly used to generate data. Furthermore, it is also applicable in dealing with semi-
SQL statements and to help data stewards manage data lakes. structured data, which requires some higher level of abstract
In a previous study [33], an E2E network was implemented description.
to generate DQL-type SQL statements to query a database The ease of use of Java annotations is unparalleled in
engine using natural language. E2E is very elegant and the history of Java metadata. Java annotations are flexible
has been popularized for deep learning [32]. The idea of enough to provide a retention policy that specifies how
using a single model to specialize in predicting outputs marked annotations are stored, whether they are stored
directly from inputs can handle extremely complex systems only in code, compiled into classes, or available at runtime
and is arguably the most advanced deep learning technique. through reflection [11]. With Java annotations, the JAMDL
By the same token, people can use AI to generate complex framework is fully implemented based on ORMapping for
DML-type SQL statements for processing enterprise-level manipulating data in the data lakes, the source code of
transactions. which is available on Github [36]. Fig. 3 shows the essential
Data stewards also need AI technologies to provide classes used to form the business logic of this software
alerts and recommendations to create insightful analyses framework.
of business intelligence (BI) type reports and visualization
diagrams. Furthermore, people can build AI models that 2) MAPPING APPROACH
rate data based on its accessibility, and validate the data To the extent that Java objects represent various types of data,
in the data lakes after the JAMDL object models are the most fundamental metadata encompass field name, field
built. Subsequently, many areas where AI can be used to type, entity type, and entity path. The terms ‘‘field’’ (table
enhance the JAMDL framework are yet to be explored by columns, log records, JSON attributes, etc.) and ‘‘entity’’
researchers. (database tables, log files, JSON files, etc.) here have more

VOLUME 12, 2024 34907


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

FIGURE 3. Class diagram: Business logic of the software framework.

abstract meanings and are used to support different data shows the architecture of the DLManager and its relationship
structures. to the auxiliary Java classes. All the roles of these auxiliary
Once this metadata is defined in a Java class, the software classes are summarized below.
framework generates JavaBeans for developers to manipulate • DLConverter: An interface class that contains APIs for
the data and DML and/or DQL operations for the software converting JavaBeans to JSON objects, tabular database
framework to manipulate the data in the data lake. Note records to JavaBeans, and tabular database records to
that DML and DQL here have been enhanced to support the JSON objects.
processing of data other than database records using Plain • DLCRUD: An interface class that provides developers
Old Java Objects (POJOs). with a CRUD transaction API. It also provides advanced
• field name: This attribute is the name of the field that APIs for manipulating multiple tables and records at the
will be used to generate the JavaBeans and DML and/or same time.
DQL operations. • DLManager: An abstract class that implements CRUD
• field type: This attribute is the data field (double, transactions by generating DML and DQL operations.
integer, string, etc.) of the data type that will be It also generates JavaBeans for developers.
used to generate JavaBeans and DML and/or DQL • DLMapper: An interface class that stores annotation
operations. The field type is the data type of structured information provided by the developer for the ORMap-
data table columns, semi-structured data attributes, and ping process.
unstructured data records. • DLTree: A class that provides developers with an API
• entity type: This attribute indicates the type of for dynamically building JSON objects.
entity (log file, JSON, RDB, etc.). Unlike traditional • DLViewer: An interface class for storing information to
ORMapping software frameworks that can only handle combine transactions from multiple tables using specific
structured data (i.e., database records), it can associate table join methods.
semi-structured and unstructured data.
• entity path: This attribute indicates the location of the 4) MAPPING PROCESS
entity (file path, RDB name, etc.) in the data lake. The DLManager class triggers the generateSQL(),
generateBean(), and generateService() APIs
3) CORE PROGRAMS to take care of the tedious tasks for us. The following
The implementation of developers begins with the con- pseudocode demonstrates the mapping process using Java
struction of the DLMapper program, an interface class for annotations in a subclass of DLManager. The database table
developers to annotate metadata. DLManager is a superclass (student_info) has two columns (student_id, student_name)
for constructing CRUD transactions to manipulate data in a that can be mapped to a JavaBean (StudentBean) whose
data lake after the DLMapper class has been defined. Fig. 3 properties (studentId, studentName) are of type integer and

34908 VOLUME 12, 2024


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

string respectively. Moreover, the mapping() API specifies 1 public class ReadStudent {
the primary key of the table. 2 public static void main(String[] args) {
3 // Data Source Connection
1 public class StudentManager extends DLManager { 4 DSConn dsObject = new DSConn();
2 @Override 5 StudentBean bean = new StudentBean();
3 @DLMapper(tableName = ‘‘student_info’’, 6 bean.setStudentId(123123);
7 bean = (StudentBean) StudentManager.noSQL()
4 beanName = ‘‘dl.model.StudentBean’’, 8 .read(dsObject, bean);
5 columns = {"student_id", "student_name"}, 9 System.out.println(bean.getStudentName());
6 properties = {"StudentId", "StudentName"}, 10 }
7 types = {DLData.Integer, DLData.String}, 11 }
8 path = ‘‘StudentDB’’, datasetType = 0) The READ API provided by the software framework
9 public void mapping() { super.setPkNum(1); } is very concise, as the abstract class DLManager already
10 handles all the heavy lifting. First, it reads the annotation
11 @Override information from the mapping() API of the subclass.
12 public void joining() {} It then validates the number of attributes in the annotation and
13 } generates a JavaBean and DQL operation. DQL operations
All the mapping process is done in this ordinary Java are performed by using two APIs (callGetter() and
subclass. The annotations occur in front of the mapping() callSetter()) to manipulate data in the data lake.
API. Since Java annotations are defined to be retained in Algorithm 1 and Algorithm 2 show the implementation of
the software framework at compile time, JavaBeans and the these two APIs inside the software framework.
related DML and DQL operations will be generated after the
subclass is compiled. Algorithm 1 Retrieving Data From a JavaBean
Input: bName, mName, bean
B. DATA RETRIEVING Output: value
The object retrieval mechanism involves a READ transaction 1: function callGetter()
of CRUD. The software framework uses the industry 2: Class<?> c ← Class.forName(bName)
standard Java Database Connectivity (JDBC) for database 3: Method m ← c.getDeclaredMethod(mName)
connectivity, which is very mature and robust and supports 4: Object value ← m.invoke(bean)
most database engines. 5: return value
There are four different types of JDBC connections
(bridge, native API, middleware, and driver-only) to cater to
the diversity of enterprise environments [37]. Furthermore, Algorithm 2 Storing Data to a JavaBean
this software framework is designed to support the schema- Input: bName, mName, bean, value
on-read approach that allows developers to self-define their
1: function callSetter()
special schema when querying a data lake. Since JAMDL is
2: Class<?> c ← Class.forName(bName)
developed entirely in Java, the database connection methods
3: Class<?>[] arg ← new Class[1]
use type four (driver-only) which is considered the most
4: Method m ← c.getDeclaredMethod(mName, arg)
efficient. Consequently, unlike most of the well-known
5: m.invoke(bean, value)
ORMapping software frameworks on the market that use
other complex database connection types, the concise archi-
tecture of JAMDL should run faster than they do.
In most ORMapping software frameworks, the READ 2) OBJECT INTEGRATION
transaction can retrieve only one record from the database According to the database normalization concept, a sin-
per query, which is insufficient for handling semi-structured gle transaction can be separated into multiple tables for
and unstructured data. Therefore, this software framework persistence [38]. Querying back a transaction requires con-
provides a list() API to retrieve multiple records. solidating them together. The query statements become very
It also provides developers with custom JSON objects to complex if a transaction involves more than three tables, and
dynamically output data in the desired format. joining database tables is doubtless a challenge in building
novel software frameworks. As in the previous example in
1) RETRIEVING PROCESS Fig. 1, student information is normalized into three different
Once the mapping process is complete, CRUD transactions database tables. To retrieve complete information about a
become very simple. The READ transaction can be imple- student, these three tables need to be joined.
mented by simply calling StudentManager with a JavaBean. In this software framework, DLManager provides a
The following pseudocode demonstrates a READ transaction joining() API for developers to specify how to join tables
where StudentManager retrieves a record based on a primary together. Developers simply create a JoinBean, fill in the
key. basic table join information (data fields, table name, and join

VOLUME 12, 2024 34909


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

key), and the software framework fetches the data from the down into different parts (data fields, tables, keys, and
database table accordingly. filters) [33].
DLManager also provides developers with two result There are two main SQL patterns for table joins in this
formats through the interface class DLCRUD: JavaBean and software framework, and the syntax diagram is shown in
tabular data (list of strings). Fig. 4. As mentioned earlier, the DQL keywords (SELECT,
+queryJoin(n: String): List<DLBean> FROM, ON, HAVING, and WHERE) separate the SQL into
+queryView(n: String): List<List<String\gg different parts. This software framework then reads the object
information from the annotations and uses these predefined
Instead of retrieving all records directly, DLManager can SQL templates to generate the requesting queries.
choose to limit the records fetched. Developers can use the
setFilters() API of JoinBean to insert the where clause
of a DQL operation.
+setFilters(f: String[]): void
In addition, DLManager is capable of specifying different
join methods. Since not all database engines can support
various join methods, the JAMDL framework provides the
most commonly used method to support most database
engines, and the default method in this software framework
is an inner join. These join methods (inner join, left join,
right join, full outer join, intersect, union, minus, etc.) are
commonly found in most database engines for processing
queries [1]. Moreover, these join methods are slightly
modified to accommodate all database engines. Table 1
summarizes the syntax of the different join methods provided.

TABLE 1. Syntax for different join methods.

FIGURE 4. DQL type SQL patterns for different join methods.

By the same token, semi-structured and unstructured data


can be queried by implementing the joining() API.
Unstructured datasets are retrieved in the form of a data field.
The following DQL operations show enhanced versions that
support data other than database records. Therefore, POJOs
The following pseudocode shows how to query student handle queries by processing data at the file system level. It is
information by constructing a JoinBean. The table join important to note that the syntax of enhanced DQL is the same
method is specified in the setKeys() API on line 7. as the normal DQL SQL statement.
SELECT row FROM abc.log WHERE lineNum = 6;
1 @Override SELECT attribute FROM def.json WHERE
2 public void joining() { attribute = ’id’ and level = 3;
3 JoinBean bean = new JoinBean();
4 bean.setJoinName(‘‘JoinStudent’’); The first enhanced DQL retrieves a record from the log file.
5 bean.setColumns({"ID", ‘‘NAME’’, "COURSE"}); This result set is a subset of unstructured data or just a part
6 bean.setTables({"STUD", ‘‘RELA’’, "COURSE"}); of the log file. Therefore, developers can now manipulate the
7 bean.setKeys({{"STUD.ID=", "RELA.ID"}\ldots }); log file through the JAMDL framework.
8 bean.setFilters({"ID = 123456"});
3) SEMI-STRUCTURED DATA TRANSFORMATION
9 super.setJoinTables(bean);
10 } Since JAMDL framework uses JavaBean as the default
data format for developers to manipulate data in the data
The most critical part of making this data retrieval lake, it provides advanced APIs for converting bland,
mechanism work flawlessly is the process of correctly two-dimensional data into the JSON tree format. This
generating SQL statements (DQL). While query statements software framework provides two APIs (beanToJson()
can be dynamic and varied, SQL statements can be broken and toJsonTree()) from the DLConverter interface

34910 VOLUME 12, 2024


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

class (described in Fig. 3) to construct JSON objects. "StudentID", "StudentName",


The beanToJson() API converts JavaBean or table data "CourseTaken[CourseCode", "CourseName]",
directly into flattened JSON objects. This is an overloaded "TotalCredit"
});
method that can output single or multiple records from the // JSON format 2
database. In other words, the output JSON object has either b2t.setFields(new String[] {
a one-level (JSONObject) or two-level (JSONArray) tree "CourseCode", "CourseName", "Credit",
structure. "Students[StudentID", "StudentName]"
In reality, JSON objects can be very complex, with multiple });
layers, especially for enterprise systems. To address this
need, the JAMDL framework provides the toJSONTree()
API for dynamically constructing tree-structured objects.
Moreover, the ‘‘Bracket Schema’’ notation is introduced for
the software framework to understand the self-defined JSON
structure. Fig. 5 shows the syntax of bracket notation with
the upper part representing JSONObject, and the lower part
representing JSONArray objects.

FIGURE 6. Different JSON objects.

Inside the JAMDL framework, the toJSONTree() API


is implemented in the DLTree class (as described in Fig. 3).
As shown in Algorithm 3, it reads the data fields and
determines the sequential number, tree level number, and
column name of the database table for each data field. It then
FIGURE 5. The ‘‘Bracket Schema’’ notation for building the JSON objects. uses indirect recursive methods to add either JSONObject
or JSONArray objects to build the JSON tree. Algorithm 4
As a result, developers can dynamically define the JSON shows the business logic for transforming JSON objects from
format and the software framework generates JSON on the database records.
fly, for example, by converting data records from three
database tables into JSON objects, as shown in Fig. 1. Algorithm 3 Parsing the Bracket Schema
Developers can dynamically create different tree structures 1: function readNodes()
by simply embedding these bracket symbols in a Java 2: levelId ← 1
program (pseudocode shown below). In addition, the JSON 3: initialize nodeList
structure can be changed by modifying lines 7 through 10. 4: while loop over each field do
1 @Override
5: if field has left bracket then
2 public static void main(String[] args) { 6: levelId ← levelId + 1
3 DSConn dsObject = new DSConn(); 7: if field has right bracket then
4 list lBean = new ArrayList(); 8: levelId ← levelId − 1
5 lBean = StudentManager.noSQL().list(dsObject, bean);
6 Bean2Tree b2t = new Bean2Tree();
9: nodeList.add(field)
7 b2t.setFields(new String[] { 10: return nodeList
8 "StudentID", "StudentName",
9 "CourseTaken[CourseCode", "CourseName", "Credit]"
10 });
11 b2t.buildTree(); C. DATA PERSISTING
12 b2t.setlRecord(lRow); This software framework provides two different approaches
13 JSONObject json = b2t.toJSON();
14 }
to handle DML operations beneficial to persist data in a data
lake.
Developers can change the bracket notation in plain
text to represent more complex JSON object structures 1) PERSISTING PROCESS
such as nested brackets. Two additional JSON objects are DML operations (INSERT, UPDATE, and DELETE) are
demonstrated below, with the results shown in Fig. 6. used to change the contents of database tables. After
// JSON format 1 the data mapping process is complete, the DLManager
b2t.setFields(new String[] { reads the object information from the annotations and

VOLUME 12, 2024 34911


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

Algorithm 4 JSON Objects Transformation


1: function toTreeArr(levelId, array)
2: TreeBean bean ← nodeList.get(levelId)
3: JSONObject obj ← new JSONObject()
4: obj.put(bean.getName(), bean.getRecord())
5: if levelId + 1 = nodeList.size() then
6: array.add(obj)
7: return array
8: else
9: TreeBean next ← nodeList.get(levelId + 1)
10: if next.getLevel() > next.getLevel() then
11: obj.put(next.getName(),
toTreeArr(levelId + 1, new JSONArray()))
12: array.add(obj)
13: else
14: array.add(toTreeObj(levelId + 1, obj)) FIGURE 7. The DML operation template.

15: return array


16: function toTreeObj(levelId, obj)
17: TreeBean bean ← nodeList.get(id) These three APIs first turn off the auto-commit feature of
18: obj.put(bean.getName(), bean.getValue()) the database engine and then evaluate each SQL statement.
19: if levelId + 1 = nodeList.size() then If all SQL statements are successfully executed, the modified
20: return obj data will be permanently stored in the database. Finally,
21: else they turned on auto-commit again. This transaction mode
22: TreeBean next ← nodeList.get(levelId + 1) process improves the overall performance significantly. The
23: if next.getLevel() > next.getLevel() then implementation is almost identical to a normal CRUD
24: obj.put(next.getName(), transaction, with the addition of auto-commit configuration
toTreeArr(levelId + 1, new JSONArray())) steps.
25: return obj Semi-structured and unstructured data usually exist as files
26: else in the data lake. The framework treats them as a normal data
27: return toTreeObj(levelId + 1, obj) field with their data path in the JavaBean. Persisting these
types of data requires direct modification of files rather than
database tables. Thus, the DML operation consists not only
of an SQL statement that modifies a database record but also
generates DML to process the corresponding transactions. of a POJO that modifies a JSON or log file.
All SQL statements rely on the primary key defined in the Although DML operations that modify semi-structured
mapping() API (described in Section III-A4) as the default and unstructured data are more like commands than SQL
generation method. statements, they follow SQL syntax structures. For example,
The business logic is similar to the READ transaction. two DML operations to delete specific lines of a log file and
DML templates (shown in Fig. 7) can be created by breaking update an attribute of a JSON object with a specific value are
down SQL statements using keywords (INSERT INTO, shown below. Then, the POJO translates these command-type
VALUES, UPDATE, SET, DELETE FROM, and WHERE). DML operations and modifies them at the file system level.
The basic DML can be further divided into different parts:
action (DELETE, INSERT, UPDATE), filters (WHERE), and DELETE FROM abc.log WHERE lineNum = 6;
arguments (entity names, field names, and field values) for UPDATE def.json SET name = value WHERE
automatic generation by the framework. level = 3;
Developers can then use the create(), update(), and
delete() APIs provided by the DLCRUD interface class 2) DML FOR JOINED TABLES
to modify the content of entities. Section III-C1 demonstrates how JAMDL framework manip-
Changing one record at a time cannot satisfy the require- ulates simple DML operations. An enterprise-level transac-
ments of an enterprise-level system. Therefore, this software tion indeed can involve several data sources and various
framework provides three additional APIs (bCreate(), data formats. The updateJoin() API provided by the
bUpdate(), and bDelete()) to modify multiple records DLCRUD interface class is used to persist online transactions
in batch mode or ‘‘transaction’’ mode. Most database engines involving multiple database tables. The DLManager first
offer the advanced feature of transaction mode. A transaction reads object information from the joining() API. It then
is a set of SQL queries that are treated ‘‘atomically’’ into a generates DML operations from the corresponding entities,
unit of work [39]. fields, and keys. To handle more complex DML operations,

34912 VOLUME 12, 2024


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

the software framework uses AI technologies to generate


these DML operations.
In previous research on building question answering (QA)
systems, LSTM and GRU neural networks are illustrated to
generate natural language questions and DQL answers with
high accuracy [33]. To exploit this, the JAMDL framework
also uses LSTM and GRU models to generate DML for
accessing multiple data sources, the performance of which
is discussed in detail in the next section.
FIGURE 8. Data Precipitation.
D. DATA GOVERNING
Although the responsibilities of data stewards vary from
enterprise to enterprise, their role is becoming increasingly 3) RESOURCE PLANNING
important. They also need to react quickly to new rules, Modern system architectures rely on cloud computing to
policies, and instructions set by local governments in some create virtual machines for each server node. Hardware
countries. Since DCL operations can only control user configurations (CPU, memory, hard drives, etc.) can be
access to structured data, DCL can be augmented with adjusted more flexibly as server usage changes. Conse-
POJOs to control other structured data (semi-structured and quently, a predictive model can be built to estimate the
unstructured data). resources (hardware configuration, network bandwidth, etc.)
However, the role of data stewards is not only to safeguard consumed by each server node through past usage statistics
data but also to provide high-quality data to the enterprise. and possible patterns.
Therefore, the following areas can utilize AI technology to
help them better govern the data in the data lake. IV. EXPERIMENTS AND DISCUSSIONS
A. GENERATING DML BY LSTM AND GRU NETWORKS
1) DATA PRECIPITATION In most cases, building accurate AI models requires a
In the field of chemistry, precipitation is the separation of large amount of high-quality training data. To prepare the
solid substances from a liquid. This process occurs naturally training data, all possible DML operations from each entity
over a while without external intervention. (database tables, JSON, and log files) are created. This can
As data stays in the data lake for longer and longer periods, be accomplished by writing a simple recursive program
polarization occurs due to the increase in the number of visits, that iterates through every file and column in the database
leading to data precipitation. Fig. 8 shows that data near the table. Then, form a transaction by randomly selecting some
surface is most frequently accessed, while data located at fields from different entities (files or tables). Simultaneously,
the bottom is rarely accessed, or is referred to as ‘‘zombie the corresponding DMLs are generated from each entity.
data’’. Nevertheless, data stewards are responsible for various As a result, each transaction containing different fields from
definitions of data, including zombie data, which is data that different entities is paired with multiple DML operations. The
has not been accessed within a specific time. business logic of this program is shown in Algorithm 5.
Subsequently, there are a lot of interesting things that can
be done when data precipitation appears. On the one hand, Algorithm 5 Training Data Generation
people can rank the data based on how often it is accessed. Input: n ▷ Number of records
On the other hand, people can classify data that may end Output: train.txt, test.txt ▷ Output files
up as zombie data. With the help of AI technology, people
1: procedure gen_training_data(n)
can also build recommendation models that provide not only
2: while loop over each file and table do
high-quality data but also rarely accessed data that may be
3: list(field, entity) ← {files or tables}
useful to users. As a result, data that would otherwise form
data silos can increase their visibility. 4: counter = 0
5: while counter < n do
2) FAULT DETECTION 6: randomly picks some records from the list
Most data stewards are probably more concerned about 7: put all the field together as a transaction
system failures and errors. However, many factors and 8: iterate over all three DML types of operations
combinations can cause system failure. Therefore, big data 9: generate DML for each entity containing field
concepts can be used to build a linear regression model and 10: output the transaction and DMLs pairs to files
incorporate all relevant parameters that may lead to system 11: write 90% of the data to train.txt
failures. Some common parameters include new system 12: write 10% of the data to test.txt
deployments, system updates, age of hardware, new security
threats, peak times, holidays, historical high-risk days, and Consequently, the training and testing datasets are pre-
dates of special events and activities. pared. Then, translation models are built to allow the software

VOLUME 12, 2024 34913


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

framework to generate DML statements automatically. Trans-


lation models require source and target texts as a pair or
parallel corpus to train the AI model. Below is an example
of the parallel corpus generated by Algorithm 5.

SOURCE TEXT (TRANSACTIONS):


UPDATE logA.lineNum=num,
tableA.fieldA=valueA, tableA.fieldB=valueB,
tableB.fieldA=valueA, tableB.fieldC=valueC

TARGET TEXT (DML OPERATIONS):


UPDATE logA.log SET lineNum=value; FIGURE 9. Evaluation of GRU and LSTM neural networks.
UPDATE tableA SET fieldA=valueA,fieldB=valueB;
UPDATE tableB SET fieldA=valueA,fieldC=valueC;
C. PERFORMANCE EVALUATION
B. EVALUATING LSTM AND GRU NETWORKS
Section III introduced the features and usage of the JAMDL
The next step is to build some translation models for evalu-
framework. This section provides a performance evaluation
ation. With the emergence of the Google Transformer model
of the JAMDL framework and compares it with well-known
in 2017 [28], it has rapidly dominated the deep learning field,
ORMapping software frameworks in the market (Hibernate
particularly natural language processing (NLP) research
and MyBatis).
projects. Hence, a sequence-to-sequence (S2S) transformer
Honestly, it is not easy to fairly compare different ORMap-
model is used to encode the transactions and decode the
ping software frameworks. Performance may depend on
DML operations with the Keras APIs. The construction of
the environment (operating system, hardware configuration,
LSTM and GRU neural networks follows some textbooks
database engine, data complexity, etc.). Since neither the
[40], [41].
Hibernate nor MyBatis software frameworks fully support
The translation model was built by using a Keras
semi-structured and unstructured data, performance testing
GRU-based encoder and decoder and SoftMax as a starter
can only be done on structured data.
function, with the embedding dimension set to 256. Since the
Four test cases are designed for evaluation. Case 1 has a
vocabulary of all field names in the data lake is not rich, the
table with less than a hundred data records. Case 2 has a
maximum value of vocabulary is set to 18,000. Moreover,
table with about one million data records. Case 3 has three
the ‘‘forget gate’’ in GRU is activated which can help predict
separate tables, each with fewer than a hundred data records.
words in a sequence accurately and efficiently [42].
Case 4 has three separate tables, each with about one million
During the experiments, six test cases were designed
data records. Then, the read and write speeds of the three
to test the performance of different networks and datasets
software frameworks are measured for each case. Four cases
(3, 6, and 9 hundred thousand). After running each
are tested on the same environment and the results are
test case for 60 epochs, the results are summarized in
summarized in Table 3.
Table 2.
TABLE 3. Performance evaluation results.
TABLE 2. Test case results.

The resulting data in Table 3 are then plotted into pie


charts for comparison, as shown in Fig. 10. Each pie chart
Fig.9 shows the accuracy values for both networks. The is either a read or a write operation to the database for
curves show that the overall performance of the GRU neural each of the four test cases. Moreover, the resultant data
network outperforms LSTM. After more than 50 epochs of (database operation time) in the pie charts was converted
training, the score of the model was roughly stable, so training into a percentage for comparison and visualization, where
was stopped at 60 epochs. Furthermore, large datasets can the higher the percentage, the longer it took to complete the
lead to higher accuracy. As a consequence, this software database operation.
framework uses GRU neural networks to generate DML The first impression of these pie charts in Fig. 10 is that
operations. the speed of the read operations on the left are similar, while

34914 VOLUME 12, 2024


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

a single batch and commit it to the database with a single


call [37]. As a result, it reduces communication overhead and
logging, thereby improving performance. For example, if a
transaction needs to write to fifty data records in a table,
the database engine treats the batch processing as a single
operation, not fifty separate operations.
In another aspect, Hibernate has heavily adopted the
database optimizers and the caching and buffering features
of database engines allow previously executed queries to
perform significantly better and faster compared to the other
software frameworks. Therefore, the JAMDL framework
should also employ some database optimizers to improve
robustness.

3) TABLE JOIN
Since neither Hibernate nor MyBatis supports complex
queries connecting to multiple database tables, people
usually need to write their own SQL statements. Therefore,
performance becomes similar due to executing SQL directly
instead of going through the framework to generate the SQL
statements. However, JAMDL better supports the generation
of complex queries, saving developers time in developing
SQL. Consequently, using the JAMDL framework for
manipulating multiple tables is more efficient than the other
two software frameworks.

4) IMPACT ON COMPLEXITY
The SQL statements generated in the JAMDL framework are
high-level programming languages. Database engines can use
the built-in functions to optimize these statements, and the
FIGURE 10. Comparison of database running speeds of different software optimization results are similar to those obtained by manually
frameworks. writing SQL. The purpose of the JAMDL framework is
mainly to handle different structured data in the data lake
through Java annotations and there should not be any impact
JAMDL (in orange) takes less time in the write operations on on complexity.
the right. There are several reasons why these pie charts are
distributed in this pattern, which are summarized below. D. DYNAMIC DATASETS
The manipulation of different structures in data lakes is fully
1) DIFFERENT PURPOSES demonstrated in Section III. One may realize that switching
These three software frameworks were designed for different from one data structure to another as time goes by requires a
purposes, and are more concerned with functionality than just huge amount of effort. Others may be aware of the feasibility
speed of operation. Hibernate is designed for larger appli- of managing multiple data structures simultaneously.
cations can be used in most development environments and The JAMDL framework abstracts data structures into an
supports most database engines. Therefore, it is considered object model that can represent different data structures at
to be more heavyweight than the others because it has more the same time. In addition, the lightweight nature of Java
libraries consuming the memory. On the other hand, MyBatis annotations makes it easy to modify the object model to meet
is designed for small and medium-sized applications and is inconsistent demand. Therefore, the JAMDL framework can
therefore easier to use. The JAMDL framework focuses on address the concerns when managing complex data.
processing different structured data in a data lake. Hence,
it is less compromised for database and operating system V. CONCLUSION
environments. In this article, a software framework JAMDL based on
ORMapping is comprehensively presented. JAMDL aims
2) BATCH OPERATIONS to provide a solution for the manipulation of different
Since the JAMDL framework implements batch processing, data structures in a data lake. JAMDL solves the problem
it takes less time to write database records. Batch processing of managing diverse data and overcomes the difficulty of
allows developers to combine related SQL statements into transforming data between different structures in the data

VOLUME 12, 2024 34915


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

lake. Some important features of this software framework REFERENCES


are summarized below, and some of the main features are [1] A. Badia, SQL for Data Science: Data Cleaning, Wrangling and Analytics
compared in Table 4. With Relational Databases. Cham, Switzerland: Springer, Nov. 2020.
• Java annotations are used as objects that represent [2] J. Reis and M. H. Housley, Fundamentals of Data Engineering: Plan
and Build Robust Data Systems. Sebastopol, CA, USA: O’Reilly Media,
diverse data structures (structured, semi-structured, and Jul. 2022.
unstructured data). [3] D. Colley and C. Stanier, ‘‘Identifying new directions in database perfor-
• Java objects are used to read and write different data mance tuning,’’ Proc. Comput. Sci., vol. 121, pp. 260–265, Jan. 2017.
structures in a data lake. [4] T. Neward, ‘‘The Vietnam of computer science,’’ in Political Science,
Jun. 2006.
• In addition to inner joins, a variety of table joining [5] M. Keith, M. Schincariol, and M. Nardone, Pro JPA 2 in Java EE 8:
methods are available. An in-Depth Guide to Java Persistence, 3rd ed. Apress, Feb. 2018.
• By adopting the schema-on-read approach, JSON [6] L. H. Z. Santana and R. D. S. Mello, ‘‘Persistence of RDF data into NoSQL:
A survey and a reference architecture,’’ IEEE Trans. Knowl. Data Eng.,
objects are dynamically generated using the ‘‘Bracket vol. 34, no. 3, pp. 1370–1389, Mar. 2022.
Schema’’ notation. [7] A. Gorelik, The Enterprise Big Data Lake: Delivering the Promise of Big
• GRU neural networks are used to build E2E AI models Data and Data Science. Sebastopol, CA, USA: O’Reilly Media, Mar. 2019.
to generate DML operations. [8] R. Hai, C. Koutras, C. Quix, and M. Jarke, ‘‘Data lakes: A survey of
functions and systems,’’ IEEE Trans. Knowl. Data Eng., vol. 35, no. 12,
• A recommendation AI model is implemented to improve
pp. 12571–12590, Dec. 2023.
the visibility of data when data precipitation occurs. [9] E. Zagan and M. Danubianu, ‘‘Data lake architecture for storing and
transforming web server access log files,’’ IEEE Access, vol. 11,
pp. 40916–40929, 2023.
TABLE 4. ORMapping software framework comparison.
[10] A. A. Munshi and Y. A. I. Mohamed, ‘‘Data lake lambda archi-
tecture for smart grids big data analytics,’’ IEEE Access, vol. 6,
pp. 40463–40471, 2018.
[11] C. Horstmann, Core Java: Advanced Features, vol. 2. Oracle Press,
Apr. 2022.
[12] H. Rocha and M. T. Valente, ‘‘How annotations are used in Java: An
empirical study,’’ in Proc. 23rd Int. Conf. Softw. Eng. Knowl. Eng.,
Jul. 2011, pp. 426–431.
[13] S. Roubtsov, A. Serebrenik, and M. van den Brand, ‘‘Detecting modularity
‘smells’’ in dependencies injected with java annotations,’’ in Proc. 14th
Eur. Conf. Softw. Maintenance Reeng., Mar. 2010, pp. 244–247.
This article demonstrates how to prepare training and [14] Y. Liu, Y. Yan, C. Sha, X. Peng, B. Chen, and C. Wang, ‘‘DeepAnna: Deep
testing data for building LSTM and GRU neural networks. learning based Java annotation recommendation and misuse detection,’’
in Proc. IEEE Int. Conf. Softw. Anal., Evol. Reeng. (SANER), Mar. 2022,
In addition, six test cases were used to evaluate the pp. 685–696.
effectiveness of the resulting DML operations. The results [15] F. Mancini, D. Hovland, and K. A. Mughal, ‘‘The SHIP validator:
show that GRU has the highest accuracy of 0.9589 using An annotation-based content-validation framework for Java applications,’’
in Proc. 5th Int. Conf. Internet Web Appl. Services, May 2010, pp. 122–128.
900,000 training data. Moreover, the JAMDL framework can
[16] C. He, Z. Li, and K. He, ‘‘Identification and extraction of design pattern
write faster than the other software frameworks. information in Java program,’’ in Proc. 9th ACIS Int. Conf. Softw. Eng.,
The job responsibilities of an enterprise data steward have Artif. Intell., Netw., Parallel/Distrib. Comput., 2008, pp. 828–834.
created many new and interesting areas of research. There is [17] L. Jicheng, Y. Hui, and W. Yabo, ‘‘A novel implementation of observer
pattern by aspect based on Java annotation,’’ in Proc. 3rd Int. Conf.
ample space to explore further and help them manage and Comput. Sci. Inf. Technol., vol. 1, Jul. 2010, pp. 284–288.
deliver high-quality data. [18] B. Nuryyev, A. Kumar Jha, S. Nadi, Y.-K. Chang, E. Jiang, and
Data lake is one of the important development projects in V. Sundaresan, ‘‘Mining annotation usage rules: A case study with
MicroProfile,’’ in Proc. IEEE Int. Conf. Softw. Maintenance Evol.
enterprise recently. The JAMDL framework is designed to (ICSME), Oct. 2022, pp. 553–562.
rapidly develop web applications and data warehouse-type [19] Z. Yu, C. Bai, L. Seinturier, and M. Monperrus, ‘‘Characterizing the usage,
software applications for manipulating different structures of evolution and impact of Java annotations in practice,’’ IEEE Trans. Softw.
data in data lakes. Unlike the majority of data lake projects Eng., vol. 47, no. 5, pp. 969–986, May 2021.
[20] M. Chatham, Structured Query Language by Example—Volume I: Data
that focus merely on architecture, JAMDL contributes Query Language, 1st ed. Nov. 2012.
people to governing big data holistically and efficiently. [21] M. Lorenz, G. Hesse, and J.-P. Rudolph, ‘‘Object-relational mapping
It also inspires people to manage data lakes from different revised—A guideline review and consolidation,’’ in Proc. 11th Int. Joint
Conf. Softw. Technol., 2016, pp. 157–168.
perspectives using AI techniques.
[22] C. Richardson, Microservices Patterns With Examples in Java, 1st ed.
Data are facts and evidence that deserve to be preserved Manning, Oct. 2018.
forever. Even though computer science has evolved over [23] G. Liyanaarachchi, L. Kasun, M. Nimesha, K. Lahiru, and A. Karunasena,
many eras, data always tells us new stories with new ‘‘MigDB–relational to NoSQL mapper,’’ in Proc. IEEE Int. Conf. Inf.
Autom. for Sustainability (ICIAfS), Dec. 2016, pp. 1–6.
technologies. Unlike other fields of study, data is not confined [24] A. Gandomi and M. Haider, ‘‘Beyond the hype: Big data concepts,
to a specific domain but is cross-domain and ubiquitous. methods, and analytics,’’ Int. J. Inf. Manage., vol. 35, no. 2, pp. 137–144,
As the appearance of data continues to change (structured, Apr. 2015.
semi-structured, unstructured, etc.), people will continue to [25] P. Kaewkamol, ‘‘Data governance framework as initiative for higher edu-
cational organisation,’’ in Proc. Joint Int. Conf. Digit. Arts, Media Technol.
face new challenges to either play with data or be played by ECTI Northern Sect. Conf. Electr., Electron., Comput. Telecommun. Eng.,
data. Jan. 2022, pp. 175–178.

34916 VOLUME 12, 2024


L. M. Hoi et al.: Manipulating Data Lakes Intelligently With Java Annotations

[26] Y. Demchenko and L. Stoy, ‘‘Research data management and data LAP MAN HOI (Member, IEEE) received the
stewardship competences in university curriculum,’’ in Proc. IEEE Global bachelor’s degree in computer science from York
Eng. Educ. Conf. (EDUCON), Apr. 2021, pp. 1717–1726. University, Canada, and the master’s degree in
[27] I. H. Sarker, ‘‘Deep learning: A comprehensive overview on techniques, internet computing from the Queen Mary Univer-
taxonomy, applications and research directions,’’ Social Netw. Comput. sity of London. He is currently pursuing the Ph.D.
Sci., vol. 2, no. 6, pp. 1–20, Aug. 2021. degree in computer applied technology with the
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Faculty of Applied Sciences, Macao Polytechnic
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. University (MPU). He was a Researcher of gaming
Neural Inf. Process. Syst., vol. 30. Red Hook, NY, USA: Curran Associates,
and entertainment. He is also a Researcher of
Dec. 2017, pp. 5999–6009.
machine translation with the Faculty of Applied
[29] F. Mortezapour Shiri, T. Perumal, N. Mustapha, and R. Mohamed,
‘‘A comprehensive overview and comparative analysis on deep learning Sciences, MPU. His research interests include internet computing, data
models: CNN, RNN, LSTM, GRU,’’ 2023, arXiv:2305.17473. warehouse, data science, gaming, deep learning, machine translation, and
[30] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, voice recognition.
‘‘On the properties of neural machine translation: Encoder–decoder
approaches,’’ in Proc. SSST EMNLP, Sep. 2014. [Online]. Available:
https://api.semanticscholar.org/CorpusID:11336213 WEI KE (Member, IEEE) received the Ph.D.
[31] B. Desplanques, J. Thienpondt, and K. Demuynck, ‘‘ECAPA-TDNN: degree from the School of Computer Science and
Emphasized channel attention, propagation and aggregation in Engineering, Beihang University. He is currently
TDNN based speaker verification,’’ in Proc. Interspeech, Oct. 2020, a Professor with the Faculty of Applied Sci-
pp. 3830–3834. ences, Macao Polytechnic University. His research
[32] T. Glasmachers, ‘‘Limits of end-to-end learning,’’ in Proc. 9th Asian Conf.
interests include programming languages, image
Mach. Learn. (ACML), pp. 17–32, Apr. 2017.
processing, computer graphics, tool support for
[33] L. M. Hoi, W. Ke, and S. K. Im, ‘‘Data augmentation for building QA
object-oriented and component-based engineering
systems based on object models with star schema,’’ in Proc. IEEE 3rd Int.
Conf. Power, Electron. Comput. Appl. (ICPECA), Jan. 2023, pp. 244–249. and systems, the design and implementation
[34] L. M. Hoi, W. Ke, and S. K. Im, ‘‘Corpus database management design for of open platforms for applications of computer
chinese-portuguese bidirectional parallel corpora,’’ in Proc. IEEE 3rd Int. graphics, and pattern recognition, including programming tools, environ-
Conf. Comput. Commun. Artif. Intell. (CCAI), May 2023, pp. 103–108. ments, and frameworks.
[35] G. Booch, R. A. Maksimchuk, M. W. Engle, B. J. Young, J. Conallen, and
K. A. Houston, Object-Oriented Analysis and Design With Applications,
3rd ed. Reading, MA, USA: Addison-Wesley Professional, Apr. 2008. SIO KEI IM (Member, IEEE) received the degree
[36] L. M. Hoi. (Mar. 2023). An Open-Source Software Framework in computer science and the master’s degree in
for Manipulating Data Lakes. [Online]. Available: https://github. enterprise information systems from the King’s
com/LapmanHoi/Annotation College London, University of London, U.K., in
[37] Y. Bai, JDBC API and JDBC Drivers, 1st ed. Hoboken, NJ, USA: Wiley, 1998 and 1999, respectively, and the Ph.D. degree
May 2012.
in electronic engineering from the Queen Mary
[38] C. Beeri, P. A. Bernstein, and N. Goodman, ‘‘A sophisticate’s introduction
University of London (QMUL), U.K., in 2007.
to database normalization theory,’’ in Proc. 4th Int. Conf. Very Large Data
Bases, Sep. 1978, pp. 113–124.
He gained the position of a Lecturer with the
[39] S. Botros, High Performance MySQL: Proven Strategies for Operating at Computing Program, Macao Polytechnic Institute
Scale, 4th ed. O’Reilly Media, Dec. 2021. (MPI), in 2001. In 2005, he became the Operations
[40] A. Géron, Hands-on Machine Learning With Scikit-Learn, Keras, and Manager of the MPI-QMUL Information Systems Research Center jointly
TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, operated by MPI and QMUL, where he carried out signal processing work.
2nd ed. O’Reilly Media, Oct. 2019. He was promoted to a Professor with MPI, in 2015. He was a Visiting Scholar
[41] F. Chollet, Deep Learning With Python, 2nd ed. Manning, Dec. 2021. with the School of Engineering, University of California, Los Angeles
[42] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, ‘‘End-to-end memory (UCLA), and an Honorary Professor with The Open University of Hong
networks,’’ in Proc. Conf. Neural Inf. Process. Syst. (NIPS), 2015, Kong.
pp. 2440–2448.

VOLUME 12, 2024 34917

You might also like