Manipulating Data Lakes Intelligently With Java Annotations
Manipulating Data Lakes Intelligently With Java Annotations
ABSTRACT Data lakes are typically large data repositories where enterprises store data in a variety of data
formats. From the perspective of data storage, data can be categorized into structured, semi-structured, and
unstructured data. On the one hand, due to the complexity of data forms and transformation procedures, many
enterprises simply pour valuable data into data lakes without organizing and managing them effectively.
This can create data silos (or data islands) or even data swamps, with the result that some data will be
permanently invisible. Although data are integrated into a data lake, they are simply physically stored in
the same environment and cannot be correlated with other data to leverage their precious value. On the
other hand, processing data from a data lake into a desired format is always a difficult and tedious task
that requires experienced programming skills, such as conversion from structured to semi-structured. In this
article, a novel software framework called Java Annotation for Manipulating Data Lakes (JAMDL) that can
manage heterogeneous data is proposed. This approach uses Java annotations to express the properties of
data in metadata (data about data) so that the data can be converted into different formats and managed
efficiently in a data lake. Furthermore, this article suggests using artificial intelligence (AI) translation
models to generate Data Manipulation Language (DML) operations for data manipulation and uses AI
recommendation models to improve the visibility of data when data precipitation occurs.
INDEX TERMS Data lake, data precipitation, data stewards, enterprise-level applications, impedance
mismatch, java annotations, JAMDL, object-oriented, ORMapping, software framework.
FIGURE 2. The modular software architecture for manipulating the data lake.
Furthermore, most modern ORMapping software frame- cannot meet the needs of storing unstructured data such as
works only use inner join methods to join database tables images, videos, and corpus files. Data lakes are one of the
and do not provide other table join methods (left join, solutions for persisting data in various formats. However, data
right join, intersect join, full join, etc.) for data integration. is often not well organized due to the complexity and diversity
Consequently, developers either have to change the table of data in data lakes. As a result, it is difficult for data to be
structure or write complex SQL statements themselves. fully utilized and analyzed to help decision-makers identify
Likewise, data integrity, especially the merging of various interesting issues and govern.
data structures, is one of the difficulties in manipulating data In other words, a data lake is simply a repository of all
lakes. data (including raw data) for people to access at one point.
The terms used to describe this large data repository are
D. BIG PICTURE varied and include data puddles, data ponds, data pools, data
In this article, the JAMDL framework is presented compre- oceans, and more. They differ mainly in their size, maturity,
hensively, and it is based on ORMapping for manipulating and purposes [7]. Nevertheless, it is more appropriate to use
data in a data lake. The structure diagram is shown in Fig. 2. the term ‘‘data lake’’ in this article, as it is more relevant to
The software architecture is componentized into different enterprise-level applications.
modules (data modeling, data persisting, data retrieving, and Starting in 2010, different architectures have been sug-
data governing), and each module is discussed separately gested for building data lakes. Recently, many service
below. providers have adopted data lakes in the cloud. Some well-
The JAMDL framework is designed to address the known companies, such as Amazon Web Services (AWS),
problems listed in the previous sections and to provide a Azure Data Lake Store, Google Cloud Platform (GCP),
way for people to represent different data structures using an Alibaba Cloud, and the Data Cloud from Snowflake, even
object model. As a result, one can use this object model to offer powerful tools and user-friendly service interfaces
read, write, and convert data in a data lake to a desired format. for enterprises to build their own data lakes. In academia,
The rest of the article is organized as follows: Section I scholars have never stopped to propose innovative solutions
briefly describes the purpose of this study; Section II reviews for constructing data lakes. According to the structure
state-of-the-art techniques and related research; Section III and function of a data lake, data lakes usually consist
shows how to build a software framework for manipulating of four layers (Ingestion, Maintenance, Exploration, and
data lakes; Section IV evaluates and analyses the significance Storage) [8], [9], [10].
of all the results; and Section V concludes all the research Nonetheless, people are more concerned with the archi-
study and discusses future work. tecture than with manipulating the content of the data in
the data lake. The demand for comprehensive solutions for
II. LITERATURE REVIEW manipulating data in data lakes continues to exist.
A. DATA LAKES
Enterprises collect digital footprints from a wide range of B. JAVA ANNOTATIONS
activities, with data coming in a heterogeneous form. The Java annotations, first released in 2004, are a form of
predefined table schema of the data warehouse architecture metadata that provide information about a program rather
than being part of the program itself [11]. Java annotations D. ORMAPPING SOFTWARE FRAMEWORK
provide three different retention policies (source, class, and ORMapping is a mechanism for connecting classes in an
runtime) for specifying how long to retain annotations. object-oriented (OO) programming language to tables in a
Therefore, once a Java program is annotated, the annotation relational database [21]. ORMapping allows us to query and
information can be read at different stages of the program manipulate data stored in an RDB using OO approaches
(compile time, deployment time, and runtime). Moreover, without the need to use DQL or DML [5]. As a result,
annotations can even be used to dynamically generate code programmers can retrieve and save data in a variety of
that outputs a Java program, which is very much in line with database engines without having to write cumbersome SQL
the needs of framework developers. statements. In other words, developers can simply use their
Although annotations are not the core programming favorite programming language (Java, PHP, C#, etc.) without
language by themselves, people often use them to extend having to develop at the database level.
the language’s support for custom features such as compiler However, the popular ORMapping software frameworks
information, documentation, runtime logging, generating on the market (Hibernate, MyBatis, TopLink, etc.) and
additional files, and so on [12]. In addition, some recent the Java Persistence API (JPA) industry standard share
research studies have suggested the use of Java annotations many common issues that cannot fully satisfy the needs
for validation [13], [14], [15], mining [16], [17], [18], and of developers [5]. For example, most ORMapping software
maintenance [19]. frameworks are unable to address the impedance mismatch
People commonly use XML to define and configure between structured, semi-structured, and unstructured data.
different software frameworks. However, XML is considered Although ORMapping has been around for a while,
heavyweight because it has too many tags and its tree struc- there are still some outstanding issues that cannot be fully
ture is bulky when parsing content. Hence, Java annotations settled [22]. A typical example of this is manipulating
can be used to store configuration information of software JavaScript Object Notation (JSON) data. JSON is one of the
systems instead of the verbose XML. most commonly used data exchange formats in modern online
In the field of deep learning, data labeling is one of systems, but existing ORMapping software frameworks only
the important processes in machine-supervised training, and partially support the complex tree structure of JSON, which
Java annotations can be the metadata describing various data does not meet the expectations of modern developers. Most
structures in the data lake. As a result, the applications of Java database engines that support JSON will keep the JSON
annotations in different domains are yet to be explored by objects intact, which loses flexibility and performance for
researchers. searching and updating data.
To design a new ORMapping software framework to
C. TYPES OF SQL STATEMENTS manipulate data in a data lake, conversion between differ-
There are various types of Structured Query Language (SQL) ent data formats is unequivocally a challenging problem.
statements used to process database records. The most In general, there are three problems (mapping, retrieving,
commonly used SQL statements are Data Control Lan- and persisting) that need to be overcome to design a new
guage (DCL), Data Definition Language (DDL), Data ORMapping software framework [23]. In the proposed
Manipulation Language (DML), and Data Query Lan- solution, Java annotations are used as data objects to represent
guage (DQL) [20]. They are sub-languages that perform all different data structures to manipulate the data in the data
the basic operations in the database engine. lake. Thus, the JAMDL framework based on ORMapping
• DCL operations grant access to all elements in the must overcome the following problems.
database. Typical DCL statements are the GRANT and • Map data in different formats to data objects.
REVOKE statements. • Store data objects to different datasets.
• DDL operations define elements such as schema for • Retrieve data from multiple datasets and convert it back
storing data. Typical DDL statements are the CREATE to a single data object.
and DROP statements.
• DML operations manipulate the contents of data E. DATA STEWARDS
records. Typical DML statements are INSERT, Over the years, people have had different names for people
DELETE, and UPDATE statements. who work with data, such as data engineers, data analysts,
• DQL operations retrieve data records and combine them and data scientists. Roughly speaking, data engineers are
into a subset of data. A typical DQL statement is a responsible for underlying data processing, data analysts for
SELECT statement. business insight analysis, and data scientists for academic
These sub-languages have traditionally been used only for research. People usually categorize employees into specific
manipulating structured data, i.e., database records. Hence, roles based on their interests and job characteristics in
researchers should improve these sub-languages so that the company. More recently, enterprises can even appoint
they can handle other types of data (semi-structured and data stewards for data governance, data quality control,
unstructured data) and assist the JAMDL framework in data pipeline management, business definition regulation,
managing the heterogeneous data in the data lakes. glossary creation, and sensitive data operations.
Data governance is a very broad term that can cover many III. METHODOLOGY
areas. It can be linked to plans, policies, and procedures for To propose a new type of ORMapping software to
managing and implementing data accountability [24], [25], comprehensively address the salient issues of data lakes,
[26]. Data governance is now often used to represent the job it is necessary to address the fundamental issues of data
responsibilities of a data steward. As the role of the data manipulation (mapping, retrieval, and persistence). All the
stewards becomes more important, enterprises require them modules listed in Fig. 2 are discussed below. The JAMDL
to have a more holistic view of the data lake conduce to framework is designed to handle different datasets, which will
effectively manage its dynamic nature. It has also become be demonstrated below using the simple dataset mentioned in
necessary to use modern AI tools to help them in their Fig. 1.
governance efforts.
With the help of cutting-edge AI technologies, many A. DATA MODELING
hidden issues and problems can be detected in advance The mapping process is the most critical part of the
and data stewards can react quickly. Therefore, modern proposed software framework. There are several conventional
software frameworks should also offer the ability to include methods for mapping relational data to program objects.
AI technologies to predict and recommend management The common practice in today’s software frameworks is
strategies to data stewards. to define an XML file for the object mapping process.
However, popular software frameworks (Spring Boot, iBaits,
F. AI TECHNIQUES
etc.) require writing and managing many XML files for
Since the rise of neural networks and AI-related techniques, configuration. As the project evolved, the content and syntax
they have immediately dominated the field of academic became lengthy and complex. Instead, Java annotation is
research. They specialize in automation and prediction and recommended to be used as an object to describe different
can be applied to many different domains. types of data in a data lake.
AI techniques can be used in many different areas to
help the JAMDL framework provide powerful features for
secondary developers. Some common but not limited to 1) OBJECT MODEL
these neural networks can help are Convolutional Neural The principle of OO is that everything can be object-based.
Network (CNN) for building classification model [27], Data in various formats can also be conceptually represented
[28], Recurrent Neural Network (RNN), Long Short-Term by corresponding objects. In a previous study [34], object
Memory (LSTM), and Gated Recurrent Unit (GRU) for models are illustrated to process unstructured data (parallel
building models that require to understand the context of corpora). Therefore, the object model can help to represent
the sentences [29], [30], end-to-end (E2E) network for and manage complex data. Abstraction refers to the basic
building translation models [32], and Emphasized Channel characteristics of an object that distinguish it from all other
Attention, Propagation and Aggregation in Time Delay types of objects, thus providing a clearly defined conceptual
Neural Network (ECAPA-TDNN) for voice and speaker boundary relative to the perspective of the viewer [35]. This
recognition [31]. concept helps in discovering the characteristics of various
In this article, AI techniques are mainly used to generate data. Furthermore, it is also applicable in dealing with semi-
SQL statements and to help data stewards manage data lakes. structured data, which requires some higher level of abstract
In a previous study [33], an E2E network was implemented description.
to generate DQL-type SQL statements to query a database The ease of use of Java annotations is unparalleled in
engine using natural language. E2E is very elegant and the history of Java metadata. Java annotations are flexible
has been popularized for deep learning [32]. The idea of enough to provide a retention policy that specifies how
using a single model to specialize in predicting outputs marked annotations are stored, whether they are stored
directly from inputs can handle extremely complex systems only in code, compiled into classes, or available at runtime
and is arguably the most advanced deep learning technique. through reflection [11]. With Java annotations, the JAMDL
By the same token, people can use AI to generate complex framework is fully implemented based on ORMapping for
DML-type SQL statements for processing enterprise-level manipulating data in the data lakes, the source code of
transactions. which is available on Github [36]. Fig. 3 shows the essential
Data stewards also need AI technologies to provide classes used to form the business logic of this software
alerts and recommendations to create insightful analyses framework.
of business intelligence (BI) type reports and visualization
diagrams. Furthermore, people can build AI models that 2) MAPPING APPROACH
rate data based on its accessibility, and validate the data To the extent that Java objects represent various types of data,
in the data lakes after the JAMDL object models are the most fundamental metadata encompass field name, field
built. Subsequently, many areas where AI can be used to type, entity type, and entity path. The terms ‘‘field’’ (table
enhance the JAMDL framework are yet to be explored by columns, log records, JSON attributes, etc.) and ‘‘entity’’
researchers. (database tables, log files, JSON files, etc.) here have more
abstract meanings and are used to support different data shows the architecture of the DLManager and its relationship
structures. to the auxiliary Java classes. All the roles of these auxiliary
Once this metadata is defined in a Java class, the software classes are summarized below.
framework generates JavaBeans for developers to manipulate • DLConverter: An interface class that contains APIs for
the data and DML and/or DQL operations for the software converting JavaBeans to JSON objects, tabular database
framework to manipulate the data in the data lake. Note records to JavaBeans, and tabular database records to
that DML and DQL here have been enhanced to support the JSON objects.
processing of data other than database records using Plain • DLCRUD: An interface class that provides developers
Old Java Objects (POJOs). with a CRUD transaction API. It also provides advanced
• field name: This attribute is the name of the field that APIs for manipulating multiple tables and records at the
will be used to generate the JavaBeans and DML and/or same time.
DQL operations. • DLManager: An abstract class that implements CRUD
• field type: This attribute is the data field (double, transactions by generating DML and DQL operations.
integer, string, etc.) of the data type that will be It also generates JavaBeans for developers.
used to generate JavaBeans and DML and/or DQL • DLMapper: An interface class that stores annotation
operations. The field type is the data type of structured information provided by the developer for the ORMap-
data table columns, semi-structured data attributes, and ping process.
unstructured data records. • DLTree: A class that provides developers with an API
• entity type: This attribute indicates the type of for dynamically building JSON objects.
entity (log file, JSON, RDB, etc.). Unlike traditional • DLViewer: An interface class for storing information to
ORMapping software frameworks that can only handle combine transactions from multiple tables using specific
structured data (i.e., database records), it can associate table join methods.
semi-structured and unstructured data.
• entity path: This attribute indicates the location of the 4) MAPPING PROCESS
entity (file path, RDB name, etc.) in the data lake. The DLManager class triggers the generateSQL(),
generateBean(), and generateService() APIs
3) CORE PROGRAMS to take care of the tedious tasks for us. The following
The implementation of developers begins with the con- pseudocode demonstrates the mapping process using Java
struction of the DLMapper program, an interface class for annotations in a subclass of DLManager. The database table
developers to annotate metadata. DLManager is a superclass (student_info) has two columns (student_id, student_name)
for constructing CRUD transactions to manipulate data in a that can be mapped to a JavaBean (StudentBean) whose
data lake after the DLMapper class has been defined. Fig. 3 properties (studentId, studentName) are of type integer and
string respectively. Moreover, the mapping() API specifies 1 public class ReadStudent {
the primary key of the table. 2 public static void main(String[] args) {
3 // Data Source Connection
1 public class StudentManager extends DLManager { 4 DSConn dsObject = new DSConn();
2 @Override 5 StudentBean bean = new StudentBean();
3 @DLMapper(tableName = ‘‘student_info’’, 6 bean.setStudentId(123123);
7 bean = (StudentBean) StudentManager.noSQL()
4 beanName = ‘‘dl.model.StudentBean’’, 8 .read(dsObject, bean);
5 columns = {"student_id", "student_name"}, 9 System.out.println(bean.getStudentName());
6 properties = {"StudentId", "StudentName"}, 10 }
7 types = {DLData.Integer, DLData.String}, 11 }
8 path = ‘‘StudentDB’’, datasetType = 0) The READ API provided by the software framework
9 public void mapping() { super.setPkNum(1); } is very concise, as the abstract class DLManager already
10 handles all the heavy lifting. First, it reads the annotation
11 @Override information from the mapping() API of the subclass.
12 public void joining() {} It then validates the number of attributes in the annotation and
13 } generates a JavaBean and DQL operation. DQL operations
All the mapping process is done in this ordinary Java are performed by using two APIs (callGetter() and
subclass. The annotations occur in front of the mapping() callSetter()) to manipulate data in the data lake.
API. Since Java annotations are defined to be retained in Algorithm 1 and Algorithm 2 show the implementation of
the software framework at compile time, JavaBeans and the these two APIs inside the software framework.
related DML and DQL operations will be generated after the
subclass is compiled. Algorithm 1 Retrieving Data From a JavaBean
Input: bName, mName, bean
B. DATA RETRIEVING Output: value
The object retrieval mechanism involves a READ transaction 1: function callGetter()
of CRUD. The software framework uses the industry 2: Class<?> c ← Class.forName(bName)
standard Java Database Connectivity (JDBC) for database 3: Method m ← c.getDeclaredMethod(mName)
connectivity, which is very mature and robust and supports 4: Object value ← m.invoke(bean)
most database engines. 5: return value
There are four different types of JDBC connections
(bridge, native API, middleware, and driver-only) to cater to
the diversity of enterprise environments [37]. Furthermore, Algorithm 2 Storing Data to a JavaBean
this software framework is designed to support the schema- Input: bName, mName, bean, value
on-read approach that allows developers to self-define their
1: function callSetter()
special schema when querying a data lake. Since JAMDL is
2: Class<?> c ← Class.forName(bName)
developed entirely in Java, the database connection methods
3: Class<?>[] arg ← new Class[1]
use type four (driver-only) which is considered the most
4: Method m ← c.getDeclaredMethod(mName, arg)
efficient. Consequently, unlike most of the well-known
5: m.invoke(bean, value)
ORMapping software frameworks on the market that use
other complex database connection types, the concise archi-
tecture of JAMDL should run faster than they do.
In most ORMapping software frameworks, the READ 2) OBJECT INTEGRATION
transaction can retrieve only one record from the database According to the database normalization concept, a sin-
per query, which is insufficient for handling semi-structured gle transaction can be separated into multiple tables for
and unstructured data. Therefore, this software framework persistence [38]. Querying back a transaction requires con-
provides a list() API to retrieve multiple records. solidating them together. The query statements become very
It also provides developers with custom JSON objects to complex if a transaction involves more than three tables, and
dynamically output data in the desired format. joining database tables is doubtless a challenge in building
novel software frameworks. As in the previous example in
1) RETRIEVING PROCESS Fig. 1, student information is normalized into three different
Once the mapping process is complete, CRUD transactions database tables. To retrieve complete information about a
become very simple. The READ transaction can be imple- student, these three tables need to be joined.
mented by simply calling StudentManager with a JavaBean. In this software framework, DLManager provides a
The following pseudocode demonstrates a READ transaction joining() API for developers to specify how to join tables
where StudentManager retrieves a record based on a primary together. Developers simply create a JoinBean, fill in the
key. basic table join information (data fields, table name, and join
key), and the software framework fetches the data from the down into different parts (data fields, tables, keys, and
database table accordingly. filters) [33].
DLManager also provides developers with two result There are two main SQL patterns for table joins in this
formats through the interface class DLCRUD: JavaBean and software framework, and the syntax diagram is shown in
tabular data (list of strings). Fig. 4. As mentioned earlier, the DQL keywords (SELECT,
+queryJoin(n: String): List<DLBean> FROM, ON, HAVING, and WHERE) separate the SQL into
+queryView(n: String): List<List<String\gg different parts. This software framework then reads the object
information from the annotations and uses these predefined
Instead of retrieving all records directly, DLManager can SQL templates to generate the requesting queries.
choose to limit the records fetched. Developers can use the
setFilters() API of JoinBean to insert the where clause
of a DQL operation.
+setFilters(f: String[]): void
In addition, DLManager is capable of specifying different
join methods. Since not all database engines can support
various join methods, the JAMDL framework provides the
most commonly used method to support most database
engines, and the default method in this software framework
is an inner join. These join methods (inner join, left join,
right join, full outer join, intersect, union, minus, etc.) are
commonly found in most database engines for processing
queries [1]. Moreover, these join methods are slightly
modified to accommodate all database engines. Table 1
summarizes the syntax of the different join methods provided.
3) TABLE JOIN
Since neither Hibernate nor MyBatis supports complex
queries connecting to multiple database tables, people
usually need to write their own SQL statements. Therefore,
performance becomes similar due to executing SQL directly
instead of going through the framework to generate the SQL
statements. However, JAMDL better supports the generation
of complex queries, saving developers time in developing
SQL. Consequently, using the JAMDL framework for
manipulating multiple tables is more efficient than the other
two software frameworks.
4) IMPACT ON COMPLEXITY
The SQL statements generated in the JAMDL framework are
high-level programming languages. Database engines can use
the built-in functions to optimize these statements, and the
FIGURE 10. Comparison of database running speeds of different software optimization results are similar to those obtained by manually
frameworks. writing SQL. The purpose of the JAMDL framework is
mainly to handle different structured data in the data lake
through Java annotations and there should not be any impact
JAMDL (in orange) takes less time in the write operations on on complexity.
the right. There are several reasons why these pie charts are
distributed in this pattern, which are summarized below. D. DYNAMIC DATASETS
The manipulation of different structures in data lakes is fully
1) DIFFERENT PURPOSES demonstrated in Section III. One may realize that switching
These three software frameworks were designed for different from one data structure to another as time goes by requires a
purposes, and are more concerned with functionality than just huge amount of effort. Others may be aware of the feasibility
speed of operation. Hibernate is designed for larger appli- of managing multiple data structures simultaneously.
cations can be used in most development environments and The JAMDL framework abstracts data structures into an
supports most database engines. Therefore, it is considered object model that can represent different data structures at
to be more heavyweight than the others because it has more the same time. In addition, the lightweight nature of Java
libraries consuming the memory. On the other hand, MyBatis annotations makes it easy to modify the object model to meet
is designed for small and medium-sized applications and is inconsistent demand. Therefore, the JAMDL framework can
therefore easier to use. The JAMDL framework focuses on address the concerns when managing complex data.
processing different structured data in a data lake. Hence,
it is less compromised for database and operating system V. CONCLUSION
environments. In this article, a software framework JAMDL based on
ORMapping is comprehensively presented. JAMDL aims
2) BATCH OPERATIONS to provide a solution for the manipulation of different
Since the JAMDL framework implements batch processing, data structures in a data lake. JAMDL solves the problem
it takes less time to write database records. Batch processing of managing diverse data and overcomes the difficulty of
allows developers to combine related SQL statements into transforming data between different structures in the data
[26] Y. Demchenko and L. Stoy, ‘‘Research data management and data LAP MAN HOI (Member, IEEE) received the
stewardship competences in university curriculum,’’ in Proc. IEEE Global bachelor’s degree in computer science from York
Eng. Educ. Conf. (EDUCON), Apr. 2021, pp. 1717–1726. University, Canada, and the master’s degree in
[27] I. H. Sarker, ‘‘Deep learning: A comprehensive overview on techniques, internet computing from the Queen Mary Univer-
taxonomy, applications and research directions,’’ Social Netw. Comput. sity of London. He is currently pursuing the Ph.D.
Sci., vol. 2, no. 6, pp. 1–20, Aug. 2021. degree in computer applied technology with the
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Faculty of Applied Sciences, Macao Polytechnic
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. University (MPU). He was a Researcher of gaming
Neural Inf. Process. Syst., vol. 30. Red Hook, NY, USA: Curran Associates,
and entertainment. He is also a Researcher of
Dec. 2017, pp. 5999–6009.
machine translation with the Faculty of Applied
[29] F. Mortezapour Shiri, T. Perumal, N. Mustapha, and R. Mohamed,
‘‘A comprehensive overview and comparative analysis on deep learning Sciences, MPU. His research interests include internet computing, data
models: CNN, RNN, LSTM, GRU,’’ 2023, arXiv:2305.17473. warehouse, data science, gaming, deep learning, machine translation, and
[30] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, voice recognition.
‘‘On the properties of neural machine translation: Encoder–decoder
approaches,’’ in Proc. SSST EMNLP, Sep. 2014. [Online]. Available:
https://api.semanticscholar.org/CorpusID:11336213 WEI KE (Member, IEEE) received the Ph.D.
[31] B. Desplanques, J. Thienpondt, and K. Demuynck, ‘‘ECAPA-TDNN: degree from the School of Computer Science and
Emphasized channel attention, propagation and aggregation in Engineering, Beihang University. He is currently
TDNN based speaker verification,’’ in Proc. Interspeech, Oct. 2020, a Professor with the Faculty of Applied Sci-
pp. 3830–3834. ences, Macao Polytechnic University. His research
[32] T. Glasmachers, ‘‘Limits of end-to-end learning,’’ in Proc. 9th Asian Conf.
interests include programming languages, image
Mach. Learn. (ACML), pp. 17–32, Apr. 2017.
processing, computer graphics, tool support for
[33] L. M. Hoi, W. Ke, and S. K. Im, ‘‘Data augmentation for building QA
object-oriented and component-based engineering
systems based on object models with star schema,’’ in Proc. IEEE 3rd Int.
Conf. Power, Electron. Comput. Appl. (ICPECA), Jan. 2023, pp. 244–249. and systems, the design and implementation
[34] L. M. Hoi, W. Ke, and S. K. Im, ‘‘Corpus database management design for of open platforms for applications of computer
chinese-portuguese bidirectional parallel corpora,’’ in Proc. IEEE 3rd Int. graphics, and pattern recognition, including programming tools, environ-
Conf. Comput. Commun. Artif. Intell. (CCAI), May 2023, pp. 103–108. ments, and frameworks.
[35] G. Booch, R. A. Maksimchuk, M. W. Engle, B. J. Young, J. Conallen, and
K. A. Houston, Object-Oriented Analysis and Design With Applications,
3rd ed. Reading, MA, USA: Addison-Wesley Professional, Apr. 2008. SIO KEI IM (Member, IEEE) received the degree
[36] L. M. Hoi. (Mar. 2023). An Open-Source Software Framework in computer science and the master’s degree in
for Manipulating Data Lakes. [Online]. Available: https://github. enterprise information systems from the King’s
com/LapmanHoi/Annotation College London, University of London, U.K., in
[37] Y. Bai, JDBC API and JDBC Drivers, 1st ed. Hoboken, NJ, USA: Wiley, 1998 and 1999, respectively, and the Ph.D. degree
May 2012.
in electronic engineering from the Queen Mary
[38] C. Beeri, P. A. Bernstein, and N. Goodman, ‘‘A sophisticate’s introduction
University of London (QMUL), U.K., in 2007.
to database normalization theory,’’ in Proc. 4th Int. Conf. Very Large Data
Bases, Sep. 1978, pp. 113–124.
He gained the position of a Lecturer with the
[39] S. Botros, High Performance MySQL: Proven Strategies for Operating at Computing Program, Macao Polytechnic Institute
Scale, 4th ed. O’Reilly Media, Dec. 2021. (MPI), in 2001. In 2005, he became the Operations
[40] A. Géron, Hands-on Machine Learning With Scikit-Learn, Keras, and Manager of the MPI-QMUL Information Systems Research Center jointly
TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, operated by MPI and QMUL, where he carried out signal processing work.
2nd ed. O’Reilly Media, Oct. 2019. He was promoted to a Professor with MPI, in 2015. He was a Visiting Scholar
[41] F. Chollet, Deep Learning With Python, 2nd ed. Manning, Dec. 2021. with the School of Engineering, University of California, Los Angeles
[42] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, ‘‘End-to-end memory (UCLA), and an Honorary Professor with The Open University of Hong
networks,’’ in Proc. Conf. Neural Inf. Process. Syst. (NIPS), 2015, Kong.
pp. 2440–2448.