0% found this document useful (0 votes)

7 views18 pages

Sample Summary

Uploaded by

MAYUR VERMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views18 pages

Sample Summary

Uploaded by

MAYUR VERMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Vol 04 | Issue 04 | October 2024 ACADEMIC JOURNAL ON BUSINESS

ACADEMIC JOURNAL
ADMINISTRATION, ON BUSINESS
INNOVATION &
ISSN :2997-9552
Page:1-18
ADMINISTRATION, INNOVATION & SUSTAINABILITY
SUSTAINABILITY

RESEARCH ARTICLE OPEN ACCESS

A SYSTEMATIC REVIEW OF BIG DATA INTEGRATION CHALLENGES AND

SOLUTIONS FOR HETEROGENEOUS DATA SOURCES

11
Farhana Zaman Rozony , 2 Mst Nahida Aktar Aktar , 3 Md Ashrafuzzaman , 4Ashraful Islam

1
Graduate Researcher, Master of Science in Information Management System, College of Business, Lamar
University, Texas, USA
Email: [email protected]
2
Graduate Researcher, Master of Science in Information Management System, College of Business, Lamar
University, Texas, USA
Email: [email protected]
3
Master in Management Information System, International American University, Los Angeles, USA
Email: [email protected]
4
Master Of Science in Information Technology, Washington University Of Science And Technology, Alexandria,
Virginia, USA
Email: [email protected]

This systematic review explores the current challenges and emerging solutions in
big data integration, focusing on key issues such as semantic heterogeneity, data
Submitted: August 05, 2024
quality, scalability, and security. Using the PRISMA guidelines, 150 peer-reviewed
articles were analyzed to identify both established and innovative approaches to Accepted: September 29, 2024
integrating data from heterogeneous sources. The findings reveal that ontology- Published: October 2, 2024
based frameworks are widely used to address semantic inconsistencies but face
limitations in scalability when handling large, dynamic datasets. Machine Corresponding Author:
learning has emerged as a powerful tool for automating data quality and schema
matching processes, although its effectiveness is highly dependent on the Farhana Zaman Rozony
availability of high-quality training data. Distributed computing frameworks like
Hadoop and Spark have become the industry standard for scalable data Graduate Researcher, Master of
integration, yet their implementation requires significant infrastructure and Science in Information
technical expertise. Cloud-based platforms offer flexible, scalable solutions, but Management System, College of
concerns about data privacy and security persist. Blockchain technology, while Business, Lamar University,
Texas, USA
promising for secure and decentralized data integration, is still in its infancy and
struggles with scalability. The review highlights significant progress in the field
but underscores the need for further research to address unresolved challenges in Email: [email protected]
real-time integration, cross-domain data harmonization, and the management of
unstructured data. 10.69593/ajbais.v4i04.111

KEYWORDS

Big Data Integration, Heterogeneous Data Sources, Data Transformation, Semantic

Modeling, Machine Learning Integration

Copyright: © 2024 Sakib A I M. This is an open access article distributed under the terms of Vol 04 | Issue 04 | October Year 1
the Creative Commons Attribution License, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original source is cited.
ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111

1 Introduction: large-scale analytics and machine learning, while

private clouds offer enhanced security and control for
The advent of big data analytics has revolutionized the sensitive data (Wang & Chen, 2021). The flexibility to
way organizations store, manage, and analyze vast shift workloads between these environments enables
quantities of data, leading to increased adoption of organizations to improve operational efficiency and
hybrid cloud databases. As businesses generate ever- reduce costs, which is especially important in
growing volumes of structured, semi-structured, and industries like finance, healthcare, and retail, where
unstructured data, traditional data storage systems data security and privacy are paramount (Zhang et al.,
often struggle to keep pace with the demands of Figure 2: The 3Vs of big data. Volume, variety, and
scalability, flexibility, and cost-effectiveness velocity
(Alghamdi et al., 2020). Hybrid cloud databases,
which integrate the strengths of both public and private
cloud environments, provide a more adaptable solution
for managing big data. These databases offer
organizations the ability to balance the benefits of Source: Sabri et al. (2020)
scalability and flexibility with the security and control
of on-premises infrastructure (Elnour et al., 2021). The
rise of hybrid cloud architectures, therefore, is not only
a response to growing data needs but also a reflection
of the evolving technological landscape, where
businesses are prioritizing data-driven decision-
making and operational efficiency.
Hybrid cloud databases allow companies to optimize
their data management strategies by distributing data
storage and processing tasks across both cloud
Source: Sabri et al. (2020)
Figure 1: Comparative Study of Big Data 2020).
Heterogeneity Solutions
Performance is a crucial consideration for
organizations adopting hybrid cloud databases.
Various studies have highlighted the ability of hybrid
cloud databases to enhance data processing speed and
reliability through optimized resource allocation and
load balancing (Xu et al., 2021). In addition, hybrid
cloud architectures support advanced data analytics
frameworks, such as Hadoop and Spark, which can
process large datasets efficiently (Himeur et al.,
2022a). However, performance optimization often
depends on factors such as the design of the database,
network latency between cloud environments, and the
specific data analytics workloads being executed (Jia
et al., 2019). As organizations increasingly rely on
real-time data analytics for decision-making, ensuring
Source: Yang et al. (2019)
high-performance standards in hybrid cloud
environments and on-premises systems (Su & Wang,
environments remains a top priority.
2020). This architecture addresses several critical
The cost-efficiency of hybrid cloud databases is
challenges, including data sovereignty, latency, and
another critical factor influencing their adoption.
compliance, while also enhancing scalability and
Traditional on-premises data centers often require
availability. For instance, public clouds provide on-
substantial capital investments in hardware,
demand resources for data-intensive tasks, such as

Vol 04 | Issue 04 | October 2024 2

A SYSTEMATIC REVIEW OF BIG DATA INTEGRATION CHALLENGES AND SOLUTIONS FOR
HETEROGENEOUS DATA SOURCES

maintenance, and personnel, while public cloud strategies for enhancing data consistency, accuracy,
services offer a pay-as-you-go model that can and scalability in big data environments. Ultimately,
significantly reduce upfront costs (Smolak et al., the goal is to offer actionable insights for researchers
2020). Hybrid cloud databases allow organizations to and practitioners working to improve the integration of
strike a balance between these two models, leveraging complex and heterogeneous data in various industries.
the scalability of public clouds for burst workloads
while maintaining control over core data systems on 2 Literature Review
private infrastructure (Liu et al., 2018). Furthermore,
The integration of big data from heterogeneous sources
hybrid cloud solutions can help businesses avoid
has become a critical area of focus in both academic
vendor lock-in by enabling them to choose the most
research and industry practice. As organizations
cost-effective cloud service providers for specific
increasingly rely on diverse data sources, including
workloads. However, cost considerations must account
structured, semi-structured, and unstructured data, the
for the complexity of managing hybrid environments,
complexity of managing and integrating this
including network costs and the need for specialized
information has grown exponentially. A
expertise (Huang et al., 2017).
comprehensive review of existing literature provides
Despite the numerous benefits of hybrid cloud
valuable insights into the challenges associated with
databases, several challenges remain, particularly in
big data integration, including semantic heterogeneity,
terms of architecture complexity, data migration, and
data quality issues, and scalability concerns. In
security concerns. Studies have shown that integrating
addition to identifying these challenges, the literature
data across public and private clouds can introduce
also explores a variety of solutions, such as ontology-
new vulnerabilities, particularly in data transmission
based frameworks, data transformation techniques, and
and access control (Liu et al., 2021). Ensuring
machine learning algorithms, which aim to improve
seamless interoperability between different cloud
the integration process. This section reviews key
environments and on-premises systems also requires
studies in the field, highlighting current approaches,
careful planning and execution (Diamantoulakis et al.,
tools, and methodologies that have been developed to
2015). Moreover, the hybrid cloud model’s
address the unique demands of integrating big data
dependence on internet connectivity makes it
from diverse sources. Through a critical examination
susceptible to latency and network issues, which can
of existing research, this review will establish the
affect the overall performance of data analytics
theoretical and practical foundations necessary for
processes. As the demand for hybrid cloud solutions
understanding and advancing big data integration
continues to grow, further research is needed to
techniques.
address these challenges and develop best practices for
secure, efficient, and cost-effective hybrid cloud 2.1 Big Data Integration
database management. The objective of this systematic The evolution of big data integration has paralleled the
review is to critically analyze and synthesize the key rapid expansion of data-driven industries in recent
challenges and solutions associated with big data decades. Big data, with its defining characteristics of
integration from heterogeneous data sources. volume, velocity, variety, and veracity, has
Specifically, the review aims to identify the primary transformed how organizations manage and analyze
technical and semantic obstacles that organizations information (Fatema et al., 2020). The integration of
face when attempting to integrate data from multiple, heterogeneous data sources, which includes structured,
diverse systems, including structured, semi-structured, semi-structured, and unstructured data, has become a
and unstructured data. Furthermore, the objective is to central focus in fields such as healthcare, finance, and
evaluate existing methodologies and tools, such as manufacturing, as organizations seek to derive
ontology-based frameworks, schema matching actionable insights from vast and diverse datasets (Liu
techniques, and machine learning-driven approaches, et al., 2018). Early approaches to data integration were
which have been proposed to address these challenges. relatively simple, focusing on combining structured
By examining both academic literature and practical data from relational databases. However, the
case studies, this review seeks to provide a increasing complexity of data sources—ranging from
comprehensive understanding of the most effective
Vol 04 | Issue 04 | October 2024 3
ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111

social media streams to sensor data—has necessitated challenge in big data integration. Poor data quality—
more advanced methods for managing, cleaning, and characterized by inconsistencies, inaccuracies, and
integrating information from disparate systems (Singh missing values—can undermine the utility of even the
& Yassine, 2018). As the need for comprehensive data most well-integrated data systems (Ahmed et al., 2024;
integration grew, so too did the research exploring Islam & Apu, 2024b; Nahar et al., 2024). As big data
challenges and potential solutions. environments expanded to include real-time data from
One of the earliest challenges in big data integration sources such as IoT devices and social media, the
emerged from the semantic heterogeneity across potential for data quality issues grew exponentially
different systems. Various data sources often utilize (Ahmed et al., 2024; Hossain et al., 2024; Islam,
different formats, terminologies, and structures to 2024). Research in the field of data cleansing has
represent similar or identical information, creating evolved significantly, with early efforts focusing on
significant barriers to effective integration manual data cleaning methods and progressing toward
(Diamantoulakis et al., 2015). For example, disparate automated, ML-driven solutions that can detect and
databases may refer to customer data differently, one rectify errors in real-time. Advances in machine
by ID and another by name, leading to difficulties in learning and AI have facilitated the development of
mapping and merging datasets. Early research in the more sophisticated error detection and correction
2000s explored solutions through ontology-based systems that are capable of handling the scale and
frameworks that could standardize data semantics, complexity of modern big data environments. Another
allowing for more seamless integration (Hu & critical development in big data integration is the need
Vasilakos, 2016). These frameworks evolved from for scalable data processing frameworks. Traditional
basic schema matching techniques to more data integration systems, which were designed to
sophisticated models that leverage artificial handle smaller, static datasets, struggled to keep up
intelligence (AI) and machine learning (ML) to with the growing volume and velocity of big data.
automate data transformation and integration (Jim et Early solutions focused on distributed computing
al., 2024; Abdur et al., 2024). Over time, the frameworks, such as MapReduce, which allowed for
development of semantic integration tools has parallel processing of large datasets across multiple
significantly advanced, making it easier for nodes (Dean & Ghemawat, 2008). These frameworks
organizations to integrate data from heterogeneous have since evolved into more advanced systems like
sources. Apache Hadoop and Spark, which are capable of
The issue of data quality has persisted as another major handling real-time data streams and offering greater

Figure 1:Traditional Big Data Integration

Source: Medium (2018)

Vol 04 | Issue 04 | October 2024 4

A SYSTEMATIC REVIEW OF BIG DATA INTEGRATION CHALLENGES AND SOLUTIONS FOR
HETEROGENEOUS DATA SOURCES

scalability and flexibility. Recent research has also Data quality issues further complicate big data
explored cloud-based integration solutions, which integration, especially in environments characterized
allow organizations to scale their data processing by the influx of vast amounts of real-time and multi-
capabilities as needed, without being constrained by format data. Poor data quality, often manifested as
physical infrastructure limitations (Islam & Apu, inaccuracies, inconsistencies, and missing data, can
2024). The shift toward cloud computing has further undermine the effectiveness of integrated datasets
accelerated the adoption of big data integration (Sayed et al., 2022). This problem is exacerbated in
practices in industries that rely on real-time analytics heterogeneous data environments where sources like
and decision-making processes. IoT devices, social media, and traditional databases
produce fragmented data that may lack validation
2.2 Big Data Integration Challenges
(Fatema et al., 2020). As big data systems have
As big data continues to evolve, the integration of evolved, data cleansing and validation techniques have
heterogeneous data sources presents significant become essential in ensuring that data integration
challenges. One of the most prominent challenges is processes are reliable (Liu et al., 2018). Machine
semantic heterogeneity, which arises from the learning and artificial intelligence (AI) have been
variations in data formats, terminologies, and increasingly employed to detect and correct data
structures across different data systems (Roccetti et al., quality issues automatically, enabling the seamless
2019). Semantic heterogeneity can occur when similar integration of large, complex datasets (Diamantoulakis
data elements are represented differently, making it et al., 2015).
difficult to align and interpret data from multiple Scalability is another significant challenge in big data
sources accurately (Taştan & Gökozan, 2019). For integration. The increasing volume, velocity, and
instance, a customer ID in one system may be variety of big data—often referred to as the "3Vs"—
represented as a numerical value, while another system pose severe difficulties for traditional data systems
uses alphanumeric codes or full names. The variations (Al-Ali et al., 2017). As data grows exponentially, the
in how data is stored and categorized lead to need for scalable processing frameworks becomes
inconsistencies, requiring sophisticated tools and apparent. Early big data systems struggled to process
methods to standardize the information across systems large amounts of incoming data efficiently, especially
(Sun & Scanlon, 2019). The evolution of ontology- when real-time analysis was required (Da Silva Lopes
based frameworks has helped address this challenge by et al., 2020). However, the development of distributed
offering a structured semantic understanding of data, systems, such as Hadoop and Spark, has offered
which facilitates its integration from disparate sources scalable solutions by enabling parallel processing and
(Plageras et al., 2018). reducing the computational burden on individual

Figure 2: Mindmap of Big Data Integration Challenges

Vol 04 | Issue 04 | October 2024 5

ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111

systems (Varlamis et al., 2022). These systems have 2.3 Solutions for Big Data Integration
evolved to handle real-time data streams more One of the most significant approaches to addressing
efficiently, improving the overall integration process the challenges of big data integration is ontology-based
for big data environments (Mahmud et al., 2020). As frameworks. Ontologies provide a structured semantic
data volume and velocity continue to grow, further understanding of data by defining the relationships
advancements in distributed processing frameworks between various data elements, making it easier to
will be necessary to meet the increasing demands of integrate heterogeneous data from different sources.
modern data systems. By creating a shared vocabulary and set of definitions,
The lack of standardization in data protocols and ontologies allow data systems to communicate and
formats remains a persistent challenge in big data interpret data consistently, thereby reducing semantic
integration. Different data sources often utilize varying heterogeneity. In practice, ontology frameworks have
formats, such as XML, JSON, and CSV, which been applied across various industries, including
complicates the design of integration systems that can healthcare and finance, to standardize terminologies
accommodate multiple formats simultaneously (Xiao- and improve the accuracy of data integration. For
wei, 2019). Non-standardized data structures can result example, the use of ontologies in biomedical research,
in integration errors, delays, and increased costs as such as the Gene Ontology (GO), has been
organizations attempt to normalize data from disparate instrumental in managing and integrating large
sources. Schema matching and transformation volumes of biological data from multiple sources,
techniques have emerged as essential solutions for ensuring semantic consistency across datasets
aligning different data formats into a common (Xiaoping et al., 2020). The evolution of ontology-
structure, making integration more feasible (Al-Ali et based solutions highlights their critical role in
al., 2017). Recent advancements in AI-driven schema managing the complexity of big data integration,
matching tools have automated much of this process, particularly when dealing with semantically diverse
enabling more efficient data transformation and datasets.
reducing the need for manual intervention (Xiao-wei, Machine learning (ML) has also emerged as a
2019). These advancements have played a critical role powerful tool for automating the process of big data
in overcoming standardization challenges, but the integration. ML techniques can be used to address a
complexity of integrating data from rapidly evolving variety of integration challenges, including schema
sources continues to demand further innovation. matching, data cleansing, and error detection, by
In addition to these challenges, the evolution of big learning patterns from the data itself (Jim et al., 2024).
data integration has also witnessed the increasing use One of the key benefits of ML-based approaches is
of AI and machine learning to enhance the process. their ability to handle large and complex datasets that
Machine learning algorithms can automate many traditional rule-based systems struggle with (Abdur et
aspects of data integration, from detecting semantic al., 2024). For instance, ML algorithms can
inconsistencies to addressing data quality issues (Zhao automatically detect and correct inconsistencies in
et al., 2020). Additionally, AI-driven systems can help data, improving the quality of integrated datasets
optimize the scalability of data integration processes without the need for manual intervention (Islam,
by predicting computational needs and allocating 2024). However, ML-based data integration also
resources accordingly (Elkhoukhi et al., 2019). As big presents certain limitations, such as the need for large
data systems continue to evolve, AI and machine amounts of labeled training data and the potential for
learning will likely play an even more significant role biased or inaccurate models if the training data is not
in automating and streamlining the integration process. representative (Islam & Apu, 2024b). Despite these
However, while these technologies offer promising challenges, the application of machine learning
solutions, they also present new challenges related to continues to expand, with ongoing research focused on
data privacy, ethical concerns, and the need for improving the efficiency and accuracy of ML-based
transparency in machine learning models (Himeur et data integration techniques. Distributed computing
al., 2022b). frameworks, such as Hadoop and Spark, have become
indispensable in addressing the scalability challenges
Vol 04 | Issue 04 | October 2024 6
A SYSTEMATIC REVIEW OF BIG DATA INTEGRATION CHALLENGES AND SOLUTIONS FOR
HETEROGENEOUS DATA SOURCES

associated with big data integration. These frameworks efficiency of this process, enabling more seamless data
allow for the parallel processing of large datasets integration. For instance, modern tools can
across multiple nodes, significantly increasing the automatically identify and match schema elements
computational power available for data integration based on learned patterns, reducing the need for
tasks (Nahar et al., 2024). Hadoop, one of the earliest manual intervention and improving the scalability of
distributed computing frameworks, introduced the integration efforts (Ahmed et al., 2024). These
MapReduce programming model, which enables the techniques are vital for ensuring that data from diverse
processing of vast amounts of data by distributing the sources can be effectively integrated and used for
workload across multiple servers. Apache Spark, a analysis.
more recent development, builds on Hadoop's As big data systems continue to evolve, combining
capabilities by offering in-memory processing, which multiple solutions—such as ontology-based
reduces the time required for data integration tasks and approaches, machine learning, distributed computing,
supports real-time analytics. Studies have and schema matching—has become a key strategy for
demonstrated the efficiency of these distributed overcoming the challenges of data integration. Each of
frameworks in managing big data integration at scale, these solutions addresses specific aspects of the
particularly in industries such as finance and integration process, whether it be managing semantic
telecommunications, where real-time processing of heterogeneity, ensuring data quality, or improving
high-velocity data streams is crucial (Jim et al., 2024). scalability. For instance, integrating ontologies with
Another critical solution for big data integration is the machine learning algorithms can enhance the semantic
development of data transformation and schema accuracy of integrated datasets while also automating
matching techniques. Schema matching is the process error detection and correction (Hossain et al., 2024).
of identifying correspondences between the attributes Similarly, combining distributed computing
of different datasets to facilitate their integration into a frameworks with schema matching techniques enables
unified format. As data sources often use varying the efficient integration of large-scale, heterogeneous
structures and formats, schema matching and data sources, ensuring that data can be processed and
transformation are essential for converting analyzed in real-time (Islam, 2024). As the complexity
heterogeneous data into a common format that can be of data environments grows, the integration of these
used across systems. Advances in schema matching various solutions will be critical for managing the
algorithms, such as those incorporating machine increasing demands of big data systems.
learning and AI, have improved the accuracy and

Figure 3: Big Data Integration Challenges Over Time

Vol 04 | Issue 04 | October 2024 7

ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111

(Mahmud et al., 2020). This makes ontologies

2.4 Semantic Similarity in Ontology-Based
Integration particularly useful in domains like healthcare and
finance, where consistency and accuracy are crucial
Semantic similarity is crucial in managing semantic
(Al-Ali et al., 2017). However, ontology-based
heterogeneity across diverse data sources. A common
approaches often require significant manual effort to
formula used to measure similarity between concepts
create and maintain, and they may struggle to scale
in ontologies is based on Information Content (IC):
when integrating highly dynamic or rapidly evolving
2 × 𝐼𝐶(𝐿𝐶𝑆(𝐶1 , 𝐶2 )) datasets. In contrast, ML-driven approaches can
Sim(𝐶1 , 𝐶2 ) =
𝐼𝐶(𝐶1 ) + 𝐼𝐶(𝐶2 ) automate the integration process by learning patterns
Where: from data, reducing the need for manual intervention.
• C1, C2 are two concepts being compared.
Machine learning techniques excel in situations where
• LCS(C1,C2) is the Least Common Subsumer,
or the most specific ancestor shared by C1 and data is too large or complex for rule-based systems to
C2. handle, but they rely heavily on the availability of
• IC is the Information Content, typically high-quality training data and can produce biased
derived from corpus data. outcomes if the training data is insufficient or
This equation relates to semantic heterogeneity by unrepresentative.
quantifying how similar two data elements are, helping Several case studies highlight the differing
reduce discrepancies when integrating heterogeneous
applications of ontology-based and machine learning
data from different sources.
approaches. For example, the Gene Ontology (GO)
2.5 Comparative Analysis of Integration project is a successful implementation of an ontology-
Techniques
based approach that has standardized biological data
Ontology-based approaches and machine learning across multiple sources, allowing for seamless
(ML) techniques represent two distinct methods for integration and comparison of genetic data from
addressing the challenges of big data integration, each different species. In contrast, machine learning
with its own strengths and weaknesses. Ontology- techniques have been applied to large-scale financial
based approaches offer a structured way to manage datasets to automate the integration of transactional
semantic heterogeneity by defining relationships and data from multiple systems, where manual ontology
standardizing terminologies across diverse datasets creation would be impractical due to the sheer volume
Figure 4: SWOT Analysis for this study

Vol 04 | Issue 04 | October 2024 8

A SYSTEMATIC REVIEW OF BIG DATA INTEGRATION CHALLENGES AND SOLUTIONS FOR
HETEROGENEOUS DATA SOURCES

and velocity of the data (Da Silva Lopes et al., 2020). for specialized technical skills to manage distributed
These examples demonstrate that ontology-based systems (Zhou & Yang, 2016).
approaches are more suited to environments where
2.6 Emerging Trends in Big Data Integration
data standardization is crucial, while ML techniques
are preferable in dynamic and high-velocity data The rise of artificial intelligence (AI) has significantly
environments where automation is necessary (Roccetti transformed the field of big data integration, with
et al., 2019). recent developments focused on enhancing the
Schema matching and distributed computing automation and optimization of data workflows. AI-
frameworks are also widely used techniques for big powered data integration solutions utilize machine
data integration, each with distinct advantages and learning (ML) algorithms and natural language
limitations. Schema matching focuses on aligning the processing (NLP) techniques to automate processes
structures of disparate datasets by identifying like schema matching, data cleansing, and error
correspondences between schema elements. This is detection. This not only reduces the need for manual
particularly useful for integrating structured data from intervention but also improves the accuracy and
relational databases or systems with well-defined efficiency of data integration. Studies show that AI-
formats. Schema matching techniques have evolved driven tools are particularly effective in handling
with the incorporation of AI and ML to automate much unstructured data and can adapt to evolving data
of the process, reducing the time and effort required patterns in real-time, making them ideal for dynamic
for integration. However, schema matching alone is data environments. However, challenges remain, such
limited when dealing with large, unstructured, or semi- as the need for large amounts of high-quality training
structured datasets, which are increasingly common in data and the potential for biases in AI models if the
big data environments. Distributed computing training data is insufficient or skewed. Nevertheless,
frameworks, such as Hadoop and Spark, address this the integration of AI in big data processes continues to
limitation by enabling the parallel processing of vast gain traction, with ongoing research aimed at
datasets across multiple nodes (Mahmud et al., 2020; addressing these limitations and further optimizing
Roccetti et al., 2019). These frameworks excel in integration workflows (Mahmud et al., 2020).
handling large, diverse datasets in real-time but may Cloud computing has also emerged as a crucial enabler
require significant infrastructure investment and of big data integration by providing scalable, flexible,
expertise to implement effectively. and cost-effective solutions. Major cloud platforms,
When comparing schema matching and distributed such as Amazon Web Services (AWS), Microsoft
computing frameworks in terms of performance, Azure, and Google Cloud, offer robust infrastructure
efficiency, and cost-effectiveness, several trade-offs for processing and integrating massive datasets. These
emerge. Schema matching techniques are generally cloud-based solutions facilitate real-time data
more efficient for smaller, structured datasets, as they integration by providing on-demand resources that can
focus on aligning schema elements and transforming be scaled according to the volume and velocity of
data into a common format. These techniques are incoming data. Research has shown that cloud
relatively cost-effective, as they do not require platforms allow organizations to handle vast amounts
extensive infrastructure and can be implemented with of heterogeneous data efficiently, without the need for
off-the-shelf software (Dey et al., 2020; Shamim, expensive on-premises infrastructure. Additionally,
2022). However, they may struggle to keep up with the cloud-based integration services often include built-in
scale and speed of modern big data environments. AI and ML capabilities, further enhancing the speed
Distributed computing frameworks, on the other hand, and accuracy of data integration (Dey et al., 2020).
are highly scalable and capable of processing large However, concerns around data privacy, security, and
datasets in parallel, making them ideal for real-time compliance remain significant barriers to cloud
data integration in industries like telecommunications adoption for sensitive data integration tasks.
and finance(Himeur et al., 2022b). While these Blockchain technology represents another emerging
frameworks offer superior performance in large-scale trend in big data integration, offering secure,
data environments, they come with higher decentralized frameworks for managing and
infrastructure and operational costs, as well as the need integrating data. Blockchain’s distributed ledger
Vol 04 | Issue 04 | October 2024 9
ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111

system ensures data integrity and immutability, terms of improving the scalability and security of big
making it particularly useful for secure data sharing data integration workflows. Several studies have
and integration across multiple, often untrusted, demonstrated the potential of these combined
entities. Blockchain can enhance the traceability and approaches in fields such as supply chain management
transparency of data integration processes by and finance, where secure, real-time data integration is
providing verifiable transaction records that can be crucial. As cloud computing continues to evolve, its
accessed in real time. This technology is particularly integration with AI and blockchain technologies is
promising in industries like healthcare and finance, expected to play an even more significant role in
where data security and privacy are critical. Recent managing complex data environments.
research has also explored the potential of combining The integration of AI, cloud computing, and
blockchain with AI to create smart contracts that can blockchain technologies represents a significant shift
automate data integration workflows while ensuring in the way organizations handle big data. AI-driven
secure and transparent transactions. However, the automation improves the efficiency of data integration
integration of blockchain with big data systems is still processes, cloud computing provides scalable
in its nascent stages, with challenges related to infrastructure, and blockchain ensures data security
scalability, transaction speed, and interoperability and transparency. As these technologies continue to
needing further exploration (Aazam et al., 2018; evolve, they are likely to address many of the current
Shamim, 2022). challenges in big data integration, including
AI and blockchain technologies are increasingly being scalability, data quality, and security (Al-Ali et al.,
integrated into cloud-based systems to create more 2017). However, as with any emerging technology,
efficient and secure data integration environments. For these solutions come with their own set of challenges,
instance, AI-powered cloud platforms can automate including the need for large-scale infrastructure, data
the integration of heterogeneous data sources while privacy concerns, and potential biases in AI models.
ensuring real-time processing, and blockchain can Future research will need to focus on overcoming
secure the integrity of this data as it moves across these barriers to fully harness the potential of AI,
different systems. The combination of these cloud, and blockchain technologies in big data
technologies offers significant benefits, particularly in integration.
Figure 5: Complex Combined Boxplot and Line Chart for Big Data Integration Trends

Vol 04 | Issue 04 | October 2024 10

A SYSTEMATIC REVIEW OF BIG DATA INTEGRATION CHALLENGES AND SOLUTIONS FOR
HETEROGENEOUS DATA SOURCES

as text, images, and video, less explored (Plageras et

2.7 Gaps in the Literature
al., 2018). Unstructured data accounts for a significant
Despite significant advancements in big data portion of the data generated today, particularly in
integration, there remain several unresolved issues that fields like social media, healthcare, and marketing, yet
need further research and development. One critical existing integration frameworks struggle to handle this
gap is the lack of effective solutions for real-time data type of data effectively (Chen et al., 2019). Machine
integration. As the velocity of data generation learning and natural language processing techniques
continues to increase, many current integration tools have shown promise in extracting and integrating
and frameworks struggle to process and merge data in unstructured data, but these solutions are still in the
real time, particularly when handling large, early stages and have not yet been fully developed to
heterogeneous datasets from diverse sources. While meet the demands of large-scale, unstructured data
distributed computing frameworks such as Hadoop and integration (Aazam et al., 2018). More research is
Spark have improved the scalability of data needed to improve the capabilities of big data
integration, they often prioritize batch processing over integration frameworks to manage unstructured data
real-time integration (Mahmud et al., 2020; Taştan & and ensure that it can be seamlessly combined with
Gökozan, 2019). This gap is particularly evident in structured datasets.
industries like finance and telecommunications, where Finally, the gap in addressing data privacy and security
real-time data integration is essential for timely concerns in big data integration is another critical area
decision-making. Although some advancements have that remains underexplored. As data sources become
been made in the use of stream processing systems like more diverse and integration processes involve
Apache Flink, these technologies are still in the early multiple stakeholders, ensuring the privacy and
stages and require further development to handle the security of sensitive information has become
complexity and speed of modern big data increasingly challenging (Xiao-wei, 2019). While
environments. blockchain technology offers potential solutions for
Another gap in the literature concerns the integration secure data integration by providing decentralized and
of cross-domain data, where information from multiple tamper-proof ledgers (Diamantoulakis et al., 2015), its
and diverse industries or fields must be combined. scalability and efficiency are still in question,
Most existing big data integration tools are designed to particularly when handling the large volumes of data
work within specific domains and struggle to adapt typically associated with big data environments.
when datasets from unrelated fields need to be merged Additionally, existing research has not fully addressed
(Fatema et al., 2020). Cross-domain data integration the ethical and regulatory implications of integrating
presents significant challenges due to the variations in sensitive data across international borders, where
data formats, terminologies, and structures, as well as varying data protection laws and regulations may
the need for more advanced semantic matching complicate the integration process (Liu et al., 2018).
techniques to ensure accurate data mapping. Ontology- Future research should focus on developing more
based approaches offer some promise in this area, as robust privacy-preserving mechanisms and exploring
they help standardize the semantics of data from the regulatory and ethical implications of cross-border
different domains (Taştan & Gökozan, 2019), but they data integration in big data systems
often require significant manual effort to construct and
maintain, limiting their scalability and applicability in 3 Method
rapidly changing data environments (Sun & Scanlon,
2019). Further research is needed to develop more This study employs the Preferred Reporting Items for
automated and flexible solutions for integrating cross- Systematic Reviews and Meta-Analyses (PRISMA)
domain data effectively. guidelines to systematically analyze the challenges and
solutions in big data integration. The PRISMA
In addition to real-time and cross-domain integration, framework provides a structured approach for
there is also a gap in the integration of unstructured conducting and reporting systematic reviews, ensuring
data. Much of the research and development around transparency and replicability. The following steps
big data integration has focused on structured and were undertaken during the research process.
semi-structured data, leaving unstructured data, such
Vol 04 | Issue 04 | October 2024 11
ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111

Step 1: Identification of Studies to identify gaps in the literature. The synthesis allowed
for the comparison of different approaches and
In the first step, a comprehensive search was
highlighted emerging trends in the field.
conducted across multiple academic databases,
including Google Scholar, IEEE Xplore, and Step 5: Data Analysis
ScienceDirect, to identify relevant studies. Keywords
The final step involved a qualitative analysis of the
such as “big data integration,” “data heterogeneity,”
synthesized data to identify patterns, common
“ontology-based integration,” “machine learning for
challenges, and innovative solutions proposed in the
data integration,” and “distributed computing
literature. Key metrics such as the frequency of certain
frameworks” were used to retrieve peer-reviewed
integration techniques (e.g., machine learning vs.
journal articles, conference proceedings, and industry
ontology-based methods) and the industries in which
reports. The search process yielded an initial total of
these techniques were applied (e.g., healthcare,
2,500 articles from these databases. To ensure the
finance, telecommunications) were analyzed. This step
relevance of the articles, filters for publication dates
provided the basis for drawing conclusions about the
(2010-2023), language (English), and subject areas
effectiveness of various solutions and the areas where
(Computer Science, Information Technology, and Data
further research is needed.
Management) were applied.
Figure 6: PRISMA Flowchart for this study
Step 2: Screening of Articles
The second step involved screening the identified
articles for eligibility based on their titles and
abstracts. After removing duplicates and irrelevant
articles, a total of 1,200 studies remained. The
abstracts were carefully reviewed to exclude papers
that did not focus specifically on the integration of big
data or those that dealt solely with general data
management without addressing integration
challenges. Studies that lacked full-text availability
were also excluded at this stage. After this screening
process, 500 articles were selected for further review.
Step 3: Eligibility Assessment
In the third step, the full texts of the remaining 500
articles were assessed for eligibility using predefined
inclusion and exclusion criteria. To be included,
studies had to discuss either the challenges or solutions
of big data integration, such as semantic heterogeneity,
data quality, or scalability. Studies that only provided
theoretical overviews without empirical or practical
insights were excluded. Additionally, papers focusing
on unrelated data management topics were discarded.
This process resulted in a refined selection of 150
articles deemed relevant for inclusion in the systematic
review.
Step 4: Data Extraction and Synthesis
In the fourth step, data were extracted from the 150
eligible articles, focusing on key themes such as the
methods used to address big data integration
challenges, technological solutions like ontology-based
frameworks, and the application of machine learning
for automation. The articles were categorized based on
the type of integration challenge they addressed, such
as semantic heterogeneity, scalability, or data quality.
The findings were synthesized to provide an overview
of the current state of big data integration research and

Vol 04 | Issue 04 | October 2024 12

A SYSTEMATIC REVIEW OF BIG DATA INTEGRATION CHALLENGES AND SOLUTIONS FOR
HETEROGENEOUS DATA SOURCES

4 Findings ontology-based approaches as a leading method for

managing semantic inconsistencies. These studies
Following the comprehensive review process outlined illustrated that ontology frameworks provide a
in the methodology, a total of 150 articles were structured, standardized vocabulary that enables
included in the final analysis to examine the seamless integration of data from diverse sources by
predominant challenges and solutions in big data harmonizing terminologies and data structures.
integration. Of these, 52 articles (35%) specifically Notably, 18 articles (40% of the ontology-related
addressed the challenge of semantic heterogeneity, studies) pointed out scalability issues with ontology-
underscoring it as one of the most significant barriers based solutions, especially when applied to large-scale,
to effective big data integration. These studies rapidly changing datasets. This limitation suggests that
highlighted the difficulties involved in reconciling while ontologies offer a robust approach to
different data formats, terminologies, and structures standardizing data semantics, their practical
across disparate systems, particularly when integrating application may be limited in environments
cross-domain or unstructured data sources. The characterized by high data velocity and large volumes
remaining articles were focused on other key of unstructured data. The research indicates a need for
challenges, with 38 (25%) addressing data quality further development of ontology frameworks that can
issues and 60 (40%) centered on scalability challenges. better handle the scale and complexity of modern big
This distribution of research attention reveals that data environments.
while semantic heterogeneity is a critical issue, In terms of data quality, 38 studies (25%) focused on
scalability and data quality also present substantial data cleansing and validation techniques as critical
challenges to achieving seamless data integration in solutions for big data integration. Within this subset,
large-scale systems. 17 articles (45%) discussed the use of machine
When analyzing the solutions proposed for addressing learning (ML) as a key enabler of automating data
semantic heterogeneity, 45 articles (30%) highlighted quality processes, such as schema matching, error
Figure 7: Summary findings

Vol 04 | Issue 04 | October 2024 13

ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111

detection, and data correction. The reliance on ML- sensitive or regulated data is involved. The findings
driven methods is reflective of the growing trend suggest that while cloud computing provides a
toward using artificial intelligence (AI) to manage the powerful and flexible solution for big data integration,
complexity and volume of data in integration security remains a critical concern that organizations
workflows. However, 8 articles (20%) within this must address, especially when integrating sensitive
group raised concerns about the quality and datasets across cloud platforms. This area represents an
representativeness of the training data used in ML ongoing challenge in balancing the scalability benefits
models, which can lead to biased or incomplete results of cloud computing with the need for robust data
if not properly managed. These studies emphasize the protection measures.
importance of ensuring that ML-based data integration Finally, blockchain technology was discussed as a
tools are supported by high-quality, diverse training potential solution for secure data integration in 22
data to achieve accurate and reliable integration articles (15% of the total). Of these, 15 articles (70%)
outcomes. highlighted blockchain’s ability to provide
Scalability was a central theme in 60 articles (40% of decentralized, transparent, and immutable data
the total), with the majority (36 articles, or 60% of integration processes, making it a promising
those discussing scalability) focusing on distributed technology for industries where data security and
computing frameworks such as Hadoop and Apache integrity are paramount, such as healthcare and
Spark. These frameworks have become industry finance. However, 11 articles (50% of the blockchain-
standards for processing large datasets due to their related studies) raised concerns about the scalability of
ability to manage high-velocity data streams and blockchain systems, particularly regarding transaction
facilitate real-time integration. However, 21 articles speeds and the significant computational resources
(35% of those discussing scalability) pointed out that required to maintain large-scale blockchain networks.
while distributed computing frameworks offer These limitations suggest that while blockchain holds
significant advantages in handling large-scale data significant promise for enhancing the security of big
integration, they require substantial infrastructure data integration, its practical application at scale
investments and technical expertise to implement and remains in its early stages. The research indicates a
maintain effectively. This presents a challenge for need for further development of blockchain
smaller organizations or those with limited IT technologies that can overcome current scalability
resources, highlighting the need for more accessible challenges while maintaining the security and
and cost-effective scalability solutions. The findings transparency benefits that make blockchain an
suggest that while distributed frameworks have made attractive option for data integration. In brief, the
significant strides in addressing scalability challenges, findings from this systematic review highlight the
there is still room for improvement, particularly in significant progress made in addressing big data
reducing the cost and complexity of their integration challenges, particularly through the use of
implementation. ontology-based frameworks, machine learning,
Cloud-based integration solutions were explored in 45 distributed computing, cloud platforms, and
articles (30% of the total), with 36 articles (80% of blockchain technologies. However, the review also
those discussing cloud solutions) emphasizing the role underscores the persistent gaps in real-time integration,
of cloud platforms such as Amazon Web Services cross-domain data harmonization, unstructured data
(AWS), Microsoft Azure, and Google Cloud in integration, and ensuring privacy and security in
providing scalable, real-time big data integration cloud-based and blockchain-enabled integration
services. These platforms were praised for their processes. These gaps present critical areas for future
flexibility, allowing organizations to scale their data research and technological development to ensure that
processing capabilities as needed without significant big data integration can fully meet the demands of
upfront infrastructure investments. However, 11 modern data environments.
articles (25% of those discussing cloud solutions)
raised concerns about data security and privacy in
cloud environments, particularly in cases where

Vol 04 | Issue 04 | October 2024 14

A SYSTEMATIC REVIEW OF BIG DATA INTEGRATION CHALLENGES AND SOLUTIONS FOR
HETEROGENEOUS DATA SOURCES

5 Discussion studies (40%) in our review that emphasized this issue.

Distributed computing frameworks, such as Hadoop
The findings of this systematic review reveal and Spark, have become industry standards for
significant progress in addressing the challenges of big addressing this challenge, with 36 articles highlighting
data integration, while also highlighting several areas their ability to process large datasets in real-time.
where existing solutions remain inadequate. One of the These findings are in line with previous research by
key takeaways is the central role of ontology-based Fatema et al. (2020) and Singh and Yassine (2018),
frameworks in managing semantic heterogeneity, as who demonstrated the effectiveness of these
evidenced by 45 of the reviewed studies. These frameworks in scaling big data systems. However, our
findings are consistent with earlier research that has review also found that 35% of scalability-focused
long emphasized the potential of ontologies to provide studies highlighted the high infrastructure and
a structured and standardized vocabulary for aligning technical expertise required to implement and maintain
disparate data sources (Elkhoukhi et al., 2019). distributed systems. This observation aligns with
However, our review also points out a critical earlier studies, such as those by Mahmud et al. (2020),
limitation: 40% of the ontology-based studies which pointed out that the complexity and cost of
identified scalability issues, particularly when handling distributed computing frameworks pose barriers for
large, dynamic datasets. This echoes earlier concerns smaller organizations with limited IT resources. These
raised by Himeur et al. (2022) about the manual effort findings suggest that while distributed systems are
required to build and maintain ontologies, which critical for scaling data integration, more accessible
remains a barrier to broader adoption in fast-evolving and cost-effective solutions are needed to democratize
big data environments. Therefore, while ontology- these capabilities.
based approaches continue to play a vital role in data Cloud-based solutions have gained increasing attention
integration, their scalability limitations underscore the as a scalable and flexible option for big data
need for more automated, flexible solutions to address integration, with 30% of the studies in our review
large-scale, real-time data integration. focusing on this approach. Platforms like AWS,
Machine learning (ML) has emerged as a promising Microsoft Azure, and Google Cloud offer on-demand
solution for automating many aspects of big data resources that can be scaled to meet the volume and
integration, including schema matching, data velocity of data integration tasks. These findings are
cleansing, and error detection. Our findings, where 17 consistent with earlier research by Dey et al., (2020)
articles (45% of data quality-focused studies) and Taştan and Gökozan (2019), which emphasized
highlighted ML-driven methods, align with previous the potential of cloud computing to revolutionize big
research indicating the effectiveness of ML in data integration by reducing the need for expensive on-
improving the accuracy and efficiency of data premises infrastructure. However, our review also
integration workflows (Plageras et al., 2018). identified data privacy and security concerns in 25% of
However, several studies in our review (20% of the the cloud-focused studies, particularly in cases
data quality studies) raised concerns about the involving sensitive or regulated data. This finding is in
dependency on high-quality training data for ML line with earlier studies by Xiaoping et al. (2020),
algorithms, which can lead to biased or incomplete which pointed out that while cloud platforms offer
results if not properly managed. This limitation is scalability, they also introduce new risks related to
consistent with earlier studies by Aazam et al. (2018), data security, compliance, and privacy. As cloud-based
who emphasized the risk of overfitting and bias in integration solutions continue to evolve, ensuring
machine learning models when working with limited robust security measures and regulatory compliance
or unrepresentative data. Therefore, while ML offers will remain critical concerns for organizations.
substantial advantages in automating big data Finally, blockchain technology is an emerging solution
integration processes, ongoing research is needed to for secure and decentralized big data integration, with
ensure that these systems are adequately supported by 22 studies (15%) in our review exploring its potential.
diverse, high-quality datasets. Blockchain’s ability to ensure data immutability,
Scalability remains one of the most persistent transparency, and security has been recognized as
challenges in big data integration, as reflected in the 60 particularly useful in industries like healthcare and
Vol 04 | Issue 04 | October 2024 15
ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111

finance, where data integrity is paramount. These Industry 4.0. IEEE Transactions on Industrial
findings align with earlier research by Sun and Informatics, 14(10), 4674-4682.
https://doi.org/10.1109/tii.2018.2855198
Scanlon (2019) and Grolinger et al. (2016), which
emphasized blockchain’s role in providing secure data- Ahmed, N., Rahman, M. M., Ishrak, M. F., Joy, M. I. K.,
sharing frameworks. However, our review also Sabuj, M. S. H., & Rahman, M. S. (2024).
highlighted scalability concerns in 50% of the Comparative Performance Analysis of
Transformer-Based Pre-Trained Models for
blockchain-focused studies, particularly regarding Detecting Keratoconus Disease. arXiv preprint
transaction speed and the significant computational arXiv:2408.09005.
resources required to maintain blockchain networks.
This mirrors earlier concerns raised by Elkhoukhi et al. Al-Ali, A.-R., Zualkernan, I. A., Rashid, M., Gupta, R., &
Alikarar, M. (2017). A smart home energy
(2019), who pointed out that while blockchain offers management system using IoT and big data
promising security features, its scalability and analytics approach. IEEE Transactions on
performance issues need to be addressed before it can Consumer Electronics, 63(4), 426-434.
https://doi.org/10.1109/tce.2017.015014
be widely adopted in big data environments.
Therefore, while blockchain represents a promising Alghamdi, A. A., Hu, G., Haider, H., Hewage, K., & Sadiq,
avenue for secure data integration, further research is R. (2020). Benchmarking of Water, Energy, and
necessary to enhance its scalability for large-scale, Carbon Flows in Academic Buildings: A Fuzzy
Clustering Approach. Sustainability, 12(11), 4422-
real-time data systems.
NA. https://doi.org/10.3390/su12114422

6 Conclusion Chen, K., Chen, H., Zhou, C., Huang, Y., Qi, X., Shen, R.,
Liu, F., Zuo, M., Zou, X., Wang, J., Zhang, Y.,
While significant progress has been made in Chen, D., Chen, X., Deng, Y., & Ren, H. (2019).
addressing the challenges of big data integration Comparative analysis of surface water quality
prediction performance and identification of key
through solutions such as ontology-based frameworks,
water parameters using different machine learning
machine learning, distributed computing, cloud models based on big data. Water research,
platforms, and blockchain technology, several 171(NA), 115454-NA.
unresolved issues persist. Scalability, real-time https://doi.org/10.1016/j.watres.2019.115454
integration, data quality, and security remain critical
Da Silva Lopes, M. A., Neto, A. D. D., & de Medeiros
areas that require further research and development. Martins, A. (2020). Parallel t-SNE Applied to Data
Ontology-based methods are effective in managing Visualization in Smart Cities. IEEE Access, 8(NA),
semantic heterogeneity, but their scalability limitations 11482-11490.
https://doi.org/10.1109/access.2020.2964413
hinder broader applicability in dynamic environments.
Machine learning offers automation benefits but is Dey, M., Rana, S. P., & Dudley, S. (2020). Smart building
highly dependent on high-quality training data. creation in large scale HVAC environments through
Distributed computing frameworks provide scalable automated fault detection and diagnosis. Future
Generation Computer Systems, 108(NA), 950-966.
processing capabilities but require substantial https://doi.org/10.1016/j.future.2018.02.019
infrastructure investments, limiting their accessibility
for smaller organizations. Cloud-based solutions offer Diamantoulakis, P. D., Kapinas, V. M., & Karagiannidis, G.
flexibility but raise concerns about data privacy and K. (2015). Big Data Analytics for Dynamic Energy
Management in Smart Grids. Big Data Research,
security, while blockchain, though promising for 2(3), 94-101.
secure integration, faces scalability challenges. Future https://doi.org/10.1016/j.bdr.2015.03.003
research should focus on addressing these gaps to fully
harness the potential of big data integration in Elkhoukhi, H., NaitMalek, Y., Bakhouya, M., Berouine, A.,
Kharbouch, A., Lachhab, F., Hanifi, M., Ouadghiri,
increasingly complex and diverse data environments. D. E., & Essaaidi, M. (2019). A platform
architecture for occupancy detection using stream
References processing and machine learning approaches.
Concurrency and Computation: Practice and
Aazam, M., Zeadally, S., & Harras, K. A. (2018). Deploying Experience, 32(17), NA-NA.
Fog Computing in Industrial Internet of Things and https://doi.org/10.1002/cpe.5651

Vol 04 | Issue 04 | October 2024 16

A SYSTEMATIC REVIEW OF BIG DATA INTEGRATION CHALLENGES AND SOLUTIONS FOR
HETEROGENEOUS DATA SOURCES

Elnour, M., Meskin, N., Khan, K. M., & Jain, R. (2021). Emerging Technology, 3(4), 58–68.
Application of data-driven attack detection https://doi.org/10.62304/jieet.v3i04.195
framework for secure operation in smart buildings.
Sustainable Cities and Society, 69(NA), 102816- Islam, S., & Apu, K. U. (2024b). Decentralized vs.
NA. https://doi.org/10.1016/j.scs.2021.102816 Centralized Database Solutions in Blockchain:
Advantages, Challenges, And Use Cases. Global
Fatema, N., Malik, H., & Iqbal, A. (2020). Big-Data Mainstream Journal of Innovation, Engineering &
Analytics Based Energy Analysis and Monitoring Emerging Technology, 3(4), 58-68.
for Multi-storey Hospital Buildings: Case Study. In https://doi.org/10.62304/jieet.v3i04.195
(Vol. NA, pp. 325-343).
https://doi.org/10.1007/978-981-15-1532-3_14 Jia, M., Komeily, A., Wang, Y., & Srinivasan, R. S. (2019).
Adopting Internet of Things for the development of
Grolinger, K., L'Heureux, A., Capretz, M. A. M., & smart buildings: A review of enabling technologies
Seewald, L. (2016). Energy Forecasting for Event and applications. Automation in Construction,
Venues: Big Data and Prediction Accuracy. Energy 101(NA), 111-126.
and Buildings, 112(NA), 222-233. https://doi.org/10.1016/j.autcon.2019.01.023
https://doi.org/10.1016/j.enbuild.2015.12.010
Jim, M. M. I., Hasan, M., Sultana, R., & Rahman, M. M.
Himeur, Y., Elnour, M., Fadli, F., Meskin, N., Petri, I., (2024). Machine Learning Techniques for
Rezgui, Y., Bensaali, F., & Amira, A. (2022a). AI- Automated Query Optimization in Relational
big data analytics for building automation and Databases. International Journal of Advanced
management systems: a survey, actual challenges Engineering Technologies and Innovations, 1(3),
and future perspectives. Artificial intelligence 514-529.
review, 56(6), 4929-5021.
https://doi.org/10.1007/s10462-022-10286-2 Liu, G., Yang, J., Hao, Y., & Zhang, Y. (2018). Big data-
informed energy efficiency assessment of China
Himeur, Y., Elnour, M., Fadli, F., Meskin, N., Petri, I., industry sectors based on K-means clustering.
Rezgui, Y., Bensaali, F., & Amira, A. (2022b). Journal of Cleaner Production, 183(NA), 304-314.
Next-generation energy systems for sustainable https://doi.org/10.1016/j.jclepro.2018.02.129
smart cities: Roles of transfer learning. Sustainable
Cities and Society, 85(NA), 104059-104059. Liu, Z., Chi, Z., Osmani, M., & Demian, P. (2021).
https://doi.org/10.1016/j.scs.2022.104059 Blockchain and Building Information Management
(BIM) for Sustainable Building Development
Hossain, M. A., Islam, S., Rahman, M. M., & Arif, N. U. M. within the Context of Smart Cities. Sustainability,
(2024). Impact of Online Payment Systems On 13(4), 2090-NA.
Customer Trust and Loyalty In E-Commerce https://doi.org/10.3390/su13042090
Analyzing Security and Convenience. Academic
Journal on Science, Technology, Engineering & Mahmud, M. S., Huang, J. Z., Salloum, S., Emara, T. Z., &
Mathematics Education, 4(03), 1-15. Sadatdiynov, K. (2020). A survey of data
https://doi.org/10.69593/ajsteme.v4i03.85 partitioning and sampling methods to support big
data analysis. Big Data Mining and Analytics, 3(2),
Hu, J., & Vasilakos, A. V. (2016). Energy Big Data Analytics 85-101.
and Security: Challenges and Opportunities. IEEE https://doi.org/10.26599/bdma.2019.9020015
Transactions on Smart Grid, 7(5), 2423-2436.
https://doi.org/10.1109/tsg.2016.2563461 Md Abdur, R., Md Majadul Islam, J., Rahman, M. M., &
Tariquzzaman, M. (2024). AI-Powered Predictive
Huang, S., Zuo, W., & Sohn, M. D. (2017). A Bayesian Analytics for Intellectual Property Risk
Network model for predicting cooling load of Management In Supply Chain Operations: A Big
commercial buildings. Building Simulation, 11(1), Data Approach. International Journal of Science
87-101. https://doi.org/10.1007/s12273-017-0382-z and Engineering, 1(04), 32-46.
https://doi.org/10.62304/ijse.v1i04.184
Islam, S. (2024). Future Trends In SQL Databases And Big
Data Analytics: Impact of Machine Learning and Nahar, J., Rahaman, M. A., Alauddin, M., & Rozony, F. Z.
Artificial Intelligence. International Journal of (2024). Big Data in Credit Risk Management: A
Science and Engineering, 1(04), 47-62. Systematic Review Of Transformative Practices
https://doi.org/10.62304/ijse.v1i04.188 And Future Directions. International Journal of
Management Information Systems and Data
Islam, S., & Apu, K. U. (2024a). Decentralized Vs. Science, 1(04), 68-79.
Centralized Database Solutions In Blockchain: https://doi.org/10.62304/ijmisds.v1i04.196
Advantages, Challenges, And Use Cases. Global
Mainstream Journal of Innovation, Engineering &

Vol 04 | Issue 04 | October 2024 17

ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111

Plageras, A. P., Psannis, K. E., Stergiou, C., Wang, H., & Based E-Nose. Applied Sciences, 9(16), 3435-NA.
Gupta, B. B. (2018). Efficient IoT-based sensor https://doi.org/10.3390/app9163435
BIG Data collection–processing and analysis in
smart buildings. Future Generation Computer Varlamis, I., Sardianos, C., Chronis, C., Dimitrakopoulos,
Systems, 82(NA), 349-357. G., Himeur, Y., Alsalemi, A., Bensaali, F., & Amira,
https://doi.org/10.1016/j.future.2017.09.082 A. (2022). Using big data and federated learning
for generating energy efficiency recommendations.
Roccetti, M., Delnevo, G., Casini, L., & Cappiello, G. International Journal of Data Science and
(2019). Is bigger always better? A controversial Analytics, 16(3), 353-369.
journey to the center of machine learning design, https://doi.org/10.1007/s41060-022-00331-2
with uses and misuses of big data for predicting
water meter failures. Journal of Big Data, 6(1), 1- Wang, J., & Chen, Y. (2021). Adaboost-based Integration
23. https://doi.org/10.1186/s40537-019-0235-y Framework Coupled Two-stage Feature Extraction
with Deep Learning for Multivariate Exchange
Sayed, A. N., Himeur, Y., & Bensaali, F. (2022). Deep and Rate Prediction. Neural Processing Letters, 53(6),
transfer learning for building occupancy detection: 4613-4637. https://doi.org/10.1007/s11063-021-
A review and comparative analysis. Engineering 10616-5
Applications of Artificial Intelligence, 115(NA),
105254-105254. Xiao-wei, X. (2019). Study on the intelligent system of
https://doi.org/10.1016/j.engappai.2022.105254 sports culture centers by combining machine
learning with big data. Personal and Ubiquitous
Shamim, M. I. (2022). Exploring the success factors of Computing, 24(1), 151-163.
project management. American Journal of https://doi.org/10.1007/s00779-019-01307-z
Economics and Business Management, 5(7), 64-72
Xiaoping, Z., Zheng, Z., Peng, W., Song, J., & Kong, Z.
Shamim, M. (2022). The Digital Leadership on Project (2020). A Hybrid Edge-Cloud Computing Method
Management in the Emerging Digital Era. Global for Short-Term Electric Load Forecasting Based on
Mainstream Journal of Business, Economics, Smart Metering Terminal. 2020 IEEE 4th
Development & Project Management, 1(1), 1-14 Conference on Energy Internet and Energy System
Integration (EI2), 42(NA), 3101-3105.
Singh, S., & Yassine, A. (2018). Big Data Mining of Energy https://doi.org/10.1109/ei250167.2020.9346774
Time Series for Behavioral Analytics and Energy
Consumption Forecasting. Energies, 11(2), 452- Xu, C., Wang, J., Zhang, J., & Li, X. (2021). Anomaly
NA. https://doi.org/10.3390/en11020452 detection of power consumption in yarn spinning
using transfer learning. Computers & Industrial
Smolak, K., Kasieczka, B., Fiałkiewicz, W., Rohm, W., Sila- Engineering, 152(NA), 107015-NA.
Nowicka, K., & Kopańczyk, K. (2020). Applying https://doi.org/10.1016/j.cie.2020.107015
human mobility and water consumption data for
short-term water demand forecasting using Zhang, G., Tian, C., Li, C., Zhang, J. J., & Zuo, W. (2020).
classical and machine learning models. Urban Accurate forecasting of building energy
Water Journal, 17(1), 32-42. consumption via a novel ensembled deep learning
https://doi.org/10.1080/1573062x.2020.1734947 method considering the cyclic feature. Energy,
201(NA), 117531-NA.
Su, B., & Wang, S. (2020). An agent-based distributed real- https://doi.org/10.1016/j.energy.2020.117531
time optimal control strategy for building HVAC
systems for applications in the context of future Zhao, Y., Zhang, C., Zhang, Y., Wang, Z., & Li, J. (2020). A
IoT-based smart sensor networks. Applied Energy, review of data mining technologies in building
274(NA), 115322-NA. energy systems: Load prediction, pattern
https://doi.org/10.1016/j.apenergy.2020.115322 identification, fault detection and diagnosis. Energy
and Built Environment, 1(2), 149-164.
Sun, A. Y., & Scanlon, B. R. (2019). How can Big Data and https://doi.org/10.1016/j.enbenv.2019.11.003
machine learning benefit environment and water
management: a survey of methods, applications, Zhou, K., & Yang, S. (2016). Understanding household
and future directions. Environmental Research energy consumption behavior: The contribution of
Letters, 14(7), 073001-NA. energy big data analytics. Renewable and
https://doi.org/10.1088/1748-9326/ab1b7d Sustainable Energy Reviews, 56(NA), 810-819.
https://doi.org/10.1016/j.rser.2015.12.001
Taştan, M., & Gökozan, H. (2019). Real-Time Monitoring
of Indoor Air Quality with Internet of Things-