Sample Summary
Sample Summary
ACADEMIC JOURNAL
ADMINISTRATION, ON BUSINESS
INNOVATION &
ISSN :2997-9552
Page:1-18
ADMINISTRATION, INNOVATION & SUSTAINABILITY
SUSTAINABILITY
11
Farhana Zaman Rozony , 2 Mst Nahida Aktar Aktar , 3 Md Ashrafuzzaman , 4Ashraful Islam
1
Graduate Researcher, Master of Science in Information Management System, College of Business, Lamar
University, Texas, USA
Email: [email protected]
2
Graduate Researcher, Master of Science in Information Management System, College of Business, Lamar
University, Texas, USA
Email: [email protected]
3
Master in Management Information System, International American University, Los Angeles, USA
Email: [email protected]
4
Master Of Science in Information Technology, Washington University Of Science And Technology, Alexandria,
Virginia, USA
Email: [email protected]
This systematic review explores the current challenges and emerging solutions in
big data integration, focusing on key issues such as semantic heterogeneity, data
Submitted: August 05, 2024
quality, scalability, and security. Using the PRISMA guidelines, 150 peer-reviewed
articles were analyzed to identify both established and innovative approaches to Accepted: September 29, 2024
integrating data from heterogeneous sources. The findings reveal that ontology- Published: October 2, 2024
based frameworks are widely used to address semantic inconsistencies but face
limitations in scalability when handling large, dynamic datasets. Machine Corresponding Author:
learning has emerged as a powerful tool for automating data quality and schema
matching processes, although its effectiveness is highly dependent on the Farhana Zaman Rozony
availability of high-quality training data. Distributed computing frameworks like
Hadoop and Spark have become the industry standard for scalable data Graduate Researcher, Master of
integration, yet their implementation requires significant infrastructure and Science in Information
technical expertise. Cloud-based platforms offer flexible, scalable solutions, but Management System, College of
concerns about data privacy and security persist. Blockchain technology, while Business, Lamar University,
Texas, USA
promising for secure and decentralized data integration, is still in its infancy and
struggles with scalability. The review highlights significant progress in the field
but underscores the need for further research to address unresolved challenges in Email: [email protected]
real-time integration, cross-domain data harmonization, and the management of
unstructured data. 10.69593/ajbais.v4i04.111
KEYWORDS
Copyright: © 2024 Sakib A I M. This is an open access article distributed under the terms of Vol 04 | Issue 04 | October Year 1
the Creative Commons Attribution License, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original source is cited.
ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111
maintenance, and personnel, while public cloud strategies for enhancing data consistency, accuracy,
services offer a pay-as-you-go model that can and scalability in big data environments. Ultimately,
significantly reduce upfront costs (Smolak et al., the goal is to offer actionable insights for researchers
2020). Hybrid cloud databases allow organizations to and practitioners working to improve the integration of
strike a balance between these two models, leveraging complex and heterogeneous data in various industries.
the scalability of public clouds for burst workloads
while maintaining control over core data systems on 2 Literature Review
private infrastructure (Liu et al., 2018). Furthermore,
The integration of big data from heterogeneous sources
hybrid cloud solutions can help businesses avoid
has become a critical area of focus in both academic
vendor lock-in by enabling them to choose the most
research and industry practice. As organizations
cost-effective cloud service providers for specific
increasingly rely on diverse data sources, including
workloads. However, cost considerations must account
structured, semi-structured, and unstructured data, the
for the complexity of managing hybrid environments,
complexity of managing and integrating this
including network costs and the need for specialized
information has grown exponentially. A
expertise (Huang et al., 2017).
comprehensive review of existing literature provides
Despite the numerous benefits of hybrid cloud
valuable insights into the challenges associated with
databases, several challenges remain, particularly in
big data integration, including semantic heterogeneity,
terms of architecture complexity, data migration, and
data quality issues, and scalability concerns. In
security concerns. Studies have shown that integrating
addition to identifying these challenges, the literature
data across public and private clouds can introduce
also explores a variety of solutions, such as ontology-
new vulnerabilities, particularly in data transmission
based frameworks, data transformation techniques, and
and access control (Liu et al., 2021). Ensuring
machine learning algorithms, which aim to improve
seamless interoperability between different cloud
the integration process. This section reviews key
environments and on-premises systems also requires
studies in the field, highlighting current approaches,
careful planning and execution (Diamantoulakis et al.,
tools, and methodologies that have been developed to
2015). Moreover, the hybrid cloud model’s
address the unique demands of integrating big data
dependence on internet connectivity makes it
from diverse sources. Through a critical examination
susceptible to latency and network issues, which can
of existing research, this review will establish the
affect the overall performance of data analytics
theoretical and practical foundations necessary for
processes. As the demand for hybrid cloud solutions
understanding and advancing big data integration
continues to grow, further research is needed to
techniques.
address these challenges and develop best practices for
secure, efficient, and cost-effective hybrid cloud 2.1 Big Data Integration
database management. The objective of this systematic The evolution of big data integration has paralleled the
review is to critically analyze and synthesize the key rapid expansion of data-driven industries in recent
challenges and solutions associated with big data decades. Big data, with its defining characteristics of
integration from heterogeneous data sources. volume, velocity, variety, and veracity, has
Specifically, the review aims to identify the primary transformed how organizations manage and analyze
technical and semantic obstacles that organizations information (Fatema et al., 2020). The integration of
face when attempting to integrate data from multiple, heterogeneous data sources, which includes structured,
diverse systems, including structured, semi-structured, semi-structured, and unstructured data, has become a
and unstructured data. Furthermore, the objective is to central focus in fields such as healthcare, finance, and
evaluate existing methodologies and tools, such as manufacturing, as organizations seek to derive
ontology-based frameworks, schema matching actionable insights from vast and diverse datasets (Liu
techniques, and machine learning-driven approaches, et al., 2018). Early approaches to data integration were
which have been proposed to address these challenges. relatively simple, focusing on combining structured
By examining both academic literature and practical data from relational databases. However, the
case studies, this review seeks to provide a increasing complexity of data sources—ranging from
comprehensive understanding of the most effective
Vol 04 | Issue 04 | October 2024 3
ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111
social media streams to sensor data—has necessitated challenge in big data integration. Poor data quality—
more advanced methods for managing, cleaning, and characterized by inconsistencies, inaccuracies, and
integrating information from disparate systems (Singh missing values—can undermine the utility of even the
& Yassine, 2018). As the need for comprehensive data most well-integrated data systems (Ahmed et al., 2024;
integration grew, so too did the research exploring Islam & Apu, 2024b; Nahar et al., 2024). As big data
challenges and potential solutions. environments expanded to include real-time data from
One of the earliest challenges in big data integration sources such as IoT devices and social media, the
emerged from the semantic heterogeneity across potential for data quality issues grew exponentially
different systems. Various data sources often utilize (Ahmed et al., 2024; Hossain et al., 2024; Islam,
different formats, terminologies, and structures to 2024). Research in the field of data cleansing has
represent similar or identical information, creating evolved significantly, with early efforts focusing on
significant barriers to effective integration manual data cleaning methods and progressing toward
(Diamantoulakis et al., 2015). For example, disparate automated, ML-driven solutions that can detect and
databases may refer to customer data differently, one rectify errors in real-time. Advances in machine
by ID and another by name, leading to difficulties in learning and AI have facilitated the development of
mapping and merging datasets. Early research in the more sophisticated error detection and correction
2000s explored solutions through ontology-based systems that are capable of handling the scale and
frameworks that could standardize data semantics, complexity of modern big data environments. Another
allowing for more seamless integration (Hu & critical development in big data integration is the need
Vasilakos, 2016). These frameworks evolved from for scalable data processing frameworks. Traditional
basic schema matching techniques to more data integration systems, which were designed to
sophisticated models that leverage artificial handle smaller, static datasets, struggled to keep up
intelligence (AI) and machine learning (ML) to with the growing volume and velocity of big data.
automate data transformation and integration (Jim et Early solutions focused on distributed computing
al., 2024; Abdur et al., 2024). Over time, the frameworks, such as MapReduce, which allowed for
development of semantic integration tools has parallel processing of large datasets across multiple
significantly advanced, making it easier for nodes (Dean & Ghemawat, 2008). These frameworks
organizations to integrate data from heterogeneous have since evolved into more advanced systems like
sources. Apache Hadoop and Spark, which are capable of
The issue of data quality has persisted as another major handling real-time data streams and offering greater
scalability and flexibility. Recent research has also Data quality issues further complicate big data
explored cloud-based integration solutions, which integration, especially in environments characterized
allow organizations to scale their data processing by the influx of vast amounts of real-time and multi-
capabilities as needed, without being constrained by format data. Poor data quality, often manifested as
physical infrastructure limitations (Islam & Apu, inaccuracies, inconsistencies, and missing data, can
2024). The shift toward cloud computing has further undermine the effectiveness of integrated datasets
accelerated the adoption of big data integration (Sayed et al., 2022). This problem is exacerbated in
practices in industries that rely on real-time analytics heterogeneous data environments where sources like
and decision-making processes. IoT devices, social media, and traditional databases
produce fragmented data that may lack validation
2.2 Big Data Integration Challenges
(Fatema et al., 2020). As big data systems have
As big data continues to evolve, the integration of evolved, data cleansing and validation techniques have
heterogeneous data sources presents significant become essential in ensuring that data integration
challenges. One of the most prominent challenges is processes are reliable (Liu et al., 2018). Machine
semantic heterogeneity, which arises from the learning and artificial intelligence (AI) have been
variations in data formats, terminologies, and increasingly employed to detect and correct data
structures across different data systems (Roccetti et al., quality issues automatically, enabling the seamless
2019). Semantic heterogeneity can occur when similar integration of large, complex datasets (Diamantoulakis
data elements are represented differently, making it et al., 2015).
difficult to align and interpret data from multiple Scalability is another significant challenge in big data
sources accurately (Taştan & Gökozan, 2019). For integration. The increasing volume, velocity, and
instance, a customer ID in one system may be variety of big data—often referred to as the "3Vs"—
represented as a numerical value, while another system pose severe difficulties for traditional data systems
uses alphanumeric codes or full names. The variations (Al-Ali et al., 2017). As data grows exponentially, the
in how data is stored and categorized lead to need for scalable processing frameworks becomes
inconsistencies, requiring sophisticated tools and apparent. Early big data systems struggled to process
methods to standardize the information across systems large amounts of incoming data efficiently, especially
(Sun & Scanlon, 2019). The evolution of ontology- when real-time analysis was required (Da Silva Lopes
based frameworks has helped address this challenge by et al., 2020). However, the development of distributed
offering a structured semantic understanding of data, systems, such as Hadoop and Spark, has offered
which facilitates its integration from disparate sources scalable solutions by enabling parallel processing and
(Plageras et al., 2018). reducing the computational burden on individual
systems (Varlamis et al., 2022). These systems have 2.3 Solutions for Big Data Integration
evolved to handle real-time data streams more One of the most significant approaches to addressing
efficiently, improving the overall integration process the challenges of big data integration is ontology-based
for big data environments (Mahmud et al., 2020). As frameworks. Ontologies provide a structured semantic
data volume and velocity continue to grow, further understanding of data by defining the relationships
advancements in distributed processing frameworks between various data elements, making it easier to
will be necessary to meet the increasing demands of integrate heterogeneous data from different sources.
modern data systems. By creating a shared vocabulary and set of definitions,
The lack of standardization in data protocols and ontologies allow data systems to communicate and
formats remains a persistent challenge in big data interpret data consistently, thereby reducing semantic
integration. Different data sources often utilize varying heterogeneity. In practice, ontology frameworks have
formats, such as XML, JSON, and CSV, which been applied across various industries, including
complicates the design of integration systems that can healthcare and finance, to standardize terminologies
accommodate multiple formats simultaneously (Xiao- and improve the accuracy of data integration. For
wei, 2019). Non-standardized data structures can result example, the use of ontologies in biomedical research,
in integration errors, delays, and increased costs as such as the Gene Ontology (GO), has been
organizations attempt to normalize data from disparate instrumental in managing and integrating large
sources. Schema matching and transformation volumes of biological data from multiple sources,
techniques have emerged as essential solutions for ensuring semantic consistency across datasets
aligning different data formats into a common (Xiaoping et al., 2020). The evolution of ontology-
structure, making integration more feasible (Al-Ali et based solutions highlights their critical role in
al., 2017). Recent advancements in AI-driven schema managing the complexity of big data integration,
matching tools have automated much of this process, particularly when dealing with semantically diverse
enabling more efficient data transformation and datasets.
reducing the need for manual intervention (Xiao-wei, Machine learning (ML) has also emerged as a
2019). These advancements have played a critical role powerful tool for automating the process of big data
in overcoming standardization challenges, but the integration. ML techniques can be used to address a
complexity of integrating data from rapidly evolving variety of integration challenges, including schema
sources continues to demand further innovation. matching, data cleansing, and error detection, by
In addition to these challenges, the evolution of big learning patterns from the data itself (Jim et al., 2024).
data integration has also witnessed the increasing use One of the key benefits of ML-based approaches is
of AI and machine learning to enhance the process. their ability to handle large and complex datasets that
Machine learning algorithms can automate many traditional rule-based systems struggle with (Abdur et
aspects of data integration, from detecting semantic al., 2024). For instance, ML algorithms can
inconsistencies to addressing data quality issues (Zhao automatically detect and correct inconsistencies in
et al., 2020). Additionally, AI-driven systems can help data, improving the quality of integrated datasets
optimize the scalability of data integration processes without the need for manual intervention (Islam,
by predicting computational needs and allocating 2024). However, ML-based data integration also
resources accordingly (Elkhoukhi et al., 2019). As big presents certain limitations, such as the need for large
data systems continue to evolve, AI and machine amounts of labeled training data and the potential for
learning will likely play an even more significant role biased or inaccurate models if the training data is not
in automating and streamlining the integration process. representative (Islam & Apu, 2024b). Despite these
However, while these technologies offer promising challenges, the application of machine learning
solutions, they also present new challenges related to continues to expand, with ongoing research focused on
data privacy, ethical concerns, and the need for improving the efficiency and accuracy of ML-based
transparency in machine learning models (Himeur et data integration techniques. Distributed computing
al., 2022b). frameworks, such as Hadoop and Spark, have become
indispensable in addressing the scalability challenges
Vol 04 | Issue 04 | October 2024 6
A SYSTEMATIC REVIEW OF BIG DATA INTEGRATION CHALLENGES AND SOLUTIONS FOR
HETEROGENEOUS DATA SOURCES
associated with big data integration. These frameworks efficiency of this process, enabling more seamless data
allow for the parallel processing of large datasets integration. For instance, modern tools can
across multiple nodes, significantly increasing the automatically identify and match schema elements
computational power available for data integration based on learned patterns, reducing the need for
tasks (Nahar et al., 2024). Hadoop, one of the earliest manual intervention and improving the scalability of
distributed computing frameworks, introduced the integration efforts (Ahmed et al., 2024). These
MapReduce programming model, which enables the techniques are vital for ensuring that data from diverse
processing of vast amounts of data by distributing the sources can be effectively integrated and used for
workload across multiple servers. Apache Spark, a analysis.
more recent development, builds on Hadoop's As big data systems continue to evolve, combining
capabilities by offering in-memory processing, which multiple solutions—such as ontology-based
reduces the time required for data integration tasks and approaches, machine learning, distributed computing,
supports real-time analytics. Studies have and schema matching—has become a key strategy for
demonstrated the efficiency of these distributed overcoming the challenges of data integration. Each of
frameworks in managing big data integration at scale, these solutions addresses specific aspects of the
particularly in industries such as finance and integration process, whether it be managing semantic
telecommunications, where real-time processing of heterogeneity, ensuring data quality, or improving
high-velocity data streams is crucial (Jim et al., 2024). scalability. For instance, integrating ontologies with
Another critical solution for big data integration is the machine learning algorithms can enhance the semantic
development of data transformation and schema accuracy of integrated datasets while also automating
matching techniques. Schema matching is the process error detection and correction (Hossain et al., 2024).
of identifying correspondences between the attributes Similarly, combining distributed computing
of different datasets to facilitate their integration into a frameworks with schema matching techniques enables
unified format. As data sources often use varying the efficient integration of large-scale, heterogeneous
structures and formats, schema matching and data sources, ensuring that data can be processed and
transformation are essential for converting analyzed in real-time (Islam, 2024). As the complexity
heterogeneous data into a common format that can be of data environments grows, the integration of these
used across systems. Advances in schema matching various solutions will be critical for managing the
algorithms, such as those incorporating machine increasing demands of big data systems.
learning and AI, have improved the accuracy and
and velocity of the data (Da Silva Lopes et al., 2020). for specialized technical skills to manage distributed
These examples demonstrate that ontology-based systems (Zhou & Yang, 2016).
approaches are more suited to environments where
2.6 Emerging Trends in Big Data Integration
data standardization is crucial, while ML techniques
are preferable in dynamic and high-velocity data The rise of artificial intelligence (AI) has significantly
environments where automation is necessary (Roccetti transformed the field of big data integration, with
et al., 2019). recent developments focused on enhancing the
Schema matching and distributed computing automation and optimization of data workflows. AI-
frameworks are also widely used techniques for big powered data integration solutions utilize machine
data integration, each with distinct advantages and learning (ML) algorithms and natural language
limitations. Schema matching focuses on aligning the processing (NLP) techniques to automate processes
structures of disparate datasets by identifying like schema matching, data cleansing, and error
correspondences between schema elements. This is detection. This not only reduces the need for manual
particularly useful for integrating structured data from intervention but also improves the accuracy and
relational databases or systems with well-defined efficiency of data integration. Studies show that AI-
formats. Schema matching techniques have evolved driven tools are particularly effective in handling
with the incorporation of AI and ML to automate much unstructured data and can adapt to evolving data
of the process, reducing the time and effort required patterns in real-time, making them ideal for dynamic
for integration. However, schema matching alone is data environments. However, challenges remain, such
limited when dealing with large, unstructured, or semi- as the need for large amounts of high-quality training
structured datasets, which are increasingly common in data and the potential for biases in AI models if the
big data environments. Distributed computing training data is insufficient or skewed. Nevertheless,
frameworks, such as Hadoop and Spark, address this the integration of AI in big data processes continues to
limitation by enabling the parallel processing of vast gain traction, with ongoing research aimed at
datasets across multiple nodes (Mahmud et al., 2020; addressing these limitations and further optimizing
Roccetti et al., 2019). These frameworks excel in integration workflows (Mahmud et al., 2020).
handling large, diverse datasets in real-time but may Cloud computing has also emerged as a crucial enabler
require significant infrastructure investment and of big data integration by providing scalable, flexible,
expertise to implement effectively. and cost-effective solutions. Major cloud platforms,
When comparing schema matching and distributed such as Amazon Web Services (AWS), Microsoft
computing frameworks in terms of performance, Azure, and Google Cloud, offer robust infrastructure
efficiency, and cost-effectiveness, several trade-offs for processing and integrating massive datasets. These
emerge. Schema matching techniques are generally cloud-based solutions facilitate real-time data
more efficient for smaller, structured datasets, as they integration by providing on-demand resources that can
focus on aligning schema elements and transforming be scaled according to the volume and velocity of
data into a common format. These techniques are incoming data. Research has shown that cloud
relatively cost-effective, as they do not require platforms allow organizations to handle vast amounts
extensive infrastructure and can be implemented with of heterogeneous data efficiently, without the need for
off-the-shelf software (Dey et al., 2020; Shamim, expensive on-premises infrastructure. Additionally,
2022). However, they may struggle to keep up with the cloud-based integration services often include built-in
scale and speed of modern big data environments. AI and ML capabilities, further enhancing the speed
Distributed computing frameworks, on the other hand, and accuracy of data integration (Dey et al., 2020).
are highly scalable and capable of processing large However, concerns around data privacy, security, and
datasets in parallel, making them ideal for real-time compliance remain significant barriers to cloud
data integration in industries like telecommunications adoption for sensitive data integration tasks.
and finance(Himeur et al., 2022b). While these Blockchain technology represents another emerging
frameworks offer superior performance in large-scale trend in big data integration, offering secure,
data environments, they come with higher decentralized frameworks for managing and
infrastructure and operational costs, as well as the need integrating data. Blockchain’s distributed ledger
Vol 04 | Issue 04 | October 2024 9
ACADEMIC JOURNAL ON BUSINESS ADMINISTRATION, INNOVATION & SUSTAINABILITY
Doi: 10.69593/ajbais.v4i04.111
system ensures data integrity and immutability, terms of improving the scalability and security of big
making it particularly useful for secure data sharing data integration workflows. Several studies have
and integration across multiple, often untrusted, demonstrated the potential of these combined
entities. Blockchain can enhance the traceability and approaches in fields such as supply chain management
transparency of data integration processes by and finance, where secure, real-time data integration is
providing verifiable transaction records that can be crucial. As cloud computing continues to evolve, its
accessed in real time. This technology is particularly integration with AI and blockchain technologies is
promising in industries like healthcare and finance, expected to play an even more significant role in
where data security and privacy are critical. Recent managing complex data environments.
research has also explored the potential of combining The integration of AI, cloud computing, and
blockchain with AI to create smart contracts that can blockchain technologies represents a significant shift
automate data integration workflows while ensuring in the way organizations handle big data. AI-driven
secure and transparent transactions. However, the automation improves the efficiency of data integration
integration of blockchain with big data systems is still processes, cloud computing provides scalable
in its nascent stages, with challenges related to infrastructure, and blockchain ensures data security
scalability, transaction speed, and interoperability and transparency. As these technologies continue to
needing further exploration (Aazam et al., 2018; evolve, they are likely to address many of the current
Shamim, 2022). challenges in big data integration, including
AI and blockchain technologies are increasingly being scalability, data quality, and security (Al-Ali et al.,
integrated into cloud-based systems to create more 2017). However, as with any emerging technology,
efficient and secure data integration environments. For these solutions come with their own set of challenges,
instance, AI-powered cloud platforms can automate including the need for large-scale infrastructure, data
the integration of heterogeneous data sources while privacy concerns, and potential biases in AI models.
ensuring real-time processing, and blockchain can Future research will need to focus on overcoming
secure the integrity of this data as it moves across these barriers to fully harness the potential of AI,
different systems. The combination of these cloud, and blockchain technologies in big data
technologies offers significant benefits, particularly in integration.
Figure 5: Complex Combined Boxplot and Line Chart for Big Data Integration Trends
Step 1: Identification of Studies to identify gaps in the literature. The synthesis allowed
for the comparison of different approaches and
In the first step, a comprehensive search was
highlighted emerging trends in the field.
conducted across multiple academic databases,
including Google Scholar, IEEE Xplore, and Step 5: Data Analysis
ScienceDirect, to identify relevant studies. Keywords
The final step involved a qualitative analysis of the
such as “big data integration,” “data heterogeneity,”
synthesized data to identify patterns, common
“ontology-based integration,” “machine learning for
challenges, and innovative solutions proposed in the
data integration,” and “distributed computing
literature. Key metrics such as the frequency of certain
frameworks” were used to retrieve peer-reviewed
integration techniques (e.g., machine learning vs.
journal articles, conference proceedings, and industry
ontology-based methods) and the industries in which
reports. The search process yielded an initial total of
these techniques were applied (e.g., healthcare,
2,500 articles from these databases. To ensure the
finance, telecommunications) were analyzed. This step
relevance of the articles, filters for publication dates
provided the basis for drawing conclusions about the
(2010-2023), language (English), and subject areas
effectiveness of various solutions and the areas where
(Computer Science, Information Technology, and Data
further research is needed.
Management) were applied.
Figure 6: PRISMA Flowchart for this study
Step 2: Screening of Articles
The second step involved screening the identified
articles for eligibility based on their titles and
abstracts. After removing duplicates and irrelevant
articles, a total of 1,200 studies remained. The
abstracts were carefully reviewed to exclude papers
that did not focus specifically on the integration of big
data or those that dealt solely with general data
management without addressing integration
challenges. Studies that lacked full-text availability
were also excluded at this stage. After this screening
process, 500 articles were selected for further review.
Step 3: Eligibility Assessment
In the third step, the full texts of the remaining 500
articles were assessed for eligibility using predefined
inclusion and exclusion criteria. To be included,
studies had to discuss either the challenges or solutions
of big data integration, such as semantic heterogeneity,
data quality, or scalability. Studies that only provided
theoretical overviews without empirical or practical
insights were excluded. Additionally, papers focusing
on unrelated data management topics were discarded.
This process resulted in a refined selection of 150
articles deemed relevant for inclusion in the systematic
review.
Step 4: Data Extraction and Synthesis
In the fourth step, data were extracted from the 150
eligible articles, focusing on key themes such as the
methods used to address big data integration
challenges, technological solutions like ontology-based
frameworks, and the application of machine learning
for automation. The articles were categorized based on
the type of integration challenge they addressed, such
as semantic heterogeneity, scalability, or data quality.
The findings were synthesized to provide an overview
of the current state of big data integration research and
detection, and data correction. The reliance on ML- sensitive or regulated data is involved. The findings
driven methods is reflective of the growing trend suggest that while cloud computing provides a
toward using artificial intelligence (AI) to manage the powerful and flexible solution for big data integration,
complexity and volume of data in integration security remains a critical concern that organizations
workflows. However, 8 articles (20%) within this must address, especially when integrating sensitive
group raised concerns about the quality and datasets across cloud platforms. This area represents an
representativeness of the training data used in ML ongoing challenge in balancing the scalability benefits
models, which can lead to biased or incomplete results of cloud computing with the need for robust data
if not properly managed. These studies emphasize the protection measures.
importance of ensuring that ML-based data integration Finally, blockchain technology was discussed as a
tools are supported by high-quality, diverse training potential solution for secure data integration in 22
data to achieve accurate and reliable integration articles (15% of the total). Of these, 15 articles (70%)
outcomes. highlighted blockchain’s ability to provide
Scalability was a central theme in 60 articles (40% of decentralized, transparent, and immutable data
the total), with the majority (36 articles, or 60% of integration processes, making it a promising
those discussing scalability) focusing on distributed technology for industries where data security and
computing frameworks such as Hadoop and Apache integrity are paramount, such as healthcare and
Spark. These frameworks have become industry finance. However, 11 articles (50% of the blockchain-
standards for processing large datasets due to their related studies) raised concerns about the scalability of
ability to manage high-velocity data streams and blockchain systems, particularly regarding transaction
facilitate real-time integration. However, 21 articles speeds and the significant computational resources
(35% of those discussing scalability) pointed out that required to maintain large-scale blockchain networks.
while distributed computing frameworks offer These limitations suggest that while blockchain holds
significant advantages in handling large-scale data significant promise for enhancing the security of big
integration, they require substantial infrastructure data integration, its practical application at scale
investments and technical expertise to implement and remains in its early stages. The research indicates a
maintain effectively. This presents a challenge for need for further development of blockchain
smaller organizations or those with limited IT technologies that can overcome current scalability
resources, highlighting the need for more accessible challenges while maintaining the security and
and cost-effective scalability solutions. The findings transparency benefits that make blockchain an
suggest that while distributed frameworks have made attractive option for data integration. In brief, the
significant strides in addressing scalability challenges, findings from this systematic review highlight the
there is still room for improvement, particularly in significant progress made in addressing big data
reducing the cost and complexity of their integration challenges, particularly through the use of
implementation. ontology-based frameworks, machine learning,
Cloud-based integration solutions were explored in 45 distributed computing, cloud platforms, and
articles (30% of the total), with 36 articles (80% of blockchain technologies. However, the review also
those discussing cloud solutions) emphasizing the role underscores the persistent gaps in real-time integration,
of cloud platforms such as Amazon Web Services cross-domain data harmonization, unstructured data
(AWS), Microsoft Azure, and Google Cloud in integration, and ensuring privacy and security in
providing scalable, real-time big data integration cloud-based and blockchain-enabled integration
services. These platforms were praised for their processes. These gaps present critical areas for future
flexibility, allowing organizations to scale their data research and technological development to ensure that
processing capabilities as needed without significant big data integration can fully meet the demands of
upfront infrastructure investments. However, 11 modern data environments.
articles (25% of those discussing cloud solutions)
raised concerns about data security and privacy in
cloud environments, particularly in cases where
finance, where data integrity is paramount. These Industry 4.0. IEEE Transactions on Industrial
findings align with earlier research by Sun and Informatics, 14(10), 4674-4682.
https://doi.org/10.1109/tii.2018.2855198
Scanlon (2019) and Grolinger et al. (2016), which
emphasized blockchain’s role in providing secure data- Ahmed, N., Rahman, M. M., Ishrak, M. F., Joy, M. I. K.,
sharing frameworks. However, our review also Sabuj, M. S. H., & Rahman, M. S. (2024).
highlighted scalability concerns in 50% of the Comparative Performance Analysis of
Transformer-Based Pre-Trained Models for
blockchain-focused studies, particularly regarding Detecting Keratoconus Disease. arXiv preprint
transaction speed and the significant computational arXiv:2408.09005.
resources required to maintain blockchain networks.
This mirrors earlier concerns raised by Elkhoukhi et al. Al-Ali, A.-R., Zualkernan, I. A., Rashid, M., Gupta, R., &
Alikarar, M. (2017). A smart home energy
(2019), who pointed out that while blockchain offers management system using IoT and big data
promising security features, its scalability and analytics approach. IEEE Transactions on
performance issues need to be addressed before it can Consumer Electronics, 63(4), 426-434.
https://doi.org/10.1109/tce.2017.015014
be widely adopted in big data environments.
Therefore, while blockchain represents a promising Alghamdi, A. A., Hu, G., Haider, H., Hewage, K., & Sadiq,
avenue for secure data integration, further research is R. (2020). Benchmarking of Water, Energy, and
necessary to enhance its scalability for large-scale, Carbon Flows in Academic Buildings: A Fuzzy
Clustering Approach. Sustainability, 12(11), 4422-
real-time data systems.
NA. https://doi.org/10.3390/su12114422
6 Conclusion Chen, K., Chen, H., Zhou, C., Huang, Y., Qi, X., Shen, R.,
Liu, F., Zuo, M., Zou, X., Wang, J., Zhang, Y.,
While significant progress has been made in Chen, D., Chen, X., Deng, Y., & Ren, H. (2019).
addressing the challenges of big data integration Comparative analysis of surface water quality
prediction performance and identification of key
through solutions such as ontology-based frameworks,
water parameters using different machine learning
machine learning, distributed computing, cloud models based on big data. Water research,
platforms, and blockchain technology, several 171(NA), 115454-NA.
unresolved issues persist. Scalability, real-time https://doi.org/10.1016/j.watres.2019.115454
integration, data quality, and security remain critical
Da Silva Lopes, M. A., Neto, A. D. D., & de Medeiros
areas that require further research and development. Martins, A. (2020). Parallel t-SNE Applied to Data
Ontology-based methods are effective in managing Visualization in Smart Cities. IEEE Access, 8(NA),
semantic heterogeneity, but their scalability limitations 11482-11490.
https://doi.org/10.1109/access.2020.2964413
hinder broader applicability in dynamic environments.
Machine learning offers automation benefits but is Dey, M., Rana, S. P., & Dudley, S. (2020). Smart building
highly dependent on high-quality training data. creation in large scale HVAC environments through
Distributed computing frameworks provide scalable automated fault detection and diagnosis. Future
Generation Computer Systems, 108(NA), 950-966.
processing capabilities but require substantial https://doi.org/10.1016/j.future.2018.02.019
infrastructure investments, limiting their accessibility
for smaller organizations. Cloud-based solutions offer Diamantoulakis, P. D., Kapinas, V. M., & Karagiannidis, G.
flexibility but raise concerns about data privacy and K. (2015). Big Data Analytics for Dynamic Energy
Management in Smart Grids. Big Data Research,
security, while blockchain, though promising for 2(3), 94-101.
secure integration, faces scalability challenges. Future https://doi.org/10.1016/j.bdr.2015.03.003
research should focus on addressing these gaps to fully
harness the potential of big data integration in Elkhoukhi, H., NaitMalek, Y., Bakhouya, M., Berouine, A.,
Kharbouch, A., Lachhab, F., Hanifi, M., Ouadghiri,
increasingly complex and diverse data environments. D. E., & Essaaidi, M. (2019). A platform
architecture for occupancy detection using stream
References processing and machine learning approaches.
Concurrency and Computation: Practice and
Aazam, M., Zeadally, S., & Harras, K. A. (2018). Deploying Experience, 32(17), NA-NA.
Fog Computing in Industrial Internet of Things and https://doi.org/10.1002/cpe.5651
Elnour, M., Meskin, N., Khan, K. M., & Jain, R. (2021). Emerging Technology, 3(4), 58–68.
Application of data-driven attack detection https://doi.org/10.62304/jieet.v3i04.195
framework for secure operation in smart buildings.
Sustainable Cities and Society, 69(NA), 102816- Islam, S., & Apu, K. U. (2024b). Decentralized vs.
NA. https://doi.org/10.1016/j.scs.2021.102816 Centralized Database Solutions in Blockchain:
Advantages, Challenges, And Use Cases. Global
Fatema, N., Malik, H., & Iqbal, A. (2020). Big-Data Mainstream Journal of Innovation, Engineering &
Analytics Based Energy Analysis and Monitoring Emerging Technology, 3(4), 58-68.
for Multi-storey Hospital Buildings: Case Study. In https://doi.org/10.62304/jieet.v3i04.195
(Vol. NA, pp. 325-343).
https://doi.org/10.1007/978-981-15-1532-3_14 Jia, M., Komeily, A., Wang, Y., & Srinivasan, R. S. (2019).
Adopting Internet of Things for the development of
Grolinger, K., L'Heureux, A., Capretz, M. A. M., & smart buildings: A review of enabling technologies
Seewald, L. (2016). Energy Forecasting for Event and applications. Automation in Construction,
Venues: Big Data and Prediction Accuracy. Energy 101(NA), 111-126.
and Buildings, 112(NA), 222-233. https://doi.org/10.1016/j.autcon.2019.01.023
https://doi.org/10.1016/j.enbuild.2015.12.010
Jim, M. M. I., Hasan, M., Sultana, R., & Rahman, M. M.
Himeur, Y., Elnour, M., Fadli, F., Meskin, N., Petri, I., (2024). Machine Learning Techniques for
Rezgui, Y., Bensaali, F., & Amira, A. (2022a). AI- Automated Query Optimization in Relational
big data analytics for building automation and Databases. International Journal of Advanced
management systems: a survey, actual challenges Engineering Technologies and Innovations, 1(3),
and future perspectives. Artificial intelligence 514-529.
review, 56(6), 4929-5021.
https://doi.org/10.1007/s10462-022-10286-2 Liu, G., Yang, J., Hao, Y., & Zhang, Y. (2018). Big data-
informed energy efficiency assessment of China
Himeur, Y., Elnour, M., Fadli, F., Meskin, N., Petri, I., industry sectors based on K-means clustering.
Rezgui, Y., Bensaali, F., & Amira, A. (2022b). Journal of Cleaner Production, 183(NA), 304-314.
Next-generation energy systems for sustainable https://doi.org/10.1016/j.jclepro.2018.02.129
smart cities: Roles of transfer learning. Sustainable
Cities and Society, 85(NA), 104059-104059. Liu, Z., Chi, Z., Osmani, M., & Demian, P. (2021).
https://doi.org/10.1016/j.scs.2022.104059 Blockchain and Building Information Management
(BIM) for Sustainable Building Development
Hossain, M. A., Islam, S., Rahman, M. M., & Arif, N. U. M. within the Context of Smart Cities. Sustainability,
(2024). Impact of Online Payment Systems On 13(4), 2090-NA.
Customer Trust and Loyalty In E-Commerce https://doi.org/10.3390/su13042090
Analyzing Security and Convenience. Academic
Journal on Science, Technology, Engineering & Mahmud, M. S., Huang, J. Z., Salloum, S., Emara, T. Z., &
Mathematics Education, 4(03), 1-15. Sadatdiynov, K. (2020). A survey of data
https://doi.org/10.69593/ajsteme.v4i03.85 partitioning and sampling methods to support big
data analysis. Big Data Mining and Analytics, 3(2),
Hu, J., & Vasilakos, A. V. (2016). Energy Big Data Analytics 85-101.
and Security: Challenges and Opportunities. IEEE https://doi.org/10.26599/bdma.2019.9020015
Transactions on Smart Grid, 7(5), 2423-2436.
https://doi.org/10.1109/tsg.2016.2563461 Md Abdur, R., Md Majadul Islam, J., Rahman, M. M., &
Tariquzzaman, M. (2024). AI-Powered Predictive
Huang, S., Zuo, W., & Sohn, M. D. (2017). A Bayesian Analytics for Intellectual Property Risk
Network model for predicting cooling load of Management In Supply Chain Operations: A Big
commercial buildings. Building Simulation, 11(1), Data Approach. International Journal of Science
87-101. https://doi.org/10.1007/s12273-017-0382-z and Engineering, 1(04), 32-46.
https://doi.org/10.62304/ijse.v1i04.184
Islam, S. (2024). Future Trends In SQL Databases And Big
Data Analytics: Impact of Machine Learning and Nahar, J., Rahaman, M. A., Alauddin, M., & Rozony, F. Z.
Artificial Intelligence. International Journal of (2024). Big Data in Credit Risk Management: A
Science and Engineering, 1(04), 47-62. Systematic Review Of Transformative Practices
https://doi.org/10.62304/ijse.v1i04.188 And Future Directions. International Journal of
Management Information Systems and Data
Islam, S., & Apu, K. U. (2024a). Decentralized Vs. Science, 1(04), 68-79.
Centralized Database Solutions In Blockchain: https://doi.org/10.62304/ijmisds.v1i04.196
Advantages, Challenges, And Use Cases. Global
Mainstream Journal of Innovation, Engineering &
Plageras, A. P., Psannis, K. E., Stergiou, C., Wang, H., & Based E-Nose. Applied Sciences, 9(16), 3435-NA.
Gupta, B. B. (2018). Efficient IoT-based sensor https://doi.org/10.3390/app9163435
BIG Data collection–processing and analysis in
smart buildings. Future Generation Computer Varlamis, I., Sardianos, C., Chronis, C., Dimitrakopoulos,
Systems, 82(NA), 349-357. G., Himeur, Y., Alsalemi, A., Bensaali, F., & Amira,
https://doi.org/10.1016/j.future.2017.09.082 A. (2022). Using big data and federated learning
for generating energy efficiency recommendations.
Roccetti, M., Delnevo, G., Casini, L., & Cappiello, G. International Journal of Data Science and
(2019). Is bigger always better? A controversial Analytics, 16(3), 353-369.
journey to the center of machine learning design, https://doi.org/10.1007/s41060-022-00331-2
with uses and misuses of big data for predicting
water meter failures. Journal of Big Data, 6(1), 1- Wang, J., & Chen, Y. (2021). Adaboost-based Integration
23. https://doi.org/10.1186/s40537-019-0235-y Framework Coupled Two-stage Feature Extraction
with Deep Learning for Multivariate Exchange
Sayed, A. N., Himeur, Y., & Bensaali, F. (2022). Deep and Rate Prediction. Neural Processing Letters, 53(6),
transfer learning for building occupancy detection: 4613-4637. https://doi.org/10.1007/s11063-021-
A review and comparative analysis. Engineering 10616-5
Applications of Artificial Intelligence, 115(NA),
105254-105254. Xiao-wei, X. (2019). Study on the intelligent system of
https://doi.org/10.1016/j.engappai.2022.105254 sports culture centers by combining machine
learning with big data. Personal and Ubiquitous
Shamim, M. I. (2022). Exploring the success factors of Computing, 24(1), 151-163.
project management. American Journal of https://doi.org/10.1007/s00779-019-01307-z
Economics and Business Management, 5(7), 64-72
Xiaoping, Z., Zheng, Z., Peng, W., Song, J., & Kong, Z.
Shamim, M. (2022). The Digital Leadership on Project (2020). A Hybrid Edge-Cloud Computing Method
Management in the Emerging Digital Era. Global for Short-Term Electric Load Forecasting Based on
Mainstream Journal of Business, Economics, Smart Metering Terminal. 2020 IEEE 4th
Development & Project Management, 1(1), 1-14 Conference on Energy Internet and Energy System
Integration (EI2), 42(NA), 3101-3105.
Singh, S., & Yassine, A. (2018). Big Data Mining of Energy https://doi.org/10.1109/ei250167.2020.9346774
Time Series for Behavioral Analytics and Energy
Consumption Forecasting. Energies, 11(2), 452- Xu, C., Wang, J., Zhang, J., & Li, X. (2021). Anomaly
NA. https://doi.org/10.3390/en11020452 detection of power consumption in yarn spinning
using transfer learning. Computers & Industrial
Smolak, K., Kasieczka, B., Fiałkiewicz, W., Rohm, W., Sila- Engineering, 152(NA), 107015-NA.
Nowicka, K., & Kopańczyk, K. (2020). Applying https://doi.org/10.1016/j.cie.2020.107015
human mobility and water consumption data for
short-term water demand forecasting using Zhang, G., Tian, C., Li, C., Zhang, J. J., & Zuo, W. (2020).
classical and machine learning models. Urban Accurate forecasting of building energy
Water Journal, 17(1), 32-42. consumption via a novel ensembled deep learning
https://doi.org/10.1080/1573062x.2020.1734947 method considering the cyclic feature. Energy,
201(NA), 117531-NA.
Su, B., & Wang, S. (2020). An agent-based distributed real- https://doi.org/10.1016/j.energy.2020.117531
time optimal control strategy for building HVAC
systems for applications in the context of future Zhao, Y., Zhang, C., Zhang, Y., Wang, Z., & Li, J. (2020). A
IoT-based smart sensor networks. Applied Energy, review of data mining technologies in building
274(NA), 115322-NA. energy systems: Load prediction, pattern
https://doi.org/10.1016/j.apenergy.2020.115322 identification, fault detection and diagnosis. Energy
and Built Environment, 1(2), 149-164.
Sun, A. Y., & Scanlon, B. R. (2019). How can Big Data and https://doi.org/10.1016/j.enbenv.2019.11.003
machine learning benefit environment and water
management: a survey of methods, applications, Zhou, K., & Yang, S. (2016). Understanding household
and future directions. Environmental Research energy consumption behavior: The contribution of
Letters, 14(7), 073001-NA. energy big data analytics. Renewable and
https://doi.org/10.1088/1748-9326/ab1b7d Sustainable Energy Reviews, 56(NA), 810-819.
https://doi.org/10.1016/j.rser.2015.12.001
Taştan, M., & Gökozan, H. (2019). Real-Time Monitoring
of Indoor Air Quality with Internet of Things-