0% found this document useful (0 votes)
33 views13 pages

DataFITS A Heterogeneous Data Fusion Framework For Traffic and Incident Prediction

Uploaded by

1867010261
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views13 pages

DataFITS A Heterogeneous Data Fusion Framework For Traffic and Incident Prediction

Uploaded by

1867010261
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

11466 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO.

10, OCTOBER 2023

DataFITS: A Heterogeneous Data Fusion


Framework for Traffic and Incident Prediction
Philipp Zißner , Paulo H. L. Rettore , Bruno P. Santos, Johannes F. Loevenich ,
and Roberto Rigolin F. Lopes , Member, IEEE

Abstract— This paper introduces DataFITS (Data Fusion on highs of 48.5 million cars (2022) and 12.7 billion carried
Intelligent Transportation System), an open-source framework passengers (2019, before the pandemic) [2], [3]. As a result,
that collects and fuses traffic-related data from various sources, urban areas experience an increasing number of traffic-related
creating a comprehensive dataset. We hypothesize that a het-
erogeneous data fusion framework can enhance information incidents (e.g., congestion and accidents), increasing time
coverage and quality for traffic models, increasing the efficiency delays, emissions, and fuel consumption [4].
and reliability of Intelligent Transportation System (ITS) appli- For this reason, academia and industry have driven efforts
cations. Our hypothesis was verified through two applications to create the next generation of transportation systems that
that utilized traffic estimation and incident classification models. are eco-friendly, cost-efficient, and powered by data analysis
DataFITS collected four data types from seven sources over nine
months and fused them in a spatiotemporal domain. Traffic and communication technology. We hypothesize that a het-
estimation models used descriptive statistics and polynomial erogeneous data fusion framework can enhance the coverage
regression, while incident classification employed the k-nearest and quality of information serving as input for traffic models,
neighbors (k-NN) algorithm with Dynamic Time Warping (DTW) thus increasing the efficiency and reliability of ITS applica-
and Wasserstein metric as distance measures. Results indicate tions. Therefore, we propose the Data Fusion on Intelligent
that DataFITS significantly increased road coverage by 137%
and improved information quality for up to 40% of all roads Transportation System (DataFITS) framework, providing a
through data fusion. Traffic estimation achieved an R2 score spatiotemporal fusion of data used to train models for two
of 0.91 using a polynomial regression model, while incident ITS applications, traffic estimation, and incident classification.
classification achieved 90% accuracy on binary tasks (incident DataFITS collects and combines real heterogeneous data (e.g.,
or non-incident) and around 80% on classifying three different weather, traffic, incident) from various sources (e.g., open
types of incidents (accident, congestion, and non-incident).
databases, map applications), preparing them by fixing errors,
Index Terms— Intelligent transportation systems, heteroge- adapting the data structure, and finally fusing them in the exact
neous data fusion, traffic estimation, incident classification. location and point in time. Our hypothesis is verified using
data characterization to quantify the benefits of combining
I. I NTRODUCTION
heterogeneous data sources and the proposal of two ITS

D ATA availability is a critical aspect in the design of


modern Intelligent Transportation Systems (ITSs), which
implement models to understand better various patterns of the
applications. The performance of the two applications ratifies
the benefits of larger data coverage/quality while estimating
traffic and classifying incidents. Thus, the main contributions
transportation system [1], thus improving mobility and safety of this investigation are:
for people and goods. With modern society depending heavily
• An open-source framework DataFITS for heterogeneous
on efficient and reliable transportation, the importance of these
spatiotemporal data fusion, covering the acquisition, pro-
systems has seen a rapid increase in significance over recent
cessing, and fusion of data, available in a public code
years. In Germany alone, both the number of registered cars
repository.1
and the number of carried passengers using public transporta-
• The characterization of a heterogeneous dataset combin-
tion have shown a substantial increase, reaching their all-time
ing real traffic data from two cities in Germany, col-
Manuscript received 7 February 2023; revised 20 April 2023; lected from seven sources over nine months and provided
accepted 25 May 2023. Date of publication 12 June 2023; date of current together with the repository.
version 4 October 2023. This work was suppported by the Bundeswehr
through Federal Office of Bundeswehr Equipment, Information Technology, • Two traffic estimation models, one using descriptive
and In-Service Support (BAAINBw) and Bundeswehr Technical Center for statistics and another using polynomial regression with
Information Technology and Electronics (WTD81). The Associate Editor for different parameters such as time, road type, and weather,
this article was T. Tettamanti. (Corresponding author: Paulo H. L. Rettore.)
Philipp Zißner and Paulo H. L. Rettore are with the Communications and a comparison between single and fused datasets.
Systems Department, Fraunhofer FKIE, 53177 Bonn, Germany (e-mail: • An incident classification model trained and
[email protected]; [email protected]). evaluated on heterogeneous fused data using
Bruno P. Santos is with the Department of Computer Science, Federal
University of Bahia, Salvador 40170-110, Brazil (e-mail: [email protected]). k-nearest neighbors (k-NN), with Dynamic Time
Johannes F. Loevenich is with the Communications Systems Department, Warping (DTW) and Wasserstein as distance methods.
Fraunhofer FKIE, 53177 Bonn, Germany, and also with the Department of
Mathematics/Computer Science, University of Osnabrück, 49074 Osnabrück, The rest of the paper is organized as follows. Section II
Germany (e-mail: [email protected]). reviews recent literature using data fusion to design appli-
Roberto Rigolin F. Lopes is with the Secure Communications and Infor- cations like traffic estimation and incident classification and
mation (SIX), Thales Deutschland, 71254 Ditzingen, Germany (e-mail:
[email protected]).
Digital Object Identifier 10.1109/TITS.2023.3281752 1 https://github.com/prettore/DataFITS

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
ZIßNER et al.: DataFITS: A HETEROGENEOUS DATA FUSION FRAMEWORK 11467

compares them against our solution. The design of DataFITS Their results showed that estimation up to 30 minutes ahead
and the traffic data applications are described in Section III. has an error of 12%. Meanwhile, [18] employs deep learning
Section IV evaluates the performance of our framework and algorithms for traffic estimation, showing an improvement of
the effectiveness of our traffic estimation and incident classi- accuracy and efficiency. These approaches discuss the usage
fication models using the heterogeneous fused data, verifying of ML to create accurate models for traffic estimation, but do
our hypothesis. Finally, we conclude this paper in Section V, not consider further methods, such as data fusion, correlation,
highlighting open problems for future investigations. etc.
Some ML approaches use spatiotemporal correlation
II. R ELATED W ORK to improve traffic estimation quality. In [19], a neural
network(NN)-based estimation using Graph Convolutional
This section reviews the literature on three main topics
Network (GCN) and Gated Recurrent Unit (GRU) models is
related to our proposed solution: (i) data collection and
proposed with full public access. The GCN captures spatial
fusion, (ii) traffic estimation, and (iii) incident classification.
dependencies from the road network, and GRU detect dynamic
Finally, we summarize and compare the literature with our
changes in traffic data and captures temporal dependencies.
proposal.
Other NN-based approaches, such as [20] and [21], show
similar improvements in accuracy using data correlation.
A. Data Collection and Fusion Wang et al. [22] propose an open-source deep learning frame-
To develop ITS applications, significant data is required work using GCN to estimate network-wide traffic multiple
from real or virtual sensors [5]. Vitor et al. [4] present a steps ahead in time. Zheng et al. [23] introduce another open-
platform to collect, process, and export heterogeneous data source solution, the Graph Multi Attention Network (GMAN),
from smart city sensors, providing different statistics and using an encoder-decoder architecture to provide long-term
visualizations. However, their platform concentrates on secur- traffic estimation up to one hour ahead. These approaches
ing data. Similarly, [6] proposes a smart city data platform also include correlation to improve the discussed models and
containing information from various cities. In contrast to our offer access to their data but do not propose a solution
framework, we focus on improving the quantity and quality of for collecting or fusing data. Limited literature combines
the information by fusing data, and we assess the advantages data fusion, spatiotemporal correlation, and ML to estimate
of using fused data through two ITS applications. traffic, similar to our solution. In [26], the authors fuse traffic
Data fusion combines data from multiple sources, enrich- data from stationary and dynamic sensors, considering the
ing spatiotemporal information [7], [8], [9], [10]. Several spatiotemporal correlation between traffic levels of road seg-
applications benefit from data fusion, such as emergency ments. A Multiple Linear Regression (MLR) model processes
management [11] and path planning [12]. However, fusing het- the fused information to enhance traffic estimation accuracy.
erogeneous data requires additional preprocessing to combine Unlike our solution, this approach relies solely on traffic
various data types and features [13], [14]. This investigation data from sensors but does not consider different data types
focuses on two applications supported through data fusion: and sources. Zhao et al. [24] propose a general platform for
traffic estimation and incident classification, and the methods spatiotemporal data fusion to enhance traffic estimation. The
to achieve their goals, such as data acquisition, fusion, machine approach introduces a fusion method to improve accuracy
learning, correlation, and different data types. by combining direct and indirect traffic-related data as input
for two different ML models. The indirect traffic-related data
features contain information about weather and points of inter-
B. Traffic Estimation
est and are used to improve the estimation quality. However,
Traffic estimation is a crucial smart city application for their model uses pre-existing datasets, offering no solution
better transportation management. This review focuses on for data collection, and our study focuses on incident-related
data fusion, spatiotemporal correlation, and machine learning data, while the authors in [24] consider points of interest and
techniques to achieve accurate and reliable traffic estimation weather conditions.
using historical data. The increasing availability of open In [27], the authors introduce a model to estimate traffic
databases (kept by governmental authorities) and Application within a small urban area in Zurich, with data acquired as part
Programming Interfaces (APIs) to commercial applications of a video measurement campaign. Their solution fuses infor-
(Bing, Google Maps, etc.) results in a vast collection of traffic- mation from Loop detectors, traffic lights, and other sensors
related data, making big data an opportunity for heterogeneous (e.g., video plus license plate recognition, thermal cameras)
data fusion [15]. The challenge is to combine stationary sensor and trains different MLR models with this data. Finally,
data (e.g., traffic cameras or loop detectors) and probe vehicle they evaluate the various sensors’ accuracy and robustness.
information (e.g., cameras, GPS, cellular data, or vehicular In contrast to our solution, they investigate the quality of a
sensors). Anand et al. [16] used a Kalman filter to fuse regression model using different sensor data fused to stationary
traffic flow values (from cameras) and travel time (from GPS), data. Furthermore, their data is acquired using sensors that are
improving a traffic estimation approach. not publicly available, covering only a small urban area.
Many recent traffic estimation models use Machine Learn- Finally, [25] proposes a traffic speed prediction by inte-
ing (ML) [17], [18], [19], [20], [21], [22], [23], [24], [25]. Ref- grating heterogeneous data from various sensors, including
erence [17] proposes an auto-regressive model that uses data exogenous data like weather, into a hybrid spatiotemporal
from a traffic simulator and adapts to events like accidents. features space. The main contributions are a hybrid model
11468 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 10, OCTOBER 2023

TABLE I ination and a LSTM network with Bayesian optimization.


R ELATED W ORK C OMPARISON Their approach provides an accurate binary classification of
incidents, outperforming other state-of-the-art solutions, but
does not classify them into different types.
Other approaches use location-based social media (LBSM)
to improve the detection and classification of incidents.
Rettore et al. [34] propose a framework containing two data
services, one to detect traffic-related events. The framework
collects data from social media platforms (e.g., Twitter), which
is used in a road incident detection model based on heteroge-
neous data fusion to provide more descriptive transportation
system data. The free access of user data through Twitter’s API
improves the availability of incident data, an essential aspect
in developing ITS solutions. Also, in [35], the authors describe
a real-time traffic event detection solution using Twitter posts.
Their solution is based on a text classification algorithm to
identify traffic-related tweets with their location and classify
the information into different classes of events.

D. Comparison & Summary


Table I summarizes the reviewed literature, categorizing
using Long short-term memory (LSTM) and GRU, comparing them into five applications: smart city, emergency, traffic esti-
the model against other well-known classical deep learning mation, incident classification, and our solution. The second
models, showing the highest efficiency and lowest error metric. and third columns list the key aspects and the correspond-
In contrast to our study, this investigation focuses on the ing references. The remaining columns denote the following
prediction using only vehicle speed and has no open access to labels: Data Acquisition, Data Fusion, ML, Correlation, Sta-
their solution and the data. tionary, Probe, and LBSM. These labels indicate whether the
approach collects data, uses data fusion techniques, utilizes
ML and deep-learning models, incorporates data correlation,
C. Incident Classification employs stationary sensor data, uses probe vehicle data, or uti-
Numerous ML and deep-learning models are also used for lizes georeferenced social media data. Moreover, we classify
incident classification [28], [29], [30], [31]. These models the availability of the source code and data of all solutions into
improve road safety in urban areas by facilitating traffic man- three categories using different colors no, limited, or yes public
agement, warning systems, and emergency rescue operations. access. A paper labeled with no public access does not offer
Other applications, such as incident detection, are proposed access to their data or solution, unlike solutions that provide
in [32] and [33], which provide additional traffic management full public access to source code and data. Limited public
enhancements, including the ability to control traffic lights access describes the usage of datasets that are not accessible
from emergency vehicles. anymore or solutions that plan to offer open access in theory
In [28], the authors introduce a Convolutional Neural Net- but currently do not fulfill this aspect.
work (CNN) model to predict traffic accidents using a state The last row of Table I compares our investigation with
matrix with influencing traffic features. Their solution achieves the literature, highlighting the coverage and contributions of
high prediction accuracy, but limited training data affects CNN our proposed solution. Compared with most of the literature,
model quality, which could be improved by using data fusion. we provide a methodology that covers four of five stages of
Park et al. [29] propose a big data approach using the Hadoop the data cycle (acquisition, preparation, processing, use) [13],
framework to combine incident-related and other traffic data. providing an open-source framework,1 and access to the
The study classifies data into groups of traffic incidents. Data collected datasets. Making the models and datasets available,
fusion benefits the approach, but incorporating spatiotemporal or the means to acquire and process them, is crucial to
aspects could further increase model accuracy. enable a fair comparison between models/methodologies,
In [30], the authors propose a recurrent neural network which we did not find in most literature. Moreover, the
to predict traffic accident risk by combining incident data DataFITS framework is designed to support multiple data
with a spatiotemporal traffic correlation. The model has high types, including stationary and probe data, and can potentially
accuracy and can be used for accident prevention and inte- incorporate additional types of information like LBSM.
grated into traffic control systems. However, its main lim- We perform spatiotemporal data fusion to provide enriched
itation is the consideration of only directly-related incident information used as an input for two, but not limited to, data
data. Other traffic-related features (e.g., traffic flow, weather, applications showing the benefit of using fused heterogeneous
vehicular data, etc.) could be fused to improve accuracy. data. In contrast, other approaches in the literature focus on
Shang et al. [31] propose a hybrid approach for automatic specialized solutions that combine only a subset of the listed
incident detection using random forest-recursive feature elim- features in the context of ITS.
ZIßNER et al.: DataFITS: A HETEROGENEOUS DATA FUSION FRAMEWORK 11469

Fig. 1. The general workflow.

III. T HE D ESIGN Fig. 2. Workflow of DataFITS.


This research proposes a solution including two differ-
ent modules: A data fusion framework DataFITS and two
data applications traffic estimation and incident classifica- A shapefile stores road network information identified by
tion. The DataFITS design follows a three-stage workflow, the primary key (fid) for each road segment, which is used
as presented in Fig. 1-A. It starts by gathering data from in the map-matching procedure conducted within the data
heterogeneous transportation-related data sources using APIs processing. DataFITS can also extract the road type and speed
and web crawlers (1). In sequence, all acquired data are fused limit from shapefiles. Finally, the collected data is converted
geographically by mapping them to road segments and aligned into “trip files”, a representation of the input used by the map-
temporally (2). After fusing the data, we can perform data matching.
analysis to identify and visualize specific data characteristics 3) Data Processing:
(e.g., traffic and incident statistics) (3). DataFITS can export
a) Temporal fusion: Fig. 2 (3) displays the temporal
data which then can be used as input for different applications,
data fusion. This process groups the complete data within
depicted in Fig. 1-B. In this article, we use the fused data in
an arbitrary time window aggregation (e.g., hourly, daily, or
two applications: traffic estimation and incident classification
10 minutes for the results in this paper), adapting the time
that can benefit from fused data (see Section III-B) providing
interval from the collection process.
a more comprehensive perspective of the results (4).
b) Spatial fusion: DataFITS leverages the map-matching
technique, taking GPS points and aligning them to established
A. Data Fusion Framework coordinates under a predetermined degree of accuracy based
1) Data Acquisition: Within the data acquisition, Fig. 2 (1), on an underlying road network. This results in a balanced level
DataFITS collects information from different predefined data of accuracy and associate all geo-located data with the same
sources according to a set of user-defined parameters (e.g., road network.
geographical area and time interval). Currently, DataFITS Among different strategies of map-matching, DataFITS
supports multiple methods to collect traffic, incident, vehicular, integrates Fast Map Matching (FMM), an open-source tool,
and weather data. In addition, the framework parses hetero- which provides two different algorithms for achieving optimal
geneous information and stores them in standardized CSV performance based on the given road network size [41]. To this
files. The acquisition follows a modular application design, end, FMM uses the trip and shapefiles created in the prior
ensuring easy expandability of the framework functionalities stage and connects all input data points to a corresponding
and allowing the specification of additional data sources. road network. Each data entry within the trip file contains a
2) Data Preparation: The compiled data undergo an addi- Linestring representing the GPS coordinates (path) of a road
tional preparation step as illustrated in Fig. 2 (2). The key segment, except for incident data entries, which only contain
component of the preparation stage is data standardization, coordinates of a start and end point. In addition to the matched
converting different feature names and types into a uniform points of each input entry, the algorithm returns two arrays,
representation and a set of user-customizable data mappings opath and cpath, that contain a set of road identifiers (fids)
to deliver consistent data types. In sequence, the data is from the OSM. The first array, opath, stores the fid for each
prepared to be mapped onto geographical locations. Lever- matched point, representing a list of road segments that got
aging OpenStreetMap (OSM), a free map database, DataFITS matched to the input data entry (data source coordinates with
gathers shapefiles according to the bounding box parameter OSM road map). The cpath, second array, stores the fid values
specified in the data acquisition stage, using OSMNX [40]. that create a path between all matched road segments. This is
11470 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 10, OCTOBER 2023

necessary for creating the road trajectories for data entries that
are only represented by a start and an end point within the data.
This process is performed on each record of the vehicular
and incident data sources, while the geo-location of the traffic
data sources is only matched once for each area, as those
are static and do not change between data acquisitions. This
strategy significantly reduces the execution time and com-
putation required for map-matching. Instead of processing
all data points within each acquisition, the main amount of
data points, namely the traffic-related information, is only
matched once. On our nine-month data time frame and a
10-minute acquisition time, this reflects a single matching
procedure instead of 38,800 procedures, significantly reducing
the runtime and required computation power. Fig. 3. Design of the traffic estimation application.
The spatial data fusion process combines the fused input
dataset with the map-matching output, adding the enriched
information about opath, cpath, and matched GPS points. but also discusses the model based on descriptive statistics.
To provide the data for all given road segments, the informa- and gives a comparative evaluation of both approaches in
tion is rearranged by extracting all fid values from the cpath section IV.
array of each data entry. The amount of computing power and a) Preprocessing: The fused data from DataFITS is
memory for grouping data points is a non-linear function of cleaned, removing all incident-related information, as it is not
the input data size. Therefore, the framework splits the data required by the model, and grouped into traffic areas contain-
into chunks, reducing the memory requirement and allowing ing one or multiple road segments. Using a data aggregation
multi-threading to speed up the process. over the array of road identifiers (cpath), we create a list of
4) Data Usage: The last stage, (4) in Fig. 2, describes areas contributing traffic information to the dataset.
different use cases of the fused dataset, e.g., as an input to Due to the traffic area grouping, the data may contain
various data applications or being characterized through dif- overlapping areas due to the data fusion that merges traffic
ferent types of statistics and visualizations for spatiotemporal areas from different sources. Those intersecting areas describe
data analysis. For example, DataFITS provides heat maps and the same spatial region but with minor differences in the
density plots separated by each source and different features, covered road segments. Combining them removes potential
such as the number of observations, traffic levels, speed, and duplicated areas, resulting in a final set of unique traffic
types of incidents. In the scope of temporal analysis, DataFITS areas. The underlying function iterates through all existing
provides time-series statistics for a specific time window (e.g., areas, calculates pairwise intersections, and combines them if
by the hour, day of the week, month, and season) and shows the overlapping road segments exceed a predefined threshold
the correlation between different features. Moreover, the fused th overlap . Finally, the initial set of fused data is re-grouped
data is exported in different data structures, allowing to be used according to the new set of combined traffic areas, resulting in
by various data applications, such as our proposed models or an input dataset for the traffic estimation models that contains
other third-party tools (e.g., ArcGIS). the combined information for each area.
Furthermore, the design covers a procedure to add data
points from other regions that show similar traffic patterns
B. Applications based on correlation. The goal is to increase the volume of
1) Traffic Estimation: The proposed traffic estimation appli- data points in areas with insufficient training data. Therefore,
cation is organized into two phases, as shown in Fig. 3. by correlating the traffic patterns (traffic level/relative speed)
Phase (1) prepares the data, groups it by intersecting areas, from different regions, the highly correlated areas can be
identifies similar traffic regions based on correlating traffic merged, increasing the training dataset, thus, benefiting the
patterns and performs a train-test-split. A traffic region is accuracy of the traffic estimation. To identify such regions,
defined as the set of connected paths (road segments) reported we calculate data similarity based on the traffic values, aiming
from a data source, represented through unique road identifiers for a more precise representation of the traffic situation within
(fids). By intersecting areas, we obtain a list of unique traffic the original area. The corresponding function implements a
regions and are able to measure the similarities between them. modified version of the Pearson Correlation and the DTW to
In phase (2), the prepared data is used to create and evaluate identify correlated traffic areas with similar traffic patterns.
two traffic estimation models using: i) descriptive statistics The correlation between two time series was defined in [36]
(naive); and ii) polynomial regression. Each model estimates and adapted for our proposed methodology.
traffic values for a single area within an arbitrarily defined time PL
interval and can also utilize data from correlating regions with t=1 (Si (t) − S̄i )(S j (t) − S̄i ))
X i, j = qP qP (1)
similar traffic behavior. Furthermore, the process considers L−t
(S (t) − S̄ )2· L−t
(S (t) − S̄ )2
t=1 i i t=1 j j
optional input parameters like weekday, weather, and road type
to create more specific models for the given characteristics. Using Eq. (1), we calculate the respective correlation
This research mainly focuses on the regression-based model between two time series of any traffic data feature Si (t) for
ZIßNER et al.: DataFITS: A HETEROGENEOUS DATA FUSION FRAMEWORK 11471

two regions i and j at a given time t. We compute this value


between all regions and define a correlation threshold th cor
to identify similarity. However, this type of correlation can
solely describe a linear relationship between two variables,
not considering the value variation. To overcome this issue,
we use DTW to measure the distance between the two series
and set a threshold th dtw to ensure that both correlating areas
have similar values. DTW measures the similarity between
two time series that are not synchronized. More precisely, the
algorithm can use a temporal alignment of the data pattern
Fig. 4. Design of the incident classification application.
resulting in a more similar comparison than using, e.g., the
Euclidean distance, comparing timestamps regardless of the
feature values [42]. Calculating both correlation and DTW for degree depends on the input data used to train the model
the traffic and speed data, we can identify a set of areas that and is obtained during the model creation. Then, the data
show similar patterns and satisfy Eq. 2: is transformed into a matrix of features to represent the
given input in a higher-order feature space. For example,
(cortra f ≥ th cor ∧ corspeed ≥ th cor ) ∧ a 2-dimensional feature space (X 1 , X 2 ) is transformed to
(dtwtra f ≤ th dtw ∧ dtwspeed ≤ th dtw ) (2) (1, X 1 , X 2 , X 12 , X 1 · X 2 , X 22 ). Therefore, the newly created
Finally, the dataset is filtered according to the chosen feature contains the bias value of 1, all values raised to
parameters (i.e., time frame, weekday, road type, and weather) the power for each degree ∈ 0, . . . , d, and all combinations
to generate the final model input data. We perform a Train- between every pair of features. Finally, the data points within
Test-Split and generate estimations for each individual area. the training dataset are fitted using polynomial regression, and
b) The model: Our initial model to estimate traffic values the model is used to create traffic value estimations.
is based on descriptive statistics, with the goal to verify 2) Incident Classification: Our second proposed application
if basic statistics with low-computational costs can provide classifies traffic patterns to different incident types using a
accurate predictions on a set of heterogeneous fused data. Eq. 3 modified version of the k-NN algorithm. The application
describes the calculation of Y (t), representing an estimated has three stages as shown in Fig. 4. It starts by applying
traffic value for a point in time t. Additionally to the mean preprocessing methods to prepare the input for the classifica-
of the original region at time t, x(t), we also add the average tion model, including collecting incident-related traffic data,
value of all correlating regions i ∈ 1, . . . , n corr , represented by defining a time frame for all incidents, and data validation
the second part of the equation. More details on the descriptive mechanisms. Secondly, an input dataset is made by combining
statistics approach can be found in [43]. data from incidents and non-incident traffic situations. Also,
a data interpolation process fills gaps in the traffic data and
nX
corr
1 a train-test-split is performed. Finally, we train a k-NN-like
Y (t) = x(t) + ∗ xi (t) (3) algorithm to perform binary classification (incident and non-
n corr
i=1
incident) or a multi-class classification (accident, congestion,
The second proposed traffic estimation model is based on and non-incident).
ML and uses polynomial regression to estimate a continuous a) Preprocessing: First, all incident-related information
traffic value Y (e.g., traffic level or speed), as shown in Eq. (4). is collected from the heterogeneous dataset, filtering the data
by incident information and grouping it into unique reports
Y = θ0 + θ1 x + θ2 x 2 + θ3 x 3 + . . . + θd x d + ϵ (4)
with a specific duration. Because the incident-related data
The input feature x represents all traffic or speed values that sources do not include information about traffic data features
are within the training dataset for a single traffic region, (e.g., traffic level and speed), this type of information is added
matching the parameters defined by the model. A higher-order through spatiotemporal fusion. To gather all data related to one
polynomial, up to a degree d, can represent the dependent vari- specific incident, a method extracts all traffic areas that have
able Y . The corresponding implementation calculates a least a spatial intersection with the incident region and combines
squares regression, resulting in an estimation that minimizes them in a temporal domain. The corresponding implementation
the sum of squares between the dependent and independent uses the dataset of incidents, a list containing the unique traffic
variables. The model is configured based on the input data areas, and an overlapping threshold to calculate the intersec-
created in the preprocessing phase, matching the previously tion between the incident area and corresponding regions with
defined thresholds and the following parameters: i) Required: available traffic information.
data feature, polynomial degree, and time frame; ii) Optional: The incident duration is key information to define the right
weekday, weather, and road type. time window, which includes traffic data related to a particular
For example, the model could be configured to estimate the incident. However, usually, this duration is not included in
traffic level for a traffic area on a Monday in a 10-minute time the data sources, requiring strategies to calculate it using
interval. another data source. Therefore, we developed two approaches:
The regression is implemented, generating a polynomial static and estimated incident start time. The static approach
feature matrix of a certain degree d, where the optimal collects traffic data in a time interval of 90 or 120 minutes
11472 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 10, OCTOBER 2023

before and after the reported incident start time from the relative to the speed limit), and road type. k-NN was trained
data source, representing a time interval that includes data using two different distance metrics suitable to time series
prior to and after the measurable incident effect. We chose data. The DTW metric is used to measure the distance between
the values of 90 and 120 minutes based on an exploratory two time series in the classification model. Furthermore,
data analysis (briefly indicated in Table IV), showing that we train a model using the Wasserstein metric, a function to
the majority of incidents did not exceed 120 min impacting calculate the distance between two probability distributions µ
the traffic behavior, except for a few cases during congestion and ν, defined in Eq. 5 [44].
and disabled vehicle incidents. The second approach tries to  Z 1
p
estimate the incident start time, iterating over the traffic data, W p (µ, ν) := inf d(x, y) dγ (x, y)
p
(5)
to find significant changes in the traffic pattern and setting the γ ∈0(µ,ν) RxR
start point based on this observation. The estimated approach We based our model on the k-NN-Implementation K Near-
aims to provide a more realistic representation of the incident est Neighbors with Dynamic Time Warping,2 modified and
start time, which is evaluated later in Section IV. extended to support the Wasserstein metric. k-NN uses a
Moreover, each incident report is validated by identifying parameter n (the number of neighbors) and the maximum
samples with high noise in the corresponding traffic data. warping window for the DTW, limiting the number of ele-
Noise has a negative contribution to the model and adds a ments to compare and therefore reducing the execution time.
potential bias to its accuracy. Therefore, it is detected and Moreover, data over and under-sampling are used to reduce
removed from the input dataset using three different strategies the under-representation of accident samples in our imbalanced
to validate each incident report, where at least one must be input dataset. To reduce the severity of this problem, two
satisfied: i) Comparing the absolute difference between the methods are used: i) Oversampling: Adds new samples of
traffic at the start and the end of the incident time interval, the minority class to the training data, using the information
checking for a noticeable difference given by a deviation from the already existing data points. Our implementation
of more than a predefined threshold; or ii) Calculating the includes Random Oversampling and SMOTE Oversampling;
standard deviation over the entire incident time interval and ii) Undersampling: Provides the contrary part by removing
comparing it to a certain threshold; or iii) Iterating over samples from a majority class. Our implementation includes
the data points close to the incident start and end time and Nearmiss Undersampling. These data sampling approaches are
checking for a point-wise traffic variation above a defined evaluated later in Section IV, further comparing the model
threshold. Based on these methods, we extract incidents that quality using an imbalanced dataset.
reflect a clear traffic pattern that shows a measurable impact Finally, the classification model is created using the param-
of the incident on the traffic behavior and remove all other eters k (number of neighbors), warping window (compar-
patterns that could be confused with a non-incident traffic ison constraint), and metric (DTW or Wasserstein). Next,
pattern or a biased sensor report. the model is trained, and each test data sample is classified
Lastly, to create the final dataset for the classification model, using the trained model, returning a class label together with
three further data processes are performed: the corresponding probability. This method is implemented
• Add “normal traffic data”: Adding non-incident sam- by calculating a distance matrix that contains the respective
ples to the input data is essential in the designing distance (DTW or Wasserstein) between all data samples
of our incident model. We add observations similar in regarding the chosen data feature. Using this matrix, our
time, weekday, and location to get the most comparable proposed algorithm can find the k closest neighbors and extract
data. Therefore, these reports can accurately identify a the most representative label for all test data samples.
non-incident situation for every incident in the dataset.
• Data Interpolation: As a result of measurement errors or IV. E VALUATION
other problems in the data collection, there is a possibility
This section evaluates DataFITS by quantifying the
of missing traffic values within the incident duration.
improvements in data quality and quantity and presents a data
Therefore, we use linear data interpolation to fill gaps
characterization analysis from a real heterogeneous dataset.
in the traffic data if required.
Finally, the enhanced fused data is used to evaluate the traffic
• Train-Test-Split: Finally, the input data is split into train-
estimation and incident classification models.
ing and testing datasets. The former is used to train the
ML model, allowing it to be generalized. Furthermore,
the testing dataset evaluates the classification quality. The A. The Data Fusion Framework
incident cases are randomly sampled and used within the 1) The Data: The data acquisition process started on
training or testing dataset. December 1st, 2021, and covers nine months of heterogeneous
b) The model: We use k-NN algorithm, a well-known data from Bonn and Cologne. The acquired dataset from Bonn
supervised learning approach to solve the classification prob- contains 13,700,000 entries with a total size of 14 GB, whereas
lem. To train it, each incident entry has multiple features the dataset from the neighboring city Cologne has 28,700,000
and a label referring to a particular incident type (accident, entries with a total size of 31 GB. The data is structured
congestion, or non-incident). The data features represent a time 2 github.com/markdregan/K-Nearest-Neighbors-with-Dynamic-Time-
series with the corresponding traffic level, speed (absolute and Warping
ZIßNER et al.: DataFITS: A HETEROGENEOUS DATA FUSION FRAMEWORK 11473

TABLE II TABLE III


C OVERED ROADS BY DATA S OURCE G ENERAL T RAFFIC DATA S TATISTICS

into four types of information acquired from seven different


data providers: i) Traffic data from the commercial service 684 roads in Bonn and 2940 in Cologne, while the fused
HERE and the open service Open Data (OD), containing data covers 1619 and 5081 roads, respectively. This is a data
data features like speed, traffic, and GPS coordinates; ii) enrichment of 137% in Bonn and 173% in Cologne.
Incident reports from the commercial services HERE, BING 3) Data Characterization: DataFITS presents general traf-
and OD, containing data features like the type of incident, fic statistics as a function of time, day, or road type.
GPS coordinates, and additional information; iii) Vehicular It helps in analyzing collected and fused data. The traffic
probe data from the Envirocar platform, providing in-depth values are grouped into levels: Low (0-1), Normal (>1-4),
data about the vehicle such as speed, fuel consumption, CO2 Increased (>4-7), and Jammed (>7-10) over three types of
emissions, torque, throttle position, and more; and iv) Weather roads: Motorway, Main Road, and Residential. Additionally,
data from the Meteostat providing the weather conditions. it provides the characterization of incident data from Bonn,
Commercial map services like Google, HERE, and Bing are which includes four incident types, namely Accident, Conges-
the leading traffic data providers. They offer limited or paid tion, Disabled Vehicle, and Road Hazard.
access per user. Contrasting, projects like Open Data (OD) pro- a) Traffic data: Table III lists the number of data entries,
vide open access to data from multiple information categories. in Bonn, for each traffic level on different road types, includ-
The goal is to create a collaborative data infrastructure that ing the average values for traffic, speed, and relative speed
can be used by industry, academia, government, and civilian speed
( speed limit ). One can observe a similar distribution of traffic
people, to design intelligent data-driven systems. for different road types. The low-traffic level presents more
2) The Fusion: Table II emphasizes the benefits of hetero- entries, but there is a variation between the distributions
geneous data fusion by tabulating the number of roads covered related to the various road types. On main roads, nearly
by all sources and the ones that are spatiotemporally covered 97% of all data entries represent a low or normal traffic
by multiple data sources, labeled with Overlap. Thus, the level. On motorways, the amount is 92%, a difference of 5%,
overlapped data correspond to multiple sources reporting data distributed in the other two levels (increased and jammed).
in the same location and time. Each source covers a number of The traffic in residential areas is mainly represented in a low
total roads and provides a portion of unique roads to the fused (94%) or increased level (5%).
dataset of Bonn and Cologne. For instance, Traffic HERE Table III also lists the traffic and speed (absolute and
presents a proportion of 21% to the fused data for Bonn and relative). The relative speed varies significantly on the three
27% for Cologne. In contrast, OD contributes a significantly road types, decreasing from nearly 70% in a low traffic level
lower amount of information to the fused dataset, especially to 30% in a jammed condition on a main road. A similar
within the Cologne data at only 3.4%. pattern is observed on motorways with higher relative speeds
Regarding the incident data, a significant amount of addi- at the first two traffic levels. In residential areas, the speed
tionally covered roads was added to the fused dataset, con- reaches a maximum of 54% and significantly decreases to
tributing 20-25% of new information. Furthermore, there is a 3% when jammed. This characterization suggests that the city
substantial amount of extra information provided by Envirocar, of Bonn has a low traffic level most of the time (>80% of
especially in Bonn, with the probe vehicles covering areas all data entries). Higher traffic levels were observed in seven
that are not equipped with sensors and committing a portion percent of all data entries, with the proportion of a jammed
of 11% to the fused data. This can be explained by the fact condition representing one percent. This depicts a realistic
that the users contributing to the platform also collect data behavior, as congestion generally emerges during rush hours
in many residential areas that are usually not equipped with or in case of specific incidents. Moreover, Table III shows
sensors. The amount of overlapping road segments reaches a significant reduction in speed for high traffic levels (seven
35% for Bonn and 39.5% in Cologne, revealing the potential or more), especially in the case of residential areas. A low
of information enrichment using heterogeneous data fusion. average speed on the residential roads is also noticed, with
These numerical results show that we can utilize fused data to less than 27 km h−1 , representing a safe value to reduce noise,
better describe the transportation system status and improve pollution, and the probability of fatal accidents.
the amount of information compared to only using a single In summary, the traffic data characterization suggests that
data source. For instance, the Traffic HERE solely covers the city demands different traffic management strategies based
11474 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 10, OCTOBER 2023

TABLE IV
G ENERAL I NCIDENT DATA S TATISTICS

Fig. 5. Incident effect on the traffic level.


on road type and location to improve its quality and safety. Due
to the limited space, more spatiotemporal data characterization B. Traffic Estimation
is provided by the DataFITS on the git repository.1 This section reports a comparative evaluation of the esti-
b) Incident data: Table IV lists the number of reports mation model introduced in Section III-B.1 and the model
for each type of incident, grouped by different road types. introduced in a previous paper [43]. Both were trained with
As expected, the amount of congestion reports surpasses the the heterogeneous fused dataset characterized in the earlier
number of all other incident types within the dataset because sections. To measure the performance of each model, we cal-
congestion is a re-occurring event, generally emerging due to culate three different performance metrics: i) Coefficient of
high traffic. Furthermore, a significant reduction in incidents is Determination (R2 ) represents the variance proportion between
shown when comparing motorways to the other two discussed the true and estimated values. The optimal value is one, and
road types. Besides the higher traffic on motorways in general, zero is given to a model that always predicts the average of
this observation matches the results of the previous traffic data the true value y. ii) Mean Absolute Error (MAE) is an error
characterization, suggesting that there is a lower speed on main metric that describes the sum of absolute errors between the
roads, especially in residential areas, reducing the probability real and estimated values, aiming for a value close to zero.
of accidents. Finally, Table IV includes the average and iii) Root Mean Squared Error (RMSE) denotes the root of the
maximum duration regarding each group of incident reports, Mean Squared Error (MSE), a measurement for the average
showing significant differences, especially for the maximum squared distance between the estimated values by the model
duration of congestion on a motorway reaching 730 minutes, and the real values within the dataset, also having a desired
compared to not more than 210 minutes for the other groups. value of zero. We compare the model’s performances against
By showing traffic and incident events together, it is possible each other, using different input parameters and a fused vs.
to identify the effects of a single incident on the traffic levels non-fused dataset.
in the surrounding area, as shown in Fig. 5. The accident The proposed model estimated the traffic level for 181 dif-
(marked by ‘X’) was reported on a motorway at 17:30, and ferent traffic areas in Bonn and was trained using a dataset of
the traffic levels on the surrounding roads over time are shown more than 7 million entries using the following thresholds:
with green to red colors, from no traffic (level 0) to high th overlap = 0.50, th cor = 0.90 and th dtw = 0.25. Within
traffic (level 10), respectively. Before the accident, a low traffic the experimental setup, we used a train-test split of 60-40
level around the incident location can be observed, but the and a polynomial degree of 10, reflecting the optimal model
traffic increases in the directly connected areas 10 minutes configuration based on extensive experiments using different
before the incident is reported in the respective source. This polynomial degrees. All estimations shown here use a time
condition can be observed further, escalating to a jammed road frame of 24 hours. The remaining input parameters are stated
at 17:30 with very high traffic on the connected roads. It reverts within each result throughout this section.
to the initial state at 17:50, indicating that the incident lasts Fig. 6 depicts four examples of traffic estimation for two
about 30 minutes. Therefore, the accident impacts the traffic different regions (A and B), for an entire day, with the time
pattern of many neighboring roads, especially between 17:30 displayed in hours on the x-axis. It shows the estimated regres-
and 17:40. sion line (red) on the training dataset (green dots). For better
DataFITS includes further data characterization by combin- visualization, we grouped the observations within the training
ing incidents with different weather conditions and seasons, data and showed the mean values. The blue line depicts the
showing a more significant number of incidents reported test data, representing real-world traffic data. Noticeably, the
in worse weather conditions (e.g., rain and snow) than in estimation in Fig. 6(a) shows a high precision on both traffic
normal conditions. Moreover, analyzing the different road (left) and speed (right), suggesting that the model fits well on
types, most incidents occur on motorways, mainly at two the prepared input dataset. The second area depicted in Fig. 6b
certain intersections that are important parts of the trans- shows a similar result, scoring an even higher performance
portation system. For more data characterization, please see and providing a very close estimation compared to the ground
the DataFITS repository.1 truth.
ZIßNER et al.: DataFITS: A HETEROGENEOUS DATA FUSION FRAMEWORK 11475

TABLE VI
P ERFORMANCE OF THE R EGRESSION M ODEL ON
VARIOUS I NPUT DATASETS

TABLE VII
C OMPARISON OF S TATISTICAL AND R EGRESSION M ODEL

Fig. 6. Comparison: Estimation and real data.


within our data during rain or snow, significantly reducing
the size of the training dataset. The estimation uses many
TABLE V
interpolated data points instead of real values, reducing the
P ERFORMANCE OF THE R EGRESSION M ODEL ON VARIOUS
S TREETS AND W EATHER C ONDITIONS overall quality. However, the model can accurately estimate
traffic features in clear weather conditions (green up arrow).
This exploratory investigation suggests the model performs
well on most input data parameters. However, using separate
configurations to estimate the traffic based on a specific road
type achieves the best results. In contrast, creating a model
based on the weather conditions, rain, and snow reduces the
quality of the model’s estimations.
1) Single Vs. Fused Data: Due to the limited public access
to source code and data of most of the literature as discussed in
Section II, we compared the polynomial regression approach
on the non-fused datasets, containing information that was
obtained from a single source, with the fused dataset in order
to compile quantitative evidence that data fusion shows the
benefits in precision and coverage as listed in Table VI.
Furthermore, the model can be configured to differentiate Estimating traffic values, the single data source HERE scores
road types and weather conditions. Table V shows the overall the best results, indicating that the fused dataset is biased by
performance of the traffic estimation from all 181 areas the poor performance of the OD dataset. However, on the
when using different configurations. The first line at the top estimation of speed values, the fused dataset achieves the
represents the performance measures of the model using the highest performance of 0.84 for the R2 and error metrics of
entire dataset without any additional filter. In total, the model 0.06 and 0.08. By fusing multiple datasets, we can combine
achieves a high R2 score of 0.84, using speed or relative their individual benefits and achieve a minor improvement
speed, while also performing well estimating the traffic value in the model quality. Furthermore, the area coverage within
reaching a score of 0.76. Both error measures are low for the combined data is much higher, comparing 94 unique
each respective data feature, represented by values below areas covered by HERE, 123 by OD, and 217 areas covered
0.10. By differentiating the road types and weather conditions, by the fused information, represented through 181 unique
it is noticeable that the model achieves the best performance and 36 overlapping areas from both datasets. These results
estimating traffic on main roads with no specified weather demonstrate that although a single data source can perform
conditions, achieving R2 scores of 0.91 using the speed data better in some cases, it is still limited in spatiotemporal
and 0.81 using traffic. The general performance on motorways coverage, limiting the model’s generalization. A similar result
(orange circle) is slightly lower compared to both main roads was achieved by comparing the statistical model using the
and residential areas (green up arrow), reaching an R2 score fused vs. non-fused dataset in our previous study [43].
of up to 0.84. 2) Descriptive Vs. ML-Based Model: When evaluating the
The same measurements using different weather conditions use of data from correlating areas in the training dataset,
are shown in the last three rows of Table V. Noticeably, we noticed no further improvement in the polynomial regres-
the performance is significantly lower in estimating traffic in sion model. However, the descriptive statistics model showed
case of rain, and especially snow (indicated by the red down a substantial improvement in the estimation quality using the
arrows). This is due to the low amount of traffic observations correlation approach. In general, the model based on ML,
11476 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 10, OCTOBER 2023

proposed in this article, achieves significantly better results


compared to the descriptive statistics approach presented
in [43], as shown in Table VII. On the dataset containing
nine months of data, we could improve the average R2 score
from 0.15 to 0.68 and achieve significantly lower error metrics.
However, the descriptive statistics model is not relying on
a large amount of data and, therefore, may be useful in
the case of small input datasets. In conclusion, we could
identify that using the polynomial regression model on our
entire heterogeneous fused dataset provides promising results,
reaching an R2 score of up to 0.91. However, a less complex
and costly model (computation and time), such as using
descriptive statistics, can be applied in the case of a reduced
amount of data available.
Fig. 7. Comparison: Different input strategies.

C. Incident Classification starting point. Moreover, the Wasserstein metric achieves a


better performance on the larger input intervals, compared to
We compare the performance and training of our proposed
the DTW in the case of the estimated incident start time.
incident classification model for two different approaches:
In conclusion, our final model uses input data covering a
A binary classification (Incident or Non-Incident) and
90-minute time frame before and after the originally reported
a multi-class approach (Accident, Congestion, and Non-
incident time, as this provides the overall best results. The
Incident). The classification relies on the fused dataset com-
conducted evaluation shows that using the iterative approach
bining the information on traffic and incidents. Therefore, all
to estimate a more realistic start point of an incident does not
proposed results are generated on the heterogeneous fused
benefit the model’s accuracy. Moreover, increasing the input
dataset. First, we evaluate the performance of the presented
time interval to 120 minutes shows a minor decrease in the
data preparations, referring to ii) incident validation; iii) over-
overall performance but benefits the model in all test setups
and undersampling; and iv) various time intervals.
that use the Wasserstein metric, especially when working with
The proposed incident validation approach, which is a part
the estimated start time.
of the data preparation presented earlier in Section III-B.2,
Finally, we compare the performance of the three-class
provides a major accuracy improvement. Precisely, it improves
classification model with the binary classification. Evaluating
the overall performance by 26%, e.g., increasing the accuracy
the binary classification that distinguishes between Incident
from 0.7 on a not validated input dataset to 0.86 after the
and Non-Incident, we achieve an overall accuracy and an
validation.
F1 score of 90%, with no differences between the number
Regarding the imbalanced dataset problem, we use SMOTE
of samples in both classes, due to a balanced input dataset.
Oversampling to improve the precision of classifying accident
Furthermore, DTW shows a minor advantage compared to
data samples from 0.35 to 0.77 while keeping high precision at
using the Wasserstein metric, improving 5% on all metrics
all other data classes. Using SMOTE improves the precision
on average. The three-class model adds complexity by classi-
by 15% from 0.72 to 0.83. In contrast, using the Nearmiss
fying the types of incidents (Accident, Congestion, and Non-
Undersampling method also showed a minor benefit on the
Incident). and achieves an average accuracy of 86%, slightly
precision score of accident data samples but reduced the total
lower than the binary model. When considering the perfor-
score from 0.72 to 0.59. Based on this investigation, we apply
mance of each class, the problem of data imbalance remains
the incident validation approach and the SMOTE oversampling
even after using data oversampling. This is reflected by the
in the final configuration of the classification model.
significantly lower performance to identify the Accident data
We compare the model’s performance using the originally
as shown by comparing the F1 scores of 0.56 to 0.9 on other
reported start time of each incident against the idea of esti-
classes. To obtain a more generalized and accurate model,
mating a more realistic start time, obtained by iterating over
we proposed a completely balanced dataset, combining over-
the time series, as presented in Section III-B.2. Furthermore,
and undersampling methods, achieving better accuracy scores
we evaluated two different time intervals of 90 or 120 minutes
of 80% and an F1 score of 0.78 on all data classes.
before and after the incident start time (original start time
or estimated start time), respectively. Fig. 7 shows all per-
formance metrics over the different time interval strategies. V. C ONCLUSION
Generally, the traffic patterns during the 90-minute time inter-
vals achieve the highest overall scores, with an accuracy of In this paper, we introduce DataFITS, an open-source
0.86 for the DTW and 0.81 for Wasserstein distance metrics. data fusion framework that integrates diverse data by col-
Furthermore, we noticed an advantage of using the (original) lecting, analyzing, and fusing it. We hypothesize that het-
reported start time, in the 90-minute time interval, compared erogeneous data fusion increases data quantity and quality,
to our estimated incident start time approach. In contrast, the thereby improving datasets for ITS applications. To verify
larger time interval (120 minutes) shows only minor perfor- this, we developed two ITS applications: one used polynomial
mance differences between using the original and estimated regression to estimate traffic levels, while the other combined
ZIßNER et al.: DataFITS: A HETEROGENEOUS DATA FUSION FRAMEWORK 11477

traffic and incident data to classify events into accident, [5] A. B. Campolina, P. H. L. Rettore, M. Do Val Machado, and
congestion, or non-incidents. A. A. F. Loureiro, “On the design of vehicular virtual sensors,” in Proc.
13th Int. Conf. Distrib. Comput. Sensor Syst. (DCOSS), Jun. 2017,
Using real heterogeneous data from two German cities, pp. 134–141.
we quantified the advantages of DataFITS by compiling a
[6] S. Jeong, S. Kim, and J. Kim, “City data hub: Implementation of
fused dataset. Our results indicate that DataFITS integrated standard-based smart city data platform for interoperability,” Sen-
data from multiple sources for 40% of all roads, thereby sors, vol. 20, no. 23, p. 7000, Dec. 2020. [Online]. Available:
increasing the overall road coverage by 137%. In addition, the https://www.mdpi.com/1424-8220/20/23/7000
[7] L. Zhang, Y. Xie, L. Xidao, and X. Zhang, “Multi-source heterogeneous
traffic estimation model, which uses polynomial regression, data fusion,” in Proc. Int. Conf. Artif. Intell. Big Data (ICAIBD),
outperformed our previous approach based on descriptive May 2018, pp. 47–51.
statistics, achieving a high R2 score of 0.91, low error metrics [8] P. H. L. Rettore, B. P. Santos, A. B. Campolina, L. A. Villas, and
A. A. F. Loureiro, “Towards intra-vehicular sensor data fusion,” in
of 0.05, and provides accurate traffic estimations using the Proc. IEEE 19th Int. Conf. Intell. Transp. Syst. (ITSC), Nov. 2016,
fused dataset. Compared to using a single sources dataset, the pp. 126–131.
fused dataset estimation showed minor accuracy improvements [9] P. H. L. Rettore, A. B. Campolina, L. A. Villas, and A. A. F. Loureiro,
“A method of eco-driving based on intra-vehicular sensor data,” in Proc.
but drastically improved the spatiotemporal coverage of the IEEE Symp. Comput. Commun. (ISCC), Jul. 2017, pp. 1122–1127.
estimated areas. Our incident classification model relies on the [10] P. H. L. Rettore, A. B. Campolina, A. Souza, G. Maia, L. A. Villas, and
A. A. F. Loureiro, “Driver authentication in VANETs based on intra-
fusion of traffic and incident data, achieving a 90% binary clas- vehicular sensor data,” in Proc. IEEE Symp. Comput. Commun. (ISCC),
sification accuracy rate within our evaluation. Preprocessing Jun. 2018, pp. 00078–00083.
the data, such as removing unclear traffic patterns, improved [11] G. L. Foresti, M. Farinosi, and M. Vernier, “Situational awareness in
smart environments: Socio-mobile and sensor data fusion for emergency
accuracy by an average of 29%. The classification of incidents response to disasters,” J. Ambient Intell. Humanized Comput., vol. 6,
into different categories resulted in a slightly lower accuracy no. 2, pp. 239–257, Apr. 2015.
of 86%, with unequal performance among classes indicated [12] H. Wen, Y. Lin, and J. Wu, “Co-evolutionary optimization algorithm
based on the future traffic environment for emergency rescue path
by F1 scores. To mitigate this problem, we oversampled the planning,” IEEE Access, vol. 8, pp. 148125–148135, 2020.
training dataset to create a more uniform representation of the [13] P. H. Rettore, G. Maia, L. A. Villas, and A. A. F. Loureiro, “Vehicular
data, resulting in an 80% accuracy for each class. Collecting data space: The data point of view,” IEEE Commun. Surveys Tuts.,
vol. 21, no. 3, pp. 2392–2418, 3rd Quart., 2019.
more accident data can also solve this problem. [14] S. A. Kashinath et al., “Review of data fusion methods for real-
We plan to expand the DataFITS framework by collecting time and multi-sensor traffic flow analysis,” IEEE Access, vol. 9,
and fusing more data types, improving its performance and pp. 51258–51276, 2021.
[15] W. Jiang and J. Luo, “Big data for traffic estimation and prediction:
data quality, and expanding its data analysis. We focus on A survey of data and tools,” Appl. Syst. Innov., vol. 5, no. 1, p. 23,
data types such as social media and images, which require Feb. 2022.
methods such as Natural Language Processing (NLP) and [16] R. A. Anand, L. Vanajakshi, and S. C. Subramanian, “Traffic density
estimation under heterogeneous traffic conditions using data fusion,” in
image processing. For ITS applications, we aim to use auto- Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2011, pp. 31–36.
mated machine learning to explore different models and [17] A. Abadi, T. Rajabioun, and P. A. Ioannou, “Traffic flow prediction
hyper-parameters and compare them with our current models. for road transportation networks with limited traffic data,” IEEE Trans.
Intell. Transp. Syst., vol. 16, no. 2, pp. 653–662, Apr. 2015.
We also plan to analyze the correlation between traffic and [18] G. Meena, D. Sharma, and M. Mahrishi, “Traffic prediction for intelli-
incidents and incorporate it into the traffic estimation models. gent transportation system using machine learning,” in Proc. 3rd Int.
In addition, we intend to explore the use of big data in Conf. Emerg. Technol. Comput. Eng., Mach. Learn. Internet Things
(ICETCE), Feb. 2020, pp. 145–148.
military scenarios, combining information from the civilian [19] L. Zhao et al., “T-GCN: A temporal graph convolutional network for
and military fields to support strategic operations in urban traffic prediction,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 9,
pp. 3848–3858, Sep. 2020.
warfare. To this end, our framework can be enhanced to collect [20] J. Tang, L. Li, Z. Hu, and F. Liu, “Short-term traffic flow prediction
and combine different types of information (image, text) to considering spatio-temporal correlation: A hybrid model combing type-
create common operational pictures and verify/authenticate 2 fuzzy C-means and artificial neural network,” IEEE Access, vol. 7,
pp. 101009–101018, 2019.
information, thereby avoiding misinformation that may influ- [21] X. Di, Y. Xiao, C. Zhu, Y. Deng, Q. Zhao, and W. Rao, “Traffic conges-
ence political decisions. tion prediction by spatiotemporal propagation patterns,” in Proc. 20th
IEEE Int. Conf. Mobile Data Manage. (MDM), Jun. 2019, pp. 298–303.
[22] X. Wang, X. Guan, J. Cao, N. Zhang, and H. Wu, “Forecast
network-wide traffic states for multiple steps ahead: A deep learn-
R EFERENCES ing approach considering dynamic non-local spatial correlation and
non-stationary temporal dependency,” Transp. Res. C, Emerg. Tech-
[1] L. Zhu, F. R. Yu, Y. Wang, B. Ning, and T. Tang, “Big data analytics in nol., vol. 119, Oct. 2020, Art. no. 102763. [Online]. Available:
intelligent transportation systems: A survey,” IEEE Trans. Intell. Transp. https://www.sciencedirect.com/science/article/pii/S0968090X20306756
Syst., vol. 20, no. 1, pp. 383–398, Jan. 2019. [23] C. Zheng, X. Fan, C. Wang, and J. Qi, “GMAN: A graph multi-attention
[2] Umweltbundesamt. (2022). Verkehrsinfrastruktur und network for traffic prediction,” in Proc. AAAI, vol. 34, no. 1, 2020,
fahrzeugbestand. Accessed: Dec. 12, 2022. [Online]. Available: pp. 1234–1241.
https://www.umweltbundesamt.de/daten/verkehr/verkehrsinfrastruktur- [24] B. Zhao, X. Gao, J. Liu, J. Zhao, and C. Xu, “Spatiotemporal data fusion
fahrzeugbestand in graph convolutional networks for traffic prediction,” IEEE Access,
[3] German Federal Statistical Office (Destatis). (2022). Passengers vol. 8, pp. 76632–76641, 2020.
Carried in Germany. Accessed: Jul. 12, 2022. [Online]. [25] N. Zafar, I. U. Haq, J.-U.-R. Chughtai, and O. Shafiq, “Applying hybrid
Available: https://www.destatis.de/EN/Themes/Economic-Sectors- LSTM-GRU model based on heterogeneous data sources for traffic speed
Enterprises/Transport/Passenger-Transport/Tables/passengers- prediction in urban areas,” Sensors, vol. 22, no. 9, p. 3348, Apr. 2022.
carried.html [Online]. Available: https://www.mdpi.com/1424-8220/22/9/3348
[4] G. Vítor, P. Rito, and S. Sargento, “Smart city data platform for real-time [26] Z. Shan, Y. Xia, P. Hou, and J. He, “Fusing incomplete multisensor
processing and data sharing,” in Proc. IEEE Symp. Comput. Commun. heterogeneous data to estimate urban traffic,” IEEE MultimediaMag.,
(ISCC), Sep. 2021, pp. 1–7. vol. 23, no. 3, pp. 56–63, Jul. 2016.
11478 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 10, OCTOBER 2023

[27] A. Genser, N. Hautle, M. Makridis, and A. Kouvelas, “An experimental Paulo H. L. Rettore received the B.Sc. and M.Sc.
urban case study with various data sources and a model for traffic degrees in computer science in 2009 and 2012,
estimation,” Sensors, vol. 22, no. 1, p. 144, Dec. 2021. respectively, and the Ph.D. degree in computer sci-
[28] L. Wenqi, L. Dongyu, and Y. Menghua, “A model of traffic accident ence from the Federal University of Minas Gerais
prediction based on convolutional neural network,” in Proc. 2nd IEEE (UFMG) in 2019. He is currently a Scientist with
Int. Conf. Intell. Transp. Eng. (ICITE), Sep. 2017, pp. 198–202. Fraunhofer FKIE, Bonn, Germany. Sitting with the
[29] S.-H. Park, S.-M. Kim, and Y.-G. Ha, “Highway traffic accident pre- Communication Systems Department (KOM), he has
diction using VDS big data analysis,” J. Supercomput., vol. 72, no. 7, been focused on measuring the performance bounds
pp. 2815–2831, Jul. 2016. of tactical systems over ever-changing scenarios.
[30] H. Ren, Y. Song, J. Wang, Y. Hu, and J. Lei, “A deep learning approach His research interests include computer networks,
to the citywide traffic accident risk prediction,” in Proc. 21st Int. Conf. mobile ad-hoc networks, tactical networks, software-
Intell. Transp. Syst. (ITSC), Nov. 2018, pp. 3346–3351. defined networking, ubiquitous computing, the Internet of Things, intelligent
[31] Q. Shang, L. Feng, and S. Gao, “A hybrid method for traffic incident transportation systems, and smart mobility.
detection using random forest-recursive feature elimination and long
short-term memory network with Bayesian optimization algorithm,”
IEEE Access, vol. 9, pp. 1219–1232, 2021.
[32] Z. Liu and C. Wang, “Design of traffic emergency response system based
on Internet of Things and data mining in emergencies,” IEEE Access,
vol. 7, pp. 113950–113962, 2019.
[33] K. R. Sanjana, S. Lavanya, and Y. B. Jinila, “An approach on automated
rescue system with intelligent traffic lights for emergency services,” Bruno P. Santos received the bachelor’s degree
in Proc. Int. Conf. Innov. Inf., Embedded Commun. Syst. (ICIIECS), from Universidade Estadual de Santa Cruz (UESC)
Mar. 2015, pp. 1–4. and the M.S. and Ph.D. degrees in computer sci-
[34] P. H. L. Rettore, B. P. Santos, R. Rigolin F. Lopes, G. Maia, L. A. Villas, ence from Universidade Federal de Minas Gerais
and A. A. F. Loureiro, “Road data enrichment framework based on (UFMG). He is currently a Professor of com-
heterogeneous data fusion for ITS,” IEEE Trans. Intell. Transp. Syst., puter science with the Federal University of Bahia
vol. 21, no. 4, pp. 1751–1766, Apr. 2020. (UFBA). His research interests include computer
[35] A. Salas, P. Georgakis, and Y. Petalas, “Incident detection using data networks, distributed systems, ubiquitous comput-
from social media,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. ing, the Internet of Things, intelligent transportation
(ITSC), Oct. 2017, pp. 751–755. systems, and smart mobility.
[36] S. Guo et al., “Identifying the most influential roads based on traffic
correlation networks,” EPJ Data Sci., vol. 8, no. 1, pp. 1–17, Dec. 2019.
[37] Z. Liu, Z. Li, M. Li, W. Xing, and D. Lu, “Mining road network
correlation for traffic estimation via compressive sensing,” IEEE Trans.
Intell. Transp. Syst., vol. 17, no. 7, pp. 1880–1893, Jul. 2016.
[38] Y. Zhu, Z. Li, H. Zhu, M. Li, and Q. Zhang, “A compressive sensing
approach to urban traffic estimation with probe vehicles,” IEEE Trans.
Mobile Comput., vol. 12, no. 11, pp. 2289–2302, Nov. 2013.
[39] B. P. Santos, P. H. L. Rettore, H. S. Ramos, L. F. M. Vieira, and Johannes F. Loevenich received the B.Sc. degree in
A. A. F. Loureiro, “Enriching traffic information with a spatiotemporal computer science and the B.Sc. degree in mathemat-
model based on social media,” in Proc. IEEE Symp. Comput. Commun. ics from Rheinische Friedrich-Wilhelms-Universität
(ISCC), Jun. 2018, pp. 00464–00469. Bonn. He is currently pursuing the Ph.D. degree in
[40] G. Boeing, “OSMnx: New methods for acquiring, constructing, analyz- computer science/mathematics with the Distributed
ing, and visualizing complex street networks,” Comput., Environ. Urban Systems Department, University of Osnabrück. He is
Syst., vol. 65, pp. 126–139, Sep. 2017. a Scientist with the Communication Systems Depart-
[41] C. Yang and G. Gidófalvi, “Fast map matching, an algorithm integrating ment (KOM), Fraunhofer FKIE, Bonn, Germany.
hidden Markov model with precomputation,” Int. J. Geographical Inf. His research interests include computer systems,
Sci., vol. 32, no. 3, pp. 547–570, Mar. 2018. computer networks, distributed systems, data sci-
[42] R. Tavenard. An Introduction to Dynamic Time Warping. ence, optimization theory, artificial intelligence, and
Accessed: Sep. 14, 2022. [Online]. Available: https://rtavenar. game theory.
github.io/blog/dtw.html
[43] P. Zißner, P. H. L. Rettore, B. P. Santos, R. R. F. Lopes, and P. Sevenich,
“Road traffic density estimation based on heterogeneous data fusion,” in
Proc. IEEE Symp. Comput. Commun. (ISCC), Jun. 2022, pp. 1–6.
[44] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde,
“Optimal mass transport: Signal processing and machine-learning appli-
cations,” IEEE Signal Process. Mag., vol. 34, no. 4, pp. 43–59, Jul. 2017.
Roberto Rigolin F. Lopes (Member, IEEE) received
the B.Sc. degree in computer science from UFMT,
Brazil, the M.Sc. degree in computer science from
UFSCar, Brazil, and the Ph.D. degree in com-
puter science from USP, Brazil. During his Ph.D.,
he also visited Twente, The Netherlands, and Ottawa,
Canada. After his Ph.D., he got a post-doctoral
scholarship from the European Research Consortium
Philipp Zißner received the B.Sc. and M.Sc.
for Informatics and Mathematics (ERCIM) to join
degrees in computer science from Rheinische
NTNU, Norway, for four years, and a Scientist with
Friedrich-Wilhelms-Universität Bonn in 2020 and
Fraunhofer FKIE, Germany, for six years. He is
2022, respectively. He is currently a Scientist with
currently a Scientist with Thales Deutschland, Ditzingen, Germany. Sitting
the Communication Systems Department (KOM),
with the Secure Communications and Information Systems (SIX), he has
Fraunhofer FKIE, Bonn, Germany. His research
been attacking problems in computer networks and distributed systems with
interests include intelligent transportation systems,
a particular interest in the performance bounds of tactical systems over
smart mobility, the Internet of Things, and tactical
ever-changing communication scenarios. His academic life triggered interest-
networks.
ing life experiences, but he has been rebuilding his own education following
curiosity freely by reading books on physics, mathematics, and philosophy.

You might also like