0% found this document useful (0 votes)
99 views9 pages

Paper Transito PDF

Uploaded by

yerson calderon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views9 pages

Paper Transito PDF

Uploaded by

yerson calderon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Journal of Transport Geography 51 (2016) 36–44

Contents lists available at ScienceDirect

Journal of Transport Geography

journal homepage: www.elsevier.com/locate/jtrangeo

Classification of automobile and transit trips from Smartphone data:


Enhancing accuracy using spatial statistics and GIS
Akram Nour a,c,⁎, Bruce Hellinga a, Jeffrey Casello a,b
a
Department of Civil and Environmental Engineering, University of Waterloo, Waterloo, O.N. N2L 3G1, Canada
b
School of Planning, University of Waterloo, Waterloo, O.N. N2L 3G1, Canada
c
Urban and Engineering Research Department, The Custodian of the Two Holy Mosque Institute for Hajj and Umrah Research, Umm Al-Qura University, Makkah Saudi Arabia

a r t i c l e i n f o a b s t r a c t

Article history: As the practices of transportation engineering and planning evolve from “data poor” to “data rich”, methods to
Received 25 March 2015 automate the translation of data to information become increasingly important. A major field of study is the au-
Received in revised form 6 November 2015 tomatic identification of travel modes from passively collected GPS data. In previous work, the authors have de-
Accepted 10 November 2015
veloped a robust modal classification system using an optimized combination of statistical inference techniques.
Available online xxxx
One problem that remains very difficult is the correct identification of transit travel, particularly when the system
Keywords:
is operating in mixed traffic. This type of operation generates a wide range of values for many travel parameters
Machine learning (average speed, maximum speed, and acceleration for example) which have similar characteristics to other
Transportation urban modes. In this paper, we supplement the previous research to improve the identification of transit trips.
Mode identification The method employed evaluates the likelihood that GPS travel data belong to transit by comparing the location
GIS and pattern of zero-travel speeds (stopping) to the presence of transit stops and signalized intersections. These
Spatiotemporal comparisons are done in a GIS. The consideration of the spatial attributes of GPS data vastly improves the
Spatial statistics accuracy of transit travel prediction.
Classification
© 2015 Elsevier Ltd. All rights reserved.
Transit
Classification

1. Introduction nearly ubiquitous. As most smart phones are equipped with GPS, an op-
portunity has arisen for researchers to use these devices, rather than
The accurate and reliable evaluation of candidate transportation dedicated GPS loggers, to collect travel behavior data. These smart
infrastructure investments or policies requires detailed knowledge of phones overcome most of the challenges with the use of dedicated
disaggregate travel behavior. Traditionally, travel behavior data have GPS loggers: there is no need to distribute or collect devices; data can
been collected through the use of travel diaries in which the survey be transmitted directly from the phone to a server for analysis; and
participant is required to record details of their trips. Originally, the the generation of the data can be passive once a smart phone app has
travel diaries were paper based and/or relied on phone interview been installed. The use of smart phones to generate GPS points has
surveys; however many contemporary travel surveys now make use made it significantly easier to collect and analyze travel data.
of GPS loggers (Casas and Arce, 1999; Wolf et al., 2001). A unique attri- As a result, recent research efforts have focused on developing infer-
bute of GPS is the ability to obtain comprehensive spatial and temporal ence models to identify the transportation modes on the basis of the
(spatiotemporal) data with high accuracy and minimal (or no) burden measured GPS data. Often, these models rely on travel parameters
on participants. The use of GPS loggers faces several implementation such as instantaneous or smoothed speed, acceleration, and distance
challenges including: distribution and recovery of the GPS devices to/ traveled. Two general methods have emerged, namely: (1) using simple
from the survey participants; recovery of the data from the loggers; rule-based models that rely primarily on thresholds associated with the
survey participants' proficiency with the logger technology; and partic- measured attributes (e.g. speed and acceleration) and (2) machine
ipants' failure to use the logger on many trips. learning models that are trained by using a subset of labeled data. In
With the advances in wireless communications and technologies, most of these existing studies, the proposed methods have been
smart-phones have become extremely common and, in some areas, moderately successful in being able to accurately infer the actual
transportation mode.
In our previous work (Nour et al., 2015), we were able to develop a
⁎ Corresponding author at: Department of Civil and Environmental Engineering,
University of Waterloo, Waterloo, O.N. N2L 3G1, Canada. novel method that optimizes a classification model with respect to the
E-mail addresses: [email protected] (A. Nour), [email protected] following: (1) the number of attributes to consider in the model based
(B. Hellinga), [email protected] (J. Casello). on the ability of those attributes to distinguish between the available

http://dx.doi.org/10.1016/j.jtrangeo.2015.11.005
0966-6923/© 2015 Elsevier Ltd. All rights reserved.
A. Nour et al. / Journal of Transport Geography 51 (2016) 36–44 37

transportation modes; (2) the type of data processing employed; and to coincide with a walking segment of at least 60 seconds. Hence, iden-
(3) the model parameters. This method was employed using a set of la- tifying transitions to and from walking was a major focus of the algo-
beled GPS data to develop an optimized mode inference model rithm they employed. The authors define events including end-of-
(denoted as the NCH model) which was shown to perform well in walk (EOW) and start-of-walk (SOW) while controlling for losses in
classifying modes, but performs best when differentiating at an GPS signals. They employ feature vectors to categorize SOW and EOW
aggregate level (i.e. motorized versus non-motorized modes). points based on speed and acceleration. A fuzzy logic-based model
The evaluation of the performance of the NCH model showed that, was used to classify the mode segments. This work has become the
similar to other models proposed in the literature, the NCH model foundation upon which many other researchers have built their classifi-
does not perform as well at the disaggregate level (i.e. discriminating cation models.
one motorized mode such as public transit (bus) from another motor- Tsui and Shalaby (2006) extended the work by Chung and Shalaby
ized mode such as personal automobile). The problem stems from the (2005) by using GIS map and transit route service information. The fun-
fact that transportation modes within each aggregate category (e.g. mo- damental approach is to match a travel segment (from the GPS) with
torized or non-motorized modes) exhibit very similar distributions of the presence of a transit route (from GIS). Quantitatively, they devel-
basic attributes (e.g. speed and acceleration). Consequently, in this oped a route searching algorithm that is only activated when the
paper we utilize spatial information from the GPS data combined with resulting membership of cycling and bus from the fuzzy logic classifier
the spatial attributes of the transportation network to improve the exceeds a threshold. When the route searching algorithm matches at
mode inference performance. More specifically, we have developed a least one transit route, the segment is labeled as transit. The addition
method that compares the locations and patterns of stationary of this route searching algorithm to their original classification model
segments – a set of consecutive points with speeds below a moving improved the accuracy of the classifier from 76% to 80%. Gong et al.
threshold – from the GPS data, to locations of transit stops and signal- (2011) and Schüssler (2010) also followed Chung and Shalaby in iden-
ized intersections within a GIS platform. These comparisons prove to tifying transportation mode segments. To further distinguish transit
be very effective in identifying transit trips. Furthermore, an additional trips, these authors classified travel segments as transit trips when a
algorithm is developed to detect the actual location (time and space) segment's start and end points (origins and destinations) are sufficient-
at which the traveler changes from the transit mode to another mode ly close to transit stations - the term transit station indicates any location
or from another mode to transit. at which a transit vehicle is scheduled to stop to board and discharge
The remainder of the paper is organized as follows. The subsequent passengers. In Gong's work, the researchers were aided by the proper-
section describes relevant previous work on this topic. Section three de- ties of the metro network – fixed entry and exit points as well as
scribes conceptually the approach we take to enhance the mode identi- known acceleration and speed patterns. Conversely, Gong et al. faced
fication using GIS, with detailed quantitative explanations as warranted. significantly greater challenges due to the complexity of the New York
Section four demonstrates the performance of the model by applying City metro area they attempted to model. Most importantly, Gong
the proposed method to real data generated by travelers on multiple et al. experienced very similar challenges to our work in classifying be-
modes: walking, cycling, on-street bus transit and private automobile. tween bus, car and walking, due to the slow speeds for buses in conges-
We then compare the results of this integrated model with our previous tion. They report a success rate of 53% in a smaller data set which was
mode classifier to demonstrate the improvement gained. We conclude further complicated by the “urban canyon” effect produced by building
with comments on future work in this area. heights in the City.
Other researchers attempted to improve their classification models'
2. Literature review performance by acquiring additional information. For example,
Stenneth et al. (2011) not only compared GPS data to the locations of
Recently, GPS data loggers and GPS embedded smart phones have transit routes and stations, but also included a temporal component –
been utilized for collecting travel survey data at pre-set time intervals. comparing transit schedules with the recorded times from the GPS.
The main motivations for these GPS enabled devices are: (1) to reduce The authors report very high identification accuracy. Rasmussen et al.
or eliminate the burden on survey participants with respect to data (2013) described a method to identify transit segments using the per-
input; and (2) to generate higher accuracy compared to traditional centage of stops occurring at transit stations. They establish appropriate
trip diaries. In order to satisfy the first motivation, it is necessary to thresholds based on the characteristics of individual transit routes; as
avoid requiring the survey participant to record the trips characteristics the level of service on the route increases (implying less time delayed
such as the transportation mode(s) used for each trip, and instead to at signalized intersections or in congestion), the percentage of stops
infer these characteristics directly from the raw GPS data. Several stud- should also increase. Other researchers have used household transpor-
ies have found that transportation modes can be identified using speed tation attributes, such as automobile and bike ownership to enhance
and acceleration profiles gathered by the GPS device. While this is an classification accuracy. Stopher et al. (2007) developed an algorithm
easy and efficient approach for classifying some modes during certain that only assigns car or bike as a mode for a trip if the household indi-
conditions, it is often not sufficient to enable a clear distinction between cates ownership of a car or bike. Moiseeva et al. (2010) developed a sys-
certain modes. For example, consider a trip made on a freeway during tem called “TraceAnnotator” that complements GPS data such as speed
uncongested traffic conditions. Using only speed and acceleration and acceleration with ownership information for bike, auto, and motor-
data, it is a simple matter to correctly classify that the trip was made cycle amongst others. Moiseeva completed the mode inference using a
using a motorized mode rather than a non-motorized mode (e.g. walk Bayesian belief network (BBN).
or bike). However, now consider a trip made on a highly congested These methods have demonstrated improvements in identifying
arterial. The recorded speed and acceleration data for a motorized transit segments. However, opportunities exist to advance these ap-
mode and a non-motorized mode under these conditions are likely proaches. For example, a more robust model will be able to correctly
very similar, confounding the ability to correctly classify the mode. classify walking or cycling segments that occur immediately adjacent
This section reviews the most often replicated method for transporta- to a transit line. We also believe there is a benefit in relaxing the as-
tion mode inference models. We also review literature that aims to sumption made originally by Chung and Shalaby (2005) and adopted
improve classifier performance. In some cases, these approaches by others, that mode transfers only occur with walking segment having
integrate GIS; in other cases, more disparate data sources are used. a minimum duration of 60 s. Consider the example where someone is
Chung and Shalaby (2005) collected data using wearable GPS log- dropped off at a transit stop and boards an arriving vehicle shortly
gers and written detailed trip reports. The approach they developed thereafter. In this case, a very short (in time and distance) walking
concentrated on determining changes in modes which they assumed trip will be observed. The performance of the classification models can
38 A. Nour et al. / Journal of Transport Geography 51 (2016) 36–44

be improved by developing less rigid criteria for identifying a sequence points. Each data point was labeled by the survey participant as one of
of points as a transit segment. Finally, we aim to improve the identifica- Walk, Bike, Transit, Auto, Wait (waiting at transit station) and Activity.
tion of the transit mode segments without the use of more difficult to We proposed a mode inference model containing the following four
access information, such as bus schedules or household information, steps:
and instead concentrate on the integration of GPS and data that are
1. Segment a full trip into transportation mode segments (TMS), where
readily available through GIS. We also believe that the performance of
each segment is a consecutive set of GPS measurements for which a
the classification models can be better evaluated when applied to larger
single transportation mode was used;
data sets than those used in some of these previous publications such as
Chung and Shalaby (2005). 2. Develop a feature vector (FV) of attributes – speed, acceleration, jerk
(rate of change of acceleration) – for all TMS;
3. Sort the attributes in the FV based on the strength of their differenti-
3. Methods ating power creating an ordered list of Adjusted Attribute Differenti-
ating Power (AADP);
The method we employ here builds upon previous research (Nour 4. Identify the optimal classifier by choosing the best combination
et al., 2015) in which we proposed a method for solving the transporta- amongst:
tion mode inference problem in three main stages: (1) data collection
• the number of attributes to be included in the FV;
and processing; (2) training, testing, and optimizing the transportation
• the level of correlation amongst attributes;
mode classifier; and (3) model evaluation. We briefly review these
• the representation of attributes: continuous or discrete;
steps here as introduction to the new work proposed in this paper.
• the transformation of attributes: a determination is made if
Principal Components Analysis (PCA) should be applied; and
3.1. Previous model development • different classification models: Naïve Bayes (NB); K-Nearest
Neighbor (KNN), and Discriminate Analysis (DA).
The GPS travel data used in this research consist of location (x,y),
speed (v), and time (t) automatically uploaded from participants'
smart-phones to a secure server every 5 seconds using a custom soft- When we used the proposed methodology to this set of travel data,
ware application (Taghipour and Hellinga, 2012). Survey participants we found that the optimal inference model had the following
were also asked to label the mode used for each trip segment (they characteristics:
could do this in real-time via the smart phone application or afterwards ▪ 11 attributes were included in the classification model;
using a web-based interface). In addition, an automatic algorithm was ▪ The effect of correlation amongst variables was not significant;
developed to screen the data to ensure that the user reported mode ▪ Data were not discretized;
transfers occurred at times and locations that were feasible and logical ▪ PCA was applied; and
– i.e. not while traveling at high speed, but rather while stationary. Fig. ▪ The KNN (k = 11) classification model was selected.
1 presents a hypothetical trip with velocity plotted as a function of
time. The algorithm corrects suspicious transportation mode transfer
points – labeled as points 2, 5, and 6 in the diagram – by shifting the The model was evaluated using both recall and precision. Recall is
mode transfer label forward or backward along the time axis to the be- computed as the number of points correctly classified as mode m divid-
ginning or end of the adjacent stationary segment, where two or more ed by the total number of points that actually belong to mode m. Preci-
observations have speed less than a threshold. sion is computed as the number of points correctly classified as mode m
In total, data were collected for 658 trips – defined as a time series of divided by the total number of points classified as mode m. We elected
GPS points bounded by two activities – consisting of 457,945 data to evaluate the model at the point level to avoid the case where many

Fig. 1. Algorithm for correcting identification of mode transfer points.


A. Nour et al. / Journal of Transport Geography 51 (2016) 36–44 39

Table 1
Confusion matrix.

1a: Aggregate level

Classified as Recall

Non-motorized Motorized

Reality Non-motorized 29,414 505 98.3%


Motorized 5558 61,288 91.7%
Precision 84.1% 99.2%

1b: Disaggregate level

Classified as Recall

Walk Bike Transit Auto Total Proportion

Reality Walk 16,840 60 130 150 17,180 17.8% 98.0%


(98.0%) (0.3%) (0.8%) (0.9%)
(73.0%) (0.5%) (6.2%) (0.3%)
Bike 2649 9865 5 220 12,739 13.2% 77.4%
(20.8%) (77.4%) (0.0%) (1.7%)
(11.5%) (83.0%) (0.2%) (0.4%)
Transit 1489 514 1583 4597 8183 8.5% 19.3%
(18.2%) (6.3%) (19.3%) (56.2%)
(6.5%) (4.3%) (75.2%) (7.7%)
Auto 2103 1452 386 54,722 58,663 60.6% 93.3%
(3.6%) (2.5%) (0.7%) (93.3%)
(9.1%) (12.2%) (18.3%) (91.7%)
Total 23,081 11,891 2104 59,689
Proportion 23.9% 12.3% 2.2% 61.7%
Precision 73.0% 83.0% 75.2% 91.7%

(xx) = number of points classified as mode n divided by total number of actual points of mode m times 100%.
(yy) = number of mode m points divided by the total number of points classified as mode n times 100%.

misclassified short segments, for which classification is most difficult, one segment was labeled as a motorized-auto or transit. We label
skews the assessment of the model's performance. these trips Initial Potential Transit Trips (IPTTs).
The model performance, in terms of recall and precision, is summa- To determine if the IPTTs actually contain transit segments, we eval-
rized in the confusion matrix, presented as Table 1. The results are pre- uate whether the IPTTs have characteristics that are unique to transit
sented based on the actual aggregate (and disaggregate) mode to which trips. Generally, we employ two distinctive traits. First, a transit
the data belong compared to the mode to which the classifier assigns segment must begin and end at locations that coincide with transit
the data points. As noted earlier, the classifier model performed very stop locations. Second, the stopping pattern for transit segments tends
well in assigning the correct aggregate label (Table 1a) with 98% of to be distinct from other motorized travel. When a trip contains these
non-motorized (walking or cycling) and 91.7% of motorized (transit or two characteristics, we then employ an algorithm to establish the limits
auto) points correctly identified. of the transit portion of a trip. The proposed module includes the three
Table 1b shows the results at the disaggregate level (i.e. the model's steps shown in Fig. 2. The details for each step are presented in
ability to classify into each of the four transportation modes). An exam- subsequent sections.
ination of the recall statistic shows that the classifier model performed
very well for walk and auto modes, and reasonably well for bike 3.2.1. Step 1: Identifying trips with stationary segments in proximity to
mode, but poorly for transit mode (the model was able to correctly transit stations
identify only 19.3% of transit points). The classifier most often incorrect- The traveler's GPS data identify locations where speeds fall below a
ly identified transit points as auto points (56.2%), but also a non-trivial certain threshold, vth, such that the traveler can be considered “stopped”
proportion of points were misclassified as walking (18.2%) and bike over a set of points. We call these series of points stationary segments.
(6.3%). This suggests that improvements can be made to the overall For each IPTT, we employ GIS tools to determine if two or more station-
classifier performance by enhancing the model's ability to distinguish ary segments are within a specified distance of a transit station. In Fig. 3,
transit points from other modes. We expect, however, that efforts to im-
prove recall will degrade the precision statistic — that is points that
were previously (and conservatively) judged to not be transit, will
now be considered transit, thereby misclassifying a greater proportion
of walk and auto trips as transit. Our efforts in this paper concentrate
on achieving balance in improving the recall statistic without signifi-
cantly decreasing the precision.

3.2. Model enhancement with spatial data and GIS

To improve the classifier's performance, we begin by identifying all


trip segments that may have been made by public transport. The
optimized KNN model provides remarkably high accuracy at an
aggregated level — determining if a trip segment is made by motorized
or non-motorized modes. Given this accuracy, we eliminate from
further analysis those trips for which all segments were labeled as
non-motorized. This produces a subset of total trips for which at least Fig. 2. Components of model enhancement.
40 A. Nour et al. / Journal of Transport Geography 51 (2016) 36–44

vth is the speed threshold (0.75 m/sec)


TDk is the Euclidian distance from the transit station to the cen-
troid of k stationary segment points
TDth is the proximity threshold (30 m).

We chose 30 m as a threshold value for TDth to allow for error in the


GPS data (which we observe to range between 3 and 10 m) as well as
the range of actual stopping points for transit vehicles in the vicinity
of the GIS point identifying the transit stop. We believe this value
must be re-calibrated for different applications.

3.2.2. Step 2: Application of a transit stop rate criteria considering the


influence of signalized intersections
The data set is refined a third time using information about the
frequency and location of stationary segments. For transit trips, the
stationary segments are expected to be more frequent and have a stron-
ger spatial correlation with the location of transit stops than for private
Fig. 3. A hypothetical transit station illustrates the distance threshold. vehicles. Fig. 4 illustrates this concept graphically using a space–time
diagram for a private auto and a transit vehicle. Thus, we increase our
any stationary segments with x,y values that are located within the cir- likelihood of correctly classifying transit trips by identifying those trips
cle bounded by the Transit Threshold Distance, TDth, would return a pos- segments with a higher rate of stationary segments and stronger spatial
itive match. correlations to transit stops.
Mathematically, we find trips that contain at least two sets of points, As discussed previously, other researchers have proposed compar-
S1 and S2 for which: ing the number of stationary segments per trip as a way to distinguish
between auto and transit modes. We adopt a similar approach; howev-
v1;i bvth for all i er, we improve upon this metric by calculating the number of stationary
v2; j bvth for all j
segments per distance traveled which we believe is a more robust
and TD1 and TD2 bTDth
differentiating factor, as the following example demonstrates.
Suppose two trips are made, one is 10 km the other is 3 km long. In
where
the first trip, the GPS data indicate four stationary segments near to
v is the speed for observation i transit stops of which three are in proximity to a signalized intersection

Fig. 4. Space–time diagram for a hypothetical example for two trips by private-auto and transit.
A. Nour et al. / Journal of Transport Geography 51 (2016) 36–44 41

and one is away from a traffic signal. For the second trip, the GPS data are equally likely to have been made using transit. However, if we
indicate three stationary segments, again with only one occurring now consider the ratio of the number of stationary segments that spa-
away from a traffic signal. Both trips have one stationary segment unre- tially coincide with a transit stop but which do not spatially coincide
lated to a traffic signal and therefore we must conclude that both trips with a traffic signal to the length of the trip. We call this the transit

Fig. 5. Location of the Region of Waterloo and the Major Roadway Network.
42 A. Nour et al. / Journal of Transport Geography 51 (2016) 36–44

stop rate (TSR). For trip 1, the TSR is 0.1 per km, and for trip 2, the TSR is The Region's transportation network is comprised of three freeways:
0.33 per km. From these results we would conclude that trip 2 is more Provincial Highway 401 connecting the Detroit/Windsor area in the
likely to have been made by transit than trip 1. west through Toronto to Ontario's eastern border with Quebec; High-
To operationalize this step, we need to define three threshold values. way 8/85 that connects highway 401 in the south through the cities to
For proximity to a transit stop, we require that the centroid of the the northernmost parts of the Region; and Highway 7 also providing
stationary segment be within 30 m. On the other hand, a stationary an east west connection through the Region. Several multi-lane, major
segment is deemed to be away from a signalized intersection when arterials carry significant traffic throughout the Region. The normal
the centroid of the stationary segment is greater than 50 m from the speed limits on these are normally 50–70 km per hour, depending on
traffic signal. Finally, to identify a threshold value, TSRth above which a the adjacent land use density. The location of the Region and the
trip should be considered for further analysis, we rely on the empirical major roadway network are shown in Fig. 5.
data. We establish an appropriate threshold by comparing the cumula- Transit in the Region is provided by a single, public sector agency —
tive distribution functions for TSR for known transit and non-transit Grand River Transit. The transit agency has approximately 250 buses in
trips. The actual threshold value should be calibrated for local service, operating on 66 regular routes. The network contains more than
conditions. The value used in our example is 0.3 stops per km the 2700 transit stops, with typical station spacings being 400 m. Annually,
derivation of which is shown in Fig. 5. the system is operated for about 15 M vehicle kilometers nearly
exclusively on arterials; only one route operates for a short segment
3.2.3. Step 3: Calculating the start and end points of the transit segment on Highway 401 and Highway 8. In 2014, boardings on the system
We now have a subset of trips for which we are confident that at exceeded 22 million. Currently a bus only system, the Region is current-
least one segment was conducted by transit. We determine the start ly constructing a 19 km Light Rail Transit system, with an expected
and end points of the transit component for each of these trips as opening date late in 2017.
follows: The data used in this study were gathered over two time periods.
The first collection took place in October of 2011, with additional data
We define the Potential Transit Starting Point (PTSP) as the first gathered in March of 2013. Twenty individuals provided GPS data;
point in a non-stationary segment that satisfies the following each data record contained (x,y,z,t) and were gathered at five second in-
conditions: tervals. Prior to the application of the model, three pre-processing steps
▪ The point is located within the proximity threshold of a transit were taken. First, obviously erroneous points were deleted. The method
station; by which this is done generally involved significant (and unrealistic)
▪ The preceding segment was labeled by the original classifier as changes in heading, speed, or location. Next, very short trips – those
with duration less than five minutes – were eliminated from the data
non-motorized (i.e. walk or bike to the transit stop);
set due to the inability to compute meaningful feature vectors for
We define the Potential Transit End Point (PTEP) as the last point in a
these trips. Finally, unlabeled trips were also eliminated as the effective-
non-stationary segment that satisfies the following conditions: ness of the model could not be evaluated.
▪ The point is located within the proximity threshold of a transit
station;
▪ The following segment is classified by the original classifier as a 5. Model application and results
non-motorized segment.
After data cleaning, we had 658 trips comprised of approximately
100,000 points representing transportation modes — i.e. not stationary
Based on these definitions, the algorithm employs a stepwise or engaging in an activity (shopping, etc.). As presented in Table 1b,
(forward and backward) approach to determine limits of a transit our dataset contained 8183 points for which the mode of travel was
mode segment. Suppose there are k stationary segments on a trip. The transit. The optimal classifier from the previous work did not make
algorithm begins at stationary segment 1 and assesses whether this use of spatiotemporal data and correctly identified only 19% of these
segment meets the criteria for a PTSP. If so, the segment is labeled as transit points. The purpose of this model is to improve these results.
the beginning point of a transit segment. If not, then the algorithm iter- We apply the process described in Section 3. The preliminary step is
ates to stationary segment 2 and the process continues until either a to eliminate all trips for which all non-stationary segments have been
transit starting point is found at segment i or all k segments have been labeled as non-motorized. This reduces our number of trips to 501 and
evaluated and no transit start point has been identified. In the latter the number of points to 66,277.
case, the trip is then eliminated from further consideration and the seg- Next, we compute the distances from the centroid of all stationary
ment labels remain as determined by the original classification model. segments to the nearest transit stop. We then eliminate those trips
In the former case, the algorithm then seeks a transit segment end that have no stationary segments within the 30 m threshold to transit
point beginning at point k. If point k satisfies the requirements, then stop locations. This further reduces the number of trips to 323.
the transit segment is bounded by points i and k. If not, the algorithm We now apply the stopping rate filter as part of Step 2. Recall that
iterates to k − 1 and tests for suitability. This iterative process continues this calculation is the number of stationary segments for which the cen-
for n iterations such that k − n N i. If no suitable end point is found, the troid is within a threshold to a transit stop but beyond a threshold to the
trip is eliminated; if a suitable end point is found in iteration n, then the nearest signalized intersection. Fig. 6 shows the cumulative distribution
transit segment is bounded by points i and k − n. functions for TSR generated from our data for transit trips and private
automobile trips. For 80% of private automobile trips have a TSR less
4. Model study area than 0.3 stops per km. Only 25% of transit trips have a TSR less than
this value. As a result, we eliminate all trips from the data set with a
To test the effectiveness of the model presented here and in previous TSR less than 0.3 stops/km. This filter eliminates 236 trips from consid-
work (Nour, 2015), data were gathered in the Region of Waterloo, in eration, leaving us with 87 potential transit trips.
Ontario Canada. The Region is located approximately 100 km to the For these trips, we then applied the stepwise algorithm from Step 4
west of Toronto. The Region is made up of three cities – Cambridge, to identify the start and end points of the transit trip. Recall that one of
Kitchener, and Waterloo – as well as four rural townships – Wellesley, the criteria for identifying a potential transit start point involves the
Woolwich, Wilmot and North Dumfries. The total regional population maximum speed on the segment to which the point belongs. To
is about 530,000 but is expected to grow to 750,000 by 2031. determine an appropriate threshold value for maximum speed, we
A. Nour et al. / Journal of Transport Geography 51 (2016) 36–44 43

Fig. 8. Impact of proposed method for including spatiotemporal data on classification


results.
Fig. 6. TSR for transit and non-transit trips.

is a small increase in the number of walk and auto points which are
created the cumulative distribution functions for motorized and non- incorrectly labeled as transit. However, the overall precision for each
motorized trips (Fig. 7). transportation mode is still improved. The use of the spatiotemporal
In this case, we observe that 90% of non-motorized segments have a data also improves the model performance for auto trips. Recall
maximum speed that is less than 6.75 m/s (24 km/h). Less than 18% of increased by approximately 1% and precision by 6%.
motorized segments exhibit maximum speeds lower than this value.
We select this value for the threshold implying that the segment 6. Conclusions, limitations and recommendations
containing a potential transit start point must have a maximum speed
greater than 6.75 m/s. This paper has built upon previous work to classify travel modes
The application of the process results in 87 total trips being identi- from GPS data. In that previous work, we were successful in identifying
fied as potential transit trips. Of these 87 trips, 57 trips actually contain trips (at the point and segment levels) as motorized and non-
at least one transit segment. On the other hand, six trips that actually motorized. The model performed less satisfactorily in differentiating
contain at least one transit segment were eliminated by our model – between auto and bus transit modes — primarily due to the fact that
i.e. classified as not containing any transit segments. A further analysis bus transit in the study area operates in mixed traffic and therefore
of these trips suggests that the conservative value for the stopping has properties which are very difficult to distinguish from private
rates as well as unusual GPS errors lead to the misclassification. The auto. In lieu of further complicating the original model, we elected to
overall accuracy is 91% at the disaggregated level and 95% at the build a second, complementary procedure that improves the combined
aggregated level. models' performance in correctly identifying transit trips.
While we are interested in accuracy at the trip level, a better metric The approach presented here integrated the classifications from the
is at the point level. Table 2 provides the confusion matrix for the previous model with additional information derived from both the GPS
disaggregated results at the point level when the proposed method for data and GIS information. More specifically, we first limited our analysis
including the spatiotemporal information is applied. The impact of the to trips classified as containing at least one motorized segment. We then
proposed method can be determined by comparing the results in further refined the analysis set to contain only trips with stationary
Table 2 with those from the original KNN model (Table 1). segments that spatially coincided with the location of transit stations.
Recall that the objective was to improve the classification of transit Next, we filtered the trip set again to eliminate those trips for which
points. The results in Table 2 demonstrate that we have achieved this the transit stop rate – a metric we developed that isolates the stopping
goal without jeopardizing the accuracy results of other modes. The pattern at transit stations as opposed to traffic signals – failed to exceed
change in the classification performance (recall and precision) are sum- an empirically derived threshold.
marized and presented in Fig. 8 (positive values indicate improvement). We then applied a stepwise algorithm that cycled through stationary
We have improved the recall of transit from 19% to 85% (an increase in segment locations to find logical transit starting and end points. All tran-
recall of 65% compared to the original classifier). We also note that there sit segments bounded by the start and end point were labeled as transit.
If a start or end point were not found, the trip was eliminated from the
analysis set. The final criterion was to ensure that any set of segments
that were labeled as a transit trip was sufficiently long in duration.
The application of the proposed method resulted in a vast improve-
ment in the classification of transit trips, with only minor degradation in
the classification of other modes — walking particularly. With the appli-
cation of the spatiotemporal methods described here, we improved the
transit recall from 19% to 85%, an increase of nearly 65%. The proposed
method can be applied to the results of the KNN model proposed in
Nour et al., 2015, as well as any other mode classification model.
Model parameters need to be calibrated to local conditions using a
small sample of labeled data as well as network characteristics.
Naturally, the model has limitations. The ability to correctly classify
on-street transit modes used in this research depends heavily on two
criteria: the presence of transit stop locations that are not co-located
with intersections and the prevalence different stopping rates for
Fig. 7. Cumulative distribution of maximum segment speed for motorized versus non- autos and transit vehicles. While we have not tested the model in
motorized modes. other domains, our expectation is that the performance would be
44 A. Nour et al. / Journal of Transport Geography 51 (2016) 36–44

Table 2
Confusion matrix.

Classified as Recall

Walk Bike Transit Auto Total Proportion

Reality Walk 16,763 78 253 86 17,180 17.8% 97.6%


(97.6%) (0.5%) (1.5%) (0.5%)
(83.4%) (0.6%) (3.1%) (0.2%)
Bike 1918 10,542 5 274 12,739 13.2% 82.8%
(15.1%) (82.8%) (0.0%) (2.2%)
(9.5%) (87.7%) (0.1%) (0.5%)
Transit 356 43 6934 851 8184 8.5% 84.7%
(4.3%) (0.5%) (84.7%) (10.4%)
(1.8%) (0.4%) (85.2%) (1.5%)
Auto 1056 1364 949 55,295 58,664 60.6% 94.3%
(1.8%) (2.3%) (1.6%) (94.3%)
(5.3%) (11.3%) (11.7%) (97.9%)
Total 20,093 12,027 8141 56,506
Proportion 20.8% 12.4% 8.4% 58.4%
Precision 83.4% 87.7% 85.2% 97.9%

(xx) = number of points classified as mode n divided by total # of actual points of mode m times 100%.
(yy) = number of mode m points divided by the total number of points classified as mode n times 100%.

degraded in networks where nearly all transit stop locations are within Nour, A., Casello, J., Hellinga, B., 2015. Developing and optimizing a transportation mode
inference model utilizing data from GPS embedded smartphones. Presented in the
the buffer distance of intersections. Similarly, the model is likely to per- 94th Annual Meeting of the Transportation Research Board - #15–5027.
form less well in transit networks with fewer stops and, as a result, lon- Rasmussen, T., Ingvardson, J.B., Halldórsdóttir, K., Nielsen, O.A., 2013. Using wearable GPS
ger stop spacings that produce stopping patterns that are more similar devices in travel surveys: a case study in the Greater Copenhagen area. Transport
Conference at Aalborg University (ISSN 1603–9696).
to private auto patterns. Further testing is necessary to assess the im- Schüssler, N., 2010. Accounting for similarities between alternatives in discrete choice
pacts of these conditions. models based on high-resolution observations of transport behaviour Ph.D. thesis
ETH, Zurich, Switzerland.
Stenneth, L., Yu, P., Wolfson, O., Xu, B., 2011. Transportation mode detection from mobile
phones and GIS information. ACM SIGSPATIAL GIS 2011.
References
Stopher, P., Clifford, E., Zhang, J., FitzGerald, C., 2007. Deducing mode and purpose from
GPS Data. Paper presented at: 11th Transportation Research Board National Planning
Casas, J., Arce, C., 1999. Trip Reporting in Household Travel Diaries: A Comparison to GPS-
Applications Conference.
Collected Data. The 78th TRB Annual Meeting, Washington D.C.
Taghipour, R., Hellinga, B., 2012. Acquiring multimodal travel behavior data using smart
Chung, E., Shalaby, A., 2005. A trip reconstruction tool for GPS-based personal travel sur-
phones. Presented at the ITS Canada Annual Conference Held June 10–13, 2012 in
veys. Transp. Plan. Technol. 28 (5), 381–401.
Quebec City, Canada.
Gong, H., Chen, C., Bialostozky, E., Lawson, C.T., 2011. A GPS/GIS method for travel mode
Tsui, S., Shalaby, A.S., 2006. Enhanced system for link and mode identification for personal
detection in New York City. Comput. Environ. Urban. Syst. 36 (2), 131–139.
travel surveys based on global positioning systems. J. Transp. Res. Board 1972 (−1),
Moiseeva, A., Timmermans, H., Jessurun, J., 2010. Semi-automatic imputation of long-
38–45.
term activity-travel diaries using GPS traces: personal versus aggregate histories.
Wolf, J., Guensler, R., Bachman, W., 2001. Elimination of the travel diary: experiment to
12th WCTR, Lisbon, Portugal.
derive trip purpose from global positioning system travel data. J. Transp. Res. Board
Nour, A., 2015. Automating and Optimizing a Transportation Mode Classification Model
1768 (−1), 125–134.
for use on Smartphone Data Ph.D. thesis University of Waterloo, Waterloo, Ontario,
Canada.

You might also like