3.2. Data Collection
The first component of the approach is in charge of collecting data from the user’s personal mobile device. Obtaining precise data about the geographical location of users can lead to high battery consumption, depending on the location mechanism used and the frequency with which the device is located. Thus, the location mechanism to be used and the frequency with which the device is located varies according to the context and activity that the user is performing, in order to reduce the impact on the battery.
In this context, an adaptive data collection component is proposed, which is detailed in
Figure 2. This component periodically collects data on the physical activity that the user is performing and the geographical location from the services offered by the operating system. The approach generalizes the information in the following way: the collected geographic locations will be represented as a 6-tuple (
,
,
,
,
,
), where
is the moment in which the location was recorded,
and
are the geographic coordinates,
is the altitude,
is the location precision and
is the location mechanism used (e.g., GPS or Wi-Fi). On the other hand, physical activity is represented as a 3-tuple
where
is the moment in which the activity was recorded,
is the identified activity (still, walking, running, cycling, in vehicle and unknown) and
is the confidence about that activity (that is, how likely it is that the activity recorded matches the activity that the user is actually performing).
To address the problem of deviations or errors in the location mechanism, the collection component has a series of filters and a buffer (labeled in
Figure 2 with 1 and 2, respectively) that analyze the sequence of collected locations in search of anomalies or situations that can be considered as an error or a deviation. During this analysis, not only the data collected is considered but also geographic information about the user’s environment. In this work, OpenStreetMap (
https://www.openstreetmap.org/ (accessed on 15 December 2023)) and ASTER GDEM (
https://asterweb.jpl.nasa.gov/gdem.asp (accessed on 15 December 2023)) were used as sources of geographic information, but any other source of information available could be used. Locations identified as errors or deviations are classified as unreliable and discarded (labeled in
Figure 2 with 3), as they can introduce noise to the rest of the components.
Trusted locations (that is, those that were not previously discarded) are used together with the user’s physical activity data to keep the user’s mobility status up to date (labeled in
Figure 2 with 4). In the proposed approach, eight possible states of mobility are defined. Each of these states identifies a particular situation and provides useful information to the component to determine which location mechanism to use and its frequency. The description of the different mobility states and the location strategies used in each of them are detailed in
Section 3.2.1. The approach intelligently adapts the frequency and the locating mechanism used according to the situation (marked as 5 in
Figure 2), allowing us to improve the precision with which the user’s location is known and to save battery life when possible.
Finally, the trusted locations collected by this component and the user’s physical activity data are grouped into what we will call an
event (marked 6 in
Figure 2). Most approaches in the literature simply record the day, time and geographic location of the user at that time. The proposed approach also records the user’s physical activity in the event since it is useful to detect the user’s stay points during data processing. We currently detect the user’s physical activity (whether the user is walking, in a car, on a bicycle, or stationary) with different sensors and we could identify, in the future, the means of transport used. Identifying the means of transport requires additional information; for example, to detect if the user is using public transportation, it is necessary to combine information from the public transport routes in the city, then search for matches with the user’s path and eventually segment it into sections (for instance, walking for part of the journey, using public transport for another and then walking again).
Figure 2 shows an example event, indicating the data it encapsulates. The events generated by this first component are the input to the second component.
3.2.1. Mobility States
The data collection component keeps updated the mobility status of the user. In this work, eight possible states of mobility were identified and defined, where each one identifies a different situation or context that is relevant when it comes to determine which location mechanism to use (GPS or networks) and how often to use it.
Figure 3 shows a diagram of the mobility states defined in this work and the transitions between them. The states are separated into three large groups: Rural Zone, Moving and Stopped.
The Rural Zone state indicates that the user is outside any urban area (city, town, etc.). The system enters this state if it is detected that the user moved away from the urban area (transition R1). To determine if the user is in an urban area or not, the last reliable geographic location collected is compared with geographic information on the location of cities, towns and other urban centers close to the user’s location. While the user is in a rural area, the collection component reduces data collection to save battery life until the user re-enters an urban area. Upon re-entering an urban area, the state of mobility changes from rural area to slow movement (transition L1).
The Stopped and Moving state groups occur when the user is in an urban area. The states of the “Stopped” group are four: “Still?”, “Still”, “Static?” and “Static”. These states occur when the user is quiet. If a user who was initially moving stops, it changes to the state “Still?” (D1 transition). This transition occurs only if the “standing still” physical activity is detected with a confidence of 90% or more. While the user remains still, the approach takes advantage of the situation and it saves battery life by reducing the frequency with which the device is located while the user is still. This reduction in the frequency of localization occurs gradually according to the user state. The state “Still?” is a transition state and it indicates that the user apparently staying still. If the user stays in this state for a certain time (5 min by default), it then enters the “Still” state (D2 transition). The “Still” (without question mark) state indicates that since some time ago no movement has been recorded. In practice, this usually happens if the user remains seated somewhere or if she leaves the mobile device resting on some table or desk. After 5 min in the “Still” state, it goes to the “Static?” state (transition D3). The state “Static?” is a transition state and it indicates that the user has remained motionless for some considerable interval of time and he/she is apparently in repose. If the user remains for 10 min in a state of apparent rest, it then enters the “Static” state (transition D4). The “Static” (without question mark) state indicates that the device has remained quiet for a long time and it indicates that it is probably going to stay like this for a longer time. In practice, this usually occurs when the user leaves the mobile device resting somewhere for a long time, such as on the night table before going to sleep, at the work desk, etc.
We use an approach of four states to avoid losing information about the mobility. We use the transitional states (those with a question mark) to ensure the transition before adjusting the parameters of data collection.
If at any time there is some physical activity that indicates some type of movement, one of the states of the “Moving” group is activated. The “Moving” states are three: Slow, Intermediate, Fast. The Fast state indicates that the user is moving quickly, probably in a motorized vehicle. This state occurs when a speed greater than 20 km per hour is recorded (transitions R1, R2 and R3). The Intermediate state indicates that the user is moving slightly faster than in a normal walk, but she is not moving in a vehicle. This state occurs when it detects physical activity such as running or cycling, or if a speed greater than 7 km per hour is recorded (transitions I1, I2). Finally, the Slow state indicates that the user is moving slowly, probably walking. This state is the default state when we initiate the process and it occurs in any situation where the user is not stopped, but the conditions for entering the Intermediate or Fast states are not met. It is important to notice that to exit the Intermediate or Fast states, 40 s must elapse without identifying a related activity or the minimum speed required is reached. This way, the system only exits these states if the user changes the means of transport, reducing the chances of erroneously entering the Slow or Static? state (L2, L3 and D1 transitions) if the user stops momentarily at a traffic light or for traffic reasons.
Table 2 summarizes the data collection policies according to the mobility states. The first column specifies the mobility state. The second and third columns indicate the preferred location mechanism to use while the user is in that state and how often the user device is located using this mechanism. The fourth and fifth columns indicate the alternative preferred location mechanism to use and its frequency. This alternative mechanism is used when the first mechanism presents errors or deviations (detected by the filters and buffers). In rural zones, we use the cell phone antenna to detect when the user returns to a urban zone.
3.2.2. Proposed Filters
Many times the collected locations can be inaccurate, or even erroneous, due to GPS interference, unstable Wi-Fi signals, outdated Wi-Fi databases, among other factors. To deal with this, we propose a series of filters that allow us to determine if a location is trustworthy or not. Each of these filters can analyze different aspects or situations to identify unreliable locations that should be discarded. The number of filters to use and the logic of each one of them can be set according to the need. In our approach, four different types of filters were defined: precision filter, speed filter, GPS altitude filter and outdoor network filter. When a filter identifies a location as untrusted, the location is discarded and data collection is adapted if necessary, for example, re-requesting a new location or changing the location mechanism used (GPS or network location).
The precision filter is based on the precision that each collected location carries with it. This precision is a distance that indicates the expected margin of error for that location and its calculation may vary depending on the location mechanism used (GPS or location by Wi-Fi networks). Considering these issues, a configurable precision filter was designed using three parameters: p, P and C. The parameters p and P represent two preference thresholds that define three ranges of precision: ..; [p..P); [P..] (note that ). The locations within the range ..p) are considered reliable by this filter and evaluated by the following filter. Those that are in the range .. are automatically considered unreliable and discarded. Finally, those locations that are in the intermediate range .. are compared against the last known trusted location (). If the distance between the new location and is less than C (representing a trusted distance threshold), then the location is considered trusted and advances to the next filter. Otherwise the location is considered untrustworthy and it is discarded.
In some cases, it may happen that the location is wrong despite reporting a good precision value. For example, this can occur due to interference in GPS signals, or if the database of the Wi-Fi hot-spot used to locate the device is outdated. We designed a configurable speed filter using five parameters: C, , , , . As in the precision filter, C represents a confidence threshold distance. All locations whose distance to the last reliable location () is less than C are considered reliable. In case the location is outside the confidence threshold, the filter evaluates the reliability of the location based on the user’s mobility state. We compute the estimated speed () at which the user should have moved to go from to the new location collected, according to the time elapsed between both locations. This speed is then compared to a speed limit set according to the registered mobility state. The parameters , and indicate speed limits for the state of mobility Fast, Intermediate or Slow, respectively. When the speed is greater than the speed according to the mobility state, then the location is discarded. When a wrong location is detected, we store this information as a known bug during the time . Then, if a new location is similar to a known bug, it is discarded. is the time during which the known bug locations are registered; after that time, the known bugs are deleted.
The GPS altitude filter uses the altitude reported by an external source (in this work we use data from ASTER GDEM) and compares it with the altitude reported by the location mechanism. When the altitude reported is not consistent with the altitude reported by an external source, the location can be categorized as unreliable and we discard it. We considered as consistent an altitude between and , where is the altitude reported by ASTER. The intention of this filter is to remove the data points with large errors. Small differences between the altitude reported by the location mechanism and that reported by ASTER GDEM for the same latitude and longitude may correspond to a tolerable GPS error or even not correspond to an error, for example, when the user is located on different floors of a building.
In outdoor places, such as parks, network location is often inaccurate due to the low presence of Wi-Fi access points and to the instability in the intensity of their signals. In these situations, a common misconception is that the location obtained presents a bias in the direction of nearby building areas where there is a presence of Wi-Fi hot spots. To address this issue, we designed an outdoor network filter configurable through a single parameter, D. When the user is in an outdoor place, such as a park, a filtering area around the park area is set using D. To determine if the user is within a park or not, the last known reliable location of the user is compared with the areas nearby. The areas near the park can be obtained from any source of geographic information. In particular, in this work we use OpenStreetMap, which allows us to consult and download this information freely. This filter is not used if the coordinate was obtained by GPS.
3.2.3. The Buffer
This component receives all those locations that have not been discarded by the filters and it is the one who finally determines if the location is trustworthy or not. The main purpose of the buffer is to identify small jumps or deviations difficult to identify using previous filters. For this, the buffer maintains a small memory where it stores at most three locations: , and , according to their order of appearance. The buffer was designed to hold in the last known reliable location. When the intermediate location presents a deviation with respect to and , it is considered unreliable and it is discarded. These deviations are normally identified when and are close to each other and far from . The buffer is configurable through two parameters, C and . The parameter C represents a threshold filter (similar to the one used in previous filters), which allows the buffer to distinguish between situation A (user standing still) and situations B (user moving) or C (jump or detour).
Figure 4 shows a diagram of how the buffer uses the
C parameter when a new
location comes up. If the distance between
and
is within the threshold of trust
C, then
is considered trusted and it becomes the new
situation (a). If the distance between
and
exceeds the confidence threshold
C, then the new location is simply queued in the buffer as shown in situation (b/c). In this case, the buffer is waiting for a new
location that allows it to distinguish if
is a detour or if the user is moving from one place to another.
The parameter
defines the location of a perpendicular line between
and
. When the new location
enters the buffer, the buffer distinguishes between situation B (user moving) and situation C (jump or deviation) depending on which side of the perpendicular
.
Figure 5 shows a diagram of how the buffer uses the
parameter when a new location
arrives. The location of the perpendicular (marked as a dotted line in the diagram) is defined by combining the value of
with the intermediate distance (
) between
and
. If the location is on the side of the perpendicular that belongs to
(blue area in the diagram), then the user is considered to be moving (situation b). In this case,
is considered valid and it becomes the new
. Otherwise, if
is on the side of the perpendicular belonging to
(yellow area in the diagram), then
is considered to be a jump (situation C). In this case,
is considered unreliable and is removed from the buffer. In both cases,
is processed again as if it were of a new
, identically to the process previously detailed in
Figure 4.
3.3. Data Processing
The second component of the approach is in charge of processing the events generated in the previous component in order to detect the user’s stays or visits and to identify the different places visited. The different proposals in the literature for the detection of stay points and places have some disadvantages or shortcomings. In the first place, for the detection of a stay, it is usually established that the user must stay within a maximum distance for more than a given minimum time . However, since users can visit places of different sizes, there is no value of that correctly fits all of the user’s stays. A similar problem is repeated in the algorithms used to identify the places visited by the user from their stay points, which usually define a predetermined size for the places. In addition, most of the approaches proposed in the literature do not identify the places as stay points are detected, but they are executed at the end of the data collection stage, when all the stay points detected are already possessed.
In this context, the data processing component detailed in
Figure 6 is proposed. Algorithm 1 presents a pseudo code of the work done by the “Data Processing” component, where numbers at the left of some lines correspond to labels in
Figure 6. The inputs of this component are the events generated by the first component of the approach (labeled with 1 in
Figure 6). These events are processed by a state machine that is responsible for identifying the user’s trips and stays. The state machine has two main states, the active state and the passive state, and two states called transition or uncertain states, the uncertain active state and the uncertain passive state. As long as the user remains in the active state, she is considered to be traveling, moving from one place to another (for example, from home to work). When going from the active state to the passive state, it is considered that the user arrived somewhere and began the visit or stay in that place. The states of uncertainty occur when a potential transition between the main states is suspected, but not enough information is available yet. The initial state of the state machine is marked with a “*” in the figure.
Algorithm 1 Data Processing Component |
1: procedure update_profile(Event e)
| ▹ (1) |
2: state_machine.update(e) |
3:
if state_machine.stay_ended() then |
4:
new_stay = state_machine.get_last_stay() |
5:
user_mobility_profile.personal_map.update(new_stay) | ▹ (4) |
6:
user_mobility_profile.movement_history.register (new_stay) | ▹ (5) |
7:
else if state_machine.commute_ended() then |
8:
new_DMAX = user_mobility_profile.personal_map.getDmax(e) |
9:
state_machine.set_DMAX(new_DMAX) | ▹ (2) |
10:
new_TMIN = user_mobility_profile.personal_map.getTmin(e) |
11:
state_machine.set_TMIN(new_TMIN) | ▹ (2) |
12:
new_commute = state_machine.get_last_commute() |
13:
user_mobility_profile.movement_history.register(new_commute) | ▹ (5) |
14:
end if |
15:
end procedure |
Initially, the proposed state machine detects points in a similar way to the definition in the literature: the user must stay within a maximum distance
for more than a certain minimum time
. But, unlike the works in the literature, the values of these two parameters are not static; the values vary depending on the location of the user. For this, the proposed approach has a user’s personal map, which provides information on the places near the current location of the user. Initially, this map has information extracted from external sources, such as OpenStreetMap (
https://www.openstreetmap.org/ (accessed on 15 December 2023)). Then, as the mobility profile is built, the personal map will contain the places visited by each user and a personalized size for each place. The state machine uses this information to know if the user is in a large place (such as a park) and to adapt the values of the parameters
and
accordingly (labeled with 2 in
Figure 6). Additionally, the proposed state machine considers not only the latitude and longitude of the registered locations but also the precision of those locations and the physical activity of the user. This information, already available in the events generated in the first component, allows us to enrich the state machine and to make better decisions about when to go from the active state to the passive state and vice versa. The operation of the state machine and the conditions to carry out the transitions between its states are detailed in
Section 3.3.1.
When the state machine detects that the user finishes a stay, it proceeds to the learning of places (labeled with 4 in
Figure 6). If the geographic center of the stay does not correspond to any place previously visited by the user, then a new place is identified in the area visited. If, on the contrary, the area of the stay coincides with that of an already known place, then the stay is associated with that place and the location and area of the place are updated combining them with the area of the stay. As the areas of the places are updated, it can happen that two nearby places begin to overlap. Then, both places are combined into a single, larger place. This process of learning places (detailed in
Section 3.3.2) is incremental, allowing the identification of places as the user’s stays are detected. The identified locations are used to gradually enrich and personalize the personal map. This makes it possible to improve the detection of user stays when she returns to visit a previously visited place, especially if it is large, since it allows adjusting the values of
and
according to the size of the place.
The stays and trips or commutes detected by the state machine are stored in the user’s movement history (labeled with 5 in
Figure 6). The movement history keeps a record of each stay or trip with relevant information, such as the start and end date of the stay or commute, the place visited (in the case of stays) and the route traveled (in the case of trips). The movement history and the location map make up the user’s mobility profile.
3.3.1. State Machine
The proposed state machine is in charge of processing the events generated by the first component of the approach (
Section 3.2) to detect the stays and trips made by the user. For this, the state machine requires 4 parameters:
. The parameters
and
establish the range of values that the state machine can use as a radius of the geographic area where the user must remain for a certain time, which is set dynamically between
and
.
Active State and Uncertain Active State
The active state occurs when it is detected that the user starts traveling from one place to another (for example, from home to work). This state stores the trajectory traveled by the user during the trip (that is, the list of events that make up the trip). As each new event arrives, the state machine evaluates if the user continues traveling or if there is any indication to suspect that she might be ending the trip. In turn, the event must not be located inside any place already known by the map of places, as this is an indication that the user might be entering that place. Then, if the aforementioned conditions are met, the event is added to the trajectory collected up to that moment. Otherwise, given the suspicion that the user could be arriving somewhere, the state machine moves the active uncertain state.
The uncertain active state is a state of uncertainty that works as a buffer. Its operation is shown in
Figure 7. The state machine remains in this state until it collects enough information (sufficient events) to determine whether the user initiated a stay or only stopped momentarily to continue the trip (for example, if the user stops at a traffic light). Every time a new event
arrives, the uncertain state adds it to the buffer and parses all events in the buffer to identify a stay. In the literature, this process typically consists of calculating the centroid of all locations, checking that the distance between the centroid and the locations is less than a maximum distance value and that the time elapsed between the first and the last location is greater than a minimum time. The proposed approach alters this mechanism, granting more flexibility according to the size of the place that the user is visiting. The first step (labeled with 1 in
Figure 7) is to calculate the centroid from the locations recorded in all buffered events. The frequency with which events are generated is not constant and the traditional centroid calculation can present discrepancies regarding the location where the user actually stopped. To address this problem, the approach calculates the centroid weighting each event according to the time elapsed between each pair of events. Each event
e is weighted according to the sum of the time elapsed between
e and the previous event and between
e and the subsequent event. Once the centroid has been calculated, the second step consists in estimating an area of stay that contains all the events of the buffer. Once the area of stay is defined, the approach evaluates if all the events are finally contained or not within that area. If it is not possible to find an area of stay that contained all events, then it is considered that the path made up of the events in the buffer is too long to detect a stay, and the state machine changes to the active state. If the visit area contains all the events in the buffer, then the approach evaluates whether the elapsed time between the first event in buffer and the new event is enough to detect a stay. At this point, if a stay is detected, the state machine changes to the passive state, and otherwise it remains in the uncertain active state.
Passive State and Uncertain Passive State
The occurrence of the passive and uncertain passive states implies that the user is in a stay, visiting some place. These states are in charge of periodically updating the area of stay where the user is and to detect when the user starts to move to another place (i.e., she starts a journey). The passive state starts when the user is detected to start a stay or visit (for example, when the user arrives at home or work). This state is responsible for storing and keeping updated the centroid and the area of stay.
With each new event e, the passive state verifies if the event is inside or outside the area of stay. If it is inside the area of stay, the passive state updates the centroid of the visit and the area of stay incorporating it into the calculations. If the passive state receives an event that is outside the stay area, the state machine moves to the passive uncertain state with e as the first buffer event. The passive uncertain state is like the active uncertain state, a state of uncertainty that works as a buffer. The state machine remains in this state until it accumulates enough information (sufficient events) to determine if the user initiated a trip or if events outside the area of stay are only small deviations.
3.3.2. Learning of Places
Each time a stay ends, the user’s personal map of places is updated. Most of the approaches in the literature use techniques of place extraction that take as input all the stay points detected up to the moment. However, the dependence of mobile devices on a battery discourages the use of these type of techniques, since with each new stay detected, the execution of these algorithms would become increasingly expensive.
Thus, the use of an incremental algorithm is proposed. Every time a stay is detected, the proposed approach verifies if the centroid is located inside some place already present in the map. If the stay does not match any place known so far, the approach creates a new place with an area identical to the one defined by the passive state at the time of ending the stay. On the contrary, if the stay coincides with an already known place, we proceed to update the geographic area of that place. The process of updating a place consists of updating the area by combining its geographic center and surface with those of the stay area. In this process, the area of the place is weighted over the stay area based on the number of previous stays used to calculate the current place area. In case it is located within the area of two or more different places, the approach selects that place whose centroid is closest to that of the stay .