0% found this document useful (0 votes)
71 views8 pages

CCNC Paper CameraReadyVersion 7P

This document discusses using machine learning with partially labeled data to detect whether a mobile user's environment is indoor or outdoor. The authors aim to perform this indoor/outdoor detection (IOD) automatically at the network side using radio measurements collected from mobile devices in a crowdsourced manner. Their key contributions are: (1) using additional timing advance radio metrics collected by the network for IOD classification in addition to existing signal strength and quality measurements, and (2) designing a semi-supervised learning method to train an IOD classifier using a partially labeled crowdsourced dataset to minimize human labeling efforts.

Uploaded by

Obeid Allah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views8 pages

CCNC Paper CameraReadyVersion 7P

This document discusses using machine learning with partially labeled data to detect whether a mobile user's environment is indoor or outdoor. The authors aim to perform this indoor/outdoor detection (IOD) automatically at the network side using radio measurements collected from mobile devices in a crowdsourced manner. Their key contributions are: (1) using additional timing advance radio metrics collected by the network for IOD classification in addition to existing signal strength and quality measurements, and (2) designing a semi-supervised learning method to train an IOD classifier using a partially labeled crowdsourced dataset to minimize human labeling efforts.

Uploaded by

Obeid Allah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Machine Learning with partially labeled Data for Indoor

Outdoor Detection
Illyyne Saffar, Marie-Line Alberi-Morel, Kamal Deep Singh, César Viho

To cite this version:


Illyyne Saffar, Marie-Line Alberi-Morel, Kamal Deep Singh, César Viho. Machine Learning with par-
tially labeled Data for Indoor Outdoor Detection. CCNC 2019 - 16th IEEE Consumer Communications
& Networking Conference, Jan 2019, Las Vegas, United States. pp.1-7, �10.1109/CCNC.2019.8651736�.
�hal-02011454�

HAL Id: hal-02011454


https://hal.science/hal-02011454
Submitted on 12 Feb 2019

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Machine Learning with partially labeled Data for
Indoor Outdoor Detection
Illyyne Saffar Marie Line Alberi Morel Kamal Deep Singh Cesar Viho
Service Automation, Service Automation, Laboratoire Hubert Curien, IRISA - INRIA,
Nokia Bell Labs Nokia Bell Labs University of Saint-Etienne, University of Rennes 1,
Nozay, France Nozay, France Saint-Etienne, France Rennes, France
[email protected] marie [email protected] [email protected] [email protected]

Abstract—This paper demonstrates the feasibility of an IOD can be performed automatically and in real-time using
hybrid/semi-supervised classification method for detecting the machine learning techniques, which in turn need data for
environment of an active mobile phone, based on both labeled and learning. Thus, data collection is the first phase of designing
unlabeled cellular radio data. Precisely, we provide answers to
the following question: what is the environment of the mobile user IOD solution based on machine learning. Recently, a new
when it is/was experiencing a mobile service/application: indoor crowd-sourcing approach [6], [7] is becoming popular for
or outdoor? Implementing this method within the mobile network collecting and analyzing real and large network measurement
is interesting for mobile operators since it has low complexity, datasets coming from mobile phones or any other connected
is less human intrusive (minimal intervention of mobile users) devices. This method exploits smartphones (with built-in cellu-
and more accurate. The semi-supervised classification algorithm
learns to identify the environment using large and real collected lar network interface) with their various measurement sensors.
3GPP signals measurements. As compared to existing work, Additionally, data obtained from smartphones has the natural
in addition to existing parameters used for classification, we mobility vector of people carrying them. This ensures cost-
propose to also use a radio metric called Timing Advance. It is effective, continual and fine-grained spatio-temporal moni-
computed within the mobile network. We empirically validate the toring and analyses of mobile networks. For our work, we
innovative semi-supervised algorithm using new real-time radio
measurements, with partial ground truth information, gathered propose to investigate this concept of large and real crowd-
daily, weekly, monthly, from indoor and outdoor locations and sourced measurements for IOD. We also propose to extend it
from multiple typical and diversified environments crossed by to mobile networks to deal with the challenge of detecting the
mobile users. The study confirms the effectiveness of the pro- environmental context of mobile users from network side. The
posed scheme compared to the existing supervised classification idea is to collect data, which is measured or derived within
methods including SVM and Deep Learning.
Index Terms—Environment classification, Machine Learning,
network, and then consider it as an input for the machine
Indoor Outdoor Detection, 3GPP radio measurement, crowd- learning based classifier used for training, learning and then
sourcing, real user activity. detection. The data measured by multiple UEs during their
connection is sent to eNB, using standardized procedures.
I. I NTRODUCTION Such solutions are interesting for mobile network opera-
tors that wish to exploit cognition of user behavior to op-
Recent technological breakthroughs have extended the mo- timize/customize their service delivery with minimal inter-
bile phones’ features, functions and capabilities, which are vention of the users. Furthermore, such measurements, as an
now used for more than just communicating or affording ap- alternative to coverage modeling or drive tests [6], capture
plications. Recently, mobile devices are being utilized to know reality well, reveal real life of a mobile user while at the
the consuming habits of individuals and communities [1], [2], same time being less expensive. This method can then be
[3]. Thus, our purpose is to inject this learned cognition into implemented by the operators in their networks, as a generic
mobile 5G networks to help them grow smarter and be more solution, independent of the implementations of particular
efficient when faced to the increasing complexity of network manufacturers. Consequently, it allows the mobile network to
management combined with numerous new applications and exploit direct measurements at user side to deduce contextual
their heterogeneous needs. factors such as the user environment.
As a first step to bring such additional knowledge to the net- In 4G/5G cellular networks, such solutions are technically
work, we target Indoor/Outdoor Detection (IOD) in this paper. feasible since enormous amount of mobile measurement data
IOD refers to the estimation of the mobile users’ environments, is collected by the mobile terminal. This data is regularly
that is to infer whether the user is Indoor or Outdoor. IOD is sent to the network using standardized protocols and interfaces
a cornerstone of the user behavior contextualization, which during each UE’s connection to the cell (on a per-procedure
in turn can be used for learning the user behavior, adapting basis and on a network defined event basis). This measurement
mobile network resources, etc [4], [5]. The idea is to have data is referred to as LTE UE Measurement Data (LUMD) [8].
more information on the user like knowing his environment LUMD contains rich information on mobile performance and
type or his location. RF metrics such as signal strength (Reference Signal Receive
Power or RSRP), signal plus interference and noise strength a statistical issue where a weighted score or a threshold is
(Reference Signal Receive quality or RSRQ). It also includes defined to determine the mobile environment, or as a classifi-
the Channel Quality Indicator (CQI) that is a function of SINR. cation problem sorting mobile users between multiple classes.
In this work, we aim to achieve the following objectives: In most of these works, only two classes are considered
• (1) infer the user environmental context, from certain (Indoor/Outdoor) but, in some works, three classes are selected
LUMD metrics collected in crowdsourcing mode and the (e.g. Indoor/Semi-Outdoor/Outdoor). The Figure 1 shows an
radio metric, Timing Advance, assessed by the network illustration of the whole dependency of existing classes.
when the user is connected to a session. In fact, the
environment considered is divided into two main types:
– Indoor: at home, in restaurant, in cafe/ at work or in
other building types, etc.
– Outdoor: pedestrian, running or in car moving with
high speed.
• (2) consider the constraint that the inference shall be
done at network side with minimal human interaction or
intervention.
To achieve (1) and (2), we design a method for training
IOD automatic classifier based on a weakly or partially labeled
crowdsourced dataset. Such dataset reduces human interven-
tion to the lowest possible. Indeed, the labeled data, used
Fig. 1. Example of IOD classification scheme: in 3 main classes
for machine learning training, is either tagged manually or
automatically. Manual data tagging can be expensive, complex In addition to such categorization, IOD problem can also be
and even unfeasible for mobile operators if they have to tag distinguished based on the location where IOD is performed,
all the collected crowdsourced data. either at the mobile terminal side or at the mobile network side.
In this paper, we are interested in Machine Learning (ML), In the following, we highlight some of the works dealing with
one of the popular techniques, for automatic IOD. Among ML the IOD issue, presenting them according to this classification.
families, we consider supervised learning and more particu- In first category, [10] looks at a threshold of signals col-
larly semi-supervised learning which can be seen as a mix of lected from some phone sensors related to: radio signals, cell
supervised and unsupervised approaches. Supervised learning signal strength, light intensity as well as the magnetic sensor to
is more adapted for classification tasks. It uses labeled data to infer whether the mobile user is indoor or outdoor. However,
learn the mapping between data and the labels. Unsupervised this threshold is specific to the experimental settings where it is
learning looks for patterns and structures within the data for calculated. It is not generalizable to new environments. Thus,
tasks such as clustering. The semi-supervised learning, which using just a threshold decreases the IOD accuracy. Similar to
is an hybrid approach, is becoming popular with growing [10], the work in [5] also uses the same signals, but also con-
abundance of data in this era. It proposes a learning scheme sidered sound intensity, battery temperature and the proximity
based on partially or weakly labeled dataset in order to achieve sensor. For IOD, they propose a semi-supervised approach:
a classification task or a function approximation task. In our a co-training solution. They use 2 classifiers in parallel with
case, semi-supervised learning allows the mobile operator a weighted score of classification probability to improve the
to use labeled data from a few users and combine it with final performance of IOD. For every classifier, they select
lot of unlabeled and easily available data collected from a different set of sensors to learn different perspectives and
several users. This combination allows to learn all possible patterns. This work shows high performance (more than 90%
environment types related to the user behavior. of accuracy) in the detection of new instances in unknown
The rest of this paper is organized as follows. Section environments. However, the impact of this work is limited
II describes the main IOD works in literature. In Section since their database is not highly representative. Indeed, the
III, a comparative analysis of crowdsourcing and drive-test used data set was only collected in three places (the campus
data collection modes is provided. In section IV, results with area, city center, residential area) which are not enough to
supervised classification and clustering algorithms are given. train a general IOD system.
Section V and VI present a new Deep Learning-based semi- The work in [4], proposes a video streaming optimization
supervised learning approach proposed for IOD from the based on adaptation as a function of the user location in
network side. Section VI discuss the results. time. For that, IOD is computed via a Bayesian detector
that combines measurements from two smartphone sensors to
II. R ELATED WORK decide the user environment type.
In the literature, the IOD issue has not been largely studied: In second category, in [11] authors optimize the use of
only few works address it. Proposed solutions are usually radio measurements in wireless networks. Literally, they use
divided in to two categories [9]. IOD is either considered as radio signal measurements collected in different situations
of mobility with varying speed (low, medium, high) namely The set of these signals has been collected during 9 months,
(pedestrian, incar and unmoving). They dynamically estimate 24h/7 (From October 2017 until June 2018), with an average
the signal attenuation. This in turn helps them to efficiently of 1 measurement per 15 seconds while the mobile phone
classify mobile user environment (pedestrian, incar, unmoving) session is active and 1 measurement per 2 minutes otherwise.
and finally improves the handover process. Authors assume The dataset is made of 40% of labeled data and 60% of
that once the signal power attenuation is estimated correctly, unlabelled data. The 9 months collection has been performed
we can easily come to classify whether the mobile user is in many different environments like mountain, beach, forest,
pedestrian, in car or unmoving. This is because the measured companies, cafes, streets, bars, parks, restaurants, lakes, etc...
power signal for an unmoving user does not show too much It was also performed in many cities and places like country-
variations unlike the incar or pedestrian cases. Nevertheless, side, villages, small cities, metropolis, and different countries,
this proposition is still at an early stage and it has not but for this paper we are only studying the data collected in
been thoroughly developed yet. In [8], the main issue is to France (Figure 2). This long collection period allows us to
localize the mobile user by estimating its longitude and latitude have data reflecting all weather types: Heavy Rain, Foggy,
in a most possible accurate way. For this, they made the Sunny, Snowy, Windy, Rainy,... i.e. almost the 4 seasons.
assumption that mobile users are outdoor, thus giving rise Therefore with this campaign of data measurement we try to
to the importance of IOD and the necessity to classify the be as close as possible to the complexity and the variety of a
user environment. For the classification task, they used RSRP mobile user moving in real world.
and RSRQ signals and tested many algorithms: SVM, logistic
regression and random forest. SVM was the retained solution
since it performed best.
In this paper, we focus on the IOD automation within the
network side using machine learning algorithms. They are
trained using large real dataset while minimizing the mobile
user interaction (minimal labels). We look at the performance
in terms of F 1 − scores of supervised and semi-supervised
IOD methods. Goal is to evaluate the minimal amount of
labeled data required for obtaining good IOD performance.

III. C OLLECTED DATA FOR IOD


In this section, we analyze the statistical differences by
focusing on the empirical cumulative distribution function Fig. 2. Data collection Points in France: multiple environments and places
(CDF) between indoor and outdoor environments, using a large
and real data-set collected at multiple places, many environ-
ments. We illustrate the impact of the two environments on B. Data collection: crowdsourcing vs. drive-test mode
the empirical CDFs, according to where the data is collected.
In crowdsourcing mode, the collected data consists of sig-
A. Data Description nals measured by the mobile phone and sent to the eNB. Our
dataset described in the previous subsection has been collected
Our large data set consists in Time, 3 LUMD radio signals, using this mode. Figures 3 shows the empirical cumulative
the metric Timing Advance (TA) and the label when it is distribution functions (CDFs) of RSRP and CQI obtained
known. Thus, it has a vector of 6 features with the label: with the dataset. The significant offset between the indoor
• Time: time of signal record and the outdoor curves, results from substantial difference
• RSRP: the average received power of the Reference and attenuation variation in radio signal propagation. It is
Signal (RS) between -140 dBm to -44 dBm [12], sent mainly due to reflection, diffraction, dispersion and attenuation
by eNB. experienced in indoor environment. However, we note that
• RSRQ: the ratio between RSRP and RSSI (Received there is some overlap between the ranges of RSRP and CQI
Signal Strength Indicator) between -19.5dB and -3dB values. Also the extreme values seen in the two indoor and
[12], that represents the total power of the received outdoor CDFs (located in tails) get similar and the division
signal (including the transmitted signal, the noise and the between the two gets blurred. The behaviour at the juncture
interference). of extreme values can be explained by the ambiguous char-
• CQI: indicator reported by UE to eNB that gives the most acteristics of the environment when a user is at high speed
appropriate modulation scheme and coding scheme to be (Train, car...) or when he is in a semi indoor environments (like
used for transmission [13]. balconies, semi-open building, near a window.., etc. We argue
• TA: used to control Uplink signal transmission timing. that these points are ambiguous and will pose a good challenge
It is indicated by eNB to UE via a Timing Advance for supervised classification, since they can be indifferently
command [14]. classed indoor or outdoor at the same time.
on labeled dataset collected in drive-test mode as compared to
1 1
obtained through crowdsourcing mode.
0.8 0.8

0.6 0.6
F(x)

F(x)
0.4 0.4

0.2 0.2

0 0
Indoor Indoor
Outdoor Outdoor
-140 -120 -100 -80 -60 -40 0 2 4 6 8 10 12 14 16 18
RSRP (dBm) CQI

Fig. 3. Empirical CDF for measured RSRP (left) and CQI (right) in
crowdsourcing mode: multiple environments and places - Indoor (red) and
Outddor (green).

1 1

0.8 0.8

0.6 0.6
Fig. 5. The Data collection Points of EPD in drive-test like mode: Paris and
F(x)

F(x)

0.4 0.4
southern suburbs
0.2 0.2

0 0
Indoor Indoor
Outdoor Outdoor IV. C LASSIFICATION USING S UPERVISED L EARNING OR
-140 -120 -100 -80 -60 -40 0 2 4 6 8 10 12 14 16 18
RSRP (dBm) CQI CLUSTERING
After analyzing the statistical properties of I/O environ-
Fig. 4. Empirical CDF for measured RSRP (left) and CQI (right) in drive-
test type mode: specific environments and places - Indoor (red) and Outddor ments, we first evaluate the accuracy and the performance of
(green). supervised classifiers for IOD. For this, we use the accuracy
metric which is the ratio of correctly classified instances
divided by the total instances and the metric F 1 − score that
An alternate data collection mode, widely used to collect is by definition the weighted average of Precision and Recall
data, is the drive-test mode. However, this mode imposes limits according to the following relation:
on capturing the reality through the data collected in this mode. P recision.Recall
Such data collection campaigns are run for limited hours per F 1 − score = 2.
P recision + Recall
day during short period (couple of weeks) and at some specific
places. To model this way of collecting data, referred as drive- where precision is the number of correct positive results
test mode, we extract a portion data (EPD) from the whole divided by the number of all positive results returned by the
dataset. We aimed by this selected EPD data to be as close classifier, and recall is the number of correct positive results
as possible to the type of places where the drive-test was divided by the number of all relevant samples. F 1 − score
performed by one of the top 3 American operators in New is one of the most used metrics in case of unbalanced data
York City in [8]. Therefore to build EPD we consider data classes. Indeed, the statistics of our data show that the data
only in metropolis (Paris and southern suburbs see figure 5). proportion between indoor and outdoor classes is unbalanced
Indeed, Paris as metropolis, has a dense and specific architec- 65% Indoor vs. 35% Outdoor. This reflects the reality since
ture which allows better comparison with NYC. Concerning people, in general, spend more time at home or in indoor envi-
indoor data, we selected instances where the user was strictly ronments than in outdoor environments. For the experiments,
indoor and, thus, not in “semi-indoor” positions like semi- we divided the dataset as follows: 70% for training, 30% for
open building or balconies,...etc. For outdoor data, we chose validation and test. We evaluate the impact of both input pairs
the instances where the user was either pedestrian or in vehicle (RSRP, RSRQ), which is the reference input for IOD in the
in different city streets (limited speed). Thus, to mimic drive- literature, vs. (RSRP, CQI), in three cases:
test we consequently ignored data coming from environments • Training and evaluation on labeled EPD collected in

like subway/ countryside/ forest/ beaches/ Mountains/ .../ etc. drive-test like mode (see Table I),
We did this to enable a fair comparison between the two • Training on labeled EPD and evaluation on the rest of

modes. Figure 4 shows well separated RSRP empirical cdfs the labeled data of crowdsourcing mode, thus operating
between the classes indoor and outdoor. The superimposed with unknown environments (see Table II) and,
points of both the cdfs we judge conflicting have disappeared. • Training and evaluation on labeled data collected in

The overlap between both the cdfs, which previously led to crowdsourcing mode (see Table III).
ambiguity, has disappeared. This is due to the significant As shown in the table I, running either classification (SVM,
distance between the indoor and the outdoor curves. In the case Random Forest, Neural Network) or clustering (k-means)
of CQI cdfs we notice a similar phenomenon. This analysis algorithms on EPD, obtained from drive-test like mode, shows
allows us to argue that supervised classification will run better an excellent performance with an F 1−score of 99%, which is
close to the reference result found in literature [8]. However, learning the user environment, only based on drive-test data,
when the algorithm trained on EPD is used to perform is thus not enough to learn the complexities of users’ real life.
IOD directly on crowdsourced data, a dramatic performance We continue the study with SVM as the reference super-
deterioration is observed as seen in table II. The best algorithm vised classifier. The inputs (RSRP, CQI) are more appropriate
is SVM, which gives an F 1 − score of 61.7%. But, this is for doing IOD from network infrastructure point of view
not an acceptable performance for IOD. In third case, the since these signals are sent more regularly to eNB than
performance of supervised classifier where training as well as RSRQ. Lastly, we propose to add a new signal, called Timing
evaluation is performed on the crowdsourced labeled data, is Advance (TA), to enhance the IOD performance. The idea
shown only for the case of SVM, which performed best. Table is to exploit the information of distance between eNB and
III shows a noticeable enhancement of F 1 − score to 83.71%. the mobile users embedded in TA parameter. This would help
This is a moderately acceptable performance. We are still far the supervised classifier to classify the ambiguous points (e.g.
from the reference in the literature. For the target, we can measurement points with low RSRP, but close to eNB, etc.).
assume that an error of 5−8% is tolerable for the IOD system. So, in case of (RSRP, CQI, TA), the SVM performance reaches
Indeed, while dimensioning of mobile networks, an error up an F 1 − score of 89.11% and an Accuracy of 90.17% (Table
to 10% is qualified as an admissible error rate. Additionally, III). As a result, with the addition of TA, IOD using SVM
Tables I, and III show that (RSRP, CQI) as input provides performs better leading to a gain of 6%. TA notably contributes
similar results as (RSRP, RSRQ) when used for classifying to solve the classification issue of some ambiguous points.
EPD or crowdsourcing data. The results are even slightly better
in case of table II with (RSRP, CQI). RSRQ and SINR (note V. H YBRID /S EMI -S UPERVISED APPROACH
that CQI is based on SINR) are both radio measurements that
depend on signal and interference strength. The results shows To avoid performance degradation when facing new un-
that the information contained in CQI is also useful for IOD known environments, it is preferable to train the IOD classifier
and thus, (RSRP, CQI) is also a good candidate for IOD. using a more diversified dataset. From a data collection point
of view, it is more of an interest for the operator to collect
Algorithm RSRP-RSRQ RSRP-CQI massive partially tagged data. Indeed, first, during online
Accuracy F1-Score Accuracy F1-score
k−means 99, 68% 99.48% 99.67% 99.47%
labelling it alleviates the network charge by limiting the
SVM 99.75% 99.59% 99.76% 99.60% amount of UL signalling (all labels) sent to eNB and, second,
NeuronalNetwork 99.50% 99.18% 99.57% 99.28% reduces the complexity and the time for tagging data. Thus,
RandomForest 99.83% 99.72% 99.77% 99.62% the idea is to use the available tagged data, which is costly
TABLE I to obtain, and combine it with untagged data, which is easy
C LUSTERING AND C LASSIFICATION PERFORMANCE : TRAINING AND to obtain, for classifier training. However, one of the main
EVALUATION ON LABELED DATA (EPD) OF DRIVE - TEST LIKE MODE
questions is: how much percentage of tagged data is needed
for satisfactory performance of the intelligent IOD system?
We suggest a semi-supervised learning system (HSSL) that
Algorithm RSRP-RSRQ RSRP-CQI can learn additional new environments, without the need to in-
Accuracy F1-Score Accuracy F1-score
volve more users to gather the ground truth (the indoor/outdoor
k−means 61.41% 60.07% 59.64% 57.77%
SVM 56.56% 36.13% 62.69% 61.71% tag). As in [5], [15], [16], our approach uses both tagged
NeuronalNetwork 50.90% 44.58% 62.55% 61.54% and untagged data in order to improve the IOD classifier
RandomForest 62.93% 61.99% 62.63% 61.59% training, while maintaining the same good performances for
TABLE II a given ratio of tagged and untagged data. The proposed
C LUSTERING AND C LASSIFICATION PERFORMANCE : TRAINING ON EPD system is composed of 2 main modules as shown in Figure
AND EVALUATION ON LABELED DATA OF CROWDSOURCING MODE
6. The role of first module is to label the untagged data. It
uses a unsupervised clustering algorithm, called “Bayesian
As we guessed, the performance of IOD classification when Gaussian Mixture” (BGM) which is fast and efficient. The
second module uses this tagged data output to learn the user
Algo. RSRP-RSRQ RSRP-CQI RSRP-CQI-TA environment via a supervised learning classifier that can be
Accur. F1-S. Accur. F1-S. Accur. F1-S.
SVM 85.48% 83.66% 85.54% 83.71% 90.17% 89.11%
SVM or also Deep Learning algorithm.
Recently Deep Learning (DL) approaches have emerged
TABLE III which show improvements as compared to classical ap-
SVM PERFORMANCE : TRAINING AND EVALUATION ON LABELED DATA OF
CROWDSOURCING MODE
proaches such as SVM [17], [18]. Moreover, from an operator
point of view, IOD is a complicated task since millions of users
are considered during a longer period which heavily increases
trained only on EPD and then tested on the crowdsourced the dataset size. Therefore we propose to also investigate
data drops in terms of F 1 − score and accuracy. This is due Deep Learning (DL) over the huge crowdsourced dataset.
to the presence of ambiguous points combined with unknown After configuring DL using a Grid Search to find the best
environments not included in the drive-test data. Consequently, parameters that best optimize the IOD classifier, we conduct
a comparative study between SVM, HSSL (using SVM), DL VI. R ESULTS AND DISCUSSION
and HSSL (using DL).
This section evaluates the performance of HSSL on the
crowdsourced data. It provides an answer to the question on
what is the ideal amount of tagged data required with the
untagged data so that the performance of the intelligent IOD
system achieves a F 1 − score higher than 90%.
We have used both scikit-learn [20] and keras [21] in python
for the HSSL implementation. The DL module is a feed
forward neuronal network (fully connected) with 8 hidden
Layers using ReLU as the activation function. Actually, ReLU
is the most widely used activation function while designing
Fig. 6. IOD Learning Scheme for Weakly Labelled data: an hybrid/Semi-
supervised Machine Learning approach
neural networks today. The main advantage of using ReLU
over other activation functions is that it does not activate all
In this approach, the first module detects 2 clusters. Once the neurons at the same time. It leads to a sparse network that
detected, they are employed to label the untagged data. An is efficient and easy for computation. As for the last layer (the
optimizer module then processes the data before sending it output layer) we used a sigmoid activation function since we
to the second module. It corrects and minimizes the labeling look for a binary classification either 0 or 1 (for indoor/outdoor
errors resulting from clustering. For this, we assume that a environments).
user can not change his environment twice in 30 seconds. The The HSSL evaluation is first done in 2 validation steps:
idea can be explained from the following example. Imagine • (i) The performance of first module (BGM + Optimizer)
that we have three consecutive points, very near in time, in the provides an F 1 − score of 85.99%.
dataset. If, for example, the first point is mapped as indoor, the • (ii) The performance of second module (supervised learn-
second is mapped as outdoor and the next point, very near in ing using SVM or DL) shows an F 1 − score of 89.11%
time, is again mapped as indoor, then we assume that there is with SVM and of 92.81% with DL.
an error in mapping. This is because a user cannot change its Once confirmed that both modules have convincing per-
environment two times so quickly. Thus, the optimizer module formance, we evaluate the whole HSSL system. The system
detects and corrects such errors. receives both labeled and unlabeled data as inputs. The eval-
Let Et the environment type of the user at the moment t uation goal is to find out for what percentage of labeled data
and the different measurement times t − 1,t and t + 1. If the (and unlabeled data), the performance of HSSL goes above
difference between (t − 1,t + 1) is equal or less than 60s then the target F 1 − score of 90%. For this, we aim to compare
Et−1 = Et = Et+1 . The optimizer parses the data tagged with HSSL (including SVM or DL) with SVM and DL, alone,
the cluster verifying this assumption and then correcting the when trained over same amount of tagged data (with the only
BGM prediction if necessary, see Algorithm 1. The clustering difference that HSSL in addition also uses untagged data).
Figure 7 shows performance of HSSL (DL or SVM) and of
Data: output of the cluster : tagged data supervised SVM and DL for different percentages of labeled
Result: Optimization and tags correction data. We observe that IOD performs better using DL than
for Et in clustering Tagged DataSet do using SVM in both cases. We also observe that only DL and
if Diff(t − 1,t + 1)≤ 60s then HSSL(DL) achieve both the tolerable error of 5 − 8% for IOD
if Et−1 = Et+1 And Et−1 6= Et then system in mobile network. However, we note that HSSL is
Et ← Et+1 slightly better than DL for almost all percentages of labeled
end data. HSSL(DL) reach the maximal F 1 − score of 93% for
end the distribution of 65% labeled data and 35% unlabeled data.
end Also, it can be seen that for an operator a dataset composed
Algorithm 1: Time Optimizer only of 10% of tagged data, approx. 1 month of collected data
out of total 9 months, is enough to learn the user environment.
and the optimizer together form the first module of the HSSL To conclude, the proposed HSSL system trained with a
system which deals with labeling of the unknown data tags. partially tagged data set, is able to make a good distinction
The input of first module consists of untagged data. The output of the user environment. We also note that supervised DL is
is considered as the first input of the second module dealing better than SVM. This is because DL uses several layers of
with the supervised classifier. This output is a vector of 4 neurons and is able to capture more mappings. HSSL(DL)
measurements [RSRP, CQI, TA, Class*], where Class* is the showed only slightly better performance as compared to DL.
estimated labels by the first module. The second input of the This is because we studied simple IOD with detection of
second module is composed of the labels (the ground truth) only 2 classes. In future, we plan to study detection of more
forming a measurement vector of size 4 [RSRP, CQI, TA, environments such as in-car, pedestrian, etc., instead of just
Class]. outdoor or indoor. We will compare HSSL(DL) with DL with
93
and user behavior contextualization. We would like also to
thank Jakob Hoydis of NOKIA Bell Labs for many helpful
92 discussions in machine learning and Deep Learning topic.

91 R EFERENCES
[1] BULUT, E., and SZYMANSKI, B. K., Understanding User Behavior
F1-Score

90 via Mobile Data Analysis, Proc. IEEE ICC Workshops, Dynamic Social
Networks, DYSON, London, June 8, 2015, pp. 1548-1553
89 [2] B. O. Holzbauer, B. K. Szymanski, and E. Bulut, Impact of Socially
Based Demand on the Efficiency of Caching Strategy, in Proceedings of
IEEE PerCom 2014, IQ2S workshop, Budapest, Hungary, 2014.
88 [3] Q. Xu, Z. M. Mao, A. Arbor, J. Erman, F. Park, A. Gerber, J. Pang,
HSSL(DL)
S. Venkataraman, Identifying Diverse Usage Behaviors of Smartphone
87 DL
HSSL(SVM) Apps, Proceedings of the 2011 ACM SIGCOMM conference, Internet
SVM measurement conference, 2011, pp. 329-344.
86 [4] MEKKI, Sami, KARAGKIOULES, Theodoros, et VALENTIN, Stefan.
0 10 20 30 40 50 60 70 80 90 100 HTTP adaptive streaming with indoors-outdoors detection in mobile
networks. arXiv preprint arXiv:1705.08809, 2017.
Percentage of Labeled Data (%) [5] RADU, Valentin, KATSIKOULI, Panagiota, SARKAR, Rik, et al. A
semi-supervised learning approach for robust indoor-outdoor detection
Fig. 7. Evaluation of the HSSL System: F 1 − score vs. percent of labeled with smartphones. In : Proceedings of the 12th ACM Conference on
data. Embedded Network Sensor Systems. ACM, 2014. p. 280-294.
[6] MARINA, Mahesh K., RADU, Valentin, et BALAMPEKOS, Konstanti-
nos. Impact of indoor-outdoor context on crowdsourcing based mobile
more classes, environments and yet more users and data. The coverage analysis. In : Proceedings of the 5th Workshop on All Things
Cellular: Operations, Applications and Challenges. ACM, 2015. p. 45-
hypothesis will be that adding unlabeled data might improve 50.
the HSSL performance more. [7] CAINEY, Joe, GILL, Brendan, JOHNSTON, Samuel, et al. Modelling
Nevertheless, HSSL with SVM trains faster with a duration download throughput of LTE networks. In : Local Computer Networks
Workshops (LCN Workshops), 2014 IEEE 39th Conference on. IEEE,
of 32.95s on a machine having 12 CPUs and 32 Go of RAM. 2014. p. 623-628.
Training with DL is slower: about 23.35 minutes using the [8] RAY, Avik, DEB, Supratim, et MONOGIOUDIS, Pantelis. Localization
same machine. SVM and HSSL with SVM converge quicker of LTE measurement records with missing information. In : Computer
Communications, IEEE INFOCOM 2016-The 35th Annual IEEE Inter-
than DL. In future, we will try training using a GPU. national Conference on. IEEE, 2016. p. 1-9.
[9] EDELEV, Sviatoslav, PRASAD, Sunaina Nelamane, KARNAL, He-
VII. C ONCLUSION manth, et al. Knowledge-assisted location-adaptive technique for indoor-
outdoor detection in e-learning. In : Pervasive Computing and Com-
In this paper, we investigated the problem of IOD performed munication Workshops (PerCom Workshops), 2015 IEEE International
at network side using 3GPP signals and Timing Advance data Conference on. IEEE, 2015. p. 8-13.
collected inside the infrastructure. We first showed that using [10] ZHOU, Pengfei, ZHENG, Yuanqing, LI, Zhenjiang, et al. Iodetector: A
generic service for indoor outdoor detection. In : Proceedings of the 10th
a drive test dataset is insufficient to mimic the real world acm conference on embedded network sensor systems. ACM, 2012. p.
complexity and reveal the real user behavior. By diversifying 113-126.
the environments more (using a highly representative crowd- [11] ALAYA-FEKI, Afef Ben Hadj, LE CORNEC, Alain, et MOULINES,
Eric. Optimization of Radio Measurements Exploitation in Wireless
sourced dataset) during the training phase, we showed that Mobile Networks. JCM, 2007, vol. 2, no 7, p. 59-67.
the more environments we have for the training phase, the [12] 3GPP TS 3GPP TS 36.133: ”Evolved Universal Terrestrial Radio Access
better the supervised classifier performs. We also showed that (E-UTRA); Requirements for support of radio resource management”,
Release 8.
adding a new parameter, Timing Advance, can improve IOD [13] 3GPP TS 3GPP TS 36.213: ”Evolved Universal Terrestrial Radio Access
performance. (E-UTRA); Physical layer procedures”, Release 8.
To address the fundamental issue of the model adaptation to [14] 3GPP TS 3GPP TS 36.321: ”Evolved Universal Terrestrial Radio Access
(E-UTRA); Medium Access Control (MAC) protocol specification”,
new and diversified environments without making it hard and Release 8.
expensive for the operators (specially due to the labelling task) [15] ZHU, Xiaojin et GOLDBERG, Andrew B. Introduction to semi-
we proposed a new hybrid/semi supervised learning (HSSL) supervised learning. Synthesis lectures on artificial intelligence and
machine learning, 2009, vol. 3, no 1, p. 1-130.
system. The HSSL system presents satisfactory performance [16] ZHANG, Jun, CHEN, Xiao, XIANG, Yang, et al. Robust network traffic
even when facing unknown environments. classification. IEEE/ACM Transactions on Networking (TON), 2015,
We plan to extend our work on IOD in future and address vol. 23, no 4, p. 1257-1270.
[17] LECUN, Yann, BENGIO, Yoshua, et HINTON, Geoffrey. Deep learning.
the IOD issue by considering systems that take the time nature, 2015, vol. 521, no 7553, p. 436.
variations into account. Thus, probably using other algorithms [18] GOODFELLOW, Ian, BENGIO, Yoshua, COURVILLE, Aaron, et al.
with time correlation, like LSTM, would boost the HSSL Deep learning. Cambridge : MIT press, 2016.
[19] JIANG, Chunxiao, ZHANG, Haijun, REN, Yong, et al. Machine learning
system and would probably decrease the required portion of paradigms for next-generation wireless networks. IEEE Wireless Com-
labeled data to obtain F 1 − score of 95%. munications, 2017, vol. 24, no 2, p. 98-105.
[20] PEDREGOSA, Fabian, VAROQUAUX, Gal, GRAMFORT, Alexandre,
ACKNOWLEDGMENT et al. Scikit-learn: Machine learning in Python. Journal of machine
learning research, 2011, vol. 12, no Oct, p. 2825-2830.
We would like to thank Xavier Lagrange and Jean-Marie [21] CHOLLET, Franois, et al. Keras. 2015.
Bonin of INRIA for insightful discussions on IOD issue

You might also like