Exploratory Sensor Data Analysis in Python - by Mabel González Castellanos - Towards Data Science
Exploratory Sensor Data Analysis in Python - by Mabel González Castellanos - Towards Data Science
Get unlimited access to the best of Medium for less than $1/week. Become a member
132 2
While there is a lot published about traditional EDA techniques, time series
data brings special challenges when it is faced during analysis. More
specific, sensor data is also time series data but with some peculiar
characteristics which can be summarised as:
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 1/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
· Data is multidimensional, either the sensors have more than one channel
or there are several sensors recording at the same time, sometimes both.
· Time series are long, the data is recorded for a certain period and
frequency, determining the resulting number of data points.
· Several time series (files) form a dataset, that’s the case for motion sensors
for example. In this domain, the number of time series depend on the
number of recordings, normally related with the number of movements and
persons performing them.
In the scenario described, EDA is not a straightforward process anymore and the
goal of this post is to offer some practical steps to explore a sensor dataset. To
illustrate the methodology proposed I am using a dataset [2] also available in UCI
repository [3]. All relevant code and data used in this article are stored in this
Github Repository (shortcut to Jupyter Notebook file).
Table of contents:
1. Essential Visualisations
2. Correlations
3. Distribution Analysis
4. Final Remarks
5. References
1. Essential Visualisations
The dataset we will use contains the motion data of 14 people between 66
and 86 years old, performed broadly scripted activities using a battery-less,
wearable sensor on top of their clothing. Data were collected in two clinical
room settings (S1 and S2). The setting of S1 uses 4 RFID reader antennas
around the room for data collection, whereas the room setting S2 uses 3
RFID reader antennas (two at ceiling level and one at wall level) for the
collection of motion data.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 2/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
1 # load data
2 def load_data(dir_dataset: str):
3 features = ["time",
4 "frontal",
5 "vertical",
6 "lateral",
7 "id_antenna",
8 "rssi",
9 "phase",
10 "frequency",
11 "label"]
12 sensor_ds = {}
13
14 for f in files_names:
15 file_name = os.path.join(dir_dataset, f)
16 df = pd.read_csv(file_name, names=features)
17 df.set_index('time', inplace=True)
18 base_name = os.path.basename(f)
19 sensor_ds[base_name] = df
20 return sensor_ds
21
22 sensor_ds = load_data(DIR_DATASET)
Now, it is time to create the first visualisations of the data. I start by counting
the number of observations in every file contained in the dataset.
Now, let’s take one file as example (“d2p01F”, collected in room 2 by a female
volunteer) to recreate the activities recorded on it and the duration as well.
An activity plot is used to understand the script followed during the
recording.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 3/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
1 # activity plot
2 def activit_plot(sensor_ds:Dict[str, pd.DataFrame], file_name:str):
3 labels = []
4 for i in range(1,5):
5 labels.append(np.where(sensor_ds[file_name].label==i)[0])
6 labels = [sensor_ds[file_name].index[c] for c in labels]
7 colors1 = ['C{}'.format(i) for i in range(len(labels))]
8 # create a horizontal plot
9 fig = plt.figure(figsize=(15,7))
10 lineoffsets = [1,2,3,4]
11 plt.eventplot(labels, colors=colors1, lineoffsets=lineoffsets)
12 plt.yticks([1,2,3,4], ["sit on bed", "sit on chair", "lying", "ambulating"])
13 plt.title("Activities performed")
14 return fig
15
16 file_name = "d2p01F"
17 activit_plot(sensor_ds, file_name)
Figure 2. Activities performed during one recording, file “d2p01F” — Image created by Author
Lying state changes to sit and ambulate (lying -> sit on bed -> ambulate).
To see how those activities are reflected by the sensors let’s focus on the
accelerometer and a traditional time series plot. The plot shows the values of
the accelerometer in the three axes during the recording time. Vertical lines
were added to mark the moments when an activity change occurs.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 4/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
Figure 3. Frontal, vertical and lateral acceleration values for file “d2p01F” — Image created by Author
There are not values recorded by the accelerometer sensor between 100
and 200 seconds approximately.
Frontal axis shows more variations during the lying period in compare
with the others.
The lag observed in this recording seems to be related with the sudden
movement executed to get out of bed. To be aware of this kind of situation it
is important to analyse the sampling rate. An easy way to do it is by analysing
the differences between consecutive records. Normally most of the
differences are around the same values but sometimes outliers could
indicate anomalies during the recording.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 5/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
Figure 4. Time differences between every consecutive pair of observations, file “d2p01F” — Image created by
Author
The box is almost a line which means most of the values are concentrated
around the median (0.25).
There are some outliers (circles) close to this value except the one is more
than 120 seconds apart.
Figure 5. Boxplot of the time differences for all the files in the dataset — Image created by Author
2. Correlations
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 6/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
We can measure the correlation at different levels. The file level is when we
investigate the features relation within a file by analysing the multivariate
time series recorded on it. By the other hand the time window level or
rolling correlation explores the relation between two time series as a rolling
window calculation. This section addresses both levels.
Figure 6. Pairwise correlation between all numerical features, file “d2p01F” — Image created by Author
There are not strong correlations in this example neither negative nor
positive.
To get a general idea about the correlation in this dataset we need to extend
the analysis to the rest of the files. Multiple heatmap plots are used to show
the results.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 7/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
Open in app
Search Write
Figure 7. Pairwise correlation computed for all the files — Image created by Author
The correlations vary between the files. For example pair frontal-vertical
is uncorrelated (“d2p05F”), negative correlated (“d2p06F”) and positive
correlated (“d1p12F”), depending on the file.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 8/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
1 file_name="d2p01F"
2 rolling_corr = sensor_ds[file_name].frontal.rolling(50).corr(sensor_ds[file_name].vertica
3 plot_rolling_corr(rolling_corr);
Figure 8. Rolling a correlation window between frontal and vertical features in file “d2p01F” — Image created
by Author
There isn’t any valid number the first 49 observations (white coloured)
because the window size is 50.
From observations between 250 and 300, the correlations are strong and
positive (blue) due to person’s movement.
3. Distribution Analysis
In order to get more insights about the sensor dataset it is important to know
which distributions we are working with. In addition, many Machine
Learning models were designed under the assumptions of specific
distributions. To determine the distribution that better fits the data we use
distfit package [4]. It tries out many well-known distributions and returns
the one that better matches the data. There is also a plot function provided
to visualize the results. The method is applied to feature lateral from the
example file.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 9/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
Figure 9. Fitting of lateral feature with a t-distribution, file “d2p01F” — Image created by Author
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 10/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 11/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
Figure 11. Fitting of frontal feature with a t-distribution, file “d1p02M” — Image created by Author
The distribution is not unimodal, several peaks are clearly visible in the
empirical distribution.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 12/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
Dweibull-distribution was selected as the best match but we can see there
are big differences with respect to the empirical distribution.
To understand better what caused this shape, I use the kernel density
estimation method to fit the data but this time splitting the distribution by
class.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 13/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
Figure 12. Kernel density estimation plot for frontal feature, file “d1p02M” — Image created by Author
The peaks correspond with different classes and that explain the
difficulties of fitting it with only one distribution.
The separation between classes is more noticeable for class lying (3)
which has very different values in compare with the other 3 that imply
movement.
On the other hand, if the goal is to compare different files but considering all
numeric variables, a more advanced plot is needed. We discuss a solution
based on the plot proposed in [5]. Basically, the main advantage here is that
the distributions from different series partially cover each other. This setup
allows for multiple time series what is very convenient for time series with
high dimensionality. The original code can be found here. Next figure shows
the distributions for three files: “d2p01F”, “d2p02F” and “d2p03F”. All time
series were z-normalized which centers the data around 0. The number at
the right part of every pdf shows the x-value with the highest probability. It
helps to compare different distributions for a common feature.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 14/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
Figure 13. Distributions of numerical features for three files: “d2p01F”, “d2p02F” and “d2p03F” — Image
created by Author
Feature frequency looks similar across the different files, also the
numbers [-0.9, -1.0, -0.9] are close to each other.
Feature rssi is clearly bimodal, but the location of the highest peak
change in the case of the third file (“d2p03F”).
The feature vertical shows values of: -0.4, 0.6 and -0.2, which are different
hence the shapes show dissimilarities what we can confirm visually.
On the opposite, the feature lateral shows values of: -0.4, -0.4 and -0.3.
These are close to each other and the distribution shapes are not very
different.
Final Remarks
Sensor data analysis is not a trivial task. Hope you have found some
inspirations and ideas about how to handle this task. Some points to keep in
mind:
Don’t rush to mix all the files, instead try to understand what have you got
in first place.
Try every file as a mini dataset, understanding the data at that level will
help to understand the whole picture later.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 15/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
Don’t forget to analyse correlations. Even if you are confident about how
data should behave, you may be in for some surprises.
References
[1] Tukey, John W., Exploratory Data Analysis (1977), Pearson. ISBN 978–
0201076165.
[2] Shinmoto Torres, R. L., Ranasinghe, D. C., Shi, Q., Sample, A. P., Sensor
enabled wearable RFID technology for mitigating the risk of falls near beds
(2013), In 2013 IEEE International Conference on RFID (pp. 191–198). IEEE.
[3] Dua, D. and Graff, C., UCI Machine Learning Repository (2019), Irvine,
CA: University of California, School of Information and Computer Science.
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 16/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
Cristian Leo in Towards Data Science Siavash Yasini in Towards Data Science
1K 6 1.5K 12
See all from Mabel González Castellanos See all from Towards Data Science
Aserdargun daython3
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 17/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
181 6 55 2
Lists
231 2 206 4
3 1.2K 5
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 18/19
09/02/2024, 03:38 Exploratory Sensor Data Analysis in Python | by Mabel González Castellanos | Towards Data Science
https://towardsdatascience.com/exploratory-sensor-data-analysis-in-python-3a26d6931e67 19/19