(Paper) Time Series Feature Extraction On Basis of Scalable Hypothesis Tests (Tsfresh-A Python Package)
(Paper) Time Series Feature Extraction On Basis of Scalable Hypothesis Tests (Tsfresh-A Python Package)
net/publication/324948288
CITATIONS READS
225 3,670
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Maximilian Christ on 29 May 2018.
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Time series feature engineering is a time-consuming process because scientists and engineers have to
Received 31 May 2017 consider the multifarious algorithms of signal processing and time series analysis for identifying and
Revised 22 March 2018
extracting meaningful features from time series. The Python package tsfresh (Time Series FeatuRe
Accepted 23 March 2018
Extraction on basis of Scalable Hypothesis tests) accelerates this process by combining 63 time series
Available online xxx
characterization methods, which by default compute a total of 794 time series features, with feature
Communicated by Dr. Francesco Dinuzzo selection on basis automatically configured hypothesis tests. By identifying statistically significant time
series characteristics in an early stage of the data science process, tsfresh closes feedback loops with
Keywords:
Feature engineering domain experts and fosters the development of domain specific features early on. The package imple-
Time series ments standard APIs of time series and machine learning libraries (e.g. pandas and scikit-learn)
Feature extraction and is designed for both exploratory analyses as well as straightforward integration into operational data
Feature selection science applications.
Machine learning
© 2018 Published by Elsevier B.V.
1. Introduction data science projects in order to rapidly extract and explore differ-
ent time series features and evaluate their statistical significance
Trends such as the Internet of Things (IoT) [1], Industry 4.0 [2], for predicting the target. The Python package tsfresh supports
and precision medicine [3] are driven by the availability of cheap this process by providing automated time series feature extraction
sensors and advancing connectivity, which among others increases and selection on basis of the FRESH algorithm [12].
the availability of temporally annotated data. The resulting time se-
ries are the basis for machine learning applications like the classi-
2. Problems and background
fication of hard drives into risk classes concerning a specific de-
fect [4], the analysis of the human heart beat [5], the optimiza-
A time series is a sequence of observations taken sequentially
tion of production lines [6], the log analysis of server farms for
in time [13]. In order to use a set of time series D = {χi }N i=1
as
detecting intruders [7], or the identification of patients with high
input for supervised or unsupervised machine learning algorithms,
infection risk [8]. Examples for regression tasks are the prediction
each time series χ i needs to be mapped into a well-defined feature
of the remaining useful life of machinery [9] or the estimation of
space with problem specific dimensionality M and feature vec-
conditional event occurrence in complex event processing appli-
tor xi = (xi,1 , xi,2 , . . . , xi,M ). In principle, one might decide to map
cations [10]. Other frequently occurring temporal data are event
the time series of set D into a design matrix of N rows and M
series from processes, which could be transformed to uniformly
columns by choosing M data points from each time series χ i as
sampled time series via process evolution functions [11]. Time se-
elements of feature vector xi . However, from the perspective of
ries feature extraction plays a major role during the early phases of
pattern recognition [14], it is much more efficient and effective
to characterize the time series with respect to the distribution of
data points, correlation properties, stationarity, entropy, and non-
∗
Corresponding author at: Department of Engineering Science, University of
linear time series analysis [15]. Therefore the feature vector xi can
Auckland, New Zealand.
E-mail addresses: [email protected] (M. Christ), [email protected] (N. Braun),
be constructed by applying time series characterization methods fj :
[email protected] (J. Neuffer), [email protected] (A.W. χ i → xi, j to the respective time series χ i , which results into feature
Kempa-Liehr). vector xi = ( f1 (χi ), f2 (χi ), . . . , fM (χi )). This feature vector can be
https://doi.org/10.1016/j.neucom.2018.03.067
0925-2312/© 2018 Published by Elsevier B.V.
Please cite this article as: M. Christ et al., Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python
package), Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.03.067
JID: NEUCOM
ARTICLE IN PRESS [m5G;May 16, 2018;13:36]
Fig. 1. The three steps of the tsfresh algorithm are feature extraction (1.), calculation of p-values (2.) and a multiple testing procedure (3.) [12]: Both steps 1. and 2. are
highly parallelized in tsfresh, further 3. has a negligible runtime For 1, the public function extract_features is provided. 2. and 3. can be executed by the public function
select_features. The function extract_relevant_features combines all three steps. By default the hypothesis tests of step 2 are configured automatically depending on the type
of supervised machine learning problem (classification/regression) and the feature type (categorical/continuous) [12].
Please cite this article as: M. Christ et al., Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python
package), Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.03.067
JID: NEUCOM
ARTICLE IN PRESS [m5G;May 16, 2018;13:36]
Fig. 2. Memory consumption of extraction and selecting time series features from 30 time series on MacBook Pro, 2.7 GHz Intel Core i5 and tsfresh v0.11.0 (Table 1).
Each time series has a length of 10 0 0 data points. (a) one core, (b) four cores (b).
Please cite this article as: M. Christ et al., Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python
package), Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.03.067
JID: NEUCOM
ARTICLE IN PRESS [m5G;May 16, 2018;13:36]
5. Conclusions Acknowledgments
The Python based machine learning library tsfresh is a fast The authors would like to thank M. Feindt, who inspired the
and standardized machine learning library for automatic time development of the FRESH algorithm, U. Kerzel and F. Kienle for
series feature extraction and selection. It is the only Python their valuable feedback, Blue Yonder GmbH and its CTO J. Karstens,
based machine learning library for this purpose. The only alter- who approved the open sourcing of the tsfresh package, and the
native is the Matlab based package hctsa [26], which extracts contributors to tsfresh: M. Frey, earthgecko, N. Haas, St. Müller,
more than 7700 time series features. Because tsfresh imple- M. Gelb, B. Sang, V. Tang, D. Spathis, Ch. Holdgraf, H. Swaffield, D.
ments the application programming interface of scikit-learn, Gomez, A. Loosley, F. Malkowski, Ch. Chow, E. Kruglick, T. Klerx,
it can be easily integrated into complex machine learning G. Koehler, M. Tomlein, F. Aspart, S. Shepelev, J. White, jkleint.
pipelines. The development of tsfresh was funded in part by the German
The widespread adoption of the tsfresh package shows Federal Ministry of Education and Research under Grant number
that there is a pressing need to automatically extract features, 01IS14004 (project iPRODICT).
originating from e.g. financial, biological or industrial applica-
tions. We expect that, due to the increasing availability of anno- Appendix A. Detailed runtime of time series feature extraction
tated temporally data, the interest in such tools will continue to
grow. The average runtime has been obtained from a sample of 30
different time series for which all features had been computed
three times. The time series were simulated beforehand from the
following sequence:
Current code version xt+1 = xt + 0.0045(y − 1/0.3 )xt − 325xt3 + 6.75 · 10−5 ηt (A.1)
with ηt ∼ N (0, 1 ) [25, p. 164] and y being the target.
Table 1
Code metadata of tsfresh.
Nr. Code metadata description Please fill in this column
Please cite this article as: M. Christ et al., Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python
package), Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.03.067
JID: NEUCOM
ARTICLE IN PRESS [m5G;May 16, 2018;13:36]
Fig. A.1. Average runtime of time series feature extraction methods documented in http://tsfresh.readthedocs.io/en/latest/text/list_of_features.html.
Please cite this article as: M. Christ et al., Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python
package), Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.03.067
JID: NEUCOM
ARTICLE IN PRESS [m5G;May 16, 2018;13:36]
References [23] L.S. Lopes, Robot learning at the task level: a study in the assembly domain,
Ph.D. thesis, Universidade Nova de Lisboa, Portugal, 1997.
[1] J. Gubbi, R. Buyya, S. Marusic, M. Palaniswami, Internet of Things (IoT): a vi- [24] M. Lichman, UCI machine learning repository, 2013. http://archive.ics.uci.edu/
sion, architectural elements, and future directions, Future Gener. Comput. Syst. ml.
29 (7) (2013) 1645–1660. [25] A.W. Liehr, Dissipative solitons in reaction diffusion systems, in: Mechanisms,
[2] M. Hermann, T. Pentek, B. Otto, Design principles for Industrie 4.0 scenarios, Dynamics, Interaction, 70, Springer Series in Synergetics, Berlin, 2013. 10.1007/
in: Proceedings of 2016 49th Hawaii International Conference on System Sci- 978- 3- 642- 31251-9.
ences (HICSS), 2016, p. 3928. [26] B.D. Fulcher, N.S. Jones, hctsa: a computational framework for automated time-
[3] F.S. Collins, H. Varmus, A new initiative on precision medicine, N. Engl. J. Med. series phenotyping using massive feature extraction, Cell Syst. 5 (5) (2017)
372 (9) (2015) 793–795, doi:10.1056/NEJMp1500523. 527–531. E3. doi: 10.1016/j.cels.2017.10.001.
[4] R.K. Mobley, An introduction to predictive maintenance, second, Elsevier Inc.,
Woburn, MA, 2002. Maximilian Christ received his M.S. degree in Mathemat-
[5] B.D. Fulcher, M.A. Little, N.S. Jones, Highly comparative time-series analysis: ics and Statistics from Heinrich-Heine-Universität Düssel-
the empirical structure of time series and their methods, J. R. Soc. Interface 10 dorf, Germany, in 2014. In his daily work as a Data Sci-
(83) (2013) 20130048. ence Consultant at Blue Yonder GmbH he optimizes busi-
[6] M. Christ, F. Kienle, A.W. Kempa-Liehr, Time series analysis in industrial appli- ness processes through data driven decisions. Beside his
cations, in: Proceedings of Workshop on Extreme Value and Time Series Anal- business work he is persuing a Ph.D. in collaboration with
ysis, KIT Karlsruhe, 2016, doi:10.13140/RG.2.1.3130.7922. University of Kaiserslautern, Germany. His research inter-
[7] A.L. Buczak, E. Guven, A survey of data mining and machine learning methods est relate on how to deliver business value through Ma-
for cyber security intrusion detection, IEEE Commun. Surv. Tutor. 18 (2) (2016) chine Learning based optimizations.
1153–1176.
[8] J. Wiens, E. Horvitz, J.V. Guttag, Patient risk stratification for hospital-asso-
ciated c. diff as a time-series classification task, in: F. Pereira, C.J.C. Burges,
L. Bottou, K.Q. Weinberger (Eds.), Advances in Neural Information Processing
Systems 25, Curran Associates, Inc., 2012, pp. 467–475.
Nils Braun is a Ph.D. student in High Energy Particle
[9] J. Yan, Machinery Prognostics and Prognosis Oriented Maintenance Manage-
Physics at Karlsruher Institut of Technology (KIT), Ger-
ment, John Wiley & Sons, Singapore, 2015.
many. His research is focussed on developing and op-
[10] M. Christ, J. Krumeich, A.W. Kempa-Liehr, Integrating predictive analytics into
timizing scientific software for analysing and processing
complex event processing by using conditional density estimations, in: Pro-
large amounts of recorded data efficiently. He received his
ceedings of IEEE 20th International Enterprise Distributed Object Comput-
M.S. degree in Physics at KIT in 2016. He has worked at
ing Workshop (EDOCW), IEEE Computer Society, Los Alamitos, CA, USA, 2016,
Blue Yonder GmbH as a Data Science Engineer, where he
pp. 1–8, doi:10.1109/EDOCW.2016.7584363.
developed platforms and utilities for various data science
[11] A. Kempa-Liehr, Performance analysis of concurrent workflows, J. Big Data 2
tasks.
(10) (2015) 1–14, doi:10.1186/s40537-015-0017-0.
[12] M. Christ, A.W. Kempa-Liehr, M. Feindt, Distributed and parallel time series
feature extraction for industrial big data applications. Asian Machine Learning
Conference (ACML) 2016, Workshop on Learning on Big Data (WLBD), Hamil-
ton (New Zealand), ArXiv preprint arXiv: 1610.07717v1.
[13] G.E.P. Box, G.M. Jenkins, G.C. Reinsel, G.M. Ljung, Time Series Analysis: Fore- Julius Neuffer is a Software Developer at Blue Yonder
casting and Control, fifth ed., John Wiley & Sons, Hoboken, New Jersey, 2016. GmbH. He has a background in computer science and phi-
[14] C.M. Bishop, Pattern Recognition and Machine Learning, Information Science losophy. Aside from software development, he takes inter-
and Statistics, Springer, New York, 2006. est in applying machine learning to real-worldproblems.
[15] B.D. Fulcher, Feature-Based Time-Series Analysis, Cornell University Library,
2017. arXiv: 1709.08055v2.
[16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine
learning in Python, J. Mach. Learn. Res. 12 (2011) 2825–2830.
[17] S.V.D. Walt, S.C. Colbert, G. Varoquaux, The numpy array: a structure for effi-
cient numerical computation, Comput. Sci. Eng. 13 (2) (2011) 22–30.
[18] W. McKinney, Data structures for statistical computing in Python, in: S. van der
Walt, J. Millman (Eds.), Proceedings of the 9th Python in Science Conference,
2010, pp. 51–56. Andreas W. Kempa-Liehr is a Senior Lecturer at the
[19] E. Jones, T. Oliphant, P. Peterson, et al., SciPy: open source scientific tools for Department of Engineering Science of the University of
Python, 2001. http://www.scipy.org/. Auckland, New Zealand, and an Associate Member of the
[20] F. Chollet, Keras, 2015. https://github.com/fchollet/keras. Freiburg Materials Research Center (FMF) at the Univer-
[21] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, sity of Freiburg, Germany. Andreas received his doctorate
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, from the University of Münster in 2004 and continued his
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, research as head of service group Scientific Information
R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, Processing at FMF. From 2009 to 2016 he was working
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, in different data science roles at EnBW Energie Baden-
O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, in: Tensor- Württemberg AG and Blue Yonder GmbH.
flow: Large-scale machine learning on heterogeneous systems, 2015. Software
available from tensorflow.org
[22] M. Rocklin, Dask: Parallel computation with blocked algorithms and task
scheduling, in: K. Huff, J. Bergstra (Eds.), Proceedings of the 14th Python in
Science Conference, 2015, pp. 130–136.
Please cite this article as: M. Christ et al., Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python
package), Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.03.067
View publication stats