Machine Learning Introduction Presentation
Machine Learning Introduction Presentation
Assumptions
Univariate time series Time series databases
O(nd) Finding the nearest neighbor for each time series in the database is prohibitive.
Similarity Search
Clustering and classication methods perform many similarity calculations Some require storage of the k nearest neighbors of each data instance Critical that these calculations be fast
Indexing
Faster than a sequential scan Insertions and deletions do not require rebuilding the entire index Partition the data into regions Search regions that contain a likely match Requires a similarity metric that obeys triangle inequality
Indexing
R-trees kd-trees linear quad-trees grid-les
Dimensionality Reduction
Reduces the size of the time series Distance on transformed data should lower bound the original distance
Gemini Framework
Faloutsos et al., 1994 Map each time series to a lower dimension Store in multi-dimensional indexing structure
C. Faloutsos et al.: Fast Subsequence Matching in Time-Series Databases. SIGMOD Conference 1994: 419-429
Eamonn J. Keogh, et al.: Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst. 3(3): 263-286 (2001) Fig: Eamonn J. Keogh, et al.: HOT SAX: Efciently Finding the Most Unusual Time Series Subsequence. ICDM 2005: 226-233
Segmentation
Represent the time series in smaller, less complex segments.
Piecewise Linear Approximation (PLA) Minimum Bounding Rectangles (MBR)
Fig: A. Anagnostopoulos et al: Global distance-based segmentation of trajectories. SIGKDD Conference 2006: 34-43
Discretization
Transforms a real-valued time series into a sequence of characters from a discrete alphabet Dimensionality reduction implicit Allows use of string functions on time series
SAX
Jessica Lin et al. A symbolic representation of time series, with implications for streaming algorithms. DMKD 2003: 2-11 Fig: Eamonn J. Keogh, et al.: HOT SAX: Efciently Finding the Most Unusual Time Series Subsequence. ICDM 2005: 226-233
Cross Correlation
Cross correlation with convolution can nd optimal phase shift to maximize similarity
Fig: P. Protopapas et al.: Finding outlier light-curves in catalogs of periodic variable stars. Mon. Not. Roy. Astron. Soc. 369 (2006) 677-696
Cross Correlation
Optimal phase shift (to left) of solid line is 0.3
Fig: P. Protopapas et al.: Finding outlier light-curves in catalogs of periodic variable stars. Mon. Not. Roy. Astron. Soc. 369 (2006) 677-696
Warped
Time Axis
Fig: Y. Sakurai, et al.: FTW: fast similarity search under the time warping distance. PODS 2005: 326-337 D. J. Berndt, and J. Clifford: Finding Patterns in Time Series: A Dynamic Programming Approach. Advances in Knowledge Discovery and Data Mining 1996: 229-248
DTW Algorithm
DTW Algorithm
Fig: Y. Sakurai, et al.: FTW: fast similarity search under the time warping distance. PODS 2005: 326-337
Drawbacks of DTW
Computationally expensive Does not adhere to triangle inequality => cannot use it for indexing
Sakoe-Chiba Band
Itakura Parallelogram
Y. Sakurai, et al.: FTW: fast similarity search under the time warping distance. PODS 2005: 326-337
Thesis Research
Anomaly detection methods
fast preserve interesting features
Thank You