Abstract
We present afqn (Approximate Fast Qn), a novel algorithm for approximate computation of the Qn scale estimator in a streaming setting, in the sliding window model. It is well-known that computing the Qn estimator exactly may be too costly for some applications, and the problem is a fortiori exacerbated in the streaming setting, in which the time available to process incoming data stream items is short. In this paper we show how to efficiently and accurately approximate the Qn estimator. As an application, we show the use of afqn for fast detection of outliers in data streams. In particular, the outliers are detected in the sliding window model, with a simple check based on the Qn scale estimator. Extensive experimental results on synthetic and real datasets confirm the validity of our approach by showing up to three times faster updates per second. Our contributions are the following ones: (i) to the best of our knowledge, we present the first approximation algorithm for online computation of the Qn scale estimator in a streaming setting and in the sliding window model; (ii) we show how to take advantage of our UDDSketch algorithm for quantile estimation in order to quickly compute the Qn scale estimator; (iii) as an example of a possible application of the Qn scale estimator, we discuss how to detect outliers in an input data stream.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In this paper we deal with the problem of computing approximately the Qn scale estimator when the input is a data stream. Qn is a robust measure of dispersion [17]; it is a rank-based estimator proposed by Rousseeuw and Croux, a statistic based on absolute pairwise differences which does not require location estimation.
It is worth recalling here that statistical robustness comes at a cost, since computing the Qn estimator is computationally expensive as we shall see. The challenge is therefore to design a streaming algorithm working in the sliding window model [6, 15], able to quickly provide an approximation for the Qn estimator for each item in the input data stream. Moreover, the quality of such an approximation must be high, with regard to the accuracy. We shall show how to approximately compute the Qn estimator by using our UDDSketch algorithm [7] for quantile estimation.
As an application, we deal with the problem of analyzing an input data stream to detect anomalies, also known as outliers. Formally, we will denote by σ a data stream consisting of a sequence of n items drawn from a universe \(\mathcal {U}\). In general, depending on the specific application, items may be duplicated and may correspond to abstract or real entities, such as IP addresses, graph edges, points, geographical coordinates, numbers etc. A detailed survey discussing data streams and fundamental streaming algorithms is [15].
Given the input stream, we process its items using the sliding window model [6, 15]. In this model, the freshness of recent items is captured either by a time window, i.e., a temporal interval of fixed size in which only the most recent items are taken into account or by an item window, i.e. a window containing a predefined number of recent items; detection of outliers is strictly related to those items falling in the window. The items in the stream become stale over time, since the window periodically slides forward.
We are provided with a potentially infinite input stream whose items are numbers, and our task is to determine the outliers. Distinguishing inliers and outliers is a notoriously difficult problem even in the simplest case of a static input dataset. An outlier is traditionally defined as an observation which markedly deviates from the other members of a dataset, and searching for outliers requires finding those observations which appear to be inconsistent with the rest of the data [11]. Therefore, outliers are often thought as anomalies; they may arise either because of human or instrument errors, fraudulent behaviour, changes or system’s faults, natural deviations in populations, etc.
Detecting an outlier may indicate a system abnormal running condition such as an engine defect, an anomalous object in an image, an intrusion with malicious intension inside a system, a fault in a production line, etc. An outlier detection system accomplishes the task of monitoring data in order to reveal anomalous instances. A comprehensive list of outlier detection use-cases is given in [11].
Owing to the underlying nature of the input stream, outliers’ detection becomes particularly challenging and relevant. A data stream is usually characterized by the high rate of items’ arrivals and by its length, which may be unbounded. As an immediate consequence, processing the items of a stream to compute a function of the input is quite hard, given that an algorithm is only allowed a single pass over the data stream items. In particular, each item is only seen just once, and must be quickly processed and discarded. Typically, processing an item must be done in constant time. Another problem is strictly related to the potentially unbounded length of the stream, which implies that the data items can not be stored, making infeasible processing the input at a later time. Indeed, in many cases the task requires near real-time processing of the stream.
Determining the outlierness of an item centered in the current window requires computing a score based on the Qn estimator; we shall show that our afqn algorithm is both fast and accurate and is therefore a valid alternative to the exact but computationally expensive Qn estimator.
To recap, our contributions are the following ones:
-
to the best of our knowledge, we present the first approximation algorithm for online computation of the Qn scale estimator in a streaming setting and in the sliding window model;
-
we show how to take advantage of our UDDSketch algorithm for quantile estimation in order to quickly compute the Qn scale estimator;
-
as an example of a possible application of the Qn scale estimator, we discuss how to detect outliers in an input data stream.
This paper is organized as follows. In Section 2, we introduce the Qn estimator and discuss related work. We describe our afqn algorithm in Section 3. As an application, we present in Section 4 a robust statistical approach to outlier detection. Section 5 provides extensive experimental results. Our conclusions are drawn in Section 6.
2 Related work
The Qn estimator is a robust statistical method for univariate data. Given a set \(\{x_{1}, x_{2},\dots , x_{n}\}\), the Qn statistic was initially defined by its authors [17] as
where \(k \approx {n\choose 2}/ 4\) and the notation {⋅}(k) denotes computing the k th order statistics on the set. However, the authors slightly modified the definition in (1) by taking into account that
where h = ⌊n/2⌋ + 1. The final definition is
where \(k = {h \choose 2 }\) and dn is a correction factor which depends on n. From a statistical perspective, the breakdown point of Qn is 50%, i.e., the estimator is robust enough to counter the negative effects of almost 50% large outliers without becoming extremely biased. In addition, its Gaussian efficiency is about 82%, i.e., it is an efficient estimator since it needs fewer observations than a less efficient one to achieve a given performance. To better understand how efficient the Qn estimator is, it is worth recalling here that the MAD (Median Absolute Deviation about the median of the data) estimator [9] provides an efficiency of only about 36%.
Consider a static dataset consisting of n items. Computing the Qn estimator naively requires determining the set of the absolute pairwise differences, whose size is quadratic in n. Then, the differences must be sorted in order to determine the k th order statistic. Therefore, the naive approach requires in the worst case \(O(n^{2} \lg n)\) time. A slightly better algorithm requires O(n2) time in the worst case, by using the Select algorithm [1] which is linear in the input size in the worst case; again, the input size refers to the size of the set of the absolute pairwise differences. Since the Select algorithm is only of theoretical interest, selecting the k th order statistic is usually done with the QuickSelect algorithm [8, 10] which is extremely fast. However, QuickSelect is linear in the input size only on average (expected computational complexity).
Croux and Rousseeuw [5] proposed the first offline algorithm with worst case complexity \(O(n \lg n)\), taking advantage of an algorithm designed by Johnson and Mizoguchi [12] to determine the k th order statistic in a matrix which is required to have nonincreasing rows and columns, of the form
where the vectors X and Y are sorted using \(O(n \lg n)\) time in the worst case. The key idea is to use the matrix U of order n without actually computing all of its entries, since this would require O(n2) time. This is coupled with a pruning strategy which allows discarding those numbers that can not be the k th order statistic.
In order to reduce the problem of computing the Qn estimator to the problem of selecting an order statistic using the matrix U, Croux and Rousseeuw noted that
where \(k^{*} = k+n+{n \choose 2} \).
In the previous equation, x(1) ≤… ≤ x(n) are the sorted observations (we recall that x1,…,xn are the unsorted observations). It follows that, defining the vectors X = {x(1),…,x(n)} and Y = {−x(n),…,−x(1)}, Croux and Rousseeuw can apply the Johnson and Mizoguchi algorithm to the matrix obtained considering those observations whose indexes are such that 1 ≤ i, j ≤ n:
A minor difference is that the Johnson and Mizoguchi algorithm requires that the vectors X and Y be in nonincreasing order; Croux and Rousseeuw use, instead, the nondecreasing order.
The first algorithm working in a streaming setting has been proposed by [16]. The algorithm works in the sliding window model; let s be the window size. The window is initially empty, and incoming observations from the stream are added to the window until the window reaches its size. Once the window is full, a new incoming observation triggers a window’s update: the new observation is added to the window, and the oldest observation is removed from it.
In order to compute the Qn estimator during a window’s update, the algorithm reuses the same consideration of Croux and Rousseeuw: given \(X=\left \{ x_{1}, \ldots , x_{s} \right \}\), \(k^{\prime } = {\lfloor s / 2\rfloor + 1 \choose 2}\) and \(k=k^{\prime }+s + {s \choose 2}\), it holds that
It follows that in this case the algorithm must compute the k th order statistic of U = X + (−X). The algorithm maintains a buffer \({\mathscr{B}}\) of size b = O(s), storing matrix entries u(k−⌊(b− 1)/2⌋),…,u(k+⌊b/2⌋), centered on the k th order statistic. The buffer is initially populated using a variant of the Croux and Rousseeuw algorithm. The implementation relies on the use of AVL trees, respectively for X, − X and the buffer \({\mathscr{B}}\). Insertion and deletion of an item in the trees triggers a computation of the new position of the k th order statistic in \({\mathscr{B}}\). If the k th order statistic is no longer in \({\mathscr{B}}\), then \({\mathscr{B}}\) is recomputed using the offline algorithm of Croux and Rousseeuw. In the worst case, [16] requires \(O(s \lg s)\) time. The authors prove that, “for a constant signal with stationary noise, the expected amortized time per update is \(O(\lg s)\)”. However, it is worth remarking here that, in order to achieve this expected amortized time, the authors assume that the rank of each data point in the set of all data points is equiprobable, which of course is not always the case.
In [2] we presented fqn (Fast Qn), a novel streaming algorithm working in the sliding window model for computing the Qn estimator with worst case O(s) running time, where s is the window’s size. fqn outperforms [16] with regard to speed and does not assume anything related to the underlying distribution of the input stream. Our algorithm maintains two different permutations of the current window: the former is the permutation related to the actual order of arrival of the observations from the stream, the latter is instead the sorted permutation. Maintaining the sorted permutation can be done in O(s) by mimicking the way InsertionSort [4] inserts an item. Then, in order to compute the k th order statistic of the absolute pairwise differences, we cleverly reuse, adapting it to our context, an algorithm by Mirzaian and Arjomandi [14]. This algorithm determines the k th order statistic in O(s) worst case time.
3 The AFQN algorithm
Our afqn algorithm dynamically maintains and processes consecutive windows arising from the input stream. Let W be our sliding window and xi the i th item in the input stream. We let the size of W be s = 2w + 1, where w is the semi-window size.
Besides the window, AFQN exploits a sketch data structure provided by our recent UDDSketch algorithm [7] for quantile estimation. The sketch is used to insert and remove as needed the differences required to compute the Qn estimator. Formally, we recall that given a set \(\{x_{1}, x_{2},\dots , x_{n}\}\), we need a clever way to compute the k th order statistic of the set \(\left \{\left |x_{i}-x_{j}\right | ; i<j\right \}\) where \(k = {\lfloor n/2 \rfloor + 1 \choose 2}\). In particular, the size of the set of differences is (n × (n − 1))/2 = O(n2).
When the algorithm starts, the window W is empty. We insert the items in W one at a time; after inserting s items, the window is full. Each time we insert an item xi, we also insert its corresponding i − 1 differences with the previous items into the sketch. Therefore, for each item we insert at most O(s) differences, so that when the window is full for the first time the sketch contains (s × (s − 1))/2 = O(s2) differences.
Once the first window has been processed, each time a new item arrives from the stream the window and the sketch are updated as follows. The algorithm processes the stream in windows W =< xi− 2w,…,xi > of size s (implemented as a circular buffer). The window slides one item ahead when a new item xi+ 1 arrives from the input stream, we remove from W the least recent (in the temporal sequence of item arrivals) item xi− 2w and we also delete the s − 1 differences between xi− 2w and the other s − 1 items in W from the sketch. We then proceed inserting the new item xi+ 1 into W along with the s − 1 differences between xi+ 1 and the other s − 1 items in W into the sketch.
As explained, handling a new item’s arrival only requires O(s) sketch insertions and deletions, so that we avoid computing all of the O(s2) differences each time a new item arrives from the input stream. In order to compute the k th order statistic related to the current window W, we simply query the sketch determining the quantile corresponding to \(k = {\lfloor s/2 \rfloor + 1 \choose 2}\). Figure 1 depicts how the algorithm works, whilst Algorithm 5 refers to the pseudocode for afqn.
We now discuss in details our UDDSketch algorithm.
UDDSketch is based on the same data structure used by the DDSketch algorithm [13]. This data structure supports the following operations: inserting an item, deleting an item, collapsing the data structure if required and querying it for quantile estimation. However, DDSketch guarantees good accuracy only for selected input distributions, whilst our algorithm provides better accuracy for almost all of the possible input distributions. We achieved this result by engineering a new collapsing procedure which is able to uniformly distribute the error committed. The accuracy is defined as follows.
Definition 1
q-quantile. Denoting with S a multi-set of size n defined over \(\mathbb {R}\) and with R(x) the rank of an item x, i.e., the number of items in S which are smaller than or equal to x, then the lower (respectively upper) q-quantile item xq ∈ S is the item x whose rank R(x) in S is ⌊1 + q(n − 1)⌋ (respectively ⌈1 + q(n − 1)⌉) for 0 ≤ q ≤ 1.
As an example, x0 and x1 are respectively the minimum and maximum item in S, whilst x0.5 corresponds to the median. Relative accuracy is defined as follows.
Definition 2
Relative accuracy. An item \(\tilde {x}_{q}\) is an α-accurate q-quantile if, for a given q-quantile item xq ∈ S, \(\lvert \tilde {x}_{q} -\) \(x_{q}\rvert \leq \alpha x_{q}\). A sketch data structure is an α-accurate (q0,q1)-sketch if it can output α-accurate q-quantiles for q0 ≤ q ≤ q1.
The sketch data structure is a set of buckets. In the sequel, we shall assume without loss of generality that the input consists of items \(x \in \mathbb R_{\geq 0}\) (owing to the fact that the absolute differences are greater than or equal to zero; nevertheless, the algorithm can also handle negative values: this requires using another sketch in which an item \(x \in \mathbb R_{< 0}\) is handled by inserting − x). The algorithm must be initialized using two input parameters, α and m. The former is the user’s defined accuracy, and the latter is the maximum number of buckets that can be used. If inserting an item requires adding a new bucket and the number of buckets exceeds m, then a collapsing procedure is executed in order to satisfy the constraint on m; collapsing the sketch reduces the number of buckets to at most m.
Bucket boundaries are defined with regard to the quantity \(\gamma = \frac {1+\alpha }{1-\alpha }\). Let the i th bucket be Bi, with index \(i = \lceil \lg _{\gamma } x \rceil \). The input items x such that γi− 1 < x ≤ γi fall in the bucket Bi, which is just a counter variable initialized to zero. To insert an item x into the sketch, if the corresponding bucket already exists the algorithm simply increments that bucket’s counter by one, otherwise the new bucket is added to the sketch, and then its counter is set to one. Symmetrically, an item is deleted decrementing by one the corresponding bucket’s counter; if the counter’s value becomes zero, its bucket is removed from the sketch. At the beginning, the sketch is empty, and the buckets are dynamically added or removed as required. We point out here that the bucket indexes are dynamic as well, since they depend on both the input items and the γ value.
The insertion procedure may of course cause the sketch to grow without bounds. To prevent this, the algorithm executes a collapsing procedure when the number of buckets exceeds the maximum number of m buckets. The pseudocode for UDDSketch insertion of an item x into the sketch \(\mathcal {S}\) is shown as Algorithm 1, whilst Algorithm 2 refers to the pseudocode for deleting an item.
In the original DDSketch algorithm the collapse is applied to the first two buckets whose counters’ values are greater than zero (or, alternatively, it can be done on the last two buckets). Denoting respectively by By and Bz, with y < z, the first two buckets, the collapsing procedure adds the count stored by By to Bz, then By is removed from the sketch.
Our UDDSketch algorithm uses a carefully designed uniform collapsing procedure. Instead of collapsing only the first two buckets with counts greater than zero, we collapse them all, in pairs. Consider a pair of indices (i, i + 1), with i being an odd index and Bi≠ 0 or Bi+ 1≠ 0. When processing this pair of buckets we create and add to the sketch a new bucket whose index is \(j = \lceil \frac {i}{2} \rceil \), and set its counter’s value to the sum of the Bi and Bi+ 1 counters’ values. This new bucket replaces the two collapsed buckets which are then evicted from the sketch. The pseudocode of our uniform collapse procedure is shown as Algorithm 3.
Quantile estimation can be done by using the Query procedure, whose pseudocode is shown as Algorithm 4. The input is a quantile 0 ≤ q ≤ 1. To estimate the quantile value, the procedure determines the index of the first bucket with count greater than zero, and then it sums the next counters’ values greater than zero until the sum is less than or equal to q(n − 1). Letting i be the index of the last bucket considered, the final estimation is obtained as \(\tilde {x}_{q} = 2\gamma ^{i}/({\gamma +1})\). A theoretical bound on the accuracy and the actual accuracy achieved experimentally by UDDSketch on several distributions is available in [7].
Since the computational complexity of inserting an item into the sketch is in the worst-case O(1), it follows that inserting the s differences requires at most O(s). Therefore, for a theoretical perspective, afqn shares the same complexity of fqn; however, as we shall show in Section 5, in practice, afqn is faster.
4 A robust statistical approach to outlier detection
An outlier detector working in streaming can be designed using a temporal window which slides forward one item at a time. Denoting with W the current window and by w the semi-window size, the algorithm processes the stream in windows W =< σi− 2w,…,σi > of size s = 2w + 1. Once the first window has been processed, the new one is obtained by sliding the window one item ahead when the next item arrives from the stream. Letting i − 2w be the index of the first item in a window W (i.e., the oldest one), then the item under test in W is the one located at the index i − w.
Traditionally, outlier detection has been based on the z-scores of the observations given by \(z_{score}=\frac {x-\mu }{\sigma }\) where x denotes the observation under test, and μ and σ denote respectively the mean and the standard deviation of the observations. A different outlierness test has been proposed in [18], using robust estimators such as the median and the MAD (Median Absolute Deviation from the median of the observations): (x − median(W))/MAD.
As an example application for outlier detection, we use a slightly different z-score, in which we substitute the Qn estimator in place of MAD, obtaining the following outlierness test: |x − median(W)|/Qn. The reasons for preferring the Qn estimator to MAD are its greater Gaussian efficiency (82% versus 36%) and its ability to deal with skewed distributions [17].
Denoting the item under test with x = σi−w, in order to determine whether x is an outlier we proceed as follows. We begin computing the quantile q according to Eq. 3 by considering only the values in the current window. Next, we determine med, the value of W corresponding to the median order statistic and compute qn, the Qn dispersion for the window W.
Let t be a scalar integer acting as a multiplier of the Qn dispersion. In practice, t is used to control the degree of outlierness of an item. To determine if an item x is an outlier, we check the following condition (corresponding to the devised outlierness test): |x − med| > t ⋅ qn; if the condition is true, then x is an outlier, otherwise x is an inlier (i.e., a normal observation). In practice, the condition |x − med| > t ⋅ qn identifies as outliers those points that are not within t times the Qn dispersion from the sample median; regarding t, a commonly used value is t = 3 [3, 19]. The pseudocode is shown as Algorithm 6.
5 Experimental results
In this Section, we present and discuss experimental results, thoroughly comparing our algorithms fqn and afqn. The former exactly determines the Qn values for its input data stream, whilst the latter provides an approximation for the Qn values. However, we do not directly compare the two algorithms with regard to the obtained Qn values, owing to the fact that the accuracy of UDDSketch has been thoroughly analyzed, both theoretically and experimentally in [7]. Rather, we shall compare the results obtained by the two algorithms when the computed Qn values are used for outlier detection as discussed in Section 4.
In particular, we shall compare and contrast the algorithms with regard to their speed, measured as the number of updates per second, and with regard to their accuracy related to the detection of the outliers in the input data stream. For this purpose, the following metrics shall be reported: Recall, Precision, F1 score and Jaccard similarity between the sets of outliers determined by the algorithms. We shall assume that the set of outliers determined by fqn is the ground truth, i.e., the reported outliers are the actual outliers.
Table 1 reports the metrics used. Recall is related to the number of false negatives, and ranges between 0 and 1. It is 1 when there are no false negatives, i.e. all of the true outliers have been correctly reported in output, and it is 0 when no true outlier is reported. Similarly, precision is related to false positives. Its value ranges between 0 and 1, it is 1 when there are no false positives, i.e., no false outlier is reported in output and it is 0 when all of the items reported are false outliers. The F1 measure takes into account both recall and precision, being their harmonic mean. F1 is 1 when both precision and recall are 1, and is 0 if either the precision or the recall is 0. Finally, the Jaccard similarity J(A, B) of two sets A and B is defined as the ratio \(\frac {|A \cap B|}{|A \cup B|}\); when the sets A and B are both empty, the Jaccard similarity is defined as J(A, B) = 1.
The afqn algorithm has been implemented in C++. The source code has been compiled using the Apple clang compiler v11.0.3 on macOS Big Sur version 11.1 with the following flags: -Os -std=c++ 14. The tests have been carried out on an Apple MacBook Pro laptop equipped with 32 GB of RAM and a 2.6 GHz exa-core Intel Core I7 processor with 12 MB of cache level 3. The source code is freely available for inspection and for reproducibility of resultsFootnote 1. The tests have been performed on synthetic datasets, according to the distributions shown in Table 2; moreover, we also perform our experiments on a real dataset, kindly provided by Yahoo as part of its Webscope programFootnote 2: Yahoo! Synthetic and real time-series with labeled anomalies, version 1.0. In particular, we have added the results obtained running the experiments on the dataset real_17 available in the A1Benchmark within the whole dataset, considering this dataset both with and without the provided human curated annotations which explicitly report the anomalies. This dataset contains 1424 items, 227 of which have been annotated as outliers.
It is worth noting here that, once again, computing the median in Algorithm 6 could be done approximately by using another instance of an UDDSketch data structure, since the median is just another quantile. However, we recall that our purpose is to compare, ceteris paribus, the results obtained by the two algorithms when the computed Qn values are used for outlier detection; since in fqn we compute the median exactly, had we approximately computed the median in afqn we would have unfairly biased the obtained results.
The median of the window W can be computed exactly in the worst case in O(s) time by using the Select algorithm [1] but, as already observed in Section 2, the QuickSelect algorithm [8, 10] is much faster and is therefore the algorithm used in practice, despite being linear in the input size only on average.
In the streaming setting, the median can be computed exactly in O(1); the key idea is to maintain the window in sorted order, so that the median can be directly accessed in constant time. This requires handling two different permutations of the window: one is W, which represents the actual order in which incoming items arrive from the stream, and the other is π, which represents the sorted permutation of the items in the window. Therefore, a tradeoff is in place: we use additional space (2s total space instead of s, but note that the amount of space used is still O(s)) for the data structure representing π, but in return we are able to compute the median exactly in O(1).
In our implementation π is an array, storing the items π1,⋯ ,πs. To maintain π in sorted order, items are inserted in π as in InsertionSort [4]. Note that we do not sort π: upon a new item’s arrival, we remove the least recent (in the temporal sequence of item arrivals) item; since the previous window was already sorted, removing the least recent item leaves the window sorted. The new item is inserted using the InsertionSort insertion procedure, which requires O(s) worst case time. Indeed, we can simply use a backward linear scan of π from right to left until we find the index, call it j, where the new item must be inserted. Then, we insert it by sliding all of the items πj+ 1,⋯ ,πs one position to the right. Alternatively, a binary search [4] can be used to determine the index j in time \(O(\lg s)\), but the total time required for the insertion in π is still O(s), owing to the need to shift all of the items πj+ 1,⋯ ,πs one position to the right. When the window’s size s increases, this implementation is, in practice, experimentally faster than using the QuickSelect algorithm, even though the computational complexity is the same, O(s). Figure 2 depicts the implementation.
For each distribution, the algorithms have been executed three times and their results have been averaged. Figures 3, 4, 5 and 6 depict the precision, recall, F1 score and Jaccard similarity achieved by our afqn algorithm varying its initial accuracy parameter and fixing the number m of buckets to s/2, where s is the window’s size. As shown, afqn provides excellent accuracy in practice for all of the distributions under test, even using a very tiny amount of buckets, equal to one half of the window’s size. Only for the halfnormal and normal distributions the use of 100 or 200 buckets is not enough to achieve accuracy values greater than 0.9 (we recall here that 1.0 is the maximum score for each of the accuracy metrics under consideration). However, even for these two distributions the accuracy results are anyway extremely high.
The accuracy values obtained must be also discussed from the perspective of the actual running time required to achieve them. Figures 7, 8, 9 and 10 depict the running time of both fqn and afqn with regard to the number of updates per second (higher values are better). In particular, we vary both the initial accuracy parameter and the number m of buckets (s/2,s,3s/2 and 2s).
Regarding afqn we show the updates per second when varying its initial accuracy parameter. As shown, afqn is always faster than fqn with the notable exception of the poisson and zipf distributions. Even though the computational complexity of both fqn and afqn is the same, i.e., linear in the window’s size, the constant hidden by the asymptotic notation O(s) (in which s denotes the window’s size) is different, leading to these experimental results, in which afqn is shown to be empirically faster in general than fqn. Regarding the specific behaviour of fqn related to the poisson and zipf distributions, the reason why fqn beats afqn with regard to the number of updates/s is that these are discrete distributions with a few distinct items. In this case, fqn has an advantage owing to how its selection algorithms works [2]. In particular, afqn is up to three times faster than fqn when using a very small number of buckets (equal to 100) and up to two times faster when using a greater number of buckets (in the range 200-500 buckets). Therefore, the experiments confirm the validity of our approach: afqn proves to be extremely fast whilst providing, simultaneously, excellent accuracy in almost all of the cases of practical interest.
Regarding the real dataset, Figs. 11 and 12 depict the experimental results. In particular, the former picture refers to the dataset without the provided human curated annotations which explicitly report the anomalies; therefore, we assume that the set of outliers determined by fqn is the ground truth, i.e., the reported outliers are the actual outliers. The latter picture refers instead to the dataset when considering the annotations as ground truth. It is worth noting here that, owing to its length, we choose to process the dataset using an appropriate windows’ size. Therefore, in our experiments we vary the window’ size from 301 to 501 using a step size equal to 50.
As shown in Fig. 11, the behaviour of afqn does not change when processing a real dataset. Finally, Fig. 12 shows that the F1 score when considering the annotations is consistently beyond the value 0.8, and therefore the Qn scale estimator provides very good outlier recognition.
6 Conclusions
In this paper we have introduced afqn (Approximate Fast Qn), a novel algorithm for approximate computation of the Qn scale estimator in a streaming setting. The need for approximate estimation of the Qn estimator arises owing to the fact that exact computation may be too costly for some applications, and the problem is a fortiori exacerbated in the streaming setting, in which the time available to process incoming data stream items is short. We designed afqn to approximate the Qn estimator quickly and with high accuracy. As an application, we have also shown the use of afqn for fast detection of outliers in data streams. The incoming items are processed in the sliding window model, with a simple check based on the Qn scale estimator. Extensive experimental results on synthetic and real datasets have confirmed the validity of our approach, since afqn is actually fast and can provide almost exact results.
References
Blum M, Floyd RW, Pratt VR, Rivest RL, Tarjan RE (1973) Time bounds for selection. J. Comput. Syst. Sci. 7(4):448–461
Cafaro M, Melle C, Pulimeno M, Epicoco I (2021) Fast online computation of the qn estimator with applications to the detection of outliers in data streams. Expert Syst Appl 164:113831. https://doi.org/10.1016/j.eswa.2020.113831. http://www.sciencedirect.com/science/article/pii/S0957417420306424
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: A survey (tr). ACM Comput Surv 41(3):1–58. https://doi.org/10.1145/1541880.1541882
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to Algorithms, Third Edition, 3rd edn. The MIT Press
Croux C, Rousseeuw PJ (1992) Time-efficient algorithms for two highly robust estimators of scale. In: Dodge Y, Whittaker J (eds) Computational statistics. Physica-Verlag HD, Heidelberg, pp 411–428
Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows: (extended abstract). In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’02. Society for Industrial and Applied Mathematics, Philadelphia, pp 635–644
Epicoco I, Melle C, Cafaro M, Pulimeno M, Morleo G (2020) Uddsketch: Accurate tracking of quantiles in data streams. IEEE Access 8:147604–147617. https://doi.org/10.1109/ACCESS.2020.3015599
Floyd RW, Rivest RL (1975) Expected time bounds for selection. Commun ACM 18(3):165–172
Hampel FR (1974) The influence curve and its role in robust estimation. J Am Stat Assoc 69 (346):383–393. https://doi.org/10.1080/01621459.1974.10482962
Hoare CAR (1961) Algorithm 65: find. Commun ACM 4(7):321–322
Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126. https://doi.org/10.1023/B:AIRE.0000045502.10941.a9
Johnson D, Mizoguchi T (1978) Selecting the k-th element in x + y and x_1 + x_2 + ⋯ + x_m. SIAM J Comput 7(2):147–153. https://doi.org/10.1137/0207013
Masson C, Rim JE, Lee HK (2019) Ddsketch: a fast and fully-mergeable quantile sketch with relative-error guarantees. Proc VLDB Endow 12(12):2195–2205. https://doi.org/10.14778/3352063.3352135
Mirzaian A, Arjomandi E (1985) Selection in x + y and matrices with sorted rows and columns. Inf Process Lett 20(1):13–17. https://doi.org/10.1016/0020-0190(85)90123-1. http://www.sciencedirect.com/science/article/pii/0020019085901231
Muthukrishnan S (2005) Data streams: Algorithms and applications. Found Trends® Theor Comput Sci 1(2):117–236. https://doi.org/10.1561/0400000002
Nunkesser R, Schettlinger K, Fried R (2008) Applying the qn estimator online. In: Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R (eds) Data analysis, machine learning and applications. Springer, Berlin, pp 277–284
Rousseeuw PJ, Croux C (1993) Alternatives to the median absolute deviation. J Am Stat Assoc 88(424):1273–1283. https://doi.org/10.1080/01621459.1993.10476408
Rousseeuw PJ, Hubert M (2011) Robust statistics for outlier detection. WIREs Data Min Knowl Discov 1(1):73–79. https://doi.org/10.1002/widm.2
Shewhart WA (1931) Economic control of quality of manufactured product. Macmillan And Co Ltd, London
Funding
Open access funding provided by Università del Salento within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Epicoco, I., Melle, C., Cafaro, M. et al. AFQN: approximate Qn estimation in data streams. Appl Intell 52, 5082–5099 (2022). https://doi.org/10.1007/s10489-021-02614-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02614-w