Mining&Data Stream Unit-3_removed

This chapter discusses the concept of mining data streams, emphasizing its importance in real-time data analysis due to the exponential growth of data from various sources. It covers the characteristics, types, algorithms, and tools used for data stream mining, highlighting challenges such as concept-drift and the need for real-time processing. The chapter also outlines applications and advantages of data streams, alongside their inherent disadvantages.

Uploaded by

sambrisk42

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

14 views

Mining&Data Stream Unit-3_removed

Uploaded by

sambrisk42

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 50

CHAPTER MINING DATA STREAMS eee SS INSIDE THIS CHAPTER After reading this chapter, you should be able to understand: * Introduction to The Stream Concept * Data Streams in Data Mining Techniques © Tools ang Softwares for Data Streams in Data Mining ¢ Stream Data Model and Architecture Streaming Architecture Patterns « Stream Computing * Big Data » Sampling Data in a Steam « Filtering Streams © Counting Distinct Elements in a Stream + Estimating Moments * Counting Oneness in a Window « Decaying Window « Real Time Analytics Platform (RTAP) « Casestudies pea INTRODUCTION TO THE STREAM CONCEPT With the advent of real-time online applications, data repositories on World Wide Web are growing faster than before. As the data has exponentially increased, the applications have started using data mining techniques to analyse the huge amount of data in order to bring about trends or patterns that are required for business intelligence which leads to making well informed decisions. In real-time decision-making, mining data streams has become an important and active research work and is widespread in several fields of computer science and engineering Thus, data mining techniques effectively handle the challenges of storing and processing a huge amount of data. Imagine a factory with 500 sensors capturing 10 KB of information every second, in neatly an hour has captured nearby 36 GB of information and 432 GB daily. This massive information needs to be analysed in real time (or in the shortest time possible) to detect irregularities oF deviations in the system and quickly react. Stream mining enables the analysis of large amounts of data in real time, Streaming data refers to data that is continuously generated, usually in hish volumes and at high velocities. A streaming data source would typically consist of continuous time-stamped logs that record events as they happen — such as a user clicking on link on @ web page, or a sensor reporting the current temperature. In recent years, advances in hardware technology have facilitated new ways of collecting jata continuously. In many applications, such as network monitoring, the volume ‘of such Te large that it may be impossible to store the data on disk. Furthermore, even when the dae’ . stored, the volume of the incoming data may be so large that it may be impossible ¥° P"MINING nderlying problems even more difficult from an algorithmic and computational point of view. « Stream refers to a sequence of data elements or symbols made available over time. « Data stream transmits from a source and receives at the processing end ina network + Acontinuous stream of data flows between the source and receiver's end which is processed in real time It also refers to communication of bytes or characters over sockets in a computer network « Aprogram uses stream as an underlying data type in inter-process communication channels. Data streams are a continuous flow of data. Examples of data | To See | streams include network traffic, sensor data,call center records, andso ‘troduction to the on. Their sheer volume and speed pose a great challenge forthe data. “eam Concept mining community to mine ‘hem. Data streams demonstrate several unique properties: infinite length, concept-drift, concept-evolution, feature-evolution and limited labelled data. Concept-drift occurs in data streams when the underlying concept of data changes over time. Concept-evolution occurs when new classes evolve in streams. Feature-evolution occurs when feature set varies with time in data streams. Data streams also suffer scarcity of labeled data, as it is not possible to manually label all the data points in the stream. Each of these properties adds a challenge to data stream mining. Data Stream is a continuous, fast-changing, and ordered chain of data transmitted at a very high speed. It is an ordered sequence of information for a specific interval. The sender's data is transferred from the sender's side and immediately shows in data streaming at the receiver's side. Streaming does not mean downloading the data or storing the information on storage devices. Data Stream Mining is the process of extracting knowledge from continuous rapid data tecords which comes into the system in a stream. A Data Stream is an ordered sequence of instances in time. | Data Stream > ee Mining | Techniques Fig. 3.1. Data Stream in Data MiningDATA ANnayy aD ining is extracting knowledge and valuable insights from Data streams in data mi stream processing software. Data Streams in Data Mining us stream of data using I concepts of machine learning, knowledge extraction, ang t of general : i large amount of ams in data mining, data analysis of a larg data Needs continuol be considered a subset i data stree A cted in data steam Mining r data mining. In f knowledge is extra ePresenieg in real-time. The structure o! be done in real-time. infinite streams of information. in the case of models and patterns of “ vdered {implicitly by entrance the or oka ae ain titoms It isnot feasible to control the order in which units arrive, ie lparviia cee capture steam in its entirety. It is enormous volumes of data; items that arrive at a high speed. 3.1.1 Types of Data Streams A data stream is a (possibly unchained) sequence of tuples. Each tuple comprises a set of attributes, similar to a row in a database table. (i) Transactional Data Stream: It is a log interconnection between entities 1. Credit card — purchases by consumers from producer 2. Telecommunications — phone calls by callers to the dialed parties 3. Web — accesses by clients of information at servers (ii) Measurement Data Streams: 1. Sensor Networks ~ a physical natural phenomenon, road traffic 2. IP Network — traffic at router interfaces 3. Earth climate - temperature, humidity level at weather stations The Stream Data Model » We can view a stream processor as a kind of lat aarti system, the high-level organisation of which is shown in below Fig. 3.2. Any number of streams can enter the system. Each stream can Provide elements at its own schedule: they need not have the same data rates or data types, and the time between elements of one stream need not be uniform. The fact that the ri the control of the system distinguish on within a database-managementyun OT STREAMS 133 ‘Ad-hoc Queries Streams entering 1,5,2,7,4,0,3,5 Qwerty uio 0, 1, 1,0, 1,0, 0,0 Output streams Stream Processor + time Working Storage Archival Storage Fig. 3.2 Stream data model Examples of Stream Sources 1. Sensor Data In navigation systems, sensor data is used. Imagine a temperature sensor floating about in the ocean, sending back to the base station a reading of the surface temperature each hour. The data generated by this sensor is a stream of real numbers. We have 3.5 terabytes arriving every day and we, for sure, need to think about what can be kept continuing and what can only be archived. 2. Image Data Satellites frequently send down-to-earth streams containing many terabytes of images per day. Surveillance cameras generate images with lower resolution than satellites, but there can be numerous of them, each producing a stream of images at a break of 1 second each. 3. Internet and Web Traffic A bobbing node in the center of the internet receives streams of IP packets from many inputs and paths them to their outputs. Websites receive streams of heterogeneous types. For example, Google receives a hundred million search queries per day. 3.1.2 Characteristics of Data Streams 1. Large volumes of continuous data, possibly infinite 2. Steady changing and requires a fast, real-time response. 3. Data stream captures nicely our data processing needs. 4. Random access is expensive with a single scan algorithm.cE: so far. Me ry of the data seen / | 5, Store only the _ are at a much lower level or are multidimensionay i | . Maximum stream ee i‘ tt n ore. | 6. needs multilevel and multidimensional treatment oy, | 3.1.3 Applications of Data Streams 1. Fraud perception 2. Real-time goods dealing 3. Consumer enterprise 4. Observing and describing on inside IT systems 3.1.4 Advantages of Data Streams * This data is helpful in upgrading sales Help in recognising the fallacy © Helps in minimising costs * It provides details to react swiftly to risk 3.1.5 Disadvantages of Data Streams * Lack of security of data in the cloud * Hold cloud donor subordination * Off-premises warehouse of details introduces the probable for disconnection Bee DATA STREAMS IN DATA MINING TECHNIQUES Data A , - : ° Steams in Data Mining techniques are implemented to extract patterns and insights from ‘ata stream. A vast range of algorithms is available for stream mining qt Data Streams in Data Mining Techniques ere are four main algorithms used for D ata Streams in Data Mini i 1. Classification neemit - > Lazy Classifier or « Lazy Classifier oF K-NN K-NN * Naive Bayes * Naive Bayes © Decision Trees * Decision Trees * Logistic Regression * Linear Regression * Ensembles ao * Ensembles Data Stream Mining Techniques N Frequent Clustering Pattern Minin: + K-means + Apriori | * Hierarchical + Eclat + FP-growth | based | Lo Fig. 3.3 Data Streams in Data Mining Techniques Generally speaking, a stream miniug classifier is ready to do either one of the tasks at any moment: «Receive an unlabeled item ani «Receive labels for past known items an‘ ithms gorithms for predicting the labels for data streams. 1d predict it based on its current model. d use them for training the model. The best Known Classification Algori' Let's discuss the best-known classification al + Lazy Classifier or k-Nearest Neighbour The k-Nearest Neighbour or k-NN classifier predicts the new items’ class labels based on the dlass label of the closest instances. In particular, the lazy classifier outputs the majority class label of the k instances closest to the one to predict. + Naive Bayes Naive Bayes is a classifier based on Bayes’ theorem. It is a probabilistic model called ‘naive’ because it assumes conditional independence between input features. The basic idea is to compute a probability for each one of the class labels based on the attribute values and select the class with the highest probability as the label for the new item. * Decision Trees isthe name signifies, the decision tree builds a tree structure from training data, and then the Data An, Aes luce messages that consist of 148 rces prod wn “- Event splitter err data source pe used tosplita Business event into ied . ent splitter P* ler events can be split into multiple events be anaes commerce ord to te te Order Order Order ltem1 Item 2 Item 3 xampl order item "or al Spiiter New Order Fig. 3.8 Event Splitter + Claim-check pattern: A messaging" -based architecture must be capable of sending, receiving, and manipulating large messages. These use cases can be related to image recognition, video processing, etc. Sending such large messages to the message bus directly is not recommended. The solution is to send the claim check to the messaging platform and store the payload to an external service. Message with Claim Check Sending Receiving mI eee | A [Saar Input Stream | cae Store | Fig. 3.9 Claim-check Pattern — Output Stream . Event grouper: Sometimes events become significant only after they’ve happened seu a limes. For example parcel delivery will be attempted three times before we ask the. collect it from the depot. How can we wait for N logically similat The solution is to consider re elated events as a grot by a given key, and then count the occurrences of that key, Fortis, we peed fo group ia Parcel ID Fig. 3.10(a) Event GrouperVv Dara STREAMS ;™ id group ws For time-based grouping, we can group related even dows: for example, 5 minutes or 24 hours. wi - «+ Event aggregator: Combining multiple events into a single encompassing event, —— window _ calculating the average, median, or percentile on the incoming business data, is a common task in event streaming and real-time analytics. How can multiple related events be aggregated to produce a new event? We can combine event groupers and event aggregators. The groupers prepare the input events as needed for the subsequent aggregation step, e.g. by grouping the events based on the data field by which the aggregation is computed (such as order ID) and/or by grouping the events into time windows (such as 5-minute windows). The aggregators then compute the desired aggregation for each group of events, e.g., by computing the average or sum. of each 5-minute window. ts into automatically created time Fig. 3.10(b) Event Grouper Event Grouper Grouped Stream INSERT Event Aggregator UPDATE Be a DELETE ——._____—__»> Fig. 3.11. Event Aggregator * Gateway routing: When a client requires multiple services, in that case, setting up a separate endpoint for each service and having the client to manage each endpoint can be challenging. For example, an e-commerce application might provide services such as search, reviews, cart, checkout, and order history. Each service has a unique API with which the client communicates, and also that the client is aware of all endpoints in order to connect to the services fan API is changed, the client must also be updated. When a service is refactored into two or more distinct services, the code in case of the service and the client must change. Gateway ‘Outing allows the client applications to know about and communicate with a single endpoint146 Using a Single API Web UI (browser) Fig. 3.12 Gateway Routing * CQRS: It is common in traditional architectures to use the same data model for effective way In more complex applications, this approach both querying and updating databases. That is a straightforward an for basic CRUD operations. However, i ‘Some options form Consuming events Item key Item name. Quantity i a Materialized View Query for cur Replayed events sores Fig. 3.13(a) Command and Query Responsibility ‘Segregation (CQRS)DATA STREAMS Gay 147 Mw can become cumbersome. For example, the application May runa variety of 1 f 'ous shapes. Command and Query Responsibiig, Segregation one, 's a pattern for separating read and Spee emotity data store. Cline an improve the performance, scalability, and security of sean application. Migrating to CRS gives a system more flexibility over time, and t prevents update commands from causing merge conflicts at the domain level Writes to Module A. J <_ oO ee Poy + Other Writes Zt fe] and Reads Client Change Vata Reads from Capture (CDC) module A Service A naar fea — oft —oGce = Fig. 3.13(b) Command and Query Responsibility Segregation (CORS) 3.6 STREAM COMPUTING Traditional Power Grid is transformed to Smart Grid by utilising the technological advancements of Information and Communication Technology. In Smart Grid, the data is flowing between different components of the system. Advanced and online analytic on this massive data is required to trigger instantaneous action for grid operation and management. Traditional Bis Data handling techniques store and analyse high volume data with variety, but failed to hand le high velocity data. Stream computing can analyse high velocity data with variety which is essentially required in online data analytics. A high-performance computer system that analyses multiple [jose data streams from many sources live. The word ‘stream’ in stream | Stream Computing computing is used to mean pulling in streams of data: processing them and streaming them back out as a single flow. Stream computing uses ° software algorithms that analyse the data in real time as it steams into | increasing the speed and accuracy while dealing with data handling | and analysis,DATA ANaurnics “ I tinuous analysis of massive volumes of streaming data with sub. Enables con i : Hh yonse times. i re ge instruction multiple data computing paradigm to solve Certain Jsage o! sc blems. computational prot . i it reads data from Stream computing is a computing paradigm that reads data fro collections of oftware or hardware sensors in stream form and computes continuous data stream, Stream computing uses software programs that compute continuous data streams Stream computing uses software algorithm that analyse the data in real time Stream computing is an effective way to support Big Data by Providing extremely low-latency velocities with massively parallel processing architectures, It is becoming the fastest and most efficient ways to obtain useful knowledge from Big Data . | | Stream computing, the long-held dream of ‘high real-time computing’ and ‘high. throughput computing’, with programs that compute continuous data streams, have opened up a new era of future computing through Big Data, which is a datasets that is large, fast, dispersed, unstructured, and also beyond the ability of available hardware and software facilities to undertake their acquisition, access, analytics, and application in a reasonable amount of time and space. Stream computing is a computing paradigm that reads data from collections of software or hardware sensors in a stream form and computes continuous data streams, where feedback results should be in a real-time data stream as well. A data stream is a sequence of datasets, and a continuous stream of an infinite sequence of datasets, and parallel streams that have more than one stream to be processed at the same time. Stream computing is one effective way to support Big Data by providing extremely low-latency velocities with massively parallel processing architectures, and is becoming the fastest and most efficient way to obtain useful knowledge from big data, allowing organisations to react quickly to the problems appearing or to predict new trends of predictions, in the near future. aes BIG DATA Big data is defined as a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a ‘massive scale’. Big Data is now characterised by four Vs — volume, velocity, variety and value. Including all four Vs we can say that big data extract deep value from hi i a dots wingers ab P value from high velocity, volume and vat 3.7.1 Big Data Analytics and Tools Big Data analytics is the use of advanced analytical techniques [jose t : agains very age, diverse datasets that include different types such Big Data | u instructured and sizes varying from terra bytes to zeta in Action \ Pattern analysis or continuous queries Scan QR Code_nin Dasa STREAMS Ga) ” tems process continuous unbounded stream of data. This type of processing is called such muti, especially, required for certain application areas like trading, fraud detection, 120 onitoring, outage detection and management, demand response programs in smart od Big Data architecture combines, Data warehouses (DWH), Hadoop and real timne gr sgsing. Big Data architecture comprises tools for storage, complex computing, immediate ion, and monitoring of streaming data in real time. Data warehouses (DWH) are used for to pache Hadoop is an open source software library, is a framework that allows for the 1M ed processing of large datasets across clusters of commodity hardware using simple en amming models. It is designed to scale up from single servers to thousands of machine Fach offering local computation and storage. This follows a transaction based architecture, where events are stored on main frame or database, analysed and action performed. At the same time stream computing tools monitor millions of events in a specific time window to react proactively, they are behaviour based architecture where events are analysed in real time and action performed and then stored in databases for further analytics. 3.7.2 Big Data Stream Computing «Stream computing is a way to analyse and process Big Data in real time to gain current insights to take appropriate decisions or to predict new trends in the immediate future « Implements in a distributed clustered environment «High rate of receiving data in stream 3.7.3 Need of Stream Computing ‘Advances in information technology have facilitated large volume, high-velocity of data, to be stored continuously leading to several computational challenges. Due to the nature of Big Data in terms of volume, velocity, variety, variability, veracity, volatility, and value that are being generated recently, Big data computing is a new trend for future computing. Big Data computing can be generally categorised into two types based on the processing requirements, which are — Big Data batch computing and Big Data stream computing. Big data batch processing is not sufficient when it comes to analysing real-time application scenarios. Most of the data generated in a real-time data stream need real-time data analysis. In addition, the output must be generated with low-latency. 3.7.4 Key Issues in Big Data Stream Analysis Big Data stream analysis is relevant when there is a need to obtain useful knowledge from Current happenings in an efficient and speedy manner in order to enable organisations to react to situations promptly, or to detect new trends which can help improve their performance However, there are some challenges such as scalability, integration, fault-tclerance, timeliness, Consistency, heterogeneity and incompleteness, load balancing, privacy issues, and accuracy Which arise from the nature of Big Data streams that must be dealt with. Scalability i of the main challenges in Big Data streaming analysis is the issue of scalability. The big er ‘am is experiencing exponential growth in a way much faster than computer resources. should ben follow Moore's law, but the size of data is exploding. Therefore, research efforts eared towards developing scalable frameworks and algorithms that will accommodate _oo" DATA Avni data stream computing mode, effective resource allocation strategy, and parallelisatign to cope with data's ever-growing size and complexity, Integration Building a distributed system wherein each node has a view of the data flow, that is ve node performing analysis with a small numberof sources, then aggregating these views pep a global view is nonstrivial. An integration technique should be designed to enable eflicen operations across different datasets. : Fault-tolerance High fault-tolerance is required in life-critical systems. As data is real-time and infinite in Big Dat, stream computing environments, a good scalable high fault-tolerance strategy is required thay allows an application to continue working despite component failure without interruption, Issey Timeliness Time is of the essence for time-sensitive processes such as mitigating security threats, thwarting fraud, or responding to a natural disaster. There is a need for scalable architectures or platform, that will enable continuous processing of data streams which can be used to maximise thy timeliness of data. The main challenge is to implement a distributed architecture that wi aggregate local views of data into global view with minimal latency between communicating Consistency Achieving high consistency (i.e. stability) in Big Data stream computing environments is noo- trivial as it is difficult to determine which data is needed and which nodes should be consistent. Hence, a good system structure is required Heterogeneity and Incompleteness Big Data streams are heterogeneous in structure, organisations, semantics, accessibility and granularity. The challenge here is how to handle an ever-increasing data, extract meaningful content out of it, aggregate and correlate the streaming data from multiple sources in real-time A competent data presentation should be designed to reflect the structure, diversity and hierarchy of the streaming data. Load Balancing A Big Data stream computing system is expected to be self-adaptive to data stream changes and avoid load shedding. This is challenging as dedicating resources to cover peak loads 247 is impossible and load shedding is not feasible when the variance between the average load and the peak load is high. As a result, a distributed environment that automatically stra partial data streams to a global centre when local resources become insufficient are required. High Throughput A decision with respect to identifying the sub-graph that needs replication, how many replie® are needed and what the portion of the data stream to be assigned to each replica, is 2" ive in a big data stream computing environment. There is a need for good multiple instance for replication, if high throughput is to be achieved. he Swee eee 151 privacy Big Data stream analytics created opportunities for analysing a huge amount of data in real. time but also Created a big threat to the individual privacy. According to the International Dele Cooperation tore more than half of the entire information that needs Protection is effectively protected. The main challenge is proposing techniques for protecting a bj stream dataset before its analysis, pean 4 , Sse daa Accuracy One of the main objectives of big data stream analysis is to develop effective techniques that can accurately predict future observations. However, as a result of inherent characteristics of big data such as volume, velocity, variety, variability, veracity, volatility, and value, big data analysis strongly constrain processi ing algorithms spatio-temporally and hence stream-specific requitements must be taken into consideration to enst ” sure high accuracy. Stream Computing Applications « Financial sectors * Business intelligence « Risk management * Marketing management * Search engines * Social network analysis SAMPLING DATA IN A STREAM. Nowadays, users generate a huge amount of data on the internet. This data, which is increasing over time, is called a data stream. There are some important properties about the data stream that distinguish it from other data. First, we do not know the entire dataset in advance. At any time step, what the following data looks like is invisible for us. Second, the data elements enter one by one, forming a time series. Third, the data stream is typically very large so it is not possible to store the entire data stream. These properties prevent the traditional deterministic algorithms from being applied and finding out the objectives or characteristics of the data stream, is therefore difficult. However, analysation of data streams is very important as has many applications. The data stream techniques are used to track the online query tendency, unusual user behaviour, social networks news feeds, sensor networks, telephone call records, IP packets monitoring, etc. Therefore, what we are concerned about are the methods of making critical calculations in the data stream using limited amount of memory. Stream sampling is the process of collecting a representative To See sample of the elements of a data stream. The sample is usually much _ | Sampling Data in a Stream smaller than the entire stream, but can be designed to retain many in Aston important characteristic of the stream, and can be used to estimate Many important aggregates in the stream. Unlike sampling from a stored dataset, stream sampling must be performed online, when the data arrives. Any element that is not stored within the sample is lost forever and cannot be retrieved. Good examples of data streams are_ N as) Data ANatyncy Google search queries or trending items on Twitter. These huge datasets are worthwhile to sty, ~ Google queries for flu symptoms allows for efficient tracking of the flu virus, for example 4 applications involving data streams, the data often comes at an overwhelmingly fast r: \ there’s noway to know the size of the data in advance. For these reasons, it is conve, think of the stream as a dataset of infinite size. We'll make this notion more formal bel the punch line is that we cannot store the data. ate ang Nient to low, by, 3.8.1 Applications We list several examples of important applications: 1. Mining query streams: Google wants to know certain queries which are more fre today than yesterday. 2. Mining click streams: Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour. 3. Mining social network news feeds: For example look for trending topics on Twitter and Facebook. 4. Sensor networks: Many sensors feed into a central controller. 5. Telephone call records: Data feeds into customer bills as well as settlements between, telephone companies. 6. IP packets monitored at a switch: Gather information for optimal routing and detection of denial-of-service attacks. ‘quent FILTERING STREAMS A filtered stream is constructed on another 'stream (the underlying stream). The read method in a readable filter stream reads input from the underlying stream, filters it, and passes on the filtered data to the caller. Data filtering in IT can refer to a wide range of strategies or solutions for refining datasets. This means the datasets are refined into simply what a user (or set of users) needs, without including other data that can be repetitive, irrelevant or even sensitive The randomised algorithms and data structures we have seen so far always produce the correct answer but have a small probability of being slow. Generally, we are interested in trade offs between the (likely) efficiency of the algorithm and the (likely) quality of its output. 3.9.1 Bloom Filter ABloom filter is defined as a data structure designed to identify of an A specific data structure named as probabilistic data structure Filtering Steams element's presence in a set in a rapid and memory-efficient manner. To see | in dion | is implemented as bloom filter. This data structure helps us to identify that an element is either present or absent in a set. | A Bloom filter is a space-efficient probabilistic data structure | that is used to test whether an element is a member of a set. For example, checking the availability of a user name is a set membership problem, where the set is the list of all registered user names. The , \—ininc Dara STREAMS Gs3) fic in nature,which means, there might be some ight appear — that a given user name is already rice we pay for efficiency is that it is probabilist Pise positives. False positive means that it mi taken, but actually it is not. properties of Bloom Filters + Unlike a standard hash table, a Bloom an arbitrarily large number of elements, Adding an element never fails, However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result. filter of a fixed size can represent a set with Bloom filters never generate false negative results, i.e., telling you that a username. doesn't exist when it actually does. Deleting elements from filter is not possible, since, deletion of a single element by clearing bits at indices generated by k hash functions, might cause the deletion of a few other elements. Example ~ if we delete ‘geeks’ (in given example below) by clearing bit at 1, 4 and 7, we might end up deleting ‘nerd’ also Because bit at index 4 becomes 0 and bloom filter claims that ‘nerd’ is not present. Working of Bloom Filters : An empty bloom filter is a bit array of m bits, all set to zero, like this - Oo} o}o]o}o}]o};ojo}ojo 12 3 4 5 6 7 8 9 10 3.14(a) Bloom Filter Fi We need k number of hash functions to calculate the hashes for a given input. When we want to add an item in the filter, the bits at k indices h, (x), hy(x), ... h,(x) are set, where indices are calculated using hash functions. Example: Suppose we want to enter ‘geeks’ in the filter, we are using three hash functions and a bit array of length 10, all set to 0, initially. First we'll calculate the hashes as follows: h1(“geeks”) % 10 = 1 h2(“geeks”) % 10 = 4 h3(“geeks”) % 10 = 7 Note: These outputs are random for explanation only. Now we will set the bits at indices 1, 4 and 7 to 1 greeks r}olo;1)o}o]1)o} 0 +1 2 3 4 5 6 7 8 9 Fig. 3.14(b) Bloom Filter154 Dara Any Again we want to enter “nerd”, similarly, we'll calculate hashes “es hi (“nerd”) % 10 = 3 h2(“nerd”) % 10 = 5 h3(‘nerd”) % 10 = 4 Set the bits at indices 3, 5 and 4 to 1 nerd PrP PPh] 304 5 6 7 8 9 10 Fig. 3.14(c) Bloom Filter 12 Now, if we want to check ‘geeks’ is present in filter, or not. We'll do the same process this time in reverse order. We will calculate respective hashes using h1, h2 and h3 and chew, all these indices are set to 1 in the bit array. If all the bits are set then we can say that ‘geo. probably present. If any of the bits at these indices are 0 then ‘geeks’ is definitely not preseny False Positive in Bloom Filters The question is why we said ‘probably present’, why this uncertainty. Let’s understand thy with an example. Suppose we want to check whether “cat” is present or not. We'll calcula. hashes using h1, h2 and h3 hl(“cat”) % 10 = 1 h2(“cat”) % 10 = 3 h3(“cat") % 10 = 7 If we check the bit array, bits at these indices are set to 1 but we know that “cat” was never added to the filter. Bits at index 1 and 7 were set when we added ‘geeks’ and bit 3 was set we added ‘nerd’. . So, because bits at calculated indices are already set by some other item, bloom fiter erroneously claims that ‘cat'is present and generating a false positive result. Depending on the application, it could be huge downside or relatively okay. cat LEE] | 1 l ° l rfa}a4 “4 2 3 4 5 6 7 8 9 Fig. 3.14(d) Bloom Filteruna DATA ‘STREAMS Css) Mi We can control the probability of getting a false positive by controlling the size of the oom filler More space means fewer false positives. If we want to decrease probability of false , we have to use more number of hash functions and larger bit array. This would postive result add Jatency In operations that a Bloom Filter supports « insert(x): To insert an element in the Bloom Filter. ¢ lookup(x): to check whether an element is already present ii positive false probability. « Probability of False positivity: Let m be the size of bit array, k be the number of hash functions and n be the number of expected elements to be inserted in the filter, then, the probability of false positive p can be calculated as: kn \K r-(-b-ar) m. « Size of Bit Array: If expected number of elements n is known and desi positive probability is p then the size of bit array m can be calculated as: _ _ninP (ina)? ctions: The number of hash functions k must be of elements to be inserted, addition to the item and checking membership. in Bloom Filter with a ired false + Optimum number of hash fun a positive integer. If m is size of bit array and n is number then k can be calculated as: k= @in2 n Space Efficiency IF we want to store a large list of items in a set for purpose of set membership, we can store it in hashmap, tries or simple array or linked list. All these methods require storing of items, itself, but it is not very memory efficient. For ‘example, if we want to store ‘geeks’ in hashmap we have to store actual string ‘geeks’ as a key value pair {some_key: ‘geeks’ }. Bloom filters do not store the data item at all. As we have seen they use bit array which allow hash collision. Without hash collision, it would not be compact. '340 COUNTING DISTINCT ELEMENTS IN A STREAM In computer science, the count-distinct problem is the problem of finding the number of distinct elements in a data stream with repeated elements, This is a well-known problem with numerous applications. ‘The elements might represent IP addresses of packets passing through a router; unique visitors to a website, elements in a large database; motifs ina DNA sequence; or elements of RFID/sensor networks. ‘Suppose stream elements are chosen from some universal set. We would like to know how many different elements have appeared To See Counting Distinct Elements in a Streamsé) D. Aan, in the stream, counting either from the beginning of the stream or from some kno, ‘s the past. 0 time 'n general, the size of the set under consideration (which we will hencefori, Universe) is enormous. For example, if we build a system to identify denial of serviey “the the set could consist of all IP V4 and V6 addresses. Another common use case is tg number of unique visitors on popular websites like Twitter or Facebook. An obvious approach if the number of elements is not very large would be to Tain ‘set’, We can check if the set contains the element when a new element arrives. If not the element to the set, The size of the set would give the number of distinct elements, Fi, if the number of elements is vast or we are maintaining counts for multiple streams, be infeasible to maintain the set in memory. Storing the data on disc would be an tion if are only interested in offline computation using batch processing frameworks like Map, Counting distinct elements is also approximate, with an error threshold that can be tweaked by changing the algorithm's parameters. Instance: A stream of elements x;, x9 .... Xs with repititions and an integer m, Ly, be the number of distinct elements, namely n= |{X;, x, .-. X,}|, and let these elements {@), @y ..., en} Objective: Find an estimate fi and n using only m storage units, where m « p, An example of an instance for the cardinality problem is the steream: a, b, a, ¢, 4, ¢ For this instance, n = | {a, b, c,d} = 4. : 3.10.1 Flajolet-Martin Algorithm The first algorithm for counting distinct elements is the Flajolet-Martin algorithm, named after the algorithm's creators. The Flajolet-Martin algorithm is a single-pass algorithm. If there are m distinct elements in a universe comprising of n elements, the algorithm runs in O(n) time and O(log(m)) space complexity. The Following Steps Define the Algorithm First, we pick a hash function h that takes stream elements as input and outputs bit string. The length of the bit strings is large enough such that the result of the hash function is much larger than the size of the universe. We require at least log n bits i there are n elements in the universe. * 1(a) is used to denote the number of trailing zeros in the binary representation of h(a) for an element a in the stream. + R denotes the maximum value of r seen in the stream so far. * The estimate of the number of distinct elements in the stream is (2°). To intuitively understand why the algorithm works, consider the following. The probability that h(a) ends in at least i zeros is exactly (2-!). For example, for i = 0, there is a probability 1 that the tail has at least 0 zeros. For = 1, there is a probability 1/2 that the last bit is zero; for i = 2, the probability is 1/4 that the last two bits are zeros. an¢ so on. The probability of the rightmost set bit drops by a factor of 1/2 with every position ft \e ‘least significant bit’ to the ‘most significant bit’. in 10 county tain aMininc DATA STREAMS 4157 This probability should become 0 when bit Position R > log m while it should be non-zero when R <= logm. Hence, if we find the right most unset bit Position R such that the probability is 0, we can say that the number of unique elements will approximately be 28, The Flajolet-Martin uses a multiplicative hash function to transform the non-uniform set space into a uniform distribution. The general form of the hash function is (ax + b) mod c where a and b are odd numbers and c is the length of the hash range. The Flajolet-Martin algorithm is Sensitive to the hash function used, and results vary widely based on the dataset and the hash function. Hence, there are better algorithms that utilise more than one hash functions, These algorithms use the average and median values to reduce skew and increase the predictability of the result, 3.10.2 Flajolet-Martin Psuedocode and Explanation 1. L = 64 (size of the bitset), B= bitset of size L 2. hash_func = (ax + b) mod 2k 3. for each item x in stream . hash(x) * r= get_righmost_set_bit(y) * set_bit(B, r) 4. R = get_righmost_unset_bit(B) 5. return 28 We define a hash rangebig enough to hold the maximum number of possible unique values, something as big as 2. Every stream element is passed through a hash function that transforms the elements into a uniform distribution. Foreach hash value, we find the Position of the rightmost set bit and mark the ‘corresponding Position in the bitset to 1. Once all elements are processed, the bit vector will have 1s at all the Positions corresponding to the position of every rightmost set bit for all elements in the stream, Now we find R, the rightmost 0 in this bit vector, This position R corresponds to the rightmost set bit that we have not seen while processing the elements. This corresponds to the Probability 0 and will help in approximating the cardinality of unique elements as 2°. 1) ESTIMATING MOMENTS 3:11 ESTIMATING MOMENTS Frequency Moments Consider a stream S = {a,, a... a,,) with elements from a domain D = AV, Voy sony v,}. Let mi denote the frequency (also sometimes called multiplicity) of value veD:; ‘e. the number of times v, appears in S. The kth frequency moment of the stream is defined as F,= Dm f To See Estimating Moments in Action Wewill discuss algorithms that can approximate F, by making one pass of the stream and using a small amount of memory o(n + m). Frequency moments have a number of applications. Fy ‘presents the number of distinct elements in the streams (which the FM-sketch from last class estimates using O(log n) space. F, 'sthe number of elements in the stream m. F, is used in database) @ optimisation engines to estimate self join size. Consider the query, “return all pairs of that are in the same location”. Such a query has cardinality equal to Pmai2, Whe individu, number of individuals at a location. Depending on the estimated size of the Query, the thy can decide (without actually evaluating the answer) which query answering strate be Suited. F, is also used to measure the information in a stream. In general, F, repre jo be degree of skew in the data. If F/Fy is large, then there are some values in tke omen’ he beat more frequently than the rest. Estimating the skew in the data also helps winen how to partition data in a distributed system Cin What is the use of Moments? — These are very useful in statistics because they tell you much about your data, — The four commonly used moments in statistics are as: the mean, variance, skewn, and kurtosis. * To be ready to compare different datasets we will describe them using the Primary fo, statistical moments, % Let's discuss each of the moments in an exceedingly detailed manner: The First Moment — The expected value, also known as an expectation, mathematical expectation, mean, or average, is the first central moment. — It measures the location of the central point. — It measures the location of the central point. Case 1: When all outcomes have the same probability of occurrence It is defined as the sum of all the values the variable can take times the probability of that value occurring. Intuitively, we can understand this as the arithmetic mean. Case 2: When all outcomes don't have the same probability of occurrence This is the more general equation that includes the probability of each outcome andis defined as the summation of all the variables multiplied by the corresponding probability 3.42 COUNTING ONENESS IN A WINDOW Counting the number of distinct elements in a data stream (distinct counting) is a fundamental aggregation task in database query processing, query optimisation, and network monitoring On a stream of elements, it is commonly needed to compute an aggregate over only the mo# recent elements, leading to the problem of distinct counting over a ‘sliding window’ of the sn LetS be a stream of identifiers, each chosen from a universe U. We consider the protien maintaining the number of distinct identifiers in S in a single pass through $ using limited men a problem we henceforth refer to as ‘distinct counting’. Distinct counting is a funda i problem in databases with a wide variety of applications in database query process a optimisation and network monitoring and is one of the earliest problems studied in of streaming algorithms. An example application of distinct counting in network monitoring is to trac! distinct network connections established by a source IP address. Tracking sources the au that esoinc DATA STREAMS ; we ma of stint connections can help identify network anomalies such as worm ‘ince a network monitor has to simultaneously monitor a number of sources, it cannot ford use much Memory for each source and needs a small-space data structure for counting umber of distinct identifiers per source. Further, it is necessary to count the number of identifiers within a subsequence of the stream consisting of the most recently observed commonly modelled using a ‘sliding window’ in the stream, Aggregation over a sliding ‘window arises naturally in real-time monitoring situations such as network traffic engineering, telecom analytics, and cyber security. For instance, in network traffic engineering, current network performance is monitored over a sliding window to adjust the bandwidth of the etwork dynamically. ‘A time-based sliding window of length T is defined as the set of the stream elements that nave arrived within the last T time units, for some parameter T. The abstraction of a sliding window is well accepted today and has found its way into the query processing interface of major stream-processing systems, including IBM Infosphere Streams and Apache Spark Streaming. For instance, in IBM Infosphere Streams, it is possible to apply each streaming aggregation operator (including distinct counting) over a sliding window. elements, 3.12.1 Methods There are two types of sliding windows commonly considered — count-based windows and time-based windows. A count-based window of size W is the set of the W's most recent elements in the stream. A time-based window of size T is the set of all stream elements that have arrived ‘within the T's most recent time units. We consider a time-based window, since a count-based window is a special case of a time-based window. An algorithm for a time-based window can also tbe used for a count-based window by setting the timestamp to be equal to the stream position. 3.12.2 Algorithms We present an overview of the algorithms that we consider. For the following discussion, we assume that the domain of elements is IN] = (1, 2, ...., N} and that N is a power of 2. Each element of the stream is a tuple (e, t), where e € [NJ and t 2 0 is an integer timestamp. We assume that timestamps are in a non-decreasing order, but not necessarily consecutive. When a query is posed at time t, the requirement is fo estimate the number of distinct elements within a timestamp based sliding window of size T, ie., those elements with timestamps r such that (t-T+1) {1,2, ...,logN}, such that foreach e e [N], and be (1,2, ..., log, N}, Pr {hle) = bl = ‘Initially, all bits of B are set to 0. When an element e arrives, B[h(e)] is set to 1. The intuition is that approximately 2' distinct elements must be seen before B[i] is set to 1. When there is a query for the number of distinct elements, the bits of B are scanned from position 1 onwards, to find the index of the lowest bit x that is not set. The estimate returned is 1.29281 x 2**} To adapt this to a sliding window, instead of a bit vector B, we use a vector M of length log, N, indexed from 1 till logy IN, to store timestamps. Initially, all entries ofMr Data Avauyn are set to. 0. When an element (et) arrives, M[h(e)] is set to t. Note that Mii , are ett timestamp at which an element hashes to index i. When there ig q "%% number of distinct elements within a time-based sliding window of ey fo find the smallest index x such that either M[x] is 0 ont for the T + 1), where tis the current tim, The the algorithm scans M t timestamp of x has expired, i.e., MIx] < (t — estimate returned is 1.29281 x 2*+1, as before. We implement an enhancement of the above basic scheme, based on stochac: ic averaging (PCSA). In PCSA, k copies of the above data structure are used. n elements are first partitioned into k non-overlapping groups, using a hash function an element (e, t) is forwarded to one of the k data structures, according to g(e} Th final estimate is 1.29281 xkx 2x” + 11.29281 xkx2x* +1, where x°x* is the aver, . of the individual xs obtained from the k different data structures. Similar to PCsa ge an infinite window, the processing time per element of PCSA for a sliding windows (1), and the query time is O(k log N). Suppose that a timestamp can be stored P TT bits. The space taken by the sliding window version is O(TT k log N) bits, which is a factor @(TT) larger than the infinite window version which takes O(k log N) bits of space. Linear Counting: Linear Counting, due to Whang et al. (1990), uses a bit vector B of size n = Dyg,/p, where Drug, is an upper bound on the maximum number of distinct elements in the data stream, and p is a constant called the ‘load factor’. The algorithm uses a hash function h: [N] - {1, 2, ..., N} such that for each e [N}, and be {1,2,..., n}, Pr [h(e) = b] = 1/n. Initially, all bits in B are set to 0. Each element e of the data stream is uniformly and independently hashed to an index in the bit vector, and the corresponding bit is set to 1. When a query is made, the number of distinct elements is estimated as m In (n/m) where m is the number of bits in B that are still 0. Whang et al. (1990), shows that accurate estimates can be obtained when p < 12. However, when p is significantly >12, the estimates are poor due to a large density of 1s in the bit array, We extend the above to a sliding window as follows. Instead of a bit vector, we usea vector of timestamps, M, of size n, indexed from 1 till n. When element (e, t) arrives, M{h(e)] is set to t. When a query is made for the number of distinct elements within the window, the entire vector M is scanned and the number of indices that either have value 0 or whose timestamps have expired is used instead of m in the above formula. Note that the processing time per element is O(1) and the time to answer a query is O(n), which is expensive since n is linear in the number of distinct elements. The total time is still reasonable if the frequency of queries is small when compared with the frequency of updates (infrequent queries), but poor if queries are more frequent. For the case of frequent queries, we modified LC by introducing a data structure in addition to M - a list L that comprises tuples of the form (t, a) and is ordered according to t, the timestamp of observation, and a is the value to which the element hashes! In the vector M, in addition to a timestamp t, there is also a pointer to the occurrent of t in L, so that, if an element with a new timestamp hashes to an index in M io corresponding entry with an older timestamp is deleted from the list, and the entry with the current timestamp is made at the head _ |M sync DATA STREAMS The modified version of LC, dubbed ‘LC2,’ necessitates not only 2TT bytes to keep two copies of the timestamp per index of M, but also an overhead for keeping a list, which can be twice the pointer size in a typical implementation such as the C++ Standard Template Library, The expired timestamp is determined from the tail of L in constant time. A single variable can keep track of the number of indexes with expired timestamps or with an initial value of zero. When a query is posed, the number of relevant bits can be determined in O(1) time. Overall, we get O(1) time for an update as well as a query, but at the cost of a significant space overhead. A significant drawback of LC is that p cannot exceed 12 (Whang et al., 1990), so that the space used by the algorithm is at least D,,,/12. The accuracy of the estimate falls drastically as p increases, ms (iii) Durand-Flajolet: The Loglog algorithm by Durand and Flajolet (2003) for infinite window derives its name from the space cost of the algorithm which is O(log log N). However, the sliding window version of the Loglog algorithm does not have a space complexity of O(log log N), due to the need to maintain timestamps, and is more expensive. Hence, the name ‘Loglog’ is not applicable here, and we simply call it the ‘DF algorithm’. The stream element is hashed to a binary stringy with a length of O(log N), and the algorithm computes the rank of the first 1-bit from the left in y, r(y). It finds the stream. elements with the highest r(y), say r. This requires only O(log log(N)) space, since a single variable needs to be maintained to keep a track of the maximum. Similar to PCSA, DF uses | different bit vectors and does a stochastic averaging to find the average of maximum r from all bit vectors. When a query is posed, the estimate of the distinct count is returned as 0.39701 x | x Qleva(maxin)+1)_ However, in the sliding window case, there is no easy way to maintain r over all bits set by active elements, since this value is not a non-decreasing number, like in the case of the infinite window. Instead, similar to the PCSA algorithm, we use a vector, M, of length TT to store timestamps. In particular, each index i of the vector M maintains the most recent timestamp during which an element was hashed to y, such that 1(y) = ti, The space taken by this data structure is no longer O(log log N). In answering a query, max(r) is determined as the rank of the highest index in M which contains a non-expired timestamp. (S938 _DEcavING wiNDow This algorithm identifies the most popular elements (trending, in ements To See other words) in an incoming data stream. This algorithm not only ecaying tindow in Action tracks the most recurring elements in an incoming data stream but also discounts any random spikes or spam requests that might have boosted an element's frequency. In a decaying window, you assign a score or weight to every element of the incoming data stream. Furthermore, you need to calculate the aggregate sum for each distinct element by adding all Scan OR Code the weights assigned to that element. The element with the highest total score is listed as ‘trending’ or ‘the most popular’. —_INA « Assign each element a weight/score. J Calculate agaregate sum for each distinct element by adding all the weights as Mn to that element. sy tacks the most ret ; ed it tracl e mos! urring elem, The decaying window algorithm not only NI in incoming data stream, but also discounts any random spikes or spam requests that might ha reed an elements frequency. In a decaying window, you assign a score or weight tg en element of the incoming data stream. y 3.13.1 Advantages of Decaying Window Algorithm * Sudden spikes or spam data is taken care. New element is given more weight by this mechanism, to achieve right tending output. Ina decaying window algorithm, you assign more weight to newer elements. For a ney element, you first reduce the weight of all the existing elements by a constant factor k and ther, assign the new element a specific weight. The aggregate sum of the decaying exponential weights can be calculated using the following formula: St - i = Oat — i(1 - oi Here, cis usually a smalll constant of the order 10 or 10-°. Whenever a new element, say at +1, arrives in the data stream you perform the following steps to achieve an updated sum: 1. Multiply the current sum/score by the value (1 — c). 2. Add the weight corresponding to the new element. weight decays exponentially over time time Fig. 3.15 Weight Decays Exponentially Over Time dist In a data stream consists of various elements, you maintain a separate sum for each by sae eerie For er incoming element, you multiply the sum of all the existing elemen's wor c). Further, you add the weight of the incoming element to its corresponding f oe an be kept toignore elements of weight lesser than that. inally, the element wit i isli ‘se: nt with the highest aggregate score is listed as the most popular elemen® For example, consider ° + a sequence of twitter tags below: Fifa, ipl, fifa, ipl, ipl, ipl, fifa $8 PelowMininc DATA ‘STREAMS Also, let's say each element in sequence has weight of 1. Let ¢ be 0, we The aggregate sum of each tag in the end of above stream ws be ‘ fie e calculated as below: fifa: 1* (1-0.1) = 0.9 0.9% ae Ya oon. i * adding 0 because current tag is different than fifa) 729 * (1 - o1) if _ esectting 1 because current tag is fifa only) ipl: 1.5561 * (1-0.1) +0 = 1.4005 ipl: 1.4005 * (1-0.1) + 0 = 1.2605 fifa: 1.2605 * (1-0.1) +1 = 2.135 ipl fifa: 0* (1-0.1) =0 ip: 0*(1-0.1)+1=1 fifa: 1 * (10.1) + 0 = 0.9 (addin. pr 09* (1-001) + 1 a ig 0 because current tag is different than ipl) ipl: 1.81 * (1-0.01) + 1 = 2.7919 ipl: 2.7919 * (1-0.01) + 1 = 3.764 fifa: 3.764 * (1 - 0.01) + 0 = 3.7264 At the end of the sequence, we can see the score of fifa is 2.135 but ipl is 3.7264 So, ipl is more trending then fifa Even though both of them occurred for the same number of times in input but still there score is still different. 3.14 REAL TIME ANALYTICS PLATFORM (RTAP) Real-Time Analytics Real-time analytics is defined as the ability of users to see, analyse and assess data as soon as it appears in a system. In order to provide users with insights (rather than raw data), logic, mathematics and algorithms are applied. The output is a visually cohesive and easy-to-understand dashboard and/or report. It encompasses the technology and processes that quickly enable users to leverage data the second it enters the database. It includes data measurement, management and analytics. For businesses, analytics analytics in real-time can be used to — ae] meet a variety of needs including enhancing workflows, boosting the Real Time Analytics relationship between marketing and sales, understanding customer Platform (RTAP) behaviour, finalising financial close procedures and more. a a Understanding live analytics is best done by breaking down zt the terms: «© Real-time: operations are performed in milliseconds before it becomes available to the user.DATA Ana, « Analytics: a software capability to pull data from various sources ang in Meg analyse and transform it into a format that is comprehensive. ere, Without real-time analytics, a business may absorb a ton of data which may in the shuffle. Leading a finance team means leveraging data for both — financial ste procurement; as well as to understand insights about the business and the customers The ability to work in real-time and respond to customer's needs or prevent issues befo,. 1 arise which can result in benefits by reducing risk and enhancing accuracy. they Batch Layer Real-Time Layer Data Fig. 3.16 Real-Time Analytics In real-time, analysis of data allows users to view, analyse and understand data in the system it has entered. Mathematical reasoning and logic are incorporated into the data, which means it gives the users a sense of real-time data to make decisions. Real-time analytics permits businesses to get awareness and take action on data immediately or as soon as the data enters their system. Real-time app analytics response queries within seconds. They grasp a large amount of data with high velocity and low reaction time. For example, real-time big data analytics uses data in financial databases to notify trading decisions. Analytics can be on-demand or uninterrupted. On-demand notifies results when the user requests it. Continuous renovation users as events happen and can be programmed to automatically answer certain events. For example, real-time web analytics might refurbish an administrator if the page load presentation goes out of the present boundary. Examples: Examples of real-time customer analytics include the following: * Real time credit scoring helps financial institutions to decide promptly wet ® extend credit or not. es * Customer relationship management (CRM), maximises satisfaction and bU*! results during each interaction with their customers. ¢ Fraud detection is possible at points of sale. while * Targeting individual customers in retail outlets with promotions and incentives: the customers are in the store and next to the merchandise. St log, arr sme DA STEN G5) Monitoring orders as they take place to trace them better and determine the type of clothing. # It modernises customer interactions on a regular basis, like ~ the number of page views; shopping cart usage, to better understand the etiquette of users. « More advanced customers can be chosen according to their shopping habits in a shop, impacting the decisions in real-time. 3.14.1 How Does Real-Time Analytics Work Real-time data analytics works by pushing or pulling data into the system. In order to push Big Data through into a system, there needs to be streaming in place. However, strearning can require a lot of resources and may be impractical for certain uses. Instead, you may set data to be pulled in intervals, from seconds to hours. Given the choices, outputs from real-time analytics can take place in just seconds to minutes. In order for real-time data analytics to work, the software generally includes the following components: Aggregator: Pulls real-time data analytics from various sources Analytics Engine: Compares the values of data and streams it together while performing analys's « Broker: Create the availability of data «Stream Processor: Executes logic and performs analytics in real-time by receiving and sending data Real-time analytics is also made possible with the aid of technologies like in-database analytics, processing in memory (PIM), in-memory analytics and massively parallel programming (MPP). With all the data flowing into an organisation, it's only of use when the information can be transformed into insights. Without automation tools, you'll need to hire experts (coders data analysts, etc.) and wait for the manual production of data into reports. The required time. effort and opportunity cost can be detrimental to a business’ bottom line and decision-making abilities. However, with the aid of the automation solutions, a cloud software tool like SolveXia performs real data analytics and specifically offers financial teams with deep insights from data in just seconds. 3.14.2 Benefits of Real-Time Data Analytics Real-time data analiytics allows business to thrive and reach optimal the productivity standards You can minimise risks, reduce costs, and understand more about your employees, customers along with the overall financial health of the business with the helpof real-time data Here are Some of the Key Benefits © Data Visualisation: With historical data, you can get a snapshots of information displayed in a chart. However, with real-time data, you can use data visualisations that reflect changes within the business as they occur in real-time This means that dashboards are interactive and accurate at any given moment. With custom dashboards, you can also share data easily with relevant stakeholders so that decision- making never gets put on hold.

Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
UNIT-3 (Mining Data Streams)
No ratings yet
UNIT-3 (Mining Data Streams)
50 pages
Unit 2
No ratings yet
Unit 2
10 pages
Data Streams1
No ratings yet
Data Streams1
10 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Unit 3
No ratings yet
Unit 3
30 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
5 Unit
No ratings yet
5 Unit
5 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
unit-3 notes
No ratings yet
unit-3 notes
10 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
BDA-2
No ratings yet
BDA-2
16 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
Introduction To Stream Concepts - Stream Data Model and Architecture
No ratings yet
Introduction To Stream Concepts - Stream Data Model and Architecture
8 pages
Unit-5 Data Mining AIML
No ratings yet
Unit-5 Data Mining AIML
31 pages
Big Data Analytics_Unit 3
No ratings yet
Big Data Analytics_Unit 3
64 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Bda L4
No ratings yet
Bda L4
32 pages
Module II
No ratings yet
Module II
22 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
Mining Techniques for Streaming Data
No ratings yet
Mining Techniques for Streaming Data
14 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Module Iv
No ratings yet
Module Iv
16 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Session 3.9.1
No ratings yet
Session 3.9.1
11 pages
Process Mining and Data Stream Mining
No ratings yet
Process Mining and Data Stream Mining
19 pages
Unit 4
No ratings yet
Unit 4
84 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
BIG_DATA_UNIT_II_NOTES
No ratings yet
BIG_DATA_UNIT_II_NOTES
19 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
FALLSEM2024-25_SWE2011_ETH_VL2024250103282_2024-08-19_Reference-Material-I
No ratings yet
FALLSEM2024-25_SWE2011_ETH_VL2024250103282_2024-08-19_Reference-Material-I
53 pages
DM Unit V
No ratings yet
DM Unit V
20 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Stream
No ratings yet
Stream
30 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
3. Unit 3 - BD - Streaming
No ratings yet
3. Unit 3 - BD - Streaming
42 pages
BDA GTU Study Material Presentations Unit-4 29092021094703AM
No ratings yet
BDA GTU Study Material Presentations Unit-4 29092021094703AM
33 pages
Unit-2 Module-2
No ratings yet
Unit-2 Module-2
9 pages
Stream Mining
No ratings yet
Stream Mining
65 pages
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
BigData_Mod2
No ratings yet
BigData_Mod2
12 pages
UNIT-2 BDA
No ratings yet
UNIT-2 BDA
33 pages
Introduction To Stream Data Model
50% (2)
Introduction To Stream Data Model
15 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Data Mining_Unit-V
No ratings yet
Data Mining_Unit-V
12 pages
6- Streaming Part 1
No ratings yet
6- Streaming Part 1
44 pages
Uint 4miningdatastream 230810162429 9d7c02a7
No ratings yet
Uint 4miningdatastream 230810162429 9d7c02a7
11 pages
Unit2 Bda
No ratings yet
Unit2 Bda
293 pages
Unit III - MMD - Lecture Notes
No ratings yet
Unit III - MMD - Lecture Notes
8 pages
Bda M4
No ratings yet
Bda M4
57 pages
UNIT IV
No ratings yet
UNIT IV
5 pages
Overview of Streaming-Data Algorithms
No ratings yet
Overview of Streaming-Data Algorithms
10 pages
Module 3 - TIME ORIENTED DATA-1
No ratings yet
Module 3 - TIME ORIENTED DATA-1
30 pages
UNIT 3 Notes Data Analytics
No ratings yet
UNIT 3 Notes Data Analytics
136 pages
Big Data Analytics Unit 2 MINING DATA STREAMS
100% (2)
Big Data Analytics Unit 2 MINING DATA STREAMS
22 pages

Mining&Data Stream Unit-3_removed

Uploaded by

Mining&Data Stream Unit-3_removed

Uploaded by

You might also like