Q 1.
(a)
Explain the Stream Data Model Architecture with a neat diagram.
In analogy to a database-management system, we can view a stream processor as a kind of
data-management system, the high-level organization of which is suggested in Fig.
Any number of streams can enter the system. Each stream can provide elements at its own
schedule; they need not have the same data rates or data types, and the time between elements
of one stream need not be uniform. The fact that the rate of arrival of stream elements is not
under the control of the system distinguishes stream processing from the processing of data
that goes on within a database-management system. The latter system controls the rate at
which data is read from the disk, and therefore never has to worry about data getting lost as it
attempts to execute queries. Streams may be archived in a large archival store, but we assume
it is not possible to answer queries from the archival store. It could be examined only under
special circumstances using time-consuming retrieval processes. There is also a working store,
into which summaries or parts of streams may be placed, and which can be used for answering
queries. The working store might be disk, or it might be main memory, depending on how fast
we need to process queries. But either way, it is of sufficiently limited capacity that it cannot
store all the data from all the streams.
2
What is bloom filter? Determine the probability of false positivenness in Bloom Filter.
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values
to n buckets, corresponding to the n bits of the bit-array.
3. A set S of m key values.
The purpose of the Bloom filter is to allow through all stream elements whose keys are in S,
while rejecting most of the stream elements whose keys are not in S.
The model to use is throwing darts at targets. Suppose we have x targets and y darts. Any dart
is equally likely to hit any target. After throwing the darts, how many targets can we expect to
be hit at least once?
 The probability that a given dart will not hit a given target is (x − 1)/x
 The probability that none of the y darts will hit a given target is ((x−1)/x)^y
 We can write this expression as (1 – 1 x )^x( y x ).
 Using the approximation (1−ǫ)1/ǫ = 1/e for small E we conclude that the probability
that none of the y darts hit a given target is e−y/x.
3. Explain Girvan Newman Algorithm .Detect communities for the following graph using Girvan
Newman Algorithm(Edge Betweenness mentioned in the graph)
 In order to find out between edges, we need to calculate shortest paths from going
through each of the edges.
 Girvan - Newman Algorithm visits each node X once and computes the number of
shortest paths from X to each of the other nodes that go through each of the edges.
 The algorithm begins by performing a breadth first search [BFS] of the graph, starting
at the node X.
 The edges that go between node at the same level can never be a part of a shortest path
from X.
 Edges DAG edge will be part of at-least one shortest path from root X.
 To complete the betweeness calculation, we have to repeat this calculation for every
node as the root and sum the contributions.
 After calculations, following graph shows final betweenness values:
 We can cluster by taking the in order to increasing betweenness and add them to the
graph at a time.
 We can remove edge with highest value to cluster the graph.
 In the example graph we remove edge BD to get two communities as follows:
4) Define PageRank . Calculate page rank for the following graph
5 Explain Flajolet-Martin Algorithm.Perform FM for the stream 1.3.2,1,2,3,4,3,1,2,3,1……….
Flajolet-Martin algorithm approximates the number of unique objects in a stream or a
database in one pass. If the stream contains n elements with m of them unique, this algorithm
runs in O(n)O(n) time and needs O(log(m))O(log(m)) memory.
Algorithm:
1. Create a bit vector (bit array) of sufficient length L, such that 2L>n2L>n, the number
of elements in the stream. Usually a 64-bit vector is sufficient since 264264 is quite
large for most purposes.
2. The i-th bit in this vector/array represents whether we have seen a hash function value
whose binary representation ends in 0i0i. So initialize each bit to 0.
3. The i-th bit in this vector/array represents whether we have seen a hash function value
whose binary representation ends in 0i. So initialize each bit to 0.
4. The i-th bit in this vector/array represents whether we have seen a hash function value
whose binary representation ends in 0i. So initialize each bit to 0.
Example S=1,3,2,1,2,3,4,3,1,2,3,1S=1,3,2,1,2,3,4,3,1,2,3,1
h(x)=(6x+1) mod 5h(x)=(6x+1) mod 5
Assume |b| = 5
R = max( r(a) ) = 5
So no. of distinct elements = N=2R=25=32
6 Write psuedocode for pagerank calculation using MapReduce. What is the role of combiners
in performing the pagerank calculation?
Combiners: (2 Marks)
There are two reasons
1. We might wish to add terms for v ′ i , the ith component of the result vector v, at the
Map tasks. This improvement is the same as using a combiner, since the Reduce
function simply adds terms with a common key. Recall that for a MapReduce
implementation of matrix–vector multiplication, the key is the value of i for which a
term mijvj is intended.
2. We might not be using MapReduce at all, but rather executing the iteration step at a
single machine or a collection of machines.
7. Explain CURE clustering algorithm with an example.
The CURE (Clustering Using Representatives) Algorithm is large scale clustering algorithm
in the point assignment classs which assumes Euclidean space. It does not assume anything
about the shape of clusters; they need not be normally distributed, and can even have strange
bends, S-shapes, or even rings.
Instead of representing clusters by their centroid, it uses a collection of representative points,
as the name implies.
The CURE algorithm is divided into into phases:
1. Initialization in CURE
2. Completion of the CURE Algorithm
Initialization in CURE:
1. Take a small sample of the data and cluster it in main memory. In principle, any
clustering method could be used, but as CURE is designed to handle oddly shaped
clusters, it is often advisable to use a hierarchical method in which clusters are merged
when they have a close pair of points.
2. Select a small set of points from each cluster to be representative points. These points
should be chosen to be as far from one another as possible, using the K-means method.
3. Move each of the representative points a fixed fraction of the distance between its
location and the centroid of its cluster. Perhaps 20% is a good fraction to choose. Note
that this step requires a Euclidean space, since otherwise, there might not be any notion
of a line between two points.
Completion of the CURE Algorithm:
The next phase of CURE is to merge two clusters if they have a pair of representative points,
one from each cluster, that are sufficiently close. The user may pick the distance that defines
“close.” This merging step can repeat, until there are no more sufficiently close clusters.

More Related Content

PDF
Array and Pointers
PDF
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...
PPT
358 33 powerpoint-slides_15-hashing-collision_chapter-15
PDF
Bloom Filters: An Introduction
ZIP
Hashing
PDF
Fuzzy c means_realestate_application
PDF
pre
PDF
Data Representation of Strings
Array and Pointers
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...
358 33 powerpoint-slides_15-hashing-collision_chapter-15
Bloom Filters: An Introduction
Hashing
Fuzzy c means_realestate_application
pre
Data Representation of Strings

What's hot (20)

PPT
Concept of hashing
PPTX
Dynamic Memory & Linked Lists
PDF
08 Hash Tables
PPTX
Hashing Techniques in Data Structures Part2
PPT
Clustering
PPT
Hashing PPT
PPT
Ch17 Hashing
PPT
Advance algorithm hashing lec II
PPTX
Principal component analysis
PPTX
K-means clustering algorithm
PDF
Machine learning hands on clustering
PDF
Machine learning (11)
PPT
4.4 hashing
PDF
K means clustering
PPT
Advance algorithm hashing lec I
PPTX
Hashing 1
PDF
Hashing and Hash Tables
PPTX
Searching Algorithms
PPTX
Quadratic probing
PPT
Data Structure and Algorithms Hashing
Concept of hashing
Dynamic Memory & Linked Lists
08 Hash Tables
Hashing Techniques in Data Structures Part2
Clustering
Hashing PPT
Ch17 Hashing
Advance algorithm hashing lec II
Principal component analysis
K-means clustering algorithm
Machine learning hands on clustering
Machine learning (11)
4.4 hashing
K means clustering
Advance algorithm hashing lec I
Hashing 1
Hashing and Hash Tables
Searching Algorithms
Quadratic probing
Data Structure and Algorithms Hashing
Ad

Similar to Bigdata analytics (20)

PDF
Probabilistic algorithms for fun and pseudorandom profit
PPTX
streamingalgo88585858585858585pppppp.pptx
PPTX
big data analytics ,stream analytics....
PDF
Approximation Data Structures for Streaming Applications
PPTX
Unit 5 Streams2.pptx
PDF
C350712
PDF
C350712
PDF
C350712
PDF
Algorithmic techniques-for-big-data-analysis
PDF
Algorithmic techniques-for-big-data-analysis
PDF
Feedback Vertex Set
PPT
New zealand bloom filter
PPTX
Mining Data Streams
PDF
Randamization.pdf
PPTX
Streaming Algorithms
PDF
Cs6402 design and analysis of algorithms may june 2016 answer key
PDF
Probabilistic data structures
PPTX
Probabilistic data structures
PPTX
1. Problem Solving Techniques and Data Structures.pptx
PPTX
2013 py con awesome big data algorithms
Probabilistic algorithms for fun and pseudorandom profit
streamingalgo88585858585858585pppppp.pptx
big data analytics ,stream analytics....
Approximation Data Structures for Streaming Applications
Unit 5 Streams2.pptx
C350712
C350712
C350712
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysis
Feedback Vertex Set
New zealand bloom filter
Mining Data Streams
Randamization.pdf
Streaming Algorithms
Cs6402 design and analysis of algorithms may june 2016 answer key
Probabilistic data structures
Probabilistic data structures
1. Problem Solving Techniques and Data Structures.pptx
2013 py con awesome big data algorithms
Ad

Recently uploaded (20)

PPT
Total quality management ppt for engineering students
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PPTX
Software Engineering and software moduleing
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
Module 8- Technological and Communication Skills.pptx
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PDF
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PPTX
communication and presentation skills 01
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Current and future trends in Computer Vision.pptx
PPTX
Feature types and data preprocessing steps
Total quality management ppt for engineering students
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Exploratory_Data_Analysis_Fundamentals.pdf
Software Engineering and software moduleing
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Module 8- Technological and Communication Skills.pptx
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
communication and presentation skills 01
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
III.4.1.2_The_Space_Environment.p pdffdf
Current and future trends in Computer Vision.pptx
Feature types and data preprocessing steps

Bigdata analytics

  • 1. Q 1. (a) Explain the Stream Data Model Architecture with a neat diagram. In analogy to a database-management system, we can view a stream processor as a kind of data-management system, the high-level organization of which is suggested in Fig. Any number of streams can enter the system. Each stream can provide elements at its own schedule; they need not have the same data rates or data types, and the time between elements of one stream need not be uniform. The fact that the rate of arrival of stream elements is not under the control of the system distinguishes stream processing from the processing of data that goes on within a database-management system. The latter system controls the rate at which data is read from the disk, and therefore never has to worry about data getting lost as it attempts to execute queries. Streams may be archived in a large archival store, but we assume it is not possible to answer queries from the archival store. It could be examined only under special circumstances using time-consuming retrieval processes. There is also a working store, into which summaries or parts of streams may be placed, and which can be used for answering queries. The working store might be disk, or it might be main memory, depending on how fast we need to process queries. But either way, it is of sufficiently limited capacity that it cannot store all the data from all the streams. 2 What is bloom filter? Determine the probability of false positivenness in Bloom Filter. A Bloom filter consists of: 1. An array of n bits, initially all 0’s. 2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values to n buckets, corresponding to the n bits of the bit-array. 3. A set S of m key values. The purpose of the Bloom filter is to allow through all stream elements whose keys are in S, while rejecting most of the stream elements whose keys are not in S. The model to use is throwing darts at targets. Suppose we have x targets and y darts. Any dart is equally likely to hit any target. After throwing the darts, how many targets can we expect to be hit at least once?  The probability that a given dart will not hit a given target is (x − 1)/x  The probability that none of the y darts will hit a given target is ((x−1)/x)^y
  • 2.  We can write this expression as (1 – 1 x )^x( y x ).  Using the approximation (1−ǫ)1/ǫ = 1/e for small E we conclude that the probability that none of the y darts hit a given target is e−y/x. 3. Explain Girvan Newman Algorithm .Detect communities for the following graph using Girvan Newman Algorithm(Edge Betweenness mentioned in the graph)  In order to find out between edges, we need to calculate shortest paths from going through each of the edges.  Girvan - Newman Algorithm visits each node X once and computes the number of shortest paths from X to each of the other nodes that go through each of the edges.  The algorithm begins by performing a breadth first search [BFS] of the graph, starting at the node X.  The edges that go between node at the same level can never be a part of a shortest path from X.  Edges DAG edge will be part of at-least one shortest path from root X.  To complete the betweeness calculation, we have to repeat this calculation for every node as the root and sum the contributions.  After calculations, following graph shows final betweenness values:  We can cluster by taking the in order to increasing betweenness and add them to the graph at a time.  We can remove edge with highest value to cluster the graph.  In the example graph we remove edge BD to get two communities as follows:
  • 3. 4) Define PageRank . Calculate page rank for the following graph
  • 4. 5 Explain Flajolet-Martin Algorithm.Perform FM for the stream 1.3.2,1,2,3,4,3,1,2,3,1………. Flajolet-Martin algorithm approximates the number of unique objects in a stream or a database in one pass. If the stream contains n elements with m of them unique, this algorithm runs in O(n)O(n) time and needs O(log(m))O(log(m)) memory. Algorithm: 1. Create a bit vector (bit array) of sufficient length L, such that 2L>n2L>n, the number of elements in the stream. Usually a 64-bit vector is sufficient since 264264 is quite large for most purposes. 2. The i-th bit in this vector/array represents whether we have seen a hash function value whose binary representation ends in 0i0i. So initialize each bit to 0. 3. The i-th bit in this vector/array represents whether we have seen a hash function value whose binary representation ends in 0i. So initialize each bit to 0. 4. The i-th bit in this vector/array represents whether we have seen a hash function value whose binary representation ends in 0i. So initialize each bit to 0. Example S=1,3,2,1,2,3,4,3,1,2,3,1S=1,3,2,1,2,3,4,3,1,2,3,1 h(x)=(6x+1) mod 5h(x)=(6x+1) mod 5 Assume |b| = 5 R = max( r(a) ) = 5 So no. of distinct elements = N=2R=25=32 6 Write psuedocode for pagerank calculation using MapReduce. What is the role of combiners in performing the pagerank calculation?
  • 5. Combiners: (2 Marks) There are two reasons 1. We might wish to add terms for v ′ i , the ith component of the result vector v, at the Map tasks. This improvement is the same as using a combiner, since the Reduce function simply adds terms with a common key. Recall that for a MapReduce implementation of matrix–vector multiplication, the key is the value of i for which a term mijvj is intended. 2. We might not be using MapReduce at all, but rather executing the iteration step at a single machine or a collection of machines. 7. Explain CURE clustering algorithm with an example. The CURE (Clustering Using Representatives) Algorithm is large scale clustering algorithm in the point assignment classs which assumes Euclidean space. It does not assume anything about the shape of clusters; they need not be normally distributed, and can even have strange bends, S-shapes, or even rings. Instead of representing clusters by their centroid, it uses a collection of representative points, as the name implies. The CURE algorithm is divided into into phases: 1. Initialization in CURE 2. Completion of the CURE Algorithm Initialization in CURE: 1. Take a small sample of the data and cluster it in main memory. In principle, any clustering method could be used, but as CURE is designed to handle oddly shaped clusters, it is often advisable to use a hierarchical method in which clusters are merged when they have a close pair of points.
  • 6. 2. Select a small set of points from each cluster to be representative points. These points should be chosen to be as far from one another as possible, using the K-means method. 3. Move each of the representative points a fixed fraction of the distance between its location and the centroid of its cluster. Perhaps 20% is a good fraction to choose. Note that this step requires a Euclidean space, since otherwise, there might not be any notion of a line between two points. Completion of the CURE Algorithm: The next phase of CURE is to merge two clusters if they have a pair of representative points, one from each cluster, that are sufficiently close. The user may pick the distance that defines “close.” This merging step can repeat, until there are no more sufficiently close clusters.