0% found this document useful (0 votes)

77 views46 pages

Module 4 Techniques in Big Data Analytics

This document discusses techniques for analyzing big data streams, including finding similar items and calculating Jaccard similarity. It provides examples of applying these techniques to recommendations systems, e-commerce, and text mining. Applications of nearest neighbor searches are described for tasks like optical character recognition, content-based image retrieval, collaborative filtering, and document similarity analysis. The need for data stream mining over static datasets is explained, and a typical architecture for a data stream management system is outlined.

Uploaded by

King Bavisi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views46 pages

Module 4 Techniques in Big Data Analytics

Uploaded by

King Bavisi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Techniques

in Big Data
Analytics
MODULE :4(ELEX ENGG.)
Finding
Similar Items
(Similarity
and
Correlation)
K Nearest Neighborhood
Jaccard Similarity for Two Sets

Jaccard
Similarity
Calculate the Jaccard
Distance
Application of Jaccard Similarity

Text mining: find the similarity between E-Commerce: from a market database of Recommendation System: Movie
two text documents using the number of thousands of customers and millions of recommendation algorithms employ the
terms used in both documents items, find similar customers via their Jaccard Coefficient to find similar
purchase history customers if they rented or rated highly
many of the same movies.
Applications of Nearest Neighbor
Search
Optical Character Content-based image Collaborative filtering: Document Similarity:
Recognition (OCR): retrieval: These systems Process of filtering for Many applications like web
OCR software use NN typically provide information or patterns search engines, news
classifiers; for example, example images, and the using a set of aggregators and the like
the k-NN algorithms are need to identify textually
systems find similar collaborating agents, similar documents from a
used to compare image
images using NN viewpoints, data sources large corpus of documents
features with stored like web pages, or a
glyph features and approach
collection of news articles,
choose the nearest collection of tweets, etc.
match.
Similarity of
Documents
Storage and retrieval of records in
a large enterprise
Automated maintaining and retrieving a
analysis and variety of patient-related data in a
organization of large hospital
large document web search engines
repositories
identifying trending topics on
Twitter
Document Similarity
• Key issue in document management is the quantitative assessment of document similarities

How similar are

the two text documents?

Are two patient histories similar?

Which documents match a given query

best?
Applications for text-based similarity
Quality in search engines
• Near-duplicate detection improves quality of search results

Finding Similar Employees

• Human Resources applications, such as automated CV to job
description machine
• Patent Research
• Matching potential patent applications against a corpus of
existing
patent grant.
Applications for text-based similarity

Document clustering
• Auto-categorization using seed documents.

Security scrubbing
• Finding documents with very similar content,
but with different access control lists.
A stream is deﬁned as a possibly
unbounded sequence of data items or
records, that may or may not be related

Data Stream to, or correlated with each other.

Mining Streaming Data is data that is

generated continuously by thousands of
Introduction data sources, which typically send in the
data records simultaneously, and in
small sizes (order of Kilobytes).

Data Stream Mining is the process of

extracting knowledge structures from
continuous, rapid data records.
In traditional data mining-based applications, we
know the entire dataset in advance.

Data Stream
Mining Moreover, data was static and persistent in nature.
Need for Data Stream Mining

Hence, this model was adequate for most of older

and legacy applications.

Many current and emerging applications like

Facebook, twitter, sensor networks, network
monitoring etc. generate continuous, rapid, time-
varying, unpredictable and unbounded streams of
data.
Traditional DBMS
Further, the data

Data Stream
and Data Mining
in the data
applications are
stream is lost

Mining
not designed for
forever if not
rapid and
processed
continuous
immediately or
loading of data
stored.
items
Need for Data Stream Mining
Cont….
Hence, there is
Moreover, it is not
need of such a
possible to store
system that
all the arriving
handles these
data and then
type of data
interact with it at
under strict
the time of your
constraints of
choice.
Time and Space.
A data stream management system (DSMS) is a
computer software system to manage continuous
data streams.

Data Stream A DSMS also offers a ﬂexible query processing so

Mining that the information need can be expressed using

queries.

Data Stream Management DSMS executes a continuous query that is not

System only performed once, but is permanently installed.

Therefore, the query is continuously executed

until it is explicitly uninstalled.

Since most DSMS are data-driven, a continuous

query produces new results as long as new data
arrive at the system.
Any number of streams

Data Stream
can enter the system.
Each stream can provide
elements at its own

Mining
Any query that requires schedule; they need not
backtracking over data have the same data rates
stream is infeasible due to or data types, and the time
storage and performance between elements of one
constraints. stream need not be
uniform.
Facts of Data Stream Mining

Streaming query plans At times, it is not possible

must not use any to store the entire data
operators that requires stream. Hence,
that entire input before approximate summary
any results are produced. structure are used. As a
Such operators will block result, queries over the
the query processor summaries may not return
indeﬁnitely. exact answers.
Working
Data Stream Storage

Mining Input
Strea
m Stream
Summary
Query
Processor
Input
Abstract architecture for a typical Regulato Storage
DSMS r
1. Temporary working storage (e.g., User
Metadata Queri Query
for window queries). reposito
Storage es
2. Summary storage. ry

3. Static storage for meta-data (e.g.,

physical location of each source).
The data model and query processor must
allow both order-based and time-based
operations (e.g., queries over a 10 min moving
window or queries of the form which are the

Data Stream most frequently occurring data before a

particular event).

Mining
The inability to store a complete stream
Characteristics indicates that some approximate summary
structures must be used. As a result, queries
over the summaries may not return exact
answers.

Streaming query plans must not use any

operators that require the entire input before
any results are produced. Such operators will
block the query processor indeﬁnitely.
Any query that requires
backtracking over a data stream
is infeasible. This is due to the
storage and performance
constraints imposed by a data
stream. Thus any online stream
algorithm is restricted to make
Data Stream only one pass over the data.

Mining Applications that monitor streams

in real-time must react quickly to
unusual data values. Thus, long-
Characteristics Cont… running queries must be prepared
for changes in system conditions
any time during their execution
lifetime (e.g., they may encounter
variable stream rates).

Scalability requirements dictate

that parallel and shared execution
of many continuous queries must
be possible.
Sensor Network is used in numerous situations that requires constant
monitoring of several variables, based on which important decisions are made

Data Stream In many cases, alerts and alarms may be generated as a response to the
information received from a series of sensors

Application
To perform such analysis, aggregation and joins over multiple streams
corresponding to the various sensors are required
Sensor Network

Some representative queries

1. Perform a join of several data streams like temperature streams, ocean current
streams, etc. at weather stations to give alerts or warnings of disasters like
cyclones and tsunami. It can be noted here that such information can change
very rapidly based on the vagaries of nature.

2. Constantly monitor a stream of recent power usage statistics reported to a

power station, group them by location, user type, etc. to manage power
distribution efficiently.
Network service providers can
constantly get information about
Internet traffic, heavily used routes, etc.
Data Stream to identify and predict potential
Streams of network traffic can also be
congestions.
Application analyzed to identify potentially
Network Traffic Analysis
fraudulent activities. E.g. An intrusion
detection system.
If a particular server on the network becomes
a victim of a denial-of-service attack, that
route can become heavily congested within a
short period of time
Exampl Check whether a current stream of actions over a time
window are like a previous identified intrusion on the
e network.
queries Check if several routes over which traffic is moving has
several common intermediate nodes which may potentially
indicate a congestion on that route.
Link
Analysis
Spamdexing
• Spamdexing is the practice of keyword stuffing or
otherwise manipulating an index for a website with the
intention of increasing the website's ranking with search
engines.

• Search Engine Optimization (SEO) is an industry that

attempts to make a Website attractive to the major search
engines and thus increase their ranking.

• Two popular techniques of Spamdexing

1. Cloaking
2. Use of “Doorway” pages.
Cloaking
Cloaking is a technique
where a website shows
one version of a URL, page,
or piece of content to the
search engines for ranking
purposes while showing
another to its actual
visitors
Doorway

A Doorway page is a page on your website which

has been created to rank for speciﬁc search queries.

A “Doorway” to the main content and is not at all

useful to users.

Sometimes, this tactic involves redirection to

another page, and when the user clicks on it, the
meta refreshes and there is a very quick redirect to
another page.
Page Rank
• Improve the Web search by analyze
the hyperlinks and the graph
structure of the Website.

• Link analysis is one of many factors

considered by Web search engines in
computing a composite score for a
Web page on any given query.
DOES NOT CONTRIBUTE TO WEB PAGE

Dangling Links RANK CALCULATION

E.G. D WEB PAGE IS DANGLING

Institute
Affiliated
Person
Rank
https://tools.withcode.uk/pagerank/
Iteration 0 Iteration 1
PR(A) = 1 PR(A) = (1-d) + d(PR(B)/C(B) + PR(C)/C(C))
PR(B) = 1 PR(A) = 0.150 + 0.85(1.000/2 + 1.000/1) =
PR(C) = 1.425
A PR(B) = (1-d) + d(PR(A)/C(A))
1
PR(B) = 0.150 + 0.85(1.425/2) = 0.756
PR(C) = (1-d) + d(PR(A)/C(A) + PR(B)/C(B))
Iteration 2 PR(C) = 0.150 + 0.85(1.425/2 + 0.756/2) =
PR(A) = (1-d) +1.077
d(PR(B)/C(B) + PR(C)/C(C))
PR(A) = 0.150 + 0.85(0.756/2 + 1.077/1) =
1.386
PR(B) = (1-d) + d(PR(A)/C(A))
B C PR(B) = 0.150 + 0.85(1.386/2) = 0.739
PR(C) = (1-d) + d(PR(A)/C(A) + PR(B)/C(B))
PR(C) = 30.150 + 0.85(1.386/2 + 0.739/2) =
Iteration
1.053 = (1-d) + d(PR(B)/C(B) + PR(C)/C(C))
PR(A)
PR(A) = 0.150 + 0.85(0.739/2 + 1.053/1) =
1.360
PR(B) = (1-d) + d(PR(A)/C(A))
PR(B) = 0.150 + 0.85(1.360/2) = 0.728
PR(C) = (1-d) + d(PR(A)/C(A) + PR(B)/C(B))
PR(C) = 0.150 + 0.85(1.360/2 + 0.728/2) =
1.037
Page Rank
Calculation
Page Rank Simulator
https://computerscience.chemeketa.edu/cs160Reader/_static/pageRankApp/index.html
Apriori
Algorithm
Association Rule
Objective is to use ﬁnd afﬁnities between product i.
e. which products sell together often.

Exercise Support Level is set at 33 %

Conﬁdence level will be set at 50 %
Support and

Confidence
Title Lorem Ipsum
Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet

2017 2018 2019 2020

Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet

Imperative Programming 1
No ratings yet
Imperative Programming 1
169 pages
Bda L4
No ratings yet
Bda L4
32 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
UNIT-II 30-1-24
No ratings yet
UNIT-II 30-1-24
162 pages
BDA-2
No ratings yet
BDA-2
16 pages
FALLSEM2024-25_SWE2011_ETH_VL2024250103282_2024-08-19_Reference-Material-I
No ratings yet
FALLSEM2024-25_SWE2011_ETH_VL2024250103282_2024-08-19_Reference-Material-I
53 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Mining&Data Stream Unit-3_removed
No ratings yet
Mining&Data Stream Unit-3_removed
50 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Unit II(Big Data)
No ratings yet
Unit II(Big Data)
19 pages
Advanced-Applications
No ratings yet
Advanced-Applications
54 pages
Unit 4
No ratings yet
Unit 4
84 pages
UNIT-3 (Mining Data Streams)
No ratings yet
UNIT-3 (Mining Data Streams)
50 pages
Data Streams1
No ratings yet
Data Streams1
10 pages
Introduction Data Science
No ratings yet
Introduction Data Science
29 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Unit 2
No ratings yet
Unit 2
10 pages
DATA MINING for search engines
No ratings yet
DATA MINING for search engines
33 pages
Introduction To Stream Concepts - Stream Data Model and Architecture
No ratings yet
Introduction To Stream Concepts - Stream Data Model and Architecture
8 pages
Module 3 - TIME ORIENTED DATA-1
No ratings yet
Module 3 - TIME ORIENTED DATA-1
30 pages
Module II
No ratings yet
Module II
22 pages
Data Mining Unit4
No ratings yet
Data Mining Unit4
16 pages
DM Lec1
No ratings yet
DM Lec1
40 pages
UNIT-1 PPT DMA
No ratings yet
UNIT-1 PPT DMA
83 pages
BIG_DATA_UNIT_II_NOTES
No ratings yet
BIG_DATA_UNIT_II_NOTES
19 pages
What Motivated Data Mining?: Huge Amount of Raw DATA Is Available - The Motivation For The Data Mining Is To
No ratings yet
What Motivated Data Mining?: Huge Amount of Raw DATA Is Available - The Motivation For The Data Mining Is To
83 pages
Data Mining L-3,4
No ratings yet
Data Mining L-3,4
25 pages
This PPT Is Dedicated To My Inner Controller Founders.: Amma Bhagavan
No ratings yet
This PPT Is Dedicated To My Inner Controller Founders.: Amma Bhagavan
84 pages
Module 4
No ratings yet
Module 4
23 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
UNIT-2 BDA
No ratings yet
UNIT-2 BDA
33 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Updated DM
No ratings yet
Updated DM
72 pages
BI_UNIT 3
No ratings yet
BI_UNIT 3
18 pages
Unit 3
No ratings yet
Unit 3
30 pages
5 Unit
No ratings yet
5 Unit
5 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
DMW-M1-Ktunotes.in
No ratings yet
DMW-M1-Ktunotes.in
75 pages
BCA Semester VI Data Mining Module 5 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 5 (Presentation Kind of N
38 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
UNIT 3 Notes Data Analytics
No ratings yet
UNIT 3 Notes Data Analytics
136 pages
05 Database Management Systems
No ratings yet
05 Database Management Systems
37 pages
Unit 1 DM
No ratings yet
Unit 1 DM
37 pages
Web Mining
No ratings yet
Web Mining
48 pages
1st Slides
No ratings yet
1st Slides
60 pages
BIG DATA Technology: Subtitle
No ratings yet
BIG DATA Technology: Subtitle
34 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
No ratings yet
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
36 pages
Introduction To Stream Data Model
50% (2)
Introduction To Stream Data Model
15 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Data Mining
No ratings yet
Data Mining
84 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
UNIT 3
No ratings yet
UNIT 3
135 pages
Complete Doc - Lavanya
No ratings yet
Complete Doc - Lavanya
95 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
From Everand
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
PM Debug Info
No ratings yet
PM Debug Info
180 pages
J6X004
No ratings yet
J6X004
5 pages
AcumaticaERP UnitTestFrameworkGuide
No ratings yet
AcumaticaERP UnitTestFrameworkGuide
34 pages
Lecture 0
No ratings yet
Lecture 0
159 pages
Olt Qs
No ratings yet
Olt Qs
57 pages
LITERATURE SURVEYpdf
No ratings yet
LITERATURE SURVEYpdf
7 pages
Camara Vivotek IP7134 (TC5333)
No ratings yet
Camara Vivotek IP7134 (TC5333)
8 pages
DC UNIT-3
No ratings yet
DC UNIT-3
21 pages
Haxe Game Development Essentials - Sample Chapter
No ratings yet
Haxe Game Development Essentials - Sample Chapter
17 pages
eDRAM-OESP: A Novel Performance Efficient in-embedded-DRAM-compute Design For On-Edge Signal Processing Application
No ratings yet
eDRAM-OESP: A Novel Performance Efficient in-embedded-DRAM-compute Design For On-Edge Signal Processing Application
7 pages
CHAPTER - 1 Introduction To Routing and Packet Forwarding
No ratings yet
CHAPTER - 1 Introduction To Routing and Packet Forwarding
6 pages
Presentation ZDL700
No ratings yet
Presentation ZDL700
27 pages
Net Topology sfb52 e PDF
No ratings yet
Net Topology sfb52 e PDF
13 pages
Ark 2120 User Manual Ed2
No ratings yet
Ark 2120 User Manual Ed2
60 pages
Python Programming For The TI-84 Plus CE Graphing Calculator
No ratings yet
Python Programming For The TI-84 Plus CE Graphing Calculator
168 pages
Shunting Process - April 6
No ratings yet
Shunting Process - April 6
2 pages
Rest_Assured_Interview_Questions_Q_A_1734797664
No ratings yet
Rest_Assured_Interview_Questions_Q_A_1734797664
22 pages
Error
No ratings yet
Error
4 pages
Data-Centric Routing Protocols in Wireless Sensor Networks: A Survey
No ratings yet
Data-Centric Routing Protocols in Wireless Sensor Networks: A Survey
6 pages
Prajwal DBMS Mini Project 2
No ratings yet
Prajwal DBMS Mini Project 2
25 pages
Accord Communications LTD.: Digital EPABX / Key System Range
0% (1)
Accord Communications LTD.: Digital EPABX / Key System Range
34 pages
Management Information Systems: ITEC 1010 Information and Organizations
No ratings yet
Management Information Systems: ITEC 1010 Information and Organizations
33 pages
DAA Pbl Project[1]
No ratings yet
DAA Pbl Project[1]
5 pages
Applets: Interactive Feature To Web Application
No ratings yet
Applets: Interactive Feature To Web Application
36 pages
QP-PRO-015-MAR - Materials Approval Request (R1)
No ratings yet
QP-PRO-015-MAR - Materials Approval Request (R1)
1 page
DCP-2000 and DCP-2K4: User Manual
No ratings yet
DCP-2000 and DCP-2K4: User Manual
177 pages
Elite FSX
100% (1)
Elite FSX
23 pages
Audirvana Plus User Manual
No ratings yet
Audirvana Plus User Manual
39 pages
Fybca Electronics Practical Updated
No ratings yet
Fybca Electronics Practical Updated
69 pages