0% found this document useful (0 votes)
66 views

Module 4 Techniques in Big Data Analytics

This document discusses techniques for analyzing big data streams, including finding similar items and calculating Jaccard similarity. It provides examples of applying these techniques to recommendations systems, e-commerce, and text mining. Applications of nearest neighbor searches are described for tasks like optical character recognition, content-based image retrieval, collaborative filtering, and document similarity analysis. The need for data stream mining over static datasets is explained, and a typical architecture for a data stream management system is outlined.

Uploaded by

King Bavisi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Module 4 Techniques in Big Data Analytics

This document discusses techniques for analyzing big data streams, including finding similar items and calculating Jaccard similarity. It provides examples of applying these techniques to recommendations systems, e-commerce, and text mining. Applications of nearest neighbor searches are described for tasks like optical character recognition, content-based image retrieval, collaborative filtering, and document similarity analysis. The need for data stream mining over static datasets is explained, and a typical architecture for a data stream management system is outlined.

Uploaded by

King Bavisi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Techniques

in Big Data
Analytics
MODULE :4(ELEX ENGG.)
Finding
Similar Items
(Similarity
and
Correlation)
K Nearest Neighborhood
 Jaccard Similarity for Two Sets

 
Jaccard
Similarity
Calculate the Jaccard
Distance
Application of Jaccard Similarity

Text mining:  find the similarity between E-Commerce: from a market database of Recommendation System:  Movie
two text documents using the number of thousands of customers and millions of recommendation algorithms employ the
terms used in both documents items, find similar customers via their Jaccard Coefficient to find similar
purchase history customers if they rented or rated highly
many of the same movies.
Applications of Nearest Neighbor
Search
Optical Character Content-based image Collaborative filtering: Document Similarity:
Recognition (OCR): retrieval: These systems Process of filtering for Many applications like web
OCR software use NN typically provide information or patterns search engines, news
classifiers; for example, example images, and the using a set of aggregators and the like
the k-NN algorithms are need to identify textually
systems find similar collaborating agents, similar documents from a
used to compare image
images using NN viewpoints, data sources large corpus of documents
features with stored like web pages, or a
glyph features and approach
collection of news articles,
choose the nearest collection of tweets, etc.
match.
Similarity of
Documents
Storage and retrieval of records in
a large enterprise
Automated maintaining and retrieving a
analysis and variety of patient-related data in a
organization of large hospital
large document web search engines
repositories
identifying trending topics on
Twitter
Document Similarity
• Key issue in document management is the quantitative assessment of document similarities

How similar are


the two text documents?

Are two patient histories similar?

Which documents match a given query


best?
Applications for text-based similarity
Quality in search engines
• Near-duplicate detection improves quality of search results

Finding Similar Employees


• Human Resources applications, such as automated CV to job
description machine
• Patent Research
• Matching potential patent applications against a corpus of
existing
patent grant.
Applications for text-based similarity

Document clustering
• Auto-categorization using seed documents.

Security scrubbing
• Finding documents with very similar content,
but with different access control lists.
A stream is defined as a possibly
unbounded sequence of data items or
records, that may or may not be related

Data Stream to, or correlated with each other.

Mining Streaming Data is data that is


generated continuously by thousands of
Introduction data sources, which typically send in the
data records simultaneously, and in
small sizes (order of Kilobytes).

Data Stream Mining is the process of


extracting knowledge structures from
continuous, rapid data records.
In traditional data mining-based applications, we
know the entire dataset in advance.

Data Stream
Mining Moreover, data was static and persistent in nature.
Need for Data Stream Mining

Hence, this model was adequate for most of older


and legacy applications.

Many current and emerging applications like


Facebook, twitter, sensor networks, network
monitoring etc. generate continuous, rapid, time-
varying, unpredictable and unbounded streams of
data.
Traditional DBMS
Further, the data

Data Stream
and Data Mining
in the data
applications are
stream is lost

Mining
not designed for
forever if not
rapid and
processed
continuous
immediately or
loading of data
stored.
items
Need for Data Stream Mining
Cont….
Hence, there is
Moreover, it is not
need of such a
possible to store
system that
all the arriving
handles these
data and then
type of data
interact with it at
under strict
the time of your
constraints of
choice.
Time and Space.
A data stream management system (DSMS) is a
computer software system to manage continuous
data streams.

Data Stream A DSMS also offers a flexible query processing so

Mining that the information need can be expressed using


queries.

Data Stream Management DSMS executes a continuous query that is not


System only performed once, but is permanently installed.

Therefore, the query is continuously executed


until it is explicitly uninstalled.

Since most DSMS are data-driven, a continuous


query produces new results as long as new data
arrive at the system. 
Any number of streams

Data Stream
can enter the system.
Each stream can provide
elements at its own

Mining
Any query that requires schedule; they need not
backtracking over data have the same data rates
stream is infeasible due to or data types, and the time
storage and performance between elements of one
constraints. stream need not be
uniform.
Facts of Data Stream Mining

Streaming query plans At times, it is not possible


must not use any to store the entire data
operators that requires stream. Hence,
that entire input before approximate summary
any results are produced. structure are used. As a
Such operators will block result, queries over the
the query processor summaries may not return
indefinitely. exact answers.
Working
Data Stream Storage

Mining Input
Strea
m Stream
Summary
Query
Processor
Input
Abstract architecture for a typical Regulato Storage
DSMS r
1. Temporary working storage (e.g., User
Metadata Queri Query
for window queries). reposito
Storage es
2. Summary storage. ry

3. Static storage for meta-data (e.g.,


physical location of each source).
The data model and query processor must
allow both order-based and time-based
operations (e.g., queries over a 10 min moving
window or queries of the form which are the

Data Stream most frequently occurring data before a


particular event).

Mining
The inability to store a complete stream
Characteristics indicates that some approximate summary
structures must be used. As a result, queries
over the summaries may not return exact
answers.

Streaming query plans must not use any


operators that require the entire input before
any results are produced. Such operators will
block the query processor indefinitely.
Any query that requires
backtracking over a data stream
is infeasible. This is due to the
storage and performance
constraints imposed by a data
stream. Thus any online stream
algorithm is restricted to make
Data Stream only one pass over the data.

Mining Applications that monitor streams


in real-time must react quickly to
unusual data values. Thus, long-
Characteristics Cont… running queries must be prepared
for changes in system conditions
any time during their execution
lifetime (e.g., they may encounter
variable stream rates).

Scalability requirements dictate


that parallel and shared execution
of many continuous queries must
be possible.
Sensor Network is used in numerous situations that requires constant
monitoring of several variables, based on which important decisions are made

Data Stream In many cases, alerts and alarms may be generated as a response to the
information received from a series of sensors

Application
To perform such analysis, aggregation and joins over multiple streams
corresponding to the various sensors are required
Sensor Network

Some representative queries

1. Perform a join of several data streams like temperature streams, ocean current
streams, etc. at weather stations to give alerts or warnings of disasters like
cyclones and tsunami. It can be noted here that such information can change
very rapidly based on the vagaries of nature.

2. Constantly monitor a stream of recent power usage statistics reported to a


power station, group them by location, user type, etc. to manage power
distribution efficiently.
Network service providers can
constantly get information about
Internet traffic, heavily used routes, etc.
Data Stream to identify and predict potential
Streams of network traffic can also be
congestions.
Application analyzed to identify potentially
Network Traffic Analysis
fraudulent activities. E.g. An intrusion
detection system.
If a particular server on the network becomes
a victim of a denial-of-service attack, that
route can become heavily congested within a
short period of time
Exampl Check whether a current stream of actions over a time
window are like a previous identified intrusion on the
e network.
queries Check if several routes over which traffic is moving has
several common intermediate nodes which may potentially
indicate a congestion on that route.
Link
Analysis
Spamdexing
• Spamdexing is the practice of keyword stuffing or
otherwise manipulating an index for a website with the
intention of increasing the website's ranking with search
engines.

• Search Engine Optimization (SEO) is an industry that


attempts to make a Website attractive to the major search
engines and thus increase their ranking.

• Two popular techniques of Spamdexing


1. Cloaking
2. Use of “Doorway” pages.
Cloaking
Cloaking is a technique
where a website shows
one version of a URL, page,
or piece of content to the
search engines for ranking
purposes while showing
another to its actual
visitors
Doorway

A Doorway page is a page on your website which


has been created to rank for specific search queries.

A “Doorway” to the main content and is not at all


useful to users.

Sometimes, this tactic involves redirection to


another page, and when the user clicks on it, the
meta refreshes and there is a very quick redirect to
another page.
Page Rank
• Improve the Web search by analyze
the hyperlinks and the graph
structure of the Website.

• Link analysis is one of many factors


considered by Web search engines in
computing a composite score for a
Web page on any given query.
DOES NOT CONTRIBUTE TO WEB PAGE

Dangling Links RANK CALCULATION

E.G. D WEB PAGE IS DANGLING


 
Institute
Affiliated
Person
Rank
https://tools.withcode.uk/pagerank/
Iteration 0 Iteration 1
PR(A) = 1 PR(A) = (1-d) + d(PR(B)/C(B) + PR(C)/C(C))
PR(B) = 1 PR(A) = 0.150 + 0.85(1.000/2 + 1.000/1) =
PR(C) = 1.425
A PR(B) = (1-d) + d(PR(A)/C(A))
1
PR(B) = 0.150 + 0.85(1.425/2) = 0.756
PR(C) = (1-d) + d(PR(A)/C(A) + PR(B)/C(B))
Iteration 2 PR(C) = 0.150 + 0.85(1.425/2 + 0.756/2) =
PR(A) = (1-d) +1.077
d(PR(B)/C(B) + PR(C)/C(C))
PR(A) = 0.150 + 0.85(0.756/2 + 1.077/1) =
1.386
PR(B) = (1-d) + d(PR(A)/C(A))
B C PR(B) = 0.150 + 0.85(1.386/2) = 0.739
PR(C) = (1-d) + d(PR(A)/C(A) + PR(B)/C(B))
PR(C) = 30.150 + 0.85(1.386/2 + 0.739/2) =
Iteration
1.053 = (1-d) + d(PR(B)/C(B) + PR(C)/C(C))
PR(A)
PR(A) = 0.150 + 0.85(0.739/2 + 1.053/1) =
1.360
PR(B) = (1-d) + d(PR(A)/C(A))
PR(B) = 0.150 + 0.85(1.360/2) = 0.728
PR(C) = (1-d) + d(PR(A)/C(A) + PR(B)/C(B))
PR(C) = 0.150 + 0.85(1.360/2 + 0.728/2) =
1.037
Page Rank
Calculation
Page Rank Simulator
https://computerscience.chemeketa.edu/cs160Reader/_static/pageRankApp/index.html
Apriori
Algorithm
Association Rule
Objective is to use find affinities between product i.
e. which products sell together often.

Exercise Support Level is set at 33 %


Confidence level will be set at 50 %
Support and
 

Confidence
Title Lorem Ipsum
Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet

2017 2018 2019 2020

Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet

You might also like