Module 4 Techniques in Big Data Analytics
Module 4 Techniques in Big Data Analytics
in Big Data
Analytics
MODULE :4(ELEX ENGG.)
Finding
Similar Items
(Similarity
and
Correlation)
K Nearest Neighborhood
Jaccard Similarity for Two Sets
Jaccard
Similarity
Calculate the Jaccard
Distance
Application of Jaccard Similarity
Text mining: find the similarity between E-Commerce: from a market database of Recommendation System: Movie
two text documents using the number of thousands of customers and millions of recommendation algorithms employ the
terms used in both documents items, find similar customers via their Jaccard Coefficient to find similar
purchase history customers if they rented or rated highly
many of the same movies.
Applications of Nearest Neighbor
Search
Optical Character Content-based image Collaborative filtering: Document Similarity:
Recognition (OCR): retrieval: These systems Process of filtering for Many applications like web
OCR software use NN typically provide information or patterns search engines, news
classifiers; for example, example images, and the using a set of aggregators and the like
the k-NN algorithms are need to identify textually
systems find similar collaborating agents, similar documents from a
used to compare image
images using NN viewpoints, data sources large corpus of documents
features with stored like web pages, or a
glyph features and approach
collection of news articles,
choose the nearest collection of tweets, etc.
match.
Similarity of
Documents
Storage and retrieval of records in
a large enterprise
Automated maintaining and retrieving a
analysis and variety of patient-related data in a
organization of large hospital
large document web search engines
repositories
identifying trending topics on
Twitter
Document Similarity
• Key issue in document management is the quantitative assessment of document similarities
Document clustering
• Auto-categorization using seed documents.
Security scrubbing
• Finding documents with very similar content,
but with different access control lists.
A stream is defined as a possibly
unbounded sequence of data items or
records, that may or may not be related
Data Stream
Mining Moreover, data was static and persistent in nature.
Need for Data Stream Mining
Data Stream
and Data Mining
in the data
applications are
stream is lost
Mining
not designed for
forever if not
rapid and
processed
continuous
immediately or
loading of data
stored.
items
Need for Data Stream Mining
Cont….
Hence, there is
Moreover, it is not
need of such a
possible to store
system that
all the arriving
handles these
data and then
type of data
interact with it at
under strict
the time of your
constraints of
choice.
Time and Space.
A data stream management system (DSMS) is a
computer software system to manage continuous
data streams.
Data Stream
can enter the system.
Each stream can provide
elements at its own
Mining
Any query that requires schedule; they need not
backtracking over data have the same data rates
stream is infeasible due to or data types, and the time
storage and performance between elements of one
constraints. stream need not be
uniform.
Facts of Data Stream Mining
Mining Input
Strea
m Stream
Summary
Query
Processor
Input
Abstract architecture for a typical Regulato Storage
DSMS r
1. Temporary working storage (e.g., User
Metadata Queri Query
for window queries). reposito
Storage es
2. Summary storage. ry
Mining
The inability to store a complete stream
Characteristics indicates that some approximate summary
structures must be used. As a result, queries
over the summaries may not return exact
answers.
Data Stream In many cases, alerts and alarms may be generated as a response to the
information received from a series of sensors
Application
To perform such analysis, aggregation and joins over multiple streams
corresponding to the various sensors are required
Sensor Network
1. Perform a join of several data streams like temperature streams, ocean current
streams, etc. at weather stations to give alerts or warnings of disasters like
cyclones and tsunami. It can be noted here that such information can change
very rapidly based on the vagaries of nature.
Confidence
Title Lorem Ipsum
Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet
Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet