0% found this document useful (0 votes)
60 views61 pages

Big Data Analysis Patterns

The document discusses big data analysis patterns and solutions. It provides an overview of technologies like Apache Hadoop, Mahout, Drill, Storm, Titan, Solr and Lucene. It then focuses on how to determine the best solution by considering factors like data size, query size, and response time needed. Examples are given of telecommunications, credit card, and waste management use cases and how different techniques like ETL, analytics, recommendations, and alerts could apply.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views61 pages

Big Data Analysis Patterns

The document discusses big data analysis patterns and solutions. It provides an overview of technologies like Apache Hadoop, Mahout, Drill, Storm, Titan, Solr and Lucene. It then focuses on how to determine the best solution by considering factors like data size, query size, and response time needed. Examples are given of telecommunications, credit card, and waste management use cases and how different techniques like ETL, analytics, recommendations, and alerts could apply.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Big Data

Analysis Patterns
Atlanta Big Data User Group
8/15/2013
1
whoami
Brad Anderson
Solutions Architect at MapR (Atlanta)
ATLHUG co-chair
NoSQL East Conference 2009
boorad most places (twitter, github)
[email protected]
2
Announcements
Next ATLHUG Meeting - Sept. 26
How Google Does Big Data

Wednesday MapR Data Warehouse Offload


Roadshow

MapR Upcoming Training


MapR M7 & HBase for Developers on August 27 in Campbell, CA
MapR M7 & HBase for Developers on Sept 17 in Reston, VA
MapR M5 for Administrators on Oct 3 in Campbell, CA

3
3
BIG DATA
4
5
Big Data is not new!
but the tools are.

6
The Good News in Big Data:

Simple algorithms and lots of data


trump complex models

Halevy, Norvig, and Pereira, Google


IEEE Intelligent Systems

7
The Challenge: So Many Solutions!

What solutions fit your business problem?


For example, do you need
Apache Hadoop?
Apache Mahout?
Storm?
Apache Solr/Lucene?
Apache HBase (or MapR M7)?
Apache Drill (or Impala?)
d3.js or Tableau?
Node.js
Titan?
8
8
Ask a Different Question

It may be more useful to better define the problem by asking some


of these questions:
How large is the data to be stored?
How large is the data to be queried? (the analysis volume)
What time frame is appropriate for your query response?
How fast is data arriving? (bursts or continuously?)
Are queries by sophisticated users?
Are you looking for common patterns or outliers?
How are your data sources structures?

9
9
Picking the Best Solution

Your responses to these questions can help you better:


define the problem
recognize the analysis pattern to which it belongs
guide the choice of solutions to try

But first, heres a quick review of a few of the technologies you


might choose, and then we will focus on three of the questions as a
part of the landscape.

10
10
Apache Solr/Lucene

Solr/Lucene is a powerful search engine used for flexible, heavily


indexed queries including data such as
Full text
Geographical data
Statistically weighted data
Solr is a small data tool that has flourished in a big data world

11
Apache Mahout

Mahout provides a library of scalable machine learning algorithms


useful for big data analysis based on Hadoop or other storage
systems.
Mahout algorithms mainly are used for
Recommendation (collaborative filtering)
Clustering
Classification
Mahout can be used in conjunction with solutions such as Solr: You
might use Mahout to create a co-occurrence data base that could
then be queried using a search tool such as Solr

12
Apache Drill

Google Dremel clone


Pluggable Query Languages
Starts with ANSI SQL 2003
Hive, Pig, Cascading, MongoQL,
Pluggable Storage Backends
Hadoop, Hbase
MongoDB (BSON)
RDBMS?
Bypasses MapReduce

13
Storm

Realtime Stream Computation Engine


Horizontal Scalability
Guaranteed Data Processing
Fault Tolerance
Higher level abstraction over:
Message Queues
Worker Logic

The Hadoop of Realtime

14
Titan
Distributed Graph Database
Property Graph
Pluggable Backend Storage
HBase or M7
Cassandra
Berkeley DB
Search Integrated
Solr/Lucene
Elastic Search
Faunus
Batch processing of large graphs
Fulgora
Graph traversals on subset
In-memory
15
Using the Answers to Guide Your Choices

For simplicity, lets focus in on the first three questions:


How large is the data to be stored?
How large is the data to be queried? (the analysis volume)
What time frame is appropriate for your query response?

16
Big Data Decision Tree
How big is your data?

<10 GB >200 GB
mid
?
?
A What size queries?

Single element Multiple passes


One pass
at a time over big chunks
over 100%

B C Response time?

Big storage Streaming


< 100s throughput
(human scale) not response

D E

17
Use Cases
Company
Data Shape
Technique(s)
Business Value
18
Business Value

19
Business Value
20
Telecommunications Giant

ETL Offload
21
Telecommunications
Data Shape
Lots of Data
Lots of Queries across Large Sets
Throughput important

22
Telecommunications
Techniques
ETL Analytics

23
Telecommunications
Techniques

ETL (Hadoop) Analytics (Teradata)


24
Telecommunications
Business Value

25
Credit Card
Issuer

26
Credit Card
Issuer

Data Shape
Customer Purchase History (big)
Merchant Designations
Merchant Special Offers
Throughput important
Recommendations27
Credit Card
Issuer

Techniques
A Recommendation Engine with Mahout and Solr/Lucene

History matrix

One row per user

One column per thing

28
Credit Card
Issuer

Techniques

Recommendation based on
cooccurrence

Cooccurrence gives item-item


mapping

One row and column per thing


29
Credit Card
Issuer

Techniques

Cooccurrence matrix can also be


implemented as a search index

30
Credit Card
Issuer

Techniques
SolR
SolR
Complete Cooccurrence Indexer
Solr
Indexer
history (Mahout) indexing
20 Hrs 3 Hrs

Item meta- Index


data shards

31
Credit Card
Issuer

Techniques
SolR
SolR
User Indexer
Solr
Web tier Indexer
history search
8Hrs 3 Min

Item meta-
Index
data shards

32
Credit Card
Issuer

Techniques
Hadoop Export
(4 hrs)
Purchase App
History
App
Recommendation Presentation
Merchant
Engine Results Data Store App
Information
(Mahout) (DB2)
App
Merchant
Offers App
Import
(4 hrs)
33
Credit Card
Issuer

Techniques
Hadoop
Purchase App
Index
History
Update
App
(3 min)
Recommendation Recommendation
Merchant
Engine Results Search Index App
Information
(Mahout) (Solr)
App
Merchant
Offers App

34
Credit Card
Issuer

Business Value

35
Waste & Recycling Leader

Idle Alerts
36
Data Shape
Truck Geolocation Data
20,000 trucks
5 sec interval (arriving quickly)
Landfill Geographic Boundaries

37
Techniques
Realtime Stream Computation Immediate
(Storm) Alerts

Truck Batch Computation Tax Reduction


Hadoop
Geolocation (MapReduce)
Storage Reporting
Data

Shortest Path
Route
Graph Algorithm
Optimization
(Titan)

38
Business Value

39
Beverage Company

Social Engagement Application

40
Data Shape

Tweets, FB Messages
Person, Activity links
Graph Traversal

41
Consumer Activity Graph

Wal*Mart.com

Ebay
Shopping.com
Sams

Ebay Motors

Dollar General

StubHub
Toys R Us
CVS

42
Techniques

Property Graph Graph Traversal


(Titan) (Faunus/Fulgora)

Social
Activity
Stream

Key/Value Store
(MapR M7)

43
Business Value

44
Fraud Detection
Data Lake
45
Data Sources

Anti-Money Laundering
Consumer Transactions

46
Techniques
Anti-Money Laundering Consumer Transactions
System System

47
Techniques

AML

Data Lake Suspicious


(Hadoop) Events
Consumer
Transactions
Analyst

Latent Dirichlet Allocation,


Bayesian Learning Neural Network,
Peer Group Analysis
48
Business Value

49
Machine Learning
Search Relevance
DNA Matching
50
Data Sources

Birth, Death, Census, Military,


Immigration records
Search Behavior Activity
DNA SNP (snips)
51
Techniques
Record Linking
Search Relevance
Clickstream Behavior
Security Forensics
DNA Matching
52
Business Value

53
Traffic Analytics
54
Data Sources

Inrix Road Segment Data


Avg Speed / minute / segment
Reference Speeds
Road Segment Geolocation Data
55
Techniques
Bottleneck Detection Algorithm
Time Offset Correlations
Alternate Routes
Predictive Congestion Analysis
Growth & Term Assumptions
56
57
58
Business Value

59
Similar Characteristics
Lots of Data
Structured, Semi-Structured, Unstructured
Varied Systems Interoperating
Hadoop, Storm, Solr, MPP, Visualizations

Increase Revenue
Decrease Costs

60
Questions?

61

You might also like