0% found this document useful (0 votes)

60 views61 pages

Big Data Analysis Patterns

The document discusses big data analysis patterns and solutions. It provides an overview of technologies like Apache Hadoop, Mahout, Drill, Storm, Titan, Solr and Lucene. It then focuses on how to determine the best solution by considering factors like data size, query size, and response time needed. Examples are given of telecommunications, credit card, and waste management use cases and how different techniques like ETL, analytics, recommendations, and alerts could apply.

Uploaded by

Mahdy Zia Uzzaman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views61 pages

Big Data Analysis Patterns

Uploaded by

Mahdy Zia Uzzaman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 61

Big Data

Analysis Patterns
Atlanta Big Data User Group
8/15/2013
1
whoami
Brad Anderson
Solutions Architect at MapR (Atlanta)
ATLHUG co-chair
NoSQL East Conference 2009
boorad most places (twitter, github)
[email protected]
2
Announcements
Next ATLHUG Meeting - Sept. 26
How Google Does Big Data

Wednesday MapR Data Warehouse Offload

Roadshow

MapR Upcoming Training

MapR M7 & HBase for Developers on August 27 in Campbell, CA
MapR M7 & HBase for Developers on Sept 17 in Reston, VA
MapR M5 for Administrators on Oct 3 in Campbell, CA

3
3
BIG DATA
4
5
Big Data is not new!
but the tools are.

6
The Good News in Big Data:

Simple algorithms and lots of data

trump complex models

Halevy, Norvig, and Pereira, Google

IEEE Intelligent Systems

7
The Challenge: So Many Solutions!

What solutions fit your business problem?

For example, do you need
Apache Hadoop?
Apache Mahout?
Storm?
Apache Solr/Lucene?
Apache HBase (or MapR M7)?
Apache Drill (or Impala?)
d3.js or Tableau?
Node.js
Titan?
8
8
Ask a Different Question

It may be more useful to better define the problem by asking some

of these questions:
How large is the data to be stored?
How large is the data to be queried? (the analysis volume)
What time frame is appropriate for your query response?
How fast is data arriving? (bursts or continuously?)
Are queries by sophisticated users?
Are you looking for common patterns or outliers?
How are your data sources structures?

9
9
Picking the Best Solution

Your responses to these questions can help you better:

define the problem
recognize the analysis pattern to which it belongs
guide the choice of solutions to try

But first, heres a quick review of a few of the technologies you

might choose, and then we will focus on three of the questions as a
part of the landscape.

10
10
Apache Solr/Lucene

Solr/Lucene is a powerful search engine used for flexible, heavily

indexed queries including data such as
Full text
Geographical data
Statistically weighted data
Solr is a small data tool that has flourished in a big data world

11
Apache Mahout

Mahout provides a library of scalable machine learning algorithms

useful for big data analysis based on Hadoop or other storage
systems.
Mahout algorithms mainly are used for
Recommendation (collaborative filtering)
Clustering
Classification
Mahout can be used in conjunction with solutions such as Solr: You
might use Mahout to create a co-occurrence data base that could
then be queried using a search tool such as Solr

12
Apache Drill

Google Dremel clone

Pluggable Query Languages
Starts with ANSI SQL 2003
Hive, Pig, Cascading, MongoQL,
Pluggable Storage Backends
Hadoop, Hbase
MongoDB (BSON)
RDBMS?
Bypasses MapReduce

13
Storm

Realtime Stream Computation Engine

Horizontal Scalability
Guaranteed Data Processing
Fault Tolerance
Higher level abstraction over:
Message Queues
Worker Logic

The Hadoop of Realtime

14
Titan
Distributed Graph Database
Property Graph
Pluggable Backend Storage
HBase or M7
Cassandra
Berkeley DB
Search Integrated
Solr/Lucene
Elastic Search
Faunus
Batch processing of large graphs
Fulgora
Graph traversals on subset
In-memory
15
Using the Answers to Guide Your Choices

For simplicity, lets focus in on the first three questions:

How large is the data to be stored?
How large is the data to be queried? (the analysis volume)
What time frame is appropriate for your query response?

16
Big Data Decision Tree
How big is your data?

<10 GB >200 GB
mid
?
?
A What size queries?

Single element Multiple passes

One pass
at a time over big chunks
over 100%

B C Response time?

Big storage Streaming

< 100s throughput
(human scale) not response

D E

17
Use Cases
Company
Data Shape
Technique(s)
Business Value
18
Business Value

19
Business Value
20
Telecommunications Giant

ETL Offload
21
Telecommunications
Data Shape
Lots of Data
Lots of Queries across Large Sets
Throughput important

22
Telecommunications
Techniques
ETL Analytics

23
Telecommunications
Techniques

ETL (Hadoop) Analytics (Teradata)

24
Telecommunications
Business Value

25
Credit Card
Issuer

26
Credit Card
Issuer

Data Shape
Customer Purchase History (big)
Merchant Designations
Merchant Special Offers
Throughput important
Recommendations27
Credit Card
Issuer

Techniques
A Recommendation Engine with Mahout and Solr/Lucene

History matrix

One row per user

One column per thing

28
Credit Card
Issuer

Techniques

Recommendation based on
cooccurrence

Cooccurrence gives item-item

mapping

One row and column per thing

29
Credit Card
Issuer

Techniques

Cooccurrence matrix can also be

implemented as a search index

30
Credit Card
Issuer

Techniques
SolR
SolR
Complete Cooccurrence Indexer
Solr
Indexer
history (Mahout) indexing
20 Hrs 3 Hrs

Item meta- Index

data shards

31
Credit Card
Issuer

Techniques
SolR
SolR
User Indexer
Solr
Web tier Indexer
history search
8Hrs 3 Min

Item meta-
Index
data shards

32
Credit Card
Issuer

Techniques
Hadoop Export
(4 hrs)
Purchase App
History
App
Recommendation Presentation
Merchant
Engine Results Data Store App
Information
(Mahout) (DB2)
App
Merchant
Offers App
Import
(4 hrs)
33
Credit Card
Issuer

Techniques
Hadoop
Purchase App
Index
History
Update
App
(3 min)
Recommendation Recommendation
Merchant
Engine Results Search Index App
Information
(Mahout) (Solr)
App
Merchant
Offers App

34
Credit Card
Issuer

Business Value

35
Waste & Recycling Leader

Idle Alerts
36
Data Shape
Truck Geolocation Data
20,000 trucks
5 sec interval (arriving quickly)
Landfill Geographic Boundaries

37
Techniques
Realtime Stream Computation Immediate
(Storm) Alerts

Truck Batch Computation Tax Reduction

Hadoop
Geolocation (MapReduce)
Storage Reporting
Data

Shortest Path
Route
Graph Algorithm
Optimization
(Titan)

38
Business Value

39
Beverage Company

Social Engagement Application

40
Data Shape

Tweets, FB Messages
Person, Activity links
Graph Traversal

41
Consumer Activity Graph

Wal*Mart.com

Ebay
Shopping.com
Sams

Ebay Motors

Dollar General

StubHub
Toys R Us
CVS

42
Techniques

Property Graph Graph Traversal

(Titan) (Faunus/Fulgora)

Social
Activity
Stream

Key/Value Store
(MapR M7)

43
Business Value

44
Fraud Detection
Data Lake
45
Data Sources

Anti-Money Laundering
Consumer Transactions

46
Techniques
Anti-Money Laundering Consumer Transactions
System System

47
Techniques

AML

Data Lake Suspicious

(Hadoop) Events
Consumer
Transactions
Analyst

Latent Dirichlet Allocation,

Bayesian Learning Neural Network,
Peer Group Analysis
48
Business Value

49
Machine Learning
Search Relevance
DNA Matching
50
Data Sources

Birth, Death, Census, Military,

Immigration records
Search Behavior Activity
DNA SNP (snips)
51
Techniques
Record Linking
Search Relevance
Clickstream Behavior
Security Forensics
DNA Matching
52
Business Value

53
Traffic Analytics
54
Data Sources

Inrix Road Segment Data

Avg Speed / minute / segment
Reference Speeds
Road Segment Geolocation Data
55
Techniques
Bottleneck Detection Algorithm
Time Offset Correlations
Alternate Routes
Predictive Congestion Analysis
Growth & Term Assumptions
56
57
58
Business Value

59
Similar Characteristics
Lots of Data
Structured, Semi-Structured, Unstructured
Varied Systems Interoperating
Hadoop, Storm, Solr, MPP, Visualizations

Increase Revenue
Decrease Costs

60
Questions?

Big Data
No ratings yet
Big Data
190 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Wa0003.
No ratings yet
Wa0003.
23 pages
BDA Unit 2
No ratings yet
BDA Unit 2
8 pages
Algorithms For Big Data Analysis
No ratings yet
Algorithms For Big Data Analysis
24 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Hadoop
No ratings yet
Hadoop
21 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
IV Unit Big Data Analysis
No ratings yet
IV Unit Big Data Analysis
17 pages
Cours BI 23 24 Session 4 2
No ratings yet
Cours BI 23 24 Session 4 2
46 pages
Data Science
No ratings yet
Data Science
87 pages
Big Data Chatgpt
No ratings yet
Big Data Chatgpt
8 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
Go Bigwith Data Lake Architecture
No ratings yet
Go Bigwith Data Lake Architecture
35 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
Chapter 1 - 大数据概念
No ratings yet
Chapter 1 - 大数据概念
21 pages
Big Data Analytics
100% (3)
Big Data Analytics
79 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
New World Hadoop Architectures (& What Problems They Really Solve) For Dbas
No ratings yet
New World Hadoop Architectures (& What Problems They Really Solve) For Dbas
44 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
17 pages
Big Data Analysis: Shubham Gupta B.Tech Computers Batch: E3 Roll No.:E059
No ratings yet
Big Data Analysis: Shubham Gupta B.Tech Computers Batch: E3 Roll No.:E059
34 pages
BDA Module-2 Notes PDF
100% (1)
BDA Module-2 Notes PDF
14 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Unit-1 Introduction To Data Analytics
No ratings yet
Unit-1 Introduction To Data Analytics
35 pages
Information Security 07 - Audit
No ratings yet
Information Security 07 - Audit
17 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Unit 1
No ratings yet
Unit 1
11 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
51 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Terminal Managerment
No ratings yet
Terminal Managerment
129 pages
BDA - Lecture 3
100% (1)
BDA - Lecture 3
17 pages
Velocity: Introduction To Bigdata
No ratings yet
Velocity: Introduction To Bigdata
14 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Deutsche Telekom Perspective On HADOOP and Big Data Technologies
No ratings yet
Deutsche Telekom Perspective On HADOOP and Big Data Technologies
19 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Introduction To Big Data, Hadoop and Spark
No ratings yet
Introduction To Big Data, Hadoop and Spark
40 pages
Data Exploration and Analysis in The Age of Big Data:: Getting Results Faster Than You Thought Possible
No ratings yet
Data Exploration and Analysis in The Age of Big Data:: Getting Results Faster Than You Thought Possible
20 pages
15 Big Data Tools and Technologies To Know About in 2021
No ratings yet
15 Big Data Tools and Technologies To Know About in 2021
7 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
Introduction To Big Data PDF
No ratings yet
Introduction To Big Data PDF
16 pages
Avaya WFO - 15.1 - February 2016 - Interactions and Analytics Administration Guide
No ratings yet
Avaya WFO - 15.1 - February 2016 - Interactions and Analytics Administration Guide
155 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
Modulo 1 - Fundamentos de Big Data
No ratings yet
Modulo 1 - Fundamentos de Big Data
4 pages
Big Data
No ratings yet
Big Data
31 pages
Big Data Analytics For R-2017 by ArunPrasath S., Sriram Kumar K., Krishna Sankar P.
No ratings yet
Big Data Analytics For R-2017 by ArunPrasath S., Sriram Kumar K., Krishna Sankar P.
7 pages
Big Data
No ratings yet
Big Data
5 pages
Cloud Computing Infrastructure As A Service (IaaS)
No ratings yet
Cloud Computing Infrastructure As A Service (IaaS)
5 pages
How To Setup A Proxy Server
No ratings yet
How To Setup A Proxy Server
4 pages
Big Data: Concepts, Techniques, Storage and Challenges
No ratings yet
Big Data: Concepts, Techniques, Storage and Challenges
9 pages
McAfee - Implement Step
No ratings yet
McAfee - Implement Step
28 pages
Qualnet Tutorial
No ratings yet
Qualnet Tutorial
20 pages
Digital Signatures 20130304
No ratings yet
Digital Signatures 20130304
152 pages
Inventory Management System Documentation Finalremoved
No ratings yet
Inventory Management System Documentation Finalremoved
26 pages
Configuration Management (SRAN7.0 01)
No ratings yet
Configuration Management (SRAN7.0 01)
19 pages
Cs614 Grand Quiz Merge
No ratings yet
Cs614 Grand Quiz Merge
81 pages
FTP Client Pasv Manipulation
No ratings yet
FTP Client Pasv Manipulation
13 pages
3 Storage222
No ratings yet
3 Storage222
9 pages
Migration Cockpit Collective KBA For Business Partner (Customer, Supplier)
No ratings yet
Migration Cockpit Collective KBA For Business Partner (Customer, Supplier)
2 pages
Vendor Invoice Booking-MIRO
No ratings yet
Vendor Invoice Booking-MIRO
5 pages
Coarri 12
No ratings yet
Coarri 12
92 pages
Command and Control Management
No ratings yet
Command and Control Management
7 pages
UML and DP Lab Manual
No ratings yet
UML and DP Lab Manual
75 pages
COREOR D98B Full Container Release Message Implementation Guide - Hapag Lloyd
No ratings yet
COREOR D98B Full Container Release Message Implementation Guide - Hapag Lloyd
66 pages
11 - How To Send Reminder Notification To Approver - Shareapps4u
No ratings yet
11 - How To Send Reminder Notification To Approver - Shareapps4u
15 pages
GUILLERMO - Secure Coding Practices and Vulnerability
No ratings yet
GUILLERMO - Secure Coding Practices and Vulnerability
6 pages
Multile Choice Questions Unit 4
No ratings yet
Multile Choice Questions Unit 4
10 pages
G4 EHD Wapiti Presentation
No ratings yet
G4 EHD Wapiti Presentation
16 pages
Informatica
No ratings yet
Informatica
3 pages
TTDD Routing Protocol PDF
No ratings yet
TTDD Routing Protocol PDF
16 pages
Class Test 2 X
No ratings yet
Class Test 2 X
2 pages
Usp 1
No ratings yet
Usp 1
14 pages
Srinivasan Rathinam Kannan-CloudArch AWS Azure PCF
No ratings yet
Srinivasan Rathinam Kannan-CloudArch AWS Azure PCF
5 pages
Amazon Prep4sure AWS-Certified-DevOps-Engineer-Professional PDF V2018-Mar-27 by Wendell 98q Vce
No ratings yet
Amazon Prep4sure AWS-Certified-DevOps-Engineer-Professional PDF V2018-Mar-27 by Wendell 98q Vce
7 pages
DS Lab Assignment 2
No ratings yet
DS Lab Assignment 2
5 pages
The Relational Data Model & Relational Database Constraints
No ratings yet
The Relational Data Model & Relational Database Constraints
4 pages
SourceBreaker - Khizer-Sohail-21470261-cv-library
No ratings yet
SourceBreaker - Khizer-Sohail-21470261-cv-library
3 pages
8 SDLC - Class
No ratings yet
8 SDLC - Class
15 pages
Syllabus of Cloud Computing and Plan
No ratings yet
Syllabus of Cloud Computing and Plan
2 pages
Strength Exercises Download
No ratings yet
Strength Exercises Download
1 page
Flow EDI
No ratings yet
Flow EDI
1 page
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
From Everand
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Big Data Analysis Patterns

Uploaded by

Big Data Analysis Patterns

Uploaded by

Big Data

Wednesday MapR Data Warehouse Offload

MapR Upcoming Training

Simple algorithms and lots of data

Halevy, Norvig, and Pereira, Google

What solutions fit your business problem?

It may be more useful to better define the problem by asking some

Your responses to these questions can help you better:

But first, heres a quick review of a few of the technologies you

Solr/Lucene is a powerful search engine used for flexible, heavily

Mahout provides a library of scalable machine learning algorithms

Google Dremel clone

Realtime Stream Computation Engine

The Hadoop of Realtime

For simplicity, lets focus in on the first three questions:

Single element Multiple passes

Big storage Streaming

ETL (Hadoop) Analytics (Teradata)

One row per user

One column per thing

Cooccurrence gives item-item

One row and column per thing

Cooccurrence matrix can also be

Item meta- Index

Truck Batch Computation Tax Reduction

Social Engagement Application

Property Graph Graph Traversal

Data Lake Suspicious

Latent Dirichlet Allocation,

Birth, Death, Census, Military,

Inrix Road Segment Data

You might also like