0% found this document useful (0 votes)

44 views28 pages

NoSQL and SQL - Open Analytics Summit

OANYC Summit

Uploaded by

James Moore

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views28 pages

NoSQL and SQL - Open Analytics Summit

OANYC Summit

Uploaded by

James Moore

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs

Allen Day MapR Technologies

Me
Allen Day
Principal Data Scientist @ MapR Human Genomics / Bioinformatics (PhD, UCLA School of Medicine)

@allenday

[email protected] [email protected]

You
Im assuming that the typical attendee:
is a software developer

is interested and familiar with open source

is familiar with Hadoop, relational DBs has heard of or has used some NoSQL technology

Big Data Workloads

Offline

ETL Model creation & clustering & indexing Web Crawling Batch reporting
Lightweight OLTP Classification & anomaly detection Stream processing Interactive reporting SQL

Online

What is NoSQL? Why use it?

Traditional storage (relational DBs) are unable to accommodate increasing # and variety of observations
Culprits: sensors, event logs, electronic payments

Solution: stay responsive by relaxing ACID storage requirements

Denormalize (#) Loosen schema (variety), loosen consistency

This is the essence of NoSQL

NoSQL Impact on Business Processes

Traditional business intelligence (BI) tech stack assumes relational DB storage
Company decisions depend on this (reports, charts)

NoSQL collected data arent in relational DB

Data volume/variety is still increasing Tech and methods are still in flux

Decoupled data storage and decision support systems

BI cant access freshest, largest data sets Very high opportunity cost to business

Ideal Solution Features

Scalable & Reliable
Distributed replicated storage Distributed parallel processing

Hadoop FS Map/Reduce, YARN

BI application support
Ad-hoc, interactive queries Real-time responsiveness

Flexible

Handles rapid storage and schema evolution Handles new analytics methods and functions

{ SQL Interface Extensible for NoSQL, { Advanced Analytics

From Ideals to Possibilities

Migrate NoSQL data/processing to SQL
High cost to marshal NoSQL data to SQL storage SQL systems lack advanced analytics capabilities

Migrate SQL data to NoSQL

Breaks compatibility for BI-dependent functions, e.g. financial reporting Limited support for relational operations (joins)
high latency

NoSQL tech is still in flux (continuity)

Other Approaches?
Yes. First lets consider a SQL/NoSQL use case

Interactive Queries & Hadoop

Impala

low-latency

Example Problem: Marketing Campaign

Jane is an analyst at an e-commerce company How does she figure out good targeting segments for the next marketing campaign?
Transaction information

User profiles

She has some ideas and lots of data

Access logs

Traditional System Solution 1: RDBMS

ETL the data from MongoDB and Hadoop into the RDBMS
MongoDB data must be flattened, schematized, filtered and aggregated Hadoop data must be filtered and aggregated
User profiles

Transaction information

Query the data using any SQL-based tool

Access logs

Traditional System Solution 2: Hadoop

ETL the data from Oracle and MongoDB into Hadoop
MongoDB data must be flattened and schematized
User profiles

Work with the MapReduce team to write custom code to generate the desired analyses

Access logs

Transaction information

Traditional System Solution 3: Hive

ETL the data from Oracle and MongoDB into Hadoop
MongoDB data must be flattened and schematized
User profiles

Access logs

But HiveQL queries are slow and BI tool support is limited

Marshaling/Coding
Transaction information

What Would Google Do?

Distributed File System GFS NoSQL Interactive analysis Dremel Batch processing MapReduce

BigTable

HDFS

HBase

???

Hadoop MapReduce

Build Apache Drill to provide a true open source solution to interactive analysis of Big Data

Apache Drill Overview

Interactive analysis of Big Data using standard SQL Fast Apache Drill Low latency queries Complement native interfaces and MapReduce/Hive/Pig Open Community driven open source project Under Apache Software Foundation MapReduce Hive Modern Pig Standard ANSI SQL:2003 (select/into) Nested data support Schema is optional Supports RDBMS, Hadoop and NoSQL
Interactive queries Data analyst Reporting 100 ms-20 min

Data mining Modeling Large ETL 20 min-20 hr

How Does It Work?

SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1
Drill Client
Tableau MicroStrategy Crystal Reports

Drillbit (Coordinator)
SQL Query Parser

Query Planner

Drill ODBC Driver

Drillbit (Executor)

Driver

How Does It Work?

Drillbits run on each node, designed to maximize data locality Processing is done outside MapReduce paradigm (but possibly within YARN) Queries can be fed to any Drillbit Coordination, query planning, optimization, scheduling, and execution are distributed

SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1

Apache Drill: Key Features

Full ANSI SQL:2003 support
Use any SQL-based tool

Nested data support

Flattening is error-prone and often impossible

Schema-less data source support

Schema can change rapidly and may be record-specific

Extensible
DSLs, UDFs Custom operators (e.g. k-means clustering) Well-documented data source & file format APIs

How Does Impala Fit In?

Impala Strengths
Beta currently available Easy install and setup on top of Cloudera Faster than Hive on some queries SQL-like query language

Questions
Open Source Lite Lacks RDBMS support Lacks NoSQL support beyond HBase Early row materialization increases footprint and reduces performance Limited file format support Query results must fit in memory! Rigid schema is required No support for nested data SQL-like (not SQL)

Many important features are coming soon. Architectural foundation is constrained. No community development.

Drill Status: Alpha Available July

Heavy active development by multiple organizations
Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho

Available
Logical plan syntax and interpreter Reference interpreter

In progress
SQL interpreter Storage engine implementations for Accumulo, Cassandra, HBase and various file formats

Significant community momentum

Over 200 people on the Drill mailing list Over 200 members of the Bay Area Drill User Group Drill meetups across the US and Europe

Beta: Q3

Why Apache Drill Will Be Successful

Resources Contributors have strong backgrounds from companies like Oracle, IBM Netezza, Informatica, Clustrix and Pentaho Community Development done in the open Active contributors from multiple companies Rapidly growing Architecture Full SQL New data support Extensible APIs Full Columnar Execution Beyond Hadoop

Bottom Line: Apache Drill enables NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs

Me
Allen Day
Principal Data Scientist @ MapR

@allenday
[email protected] [email protected]

ADDITIONAL SLIDES

Full SQL (ANSI SQL:2003)

Drill supports SQL (ANSI SQL:2003 standard)
Correlated subqueries, analytic functions, SQL-like is not enough

Use any SQL-based tool with Apache Drill

Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, Standard ODBC and JDBC drivers
Client
Tableau

Drillbit
MicroStrategy Drill% ODBC% Driver Excel SAP% Crystal% Reports Driver SQL% Query% Parser Query% Planner

Drillbits Drill% Worker Drill% Worker

Nested Data
Nested data is becoming prevalent
JSON, BSON, XML, Protocol Buffers, Avro, etc. The data source may or may not be aware
MongoDB supports nested data natively A single HBase value could be a JSON document (compound nested type)

JSON
{ "name": "Homer", "gender": "Male", "followers": 100 children: [ {name: "Bart"}, {name: "Lisa} ]

Google Dremels innovation was efficient columnar storage and querying of nested data

Flattening nested data is error-prone and often impossible

Think about repeated and optional fields at every level

Avro
enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; }

Apache Drill supports nested data

Extensions to ANSI SQL:2003

Schema is Optional
Many data sources do not have rigid schemas
Schemas change rapidly Each record may have a different schema, may be sparse/wide

Apache Drill supports querying against unknown schemas

Query any HBase, Cassandra or MongoDB table

User can define the schema or let the system discover it automatically
System of record may already have schema information No need to manage schema evolution
Row Key "com.cnn.www" CF contents contents:html = "<html>" CF anchor anchor:my.look.ca = "CNN.com" anchor:cnnsi.com = "CNN"

"com.foxnews.www"

contents:html = "<html>"

anchor:en.wikipedia.org = "Fox News"

Flexible and Extensible Architecture

Apache Drill is designed for extensibility Well-documented APIs and interfaces Data sources and file formats
Implement a custom scanner to support a new source/format

Query languages
SQL:2003 is the primary language Implement a custom Parser to support a Domain Specific Language UDFs

Optimizers
Drill will have a cost-based optimizer Clear surrounding APIs support easy optimizer exploration

Operators
Custom operators can be implemented (e.g. k-Means clustering) Operator push-down to data source (RDBMS)

NoSQL Databases Notes
No ratings yet
NoSQL Databases Notes
5 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
In Search of Database Nirvana
100% (1)
In Search of Database Nirvana
54 pages
BDT Unit 4
No ratings yet
BDT Unit 4
93 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
Session 1
No ratings yet
Session 1
48 pages
Big data
No ratings yet
Big data
79 pages
Bda CHP 3
No ratings yet
Bda CHP 3
75 pages
2 Big Data Analytics-Hadoop R21 A7902 ABP
No ratings yet
2 Big Data Analytics-Hadoop R21 A7902 ABP
16 pages
4.1_intro_nosql
No ratings yet
4.1_intro_nosql
43 pages
Module 3
No ratings yet
Module 3
39 pages
NoSQL DB
No ratings yet
NoSQL DB
39 pages
Module 5 BDA
No ratings yet
Module 5 BDA
25 pages
Chapter-14
No ratings yet
Chapter-14
35 pages
Fdocuments - in Nosql-Seminar
No ratings yet
Fdocuments - in Nosql-Seminar
40 pages
NoSQL DBs
No ratings yet
NoSQL DBs
46 pages
UNIT 3 -BDA
No ratings yet
UNIT 3 -BDA
36 pages
Session - 6 - Complex Data Types
No ratings yet
Session - 6 - Complex Data Types
27 pages
BDA_(2)_merged[1]
No ratings yet
BDA_(2)_merged[1]
29 pages
4.1 Intro Nosql
No ratings yet
4.1 Intro Nosql
43 pages
Big Data
No ratings yet
Big Data
24 pages
Drill High Performance SQL Engine With Json Data Model 150519024433 Lva1 App6891
No ratings yet
Drill High Performance SQL Engine With Json Data Model 150519024433 Lva1 App6891
23 pages
ADBMS original-output
No ratings yet
ADBMS original-output
28 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
NO-SQL
No ratings yet
NO-SQL
32 pages
NOSQL Lecture 1 Notes
No ratings yet
NOSQL Lecture 1 Notes
31 pages
Bda Unit-5 PDF
No ratings yet
Bda Unit-5 PDF
83 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
9 pages
Lecture 1 - NoSQL
No ratings yet
Lecture 1 - NoSQL
31 pages
5.1 Intro Nosql
No ratings yet
5.1 Intro Nosql
22 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
New World Hadoop Architectures (& What Problems They Really Solve) For Dbas
No ratings yet
New World Hadoop Architectures (& What Problems They Really Solve) For Dbas
44 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
Unit 2
No ratings yet
Unit 2
23 pages
DocScanner Jan 12, 2023 2-29 PM
No ratings yet
DocScanner Jan 12, 2023 2-29 PM
32 pages
bdcc-2.6
No ratings yet
bdcc-2.6
7 pages
41 NoSQL Introduction.pptx
No ratings yet
41 NoSQL Introduction.pptx
18 pages
6
No ratings yet
6
2 pages
List of NOSQL Database
No ratings yet
List of NOSQL Database
23 pages
BDA
No ratings yet
BDA
9 pages
SQL Vs Nosql
No ratings yet
SQL Vs Nosql
8 pages
BDA Unit-3
No ratings yet
BDA Unit-3
13 pages
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
No ratings yet
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
31 pages
PPT 2.1.2
No ratings yet
PPT 2.1.2
31 pages
NoSQL (1)
No ratings yet
NoSQL (1)
12 pages
Learning Guide 2.1 - CloudDatabase - NOSQL PDF
No ratings yet
Learning Guide 2.1 - CloudDatabase - NOSQL PDF
44 pages
Introduction To Nosql: Gabriele Pozzani
No ratings yet
Introduction To Nosql: Gabriele Pozzani
49 pages
MapR OptimizeEnterpriseArchit Hadoop and NoSQL
No ratings yet
MapR OptimizeEnterpriseArchit Hadoop and NoSQL
7 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
NoSQL Notes
No ratings yet
NoSQL Notes
5 pages
Duda
No ratings yet
Duda
13 pages
CloudComputing DATABASE
No ratings yet
CloudComputing DATABASE
27 pages
Data Migration From RDBMS To Hadoop: Platform Migration Approach
No ratings yet
Data Migration From RDBMS To Hadoop: Platform Migration Approach
25 pages
Drill Slides
No ratings yet
Drill Slides
14 pages
Apache Drill: SQL For Nosql
No ratings yet
Apache Drill: SQL For Nosql
7 pages
Ha Do Op World
No ratings yet
Ha Do Op World
24 pages
Teja
No ratings yet
Teja
5 pages
BIG DATA ANALYTICS sem r20
No ratings yet
BIG DATA ANALYTICS sem r20
2 pages
Btech Cs 6 Sem Big Data Kcs 061 2023
No ratings yet
Btech Cs 6 Sem Big Data Kcs 061 2023
2 pages
BDA-Lab Record
No ratings yet
BDA-Lab Record
43 pages
CS8091-Big-Data-Analytics
No ratings yet
CS8091-Big-Data-Analytics
28 pages
Kibana, Grafana and Zeppelin On Monitoring Data
100% (1)
Kibana, Grafana and Zeppelin On Monitoring Data
21 pages
Unit IV - Cloud Storage-RVK
No ratings yet
Unit IV - Cloud Storage-RVK
87 pages
AWS Data Analytics - Technical - Student
No ratings yet
AWS Data Analytics - Technical - Student
160 pages
CS8091 BDA Unit 1
No ratings yet
CS8091 BDA Unit 1
118 pages
Graph Thesis
No ratings yet
Graph Thesis
237 pages
Unit 2
No ratings yet
Unit 2
23 pages
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
100% (1)
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
57 pages
CLOUD COMPUTING 3,4,5 Ques Bank
No ratings yet
CLOUD COMPUTING 3,4,5 Ques Bank
3 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Plan of Mata Elang Stable Development
No ratings yet
Plan of Mata Elang Stable Development
11 pages
Module 1 PDF
No ratings yet
Module 1 PDF
42 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
IoT NOtes
No ratings yet
IoT NOtes
34 pages
Anthony Nyström: Fellow, Managing Director of Engineering
No ratings yet
Anthony Nyström: Fellow, Managing Director of Engineering
180 pages
Big Data Assignment 1 1
No ratings yet
Big Data Assignment 1 1
4 pages
Huawei: IT Product Portfolio
No ratings yet
Huawei: IT Product Portfolio
42 pages
Experiment No: 2 Pig Latin Commands Aim
No ratings yet
Experiment No: 2 Pig Latin Commands Aim
7 pages
Data Science and Big Data Analysis Mcqs
No ratings yet
Data Science and Big Data Analysis Mcqs
53 pages
IOT Analytics - AI361
No ratings yet
IOT Analytics - AI361
3 pages
Chicago Crime (2013) Analysis Using Pig and Visualization Using R
No ratings yet
Chicago Crime (2013) Analysis Using Pig and Visualization Using R
61 pages
BigData BI Mature
No ratings yet
BigData BI Mature
26 pages
Role of Business Analytics in Decision Making
No ratings yet
Role of Business Analytics in Decision Making
17 pages
Open Analytics Summit NYC - Tiger
No ratings yet
Open Analytics Summit NYC - Tiger
20 pages
Curriculum-PGP in Big Data Analytics and Optimization
No ratings yet
Curriculum-PGP in Big Data Analytics and Optimization
16 pages
Big Data Science
No ratings yet
Big Data Science
18 pages
Candor - Open Analytics
No ratings yet
Candor - Open Analytics
17 pages
AWS Big Data Specialty Study Guide PDF
No ratings yet
AWS Big Data Specialty Study Guide PDF
13 pages
Recommendation System
No ratings yet
Recommendation System
7 pages
Mobile Attribution Modeling
No ratings yet
Mobile Attribution Modeling
24 pages
CS6712 2013 Regulation-Lesson plan-CS6712-GRID AND CLOUD COMPUTING LAB-7th sem-ODD2018
No ratings yet
CS6712 2013 Regulation-Lesson plan-CS6712-GRID AND CLOUD COMPUTING LAB-7th sem-ODD2018
2 pages
Middleware - WebsphereMQ - Nithyanantham
No ratings yet
Middleware - WebsphereMQ - Nithyanantham
4 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Oracle Quick Guides: Part 2 - Oracle Database Design
From Everand
Oracle Quick Guides: Part 2 - Oracle Database Design
Malcolm Coxall
No ratings yet

NoSQL and SQL - Open Analytics Summit

Uploaded by

NoSQL and SQL - Open Analytics Summit

Uploaded by

NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs

Allen Day MapR Technologies

is interested and familiar with open source

Big Data Workloads

What is NoSQL? Why use it?

Solution: stay responsive by relaxing ACID storage requirements

This is the essence of NoSQL

NoSQL Impact on Business Processes

NoSQL collected data arent in relational DB

Decoupled data storage and decision support systems

Ideal Solution Features

Hadoop FS Map/Reduce, YARN

{ SQL Interface Extensible for NoSQL, { Advanced Analytics

From Ideals to Possibilities

Migrate SQL data to NoSQL

NoSQL tech is still in flux (continuity)

Interactive Queries & Hadoop

Example Problem: Marketing Campaign

She has some ideas and lots of data

Traditional System Solution 1: RDBMS

Query the data using any SQL-based tool

Traditional System Solution 2: Hadoop

Traditional System Solution 3: Hive

But HiveQL queries are slow and BI tool support is limited

What Would Google Do?

Apache Drill Overview

Data mining Modeling Large ETL 20 min-20 hr

How Does It Work?

Drill ODBC Driver

How Does It Work?

SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1

Apache Drill: Key Features

Nested data support

Schema-less data source support

How Does Impala Fit In?

Drill Status: Alpha Available July

Significant community momentum

Why Apache Drill Will Be Successful

Full SQL (ANSI SQL:2003)

Use any SQL-based tool with Apache Drill

Drillbits Drill% Worker Drill% Worker

Flattening nested data is error-prone and often impossible

Apache Drill supports nested data

Apache Drill supports querying against unknown schemas

anchor:en.wikipedia.org = "Fox News"

Flexible and Extensible Architecture

You might also like