NoSQL and SQL - Open Analytics Summit
NoSQL and SQL - Open Analytics Summit
Me
Allen Day
Principal Data Scientist @ MapR Human Genomics / Bioinformatics (PhD, UCLA School of Medicine)
@allenday
[email protected] [email protected]
You
Im assuming that the typical attendee:
is a software developer
ETL Model creation & clustering & indexing Web Crawling Batch reporting
Lightweight OLTP Classification & anomaly detection Stream processing Interactive reporting SQL
Online
BI application support
Ad-hoc, interactive queries Real-time responsiveness
Flexible
Handles rapid storage and schema evolution Handles new analytics methods and functions
Other Approaches?
Yes. First lets consider a SQL/NoSQL use case
Impala
low-latency
User profiles
Access logs
Transaction information
Access logs
Work with the MapReduce team to write custom code to generate the desired analyses
Access logs
Transaction information
Access logs
BigTable
HDFS
HBase
???
Hadoop MapReduce
Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
Drillbit (Coordinator)
SQL Query Parser
Query Planner
Drillbit (Executor)
Drillbit (Executor)
Drillbit (Executor)
Driver
Extensible
DSLs, UDFs Custom operators (e.g. k-means clustering) Well-documented data source & file format APIs
Questions
Open Source Lite Lacks RDBMS support Lacks NoSQL support beyond HBase Early row materialization increases footprint and reduces performance Limited file format support Query results must fit in memory! Rigid schema is required No support for nested data SQL-like (not SQL)
Many important features are coming soon. Architectural foundation is constrained. No community development.
Available
Logical plan syntax and interpreter Reference interpreter
In progress
SQL interpreter Storage engine implementations for Accumulo, Cassandra, HBase and various file formats
Beta: Q3
Bottom Line: Apache Drill enables NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs
Me
Allen Day
Principal Data Scientist @ MapR
@allenday
[email protected] [email protected]
ADDITIONAL SLIDES
Drillbit
MicroStrategy Drill% ODBC% Driver Excel SAP% Crystal% Reports Driver SQL% Query% Parser Query% Planner
Nested Data
Nested data is becoming prevalent
JSON, BSON, XML, Protocol Buffers, Avro, etc. The data source may or may not be aware
MongoDB supports nested data natively A single HBase value could be a JSON document (compound nested type)
JSON
{ "name": "Homer", "gender": "Male", "followers": 100 children: [ {name: "Bart"}, {name: "Lisa} ]
Google Dremels innovation was efficient columnar storage and querying of nested data
Avro
enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; }
Schema is Optional
Many data sources do not have rigid schemas
Schemas change rapidly Each record may have a different schema, may be sparse/wide
User can define the schema or let the system discover it automatically
System of record may already have schema information No need to manage schema evolution
Row Key "com.cnn.www" CF contents contents:html = "<html>" CF anchor anchor:my.look.ca = "CNN.com" anchor:cnnsi.com = "CNN"
"com.foxnews.www"
contents:html = "<html>"
Query languages
SQL:2003 is the primary language Implement a custom Parser to support a Domain Specific Language UDFs
Optimizers
Drill will have a cost-based optimizer Clear surrounding APIs support easy optimizer exploration
Operators
Custom operators can be implemented (e.g. k-Means clustering) Operator push-down to data source (RDBMS)