SlideShare a Scribd company logo
Real-Time Distributed and Reactive Systems
with Apache Kafka and Apache Accumulo
Joe Stein
• Developer, Architect & Technologist
• Founder & Principal Consultant => Big Data Open Source Security LLC - http://stealth.ly
Big Data Open Source Security LLC provides professional services and product solutions for the collection, storage,
transfer, real-time analytics, batch processing and reporting for complex data streams, data sets and distributed
systems. BDOSS is all about the "glue" and helping companies to not only figure out what Big Data Infrastructure
Components to use but also how to change their existing (or build new) systems to work with them.
• CEO => Elodina, Inc.
Expanding BDOSS from just consulting, Elodina is an ISV & SaaS provider of stream solutions & open source software.
Elodina helps make data streams actionable.
• Apache Kafka Committer & PMC member
• Blog & Podcast - http://allthingshadoop.com
• Twitter @allthingshadoop
Overview
● Real-time distributed reactive systems
● Quick Intro to Apache Kafka
● Quick Intro to Apache Mesos
● Kafka on Mesos
● Accumulo & HDFS on Mesos
● Real-time distributed reactive systems
● Bringing it all together with Accumulo
Real-Time Distributed and Reactive Systems
A distributed system for asynchronous stream processing with
non-blocking back pressure where complex event processing
systems can influence the response without coupling the
business logic of processing. The response can be calculated
by parallel operations with concurrent orthogonal processing
engines computing their influence towards the final result.
Real-Time Distributed and Reactive Systems
http://kafka.apache.org
Apache Kafka
• Apache Kafka
o http://kafka.apache.org
• Apache Kafka Source Code
o https://github.com/apache/kafka
• Documentation
o http://kafka.apache.org/documentation.html
• Wiki
o https://cwiki.apache.org/confluence/display/KAFKA/Index
Producers, Consumers, Brokers
• Producers - ** push **
o Batching
o Compression
o Sync (Ack), Async (auto batch)
o Replication
o Sequential writes, guaranteed ordering within each partition
• Consumers - ** pull **
o No state held by broker
o Consumers control reading from the stream
• Zero Copy for producers and consumers to and from the broker
http://kafka.apache.org/documentation.html#maximizingefficiency
• Message stay on disk when consumed, deletes on TTL or compaction
https://kafka.apache.org/documentation.html#compaction
Kafka decouples data-pipelines
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Client Libraries
Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients
• Python - Pure Python implementation with full protocol support. Consumer and Producer
implementations included, GZIP and Snappy compression supported.
• C - High performance C library with full protocol support
• C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset.
• Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer
implementations included, GZIP and Snappy compression supported.
• Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy
compression supported. Ruby 1.9.3 and up (CI runs MRI 2.
• Clojure - Clojure DSL for the Kafka API
• JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation
• stdin & stdout
Wire Protocol Developers Guide
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
Really Quick Start (Scala)
1) Install Vagrant http://www.vagrantup.com/
2) Install Virtual Box https://www.virtualbox.org/
3) git clone https://github.com/stealthly/scala-kafka
4) cd scala-kafka
5) vagrant up
Zookeeper will be running on 192.168.86.5
BrokerOne will be running on 192.168.86.10
All the tests in ./src/test/scala/* should pass (which is also /vagrant/src/test/scala/* in the vm)
6) ./gradlew test
Really Quick Start (Go)
1) Install Vagrant http://www.vagrantup.com/
2) Install Virtual Box https://www.virtualbox.org/
3) git clone https://github.com/stealthly/go-kafka
4) cd go-kafka
5) vagrant up
6) vagrant ssh brokerOne
7) cd /vagrant
8) sudo ./test.sh
Apache Mesos
http://mesos.apache.org
Origins
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
http://static.usenix.org/event/nsdi11/tech/full_papers/Hindman_new.pdf
Google Borg - https://research.google.com/pubs/pub43438.html
Google Omega: flexible, scalable schedulers for large compute clusters
http://eurosys2013.tudos.org/wp-
content/uploads/2013/paper/Schwarzkopf.pdf
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Static Partition == Idle Resources
Operating System === Datacenter
Mesos => data center “kernel”
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Apache Mesos
● Scalability to 10,000s of nodes
● Fault-tolerant replicated master and slaves using ZooKeeper
● Support for Docker containers
● Native isolation between tasks with Linux Containers
● Multi-resource scheduling (memory, CPU, disk, and ports)
● Java, Python and C++ APIs for developing new parallel applications
● Web UI for viewing cluster state
Sample Frameworks
C++ - https://github.com/apache/mesos/tree/master/src/examples
Java - https://github.com/apache/mesos/tree/master/src/examples/java
Python - https://github.com/apache/mesos/tree/master/src/examples/python
Scala - https://github.com/mesosphere/scala-sbt-mesos-framework.g8
Go - https://github.com/mesosphere/mesos-go
Kafka on Mesos
● The Mesos Kafka framework https://github.com/mesos/kafka
○ Smart broker.id assignment.
○ Preservation of broker placement.
○ Ability to-do configuration changes.
○ Rolling restarts.
○ Auto-scaling the cluster up and down.
Accumulo on Mesos
No framework yet, but you can use Marathon, no problem!
Marathon https://github.com/mesosphere/marathon is a cluster-
wide init and control system for services in cgroups or docker
based on Apache Mesos
HDFS on Mesos https://github.com/mesosphere/hdfs (more on
this in a bit)
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems
Real-Time Distributed and Reactive Systems
Where does Accumulo fit in?
● Iterators
○ Accumulo iterators are a real time processing framework with
“reduce like” functionality
● Multi HDFS Volume Support
○ Spin up HDFS clusters when they are needed
● Streaming Large Blobs
○ Post files in producers, process and respond to scans
● More!
Real-Time Distributed and Reactive Systems
Questions?
/*******************************************
Joe Stein
CEO, Elodina, Inc
http://www.stealth.ly
Twitter: @allthingshadoop
********************************************/

More Related Content

PDF
Streaming Processing with a Distributed Commit Log
Joe Stein
 
PDF
Get started with Developing Frameworks in Go on Apache Mesos
Joe Stein
 
PPTX
Making Apache Kafka Elastic with Apache Mesos
Joe Stein
 
PPTX
Apache Kafka, HDFS, Accumulo and more on Mesos
Joe Stein
 
PPTX
Apache Kafka
Joe Stein
 
PDF
Cassandra Introduction & Features
Phil Peace
 
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
PDF
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
JAXLondon2014
 
Streaming Processing with a Distributed Commit Log
Joe Stein
 
Get started with Developing Frameworks in Go on Apache Mesos
Joe Stein
 
Making Apache Kafka Elastic with Apache Mesos
Joe Stein
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Joe Stein
 
Apache Kafka
Joe Stein
 
Cassandra Introduction & Features
Phil Peace
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
JAXLondon2014
 

What's hot (20)

KEY
Near-realtime analytics with Kafka and HBase
dave_revell
 
PDF
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
kawamuray
 
PDF
Deploying Docker Containers at Scale with Mesos and Marathon
Discover Pinterest
 
PPTX
kafka for db as postgres
PivotalOpenSourceHub
 
PPTX
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
PPTX
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein
 
PDF
PaaSTA: Autoscaling at Yelp
Nathan Handler
 
PPTX
Current and Future of Apache Kafka
Joe Stein
 
PDF
kafka
Ariel Moskovich
 
PDF
Hadoop on-mesos
Henry Cai 蔡明航
 
PPTX
Kafka presentation
Mohammed Fazuluddin
 
PDF
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
PDF
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PPTX
Emerging technologies /frameworks in Big Data
Rahul Jain
 
PDF
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
StreamNative
 
PDF
Make 2016 your year of SMACK talk
DataStax Academy
 
PPTX
Kafka blr-meetup-presentation - Kafka internals
Ayyappadas Ravindran (Appu)
 
PPTX
Kafka
shrenikp
 
PDF
[Demo session] 관리형 Kafka 서비스 - Oracle Event Hub Service
Oracle Korea
 
PDF
LINE's messaging service architecture underlying more than 200 million monthl...
kawamuray
 
Near-realtime analytics with Kafka and HBase
dave_revell
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
kawamuray
 
Deploying Docker Containers at Scale with Mesos and Marathon
Discover Pinterest
 
kafka for db as postgres
PivotalOpenSourceHub
 
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein
 
PaaSTA: Autoscaling at Yelp
Nathan Handler
 
Current and Future of Apache Kafka
Joe Stein
 
Hadoop on-mesos
Henry Cai 蔡明航
 
Kafka presentation
Mohammed Fazuluddin
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
StreamNative
 
Make 2016 your year of SMACK talk
DataStax Academy
 
Kafka blr-meetup-presentation - Kafka internals
Ayyappadas Ravindran (Appu)
 
Kafka
shrenikp
 
[Demo session] 관리형 Kafka 서비스 - Oracle Event Hub Service
Oracle Korea
 
LINE's messaging service architecture underlying more than 200 million monthl...
kawamuray
 
Ad

Viewers also liked (20)

PPTX
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit
 
PPTX
Introduction to Apache Kafka
Jeff Holoman
 
PDF
[Harvard CS264] 01 - Introduction
npinto
 
PPTX
Business Track: Building a Personalized Mobile App Experience Using MongoDB a...
MongoDB
 
PDF
Introduction to apache kafka
Samuel Kerrien
 
PDF
Functional Reactive Programming / Compositional Event Systems
Leonardo Borges
 
PPTX
Authorization - it's not just about who you are
David Brossard
 
PPTX
Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...
MongoDB
 
PPTX
The rise of microservices - containers and orchestration
Andrew Morgan
 
PPTX
XACML for Developers - Updates, New Tools, & Patterns for the Eager #IAM Deve...
David Brossard
 
PPTX
jstein.cassandra.nyc.2011
Joe Stein
 
PDF
Data Pipeline with Kafka
Peerapat Asoktummarungsri
 
PPTX
Storing Time Series Metrics With Cassandra and Composite Columns
Joe Stein
 
PPTX
Modern Distributed Messaging and RPC
Max Alexejev
 
PDF
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
PPTX
Containerized Data Persistence on Mesos
Joe Stein
 
PPTX
Apache Cassandra 2.0
Joe Stein
 
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PPTX
Introduction Apache Kafka
Joe Stein
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit
 
Introduction to Apache Kafka
Jeff Holoman
 
[Harvard CS264] 01 - Introduction
npinto
 
Business Track: Building a Personalized Mobile App Experience Using MongoDB a...
MongoDB
 
Introduction to apache kafka
Samuel Kerrien
 
Functional Reactive Programming / Compositional Event Systems
Leonardo Borges
 
Authorization - it's not just about who you are
David Brossard
 
Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...
MongoDB
 
The rise of microservices - containers and orchestration
Andrew Morgan
 
XACML for Developers - Updates, New Tools, & Patterns for the Eager #IAM Deve...
David Brossard
 
jstein.cassandra.nyc.2011
Joe Stein
 
Data Pipeline with Kafka
Peerapat Asoktummarungsri
 
Storing Time Series Metrics With Cassandra and Composite Columns
Joe Stein
 
Modern Distributed Messaging and RPC
Max Alexejev
 
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
Containerized Data Persistence on Mesos
Joe Stein
 
Apache Cassandra 2.0
Joe Stein
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Introduction Apache Kafka
Joe Stein
 
Ad

Similar to Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo (20)

PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
PDF
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
PDF
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
C4Media
 
PDF
An Introduction to Apache Kafka
Amir Sedighi
 
PPTX
Apache kafka
Viswanath J
 
PDF
Tutorial Kafka-Storm
Universidad de Santiago de Chile
 
PPTX
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
PPTX
Mario Cartia - SMACK is the new LAMP! - Codemotion Milan 2017
Codemotion
 
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
PDF
Sparkstreaming
Marilyn Waldman
 
PDF
Data Pipelines with Apache Kafka
Ben Stopford
 
PPTX
Building an Event Bus at Scale
jimriecken
 
PDF
SMACK Stack 1.1
Joe Stein
 
PDF
Introducing Apache Mesos
Matthias Furrer
 
PPTX
unit5_Big Data Framework and security.pptx
argadesudarshan2004
 
PPTX
Kafkha real time analytics platform.pptx
dummyuseage1
 
PPTX
Kafka for data scientists
Jenn Rawlins
 
PDF
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data
 
PDF
Apache kafka
NexThoughts Technologies
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
C4Media
 
An Introduction to Apache Kafka
Amir Sedighi
 
Apache kafka
Viswanath J
 
Tutorial Kafka-Storm
Universidad de Santiago de Chile
 
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
Mario Cartia - SMACK is the new LAMP! - Codemotion Milan 2017
Codemotion
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Sparkstreaming
Marilyn Waldman
 
Data Pipelines with Apache Kafka
Ben Stopford
 
Building an Event Bus at Scale
jimriecken
 
SMACK Stack 1.1
Joe Stein
 
Introducing Apache Mesos
Matthias Furrer
 
unit5_Big Data Framework and security.pptx
argadesudarshan2004
 
Kafkha real time analytics platform.pptx
dummyuseage1
 
Kafka for data scientists
Jenn Rawlins
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data
 

More from Joe Stein (6)

PPTX
Introduction To Apache Mesos
Joe Stein
 
PPTX
Developing Frameworks for Apache Mesos
Joe Stein
 
PPTX
Building and Deploying Application to Apache Mesos
Joe Stein
 
PPTX
Developing with the Go client for Apache Kafka
Joe Stein
 
PPTX
Introduction to Apache Mesos
Joe Stein
 
PPTX
Hadoop Streaming Tutorial With Python
Joe Stein
 
Introduction To Apache Mesos
Joe Stein
 
Developing Frameworks for Apache Mesos
Joe Stein
 
Building and Deploying Application to Apache Mesos
Joe Stein
 
Developing with the Go client for Apache Kafka
Joe Stein
 
Introduction to Apache Mesos
Joe Stein
 
Hadoop Streaming Tutorial With Python
Joe Stein
 

Recently uploaded (20)

PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Software Development Methodologies in 2025
KodekX
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 

Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo

  • 1. Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
  • 2. Joe Stein • Developer, Architect & Technologist • Founder & Principal Consultant => Big Data Open Source Security LLC - http://stealth.ly Big Data Open Source Security LLC provides professional services and product solutions for the collection, storage, transfer, real-time analytics, batch processing and reporting for complex data streams, data sets and distributed systems. BDOSS is all about the "glue" and helping companies to not only figure out what Big Data Infrastructure Components to use but also how to change their existing (or build new) systems to work with them. • CEO => Elodina, Inc. Expanding BDOSS from just consulting, Elodina is an ISV & SaaS provider of stream solutions & open source software. Elodina helps make data streams actionable. • Apache Kafka Committer & PMC member • Blog & Podcast - http://allthingshadoop.com • Twitter @allthingshadoop
  • 3. Overview ● Real-time distributed reactive systems ● Quick Intro to Apache Kafka ● Quick Intro to Apache Mesos ● Kafka on Mesos ● Accumulo & HDFS on Mesos ● Real-time distributed reactive systems ● Bringing it all together with Accumulo
  • 4. Real-Time Distributed and Reactive Systems A distributed system for asynchronous stream processing with non-blocking back pressure where complex event processing systems can influence the response without coupling the business logic of processing. The response can be calculated by parallel operations with concurrent orthogonal processing engines computing their influence towards the final result.
  • 5. Real-Time Distributed and Reactive Systems
  • 7. Apache Kafka • Apache Kafka o http://kafka.apache.org • Apache Kafka Source Code o https://github.com/apache/kafka • Documentation o http://kafka.apache.org/documentation.html • Wiki o https://cwiki.apache.org/confluence/display/KAFKA/Index
  • 8. Producers, Consumers, Brokers • Producers - ** push ** o Batching o Compression o Sync (Ack), Async (auto batch) o Replication o Sequential writes, guaranteed ordering within each partition • Consumers - ** pull ** o No state held by broker o Consumers control reading from the stream • Zero Copy for producers and consumers to and from the broker http://kafka.apache.org/documentation.html#maximizingefficiency • Message stay on disk when consumed, deletes on TTL or compaction https://kafka.apache.org/documentation.html#compaction
  • 11. Client Libraries Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients • Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. • C - High performance C library with full protocol support • C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset. • Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. • Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. • Clojure - Clojure DSL for the Kafka API • JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation • stdin & stdout Wire Protocol Developers Guide https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
  • 12. Really Quick Start (Scala) 1) Install Vagrant http://www.vagrantup.com/ 2) Install Virtual Box https://www.virtualbox.org/ 3) git clone https://github.com/stealthly/scala-kafka 4) cd scala-kafka 5) vagrant up Zookeeper will be running on 192.168.86.5 BrokerOne will be running on 192.168.86.10 All the tests in ./src/test/scala/* should pass (which is also /vagrant/src/test/scala/* in the vm) 6) ./gradlew test
  • 13. Really Quick Start (Go) 1) Install Vagrant http://www.vagrantup.com/ 2) Install Virtual Box https://www.virtualbox.org/ 3) git clone https://github.com/stealthly/go-kafka 4) cd go-kafka 5) vagrant up 6) vagrant ssh brokerOne 7) cd /vagrant 8) sudo ./test.sh
  • 15. Origins Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center http://static.usenix.org/event/nsdi11/tech/full_papers/Hindman_new.pdf Google Borg - https://research.google.com/pubs/pub43438.html Google Omega: flexible, scalable schedulers for large compute clusters http://eurosys2013.tudos.org/wp- content/uploads/2013/paper/Schwarzkopf.pdf
  • 17. Static Partition == Idle Resources
  • 18. Operating System === Datacenter
  • 19. Mesos => data center “kernel”
  • 21. Apache Mesos ● Scalability to 10,000s of nodes ● Fault-tolerant replicated master and slaves using ZooKeeper ● Support for Docker containers ● Native isolation between tasks with Linux Containers ● Multi-resource scheduling (memory, CPU, disk, and ports) ● Java, Python and C++ APIs for developing new parallel applications ● Web UI for viewing cluster state
  • 22. Sample Frameworks C++ - https://github.com/apache/mesos/tree/master/src/examples Java - https://github.com/apache/mesos/tree/master/src/examples/java Python - https://github.com/apache/mesos/tree/master/src/examples/python Scala - https://github.com/mesosphere/scala-sbt-mesos-framework.g8 Go - https://github.com/mesosphere/mesos-go
  • 23. Kafka on Mesos ● The Mesos Kafka framework https://github.com/mesos/kafka ○ Smart broker.id assignment. ○ Preservation of broker placement. ○ Ability to-do configuration changes. ○ Rolling restarts. ○ Auto-scaling the cluster up and down.
  • 24. Accumulo on Mesos No framework yet, but you can use Marathon, no problem! Marathon https://github.com/mesosphere/marathon is a cluster- wide init and control system for services in cgroups or docker based on Apache Mesos HDFS on Mesos https://github.com/mesosphere/hdfs (more on this in a bit)
  • 26. Real-Time Distributed and Reactive Systems
  • 27. Real-Time Distributed and Reactive Systems
  • 28. Where does Accumulo fit in? ● Iterators ○ Accumulo iterators are a real time processing framework with “reduce like” functionality ● Multi HDFS Volume Support ○ Spin up HDFS clusters when they are needed ● Streaming Large Blobs ○ Post files in producers, process and respond to scans ● More!
  • 29. Real-Time Distributed and Reactive Systems
  • 30. Questions? /******************************************* Joe Stein CEO, Elodina, Inc http://www.stealth.ly Twitter: @allthingshadoop ********************************************/