Streaming Ecosystem
Streaming Ecosystem
Reza Farivar
Capital One
[email protected]
Components of a streaming ecosystem
• Gather the data
• Funnel
• Distributed Queue
• Real-Time Processing
• Semi-Real-Time Processing
• Real-time OLAP
Step 1: Gather the Data
• Apache NiFi is a good distributed funnel
• Was made in NSA
• Over 8 years of development
• Open sourced in 2014 and picked up by HortonWorks
• Great visual UI to design a data flow
• Has many many processor types in the box
• But not very good for heavy weight distributed processing
• Same graph is executed on all the nodes
NiFi Components
• FlowFile
• Unit of data moving through the system
• Content + Attributes (key/value pairs)
• Processor
• Performs the work, can access FlowFiles
• Connection
• Links between processors
• Queues that can be dynamically prioritized
• Process Group
• Set of processors and their connections
• Receive data via input ports, send data via output ports
NiFi GUI
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
NiFi Site-to-Site
• Site-to-site allows very easy pushing of data from one data center to
another
• Makes it a great choice for
distributed funnel
Step 2: Distributed Queue
• Pub-sub model
Producer publish(topic, msg) Consumer
subscribe
• Kafka a very poular
example Topic
1
Topic msg
2
Topic
3
Publish subscribe
system Consumer
Producer
msg
Kafka Architecture
• Distributed, high-throughput,
pub-sub messaging system
• Fast, Scalable, Durable Producer Producer