0% found this document useful (0 votes)
13 views261 pages

DEV3500 SlideGuide FMx8S2P

The DEV 3500 Real-Time Processing with MapR Streams Slide Guide provides an overview of the MapR Streams platform, designed for real-time data processing and analytics. The course covers the motivation, core components, and use cases of MapR Streams, emphasizing its ability to handle massive data volumes with features like fault tolerance, scalability, and global data access. By the end of the course, participants will be equipped to build simple applications using MapR Streams and understand its applications in various industries.

Uploaded by

Ranjit G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views261 pages

DEV3500 SlideGuide FMx8S2P

The DEV 3500 Real-Time Processing with MapR Streams Slide Guide provides an overview of the MapR Streams platform, designed for real-time data processing and analytics. The course covers the motivation, core components, and use cases of MapR Streams, emphasizing its ability to handle massive data volumes with features like fault tolerance, scalability, and global data access. By the end of the course, participants will be equipped to build simple applications using MapR Streams and understand its applications in various industries.

Uploaded by

Ranjit G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 261

DEV 3500 – Real-Time Processing with

MapR Streams Slide Guide


Spring 2016 – Version 5.1.0

For use with the following courses:


DEV 3500 – Real-time Stream Processing with MapR
DEV 350 – MapR Streams Essentials
DEV 351 – Developing MapR Streams Applications
This Guide is protected under U.S. and international copyright laws, and is the exclusive property of
MapR Technologies, Inc.

© 2016, MapR Technologies, Inc. All rights reserved. All other trademarks cited here are the property of
their respective owners.

PROPRIETARY AND CONFIDENTIAL INFORMATION


©2016 MapR Technologies, Inc. All Rights Reserved.
®

Welcome to DEV 350 – MapR Streams Essentials. This course is designed to give
developers and administrators the basic concepts necessary to deploy MapR
Streams on a MapR Distribution.

1  
®

2  
®

By the end of this course, you will be able to:


• Summarize the motivation behind MapR Streams
• Define core components of MapR Streams
• Summarize the life of a message in MapR Streams
• Build a simple producer and consumer application

3  
®

Prior to taking this course, you should have:

A  basic  understanding  of  big  data  concepts


A  basic  understanding  of  the  MapR  plaAorm
And  a  basic  understanding  of  applica=on  development  principles  

4  
®

In Lesson 1, we'll introduce the motivation behind MapR Streams and learn how to
apply MapR Streams to some common use cases.

5  
Welcome to MapR Streams Essentials, Lesson 1 – Introduction to MapR Streams.
This lesson describes why people use MapR Streams, what MapR Streams is, and
provides a brief overview of some MapR Streams use cases.

1  
®

2  
By the end of this lesson you will be able to:
Summarize the motivation behind MapR Streams.
Why do people use MapR Streams? What problems does it solve?

Explain what makes MapR Streams different

What is MapR Streams? How is it different from other messaging systems?

And apply MapR Streams to common use cases

How do people use MapR Streams? When do people use MapR Streams?

3  
Let's start by describing some features of MapR Streams and some of the problems it
can solve.

4  
Many big data sources are event-oriented. For example, stock prices, user activity,
and sensor data all trigger events.

5  
Today’s applications need to process this high-velocity, high-volume data in real-time
to gain insights.

6  
MapR Streams adds event streaming to the MapR platform, allowing events to be
ingested, moved, and processed as they happen, as part of a real-time data pipeline.

7  
The need to process massive amounts of data in real-time has grown rapidly in recent
years.
For example, social media users expect alerts when their friends contact them.

8  
Likewise, people want notifications from their bank when they are low on funds or at
risk of fraud.

9  
Smart cars, GPS software, and traffic apps all depend on geospatial data to give up-
to-date navigation information.

10  
Networked sensors that make up the Internet of Things or log files from application
metrics may communicate important information about mission-critical events on oil
rigs or factory lines.

11  
Retail websites may use clickstream data to provide real-time advertisements.

12  
All this data used to be stored as log files. Now, MapR Streams can be used to
transport this data as events, enabling real-time analytics.

13  
What if you want to analyze data as it arrives?
Big data is typically sent in batches to HDFS, and then analyzed by distributed
processing frameworks like MapReduce, Apache Hive, or Apache Pig.
Batch processing can give great insights into things that happened in the past, but
lacks the ability to answer the question of "what is happening right now?”

 
 
 
 
 
It is becoming important to process events as they arrive for real-time insights.
Analyzing data as it arrives requires several distributed applications to be linked
together in real-time.
In this example, MapR Streams helps provide real-time insights based on sensor
data, notifying the user to turn on the air conditioning!

 
 
 
 
 
What if you want to organize data as it arrives?  
In a large data warehouse, there may be many different sources of data, and many
different applications which want to process, share and analyze this data. Integrating
data sources and applications by sharing files can quickly become unorganized,
complicated, and tightly coupled.

 
 
 
 
 
Topics organize events into categories. Topics are logical collections of messages
that are managed by MapR Streams. Topics decouple producers, which are the
sources of data, from consumers, which are the applications that process, analyze,
and share data.

 
 
 
 
 
Producers publish to a topic. In this example, we see a thermostat, an oilrig, and a
alert application all publishing event data to their respective topics.

 
 
 
 
 
Consumers subscribe to the topics that interest them.

Different consumer applications monitor the topics.

 
 
 
 
 
What if you need to process a high volume of data as it arrives?
With the emergence of the Internet of Things, there are more and more sensors,
clicks, log files, and other data sources that are sending billions of data points that
need to be processed as they arrive.

Traditional message queues can not handle the volume of data for the Internet of
Things, with millions of sources, hundreds of destinations, and the demand for real-
time analytics.

A new approach is needed that can process messages from all these sources.

23  
With MapR Streams, topics are partitioned for throughput and scalability. Partitions
make topics scalable by spreading the load for a topic across multiple servers.

 
 
 
 
 
Producers are load balanced between partitions.

 
 
 
 
 
Consumers can be grouped to read in parallel from multiple partitions within a topic
for faster performance.

 
 
 
 
 
What if you need to recover messages in case of server failure?
As more and more people depend on their data for mission-critical tasks, it is
important to have fault-tolerance built into your architecture.

If there were no replicated copy of “Topic: Warnings – Partition 2" and Server 2 went
down, then we would lose all of the messages in Partition 2.

 
 
 
 
 
With MapR Streams, each partition and all of its messages are replicated for fault
tolerance.

 
 
 
 
 
The server owning the primary partition for the topic replicates the message to replica
containers.
Producers and Clients send and read from the primary partition, shown in here in
orange. Replicas are used for fault tolerance.

 
 
 
 
 
Here, Server 2, owning the primary partition for "Topic Warnings – Partition 2," went
down.

 
 
 
 
 
A replica partition will become the new primary for the failed partition. In this case, the
"Partition 2: Warning Replica" on Server 1 became the primary, now shown in
Orange. Producers and consumers will automatically be re-routed to the new primary.

MapR Streams is 100% reliable by synchronously replicating all writes to at least


three nodes, while maintaining high performance of up to a billion messages per
second at millisecond-level delivery times.

 
 
 
 
 
What if you need real-time access to live data distributed across multiple clusters and
multiple data centers?
What if you would like to have consumers and producers in different locations
worldwide? What if you would like high availability if a cluster goes down?
If there were no replicated copy of your topics outside of the cluster, and the whole
cluster went down, then message servers would not be available for producers and
consumers.
 

 
 
 
 
 
Topics are grouped into streams which can be asynchronously replicated between
MapR clusters, with publishers and listeners existing anywhere, enabling truly global
applications.
The streams replication feature gives your users real-time access to live data
distributed across multiple clusters and multiple data centers around the world.

 
 
 
 
 
With streams replication, you can create a backup copy of a stream for producers and
consumers to fail over to if the original stream goes offline.
This feature significantly reduces the risk of data loss should a site-wide disaster
occur, making it essential for your disaster recovery strategy.

 
 
 
 
 
®

37  
Replication :: fault tolerance

Partitioning :: scalability

Topics :: organization

38  
®

39  
Next, we'll learn what MapR Streams does.

40  
®

MapR Streams provides global, Internet-of-Things-scale publish-subscribe event


streaming integrated into the MapR platform

41  
®

–alongside MapR’s distributed file system (MapR-FS)

42  
®

and NoSQL database (MapR-DB)–

43  
®

creating the industry’s first Converged Data Platform.

The MapR Converged Data Platform integrates Apache Hadoop, Apache Spark, real-
time database capabilities, global event streaming, and big data enterprise storage,
for developing and running innovative data applications.

44  
MapR Streams brings integrated publish/subscribe messaging to the MapR platform.

Producer applications can publish messages to topics that are managed by MapR
Streams.
Consumer applications can then read those messages at their own pace. All
messages published to MapR Streams are persisted, allowing future consumers to
“catch-up” on processing, and analytics applications to process historical data.
In addition to reliably delivering messages to applications within a single data center,
MapR Streams can continuously replicate data between multiple clusters, delivering
messages globally.
®

MapR Streams combines massively scalable messaging with security, reliability, fault
tolerance and global real-time delivery. MapR Streams implements the Apache Kafka
Producer/Consumer APIs, which many developers may already be familiar with.
Topics in MapR Streams are grouped into streams, which administrators can apply
security, retention, and replication policies to. Combined with MapR-FS and MapR-
DB in the MapR platform, streams allow organizations to create a centralized, secure,
data lake that unifies files, database tables, and message topics.

48  
MapR Streams is the only messaging platform built on top of an enterprise-class big
data platform.

MapR Streams can reliably handle billions of messages from millions of producers,
and sort them into thousands of topics. It reliably delivers these messages to millions
of consumers, globally. This massive, global scale is built to handle the growing
Internet of Things.

Topics in MapR Streams are grouped into streams, which administrators can apply
security, retention, and replication policies to. MapR Streams can continuously
replicate data between multiple clusters, providing fault-tolerance.

Like other MapR services, MapR Streams has a distributed, scale-out design.
Combined with MapR-FS and MapR-DB in the MapR platform, streams allow users to
create a centralized data lake that can be processed and analyzed in real time or in
bulk batches.
®

50  
Answer A,B,C (D is wrong)

51  
®

52  
Finally, we'll learn how to apply MapR Streams to some common use cases.

53  
Next, we'll take a look at a few enterprise-level use cases: health care, advertising,
credit cards, and the Internet of Things.

54  
Let's say we need to build a flexible, secure healthcare database.

This presents many challenges. First, there will be many different data models in the
healthcare industry, ranging from insurance claims to test results. Second, there will
be many issues concerning security and privacy of this data.
MapR Streams can meet each of these challenges.

With MapR Streams, the data lineage portion of the compliance challenge is solved
because the stream becomes a system of record by being an infinite, immutable log
of each data change.

To illustrate the challenge of different data formats and representations, a patient


record may be consumed in different ways–a document representation, a graph
representation, or search–by different users, such as pharmaceutical companies,
hospitals, clinics, and physicians.

By streaming data changes in real-time to the document, graph, and search


databases, users always have the most up-to-date view of data in the most
appropriate format. Further, by implementing this service on the MapR Converged
Data Platform, Liaison is able to secure all of the data components together, avoiding
data and security silos that alternate solutions require.
Let's say we need to monitor oil rig sensor data and create real-time alerts.

This objective also presents many challenges. First, the data will be coming from
many different sources, spread all over the globe. Second, it is crucial that alerts are
issued in real-time. If the pressure or temperature gets too high or too low at a oil rig,
it can cause serious problems. Third, there is a need to audit data. It is not enough to
monitor data in real-time; we will want to create reports analyzing past trends as well.

To predict potential equipment failures before they can occur, the solution must be
able to:

• Store and process real-time and historical sensor data


• Ingest and analyze real-time sensor and historical data alongside maintenance
data generated from industrial equipment
• Proactively learn patterns of normal and errant behavior across various types of
equipment to provide warnings of minor degradation
Again, MapR Streams can meet each of these challenges. First, MapR Streams can
handle global streaming of data at an Internet of Things scale. Second, MapR
Streams can directly send data to applications like Spark for analysis. Finally, MapR
Streams replicates streams, creating rewindable and auditable data.

The solution is able to capture massive amounts of data in a cost-effective and


scalable way, easily ingesting both structured and unstructured data. The centralized
data center supports applications with a predictable scaling model. The solution
provides a powerful mechanism to rapidly analyze all variable inputs against existing
failure rate models and make alerts before equipment is likely to fail.
Let's say we want to manage a global online ad market campaign, using real-time
analytics.

Creating targeted ads presents several challenges. Information about each individual
user, including their recent and long-term browsing trends, can help inform the
algorithms which will update which ads are shown to them. This requires a complex
data pipeline, which can maintain consistent streams of data for each unique user, at
a global scale.
MapR Streams' ability to handle global message streams and real-time analytics with
a streaming ETL makes it suitable for this task.

By using MapR Streams, you can provide real-time global insights for advertising.

The solution has sensors triggering off events. These events flow into the streaming
system.

Software is listening to these streams and reacts in real time. For example, if a
product is moving quickly off the shelves when there are hundreds of people in the
store, it can issue an alert so that product gets restocked as fast as possible.

The solution gets all of the data from stores to global headquarters in order to do
more with it. The data is streamed, replicated between data centers, providing the
ability to fix and optimize any issues with supply chain or shipping from global
logistics management. In addition to that, the application provides the ability to build
solutions around targeted advertising for my customers.

This use case fits nearly any different business that's out there. This works in
manufacturing, retail, or any company that you have a need to deliver more real time
solutions to your customers.
Let's say we want to build a reliable, real-time service bus for financial transactions.

Like our healthcare data, there are major concerns about the security of this
information. The streams must also be reliable for accurate, real-time updates to
financial transactions, and need to be queried to detect fraud.
Again, MapR Streams can handle this task. MapR Streams reliably delivers
information, and can have security settings enabled.
MapR Streams enables a faster closed-loop process between operational
applications and analytics by reliably running both operational and analytical
applications on one platform.

In this example, we see the clickstream data from online shopping is being used
simultaneously by several different applications. The browsing behavior from the
online shoppers is the primary data source. This data is analyzed by several different
applications, including fraud models and recommendation tools, all on a single
cluster.

63  
®

64  
®

INSTRUCTORS: Some ideas:

Manufacturing & Internet of Things: Real-time, adaptive analysis of machine data


(e.g., sensors, control parameters, alarms, notifications, maintenance logs, and
imaging results) from industrial systems (e.g., equipment, plant, fleet) for visibility into
asset health, proactive maintenance planning, and optimized operations.

Marketing & Sales: Analysis of customer engagement and conversion, powering real-
time recommendations while customers are still on the site or in the store

Customer Service & Billing: Analysis of contact center interactions, enabling accurate
remote trouble shooting before expensive field technicians are dispatched

Information Technology: Log processing to detect unusual events occurring in


stream(s) of data, so that IT can take remedial action before service quality degrades

65  
Congratulations! You have completed Lesson 1: Introduction to MapR Streams. In
Lesson 2, you will learn more about MapR Streams architecture.

66  
Welcome to MapR Streams Essentials, Lesson 2 – MapR Streams Architecture. This
lesson takes an in-depth look at the features and components of MapR Streams, and
how it fits in the big data architecture.

1  
®

2  
By the end of this lesson, you will be able to:

define core components of MapR Streams

summarize the life of a message in MapR Streams

and explain how MapR Streams fits in the complete data architecture

3  
Let's start with defining the core components of MapR Streams.

4  
®

Messages are key/value pairs, where keys are optional. We will describe the use of
keys later. Values contain the data payload, which can be text or bytes.

In this example, we see a single data point, in nested JSON format, from an oil rig.
This includes sensor readings such as pressure, temperature, and oil percentage.

5  
Producers are data-generating applications that you create, such as sensors in an oil
rig.
Producers publish or send messages to topics.
Consumers are also applications that you create, such as analytics applications using
Apache Spark.
Consumers request unread messages from topics that they are subscribed to.
Topics are logical collections of messages that are managed by MapR Streams.

8  
Topics organize events: producers publish to a relevant topic and consumers
subscribe to the topics of interest to them.
For example, you might have an application that monitors the logs for an oil rig. Your
monitoring application could send oil well sensor data to topics named Pressure,
Temperature, and Warnings.

9  
One consumer application might subscribe to the Pressure topic to analyze the oil
pressure, issuing an alert if the pressure becomes dangerously high or low. A different
consumer might subscribe to the Temperature topic to generate a report of
temperature trends over time.

10  
Topics are partitioned for throughput and scalability. Partitions, which exist within
topics, are parallel, ordered, sequences of messages that are continually appended
to.

11  
Consumer  applica7ons  can  read  in  parallel  from  mul7ple  par77ons  
within  a  Topic.  This  is  faster  and  more  scalable  than  reading  from  one  
par77on  per  Topic.    

12  
When a message is published to a partition, it is assigned an offset, which is the
message ID. Offsets are numbered in increasing order, and are local to each partition.
This way, the order of messages is preserved within individual partitions, but not
across partitions.

13  
A stream is a collection of topics that you can manage together. In this example, the
Pressure, Temperature, and Warnings topics are all collected together in one stream.
There are a number of different ways to manage a stream. For example, you can
apply security settings, time-to-live, or the default number of partitions for all topics
within a stream.
You can replicate streams to other streams in different clusters. For example, you can
create a backup copy of a stream for producers and consumers to fail over to if the
original stream goes offline.

14  
The server manages streams, topics, and partitions. The server also handles
requests from the producer client library and the consumer client library.

15  
Producer applications produce messages and send them to the producer client
library. This client library buffers incoming messages and then sends them in batches
to the server. The server then publishes the messages to the topics that the
producers have specified.

16  
Consumers request unread messages from topics that they are interested in. A
consumer client library sends unread messages. Then, consumer applications extract
data from these messages.

17  
®

18  
Answer: A

19  
Let's review.
producer applications publish messages to topics using industry standard Kafka APIs.
Topics organize events into categories. Topics are partitioned for scalability and
replicated for reliability.
Consumers can subscribe to a topic using industry standard Kafka APIs.
MapR Streams handles requests for messages from consumers for topics that they
are subscribed to. Global delivery happens in real time.
Finally, consumer applications can read those messages at their own pace. All
messages published to MapR Streams are persisted, allowing future consumers to
“catch-up” on processing, and analytics applications to process historical data.
®

25  
Next, we'll learn about the life of a message in MapR Streams.

26  
To show you how these concepts fit together, we will go through an example of the
flow of messages from a producer to a consumer.
Imagine that you are using MapR Streams as part of a system to monitor oil wells
globally.
Your producers include sensors in the oil rigs, weather stations, and an application
which generates warning messages.
Your consumers are various analytical and reporting tools.
In a volume in a MapR cluster, you create the stream /path/oilpump_metrics. In that
stream, you create the topics pressure, temperature, and warnings.
Of all of the sensors (producers) that your system uses to monitor oil wells and
related data, let's choose a sensor that is in Venezuela. We'll follow a message that is
generated by this sensor and published in the Pressure topic. When you created this
topic, you also created several partitions within it to help spread the load among the
different nodes in your MapR cluster and to help improve the performance of your
consumers. For simplicity in this example, we'll assume that each topic has only one
partition.

27  
How are messages sent?
The sensor producer application sends messages to the Pressure topic using the
MapR Streams producer client library.
The client library buffers incoming messages.

28  
When the client has a large enough number of messages buffered, or after an interval
of time has expired, the client batches and sends the messages in the buffer. The
messages in the batch are published in the Pressure topic’s partitions on a MapR
Streams server.

29  
When  the  messages  are  published  to  a  topic  par77on,  they  are  appended  in  order.    
Each  message  is  given  an  offset,  which  is  a  sequen7ally  numbered  ID.  Older  messages  
will  have  lower  numbered  offsets,  while  the  newest  messages  will  have  the  highest  
numbers.  The  numbering  of  offsets  is  local  to  each  par77on,  so  the  order  of  
messages  is  preserved  within  individual  par77ons,  but  not  across  par77ons.  
 

30  
Each  par77on  and  all  of  its  messages  are  replicated  for  fault  tolerance.  
The  server  owning  the  primary  par77on  for  the  topic  assigns  the  offset  ID  to  
messages  and  replicates  the  message  to  replica  containers  within  the  MapR  cluster.    
Replica7on  rules  are  controlled  at  the  volume  level  within  the  MapR  cluster.  By  
default,  par77ons  are  replicated  three  7mes.  
 
 

31  
The server then acknowledges receiving the batch of messages and sends the offset
IDs that it assigned to them back to the producer.
 
 

32  
There are three different ways of choosing a partition for a message:
First, if the producer specifies a partition ID, the MapR Streams server publishes the
message to the partition specified.
In this diagram, each producer specifies a partition to publish to within the same topic.
For example, Producer A specifies that it should publish to Partition 4.

33  
The second way to specify a partition is with a key.
If the producer provides a key, the MapR Streams server hashes the key and sends
the message to the partition that corresponds to the hash.
In this example, Producer A specifies a key, and its messages go to Partition 3.

34  
Last, if neither a partition ID nor a key is specified, the MapR Streams server sends
messages in a sticky round-robin fashion. For any given topic, the server randomly
chooses an initial partition. For example, suppose that for the Warnings topic, the
server chooses the partition with the ID 1. The server first sends messages to
partition 1, then to partition 2, and so on.

35  
®

36  
Answer: B

37  
A consumer application that correlates oil well pressure with weather conditions is
subscribed to the Warnings topic.
When the consumer application is ready for more data, it issues a request, using the
MapR Streams client library, to poll the Pressure topic for messages that the
application has not yet read.

Next, the client library asks if there are any messages more recent than what the
consumer application has already read.

38  
Once the request for unread messages is received, the primary partition of the topic
returns the unread messages to the client.

The original messages remain on the partition and are available to other consumers.

The client library passes the messages to the consumer application, which can then
extract and process the data.

If more unread messages remain in the partition, the process repeats with the client
library requesting messages.

39  
Since messages remain in the partition even after delivery to a consumer, when are
they deleted?
When you create a stream, you can set the time-to-live for messages.
Once a message has been in the partition for the specified time-to-live, it is expired.
An automatic process reclaims the disk space that the expired messages are using.
The time-to-live can be as long as you need it to be. Messages will not expire if the
time-to-live is zero, and will remain in the partition indefinitely.

40  
®

You don’t have to worry about partitions getting too big to store on a single server;
partitions will be re-allocated every 2GB to balance storage. MapR Streams can
intelligently move partitions around in a cluster in order to spread the data out,
allowing topics to be infinite, persistent storage.

41  
®

42  
Answer: C

43  
Messages can be read in parallel, for fast and scalable processing.
In order to take advantage of parallelism when reading, you can group consumers
together by setting the same value for the group.id configuration parameter when you
create a consumer.
The partitions in each topic are assigned dynamically to the consumers in a consumer
group in round-robin fashion.
For example, suppose that there are three consumers in a group called Oil Wells.
Each consumer is subscribed to the same topic, Warnings. This Warnings topic has
five partitions.
MapR Streams assigns each partition to a consumer, using the round robin pattern.
The first partition will be assigned to Consumer A, the second to Consumer B, and the
third to Consumer C. Then, the assignment pattern starts over.
This way, if you have a group of similar consumers all subscribed to the same topic,
you can distribute message processing across your consumers for more efficient
processing.

44  
If one of the consumers goes offline, the partitions are reassigned dynamically among
the remaining consumers in the group.
In this example, Consumer B has gone offline. The partitions that had previously been
assigned to B must be reassigned. Partition 2 gets reassigned to Consumer C, while
partition 5 gets reassigned to Consumer A.
If the offline consumer comes back online or a different consumer is added to the
group, again the partitions are redistributed among the consumers in the group.
This parallelism and dynamic reassignment is possible only if no consumer in the
group subscribes to an individual partition.

45  
®

The MapR Streams server keeps track of the messages that consumers have read
with cursors.
There is one cursor per partition per consumer group, and there are two types of
cursors.
The first type of cursor is the read cursor. This refers to the offset ID of the most
recent message that MapR Streams has sent to a consumer from a partition.
In this example, the read cursor has the offset ID of 3.

46  
®

The second type of cursor is the committed cursor.


Consumers that are part of a consumer group can save the current position of their
read cursor. Consumers can do this either automatically or manually. The saved
cursor is called a committed cursor because it indicates that the consumer has
processed all the messages in a partition up to and including the one with this offset.
In this example, the consumer has set the committed cursor to the the current
position of their read cursor, with the offset ID of 3.
Cursors are useful in case of failover. If a consumer fails and Marlin reassigns the
consumer’s partitions to other consumers in a group, those consumers can start
reading from the next offset after the committed cursor in each of those partitions.
If a stream is replicated to another cluster for backup, committed cursors are also
replicated. If a cluster fails, consumers that are redirected to the standby copy of a
stream can start reading from the next offset after committed cursors.

47  
®

You can replicate streams to other MapR clusters worldwide or to other streams
within a MapR cluster.
There are many different scenarios in which replicating streams can be useful.
For example, suppose that your oil drilling company has a refinery in Venezuela, and
sensors in the pipeline equipment track different metrics.
With replication, the factory could create a stream in the Venezuela cluster and
maintain a backup of the stream in the Venezuela_HA cluster.
This type of replication is called master-slave replication. In this example, Venezuela
is the master, and Venezuela_HA is the slave.

48  
®

Suppose that your company's headquarters are in Houston, and you want data
analysts there to be able to analyze all data company-wide.

49  
®

You have two metrics streams, one in each oil well: Venezuela and Mexico.

50  
®

You can replicate each of these streams to a metrics stream in the Houston cluster.
In this scenario, the replica is the metrics stream in the Houston cluster. This replica
has two upstream sources: the metrics streams that are replicated from the two
factories.
This type of replication, called many-to-one replication, requires that the topics in
each stream have unique names, so that message offsets do not conflict. However,
the streams may have the same name.
In this example, we give each topic within the metrics stream a unique identifier, so
that the data from Mexico does not conflict with the data from Venezuela.

51  
®

Another type of of replication that can be useful is multi-master replication. You can
use it when you need two streams to both send updates to and receive updates from
the other stream. Each stream is a replica and an upstream source. MapR Streams
keeps both streams synchronized with each other.
As with many-to-one replication, the names of the topics in each stream must be
unique across both streams, so that offsets for messages do not conflict.

52  
®

You can replicate streams in a master-slave, many-to-one, or multi-master


configuration between thousands of geographically distributed clusters interconnected
arbitrarily – in a tree, a ring, a star, or a mesh. MapR Streams detects loops and
prevents message duplication.

53  
®

54  
Answer: C & D

55  
®

56  
Finally, we'll take a look at how MapR Streams fits in the MapR ecosystem and works
with other tools in the big data pipeline.

57  
A complete big data pipeline includes many different components.
First, you need data sources. In our pipeline, this includes the oil well sensors and
other data-generating applications known as producers.
A big data architecture will also include stream processing applications like Apache
Storm and Apache Spark, as well as bulk processing applications like Apache Hive or
Apache Drill. It may also include end applications like dashboards to display data or
issue alerts.
All of these components must interact with each other. MapR Streams acts as a
message bus between these components. MapR Streams manages sending and
receiving data between the many components of a complete data architecture.
There are several advantages of having MapR Streams on the same cluster as all the
other components. For example, maintaining only one cluster means less
infrastructure to provision, manage, and monitor. Likewise, having producers and
consumers on the same cluster means fewer delays related to copying and moving
data between clusters, and between applications.
®

61  
Answer: D

62  
®

Congratulations! You have completed Lesson 2: MapR Streams Architecture. Visit


learn.mapr.com for more courses.

63  
®

Welcome to DEV 351 – Developing MapR Streams Applications. Lesson 1:


Introduction to Producers and Consumers

1  
®

2  
®

When you have finished this lesson, you will be able to:
create a stream,
develop a Java producer,
and develop a Java consumer

3  
®

Let's start by creating a stream.

4  
®

The first step of creating a MapR Streams application is to create a stream.

This example shows the maprcli command for creating a stream, which is run in a
MapR cluster node terminal.

5  
®

consumeperm, produceperm, and topicperm are optional parameters to grant security


permissions.

By default, these permissions are granted to the user ID that created a stream. You
only need to use these parameters if you plan to run the producer and consumer with
user IDs that are different from the user ID that created the stream.

6  
®

The produceperm determines which users can publish messages to topics in a


stream.

7  
®

The consumeperm determines which users can read topics from a stream.

8  
®

Topicperm determines which users can create topics in a stream or remove them. A
producer must have this permission for automatic topic creation.

9  
®

The optional -defaultpartitions parameter determines how many partitions are created
in each new topic in the stream. By default, each new topic is created with one
partition. In this example, it is created with three.

For more information about creating streams refer to the MapR documentation.

10  
®

Topics can be created manually or automatically. By default, a topic is created


automatically when a producer creates the first message for it, if the topic does not
already exist. A producer must have topicperm permissions for automatic topic
creation, as shown earlier.

You can create topics manually using the maprcli stream topic create command, as
shown here.
For example, if you already planned a number of topics for your stream, you could
create these topics after creating the stream.

11  
®

12  
®

INSTRUCTORS: There are two methods by which topics can be created in a stream:
Automatic creation
By default, a topic is created automatically when a producer creates the first message for
it and the topic does not already exist. For example, you might have the stream
anonymous_usage that is intended to collect data about the use of a software application
that is about to be released. The administrator did not create any topics when creating the
stream, but producers will create topics automatically by publishing to topics for ranges of
IP addresses. After the software is released to the public, at some point a producer
application starts publishing messages to a topic that is created based on the range within
which the producer's IP address falls. At another point in time, a producer starts
publishing messages to a topic for a different range of IP addresses. Eventually, the
stream contains a number of topics for different ranges, and multiple producers are
publishing to each topic. You can turn off the automatic creation of topics for a stream. If
you do this, the publishing of a message to a non-existent topic fails. You can use the
command maprcli stream edit to change this setting after you create a stream.
Manual creation
The other method of creating topics is to use the maprcli stream topic create command.
For example, if you are creating a stream to collect operational metrics from systems in
your enterprise, you might have already planned on a set number of topics based on
system, location, company department, project, or some other criterion. You could create
these topics after creating the stream. When you create a topic this way, you can either
accept the default number of partitions for all new topics in the current stream, or you can
override that default and specify a number of partitions for the new topic.

13  
®

14  
®

Next, we'll learn how to develop a Java producer.

15  
®

Here is a rough outline of the steps for a producer to send messages. Now, we will go
through each step. Let's start by setting the producer properties.

16  
®

The first step is to set the producer properties, which will be used later to instantiate a
KafkaProducer for publishing messages to topics.

The key.serializer and value.serializer properties specify which class to use to


serialize the key and value of a message. In this case, we configure the
StringSerializer, because we are sending String messages. Another option is to use
the ByteSerializer to send byte messages.

For more information about the producer configuration properties, refer to the
documentation.

17  
®

You can control some of the aspects of how producers publish messages by setting
various properties when you create a producer. This table lists some basic properties
you can set.

We will go over more properties for a producer sending messages in Lesson 4.

-----------------
key.serializer
The class for a producer written in Java to use to serialize keys, when keys are
included in messages. You can use the included
org.apache.kafka.common.serialization.ByteArraySerializer or
org.apache.kafka.common.serialization.StringSerializer to turn simple string or byte
types into bytes.

value.serializer
The class for a producer written in Java to use to serialize the value of each
message. You can use the included
org.apache.kafka.common.serialization.ByteArraySerializer or
org.apache.kafka.common.serialization.StringSerializer to turn simple string or byte
types into bytes.

client.id (optional)
Producers can tag records with a client ID that identifies the producer. Consumers

18  
®

The next step is to create a producer.

19  
®

You instantiate a KafkaProducer by providing a set of key-value pairs to configure the


properties as shown here. These are the properties we just discussed.

The KafkaProducer class is used to send messages to a topic.

Note that the KafkaProducer<K,V> is a Java generic class. You need to specify the
type parameters as the type of the key-value of the messages that the producer will
send.

The first type parameter is the type of the partition key. The second is the type of the
message value. In this example they are both strings, which should match the
Serializer type in the properties.

20  
®

The next step is to build the message.

21  
®

Here is the code for building the message.

The ProducerRecord is a key-value pair to be sent to Kafka. It consists of a topic


name to which the record is being sent, an optional partition number, and an optional
key and value.
The ProducerRecord is also a Java generic class, whose type parameters should
match the serialization properties set before.

In this example, we instantiate the ProducerRecord with a topic name and message
text as the value, which will create a record with no key.

----------------
Method used:
public ProducerRecord(java.lang.String topic, V value)
Create a record with no key

22  
®

Finally, we will send the message.

23  
®

Here is the code for sending the message.

We call the producer send() method, which will asynchronously send a record to a
topic. This method returns a Java Future object, which will eventually contain the
response information.

The asynchronous send() method adds the record to a buffer of pending records to
send, and immediately returns. This allows sending records in parallel without waiting
for the responses, and allows the records to be batched for efficiency.

24  
®

If you want to get the response information, then you should provide a Callback that
will be invoked when the request is complete.
The Callback is optional. There is a send method without a callback which is
equivalent to send(record, null).

25  
®

You should call producer.close after use, in order to release resources from the
producer client library.

We will show an implementation of the Callback next.

26  
®

Here is an example implementation of the callback discussed before.

The onCompletion method will be called when the record sent to the server has been
acknowledged. This example prints the RecordMetadata, which specifies the partition
the record was sent to and the offset it was assigned.

----------------------
From Kafka API doc methods used:
public interface Callback
A callback interface that the user can implement to allow code to execute when the
request is complete. This callback will generally execute in the background I/O thread
so it should be fast.

void onCompletion(RecordMetadata metadata,


java.lang.Exception exception)
A callback method the user can implement to provide asynchronous handling of
request completion. This method will be called when the record sent to the server has
been acknowledged. Exactly one of the arguments will be non-null.
Parameters:
metadata - The metadata for the record that was sent (i.e. the partition and offset).
Null if an error occurred.
exception - The exception thrown during processing of this record. Null if no error
occurred.

27  
®

Here is the code all together for a simple producer to send three messages. You will
write a similar piece of code and run it in the lab.

-----------------
This code:
• Sets the stream name and topic to publish to
• Declares a producer variable
• Calls the setupProducer() method which instantiates a producer with configuration
properties
• Loops 3 times. Each loop:
• Creates a ProducerRecord specifying the topicName and the message
text.
• Calls the send method on the producer passing the record.
• Finally, calls the close method on the producer to release resources. This method
blocks until all in-flight requests complete.

28  
®

The final step is to build your code into a jar file and run it. In the lab, a maven project
with a pom.xml file is provided for building your code.

You run the producer code using the command shown here.

`mapr classpath` is a utility command which sets the other jar files needed in the
classpath.

The output of the print statement in the code is shown here.


__________________

java –cp `mapr classpath`:<name of your jar file> <package name>.<class name>

`mapr classpath` is a utility command which sets the other jar files needed in the
classpath.

29  
®

30  
®

Answer: E

ProducerRecord(java.lang.String topic, java.lang.Integer partition, K key, V value)


Creates a record to be sent to a specified topic and partition
ProducerRecord(java.lang.String topic, K key, V value)
Create a record to be sent to Kafka
ProducerRecord(java.lang.String topic, V value)
Create a record with no key

31  
®

32  
®

Finally, we'll learn how to develop a Java consumer.

33  
®

This is an outline for how consumers receive messages. We will go through each of
the steps, starting with setting the consumer properties.

34  
®

As the first step in writing the consumer code, you need to set the consumer
properties, which will be used later to instantiate a KafkaConsumer for reading
messages from Topics.

The configuration value.deserializer parameter specifies which class to use to


deserialize the value of each message, in this case the StringDeserializer, because
we are receiving String messages. Another option is to use the ByteDeserializer to
receive byte messages.

35  
®

You can control some of the aspects of how consumers read messages by setting
various properties when you create a consumer.

We will go over more properties in Lesson 4.

36  
®

If a consumer that is not associated with a consumer-group ID fails, it can either start
reading its partitions from the earliest offsets or from the latest offset. The choice is
determined by the auto.offset.reset configuration parameter.

The earliest offset is the offset of the message that has been in the partition the
longest, without being deleted because of the time-to-live. If the consumer reads from
the earliest offset in a partition, it might re-read a large number of messages before
reading messages that were published after it failed.

The latest offset is the offset of the most current message at the time the consumer
requests new messages from MapR Streams. If the consumer reads from the latest
offset in a partition, the consumer starts off up-to-date, but skips over the messages
between its time of failure and the current time.

37  
®

Next, we will create a consumer.

38  
®

The KafkaConsumer class is used to read messages from a Topic.

A KafkaConsumer is instantiated by providing the set of configuration properties we


just discussed.

Note that the KafkaConsumer is a Java Generic class and you need to specify the
two type parameters. The first is the type of the Partition key; the second is the type
of the message value. In this example, they are both Strings, which should match the
Deserializer type in the properties.

39  
®

Next, we'll subscribe to a topic.

40  
®

The consumer calls the subscribe method, passing a list of topic names.

When a consumer subscribes to a topic, it reads messages from all of the partitions
that are in the topic, except when a consumer is part of a consumer group. Consumer
groups will be explained in Lesson 4.

Consumers can subscribe to topics in two ways: by name or by a regular expression.

The ability to use regular expressions is helpful when the -­‐autocreate parameter
for a stream is set to true and producers are allowed to create topics automatically at
runtime.

----------------
public void subscribe(java.util.List<java.lang.String> topics)
Subscribe to the given list of topics to get dynamically assigned partitions. This list will
replace the current assignment (if there is one).

41  
®

Once a consumer is subscribed to a topic, it can poll from that topic.

42  
®

The next step is to poll for messages; this is typically done in a loop.

The poll command fetches data for the topics or partitions specified when subscribing.

--------
public ConsumerRecords<K,V> poll(long timeout)

43  
®

Finally, we'll process the returned messages.

44  
®

The poll returns returns a ConsumerRecords object, which contains a list of


ConsumerRecord objects for each topic partition from which messages were
returned. Remember, you can subscribe to multiple topics.

Next, we call the ConsumerRecords iterator method which returns an iterator of


ConsumerRecord objects.

--------------
public java.util.Iterator<ConsumerRecord<K,V>> iterator()
Specified by:
iterator in interface java.lang.Iterable<ConsumerRecord<K,V>>

45  
®

Here we loop through the iterator of ConsumerRecords, printing out the contents of
the ConsumerRecord.

The ConsumerRecord contains the topic name and partition number, from which the
record was received, and an offset in the partition. The ConsumerRecord value
contains the message.

An example of record.toString() is shown here.

-----------
For example: topic=/
events:sensor,partition=0,offset=192,key=null,value=Msg1
Reference:
public final class ConsumerRecord<K,V>
extends java.lang.Object
A key/value pair to be received from Kafka. This consists of a topic name and a
partition number, from which the record is being received and an offset that points to
the record in a Kafka partition.

Note that MapR Streams message Values can be either bytes or String , as specified
by the V in the ConsumerRecord<K,V> declaration, in this example our messages are
of type String, but they could be arbitrary bytes.

46  
®

Here is the code that we just went over for a simple consumer to receive messages.
You will code this and run it in the lab.

47  
®

The final step is to build your code into a jar file and run it. In the lab, a maven project
with a pom.xml file is provided for building your code.

You can run the consumer code using the following command shown here.

The output of the print statement in the code is shown here.

----------------
Java –cp <name of your jar file>:`mapr classpath` <package name>.<class name>

`mapr classpath` is a utility command which sets the other jar files needed in the
classpath.

48  
®

49  
®

In this lab, you will complete, build, and run the code for a producer and a consumer.

50  
®

51  
®

INSTRUCTORS: Take a moment to discuss the lab with your students, and give them
a chance to ask any questions about this lesson so far.

52  
®

Congratulations! You have completed DEV 351: Lesson 3.

53  
®

Welcome to DEV 351 – Developing MapR Streams Applications. Lesson 4: Producer


and Consumer Details

1  
®

2  
®

By the end of this lesson, you will be able to:


describe producer properties and options for buffering and batching of messages, as
well as publishing to partitions
describe consumer properties and options for fetching data, consumer groups, and
read cursors,
and explain messaging semantics.

3  
®

Let's start by describing producer properties and options for buffering and batching of
messages.

4  
®

First, a brief review of producer and client library buffering.

5  
Remember, the producer application sends messages to a topic using the MapR
Streams producer client library.

The producer client library has a pool of buffer space that holds records that haven't
yet been transmitted to the server.

Next, we will go over some options for buffering messages and sending them in
batches to the server.

6  
®

You can control some of the aspects of how producers buffer when publishing
messages by setting various properties when you create a producer.
Here are definitions of these properties, which we will go over in more detail next.

7  
The producer client library will send buffered records to the server when any of these
four conditions are met:
1. If producer client library has buffered enough messages to make an efficient RPC
to the server.
2. If a message has been buffered for the amount of time that is specified for the
streams.buffer.max.time.ms configuration parameter. The default interval for
flushes is 3000 milliseconds.
3. If producer client library has buffered messages beyond the value of the
buffer.memory configuration parameter. Making this larger can result in more
messages buffered, but requires more memory.
4. If the application explicitly flushes messages by calling producer.flush()

--------------
API Method Reference:
public void flush()
Invoking this method makes all buffered records immediately available to send

8  
The producer client library batches messages into publish requests that it sends to
the MapR Streams server.

By default, the client sends multiple publish requests without waiting for
acknowledgements from the MapR Streams server.

This behavior is determined by the producer property


streams.parallel.flushers.per.partition, which defaults to true.

With this default behavior, it is possible for messages to arrive to partitions out of
order due to the presence of multiple network interface controllers, network errors, or
retries.

For example, suppose a oil rig sensor producer is sending messages that are
specifically for Partition 1, in the Pressure topic. The producer client library buffers the
messages and sends a batch to Partition 1. Meanwhile, the producer keeps sending
messages for Partition 1, and the client library continues to buffer them. The next time
the client library has enough messages for Partition 1, the client sends another batch,
whether or not the server has acknowledged the previous batch.

If you always want messages to arrive to partitions in the order in which they were
sent, set the configuration parameter MapR streams.parallel.flushers.per.partition to
false. The producer client library will wait for acknowledgements from the MapR

9  
®

As a review, messages are assigned offsets when published to partitions. Offsets are
monotonically increasing and are local to partitions. The order of messages is
preserved within individual partitions, but not across partitions. Messages are
delivered from a partition to a consumer in order.

However, if the streams.parallel.flushers.per.partition is set to the default of true,


the producer may have multiple parallel send requests to the server for a topic
partition. With this setting, it is possible for messages to be sent out of order. If this is
an issue for your application, then you can set this to false.

10  
®

11  
®

Answer: A
(default true) If enabled, producer may have multiple parallel send requests to the
server for each topic partition. If this setting is set to true, it is possible for messages
to be sent out of order.

12  
®

13  
®

Next we will discuss producer options for publishing to partitions.

14  
®

Let's review how the server chooses which partition to publish to.

15  
Recall from Lesson 2: if the producer specifies a partition ID, the MapR Streams
server publishes the message to the partition specified.

16  
If the producer provides a key when sending a message, the MapR Streams server
hashes the key and sends the message to the partition that corresponds to the hash.
A key is used to group related messages by partition within a stream.

17  
Last, if neither a partition ID nor a key is specified, the MapR Streams server sends
messages in a sticky round-robin fashion. For any given topic, the server randomly
chooses an initial partition.

18  
®

The code here shows how a producer can specify a key when it creates a record.

The key can be semantic based on something relevant to the message; for example,
by sensor ID. All messages for the same key will hash to the same partition, so keys
allow partitioning by key.
-----------
From the Java doc for Reference:
Constructor ProducerRecord(java.lang.String topic, K key, V value)
Creates a record to be sent to a specified topic and partition

public final class ProducerRecord<K,V>


extends java.lang.Object
A key/value pair to be sent to Kafka. This consists of a topic name to which the record
is being sent, an optional partition number, and an optional key and value.
If a valid partition number is specified that partition will be used when sending the
record. If no partition is specified but a key is present a partition will be chosen using
a hash of the key. If neither key nor partition is present a partition will be assigned in a
round-robin fashion.

19  
®

The KafkaProducer can specify a partition to publish to when it creates a producer


record, as shown here.

--------------
Constructor for reference
ProducerRecord(java.lang.String topic, java.lang.Integer partition, K key, V value)
Creates a record to be sent to a specified topic and partition

20  
®

With this property, you can set how often the producer will be refreshed with metadata
about new topics and partitions. This can be useful, for example, when the producer
is specifying the partition.

------------------
metadata.max.age.ms: (default 600 * 1000 msec) The producer generally refreshes
the topic metadata from the server when there is a failure. It will also poll for this data
regularly. This polling occurs automatically. There does not have to be any explicit
calls to any APIs.

21  
®

If you would like to write custom algorithms for determining which topic and partition
to use for messages that match specific criteria, you can provide a class which
implements the StreamsPartitioner interface. This also allows you to hash using more
than just the key.

The streams.partitioner.class property specifies the class that implements the


StreamsPartitioner interface. Use this configuration parameter only for producers that
are written in Java.

22  
®

Here is some sample code for a Partitioner class. The class implements the
Partitioner interface, and implements the partition method. The partition method input
parameters provide information about the topic, key, and cluster. The method can use
these input parameters to compute and then return the partition to send the message
to. This allows you to use more than just the key to group messages by partition.
In this example, the partition is calculated as the mod of the key and the number of
partitions for the topic.

----------------
Java doc for reference
int partition(java.lang.String topic,
java.lang.Object key,
byte[] keyBytes,
java.lang.Object value,
byte[] valueBytes,
Cluster cluster)
Computes the partition
Parameters:
topic - The topic name
key - The key to partition on (or null if no key)
keyBytes - The serialized key to partition on( or null if no key)
value - The value to partition on or null
valueBytes - The serialized value to partition on or null

23  
®

24  
®

Answer: B

25  
®

26  
®

Next we will go over some consumer properties for fetching data.

27  
®

Let's start with a brief review of consumers polling for messages.

28  
When the consumer application is ready for more data, it issues a request, using the
MapR Streams client library, to poll a topic for messages that the application has not
yet read.
Once the request for unread messages is received, the partition returns the unread
messages to the client library, which passes the messages to the consumer
application.

29  
®

With these properties you can set some options for how much data to return.

--------------------
max.partition.fetch.bytes
(Default 64KB) The number of bytes of message data to attempt to fetch for each
partition in each poll request. These bytes will be read into memory for each partition,
so this parameter helps control the memory that the consumer uses. The size of the
poll request must be at least as large as the maximum message size that the server
allows or else it is possible for producers to send messages that are larger than the
consumer can fetch.
fetch.min.bytes
(Default 1 byte) The minimum amount of data the server should return for a fetch
request. If insufficient data is available, the server will wait for this minimum amount of
data to accumulate before answering the request.
This minimum applies to the totality of what a consumer has subscribed to.
Works in conjunction with the timeout interval that is specified in the poll function. If
the minimum number of bytes is not reached by the time that the interval expires, the
poll returns with nothing.
For example, suppose the value is set to 6 bytes and the timeout on a poll is set to
100ms. If there are 5 bytes available and no further bytes come in before the 100ms
expire, the poll returns with nothing.

30  
®

31  
®

Next we will go over some consumer properties for consumer groups.

32  
®

Let's start with a brief review of consumer groups.

33  
As a review from Lesson 2, in order to take advantage of parallelism when reading,
you can group consumers together by setting the same value for the group.id
configuration parameter when you create a consumer.

The partitions in each topic are assigned dynamically to the consumers in a consumer
group in round-robin fashion.

This way, if you have a group of similar consumers all subscribed to the same topic,
you can distribute message processing across your consumers for more efficient
processing.

Note that consumer groups are also necessary for cursor persistence.

34  
Consumer groups enable cursor persistence. This can be useful even if a group has
only one member.

In this example, the group Oil1 only has one member, Consumer A, which is
subscribed to the Pressure topic and will receive messages from Pressure Partitions
1 and 2. Consumer B in the group Oil2 is subscribed to the Warnings topic and will
receive messages from Warnings Partition 1.

Even thought the groups Oil1 and Oil2 each only have one member, they both benefit
from cursor persistence.

35  
®

Here is an example of creating a consumer in a group by setting the value for the
group.id configuration parameter when instantiating a consumer.

If you create three consumers and give each of them the group ID pumppressure,
together these consumers form the consumer group pumppressure. MapR Streams
does not generate IDs for consumer groups. You can create IDs that make sense for
your purposes. IDs are strings that can be up to 2457 bytes long.

36  
®

Let's review consumer group assignment.

37  
If a consumer in a group is added, goes offline, comes back online, or if partitions are
added, then the partitions are reassigned dynamically among the consumers in the
group.

Consumers can be notified when group assignment happens through the


ConsumerRebalanceListener. This allows a consumer to finish things like state
cleanup or manual offset commits.

We will look at the ConsumerRebalanceListener next.

38  
®

Here is an example implementation of the ConsumerRebalanceListener interface,


and the corresponding methods, which are called when the set of partitions assigned
to the consumer changes.

39  
®

In this example implementation, the Listener constructor saves the associated


consumer object, in order to use later to commit offsets with the consumer
commitSync method.

40  
®

The onPartitionsRevoked method is called before a partition rebalance.


Implementing this method allows the consumer to commit or save offsets on the start
of a rebalance operation.

41  
®

The onPartitionsAssigned method will be called after partition re-assignment


completes and before the consumer starts fetching data.
A consumer can implement this method to handle offsets on completion of partition
re-assignment.

42  
®

You provide an instance of the ConsumerRebalanceListener implementation with the


consumer subscribe method as shown here.

43  
®

If new consumers join or leave a group, or new partitions are added, then the
partitions are redistributed among the consumers in the group. There may be a slight
pause in message delivery while this happens, if the consumer has not committed the
cursor. This could cause messages to be redelivered. We will discuss the commit
cursor further in the next section.

44  
®

45  
®

Answer: D

46  
®

47  
®

Next we will go over some consumer properties for read cursors.

48  
®

Let's review cursors.

49  
®

The MapR Streams server keeps track of the messages that consumers have read
with cursors.
There is one cursor per partition per consumer group, and there are two types of
cursors.
The first type of cursor is the read cursor. This refers to the offset ID of the most
recent message that MapR Streams has sent to a consumer from a partition.
In this example, the read cursor has the offset ID of 3.

50  
®

The second type of cursor is the committed cursor.


Consumers that are part of a consumer group can save the current position of their
read cursor. Consumers can do this either automatically or manually. The saved
cursor is called a committed cursor because it indicates that the consumer has
processed all the messages in a partition up to and including the one with this offset.
In this example, the consumer has set the committed cursor to the the current
position of their read cursor, with the offset ID of 3.

Cursors are useful in case of failover. If a consumer fails and MapR Streams
reassigns the consumer’s partitions to other consumers in a group, those consumers
can start reading from the next offset after the committed cursor in each of those
partitions.

NOTE: There could be a gap between the last committed offset and the read cursor at
the time a consumer fails. In such a situation, the messages within that gap will be
read a second time. Consumer applications have to be able to tolerate such
duplication.

If a stream is replicated to another cluster for backup, committed cursors are also
replicated. If a cluster fails, consumers that are redirected to the standby copy of a
stream can start reading from the next offset after committed cursors.

51  
®

How often a consumer should commit depends on how much read duplication you
are willing to tolerate. The more often a consumer commits, the less read duplication
the consumer must contend with. With MapR Streams, we recommend committing
often since MapR Streams can handle high-volume cursor commits better than Kafka.

52  
®

The length of time since the failed consumer last committed and the rate at which
messages are published determine how many messages are read a second time.
For example, suppose that the auto-commit interval is five seconds. A consumer
saves its commit cursor and then fails after three seconds. During those three
seconds, the consumer's read cursor has continued to move through the messages.
When its partitions are reassigned to other consumers in the group, those consumers
will read three seconds of messages that the failed consumer already read.

53  
®

These are the properties for the commit cursor.

The auto.commit.interval.ms configuration parameter determines the frequency of the


commits in milliseconds. The default is value is 1000.

54  
®

Whether the MapR server commits the cursors for a consumer that is in a consumer
group is determined by the enable.auto.commit configuration parameter. You can set
it to true, which enables auto.commit, or false. The default is true.

We recommend setting enable.auto.commit to false, so that the consumer can


commit the messages that have actually been processed.

55  
®

The code above is an example of committing the offset manually.

56  
®

57  
®

Answer: C

The more often a consumer commits, the less read duplication the consumer must
contend with.
With MapR Streams we recommend committing often since MapR Streams can
handle high-volume cursor commits better than Kafka.

58  
®

59  
®

Finally, we'll learn about messaging semantics.

60  
®

MapR Streams provides “at least once” message delivery, which means messages
will not be lost once published, but messages may be duplicated.

Duplicates may be caused on the Producer side or the Consumer side. First we will
go over causes on the Producer side. Then we will go over causes on the consumer
side.

61  
®

If the client library does not receive an acknowledgement from the MapR Streams
server, then the client library will resend the messages which were not acknowledged.
This could lead to duplicates. For example, if some messages were received by the
MapR Server, but the acknowledgement was lost due to network failure, then the
library would resend the unacknowledged messages.

62  
®

If a producer has sent messages but crashes before the client library buffer has
flushed these messages, then those messages will not be sent.

However, lost messages can be avoided with good producer app design. A producer
can be sure messages were sent by providing a send Callback that will be invoked
when the send request is complete, as explained in Lesson 3. If a producer crashes
before receiving the callback, then on restart the producer could resend these
messages to guarantee delivery.

63  
®

If a Producer fails and resends previously sent messages on restart, then this could
cause message duplicates, since even if a producer provides a callback, the ack
could be lost.

Remember that each message in a partition is assigned a unique offset ID. Duplicate
messages will have different offset IDs. If it is important for an application to not have
duplicates, then the Producer should embed a unique ID in the message record in
order to remove duplicates during consumer processing.

64  
®

As discussed before, when partitions are redistributed among the consumers in a


group, messages could be redelivered if the consumer has not committed the read
cursor. This can be avoided with a good implementation of the
ConsumerRebalanceListener.

65  
®

Also as discussed before, messages could be redelivered, if the consumer crashes


after getting a message but before committing the cursor.

66  
®

If duplicate messages are a problem for your application, then you could design your
applications for idempotent messages.

In order to detect duplicates, the Producer can embed a unique ID in the message
record. Consumers can use this unique ID to identify duplicates during processing.

67  
®

68  
®

Answer: E

69  
®

70  
®

In Lab 4, we'll try out some of the properties we have discussed in this lesson.

71  
®

72  

You might also like