DEV3500 SlideGuide FMx8S2P
DEV3500 SlideGuide FMx8S2P
© 2016, MapR Technologies, Inc. All rights reserved. All other trademarks cited here are the property of
their respective owners.
Welcome to DEV 350 – MapR Streams Essentials. This course is designed to give
developers and administrators the basic concepts necessary to deploy MapR
Streams on a MapR Distribution.
1
®
2
®
3
®
4
®
In Lesson 1, we'll introduce the motivation behind MapR Streams and learn how to
apply MapR Streams to some common use cases.
5
Welcome to MapR Streams Essentials, Lesson 1 – Introduction to MapR Streams.
This lesson describes why people use MapR Streams, what MapR Streams is, and
provides a brief overview of some MapR Streams use cases.
1
®
2
By the end of this lesson you will be able to:
Summarize the motivation behind MapR Streams.
Why do people use MapR Streams? What problems does it solve?
How do people use MapR Streams? When do people use MapR Streams?
3
Let's start by describing some features of MapR Streams and some of the problems it
can solve.
4
Many big data sources are event-oriented. For example, stock prices, user activity,
and sensor data all trigger events.
5
Today’s applications need to process this high-velocity, high-volume data in real-time
to gain insights.
6
MapR Streams adds event streaming to the MapR platform, allowing events to be
ingested, moved, and processed as they happen, as part of a real-time data pipeline.
7
The need to process massive amounts of data in real-time has grown rapidly in recent
years.
For example, social media users expect alerts when their friends contact them.
8
Likewise, people want notifications from their bank when they are low on funds or at
risk of fraud.
9
Smart cars, GPS software, and traffic apps all depend on geospatial data to give up-
to-date navigation information.
10
Networked sensors that make up the Internet of Things or log files from application
metrics may communicate important information about mission-critical events on oil
rigs or factory lines.
11
Retail websites may use clickstream data to provide real-time advertisements.
12
All this data used to be stored as log files. Now, MapR Streams can be used to
transport this data as events, enabling real-time analytics.
13
What if you want to analyze data as it arrives?
Big data is typically sent in batches to HDFS, and then analyzed by distributed
processing frameworks like MapReduce, Apache Hive, or Apache Pig.
Batch processing can give great insights into things that happened in the past, but
lacks the ability to answer the question of "what is happening right now?”
It is becoming important to process events as they arrive for real-time insights.
Analyzing data as it arrives requires several distributed applications to be linked
together in real-time.
In this example, MapR Streams helps provide real-time insights based on sensor
data, notifying the user to turn on the air conditioning!
What if you want to organize data as it arrives?
In a large data warehouse, there may be many different sources of data, and many
different applications which want to process, share and analyze this data. Integrating
data sources and applications by sharing files can quickly become unorganized,
complicated, and tightly coupled.
Topics organize events into categories. Topics are logical collections of messages
that are managed by MapR Streams. Topics decouple producers, which are the
sources of data, from consumers, which are the applications that process, analyze,
and share data.
Producers publish to a topic. In this example, we see a thermostat, an oilrig, and a
alert application all publishing event data to their respective topics.
Consumers subscribe to the topics that interest them.
What if you need to process a high volume of data as it arrives?
With the emergence of the Internet of Things, there are more and more sensors,
clicks, log files, and other data sources that are sending billions of data points that
need to be processed as they arrive.
Traditional message queues can not handle the volume of data for the Internet of
Things, with millions of sources, hundreds of destinations, and the demand for real-
time analytics.
A new approach is needed that can process messages from all these sources.
23
With MapR Streams, topics are partitioned for throughput and scalability. Partitions
make topics scalable by spreading the load for a topic across multiple servers.
Producers are load balanced between partitions.
Consumers can be grouped to read in parallel from multiple partitions within a topic
for faster performance.
What if you need to recover messages in case of server failure?
As more and more people depend on their data for mission-critical tasks, it is
important to have fault-tolerance built into your architecture.
If there were no replicated copy of “Topic: Warnings – Partition 2" and Server 2 went
down, then we would lose all of the messages in Partition 2.
With MapR Streams, each partition and all of its messages are replicated for fault
tolerance.
The server owning the primary partition for the topic replicates the message to replica
containers.
Producers and Clients send and read from the primary partition, shown in here in
orange. Replicas are used for fault tolerance.
Here, Server 2, owning the primary partition for "Topic Warnings – Partition 2," went
down.
A replica partition will become the new primary for the failed partition. In this case, the
"Partition 2: Warning Replica" on Server 1 became the primary, now shown in
Orange. Producers and consumers will automatically be re-routed to the new primary.
What if you need real-time access to live data distributed across multiple clusters and
multiple data centers?
What if you would like to have consumers and producers in different locations
worldwide? What if you would like high availability if a cluster goes down?
If there were no replicated copy of your topics outside of the cluster, and the whole
cluster went down, then message servers would not be available for producers and
consumers.
Topics are grouped into streams which can be asynchronously replicated between
MapR clusters, with publishers and listeners existing anywhere, enabling truly global
applications.
The streams replication feature gives your users real-time access to live data
distributed across multiple clusters and multiple data centers around the world.
With streams replication, you can create a backup copy of a stream for producers and
consumers to fail over to if the original stream goes offline.
This feature significantly reduces the risk of data loss should a site-wide disaster
occur, making it essential for your disaster recovery strategy.
®
37
Replication :: fault tolerance
Partitioning :: scalability
Topics :: organization
38
®
39
Next, we'll learn what MapR Streams does.
40
®
41
®
42
®
43
®
The MapR Converged Data Platform integrates Apache Hadoop, Apache Spark, real-
time database capabilities, global event streaming, and big data enterprise storage,
for developing and running innovative data applications.
44
MapR Streams brings integrated publish/subscribe messaging to the MapR platform.
Producer applications can publish messages to topics that are managed by MapR
Streams.
Consumer applications can then read those messages at their own pace. All
messages published to MapR Streams are persisted, allowing future consumers to
“catch-up” on processing, and analytics applications to process historical data.
In addition to reliably delivering messages to applications within a single data center,
MapR Streams can continuously replicate data between multiple clusters, delivering
messages globally.
®
MapR Streams combines massively scalable messaging with security, reliability, fault
tolerance and global real-time delivery. MapR Streams implements the Apache Kafka
Producer/Consumer APIs, which many developers may already be familiar with.
Topics in MapR Streams are grouped into streams, which administrators can apply
security, retention, and replication policies to. Combined with MapR-FS and MapR-
DB in the MapR platform, streams allow organizations to create a centralized, secure,
data lake that unifies files, database tables, and message topics.
48
MapR Streams is the only messaging platform built on top of an enterprise-class big
data platform.
MapR Streams can reliably handle billions of messages from millions of producers,
and sort them into thousands of topics. It reliably delivers these messages to millions
of consumers, globally. This massive, global scale is built to handle the growing
Internet of Things.
Topics in MapR Streams are grouped into streams, which administrators can apply
security, retention, and replication policies to. MapR Streams can continuously
replicate data between multiple clusters, providing fault-tolerance.
Like other MapR services, MapR Streams has a distributed, scale-out design.
Combined with MapR-FS and MapR-DB in the MapR platform, streams allow users to
create a centralized data lake that can be processed and analyzed in real time or in
bulk batches.
®
50
Answer A,B,C (D is wrong)
51
®
52
Finally, we'll learn how to apply MapR Streams to some common use cases.
53
Next, we'll take a look at a few enterprise-level use cases: health care, advertising,
credit cards, and the Internet of Things.
54
Let's say we need to build a flexible, secure healthcare database.
This presents many challenges. First, there will be many different data models in the
healthcare industry, ranging from insurance claims to test results. Second, there will
be many issues concerning security and privacy of this data.
MapR Streams can meet each of these challenges.
With MapR Streams, the data lineage portion of the compliance challenge is solved
because the stream becomes a system of record by being an infinite, immutable log
of each data change.
This objective also presents many challenges. First, the data will be coming from
many different sources, spread all over the globe. Second, it is crucial that alerts are
issued in real-time. If the pressure or temperature gets too high or too low at a oil rig,
it can cause serious problems. Third, there is a need to audit data. It is not enough to
monitor data in real-time; we will want to create reports analyzing past trends as well.
To predict potential equipment failures before they can occur, the solution must be
able to:
Creating targeted ads presents several challenges. Information about each individual
user, including their recent and long-term browsing trends, can help inform the
algorithms which will update which ads are shown to them. This requires a complex
data pipeline, which can maintain consistent streams of data for each unique user, at
a global scale.
MapR Streams' ability to handle global message streams and real-time analytics with
a streaming ETL makes it suitable for this task.
By using MapR Streams, you can provide real-time global insights for advertising.
The solution has sensors triggering off events. These events flow into the streaming
system.
Software is listening to these streams and reacts in real time. For example, if a
product is moving quickly off the shelves when there are hundreds of people in the
store, it can issue an alert so that product gets restocked as fast as possible.
The solution gets all of the data from stores to global headquarters in order to do
more with it. The data is streamed, replicated between data centers, providing the
ability to fix and optimize any issues with supply chain or shipping from global
logistics management. In addition to that, the application provides the ability to build
solutions around targeted advertising for my customers.
This use case fits nearly any different business that's out there. This works in
manufacturing, retail, or any company that you have a need to deliver more real time
solutions to your customers.
Let's say we want to build a reliable, real-time service bus for financial transactions.
Like our healthcare data, there are major concerns about the security of this
information. The streams must also be reliable for accurate, real-time updates to
financial transactions, and need to be queried to detect fraud.
Again, MapR Streams can handle this task. MapR Streams reliably delivers
information, and can have security settings enabled.
MapR Streams enables a faster closed-loop process between operational
applications and analytics by reliably running both operational and analytical
applications on one platform.
In this example, we see the clickstream data from online shopping is being used
simultaneously by several different applications. The browsing behavior from the
online shoppers is the primary data source. This data is analyzed by several different
applications, including fraud models and recommendation tools, all on a single
cluster.
63
®
64
®
Marketing & Sales: Analysis of customer engagement and conversion, powering real-
time recommendations while customers are still on the site or in the store
Customer Service & Billing: Analysis of contact center interactions, enabling accurate
remote trouble shooting before expensive field technicians are dispatched
65
Congratulations! You have completed Lesson 1: Introduction to MapR Streams. In
Lesson 2, you will learn more about MapR Streams architecture.
66
Welcome to MapR Streams Essentials, Lesson 2 – MapR Streams Architecture. This
lesson takes an in-depth look at the features and components of MapR Streams, and
how it fits in the big data architecture.
1
®
2
By the end of this lesson, you will be able to:
and explain how MapR Streams fits in the complete data architecture
3
Let's start with defining the core components of MapR Streams.
4
®
Messages are key/value pairs, where keys are optional. We will describe the use of
keys later. Values contain the data payload, which can be text or bytes.
In this example, we see a single data point, in nested JSON format, from an oil rig.
This includes sensor readings such as pressure, temperature, and oil percentage.
5
Producers are data-generating applications that you create, such as sensors in an oil
rig.
Producers publish or send messages to topics.
Consumers are also applications that you create, such as analytics applications using
Apache Spark.
Consumers request unread messages from topics that they are subscribed to.
Topics are logical collections of messages that are managed by MapR Streams.
8
Topics organize events: producers publish to a relevant topic and consumers
subscribe to the topics of interest to them.
For example, you might have an application that monitors the logs for an oil rig. Your
monitoring application could send oil well sensor data to topics named Pressure,
Temperature, and Warnings.
9
One consumer application might subscribe to the Pressure topic to analyze the oil
pressure, issuing an alert if the pressure becomes dangerously high or low. A different
consumer might subscribe to the Temperature topic to generate a report of
temperature trends over time.
10
Topics are partitioned for throughput and scalability. Partitions, which exist within
topics, are parallel, ordered, sequences of messages that are continually appended
to.
11
Consumer
applica7ons
can
read
in
parallel
from
mul7ple
par77ons
within
a
Topic.
This
is
faster
and
more
scalable
than
reading
from
one
par77on
per
Topic.
12
When a message is published to a partition, it is assigned an offset, which is the
message ID. Offsets are numbered in increasing order, and are local to each partition.
This way, the order of messages is preserved within individual partitions, but not
across partitions.
13
A stream is a collection of topics that you can manage together. In this example, the
Pressure, Temperature, and Warnings topics are all collected together in one stream.
There are a number of different ways to manage a stream. For example, you can
apply security settings, time-to-live, or the default number of partitions for all topics
within a stream.
You can replicate streams to other streams in different clusters. For example, you can
create a backup copy of a stream for producers and consumers to fail over to if the
original stream goes offline.
14
The server manages streams, topics, and partitions. The server also handles
requests from the producer client library and the consumer client library.
15
Producer applications produce messages and send them to the producer client
library. This client library buffers incoming messages and then sends them in batches
to the server. The server then publishes the messages to the topics that the
producers have specified.
16
Consumers request unread messages from topics that they are interested in. A
consumer client library sends unread messages. Then, consumer applications extract
data from these messages.
17
®
18
Answer: A
19
Let's review.
producer applications publish messages to topics using industry standard Kafka APIs.
Topics organize events into categories. Topics are partitioned for scalability and
replicated for reliability.
Consumers can subscribe to a topic using industry standard Kafka APIs.
MapR Streams handles requests for messages from consumers for topics that they
are subscribed to. Global delivery happens in real time.
Finally, consumer applications can read those messages at their own pace. All
messages published to MapR Streams are persisted, allowing future consumers to
“catch-up” on processing, and analytics applications to process historical data.
®
25
Next, we'll learn about the life of a message in MapR Streams.
26
To show you how these concepts fit together, we will go through an example of the
flow of messages from a producer to a consumer.
Imagine that you are using MapR Streams as part of a system to monitor oil wells
globally.
Your producers include sensors in the oil rigs, weather stations, and an application
which generates warning messages.
Your consumers are various analytical and reporting tools.
In a volume in a MapR cluster, you create the stream /path/oilpump_metrics. In that
stream, you create the topics pressure, temperature, and warnings.
Of all of the sensors (producers) that your system uses to monitor oil wells and
related data, let's choose a sensor that is in Venezuela. We'll follow a message that is
generated by this sensor and published in the Pressure topic. When you created this
topic, you also created several partitions within it to help spread the load among the
different nodes in your MapR cluster and to help improve the performance of your
consumers. For simplicity in this example, we'll assume that each topic has only one
partition.
27
How are messages sent?
The sensor producer application sends messages to the Pressure topic using the
MapR Streams producer client library.
The client library buffers incoming messages.
28
When the client has a large enough number of messages buffered, or after an interval
of time has expired, the client batches and sends the messages in the buffer. The
messages in the batch are published in the Pressure topic’s partitions on a MapR
Streams server.
29
When
the
messages
are
published
to
a
topic
par77on,
they
are
appended
in
order.
Each
message
is
given
an
offset,
which
is
a
sequen7ally
numbered
ID.
Older
messages
will
have
lower
numbered
offsets,
while
the
newest
messages
will
have
the
highest
numbers.
The
numbering
of
offsets
is
local
to
each
par77on,
so
the
order
of
messages
is
preserved
within
individual
par77ons,
but
not
across
par77ons.
30
Each
par77on
and
all
of
its
messages
are
replicated
for
fault
tolerance.
The
server
owning
the
primary
par77on
for
the
topic
assigns
the
offset
ID
to
messages
and
replicates
the
message
to
replica
containers
within
the
MapR
cluster.
Replica7on
rules
are
controlled
at
the
volume
level
within
the
MapR
cluster.
By
default,
par77ons
are
replicated
three
7mes.
31
The server then acknowledges receiving the batch of messages and sends the offset
IDs that it assigned to them back to the producer.
32
There are three different ways of choosing a partition for a message:
First, if the producer specifies a partition ID, the MapR Streams server publishes the
message to the partition specified.
In this diagram, each producer specifies a partition to publish to within the same topic.
For example, Producer A specifies that it should publish to Partition 4.
33
The second way to specify a partition is with a key.
If the producer provides a key, the MapR Streams server hashes the key and sends
the message to the partition that corresponds to the hash.
In this example, Producer A specifies a key, and its messages go to Partition 3.
34
Last, if neither a partition ID nor a key is specified, the MapR Streams server sends
messages in a sticky round-robin fashion. For any given topic, the server randomly
chooses an initial partition. For example, suppose that for the Warnings topic, the
server chooses the partition with the ID 1. The server first sends messages to
partition 1, then to partition 2, and so on.
35
®
36
Answer: B
37
A consumer application that correlates oil well pressure with weather conditions is
subscribed to the Warnings topic.
When the consumer application is ready for more data, it issues a request, using the
MapR Streams client library, to poll the Pressure topic for messages that the
application has not yet read.
Next, the client library asks if there are any messages more recent than what the
consumer application has already read.
38
Once the request for unread messages is received, the primary partition of the topic
returns the unread messages to the client.
The original messages remain on the partition and are available to other consumers.
The client library passes the messages to the consumer application, which can then
extract and process the data.
If more unread messages remain in the partition, the process repeats with the client
library requesting messages.
39
Since messages remain in the partition even after delivery to a consumer, when are
they deleted?
When you create a stream, you can set the time-to-live for messages.
Once a message has been in the partition for the specified time-to-live, it is expired.
An automatic process reclaims the disk space that the expired messages are using.
The time-to-live can be as long as you need it to be. Messages will not expire if the
time-to-live is zero, and will remain in the partition indefinitely.
40
®
You don’t have to worry about partitions getting too big to store on a single server;
partitions will be re-allocated every 2GB to balance storage. MapR Streams can
intelligently move partitions around in a cluster in order to spread the data out,
allowing topics to be infinite, persistent storage.
41
®
42
Answer: C
43
Messages can be read in parallel, for fast and scalable processing.
In order to take advantage of parallelism when reading, you can group consumers
together by setting the same value for the group.id configuration parameter when you
create a consumer.
The partitions in each topic are assigned dynamically to the consumers in a consumer
group in round-robin fashion.
For example, suppose that there are three consumers in a group called Oil Wells.
Each consumer is subscribed to the same topic, Warnings. This Warnings topic has
five partitions.
MapR Streams assigns each partition to a consumer, using the round robin pattern.
The first partition will be assigned to Consumer A, the second to Consumer B, and the
third to Consumer C. Then, the assignment pattern starts over.
This way, if you have a group of similar consumers all subscribed to the same topic,
you can distribute message processing across your consumers for more efficient
processing.
44
If one of the consumers goes offline, the partitions are reassigned dynamically among
the remaining consumers in the group.
In this example, Consumer B has gone offline. The partitions that had previously been
assigned to B must be reassigned. Partition 2 gets reassigned to Consumer C, while
partition 5 gets reassigned to Consumer A.
If the offline consumer comes back online or a different consumer is added to the
group, again the partitions are redistributed among the consumers in the group.
This parallelism and dynamic reassignment is possible only if no consumer in the
group subscribes to an individual partition.
45
®
The MapR Streams server keeps track of the messages that consumers have read
with cursors.
There is one cursor per partition per consumer group, and there are two types of
cursors.
The first type of cursor is the read cursor. This refers to the offset ID of the most
recent message that MapR Streams has sent to a consumer from a partition.
In this example, the read cursor has the offset ID of 3.
46
®
47
®
You can replicate streams to other MapR clusters worldwide or to other streams
within a MapR cluster.
There are many different scenarios in which replicating streams can be useful.
For example, suppose that your oil drilling company has a refinery in Venezuela, and
sensors in the pipeline equipment track different metrics.
With replication, the factory could create a stream in the Venezuela cluster and
maintain a backup of the stream in the Venezuela_HA cluster.
This type of replication is called master-slave replication. In this example, Venezuela
is the master, and Venezuela_HA is the slave.
48
®
Suppose that your company's headquarters are in Houston, and you want data
analysts there to be able to analyze all data company-wide.
49
®
You have two metrics streams, one in each oil well: Venezuela and Mexico.
50
®
You can replicate each of these streams to a metrics stream in the Houston cluster.
In this scenario, the replica is the metrics stream in the Houston cluster. This replica
has two upstream sources: the metrics streams that are replicated from the two
factories.
This type of replication, called many-to-one replication, requires that the topics in
each stream have unique names, so that message offsets do not conflict. However,
the streams may have the same name.
In this example, we give each topic within the metrics stream a unique identifier, so
that the data from Mexico does not conflict with the data from Venezuela.
51
®
Another type of of replication that can be useful is multi-master replication. You can
use it when you need two streams to both send updates to and receive updates from
the other stream. Each stream is a replica and an upstream source. MapR Streams
keeps both streams synchronized with each other.
As with many-to-one replication, the names of the topics in each stream must be
unique across both streams, so that offsets for messages do not conflict.
52
®
53
®
54
Answer: C & D
55
®
56
Finally, we'll take a look at how MapR Streams fits in the MapR ecosystem and works
with other tools in the big data pipeline.
57
A complete big data pipeline includes many different components.
First, you need data sources. In our pipeline, this includes the oil well sensors and
other data-generating applications known as producers.
A big data architecture will also include stream processing applications like Apache
Storm and Apache Spark, as well as bulk processing applications like Apache Hive or
Apache Drill. It may also include end applications like dashboards to display data or
issue alerts.
All of these components must interact with each other. MapR Streams acts as a
message bus between these components. MapR Streams manages sending and
receiving data between the many components of a complete data architecture.
There are several advantages of having MapR Streams on the same cluster as all the
other components. For example, maintaining only one cluster means less
infrastructure to provision, manage, and monitor. Likewise, having producers and
consumers on the same cluster means fewer delays related to copying and moving
data between clusters, and between applications.
®
61
Answer: D
62
®
63
®
1
®
2
®
When you have finished this lesson, you will be able to:
create a stream,
develop a Java producer,
and develop a Java consumer
3
®
4
®
This example shows the maprcli command for creating a stream, which is run in a
MapR cluster node terminal.
5
®
By default, these permissions are granted to the user ID that created a stream. You
only need to use these parameters if you plan to run the producer and consumer with
user IDs that are different from the user ID that created the stream.
6
®
7
®
The consumeperm determines which users can read topics from a stream.
8
®
Topicperm determines which users can create topics in a stream or remove them. A
producer must have this permission for automatic topic creation.
9
®
The optional -defaultpartitions parameter determines how many partitions are created
in each new topic in the stream. By default, each new topic is created with one
partition. In this example, it is created with three.
For more information about creating streams refer to the MapR documentation.
10
®
You can create topics manually using the maprcli stream topic create command, as
shown here.
For example, if you already planned a number of topics for your stream, you could
create these topics after creating the stream.
11
®
12
®
INSTRUCTORS: There are two methods by which topics can be created in a stream:
Automatic creation
By default, a topic is created automatically when a producer creates the first message for
it and the topic does not already exist. For example, you might have the stream
anonymous_usage that is intended to collect data about the use of a software application
that is about to be released. The administrator did not create any topics when creating the
stream, but producers will create topics automatically by publishing to topics for ranges of
IP addresses. After the software is released to the public, at some point a producer
application starts publishing messages to a topic that is created based on the range within
which the producer's IP address falls. At another point in time, a producer starts
publishing messages to a topic for a different range of IP addresses. Eventually, the
stream contains a number of topics for different ranges, and multiple producers are
publishing to each topic. You can turn off the automatic creation of topics for a stream. If
you do this, the publishing of a message to a non-existent topic fails. You can use the
command maprcli stream edit to change this setting after you create a stream.
Manual creation
The other method of creating topics is to use the maprcli stream topic create command.
For example, if you are creating a stream to collect operational metrics from systems in
your enterprise, you might have already planned on a set number of topics based on
system, location, company department, project, or some other criterion. You could create
these topics after creating the stream. When you create a topic this way, you can either
accept the default number of partitions for all new topics in the current stream, or you can
override that default and specify a number of partitions for the new topic.
13
®
14
®
15
®
Here is a rough outline of the steps for a producer to send messages. Now, we will go
through each step. Let's start by setting the producer properties.
16
®
The first step is to set the producer properties, which will be used later to instantiate a
KafkaProducer for publishing messages to topics.
For more information about the producer configuration properties, refer to the
documentation.
17
®
You can control some of the aspects of how producers publish messages by setting
various properties when you create a producer. This table lists some basic properties
you can set.
-----------------
key.serializer
The class for a producer written in Java to use to serialize keys, when keys are
included in messages. You can use the included
org.apache.kafka.common.serialization.ByteArraySerializer or
org.apache.kafka.common.serialization.StringSerializer to turn simple string or byte
types into bytes.
value.serializer
The class for a producer written in Java to use to serialize the value of each
message. You can use the included
org.apache.kafka.common.serialization.ByteArraySerializer or
org.apache.kafka.common.serialization.StringSerializer to turn simple string or byte
types into bytes.
client.id (optional)
Producers can tag records with a client ID that identifies the producer. Consumers
18
®
19
®
Note that the KafkaProducer<K,V> is a Java generic class. You need to specify the
type parameters as the type of the key-value of the messages that the producer will
send.
The first type parameter is the type of the partition key. The second is the type of the
message value. In this example they are both strings, which should match the
Serializer type in the properties.
20
®
21
®
In this example, we instantiate the ProducerRecord with a topic name and message
text as the value, which will create a record with no key.
----------------
Method used:
public ProducerRecord(java.lang.String topic, V value)
Create a record with no key
22
®
23
®
We call the producer send() method, which will asynchronously send a record to a
topic. This method returns a Java Future object, which will eventually contain the
response information.
The asynchronous send() method adds the record to a buffer of pending records to
send, and immediately returns. This allows sending records in parallel without waiting
for the responses, and allows the records to be batched for efficiency.
24
®
If you want to get the response information, then you should provide a Callback that
will be invoked when the request is complete.
The Callback is optional. There is a send method without a callback which is
equivalent to send(record, null).
25
®
You should call producer.close after use, in order to release resources from the
producer client library.
26
®
The onCompletion method will be called when the record sent to the server has been
acknowledged. This example prints the RecordMetadata, which specifies the partition
the record was sent to and the offset it was assigned.
----------------------
From Kafka API doc methods used:
public interface Callback
A callback interface that the user can implement to allow code to execute when the
request is complete. This callback will generally execute in the background I/O thread
so it should be fast.
27
®
Here is the code all together for a simple producer to send three messages. You will
write a similar piece of code and run it in the lab.
-----------------
This code:
• Sets the stream name and topic to publish to
• Declares a producer variable
• Calls the setupProducer() method which instantiates a producer with configuration
properties
• Loops 3 times. Each loop:
• Creates a ProducerRecord specifying the topicName and the message
text.
• Calls the send method on the producer passing the record.
• Finally, calls the close method on the producer to release resources. This method
blocks until all in-flight requests complete.
28
®
The final step is to build your code into a jar file and run it. In the lab, a maven project
with a pom.xml file is provided for building your code.
You run the producer code using the command shown here.
`mapr classpath` is a utility command which sets the other jar files needed in the
classpath.
java –cp `mapr classpath`:<name of your jar file> <package name>.<class name>
`mapr classpath` is a utility command which sets the other jar files needed in the
classpath.
29
®
30
®
Answer: E
31
®
32
®
33
®
This is an outline for how consumers receive messages. We will go through each of
the steps, starting with setting the consumer properties.
34
®
As the first step in writing the consumer code, you need to set the consumer
properties, which will be used later to instantiate a KafkaConsumer for reading
messages from Topics.
35
®
You can control some of the aspects of how consumers read messages by setting
various properties when you create a consumer.
36
®
If a consumer that is not associated with a consumer-group ID fails, it can either start
reading its partitions from the earliest offsets or from the latest offset. The choice is
determined by the auto.offset.reset configuration parameter.
The earliest offset is the offset of the message that has been in the partition the
longest, without being deleted because of the time-to-live. If the consumer reads from
the earliest offset in a partition, it might re-read a large number of messages before
reading messages that were published after it failed.
The latest offset is the offset of the most current message at the time the consumer
requests new messages from MapR Streams. If the consumer reads from the latest
offset in a partition, the consumer starts off up-to-date, but skips over the messages
between its time of failure and the current time.
37
®
38
®
Note that the KafkaConsumer is a Java Generic class and you need to specify the
two type parameters. The first is the type of the Partition key; the second is the type
of the message value. In this example, they are both Strings, which should match the
Deserializer type in the properties.
39
®
40
®
The consumer calls the subscribe method, passing a list of topic names.
When a consumer subscribes to a topic, it reads messages from all of the partitions
that are in the topic, except when a consumer is part of a consumer group. Consumer
groups will be explained in Lesson 4.
The ability to use regular expressions is helpful when the -‐autocreate parameter
for a stream is set to true and producers are allowed to create topics automatically at
runtime.
----------------
public void subscribe(java.util.List<java.lang.String> topics)
Subscribe to the given list of topics to get dynamically assigned partitions. This list will
replace the current assignment (if there is one).
41
®
42
®
The next step is to poll for messages; this is typically done in a loop.
The poll command fetches data for the topics or partitions specified when subscribing.
--------
public ConsumerRecords<K,V> poll(long timeout)
43
®
44
®
--------------
public java.util.Iterator<ConsumerRecord<K,V>> iterator()
Specified by:
iterator in interface java.lang.Iterable<ConsumerRecord<K,V>>
45
®
Here we loop through the iterator of ConsumerRecords, printing out the contents of
the ConsumerRecord.
The ConsumerRecord contains the topic name and partition number, from which the
record was received, and an offset in the partition. The ConsumerRecord value
contains the message.
-----------
For example: topic=/
events:sensor,partition=0,offset=192,key=null,value=Msg1
Reference:
public final class ConsumerRecord<K,V>
extends java.lang.Object
A key/value pair to be received from Kafka. This consists of a topic name and a
partition number, from which the record is being received and an offset that points to
the record in a Kafka partition.
Note that MapR Streams message Values can be either bytes or String , as specified
by the V in the ConsumerRecord<K,V> declaration, in this example our messages are
of type String, but they could be arbitrary bytes.
46
®
Here is the code that we just went over for a simple consumer to receive messages.
You will code this and run it in the lab.
47
®
The final step is to build your code into a jar file and run it. In the lab, a maven project
with a pom.xml file is provided for building your code.
You can run the consumer code using the following command shown here.
----------------
Java –cp <name of your jar file>:`mapr classpath` <package name>.<class name>
`mapr classpath` is a utility command which sets the other jar files needed in the
classpath.
48
®
49
®
In this lab, you will complete, build, and run the code for a producer and a consumer.
50
®
51
®
INSTRUCTORS: Take a moment to discuss the lab with your students, and give them
a chance to ask any questions about this lesson so far.
52
®
53
®
1
®
2
®
3
®
Let's start by describing producer properties and options for buffering and batching of
messages.
4
®
5
Remember, the producer application sends messages to a topic using the MapR
Streams producer client library.
The producer client library has a pool of buffer space that holds records that haven't
yet been transmitted to the server.
Next, we will go over some options for buffering messages and sending them in
batches to the server.
6
®
You can control some of the aspects of how producers buffer when publishing
messages by setting various properties when you create a producer.
Here are definitions of these properties, which we will go over in more detail next.
7
The producer client library will send buffered records to the server when any of these
four conditions are met:
1. If producer client library has buffered enough messages to make an efficient RPC
to the server.
2. If a message has been buffered for the amount of time that is specified for the
streams.buffer.max.time.ms configuration parameter. The default interval for
flushes is 3000 milliseconds.
3. If producer client library has buffered messages beyond the value of the
buffer.memory configuration parameter. Making this larger can result in more
messages buffered, but requires more memory.
4. If the application explicitly flushes messages by calling producer.flush()
--------------
API Method Reference:
public void flush()
Invoking this method makes all buffered records immediately available to send
8
The producer client library batches messages into publish requests that it sends to
the MapR Streams server.
By default, the client sends multiple publish requests without waiting for
acknowledgements from the MapR Streams server.
With this default behavior, it is possible for messages to arrive to partitions out of
order due to the presence of multiple network interface controllers, network errors, or
retries.
For example, suppose a oil rig sensor producer is sending messages that are
specifically for Partition 1, in the Pressure topic. The producer client library buffers the
messages and sends a batch to Partition 1. Meanwhile, the producer keeps sending
messages for Partition 1, and the client library continues to buffer them. The next time
the client library has enough messages for Partition 1, the client sends another batch,
whether or not the server has acknowledged the previous batch.
If you always want messages to arrive to partitions in the order in which they were
sent, set the configuration parameter MapR streams.parallel.flushers.per.partition to
false. The producer client library will wait for acknowledgements from the MapR
9
®
As a review, messages are assigned offsets when published to partitions. Offsets are
monotonically increasing and are local to partitions. The order of messages is
preserved within individual partitions, but not across partitions. Messages are
delivered from a partition to a consumer in order.
10
®
11
®
Answer: A
(default true) If enabled, producer may have multiple parallel send requests to the
server for each topic partition. If this setting is set to true, it is possible for messages
to be sent out of order.
12
®
13
®
14
®
Let's review how the server chooses which partition to publish to.
15
Recall from Lesson 2: if the producer specifies a partition ID, the MapR Streams
server publishes the message to the partition specified.
16
If the producer provides a key when sending a message, the MapR Streams server
hashes the key and sends the message to the partition that corresponds to the hash.
A key is used to group related messages by partition within a stream.
17
Last, if neither a partition ID nor a key is specified, the MapR Streams server sends
messages in a sticky round-robin fashion. For any given topic, the server randomly
chooses an initial partition.
18
®
The code here shows how a producer can specify a key when it creates a record.
The key can be semantic based on something relevant to the message; for example,
by sensor ID. All messages for the same key will hash to the same partition, so keys
allow partitioning by key.
-----------
From the Java doc for Reference:
Constructor ProducerRecord(java.lang.String topic, K key, V value)
Creates a record to be sent to a specified topic and partition
19
®
--------------
Constructor for reference
ProducerRecord(java.lang.String topic, java.lang.Integer partition, K key, V value)
Creates a record to be sent to a specified topic and partition
20
®
With this property, you can set how often the producer will be refreshed with metadata
about new topics and partitions. This can be useful, for example, when the producer
is specifying the partition.
------------------
metadata.max.age.ms: (default 600 * 1000 msec) The producer generally refreshes
the topic metadata from the server when there is a failure. It will also poll for this data
regularly. This polling occurs automatically. There does not have to be any explicit
calls to any APIs.
21
®
If you would like to write custom algorithms for determining which topic and partition
to use for messages that match specific criteria, you can provide a class which
implements the StreamsPartitioner interface. This also allows you to hash using more
than just the key.
22
®
Here is some sample code for a Partitioner class. The class implements the
Partitioner interface, and implements the partition method. The partition method input
parameters provide information about the topic, key, and cluster. The method can use
these input parameters to compute and then return the partition to send the message
to. This allows you to use more than just the key to group messages by partition.
In this example, the partition is calculated as the mod of the key and the number of
partitions for the topic.
----------------
Java doc for reference
int partition(java.lang.String topic,
java.lang.Object key,
byte[] keyBytes,
java.lang.Object value,
byte[] valueBytes,
Cluster cluster)
Computes the partition
Parameters:
topic - The topic name
key - The key to partition on (or null if no key)
keyBytes - The serialized key to partition on( or null if no key)
value - The value to partition on or null
valueBytes - The serialized value to partition on or null
23
®
24
®
Answer: B
25
®
26
®
27
®
28
When the consumer application is ready for more data, it issues a request, using the
MapR Streams client library, to poll a topic for messages that the application has not
yet read.
Once the request for unread messages is received, the partition returns the unread
messages to the client library, which passes the messages to the consumer
application.
29
®
With these properties you can set some options for how much data to return.
--------------------
max.partition.fetch.bytes
(Default 64KB) The number of bytes of message data to attempt to fetch for each
partition in each poll request. These bytes will be read into memory for each partition,
so this parameter helps control the memory that the consumer uses. The size of the
poll request must be at least as large as the maximum message size that the server
allows or else it is possible for producers to send messages that are larger than the
consumer can fetch.
fetch.min.bytes
(Default 1 byte) The minimum amount of data the server should return for a fetch
request. If insufficient data is available, the server will wait for this minimum amount of
data to accumulate before answering the request.
This minimum applies to the totality of what a consumer has subscribed to.
Works in conjunction with the timeout interval that is specified in the poll function. If
the minimum number of bytes is not reached by the time that the interval expires, the
poll returns with nothing.
For example, suppose the value is set to 6 bytes and the timeout on a poll is set to
100ms. If there are 5 bytes available and no further bytes come in before the 100ms
expire, the poll returns with nothing.
30
®
31
®
32
®
33
As a review from Lesson 2, in order to take advantage of parallelism when reading,
you can group consumers together by setting the same value for the group.id
configuration parameter when you create a consumer.
The partitions in each topic are assigned dynamically to the consumers in a consumer
group in round-robin fashion.
This way, if you have a group of similar consumers all subscribed to the same topic,
you can distribute message processing across your consumers for more efficient
processing.
Note that consumer groups are also necessary for cursor persistence.
34
Consumer groups enable cursor persistence. This can be useful even if a group has
only one member.
In this example, the group Oil1 only has one member, Consumer A, which is
subscribed to the Pressure topic and will receive messages from Pressure Partitions
1 and 2. Consumer B in the group Oil2 is subscribed to the Warnings topic and will
receive messages from Warnings Partition 1.
Even thought the groups Oil1 and Oil2 each only have one member, they both benefit
from cursor persistence.
35
®
Here is an example of creating a consumer in a group by setting the value for the
group.id configuration parameter when instantiating a consumer.
If you create three consumers and give each of them the group ID pumppressure,
together these consumers form the consumer group pumppressure. MapR Streams
does not generate IDs for consumer groups. You can create IDs that make sense for
your purposes. IDs are strings that can be up to 2457 bytes long.
36
®
37
If a consumer in a group is added, goes offline, comes back online, or if partitions are
added, then the partitions are reassigned dynamically among the consumers in the
group.
38
®
39
®
40
®
41
®
42
®
43
®
If new consumers join or leave a group, or new partitions are added, then the
partitions are redistributed among the consumers in the group. There may be a slight
pause in message delivery while this happens, if the consumer has not committed the
cursor. This could cause messages to be redelivered. We will discuss the commit
cursor further in the next section.
44
®
45
®
Answer: D
46
®
47
®
48
®
49
®
The MapR Streams server keeps track of the messages that consumers have read
with cursors.
There is one cursor per partition per consumer group, and there are two types of
cursors.
The first type of cursor is the read cursor. This refers to the offset ID of the most
recent message that MapR Streams has sent to a consumer from a partition.
In this example, the read cursor has the offset ID of 3.
50
®
Cursors are useful in case of failover. If a consumer fails and MapR Streams
reassigns the consumer’s partitions to other consumers in a group, those consumers
can start reading from the next offset after the committed cursor in each of those
partitions.
NOTE: There could be a gap between the last committed offset and the read cursor at
the time a consumer fails. In such a situation, the messages within that gap will be
read a second time. Consumer applications have to be able to tolerate such
duplication.
If a stream is replicated to another cluster for backup, committed cursors are also
replicated. If a cluster fails, consumers that are redirected to the standby copy of a
stream can start reading from the next offset after committed cursors.
51
®
How often a consumer should commit depends on how much read duplication you
are willing to tolerate. The more often a consumer commits, the less read duplication
the consumer must contend with. With MapR Streams, we recommend committing
often since MapR Streams can handle high-volume cursor commits better than Kafka.
52
®
The length of time since the failed consumer last committed and the rate at which
messages are published determine how many messages are read a second time.
For example, suppose that the auto-commit interval is five seconds. A consumer
saves its commit cursor and then fails after three seconds. During those three
seconds, the consumer's read cursor has continued to move through the messages.
When its partitions are reassigned to other consumers in the group, those consumers
will read three seconds of messages that the failed consumer already read.
53
®
54
®
Whether the MapR server commits the cursors for a consumer that is in a consumer
group is determined by the enable.auto.commit configuration parameter. You can set
it to true, which enables auto.commit, or false. The default is true.
55
®
56
®
57
®
Answer: C
The more often a consumer commits, the less read duplication the consumer must
contend with.
With MapR Streams we recommend committing often since MapR Streams can
handle high-volume cursor commits better than Kafka.
58
®
59
®
60
®
MapR Streams provides “at least once” message delivery, which means messages
will not be lost once published, but messages may be duplicated.
Duplicates may be caused on the Producer side or the Consumer side. First we will
go over causes on the Producer side. Then we will go over causes on the consumer
side.
61
®
If the client library does not receive an acknowledgement from the MapR Streams
server, then the client library will resend the messages which were not acknowledged.
This could lead to duplicates. For example, if some messages were received by the
MapR Server, but the acknowledgement was lost due to network failure, then the
library would resend the unacknowledged messages.
62
®
If a producer has sent messages but crashes before the client library buffer has
flushed these messages, then those messages will not be sent.
However, lost messages can be avoided with good producer app design. A producer
can be sure messages were sent by providing a send Callback that will be invoked
when the send request is complete, as explained in Lesson 3. If a producer crashes
before receiving the callback, then on restart the producer could resend these
messages to guarantee delivery.
63
®
If a Producer fails and resends previously sent messages on restart, then this could
cause message duplicates, since even if a producer provides a callback, the ack
could be lost.
Remember that each message in a partition is assigned a unique offset ID. Duplicate
messages will have different offset IDs. If it is important for an application to not have
duplicates, then the Producer should embed a unique ID in the message record in
order to remove duplicates during consumer processing.
64
®
65
®
66
®
If duplicate messages are a problem for your application, then you could design your
applications for idempotent messages.
In order to detect duplicates, the Producer can embed a unique ID in the message
record. Consumers can use this unique ID to identify duplicates during processing.
67
®
68
®
Answer: E
69
®
70
®
In Lab 4, we'll try out some of the properties we have discussed in this lesson.
71
®
72