Kinesis
Kinesis
easily gathers or collects, processes, and analyses video and data streams in
a real-time environment.
Data Streams -
Streaming data is data which is generated continuously from thousands of
data sources, and these data sources can send the data records
simultaneously and in small size.
Amazon Kinesis Streams are used to gather together and process huge
streams of data records in real time. AWS Kinesis Data Stream Applications
can be created, which are data-processing applications. These applications
perform the reading from a data stream in the form of data records. They use
Kinesis Client Library for these operations and can run on Amazon EC2
instances. Processed records can be sent to AWS dashboards and can be
used to generate alerts, send data to other AWS services, and dynamically
change advertising and pricing strategies.
1. Durability: It ensures minimal data loss along with synchronous
duplication of streaming data across all the Availability Zones in the
AWS Region.
2. Security: Sensitive data can be encrypted within KDS so that you can
access your data privately through Amazon Virtual Private Cloud (VPC).
3. Easy to use and low cost: The components like connectors, agents,
Kinesis Client Library (KLC), etc can help you build streaming
applications quickly and effectively. There is no upfront cost for Kinesis
Data Streams. You only have to pay for the resources you use.
4. Elasticity and Real-Time Performance: According to SNDK Corp, you
can easily dynamically scale your applications from gigabytes to
terabytes of data per hour adjusting the throughput. Real-Time Analytics
Applications can be supplied with real-time streaming data within a very
short time of the data being collected.
A Kinesis data stream is a set of shards. Each shard has a sequence of data
records. Each data record has a sequence number that is assigned by
Kinesis Data Streams.
Data Record
A data record is the unit of data stored in a Kinesis data stream. Data records
are composed of a sequence number, a partition key and a data blob, which
is an immutable sequence of bytes. Kinesis Data Streams does not inspect,
interpret, or change the data in the blob in any way. A data blob can be up to
1 MB.
Capacity Mode
A data stream capacity mode determines how capacity is managed and how
you are charged for the usage of your data stream. Currently, in Kinesis Data
Streams, you can choose between an on-demand mode and
a provisioned mode for your data streams.
With the provisioned mode, you must specify the number of shards for the
data stream. The total capacity of a data stream is the sum of the capacities
of its shards. You can increase or decrease the number of shards in a data
stream as needed and you are charged for the number of shards at an hourly
rate.
Retention Period
The retention period is the length of time that data records are accessible
after they are added to the stream. A stream’s retention period is set to a
default of 24 hours after creation.
Producer
Producers put records into Amazon Kinesis Data Streams. For example, a
web server sending log data to a stream is a producer.
Consumer
Consumers get records from Amazon Kinesis Data Streams and process
them. These consumers are known as Amazon Kinesis Data Streams
Application
Shard
Partition Key
A partition key is used to group data by shard within a stream. Kinesis Data
Streams segregates the data records belonging to a stream into multiple
shards. It uses the partition key that is associated with each data record to
determine which shard a given data record belongs to. Partition keys are
Unicode strings, with a maximum length limit of 256 characters for each key.
An MD5 hash function is used to map partition keys to 128-bit integer values
and to map associated data records to shards using the hash key ranges of
the shards. When an application puts data into a stream, it must specify a
partition key.
Sequence Number
Each data record has a sequence number that is unique per partition-key
within its shard. Kinesis Data Streams assigns the sequence number after
you write to the stream with client.putRecords or client.putRecord. Sequence
numbers for the same partition key generally increase over time. The longer
the time period between write requests, the larger the sequence numbers
become.
The Kinesis Client Library is compiled into your application to enable fault-
tolerant consumption of data from the stream. The Kinesis Client Library
ensures that for every shard there is a record processor running and
processing that shard. The library also simplifies reading data from the
stream. The Kinesis Client Library uses an Amazon DynamoDB table to store
control data. It creates one table per application that is processing data.
AWS Kinesis vs Apache Kafka:
Similarities-
The biggest difference between the two is that Amazon Kinesis is a managed
service that requires minimal setup and configuration. Kafka is an open-
source solution, requiring significant time, investment and knowledge to
configure.
Kinesis:
• Data Producers are the source devices that emit Data Records.
• The Data Consumer retrieves the Data Records from shards in the stream.
The consumer is the app or service that makes use of the stream data.
Kafka:
• With Kafka, records are immutable from the outset and sent sequentially to
ensure continuous flow without data degradation.
• Kafka has four key APIs. The Producer API sends streams of data to
Topics in the Kafka cluster. The Consumer API reads the streams of data
from topics. The Streams API transforms streams of data from input to
output Topics. The Connect API implements connectors that pull from
source systems and push from Kafka to other systems/services/
applications.
• Kafka has SDK support for Java, but Kinesis can support Android, Java,
Go, and .Net, among others. Because Kafka is open source, however, new
integrations are in development every day.
• But while Kinesis may currently offer more flexibility in integrations, it is less
flexible in terms of configuration – it only allows the number of days and
shards to be configured and writes synchronously to 3 different machines,
datacenter, and availability zones (this standard configuration can constrain
throughput performance). Kafka is more flexible, allowing more control over
configuration, letting you set the complexity of replications, and, when
configured properly to a use case, can be even more scalable and offer
greater throughput.
• allows real-time processing of streaming big data and the ability to read
and replay records to multiple Amazon Kinesis Applications.
• Amazon Kinesis Client Library (KCL) delivers all records for a given
partition key to the same record processor, making it easier to build
multiple applications that read from the same Amazon Kinesis stream
(for example, to perform counting, aggregation, and filtering).
Amazon SQS