Lecture 7 - Data Acquisition
Lecture 7 - Data Acquisition
Data Acquisition
1
Lecture Outlines
• Data Acquisition Considerations
• Publish - Subscribe Messaging Frameworks
• Big Data Collection Systems
• Messaging Queues
• Custom Connectors
Review
• Key-value databases
• Document databases
• Column family databases
• Graph databases
Keywords
2
Review
NoSQL
• Key-value databases
• Document databases
• Column family databases
• Graph databases
3
Data Acquisition
• Data acquisition is the process of gathering data from various sources and
converting it into a format that can be used for further analysis.
• It is a critical part of the data science process, as it helps to determine the
quality of the data and its usefulness for a given analysis.
• Data acquisition can involve anything:
• from collecting data from a variety of sources, such as websites, databases, or
physical documents,
• to cleaning and preparing the data for further analysis.
• Data acquisition can also involve the use of tools
• such as data mining, data wrangling, and data visualization to help understand
the data.
• Data acquisition involves the capture of real-time data from
sensors, machines, and other sources.
4
Data Acquisition
5
Data Acquisition
6
Data Acquisition Consideration
Source Type
8
Data Acquisition Consideration
Ingestion
Mechanism
• The data ingestion mechanism can either be a push or pull
mechanism.
• The choice of the specific tool or framework for data ingestion will
be driven by the data consumer.
• If the consumer has the capability (or requirement) to pull data,
publish-subscribe messaging frameworks.
• The data producers push data to the a messaging framework or a
queue from which the consumers can pull the data.
9
Data Acquisition Consideration
Ingestion
Mechanism
10
Data Acquisition
Publish - Subscribe
Messaging
• Publish-Subscribe is a communication model that comprises publishers,
brokers and consumers.
• Publishers are the sources of data. Publishers send data to topics which
are managed by the broker. Publishers are not aware of the consumers.
• Consumers subscribe to the topics which are managed by the broker.
• When the broker receives data for a topic from a publisher, it sends the
data to all the subscribed consumers.
• Alternatively, the consumers can pull data for specific topics from the
broker.
11
Publish - Subscribe Messaging
Apache
Kafka
• Apache Kafka is a high throughput distributed messaging
system.
• Kafka can also be considered as a distributed, partitioned,
replicated commit log service.
• Kafka can be used for applications such as stream processing,
messaging, website activity tracking, metrics collection and
monitoring, log aggregation, etc.
12
Publish - Subscribe Messaging
Apache
Kafka
Architecture
A Kafka system includes the following
components:
• Topic: A topic is a user-defined category to
which messages are published.
• Producer: Producer is a component that
publishes messages to one or more topics.
• Consumer: Consumer is a component that
subscribes to one or more topics and processes
the messages.
• Broker: Broker is a component that manages
the topics and handles the persistence,
partitioning, and replication of the data.
A Kafka cluster can have multiple Kafka Brokers
(or servers), with each Broker managing
multiple topics. 13
Publish - Subscribe Messaging
Apache
Kafka
Partitions
• Kafka topics are subdivided into multiple
partitions.
• Each partition is an ordered and
immutable sequence of messages.
• Topics are stored on the disk in the form of
partitioned logs.
• The benefit of using partitions is that the
topic can scale to massive sizes, which will
not fit onto the disk of a single server.
• Partitions also allow multiple consumers to
consume messages in parallel. 14
Publish - Subscribe Messaging
Apache
Kafka
Publishing Messages Consuming Messages
• Each message in a partition is assigned a
• Producers publish messages to sequence ID called the offset.
topics. • Offsets are used by the consumers to
• A producer decides which message track which messages have been
consumed.
should be published to which • The consumers increment the offset as
partition of a topic. they consume the messages in a
sequence.
• Producers can either publish
• Consumers can be grouped together into
messages to different partitions of a consumer groups.
topic (for load balancing purposes) or • Each message which is published to a
publish messages with specific keys topic by a producer is delivered to one
to specific partitions. consumer within a consumer group.
15
Publish - Subscribe Messaging
Apache
Kafka
Log Storage
• Kafka is structured to store messages in the form of append-only logs.
• Messages are made available to consumers only after they have been
committed to the log.
• Unlike, queuing systems which delete the messages after they have been
consumed, Kafka retains the messages in the log files.
• Log for a topic partition is stored as a directory of segment files.
16
Data Acquisition
Big Data Collection
Systems
• Data collection systems allow collecting, aggregating and
moving data
• from various sources (such as server logs, databases, social
media, streaming sensor data from Internet of Things devices
and other sources)
• into a centralized data store (such as a distributed file system or
a NoSQL database).
17
Big Data Collection Systems
Apache Flume
Apache Flume is a distributed, reliable, and available system for collecting, aggregating,
and moving large amounts of data from different data sources into a centralized data
store.
Flume Architecture
Flume’s architecture is based on data flows and includes the following components:
• Source:
• Source is the component which receives data from external sources. A Flume data flow starts
from a source.
• For example, Flume source can receive data from a social media network (using streaming APIs).
• Channel:
• After the data is received by a Flume source, the data is transmitted to a channel.
• Each channel in a data flow is connected to one sink to which the data is drained.
• A data flow can comprise of multiple channels, where a source writes the data to multiple
channels.
18
Big Data Collection Systems
Apache Flume
Architecture
• Sink:
• Sink is the component which drains data from a channel to a data store (such as a
distributed file system or to another agent).
• Each sink in a data flow is connected to a channel. Sinks either deliver data to its
final destination or to other agents.
• Agent:
• A Flume agent is a collection of sources, channels and sinks.
• Agent is a process that hosts the sources, channels and sinks from which the data
moves from an external source to its final destination.
• Event:
• An event is a unit of data flow having a payload and an optional set of attributes.
• Flume sources consume events generated by external sources.
19
Big Data Collection Systems
Apache Flume
• Flume uses a data flow model which includes sources, channels and sinks,
encapsulated into agents.
• The simplest data flow has one source, one channel and one sink.
• Sources can multiplex data to multiple channels for either load balancing
purposes, or, for parallel processing.
• More complex data flows can be created by chaining multiple agents
where the sink of one agent delivers data to a source of another agent
20
Big Data Collection Systems
Apache Flume
21
Big Data Collection Systems
Apache Flume
22
Big Data Collection Systems
Flume Sources
Flume comes with multiple built-in sources that allow collecting and
aggregating data from a wide range of external systems. Flume also
provides the flexibility to add custom sources.
Avro Source:
• Apache Avro is a data serialization system that provides a compact and fast binary
data format.
• Avro is defined with JSON, and the schema is always stored with the data, which
allows the programs reading the data to interpret the data.
• Avro provides serialization functionality similar to other systems such as Thrift and
Protocol Buffers.
• The Flume Avro source receives events from external Avro client streams.
23
Big Data Collection Systems
Thrift Source:
• Apache Thrift is a serialization framework similar to Avro.
• Thrift provides a software stack and a code generation engine to build services that
work with multiple programming languages.
Exec Source:
• Exec source can be used to ingest data from the standard output.
• When an agent with an Exec source is started, it continues to receive data from the
standard output.
• It is typically used to transform data from one format to another and to combine
data from multiple sources into a single source..
24
Big Data Collection Systems
JMS Source:
• Java Message Service (JMS) is a messaging service that can be used by Java
applications to create, send, receive, and read messages.
• The JMS source receives messages from a JMS queue or topic.
• The destination type can either be a queue or a topic.
Spooling Directory Source:
• Spooling Directory source is useful for ingesting log files.
• A spool directory is setup on the disk from where the Spooling Directory source
ingests the files.
• The Spooling Directory source parses the files and creates events.
25
Big Data Collection Systems
Twitter Source:
• The Flume Twitter source connects to the Twitter streaming API and receives
tweets in real-time.
• The Twitter source converts the tweet objects to Avro format before sending
them to the downstream channel.
• Before setting up the Twitter source, you will need to create a Twitter
application from the Twitter developer account and obtain the consumer and
access tokens and secrets for the application.
NetCat Source:
• NetCat reads and writes data across network connections, using TCP or UDP
protocol.
• The NetCat source listens to a specific port to which the data is written by a
NetCat client and turns each line of text received into a Flume event. 26
Big Data Collection Systems
28
Big Data Collection Systems
33
Big Data Collection Systems
Apache Flume
Interceptors
Timestamp Interceptor:
• The Timestamp interceptor adds the current timestamp to the headers of the events
processed.
Host Interceptor:
• The Host interceptor adds the hostname of the Flume agent to the headers of the events
processed.
Static Interceptor:
• Static interceptor adds a static header to the events processed.
UUID Interceptor:
• The UUID adds a universally unique identifier to the headers of the events processed.
Regex Filtering Interceptor:
• Regex Filtering interceptor applies a regular expression to the event body and filters the
matching events.
34
Big Data Collection Systems
Messaging
Queues
• Messaging queues are useful for push-pull messaging where the
producers push data to the queues, and the consumers pull the data from
the queues.
• The producers and consumers do not need to be aware of each other.
• Messaging queues allow decoupling of producers of data from the
consumers.
35
Messaging Queues
RabbitMQ
36
Messaging Queues
RabbitMQ
• AMQP brokers provide four types of exchanges:
• direct exchange (for point-to-point messaging),
• fanout exchange (for multicast messaging ),
• topic exchange (for publish-subscribe messaging)
• header exchange (that uses header attributes for making routing decisions).
• AMQP is an application level protocol that uses TCP for reliable delivery.
• A logical connection between a producer or consumer and a broker is
called a Channel.
37
Messaging Queues
ZeroMQ
38
Messaging Queues
RestMQ
• open source message queue system that supports a fully RESTful API.
• RESTMQ can be used for a wide range of applications, such as managing
microservices, IoT applications, mobile apps, web applications and more.
• RESTMQ is a message queue which is based on a simple JSON-based
protocol and uses HTTP as transport.
• The queue is organized as REST resources.
• RESTMQ can be used by any client which can make HTTP calls.
39
Custom Connectors
REST-based
Connectors
• The connector exposes a REST web service.
• Data producers can publish data to the connector using HTTP POST
requests which contain the data payload.
• The request data received by the connector is stored to the sink .
• The data sinks in the connector provide the functionality for processing
the HTTP request and storing the data to the sink.
• The HTTP headers add to the request overhead making this method
unsuitable for high-throughput and real-time applications.
40
Custom Connectors
REST-based
Connectors
41
Custom Connectors
WebSocket-based
Connectors
• The connector exposes a WebSocket web service.
• The Web Application Messaging Protocol (WAMP) which is a sub-protocol of WebSocket can be
used for creating a WebSocket-based connector.
• WAMP provides publish-subscribe and remote procedure call (RPC) messaging patterns.
• Clients (or data producers) establish a TCP connection with the connector and send data
frames.
• Data producers publish data to the WebSocket endpoints which are published by the
connector.
• The subscribers subscribe to the WebSocket endpoints and receive data from the WebSocket
web service.
• Unlike request-response communication with REST, WebSockets allow full duplex
communication and do not require a new connection to be setup for each message to be
sent.
• WebSocket communication begins with a connection setup request sent by the client to the
server.
42
Custom Connectors
WebSocket-based
Connectors
43
Custom Connectors
MQTT-based
Connectors
• MQTT (MQ Telemetry Transport) is a lightweight publish-subscribe
messaging protocol designed for constrained devices.
• MQTT is suitable for Internet of Things (IoT) applications that involve
devices sending sensor data to a server or cloud-based analytics backends
to be processed and analyzed.
• The entities involved in MQTT include:
• Publisher: Publisher is the component which publishes data to the topics managed
by the Broker.
• Broker/Server: Broker manages the topics and forwards the data received on a
topic to all the subscriber which are subscribed to the topic.
• Subscriber: Subscriber is the component which subscribes to the topics and
receives data published on the topics by the publishers.
44
Next lecture
• Batch Analysis
Assignment
Deadline
Previous Deadline