0% found this document useful (0 votes)
8 views

Lecture 7 - Data Acquisition

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 7 - Data Acquisition

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

BIG DATA Master IT Lecture 7 Course code : M25331

Data Acquisition

Dr. Ali Haider Shamsan

1
Lecture Outlines
• Data Acquisition Considerations
• Publish - Subscribe Messaging Frameworks
• Big Data Collection Systems
• Messaging Queues
• Custom Connectors

Review

• Key-value databases
• Document databases
• Column family databases
• Graph databases
Keywords

big data, Database, Data Acquisition

2
Review
NoSQL

• Key-value databases
• Document databases
• Column family databases
• Graph databases

3
Data Acquisition
• Data acquisition is the process of gathering data from various sources and
converting it into a format that can be used for further analysis.
• It is a critical part of the data science process, as it helps to determine the
quality of the data and its usefulness for a given analysis.
• Data acquisition can involve anything:
• from collecting data from a variety of sources, such as websites, databases, or
physical documents,
• to cleaning and preparing the data for further analysis.
• Data acquisition can also involve the use of tools
• such as data mining, data wrangling, and data visualization to help understand
the data.
• Data acquisition involves the capture of real-time data from
sensors, machines, and other sources.
4
Data Acquisition

• It is an important part of any data analysis project.


• Data acquisition systems collect, store, and process data to
make it available for further analysis.
• They can also be used to acquire digital data from digital
sources such as computers and networks.
• Data acquisition systems are used to collect data for
research, manufacturing, engineering, and other
applications.

5
Data Acquisition

• For example, a data scientist may acquire data from a variety of


sources such as retail websites, a customer survey, and a financial
statement.
• The data scientist might then use data wrangling techniques to
clean the data, such as removing missing values, duplicates, and
outliers.
• The data acquisition process is important for any data science
project,
• as it ensures the data is of high quality and ready to use for further
analysis.

6
Data Acquisition Consideration
Source Type

• The type of the data source has to be kept into consideration


while making a choice for a data connector.
• Data sources can either publish bulk data in batches or data in
small batches (micro-batches) or streaming real-time data.
• Some examples of batch data sources are:
• - Files - Logs - Relational databases
• Some examples of real-time data sources are: -
• Machines generating sensor data
• Internet of Things (IoT) systems sending real-time data
• Social media feeds
• Stock market feeds
7
Data Acquisition Consideration
Velocity
• The velocity of data refers to how fast the data is generated and how
frequently it varies.
• For data with a high velocity (real-time or streaming data),
communication mechanisms which have low overhead and low
latency are required.
• WebSocket and MQTT based connectors can be used for ingesting real-
time and streaming data.
• For such applications, distributed publish-subscribe messaging
frameworks such as Apache Kafka are also good choices as they
support high throughput and low latency communication.

8
Data Acquisition Consideration
Ingestion
Mechanism
• The data ingestion mechanism can either be a push or pull
mechanism.
• The choice of the specific tool or framework for data ingestion will
be driven by the data consumer.
• If the consumer has the capability (or requirement) to pull data,
publish-subscribe messaging frameworks.
• The data producers push data to the a messaging framework or a
queue from which the consumers can pull the data.

9
Data Acquisition Consideration
Ingestion
Mechanism

10
Data Acquisition
Publish - Subscribe
Messaging
• Publish-Subscribe is a communication model that comprises publishers,
brokers and consumers.
• Publishers are the sources of data. Publishers send data to topics which
are managed by the broker. Publishers are not aware of the consumers.
• Consumers subscribe to the topics which are managed by the broker.
• When the broker receives data for a topic from a publisher, it sends the
data to all the subscribed consumers.
• Alternatively, the consumers can pull data for specific topics from the
broker.

11
Publish - Subscribe Messaging
Apache
Kafka
• Apache Kafka is a high throughput distributed messaging
system.
• Kafka can also be considered as a distributed, partitioned,
replicated commit log service.
• Kafka can be used for applications such as stream processing,
messaging, website activity tracking, metrics collection and
monitoring, log aggregation, etc.

12
Publish - Subscribe Messaging
Apache
Kafka
Architecture
A Kafka system includes the following
components:
• Topic: A topic is a user-defined category to
which messages are published.
• Producer: Producer is a component that
publishes messages to one or more topics.
• Consumer: Consumer is a component that
subscribes to one or more topics and processes
the messages.
• Broker: Broker is a component that manages
the topics and handles the persistence,
partitioning, and replication of the data.
A Kafka cluster can have multiple Kafka Brokers
(or servers), with each Broker managing
multiple topics. 13
Publish - Subscribe Messaging
Apache
Kafka
Partitions
• Kafka topics are subdivided into multiple
partitions.
• Each partition is an ordered and
immutable sequence of messages.
• Topics are stored on the disk in the form of
partitioned logs.
• The benefit of using partitions is that the
topic can scale to massive sizes, which will
not fit onto the disk of a single server.
• Partitions also allow multiple consumers to
consume messages in parallel. 14
Publish - Subscribe Messaging
Apache
Kafka
Publishing Messages Consuming Messages
• Each message in a partition is assigned a
• Producers publish messages to sequence ID called the offset.
topics. • Offsets are used by the consumers to
• A producer decides which message track which messages have been
consumed.
should be published to which • The consumers increment the offset as
partition of a topic. they consume the messages in a
sequence.
• Producers can either publish
• Consumers can be grouped together into
messages to different partitions of a consumer groups.
topic (for load balancing purposes) or • Each message which is published to a
publish messages with specific keys topic by a producer is delivered to one
to specific partitions. consumer within a consumer group.
15
Publish - Subscribe Messaging
Apache
Kafka
Log Storage
• Kafka is structured to store messages in the form of append-only logs.
• Messages are made available to consumers only after they have been
committed to the log.
• Unlike, queuing systems which delete the messages after they have been
consumed, Kafka retains the messages in the log files.
• Log for a topic partition is stored as a directory of segment files.

16
Data Acquisition
Big Data Collection
Systems
• Data collection systems allow collecting, aggregating and
moving data
• from various sources (such as server logs, databases, social
media, streaming sensor data from Internet of Things devices
and other sources)
• into a centralized data store (such as a distributed file system or
a NoSQL database).

17
Big Data Collection Systems

Apache Flume
Apache Flume is a distributed, reliable, and available system for collecting, aggregating,
and moving large amounts of data from different data sources into a centralized data
store.
Flume Architecture
Flume’s architecture is based on data flows and includes the following components:
• Source:
• Source is the component which receives data from external sources. A Flume data flow starts
from a source.
• For example, Flume source can receive data from a social media network (using streaming APIs).
• Channel:
• After the data is received by a Flume source, the data is transmitted to a channel.
• Each channel in a data flow is connected to one sink to which the data is drained.
• A data flow can comprise of multiple channels, where a source writes the data to multiple
channels.
18
Big Data Collection Systems
Apache Flume
Architecture
• Sink:
• Sink is the component which drains data from a channel to a data store (such as a
distributed file system or to another agent).
• Each sink in a data flow is connected to a channel. Sinks either deliver data to its
final destination or to other agents.
• Agent:
• A Flume agent is a collection of sources, channels and sinks.
• Agent is a process that hosts the sources, channels and sinks from which the data
moves from an external source to its final destination.
• Event:
• An event is a unit of data flow having a payload and an optional set of attributes.
• Flume sources consume events generated by external sources.
19
Big Data Collection Systems

Apache Flume

• Flume uses a data flow model which includes sources, channels and sinks,
encapsulated into agents.
• The simplest data flow has one source, one channel and one sink.
• Sources can multiplex data to multiple channels for either load balancing
purposes, or, for parallel processing.
• More complex data flows can be created by chaining multiple agents
where the sink of one agent delivers data to a source of another agent

20
Big Data Collection Systems

Apache Flume

21
Big Data Collection Systems

Apache Flume

22
Big Data Collection Systems

Apache Flume Source

Flume Sources
Flume comes with multiple built-in sources that allow collecting and
aggregating data from a wide range of external systems. Flume also
provides the flexibility to add custom sources.
Avro Source:
• Apache Avro is a data serialization system that provides a compact and fast binary
data format.
• Avro is defined with JSON, and the schema is always stored with the data, which
allows the programs reading the data to interpret the data.
• Avro provides serialization functionality similar to other systems such as Thrift and
Protocol Buffers.
• The Flume Avro source receives events from external Avro client streams.
23
Big Data Collection Systems

Apache Flume Source

Thrift Source:
• Apache Thrift is a serialization framework similar to Avro.
• Thrift provides a software stack and a code generation engine to build services that
work with multiple programming languages.
Exec Source:
• Exec source can be used to ingest data from the standard output.
• When an agent with an Exec source is started, it continues to receive data from the
standard output.
• It is typically used to transform data from one format to another and to combine
data from multiple sources into a single source..

24
Big Data Collection Systems

Apache Flume Source

JMS Source:
• Java Message Service (JMS) is a messaging service that can be used by Java
applications to create, send, receive, and read messages.
• The JMS source receives messages from a JMS queue or topic.
• The destination type can either be a queue or a topic.
Spooling Directory Source:
• Spooling Directory source is useful for ingesting log files.
• A spool directory is setup on the disk from where the Spooling Directory source
ingests the files.
• The Spooling Directory source parses the files and creates events.

25
Big Data Collection Systems

Apache Flume Source

Twitter Source:
• The Flume Twitter source connects to the Twitter streaming API and receives
tweets in real-time.
• The Twitter source converts the tweet objects to Avro format before sending
them to the downstream channel.
• Before setting up the Twitter source, you will need to create a Twitter
application from the Twitter developer account and obtain the consumer and
access tokens and secrets for the application.
NetCat Source:
• NetCat reads and writes data across network connections, using TCP or UDP
protocol.
• The NetCat source listens to a specific port to which the data is written by a
NetCat client and turns each line of text received into a Flume event. 26
Big Data Collection Systems

Apache Flume Source


Sequence Generator Source:
• Sequence Generator source generates events with a sequence of numbers starting
from 0 and incremented by 1.
• This source is mainly used for testing purposes.
Syslog Source:
• Syslog source is used for ingesting syslog data.
HTTP Source:
• HTTP source receives HTTP events (POST or GET requests) and converts them into
Flume events.
Custom Source:
• Flume allows customs sources to be integrated into the system.
• Custom sources are implemented in Java.
27
Big Data Collection Systems

Apache Flume Sinks

• Flume comes with multiple built-in sinks.


• Each sink in a Flume agent connects to a channel and drains the data from
the channel to a data store.
HDFS Sink:
• The Hadoop Distributed File System (HDFS) Sink drains events from a channel to
HDFS.
• HDFS sink supports SequenceFile, DataStream and CompressedStream file types.
• The data can be stored in various formats, such as CSV, JSON, Avro, and Parquet.

28
Big Data Collection Systems

Apache Flume Sinks


Avro Sink:
• An Avro sink retrieves events from a channel and drains the events to a downstream host.
Thrift Sink:
• A Thrift sink retrieves events from a channel and drains the events to a downstream host.
File Roll Sink:
• A File Roll sink drains the events to a file on the local filesystem.
Logger Sink:
• A Logger sink retrieves events from a channel and logs the events.
IRC Sink:
• IRC usually stands for Internet Relay Chat and is a popular protocol for real-time text-based communication.
• An IRC sink retrieves events from a channel and drains the events to an IRC host.
Hbase Sink:
• An HBase sink retrieves events from a channel and drains the events to an HBase table.
Custom Sink:
• Flume allows customs sinks to be integrated into the system. Custom sinks are implemented in Java.
29
Big Data Collection Systems
Apache Flume
Channels
Memory Channel:
• Memory channel stores the events in the memory and provides high throughput.
File Channel:
• File channel stores the events in files on the local filesystem.
JDBC (Java Database Connectivity) Channel:
• It is also used to capture data from a variety of sources and store them in a single
destination.
Spillable Memory Channel:
• Spillable Memory channel stores events in an in-memory queue and when the queue
fills up, the events are spilled onto the disk.
• This channel provides high throughput and fault tolerance.
Custom Channel:
• Flume allows customs channels to be integrated into the system. Custom channels are
implemented in Java.
30
Big Data Collection Systems
Apache Flume
Channel Selectors
• Flume agents can have a single source connected to multiple channels.
• In such cases, the channel selector defines policy about distributing the
events among the channels connected to a single source.
Replicating Channel Selector:
• The default channel selected is the replicating selector, which replicates events
received from the source to all the connected channels.
Multiplexing Channel Selector:
• Multiplexing channel selector distributes events from a source to all the connected
channels.
Custom Channel Selector:
• Flume allows customs channel selectors to be integrated into the system. Custom
channel selectors are implemented in Java.
31
Big Data Collection Systems
Apache Flume Sink
Processors
• A sink processor defines how the events are drained from a channel to a
sink.
• Sink processors enable parallelism, priorities, and automatic failover.
Load balancing Sink Processor:
• The load balancing sink processor allows load balancing of events drained from a
channel between the sinks in the attached sink group.
• The load is distributed among the list of sinks specified using a round robin or
random selection mechanism.
Failover Sink Processor:
• With Failover Sink processor, priorities can be assigned to sinks in a sink group.
• The attached channel then drains the events to the highest priority sink.
• When the highest priority sink fails, the events are drained to the sink with one
lower priority, providing automatic failover. 32
Big Data Collection Systems
Apache Flume
Interceptors
• Flume interceptors allow events to be modified, filtered or
dropped as they flow from the source to a channel.
• Interceptors can be used to selectively filter events, add or
modify header fields, or enrich event data with external
information.
• Interceptors are connected to the source. Interceptors can also
be chained to each other.

33
Big Data Collection Systems
Apache Flume
Interceptors
Timestamp Interceptor:
• The Timestamp interceptor adds the current timestamp to the headers of the events
processed.
Host Interceptor:
• The Host interceptor adds the hostname of the Flume agent to the headers of the events
processed.
Static Interceptor:
• Static interceptor adds a static header to the events processed.
UUID Interceptor:
• The UUID adds a universally unique identifier to the headers of the events processed.
Regex Filtering Interceptor:
• Regex Filtering interceptor applies a regular expression to the event body and filters the
matching events.
34
Big Data Collection Systems
Messaging
Queues
• Messaging queues are useful for push-pull messaging where the
producers push data to the queues, and the consumers pull the data from
the queues.
• The producers and consumers do not need to be aware of each other.
• Messaging queues allow decoupling of producers of data from the
consumers.

35
Messaging Queues

RabbitMQ

• RabbitMQ implements the Advanced Message Queuing Protocol (AMQP),


which is an open standard that defines the protocol for exchanges of
messages between systems.
• AMQP clients can either be producers or consumers.
• The clients can communicate with each other through brokers.
• Broker is a middleware application that receives messages from
producers and routes them to consumers.
• The producers publish messages to the exchanges, which then distribute
the messages to queues based on the defined routing rules.

36
Messaging Queues

RabbitMQ
• AMQP brokers provide four types of exchanges:
• direct exchange (for point-to-point messaging),
• fanout exchange (for multicast messaging ),
• topic exchange (for publish-subscribe messaging)
• header exchange (that uses header attributes for making routing decisions).
• AMQP is an application level protocol that uses TCP for reliable delivery.
• A logical connection between a producer or consumer and a broker is
called a Channel.

37
Messaging Queues

ZeroMQ

• ZeroMQ is a high-performance messaging library which provides tools to


build a messaging system.
• Unlike other message queuing systems, ZeroMQ can work without a
message broker.
• ZeroMQ provides various messaging patterns such as Request-Reply,
Publish-Subscribe, Push-Pull and Exclusive Pair.

38
Messaging Queues

RestMQ
• open source message queue system that supports a fully RESTful API.
• RESTMQ can be used for a wide range of applications, such as managing
microservices, IoT applications, mobile apps, web applications and more.
• RESTMQ is a message queue which is based on a simple JSON-based
protocol and uses HTTP as transport.
• The queue is organized as REST resources.
• RESTMQ can be used by any client which can make HTTP calls.

39
Custom Connectors
REST-based
Connectors
• The connector exposes a REST web service.
• Data producers can publish data to the connector using HTTP POST
requests which contain the data payload.
• The request data received by the connector is stored to the sink .
• The data sinks in the connector provide the functionality for processing
the HTTP request and storing the data to the sink.
• The HTTP headers add to the request overhead making this method
unsuitable for high-throughput and real-time applications.

40
Custom Connectors
REST-based
Connectors

41
Custom Connectors
WebSocket-based
Connectors
• The connector exposes a WebSocket web service.
• The Web Application Messaging Protocol (WAMP) which is a sub-protocol of WebSocket can be
used for creating a WebSocket-based connector.
• WAMP provides publish-subscribe and remote procedure call (RPC) messaging patterns.
• Clients (or data producers) establish a TCP connection with the connector and send data
frames.
• Data producers publish data to the WebSocket endpoints which are published by the
connector.
• The subscribers subscribe to the WebSocket endpoints and receive data from the WebSocket
web service.
• Unlike request-response communication with REST, WebSockets allow full duplex
communication and do not require a new connection to be setup for each message to be
sent.
• WebSocket communication begins with a connection setup request sent by the client to the
server.
42
Custom Connectors
WebSocket-based
Connectors

43
Custom Connectors
MQTT-based
Connectors
• MQTT (MQ Telemetry Transport) is a lightweight publish-subscribe
messaging protocol designed for constrained devices.
• MQTT is suitable for Internet of Things (IoT) applications that involve
devices sending sensor data to a server or cloud-based analytics backends
to be processed and analyzed.
• The entities involved in MQTT include:
• Publisher: Publisher is the component which publishes data to the topics managed
by the Broker.
• Broker/Server: Broker manages the topics and forwards the data received on a
topic to all the subscriber which are subscribed to the topic.
• Subscriber: Subscriber is the component which subscribes to the topics and
receives data published on the topics by the publishers.
44
Next lecture

• Batch Analysis

Assignment

Explain the data acquisition processes in AWS and AZURE

Deadline

Previous Deadline

Setting up Big Data stack and framework.


45

You might also like