0% found this document useful (0 votes)
64 views39 pages

Ingesting Data: © Hortonworks Inc. 2011 - 2018. All Rights Reserved

Uploaded by

A Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views39 pages

Ingesting Data: © Hortonworks Inc. 2011 - 2018. All Rights Reserved

Uploaded by

A Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Ingesting Data

1 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Lesson Objectives
After completing this lesson, students should be able to:
⬢ Describe data ingestion

⬢ Describe Batch/Bulk ingestion options


– Ambari HDFS Files View
– CLI & WebHDFS
– NFS Gateway
– Sqoop

⬢ Describe streaming framework alternatives


– Flume
– Storm
– Spark Streaming
– HDF / NiFi

2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Ingestion Overview
Batch/Bulk Ingestion
Streaming Alternatives

3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Data Input Options

nfs gateway

MapReduce hdfs dfs -put


WebHDFS
HDFS APIs
Vendor Connectors

4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Real-Time Versus Batch Ingestion Workflows

Real-time and batch processing are very different.

Factors Real-Time Batch


Age Real-time – usually less than 15 Historical – usually more than
minutes old 15 minutes old
Data
Location Primarily in memory – moved Primarily on disk – moved to
to disk after processing memory for processing
Speed Sub-second to few seconds Few seconds to hours
Processing
Frequency Always running Sporadic to periodic
Who Automated systems only Human & automated systems
Clients Type Primarily operational Primarily analytical
applications applications

5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Ingestion Overview
Batch/Bulk Ingestion
Streaming Alternatives

6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Ambari Files View

The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.

Create a
directory

7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Ambari Files View

The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.

Upload a file.

8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Ambari Files View

The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.

Rename
directory.

9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Ambari Files View

The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.

Go up one
directory.

10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Ambari Files View

The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.

Delete to Trash or
permanently.

11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Ambari Files View

The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.

Move to another
directory.

12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Ambari Files View

The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.

Go to directory.

13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Ambari Files View

The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.

Download to local
system.

14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


The Hadoop Client

⬢ The put command to uploading data to HDFS


⬢ Perfect for inputting local files into HDFS
⬢ Useful in batch scripts
⬢ Usage:

hdfs dfs –put mylocalfile /some/hdfs/path

15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


WebHDFS

⬢ REST API for accessing all of the HDFS file system interfaces:

– http://host:port/webhdfs/v1/test/mydata.txt?op=OPEN

– http://host:port/webhdfs/v1/user/train/data?op=MKDIRS

– http://host:port/webhdfs/v1/test/mydata.txt?op=APPEND

16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


NFS Gateway

⬢ Uses NFS standard and supports all HDFS commands


⬢ No random writes
l
oco
File writes r Prot DN
sfe
by aT ran
Dat
app user

DFSClient
NFSv3
NFS ClientProtocol
NFS NN
Gate
Client
way Data
Trans
ferP
roto
col
DN

17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Sqoop: Database Import/Export

Relational Enterprise Document-based


Database Data Warehouse Systems

1. Client executes a sqoop 3. Plugins provide connectivity to various


command data sources

2. Sqoop executes the Map Hadoop


command as a MapReduce job tasks Cluster
on the cluster (using Map-only
tasks)

18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


The Sqoop Import Tool

The import command has the following requirements:


⬢ Must specify a connect string using the --connect argument

⬢ Credentials can be included in the connect string, so use the --username and
--password arguments

⬢ Must specify either a table to import using --table or the result of an SQL query using
--query

19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Importing a Table
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--target-dir /data/stockprice/
--as-textfile

20 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Importing Specific Columns
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--columns StockSymbol,Volume, High,ClosingPrice
--target-dir /data/dailyhighs/
--as-textfile
--split-by StockSymbol
-m 10

21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Importing from a Query
sqoop import
--connect jdbc:mysql://host/nyse
--query "SELECT * FROM StockPrices s
WHERE s.Volume >= 1000000
AND \$CONDITIONS"
--target-dir /data/highvolume/
--as-textfile
--split-by StockSymbol

22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


The Sqoop Export Tool

⬢ The export command transfers data from HDFS to a database:


– Use --table to specify the database table
– Use --export-dir to specify the data to export
⬢ Rows are appended to the table by default
⬢ If you define --update-key, existing rows will be updated with the new
data
⬢ Use --call to invoke a stored procedure (instead of specifying the --table
argument)

23 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Exporting to a Table
sqoop export
--connect jdbc:mysql://host/mylogs
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t"

24 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Ingestion Overview
Batch/Bulk Ingestion
Streaming Alternatives

25 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Flume: Data Streaming

Channel
Log Data
Event Data
Social Media Source Sink
etc...

Flume Agent
Flume uses a Channel between the A background process
Source and Sink to decouple the
processing of events from the storing of
events.

Hadoop
cluster

26 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Storm Topology Overview

⬢ Storm data processing occurs in a bolt


topology. stre
am

⬢ A topology consists of spout and bolt m


bolt
ea
components. str
am
spout bolt stre
stream
⬢ Spouts bring data into the topology
str
ea
⬢ Bolts can (not required) persist data m

including to HDFS
spout stream
bolt

Storm topology

27 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Message Queues

Various types of message queues are often the source of the data processed by
real-time processing engines like Storm

real-time message
Storm
data source queue

operating systems,
log entries, events, Kestrel, RabbitMQ,
services and data from queue is
errors, status AMQP, Kafka, JMS,
applications, read by Storm
messages, etc. others…
sensors

28 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Spark Streaming

⬢ Streaming Applications consist of the same components as a Core application,


but add the concept of a receiver
⬢ The receiver is a process running on an executor

Dstream

Streaming Data Receiver Dstream Spark Core Outpu


Dstream
t

Spark Streaming

29 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Spark Streaming’s Micro-Batch Approach

⬢ Micro-batches are created at regular time intervals


– Receiver takes the data and starts filling up a batch
– After the batch duration completes, data is shipped off
– Each batch forms a collection of data entities that are processed together

30 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


HDF with HDP – A Complete Big Data Solution

Perishable
Hortonworks DataFlow
Insights
(HDF)
powered by Apache NiFi

Store Data Enrich


and Metadata Context

Internet
of Anything Hortonworks Data Platform (HDP)
powered by Apache Hadoop Historical
Hortonworks Data Platform Insights
powered by Apache Hadoop
Hortonworks DataFlow and the Hortonworks Data Platform
deliver the industry’s most complete Big Data solution

31 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Big Data Ingestion with HDF

HDF workflows and Storm/Spark streaming workflows can be coupled

Hadoop
Raw Network Stream

Kafka
Network Metadata Stream
Storm Spark

Data Stores

Phoenix
HDF
Syslog HBase Hive SOLR

Raw Application Logs


YARN

Other Streaming Telemetry


HDFS

32 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Knowledge Check

33 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Questions
1. What tool is used for importing data from a RDBMS?

34 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Questions
1. What tool is used for importing data from a RDBMS?
2. List two ways to easily script moving files into HDFS.

35 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Questions
1. What tool is used for importing data from a RDBMS?
2. List two ways to easily script moving files into HDFS.
3. True/False? Storm operates on micro-batches.

36 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Questions
1. What tool is used for importing data from a RDBMS?
2. List two ways to easily script moving files into HDFS.
3. True/False? Storm operates on micro-batches.
4. Name the popular open-source messaging component that is
bundled with HDP.

37 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Summary

38 © Hortonworks Inc. 2011 – 2018. All Rights Reserved


Summary

⬢ There are many different ways to ingest data including customer solutions written via
HDFS APIs as well as vendor connectors
⬢ Streaming and batch workflows can work together in a holistic system
⬢ The NFS Gateway may help some legacy systems populate data into HDFS
⬢ Sqoop’s configurable number of database connection can overload an RDBMS
⬢ The following are streaming frameworks:
– Flume
– Storm
– Spark Streaming
– HDF / NiFi

39 © Hortonworks Inc. 2011 – 2018. All Rights Reserved

You might also like