Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski
Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski
Daniel Lanza
Zbigniew Baranowski
Z
4 sessions
• Hadoop Foundations (today)
• Data Ingestion (20-July)
• Spark (3-Aug)
• Data Analytic tools and techniques (31-Aug)
Z
Hadoop Foundations
Z
Goals for today
• Introduction to Hadoop
• Explore and run reports on example data with
Apache Impala (SQL)
• Visualize the result with HUE
• Evaluate different data formats and
techniques to improve performance
Z
Hands-on setup
• 12 node virtualized cluster
– 8GB of RAM, 4 cores per node
– 20GB of SSD storage per node
• Access (haperf10[1-12].cern.ch)
– Everybody who subscribed should have the access
– Try: ssh haperf105 'hdfs dfs -ls‘
• List of commands and queries to be used
$> sh /afs/cern.ch/project/db/htutorials/tutorial_follow_up
Z
What is Hadoop?
• A framework for large scale data processing
6
Z
What is Hadoop? Architecture
• Data locality (shared nothing) – scales out
Interconnect network
CPU CPU CPU CPU CPU CPU
MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY
7
Z
Z
Zookeeper
Coordination
Flume
Log data collector
HDFS
Impala
SQL
Spark
Large scale data proceesing
YARN
Oozie
Workflow manager
Pig
Scripting
Hive
SQL
What is Hadoop? Set of components
HBase
8
YARN Node YARN Node YARN Node YARN Node YARN Node YARN Node
Manager Manager Manager Manager Manager Manager
HDFS HDFS HDFS HDFS HDFS HDFS
DataNode DataNode DataNode DataNode DataNode DataNode
9
Z
HDFS in nutshell
• Distributed files system for Hadoop
– Fault tolerant -> multiple replicas of data spread across
a cluster
– Scalable -> design to deliver high throughputs,
sacrificing an access latency
– Files cannot be modified in place
• Architecture
– NameNode -> maintains and manages file system
metadata (in RAM)
– DataNodes -> store and manipulate the data (blocks)
Z
How HDFS stores the data
1) File to be stored on HDFS
256MB 102
256MB 256MB
MB
102 102
256MB
MB MB
Z DataNode1 DataNode2 DataNode3 DataNode4
Interacting with HDFS
• Command line (examples)
hdfs dfs –ls #listing home dir
hdfs dfs –ls /user #listing user dir…
hdfs dfs –du –h /user #space used
hdfs dfs –mkdir newdir #creating dir
hdfs dfs –put myfile.csv . #storing a file on HDFS
hdfs dfs –get myfile.csv . #getting a file fr HDFS
• Programing bindings
– Java, Python, C++
D
Using Hadoop for data processing
D
Example data
• Source
• Streaming API
– curl -s http://stream.meetup.com/2/rsvps
D
Using Hadoop for data processing
• Get/produce the data
• Load data to Hadoop
• (optional) restructure it into optimized form
• Process the data (SQL, Scala, Java)
• Present/visualise the results
D
Loading the data with HDFS command
D
Pre-proccesing required
• Convert JSON to Parquet
– SparkSQL
> spark-shell
scala> val meetup_data = sqlContext.read.json("meetup.json")
scala> val sel = meetup_data.select("*").withColumnRenamed("group","group_info")
scala> sel.saveAsParquetFile("meetup_parquet")
D
Using Hadoop for data processing
• Produce the data
• Load data to Hadoop
• (optional) restructure it into optimized form
• Process the data (SQL, Scala, Java)
• Visualise the results
D
Why SQL?
• It is simple and powerful
– interactive, ad-hoc
– declarative data processing
– no need to compile
• Good for data exploration and reporting
• Structured data
– organization of the data in table abstractions
– optimized processing
D
Apache Impala
• MPP SQL query engine running on Apache Hadoop
• Low latency SQL queries on
– Files stored on HDFS , Apache HBase and Apache Kudu
• Faster than Map-Reduce (Hive)
• C++, no Java GC
Application
ODBC
Q
Res
SS
LL
Q
ult
Res
ult
D
HUE – Hadoop User Experience
• Web interface to main Hadoop components
– HDFS, Hive, Impala, Sqoop, Oozie, Solr etc.
• http://haperf100.cern.ch:8888/
D
How to check a profile of the execution
• Impala has build in query profile feature
$ impala-shell
> SELECT event_name, event_url, member_name, venue_name, venue_lat,
venue_lon FROM meetup_csv
WHERE time BETWEEN unix_timestamp("2016-07-06 10:30:00")*1000
AND unix_timestamp("2016-07-06 12:00:00")*1000;
> profile;
• Binary format?
• Apache Avro
Z
Apache Avro data file
• Fast, binary serialization format
• Internal schema with multiple data types
including nested ones
– scalars, arrays, maps, structs, etc
• Schema in JSON
{
"type": "record",
"name": "test", Record {a=27, b=‘foo’}
"fields" : [
{"name": "a", "type":
"long"},
Encoded (hex): 36 06 66 6f 6f
{"name": "b", "type":
"string"}
long – variable- String
Z ] String chars
length zigzag length
Creating Avro table in Impala
• Creating table
CREATE TABLE meetup_avro
LIKE meetup_csv
STORED AS avro;
• Populating the table
INSERT INTO meetup_avro
SELECT * FROM meetup_csv;
Z
Data partitioning (horizontal)
• Group data by certain attribute(s) in separate
directories
• Will reduce amount of data to be read
Day Month Year No of customers
Aug 2013
10 Aug 2013 17
11 Aug 2013 15
/user/zaza/mydata/Aug2013/data
12 Aug 2013 21
2 Dec 2014 30
3 Dec 2014 34 Dec 2014
4 Dec 2014 31
17 Feb 2015 12 /user/zaza/mydata/Dec2014/data
18 Feb 2015 16
Feb 2015
Z /user/zaza/mydata/Dec2015/data
Partitioning the data with Impala
• Create a new partitioning table
CREATE TABLE meetup_avro_part
(event_id string, event_name string,
time bigint, event_url string,
group_id bigint, group_name string,
group_city string, group_country string,
group_lat double, group_lon double,
group_state string, group_urlname string,
guests bigint, member_id bigint,
member_name string, photo string,
mtime bigint, response string,
rsvp_id bigint, venue_id bigint,
venue_name string, venue_lat double,
venue_lon double)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS avro;
Z
Partitioning the data with Impala
• Populating partitioning table
– the data needs to be reload
INSERT INTO meetup_avro_part
PARTITION (year, month, day)
SELECT *,
year(from_unixtime(cast(time/1000 as bigint))),
month(from_unixtime(cast(time/1000 as bigint))),
day(from_unixtime(cast(time/1000 as bigint)))
FROM meetup_avro;
/user/zaza/mydata/year=2016/month=7/day=6/data
– Impala will create automatically directories like:
Pushdowns
D
Slicing and dicing
• Horizontal and vertical partitioning – for
efficient data processing
D
Horizontal and vertical partitioning
• Create a new table
CREATE TABLE meetup_parquet_part
(event_id string, event_name string,
time bigint, event_url string,
group_id bigint, group_name string,
group_city string, group_country string,
group_lat double, group_lon double,
group_state string, group_urlname string,
guests bigint, member_id bigint,
member_name string, photo string,
mtime bigint, response string,
rsvp_id bigint, venue_id bigint,
venue_name string, venue_lat double,
venue_lon double)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS parquet;
D
Horizontal and vertical partitioning
• Populating partitioning table
– the data needs to be reload
INSERT INTO meetup_parquet_part
PARTITION (year, month, day)
SELECT *,
year(from_unixtime(cast(time/1000 as bigint))),
month(from_unixtime(cast(time/1000 as bigint))),
day(from_unixtime(cast(time/1000 as bigint)))
FROM meetup_avro;
– Size 42MB
• Run queries
D
Can we query faster? (4)
• Use compression?
– Snappy – lightweight with decent compression rate
– Gzip – to save more space but affect performance
• Using an index?
• In Hadoop there is a ‘format’ that has an index
-> HBase
Z
HBase in a nutshell
• HBase is a key-value store on top of HDFS
– horizontal (regions) + vertical (col. families) partitioning
– row key values are indexed within regions
– data typefree – data stored in bytes arrays
• Fast random data access by key
• Stored data can be modified (updated, deleted)
• Has multiple bindings
– SQL (Impala/Hive, Phoenix), Java, Python
• Very good for massive concurrent random data access
• ..but not good for big data sequential processing!
Z
HBase: master-slaves architecture
• HBase master
– assigns table regions/partitions to region servers
– maintains metadata and table schemas
• HBase region servers
– servers clients requests (reading and writing)
– maintain and store the region data on HDFS
– writes WAL in order to recover the data after a
failure
– performs region splitting when needed
HBase table data organisation
• Run queries
$ impala-shell
> SELECT *
FROM meetup_hbase
WHERE key BETWEEN "1462060800" AND "1467331200";
> SELECT *
FROM meetup_hbase
WHERE key BETWEEN
cast(unix_timestamp("2016-07-06 10:30:00") as string)
AND cast(unix_timestamp("2016-07-06 12:00:00") as string);
Z
Formats summary
• Hands-on results
data size (MB)
900
query time (s)
770 2 1.9
800
700 1.4
600 1.5
500 0.87 0.8
1
400
300 0.5
159 0.5
200
76 76 42
100 0
0 CSV Avro Avro Parquet HBase
CSV Avro Avro partitioned Parquet HBase partitioned partitioned
partitioned
• Production data
When to use what?
• Partitioning -> always when possible
• Fast full data (all columns) processing -> Avro
• Fast analytics on subset of columns -> Parquet
• Only when predicates on the same key columns -> HBase
(data deduplication, low latency, parallel access)
Z
Summary
• Hadoop is a framework for distributed data
processing
– designed to scale out
– optimized for sequential data processing
– HDFS is the core of the system
– many components with multiple functionalities
• You do not have to be a Java guru to start using it
• Choosing data format, partitioning scheme is a key
to achieve good performance and optimal resource
utilisation
Z
Questions & feedback