0% found this document useful (0 votes)

46 views49 pages

Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski

This document provides an overview of a 4-part Hadoop tutorial series covering Hadoop foundations, data ingestion, Spark, and data analytics tools. The first session introduces Hadoop and its components, demonstrates exploring sample data with Impala and visualizing results with HUE on a 12-node virtual cluster, and discusses different data formats and techniques to improve performance. Hands-on exercises include loading meetup RSVP data from a stream into HDFS and converting it to Parquet and CSV formats for analysis.

Uploaded by

Ravi Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views49 pages

Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski

Uploaded by

Ravi Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Hadoop Tutorials

Daniel Lanza
Zbigniew Baranowski

Z
4 sessions
• Hadoop Foundations (today)
• Data Ingestion (20-July)
• Spark (3-Aug)
• Data Analytic tools and techniques (31-Aug)

Z
Hadoop Foundations

Z
Goals for today
• Introduction to Hadoop
• Explore and run reports on example data with
Apache Impala (SQL)
• Visualize the result with HUE
• Evaluate different data formats and
techniques to improve performance

Z
Hands-on setup
• 12 node virtualized cluster
– 8GB of RAM, 4 cores per node
– 20GB of SSD storage per node
• Access (haperf10[1-12].cern.ch)
– Everybody who subscribed should have the access
– Try: ssh haperf105 'hdfs dfs -ls‘
• List of commands and queries to be used
$> sh /afs/cern.ch/project/db/htutorials/tutorial_follow_up

Z
What is Hadoop?
• A framework for large scale data processing

– Data Volume (Terabytes, Zettabytes)

– Data Variety (Structured, Unstructured)

– Data Velocity ( Stream processing)

6
Z
What is Hadoop? Architecture
• Data locality (shared nothing) – scales out

Interconnect network
CPU CPU CPU CPU CPU CPU
MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY

Disks Disks Disks Disks Disks Disks

Node 1 Node 2 Node 3 Node 4 Node 5 Node X

7
Z
Z
Zookeeper
Coordination

Flume
Log data collector

HDFS
Impala
SQL

Spark
Large scale data proceesing

YARN

Hadoop Distributed File System

Mahout
Machine learning

Oozie
Workflow manager

Cluster resource manager

Sqoop
Data exchange with RDBMS
MapReduce

Pig
Scripting

Hive
SQL
What is Hadoop? Set of components

HBase
8

NoSql columnar store

Hadoop cluster architecture
• Master and slaves approach
Interconnect network
HDFS YARN Hive
NameNode ResourceManager metastore

Various Various Various

Various Various Various component component component
component component component agents and agents and agents and
agents and agents and agents and demons demons demons
masters masters demons

YARN Node YARN Node YARN Node YARN Node YARN Node YARN Node
Manager Manager Manager Manager Manager Manager
HDFS HDFS HDFS HDFS HDFS HDFS
DataNode DataNode DataNode DataNode DataNode DataNode

Node 1 Node 2 Node 3 Node 4 Node 5 Node X

9
Z
HDFS in nutshell
• Distributed files system for Hadoop
– Fault tolerant -> multiple replicas of data spread across
a cluster
– Scalable -> design to deliver high throughputs,
sacrificing an access latency
– Files cannot be modified in place
• Architecture
– NameNode -> maintains and manages file system
metadata (in RAM)
– DataNodes -> store and manipulate the data (blocks)

Z
How HDFS stores the data
1) File to be stored on HDFS

102 2) Splitting into 256MB

256MB 256MB 1.1GB
256MB 256MB MB blocks
3) Ask NameNode
4) Blocks with their replicas (by default 3) are where to put them
distributed across Data Nodes

256MB 256MB 256MB 256MB

256MB 102
256MB 256MB
MB
102 102
256MB
MB MB
Z DataNode1 DataNode2 DataNode3 DataNode4
Interacting with HDFS
• Command line (examples)
hdfs dfs –ls #listing home dir
hdfs dfs –ls /user #listing user dir…
hdfs dfs –du –h /user #space used
hdfs dfs –mkdir newdir #creating dir
hdfs dfs –put myfile.csv . #storing a file on HDFS
hdfs dfs –get myfile.csv . #getting a file fr HDFS

• Programing bindings
– Java, Python, C++

Z More about HDFS: https://indico.cern.ch/event/404527/

Using Hadoop for data processing

• Get/produce the data

• Load data to Hadoop
• (optional) restructure it into optimized form
• Process the data (SQL, Scala, Java)
• Present/visualise the results

D
Using Hadoop for data processing

• Get/produce the data

• Load data to Hadoop
• (optional) restructure it into optimized form
• Process the data (SQL, Scala, Java)
• Present/visualise the results

D
Example data
• Source

– Meetups are: neighbours getting together to learn

something, do something, share something…

• Streaming API
– curl -s http://stream.meetup.com/2/rsvps
D
Using Hadoop for data processing
• Get/produce the data
• Load data to Hadoop
• (optional) restructure it into optimized form
• Process the data (SQL, Scala, Java)
• Present/visualise the results

D
Loading the data with HDFS command

• Store it locally and then move it to HDFS

– curl -s http://stream.meetup.com/2/rsvps -o meetup_data.json
• Ctrl + C
– hdfs dfs -moveFromLocal meetup_data.json meetup.json
• Directly
– curl -s http://stream.meetup.com/2/rsvps | head -10 |
hdfs dfs -put - meetup.json
• Showing
– hdfs dfs -cat meetup.json

D
Pre-proccesing required
• Convert JSON to Parquet
– SparkSQL
> spark-shell
scala> val meetup_data = sqlContext.read.json("meetup.json")
scala> val sel = meetup_data.select("*").withColumnRenamed("group","group_info")
scala> sel.saveAsParquetFile("meetup_parquet")

• Convert to CSV with Impala

– Create external table
CREATE EXTERNAL TABLE meetup_parquet
LIKE PARQUETFILE '/user/<user_name>/meetup_parquet/<any_parquet_file>.gz.parquet'
STORED AS parquet
LOCATION '/user/<user_name>/meetup_parquet/';
– Create table as select
CREATE TABLE meetup_csv
row format delimited fields terminated by '\t' ESCAPED BY '"' LINES TERMINATED BY '\n'
AS SELECT
... all interesting columns ...
FROM meetup_parquet;

D
Using Hadoop for data processing
• Produce the data
• Load data to Hadoop
• (optional) restructure it into optimized form
• Process the data (SQL, Scala, Java)
• Visualise the results

D
Why SQL?
• It is simple and powerful
– interactive, ad-hoc
– declarative data processing
– no need to compile
• Good for data exploration and reporting
• Structured data
– organization of the data in table abstractions
– optimized processing

D
Apache Impala
• MPP SQL query engine running on Apache Hadoop
• Low latency SQL queries on
– Files stored on HDFS , Apache HBase and Apache Kudu
• Faster than Map-Reduce (Hive)
• C++, no Java GC
Application

ODBC
Q

Res
SS

LL
Q

ult
Res
ult

Query Planner Query Planner Query Planner

Query Coordinator Query Coordinator Query Coordinator

Query Executor Query Executor Query Executor

HDFS HDFS HDFS

D More about Impala and Hive: https://indico.cern.ch/event/434650/

Creating our own table
• Create table
CREATE TABLE meetup_csv
(event_id string, event_name string, ...);

CREATE TABLE meetup_csv

LIKE meetup_csv;
• Populate table
INSERT INTO meetup_csv
SELECT * FROM meetup_csv;
• Create table as select
CREATE TABLE meetup_csv
AS SELECT * from meetup_csv;
D
Querying the data
• Counting records (SQL Hello world!)
SELECT count(*) FROM meetup_csv;
• Most interesting meetups
SELECT DISTINCT event_name, group_name, venue_name
FROM meetup_csv
WHERE event_id IN
(SELECT event_id FROM meetup_csv
GROUP BY event_id ORDER BY count(*) desc
LIMIT 10);

• Not interesting meetings (people did not accept)

SELECT event_name, response, count(*)
FROM meetup_csv
WHERE response='no'
GROUP BY event_name, response
ORDER BY 3 desc;
D
Using Hadoop for data processing
• Produce the data
• Load data to Hadoop
• (optional) restructure it into optimized form
• Process the data (SQL, Scala, Java)
• Visualise the results

D
HUE – Hadoop User Experience
• Web interface to main Hadoop components
– HDFS, Hive, Impala, Sqoop, Oozie, Solr etc.

• HDFS: FS browser, permission and ACLs

configuration, file uploading

• SQL: query execution, results visualisation

• http://haperf100.cern.ch:8888/

D
How to check a profile of the execution
• Impala has build in query profile feature
$ impala-shell
> SELECT event_name, event_url, member_name, venue_name, venue_lat,
venue_lon FROM meetup_csv
WHERE time BETWEEN unix_timestamp("2016-07-06 10:30:00")*1000
AND unix_timestamp("2016-07-06 12:00:00")*1000;
> profile;

• See execution plan

• Per machine or cluster average
– How much data was read from HDFS
– How much CPU time was spent on certain operations
– etc.
Z
profile
• Execution plan profile

• Details for HDFS SCAN fragment (averaged)

Can we optimize the execution?

• Reading all the data: 159.57MB

• Data are stored as text -> not optimally!

• Binary format?
• Apache Avro

Z
Apache Avro data file
• Fast, binary serialization format
• Internal schema with multiple data types
including nested ones
– scalars, arrays, maps, structs, etc
• Schema in JSON
{
"type": "record",
"name": "test", Record {a=27, b=‘foo’}
"fields" : [
{"name": "a", "type":
"long"},
Encoded (hex): 36 06 66 6f 6f
{"name": "b", "type":
"string"}
long – variable- String
Z ] String chars
length zigzag length
Creating Avro table in Impala
• Creating table
CREATE TABLE meetup_avro
LIKE meetup_csv
STORED AS avro;
• Populating the table
INSERT INTO meetup_avro
SELECT * FROM meetup_csv;

• Data size in Avro: 76MB (in CSV was 159MB)

• Run the queries
– ~1.4s (in CSV was ~2s)
Z
Can we do it better? (2)
• Still reading more (all) data than needed!

• What if data is stored in such a way that only a

subset needs to be read
• Use partitioning!

Z
Data partitioning (horizontal)
• Group data by certain attribute(s) in separate
directories
• Will reduce amount of data to be read
Day Month Year No of customers
Aug 2013
10 Aug 2013 17
11 Aug 2013 15
/user/zaza/mydata/Aug2013/data
12 Aug 2013 21
2 Dec 2014 30
3 Dec 2014 34 Dec 2014
4 Dec 2014 31
17 Feb 2015 12 /user/zaza/mydata/Dec2014/data
18 Feb 2015 16
Feb 2015
Z /user/zaza/mydata/Dec2015/data
Partitioning the data with Impala
• Create a new partitioning table
CREATE TABLE meetup_avro_part
(event_id string, event_name string,
time bigint, event_url string,
group_id bigint, group_name string,
group_city string, group_country string,
group_lat double, group_lon double,
group_state string, group_urlname string,
guests bigint, member_id bigint,
member_name string, photo string,
mtime bigint, response string,
rsvp_id bigint, venue_id bigint,
venue_name string, venue_lat double,
venue_lon double)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS avro;

Z
Partitioning the data with Impala
• Populating partitioning table
– the data needs to be reload
INSERT INTO meetup_avro_part
PARTITION (year, month, day)
SELECT *,
year(from_unixtime(cast(time/1000 as bigint))),
month(from_unixtime(cast(time/1000 as bigint))),
day(from_unixtime(cast(time/1000 as bigint)))
FROM meetup_avro;

/user/zaza/mydata/year=2016/month=7/day=6/data
– Impala will create automatically directories like:

• Filter predicates has to be specified on partitioning columns

– where year=2016 and month=7 and day=6
Z
Can we do even better? (3)
• We are interested in reading certain columns
• But we are always reading entire row data
• Solution?
• Columnar store

Col1 Col2 Col3 Col4

D
Parquet data format

• Based on Google “Dremel”

Columnar storage

Pushdowns

D
Slicing and dicing
• Horizontal and vertical partitioning – for
efficient data processing

Col1 Col2 Col3 Col4

D
Horizontal and vertical partitioning
• Create a new table
CREATE TABLE meetup_parquet_part
(event_id string, event_name string,
time bigint, event_url string,
group_id bigint, group_name string,
group_city string, group_country string,
group_lat double, group_lon double,
group_state string, group_urlname string,
guests bigint, member_id bigint,
member_name string, photo string,
mtime bigint, response string,
rsvp_id bigint, venue_id bigint,
venue_name string, venue_lat double,
venue_lon double)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS parquet;

D
Horizontal and vertical partitioning
• Populating partitioning table
– the data needs to be reload
INSERT INTO meetup_parquet_part
PARTITION (year, month, day)
SELECT *,
year(from_unixtime(cast(time/1000 as bigint))),
month(from_unixtime(cast(time/1000 as bigint))),
day(from_unixtime(cast(time/1000 as bigint)))
FROM meetup_avro;

– Size 42MB
• Run queries

D
Can we query faster? (4)
• Use compression?
– Snappy – lightweight with decent compression rate
– Gzip – to save more space but affect performance

• Using an index?
• In Hadoop there is a ‘format’ that has an index
-> HBase

Z
HBase in a nutshell
• HBase is a key-value store on top of HDFS
– horizontal (regions) + vertical (col. families) partitioning
– row key values are indexed within regions
– data typefree – data stored in bytes arrays
• Fast random data access by key
• Stored data can be modified (updated, deleted)
• Has multiple bindings
– SQL (Impala/Hive, Phoenix), Java, Python
• Very good for massive concurrent random data access
• ..but not good for big data sequential processing!
Z
HBase: master-slaves architecture
• HBase master
– assigns table regions/partitions to region servers
– maintains metadata and table schemas
• HBase region servers
– servers clients requests (reading and writing)
– maintain and store the region data on HDFS
– writes WAL in order to recover the data after a
failure
– performs region splitting when needed
HBase table data organisation

More about HBase: https://indico.cern.ch/event/439742/

Creating and loading data to an HBase table with SQL
• Creating HBase table (with 4 column families)
$ hbase shell
> create 'meetup_<username>', 'event', 'group', 'member', 'venue'
> quit

• Mapping Hive/Impala table to HBase table

$ hive
> CREATE EXTERNAL TABLE meetup_hbase
(key string, event_id string, event_name string, time bigint, ...)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,event:event_name,event:time,event:event_url,group:group_id,group:gro
up_name,group:group_city,group:group_country,group:group_lat,… ")
TBLPROPERTIES("hbase.table.name" = "meetup_<username>");

• Populating the table (key=event_time,even_if,modification time)

$ hive
$ hive
> INSERT INTO meetup_hbase SELECT concat(
cast(nvl(time, 0) as string), event_id, cast(mtime as string)), *
FROM meetup_csv;
Z
Query data by key on HBase through Impala/Hive

• Run queries
$ impala-shell
> SELECT *
FROM meetup_hbase
WHERE key BETWEEN "1462060800" AND "1467331200";

> SELECT *
FROM meetup_hbase
WHERE key BETWEEN
cast(unix_timestamp("2016-07-06 10:30:00") as string)
AND cast(unix_timestamp("2016-07-06 12:00:00") as string);

> SELECT * FROM meetup_hbase

WHERE key = '14679936000002319268721467404430663'

Z
Formats summary
• Hands-on results
data size (MB)
900
query time (s)
770 2 1.9
800
700 1.4
600 1.5
500 0.87 0.8
1
400
300 0.5
159 0.5
200
76 76 42
100 0
0 CSV Avro Avro Parquet HBase
CSV Avro Avro partitioned Parquet HBase partitioned partitioned
partitioned

• Production data
When to use what?
• Partitioning -> always when possible
• Fast full data (all columns) processing -> Avro
• Fast analytics on subset of columns -> Parquet
• Only when predicates on the same key columns -> HBase
(data deduplication, low latency, parallel access)

• Compression in order to further reduce the data volume

– without sacrificing performance -> Snappy
– when data access is sporadic -> Gzip/Bzip or derived

Z
Summary
• Hadoop is a framework for distributed data
processing
– designed to scale out
– optimized for sequential data processing
– HDFS is the core of the system
– many components with multiple functionalities
• You do not have to be a Java guru to start using it
• Choosing data format, partitioning scheme is a key
to achieve good performance and optimal resource
utilisation
Z
Questions & feedback

[email protected]

IT Key Metrics Data 752522 NDX
No ratings yet
IT Key Metrics Data 752522 NDX
28 pages
CMPE 226 Section 2 Database Systems: Spring Semester 2016
No ratings yet
CMPE 226 Section 2 Database Systems: Spring Semester 2016
7 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
BDT - Unit - II - Hdfs and Hadoop Io
No ratings yet
BDT - Unit - II - Hdfs and Hadoop Io
42 pages
IV-UNIT - BIG - DATA (2 Files Merged)
No ratings yet
IV-UNIT - BIG - DATA (2 Files Merged)
25 pages
Case Study: Hadoop
No ratings yet
Case Study: Hadoop
46 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Hadoop
No ratings yet
Hadoop
71 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
W Java132
No ratings yet
W Java132
14 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
No ratings yet
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
8 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
An Introduction To Hadoop Presentation PDF
100% (1)
An Introduction To Hadoop Presentation PDF
91 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
62 pages
2-Introduction To Hadoop Eco System
No ratings yet
2-Introduction To Hadoop Eco System
35 pages
Hadoop
No ratings yet
Hadoop
83 pages
Attachment
No ratings yet
Attachment
11 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
Unit # 2
No ratings yet
Unit # 2
23 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
Lez.d-01-Hadoop (C)
No ratings yet
Lez.d-01-Hadoop (C)
29 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
6 - HDFS
No ratings yet
6 - HDFS
37 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Lecture 4 Introduction To Hadoop
No ratings yet
Lecture 4 Introduction To Hadoop
25 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
4 Hadoop and HDFS
No ratings yet
4 Hadoop and HDFS
33 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Sample
No ratings yet
Sample
30 pages
Big Data
No ratings yet
Big Data
67 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Bda Lab 1
No ratings yet
Bda Lab 1
9 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Unit 8 DBMS
No ratings yet
Unit 8 DBMS
82 pages
Seminar Smart Health Record Using Machine Learning Report
No ratings yet
Seminar Smart Health Record Using Machine Learning Report
23 pages
Mapping The New Transdisciplinary Theme Conceptually.
No ratings yet
Mapping The New Transdisciplinary Theme Conceptually.
30 pages
Gridgain® In-Memory Computing Platform: Feature Comparison: Pivotal Gemfire®
No ratings yet
Gridgain® In-Memory Computing Platform: Feature Comparison: Pivotal Gemfire®
14 pages
Geomatics 02 00019
No ratings yet
Geomatics 02 00019
17 pages
Why Do We Use This " in Update Task " ??
No ratings yet
Why Do We Use This " in Update Task " ??
2 pages
Theoretical Foundation in Nursing 1
100% (2)
Theoretical Foundation in Nursing 1
57 pages
Live Expand RedHat-based Linux LVM Volume and Filesystem On VMWare Virtual Machines
No ratings yet
Live Expand RedHat-based Linux LVM Volume and Filesystem On VMWare Virtual Machines
6 pages
1232-Article Text-2726-2-10-20240615
No ratings yet
1232-Article Text-2726-2-10-20240615
22 pages
Tableau Pulse Datasheet
No ratings yet
Tableau Pulse Datasheet
2 pages
Questions Template - Libby Boxes and Pitching Template Design Tool
No ratings yet
Questions Template - Libby Boxes and Pitching Template Design Tool
6 pages
Dataset - Website Content Crawler - 2024 12 03 - 08 25 43 824
No ratings yet
Dataset - Website Content Crawler - 2024 12 03 - 08 25 43 824
235 pages
PP ATL Skills
100% (2)
PP ATL Skills
4 pages
Evaluating HRD Programs
No ratings yet
Evaluating HRD Programs
53 pages
Etl in Data Warehouse PDF
No ratings yet
Etl in Data Warehouse PDF
2 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
Research Manuscript Checklist: Level of Efficiency of Highway Patrol Group in Enforcing Traffic Laws in Silang, Cavite
No ratings yet
Research Manuscript Checklist: Level of Efficiency of Highway Patrol Group in Enforcing Traffic Laws in Silang, Cavite
4 pages
AWS Certified Machine Learning Engineer - Associate MLA-C01 Exam - Free Exam Q&as, Page 3 - ExamTopics
No ratings yet
AWS Certified Machine Learning Engineer - Associate MLA-C01 Exam - Free Exam Q&as, Page 3 - ExamTopics
3 pages
Auto Invoice Program To Create AR Invoices in Oracle EBS - iBizSoft Knowledge
No ratings yet
Auto Invoice Program To Create AR Invoices in Oracle EBS - iBizSoft Knowledge
5 pages
Learning at The Edge of The Magic Circle: A Case For Playful Learning
No ratings yet
Learning at The Edge of The Magic Circle: A Case For Playful Learning
50 pages
Data Models
No ratings yet
Data Models
28 pages
CH 5.0ok
No ratings yet
CH 5.0ok
27 pages
Advantages and Disadvantages of Case Study
100% (1)
Advantages and Disadvantages of Case Study
18 pages
CMF Module 1 - Grade12 - PR2
No ratings yet
CMF Module 1 - Grade12 - PR2
4 pages
Q2 - W2-DLL-3Is-May 1-May5-2023
No ratings yet
Q2 - W2-DLL-3Is-May 1-May5-2023
3 pages
Manual Plaxis Network
No ratings yet
Manual Plaxis Network
11 pages
Navyac2023 FINALPHASE LastRanks
No ratings yet
Navyac2023 FINALPHASE LastRanks
46 pages
7 Best Practices For Snowflake Data Governance
No ratings yet
7 Best Practices For Snowflake Data Governance
13 pages

Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski

Uploaded by

Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski

Uploaded by

Hadoop Tutorials

– Data Volume (Terabytes, Zettabytes)

– Data Variety (Structured, Unstructured)

– Data Velocity ( Stream processing)

Disks Disks Disks Disks Disks Disks

Node 1 Node 2 Node 3 Node 4 Node 5 Node X

Hadoop Distributed File System

Cluster resource manager

NoSql columnar store

Various Various Various

Node 1 Node 2 Node 3 Node 4 Node 5 Node X

102 2) Splitting into 256MB

256MB 256MB 256MB 256MB

256MB 256MB 256MB 256MB

Z More about HDFS: https://indico.cern.ch/event/404527/

• Get/produce the data

• Get/produce the data

– Meetups are: neighbours getting together to learn

• Store it locally and then move it to HDFS

• Convert to CSV with Impala

Query Planner Query Planner Query Planner

Query Coordinator Query Coordinator Query Coordinator

Query Executor Query Executor Query Executor

HDFS HDFS HDFS

D More about Impala and Hive: https://indico.cern.ch/event/434650/

CREATE TABLE meetup_csv

• Not interesting meetings (people did not accept)

• HDFS: FS browser, permission and ACLs

• SQL: query execution, results visualisation

• See execution plan

• Details for HDFS SCAN fragment (averaged)

• Reading all the data: 159.57MB

• Data size in Avro: 76MB (in CSV was 159MB)

• What if data is stored in such a way that only a

• Filter predicates has to be specified on partitioning columns

Col1 Col2 Col3 Col4

• Based on Google “Dremel”

Col1 Col2 Col3 Col4

Col1 Col2 Col3 Col4

Col1 Col2 Col3 Col4

More about HBase: https://indico.cern.ch/event/439742/

• Mapping Hive/Impala table to HBase table

• Populating the table (key=event_time,even_if,modification time)

> SELECT * FROM meetup_hbase

• Compression in order to further reduce the data volume

You might also like