0% found this document useful (0 votes)
10 views12 pages

Bigdata 11

Uploaded by

dbdagroup06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views12 pages

Bigdata 11

Uploaded by

dbdagroup06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Apache Hive

Sunbeam Infotech

Sunbeam Infotech www.sunbeaminfo.com


Bucketing
Mash Table - key valuepairs fast
searching
Data in bucketed tables is divided into Hashfa: %18

multiple files.
12. A

3
⑦ ->
When data is processed using MR job, -

->
24,B 1
number of reducers will be same as 2
- >

AD-N -

35,C
number of buckets. 3
-
&

To insert data into bucketed table, it


52,D

44,7 3
- -
must be uploaded via staging table. ->
c
39,7 6
- >

->
Usually buckets are created on unique 84/G 7
-

->

column(s) to uniformly divide data 62,4 &


-

->

across multiple reducers. Hash Partitioner


-> - >

->
F
I
It provides better sampling and speed-
up map side joins.
It is mandatory for DML operations.

Sunbeam Infotech www.sunbeaminfo.com


Hive Indexes
optimization technique in Hire 2.x
Similar to RDBMS index.
To speed up SELECT queries (searching emp
-> job
INDEX

& grouping).
ANALYST ne

Indexes internally store addresses of um,


me

CLERK wow, wow, we,


records for given column values. ~
MANAGER m,m)

3
Creating index is time-taking job (for huge surs PRESIDENT
data). If indexing is done under load, then MR m

mr, m, m,
~

clients query performance is too low. sob SAISRAN

In Hive indexes are created, but deferred indexes supported


not in Hir3.x
for build (using ALTER statement).
CREATE INDEX query doesn't create
index, rather keep ready for building later.
Index building should be triggered
explicitly, when server is less loaded.

Sunbeam Infotech www.sunbeaminfo.com


Hive Indexes

In hive indexes are stored in HDFS (as hive tables).


These indexes are build by different index handlers e.g. BITMAP, CompactHandler,

Compact:
Stores combination of indexed column value & its HDFS block id.
Bitmap:
Stores combination of indexed column value & list of rows as bitmap.
Bitmap indexes work faster than Compact.
Hive indexes are not supported from Hive 3.x onwards. Use materialized view
instead to improve the performance.

Sunbeam Infotech www.sunbeaminfo.com


Apache Spark
Sunbeam Infotech

Sunbeam Infotech www.sunbeaminfo.com


Introduction

Spark is Distributed computing framework, that can process huge amount of data.
Spark can be used as eco-system of Hadoop or can be used as independent
distributed computing framework. ↑ Research projects on my algos for people m

Developed by UCB AMPlabs division. Spark is open-sourced under Apache.


I & cloud hosting.
spark support
Further developed/maintained by DataBricks. enterprise
Applications
Popular Spark vendors ① @
Scala Java Python R
DataBricks, AWS EMR, Cloudera, MapR pys park Spark -
R

Spark Toolkit Pack Pack Pack Pack

SQL streaming
Spark Philosophy Mr L
GraphX
similar api for
Unified -> similar
& any language.
performance(in high level apis.
Compute Engine -> works with distributed storage Spark High Level APIs (Data frances)
any
e.g. HDFs, 53, AzreBlob. -

Libraries - third libraries


party Spark Low Level APIs
(RDD& DAG)
Spark-packages.org

Sunbeam Infotech www.sunbeaminfo.com


Hadoop vs Spark

Distributed framework Distributed framework


Distributed storage + Distributed computing Distributed computing
Not tied up with particular storage
Hadoop is developed in Java (JVM Spark is developed in Scala (JVM
based). based).

Designed for commodity hardware. Needs better hardware config.


Data is processed in RAM and spills on Data is processed fully in RAM to achieve
disk. faster execution.

In MapReduce job, mappers & In Spark job, tasks are executed as


reducers are executed as independent threads in Executor process.
JVM processes.

Sunbeam Infotech www.sunbeaminfo.com


PySpark Development

terminal> python3 m pip install pyspark


In ~/.profile
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
export SPARK_HOME=$HOME/.local/lib/python3.6/site-packages/pyspark
export PATH=$HOME/.local/bin:$PATH
terminal> pyspark
file = sc.textFile("/home/nilesh/spark-2.4.4-bin-hadoop2.7/LICENSE")
lines = file.map(lambda line: line.lower())
words = lines.flatMap(lambda line: line.split())
word1s = words.map(lambda word: (word,1))
wordcounts = word1s.reduceByKey(lambda acc,cnt: acc + cnt)
result = wordcounts.collect()
print(result)

Sunbeam Infotech www.sunbeaminfo.com


PySpark Development (PyCharm)

PyCharm -> New Project


Select project location
Existing interpreter -> Python3.x
Create Python file (hello.py)
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf().setAppName("Demo01").setMaster("local")
sc = SparkContext(conf=conf)
file = sc.textFile("/home/nilesh/spark-2.4.4-bin-hadoop2.7/LICENSE")
lines = file.map(lambda line: line.lower())
words = lines.flatMap(lambda line: line.split())
word1s = words.map(lambda word: (word,1))
wordcounts = word1s.reduceByKey(lambda acc,cnt: acc + cnt)
result = wordcounts.collect()
print(result)

Sunbeam Infotech www.sunbeaminfo.com


Spark RDD

Resilient Distributed Dataset


Resilient
Distributed
Dataset
RDD characteristics
Immutable
Lazily evaluated
Resilient

Sunbeam Infotech www.sunbeaminfo.com


Spark Installation Modes

Sunbeam Infotech www.sunbeaminfo.com


Thank you!
Nilesh Ghule <[email protected]>

Sunbeam Infotech www.sunbeaminfo.com

You might also like