0% found this document useful (0 votes)

10 views12 pages

Bigdata 11

Uploaded by

dbdagroup06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views12 pages

Bigdata 11

Uploaded by

dbdagroup06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Apache Hive

Sunbeam Infotech

Sunbeam Infotech www.sunbeaminfo.com

Bucketing
Mash Table - key valuepairs fast
searching
Data in bucketed tables is divided into Hashfa: %18

multiple files.
12. A

3
⑦ ->
When data is processed using MR job, -

->
24,B 1
number of reducers will be same as 2
- >

AD-N -

35,C
number of buckets. 3
-
&

To insert data into bucketed table, it

52,D

44,7 3
- -
must be uploaded via staging table. ->
c
39,7 6
- >

->
Usually buckets are created on unique 84/G 7
-

column(s) to uniformly divide data 62,4 &

across multiple reducers. Hash Partitioner

-> - >

->
F
I
It provides better sampling and speed-
up map side joins.
It is mandatory for DML operations.

Sunbeam Infotech www.sunbeaminfo.com

Hive Indexes
optimization technique in Hire 2.x
Similar to RDBMS index.
To speed up SELECT queries (searching emp
-> job
INDEX

& grouping).
ANALYST ne

Indexes internally store addresses of um,

CLERK wow, wow, we,

records for given column values. ~
MANAGER m,m)

3
Creating index is time-taking job (for huge surs PRESIDENT
data). If indexing is done under load, then MR m

mr, m, m,
~

clients query performance is too low. sob SAISRAN

In Hive indexes are created, but deferred indexes supported

not in Hir3.x
for build (using ALTER statement).
CREATE INDEX query doesn't create
index, rather keep ready for building later.
Index building should be triggered
explicitly, when server is less loaded.

Sunbeam Infotech www.sunbeaminfo.com

Hive Indexes

In hive indexes are stored in HDFS (as hive tables).

These indexes are build by different index handlers e.g. BITMAP, CompactHandler,

Compact:
Stores combination of indexed column value & its HDFS block id.
Bitmap:
Stores combination of indexed column value & list of rows as bitmap.
Bitmap indexes work faster than Compact.
Hive indexes are not supported from Hive 3.x onwards. Use materialized view
instead to improve the performance.

Sunbeam Infotech www.sunbeaminfo.com

Apache Spark
Sunbeam Infotech

Sunbeam Infotech www.sunbeaminfo.com

Introduction

Spark is Distributed computing framework, that can process huge amount of data.
Spark can be used as eco-system of Hadoop or can be used as independent
distributed computing framework. ↑ Research projects on my algos for people m

Developed by UCB AMPlabs division. Spark is open-sourced under Apache.

I & cloud hosting.
spark support
Further developed/maintained by DataBricks. enterprise
Applications
Popular Spark vendors ① @
Scala Java Python R
DataBricks, AWS EMR, Cloudera, MapR pys park Spark -
R

Spark Toolkit Pack Pack Pack Pack

SQL streaming
Spark Philosophy Mr L
GraphX
similar api for
Unified -> similar
& any language.
performance(in high level apis.
Compute Engine -> works with distributed storage Spark High Level APIs (Data frances)
any
e.g. HDFs, 53, AzreBlob. -

Libraries - third libraries

party Spark Low Level APIs
(RDD& DAG)
Spark-packages.org

Sunbeam Infotech www.sunbeaminfo.com

Hadoop vs Spark

Distributed framework Distributed framework

Distributed storage + Distributed computing Distributed computing
Not tied up with particular storage
Hadoop is developed in Java (JVM Spark is developed in Scala (JVM
based). based).

Designed for commodity hardware. Needs better hardware config.

Data is processed in RAM and spills on Data is processed fully in RAM to achieve
disk. faster execution.

In MapReduce job, mappers & In Spark job, tasks are executed as

reducers are executed as independent threads in Executor process.
JVM processes.

Sunbeam Infotech www.sunbeaminfo.com

PySpark Development

terminal> python3 m pip install pyspark

In ~/.profile
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
export SPARK_HOME=$HOME/.local/lib/python3.6/site-packages/pyspark
export PATH=$HOME/.local/bin:$PATH
terminal> pyspark
file = sc.textFile("/home/nilesh/spark-2.4.4-bin-hadoop2.7/LICENSE")
lines = file.map(lambda line: line.lower())
words = lines.flatMap(lambda line: line.split())
word1s = words.map(lambda word: (word,1))
wordcounts = word1s.reduceByKey(lambda acc,cnt: acc + cnt)
result = wordcounts.collect()
print(result)

Sunbeam Infotech www.sunbeaminfo.com

PySpark Development (PyCharm)

PyCharm -> New Project

Select project location
Existing interpreter -> Python3.x
Create Python file (hello.py)
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf().setAppName("Demo01").setMaster("local")
sc = SparkContext(conf=conf)
file = sc.textFile("/home/nilesh/spark-2.4.4-bin-hadoop2.7/LICENSE")
lines = file.map(lambda line: line.lower())
words = lines.flatMap(lambda line: line.split())
word1s = words.map(lambda word: (word,1))
wordcounts = word1s.reduceByKey(lambda acc,cnt: acc + cnt)
result = wordcounts.collect()
print(result)

Sunbeam Infotech www.sunbeaminfo.com

Spark RDD

Resilient Distributed Dataset

Resilient
Distributed
Dataset
RDD characteristics
Immutable
Lazily evaluated
Resilient

Sunbeam Infotech www.sunbeaminfo.com

Spark Installation Modes

Sunbeam Infotech www.sunbeaminfo.com

Thank you!
Nilesh Ghule <[email protected]>

Sunbeam Infotech www.sunbeaminfo.com

T07 Spark
No ratings yet
T07 Spark
23 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
Spark
No ratings yet
Spark
96 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Thesis Apache Spark
100% (2)
Thesis Apache Spark
4 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Databricks: Building and Operating A Big Data Service Based On Apache Spark
No ratings yet
Databricks: Building and Operating A Big Data Service Based On Apache Spark
32 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
Sap Fico Demo
No ratings yet
Sap Fico Demo
13 pages
Software Design and Architecture: BY: Shahab Ul Islam
No ratings yet
Software Design and Architecture: BY: Shahab Ul Islam
20 pages
Sample Resume For Kubernetes On Aws: Objective: Professional Certificates: Education: Technical Tools and Software
No ratings yet
Sample Resume For Kubernetes On Aws: Objective: Professional Certificates: Education: Technical Tools and Software
2 pages
Master Google Cloud Platform (GCP) : Core Infrastructure With Bonus Data Engineering and Devops Services
No ratings yet
Master Google Cloud Platform (GCP) : Core Infrastructure With Bonus Data Engineering and Devops Services
5 pages
Spark 3.0 New Features: Spark With GPU Support
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
8 pages
01-DS320-v67-Course Introduction PDF
No ratings yet
01-DS320-v67-Course Introduction PDF
84 pages
Which of The Following Is A Challenge in A J2EE?: (A) Fault Tolerance (B) Durability (C) Scalability (D) Reliability
100% (1)
Which of The Following Is A Challenge in A J2EE?: (A) Fault Tolerance (B) Durability (C) Scalability (D) Reliability
19 pages
Ibm Hadoop
No ratings yet
Ibm Hadoop
4 pages
Spark
No ratings yet
Spark
49 pages
Solr and Spark Terminology
No ratings yet
Solr and Spark Terminology
3 pages
Hadoop vs. Spark: The New Age of Big Data
No ratings yet
Hadoop vs. Spark: The New Age of Big Data
7 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Your Paragraph Text
No ratings yet
Your Paragraph Text
26 pages
Unit 6
No ratings yet
Unit 6
26 pages
Shark
No ratings yet
Shark
24 pages
Unit 6-1
No ratings yet
Unit 6-1
128 pages
Apache Spark™ - Unified Analytics Engine For Big Data
No ratings yet
Apache Spark™ - Unified Analytics Engine For Big Data
1 page
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Unit 5 Topic 13 IBM Big Data Strategy (12 Files Merged)
No ratings yet
Unit 5 Topic 13 IBM Big Data Strategy (12 Files Merged)
219 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Big Data Handling Techniques
No ratings yet
Big Data Handling Techniques
21 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Day5 Patterns Use Cases
No ratings yet
Day5 Patterns Use Cases
45 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Module 2
No ratings yet
Module 2
20 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Mastering Apache Spark 2.0
No ratings yet
Mastering Apache Spark 2.0
62 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
Tools in Data Analytics
No ratings yet
Tools in Data Analytics
17 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Bda U4
No ratings yet
Bda U4
49 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Spark Tutorial
No ratings yet
Spark Tutorial
77 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Lab - 03 - ProductManagement - Using - SignalR and Entity Framework
No ratings yet
Lab - 03 - ProductManagement - Using - SignalR and Entity Framework
18 pages
Soa Mda and Soaml
No ratings yet
Soa Mda and Soaml
18 pages
DE Bootcamp - Week 3 Day 2
No ratings yet
DE Bootcamp - Week 3 Day 2
4 pages
Big Data Technologies Presentation
No ratings yet
Big Data Technologies Presentation
10 pages
Page 01
No ratings yet
Page 01
2 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
Divya Shree: Professional
No ratings yet
Divya Shree: Professional
10 pages
DataEngg Day3
No ratings yet
DataEngg Day3
26 pages
Mounika Full Stack Java Developer
No ratings yet
Mounika Full Stack Java Developer
7 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
31 pages
Unit 1 Lesson 1: Explaining The Basic Architecture of Sap Netweaver As
No ratings yet
Unit 1 Lesson 1: Explaining The Basic Architecture of Sap Netweaver As
6 pages
Timer in SAP PI - PO
No ratings yet
Timer in SAP PI - PO
5 pages
All Units Ppts Walker Royce PDF
No ratings yet
All Units Ppts Walker Royce PDF
110 pages
Module 01
No ratings yet
Module 01
47 pages
Oracle Bi Applications Release Announcement
No ratings yet
Oracle Bi Applications Release Announcement
2 pages
### Monitoring How To Build An Application Monitoring System With FastAPI and RabbitMQ - Python - by Carlos Armando Marcano Vargas - Medium
No ratings yet
### Monitoring How To Build An Application Monitoring System With FastAPI and RabbitMQ - Python - by Carlos Armando Marcano Vargas - Medium
25 pages
Unit-6 Web Services
No ratings yet
Unit-6 Web Services
20 pages
Sap Srm210
0% (1)
Sap Srm210
1 page
Cloud Computing Slides Eng (2025) 1
No ratings yet
Cloud Computing Slides Eng (2025) 1
66 pages
GitHub - Quozd - Awesome-Dotnet - A Collection of Awesome .NET Libraries, Tools, Frameworks and Software
No ratings yet
GitHub - Quozd - Awesome-Dotnet - A Collection of Awesome .NET Libraries, Tools, Frameworks and Software
22 pages
Web Services
No ratings yet
Web Services
20 pages
Saiteja-JavaDeveloper Resume
No ratings yet
Saiteja-JavaDeveloper Resume
1 page
AWS-Lesson Planning 2025 - 26 Div - A
No ratings yet
AWS-Lesson Planning 2025 - 26 Div - A
2 pages
DS Netbackup Comparison Chart V1067
No ratings yet
DS Netbackup Comparison Chart V1067
9 pages
Answer:: If None of These Files Are Found, Server Renders 404 Error
No ratings yet
Answer:: If None of These Files Are Found, Server Renders 404 Error
3 pages
Gangaram - Java - Sage IT
No ratings yet
Gangaram - Java - Sage IT
5 pages
Important Linux and HDFS Commands For Hadoop 1703072726
No ratings yet
Important Linux and HDFS Commands For Hadoop 1703072726
8 pages
Labs Cse
No ratings yet
Labs Cse
5 pages
MCA430 - Software Architeture
No ratings yet
MCA430 - Software Architeture
2 pages
RESTful Web Services Tutorial With Example
No ratings yet
RESTful Web Services Tutorial With Example
1 page
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet

Bigdata 11

Uploaded by

Bigdata 11

Uploaded by

Apache Hive

Sunbeam Infotech www.sunbeaminfo.com

To insert data into bucketed table, it

column(s) to uniformly divide data 62,4 &

across multiple reducers. Hash Partitioner

Sunbeam Infotech www.sunbeaminfo.com

Indexes internally store addresses of um,

CLERK wow, wow, we,

clients query performance is too low. sob SAISRAN

In Hive indexes are created, but deferred indexes supported

Sunbeam Infotech www.sunbeaminfo.com

In hive indexes are stored in HDFS (as hive tables).

Sunbeam Infotech www.sunbeaminfo.com

Sunbeam Infotech www.sunbeaminfo.com

Developed by UCB AMPlabs division. Spark is open-sourced under Apache.

Spark Toolkit Pack Pack Pack Pack

Libraries - third libraries

Sunbeam Infotech www.sunbeaminfo.com

Distributed framework Distributed framework

Designed for commodity hardware. Needs better hardware config.

In MapReduce job, mappers & In Spark job, tasks are executed as

Sunbeam Infotech www.sunbeaminfo.com

terminal> python3 m pip install pyspark

Sunbeam Infotech www.sunbeaminfo.com

PyCharm -> New Project

Sunbeam Infotech www.sunbeaminfo.com

Resilient Distributed Dataset

Sunbeam Infotech www.sunbeaminfo.com

Sunbeam Infotech www.sunbeaminfo.com

Sunbeam Infotech www.sunbeaminfo.com

You might also like